1

IP Regex without subnet for all sites

As a part of a project of my company, I need to extract IP addresses that don't include subnetting (e.g 196.82.1.12/24) from some websites.

If the address contains subnetting, I don't want to grab the part proceeding the subnetting but not taking it at all.

for example on the following case:

<td>212.179.35.154</td>
<td>200.139.97.126/24</td>
<td>"201.139.97.126"</td>
<td>F5 BIG-IP</td>
<td>unknown</td>
<td class="date">26-Feb-2011</td>

The desired output would be:

212.179.35.154

201.139.97.126

Please note that some lines include quotes surrounding the IP address however since there is no following /NUMBER they are valid.

I'm trying to find an appropriate regex for days now such as:

(<td>(\d+\.){3}\d+<\/td>)
\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}[^\/]

However, all seem to have a flaw within them.

Thanks in advance!

Submitted July 28th 2020 by Admin

Answers
0

You can use a negative lookahead assertion, by using the pattern syntax (?!...), like this:

import re s = """
<td>212.179.35.154</td>
<td>200.139.97.126/24</td>
<td>"201.139.97.126"</td>
<td>F5 BIG-IP</td>
<td>unknown</td>
<td class="date">26-Feb-2011</td>
""" pattern = r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}(?!\d*\/)" print(re.findall(pattern,s))

Output:

['212.179.35.154', '201.139.97.126']

The (?!\d*\/) part tells it "don't match the previous pattern if it is followed by any digits and a forward slash".
(the \d* part is because otherwise it will match 200.139.97.12 (without the 6) out of 200.139.97.126/24)

small note: your original pattern will match more than just legal IP addresses, but I went with your way.

Admin | 1 year ago


0

For me it looks like task where negative lookahead will be useful. I would do:

import re
txt = '''<td>212.179.35.154</td>
<td>200.139.97.126/24</td>
<td>"201.139.97.126"</td>
<td>F5 BIG-IP</td>
<td>unknown</td>
<td class="date">26-Feb-2011</td>'''
pattern = r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}(?![0-9/])"
found = re.findall(pattern, txt)
print(found)

Output:

['212.179.35.154', '201.139.97.126']

By using negative lookahead (?![0-9/]) we say: exclude matches if they are followed by 0 or 1 or 2 or 3 or 4 or 5 or 6 or 7 or 8 or 9 or /. Note that including digits is crucial here, because if you specify only / one of matches would be:

200.139.97.12

(note missing 6 at end)

Admin | 1 year ago



Relevant Questions


Parse html value in python

December 15th 2020