Python 2.7 find IP address and replace with text - python

I have used Python to extract a route table from a router and am trying to
strip out superfluous text, and
replace the destination of each route with a text string to match a different customer grouping.
At the moment I have::
infile = "routes.txt"
outfile = "output.txt"
delete_text = ["ROUTER1# sh run | i route", "ip route"]
client_list = ["CUST_A","CUST_B"]
subnet_list = ["1.2.3.4","5.6.7.8"]
fin = open(infile)
fout = open(outfile, "w+")
for line in fin:
for word in delete_text:
line = line.replace(word, "")
for word in subnet_list:
line = line.replace("1.2.3.4", "CUST_A")
for word in subnet_list:
line = line.replace("5.6.7.8", "CUST_B")
fout.write(line)
fin.close()
fout.close()
f = open('output.txt', 'r')
file_contents = f.read()
print (file_contents)
f.close()
This works to an extent but when it searches and replaces for e.g. 5.6.7.8 it also picks up that string within other IP addresses e.g. 5.6.7.88, and replaces them also which I don't want to happen.
What I am after is an exact match only to be found and replaced.

You could use re.sub() with explicit word boundaries (\b):
>>> re.sub(r'\b5.6.7.8\b', 'CUST_B', 'test 5.6.7.8 test 5.6.7.88 test')
'test CUST_B test 5.6.7.88 test'

As you found out your approach is bad because it results in false positives (i.e., undesirable matches). You should parse the lines into tokens then match the individual tokens. That might be as simple as first doing tokens = line.split() to split on whitespace. That, however, may not work if the line contains quoted strings. Consider what the result of this statement: "ab 'cd ef' gh".split(). So you might need a more sophisticated parser.
You could use the re module to perform substitutions using the \b meta sequence to ensure the matches begin and end on a "word" boundary. But that has its own, unique, failure modes. For example, consider that the . (period) character matches any character. So doing re.sub('\b5.6.7.8\b', ...) as #NPE suggested will actually match not just the literal word 5.6.7.8 but also 5x6.7y8. That may not be a concern given the inputs you expect but is something most people don't consider and is therefore another source of bugs. Regular expressions are seldom the correct tool for a problem like this one.

thanks guys, I've been testing with this and the re.sub function just seems to print out the below string in a loop : CUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTB .
I have amended the code snippet above to :
for word in subnet_list:
line = re.sub(r'\b5.6.7.8\b', 'CUST_B', '5.6.7.88')
Ideally I would like the string element to be replaced in all of the list occurrences along with preserving the list structure ?

Related

What should be the regex pattern for this?

I want to extract only this whole part - "value":["10|8.0|1665|82|apple|#||0","8|1132|188.60|banana|#||0"] from all the lines in a text file and then write into another text file. This part have different values in every line.
I have written this regex pattern but unable to get these whole part in another text file.
with open("result.txt", "w+") as result_file:
with open("log.txt", "r") as log-file:
for lines in log-file:
all_values= re.findall(r'("value"+:"[\w\.#|-]+")', lines)
for i in all_values:
result_file.write(i)
In your pattern, you can omit the outer parenthesis for the capture group to get a match only.
This part "+ matches 1 or more times a double quote which does not seem to be required.
You don't get the whole match, because there are more characters in the string than listed in the character class [\w\.#|-]+
As a more broader match, you can use
"value":\[".*?"]
"value": match literally
\[" Match ["
.*? Match any char as least as possible
"] Match "]
Regex demo

python re.sub newline multiline dotall

I have this CSV with the next lines written on it (please note the newline /n):
"<a>https://google.com</a>",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,Dirección
I am trying to delete all that commas and putting the address one row up. Thus, on Python I am using this:
with open('Reutput.csv') as e, open('Put.csv', 'w') as ee:
text = e.read()
text = str(text)
re.compile('<a/>*D', re.MULTILINE|re.DOTALL)
replace = re.sub('<a/>*D','<a/>",D',text) #arreglar comas entre campos
replace = str(replace)
ee.write(replace)
f.close()
As far as I know, re.multiline and re.dotall are necessary to fulfill /n needs. I am using re.compile because it is the only way I know to add them, but obviously compiling it is not needed here.
How could I finish with this text?
"<a>https://google.com</a>",Dirección
You don't need the compile statement at all, because you aren't using it. You can put either the compiled pattern or the raw pattern in the re.sub function. You also don't need the MULTILINE flag, which has to do with the interpretation of the ^ and $ metacharacters, which you don't use.
The heart of the problem is that you are compiling the flag into a regular expression pattern, but since you aren't using the compiled pattern in your substitute command, it isn't getting recognized.
One more thing. re.sub returns a string, so replace = str(replace) is unnecessary.
Here's what worked for me:
import re
with open('Reutput.csv') as e:
text = e.read()
text = str(text)
s = re.compile('</a>".*D',re.DOTALL)
replace = re.sub(s, '</a>"D',text) #arreglar comas entre campos
print(replace)
If you just call re.sub without compiling, you need to call it like
re.sub('</a>".*D', '</a>"D', text, flags=re.DOTALL)
I don't know exactly what your application is, of course, but if all you want to do is to delete all the commas and newlines, it might be clearer to write
replace = ''.join((c for c in text if c not in ',\n'))
When you use re.compile you need to save the returned Regular Expression object and then call sub on that. You also need to have a .* to match any character instead of matching close html tags. The re.MULTILINE flag is only for the begin and end string symbols (^ and $) so you do not need it in this case.
regex = re.compile('</a>.*D',re.DOTALL)
replace = regex.sub('</a>",D',text)
That should work. You don't need to convert replace to a string since it is already a string.
Alternative you can write a regular expression that doesn't use .
replace = re.sub('"(,|\n)*D','",D',text)
This worked for me using re.sub with multiline texte
#!/usr/bin/env python3
import re
output = open("newFile.txt","w")
input = open("myfile.txt")
file = input.read()
input.close()
text = input.read()
replace = re.sub("value1\n\s +nickname", "value\n\s +name", text, flags=re.DOTALL)
output.write(replace)
output.close()

stemming problems in python

I want to find the stems of Persian language verbs. For that first I made a file containing some current and exception stems. I want first, my code searches in the file and if the stem was there it returns the stem and if not, it goes through the rest of the code and by deleting suffixes and prefixes it returns the stem. The problem 1) is that it doesn't pay attention to the file and ignoring it, it just goes through the rest of the code and outputs a wrong stem because exceptions are in the file. 2) because I used "for", the suffixes and prefixes of verbs influence on other verbs and omit other verbs' suffixes and prefixes which sometimes outputs a wrong stem. How should I change the code that each "for" loop works independently and doesn't affect the others? (I have to just write one function and call just it)
I reduced some suffixes and prefixes.
def stemmer (verb, file):
with open (file, encoding = "utf-8") as f:
f = f.read().split()
for i in f:
if i in verb:
return i
else:
for i in suffix1:
if verb.endswith(i):
verb = verb[:-len(i)]
return verb
You don't have to put all of your code, sara. We are only concerned with the snippet that causes the problem.
My guess is that the problematic part is the check if i in verb that might fail most of the time because of trailing characters after splitting the characters. Normally, when you split the tokens, you also need to trim the ending characters with the strip() method:
>>> 'who\n'.strip() in 'who'
True
Conditionals like:
>>> "word\n" in "word"
False
>>> 'who ' in 'who'
False
will always fail and that's why the program doesn't check the exceptions at all.
I found the answer. the problem is caused by "else:". there is no need to it.
def stemmer (verb, file):
with open (file, encoding = "utf-8") as f:
f = f.read().split()
for i in f:
if i in verb:
return i
for i in suffix1: # ماضي ابعد
if verb.endswith(i):
verb = verb[:-len(i)]
break

Using regex in Python 2.7.3 to search text and output matches

I am trying to accomplish exactly what the title says. The program is meant to read a .txt file from a specified path and match the terms specified in the code. This is what I have so far:
import re
source = open("C:\\test.txt", "r")
lines = []
for line in source:
line = line.strip()
lines.append(line)
if re.search('reply', line):
print 'found: ', line
As you can see, I am specifying the term 'reply' using re.search but this restricts me to one term. I know there is a way to specify a list or dictionary of words to search for, but my attempts have failed. I think it's possible to create a list with something like ...
keywords = ['reply', 'error', 'what']
... but despite what I've read on this site, I can't seem to incorporate this into the code properly. Any advice or assistance with this is greatly appreciated!
PS. If I wanted to make the search case sensitive, would I be able to use ...
"(.*)(R|r)eply(.*)"
... in the list of terms I want to find?
One way:
import re
source = open("input", "r")
lines = []
keywords = ['reply', 'error', 'what']
# join list with OR, '|', operators
# re.I makes it case-insensitive
exp = re.compile("|".join(keywords), re.I)
for line in source:
line = line.strip()
lines.append(line)
if re.search(exp, line):
print 'found: ', line
With re.search(), you pass a single string, but you can specify quite complex patterns. See the docs on the Python re module, which has a section on "Regular Expression Syntax".
In fact you have the answer in your question... "R|r" searches for "R" or "r", so "reply|error|what" searches for 'reply', 'error', or 'what'.
PS. If I wanted to make the search case sensitive, would I be able to use ...
"(.*)(R|r)eply(.*)"
There's no need for the .* bit (and it may make your code slower). The re.search() function looks for a match anywhere in the string. (R|r)eply will look for 'reply' or 'Reply', it won't match 'REPLY' or 'rePly'.
If you want a case insensitive search, there's a flags=re.IGNORECASE option that you can pass to re.search(). E.g.:
re.search('reply', line, flags=re.IGNORECASE)

dealing with \n characters at end of multiline string in python

I have been using python with regex to clean up a text file. I have been using the following method and it has generally been working:
mystring = compiledRegex.sub("replacement",mystring)
The string in question is an entire text file that includes many embedded newlines. Some of the compiled regex's cover multiple lines using the re.DOTALL option. If the last character in the compiled regex is a \n the above command will substitute all matches of the regex except the match that ends with the final newline at the end of the string. In fact, I have had several other no doubt related problems dealing with newlines and multiple newlines when they appear at the very end of the string. Can anyone give me a pointer as to what is going on here? Thanks in advance.
If i correctly undestood you and all that you need is to get a text without newline at the end of the each line and then iterate over this text in order to find a required word than you can try to use the following:
data = (line for line in text.split('\n') if line.strip())# gives you all non empty lines without '\n'at the end
Now you can either search/replace any text you need using list slicing or regex functionality.
Or you can use replace in order to replace all '\n' to whenever you want:
text.replace('\n', '')
My bet is that your file does not end with a newline...
>>> content = open('foo').read()
>>> print content
TOTAL:.?C2
abcTOTAL:AC2
defTOTAL:C2
>>> content
'TOTAL:.?C2\nabcTOTAL:AC2\ndefTOTAL:C2'
...so the last line does not match the regex:
>>> regex = re.compile('TOTAL:.*?C2\n', re.DOTALL)
>>> regex.sub("XXX", content)
'XXXabcXXXdefTOTAL:C2'
If that is the case, the solution is simple: just match either a newline or the end of the file (with $):
>>> regex = re.compile('TOTAL:.*?C2(\n|$)', re.DOTALL)
>>> regex.sub("XXX", content)
'XXXabcXXXdefXXX'
I can't get a good handle on what is going on from your explanation but you may be able to fix it by replacing all multiple newlines with a single newline as you read in the file. Another option might be to just trim() the regex removing the \n at the end unless you need it for something.
Is the question mark to prevent the regex matching more than one iine at a time? If so then you probably want to be using the MULTILINE flag instead of DOTALL flag. The ^ sign will now match just after a new line or the beginning of a string and the $ sign will now match just before a newline character or the end of a string.
eg.
regex = re.compile('^TOTAL:.*$', re.MULTILINE)
content = regex.sub('', content)
However, this still leaves with the problem of empty lines. But why not just run one additional regex at the end that removes blank lines.
regex = re.compile('\n{2,}')
content = regex.sub('\n', content)

Categories

Resources