I want to extract only this whole part - "value":["10|8.0|1665|82|apple|#||0","8|1132|188.60|banana|#||0"] from all the lines in a text file and then write into another text file. This part have different values in every line.
I have written this regex pattern but unable to get these whole part in another text file.
with open("result.txt", "w+") as result_file:
with open("log.txt", "r") as log-file:
for lines in log-file:
all_values= re.findall(r'("value"+:"[\w\.#|-]+")', lines)
for i in all_values:
result_file.write(i)
In your pattern, you can omit the outer parenthesis for the capture group to get a match only.
This part "+ matches 1 or more times a double quote which does not seem to be required.
You don't get the whole match, because there are more characters in the string than listed in the character class [\w\.#|-]+
As a more broader match, you can use
"value":\[".*?"]
"value": match literally
\[" Match ["
.*? Match any char as least as possible
"] Match "]
Regex demo
Related
I have used Python to extract a route table from a router and am trying to
strip out superfluous text, and
replace the destination of each route with a text string to match a different customer grouping.
At the moment I have::
infile = "routes.txt"
outfile = "output.txt"
delete_text = ["ROUTER1# sh run | i route", "ip route"]
client_list = ["CUST_A","CUST_B"]
subnet_list = ["1.2.3.4","5.6.7.8"]
fin = open(infile)
fout = open(outfile, "w+")
for line in fin:
for word in delete_text:
line = line.replace(word, "")
for word in subnet_list:
line = line.replace("1.2.3.4", "CUST_A")
for word in subnet_list:
line = line.replace("5.6.7.8", "CUST_B")
fout.write(line)
fin.close()
fout.close()
f = open('output.txt', 'r')
file_contents = f.read()
print (file_contents)
f.close()
This works to an extent but when it searches and replaces for e.g. 5.6.7.8 it also picks up that string within other IP addresses e.g. 5.6.7.88, and replaces them also which I don't want to happen.
What I am after is an exact match only to be found and replaced.
You could use re.sub() with explicit word boundaries (\b):
>>> re.sub(r'\b5.6.7.8\b', 'CUST_B', 'test 5.6.7.8 test 5.6.7.88 test')
'test CUST_B test 5.6.7.88 test'
As you found out your approach is bad because it results in false positives (i.e., undesirable matches). You should parse the lines into tokens then match the individual tokens. That might be as simple as first doing tokens = line.split() to split on whitespace. That, however, may not work if the line contains quoted strings. Consider what the result of this statement: "ab 'cd ef' gh".split(). So you might need a more sophisticated parser.
You could use the re module to perform substitutions using the \b meta sequence to ensure the matches begin and end on a "word" boundary. But that has its own, unique, failure modes. For example, consider that the . (period) character matches any character. So doing re.sub('\b5.6.7.8\b', ...) as #NPE suggested will actually match not just the literal word 5.6.7.8 but also 5x6.7y8. That may not be a concern given the inputs you expect but is something most people don't consider and is therefore another source of bugs. Regular expressions are seldom the correct tool for a problem like this one.
thanks guys, I've been testing with this and the re.sub function just seems to print out the below string in a loop : CUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTB .
I have amended the code snippet above to :
for word in subnet_list:
line = re.sub(r'\b5.6.7.8\b', 'CUST_B', '5.6.7.88')
Ideally I would like the string element to be replaced in all of the list occurrences along with preserving the list structure ?
I'm trying to extract the strings from a file that start with ${ and ends with } using Python. I am using the code below to do so, but I don't get the expected result.
My input file looks like this:
Click ${SWIFT_TAB}
Click ${SEARCH_SWIFT_CODE}
and I want to get a list as below:
${SWIFT_TAB}
${SEARCH_SWIFT_CODE}
My current code looks like this:
def findStringFromFile(file):
import os,re
with open(file) as f:
ans = []
for line in f:
matches = re.findall(r'\b\${\S+}\b', line)
ans.extend(matches)
print (ans)
I am expecting a list of strings that start with ${ and end with }, but all I currently get is an empty list.
The problem is that your regexp is buggy, and doesn't match the strings you want to extract. Specifically, you have two issues:
{ and } are regexp metacharacters, just like $, and also need to be escaped if you want to match them literally.
\b matches a word boundary, i.e. a position between a "word character" (a letter, a number or an underscore) and a "non-word character" (anything else) or the beginning/end end of string. It does not match between, say, a space and $.
To fix these issues, change your line:
matches = re.findall(r'\b\${\S+}\b', line)
to:
matches = re.findall(r'\$\{\S+\}', line)
and it should work.
See the Python regular expressions documentation for more details.
I have shakespeare's full works data from here that I want to use in a word embedding algorithm to create a model. The model's requirement is that the whole text be provided with only single spaces present and no other kind of whitespaces be present. How can I perform this? I found how to do this for a single string but it isn't working for a text file.
My try(I am not very knowledgeable of python):
with open(file_path, 'r') as data:
for line in data:
cleanedline = line.strip('\n')
The cleanedline doesnt have the \n removed when printed, so I didn't write them back into file.
You could try a regular expression:
import re
with open(file_path) as data:
text = re.sub(r'\s+', ' ', data.read())
The \s+ regular expression pattern will match any sequence of one or more whitespace characters. re.sub() will substitute the matching text for a single space.
Whitespace consists of characters such as space, tab, new line, return, form feed, vertical tab etc. It does not include punctuation.
Another way to do this without regex is to use split() then join():
with open(file_path) as data:
text = ' '.join(data.read().split())
I have an auto generated bibliography file which stores my references. The citekey in the generated file is of the form xxxxx:2009tb. Is there a way to make the program to detect such a pattern and change the citekey form to xxxxx:2009?
It's not quite clear to me which expression you want to match, but you can build everything with regex, using import re and re.sub as shown. [0-9]*4 matches exactly 4 numbers.
(Edit, to incorporate suggestions)
import re
inf = 'temp.txt'
outf = 'out.txt'
with open(inf) as f,open(outf,'w') as o:
all = f.read()
all = re.sub("xxxxx:[0-9]*4tb","xxxxx:tb",all) # match your regex here
o.write(all)
o.close()
You actually just want to remove the two letters after the year in a reference. Supposing we could uniquely identify a reference as a colon followed by four numbers and two letters, than the following regular expression would work (at least it is working in this example code):
import re
s = """
according to some works (newton:2009cb), gravity is not the same that
severity (darwin:1873dc; hampton:1956tr).
"""
new_s = re.sub('(:[0-9]{4})\w{2}', r'\1', s)
print new_s
Explanation: "match a colon : followed by four numbers [0-9]{4} followed by any two "word" characters \w{2}. The parentheses catch just the part you want to keep, and r'\1' means you are replacing each whole match by a smaller part of it which is in the first (and only) group of parentheses. The r before the string is there because it is necessary to interpret \1 as a raw string, and not as an escape sequence.
Hope this helps!
I am write a small python script to gather some data from a database, the only problem is when I export data as XML from mysql it includes a \b character in the XML file. I wrote code to remove it, but then realized I didn't need to do that processing everytime, so I put it in a method and am calling it I find a \b in the XML file, only now the regex isnt matching, even though I know the \b is there.
here is what I am doing:
Main program:
'''Program should start here'''
#test the file to see if processing is needed before parsing
for line in xml_file:
p = re.compile("\b")
if(p.match(line)):
print p.match(line)
processing = True
break #only one match needed
if(processing):
print "preprocess"
preprocess(xml_file)
Preprocessing method:
def preprocess(file):
#exporting from MySQL query browser adds a weird
#character to the result set, remove it
#so the XML parser can read the data
print "in preprocess"
lines = []
for line in xml_file:
lines.append(re.sub("\b", "", line))
#go to the beginning of the file
xml_file.seek(0);
#overwrite with correct data
for line in lines:
xml_file.write(line);
xml_file.truncate()
Any help would be great,
Thanks
\b is a flag for the regular expression engine:
Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore character. Note that \b is defined as the boundary between \w and \W, so the precise set of characters deemed to be alphanumeric depends on the values of the UNICODE and LOCALE flags. Inside a character range, \b represents the backspace character, for compatibility with Python’s string literals.
So you will need to escape it to find it with a regex.
Escape it with backslash in regex. Since backslash in Python needs to be escaped as well (unless you use raw strings which you don't want to), you need a total of 3 backslashes:
p = re.compile("\\\b")
This will produce a pattern matching the \b character.
Correct me if i wrong but there is no need to use regEx in order to replace '\b', you can simply use replace method for this purpose:
def preprocess(file):
#exporting from MySQL query browser adds a weird
#character to the result set, remove it
#so the XML parser can read the data
print "in preprocess"
lines = map(lambda line: line.replace("\b", ""), xml_file)
#go to the beginning of the file
xml_file.seek(0)
#overwrite with correct data
for line in lines:
xml_file.write(line)
# OR: xml_file.writelines(lines)
xml_file.truncate()
Note that there is no need in python to use ';' at the end of string