I have the following pattern to match :
(10,'more random stuff 21325','random stuff','2014-10-26 04:50:23','','uca-default-u-kn','page')
For some context, it's part of a larger file , which contains many similar patterns separated by commas :
(10,'more random stuff 21325','random stuff','2014-10-26 04:50:23','','uca-default-u-kn','page'),
(11,'more random stuff 1nyny5','random stuff','2014-10-26 04:50:23','','uca-default-u-kn','subcat'),
(14,'more random stuff 21dd5','random stuff','2014-10-26 04:50:23','','uca-default-u-kn','page')
my goal is to ditch all patterns ending with 'page' and to keep the rest. For that, I'm trying to use
regular expressions to identify those patterns. Here is the one I come out with for now :
"\(.*?,\'page\'\)"
However, it's not working as expected.
In the following python code, I use this regex, and replace every match with an empty string :
import re
txt = "(10,'Redirects_from_moves','*..2NN:,#2.FBHRP:D6ܽ�','2014-10-26 04:50:23','','uca-default-u-kn','page'),"
txt += "(11,'Redirects_with_old_history','*..2NN:,#2.FBHRP:D6ܽ�','2010-08-26 22:38:36','','uca-default-u-kn','page'),"
txt += "(12,'Unprintworthy_redirects','*..2NN:,#2.FBHRP:D6ܽ�','2010-08-26 22:38:36','','uca-default-u-kn','subcat'),"
txt += "(13,'Anarchism','random_stuff','2020-01-23 13:27:44',' ','uca-default-u-kn','page'),"
txt += "(14,'Anti-capitalism','random_stuff','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),"
txt += "(15,'Anti-fascism','*D*L.8:NB\r�','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),"
txt += "(16,'Articles_containing_French-language_text','*D*L.8:NB\r�','2020-01-23 13:27:44','','uca-default-u-kn','page'),"
txt += "(17,'Articles_containing_French-language_text','*D*L.8:NB\r�','2020-01-23 13:27:44','','uca-default-u-kn','page')"
new_txt = re.sub("\(.*?,\'page\'\)", "",txt)
I was expecting that new_text would contains all patterns ending with 'subcat', and remove all
patterns ending with 'page', however, I obtain :
new_txt = ,,,,
What's happening here ? How can I change my regex to obtain the desired result ?
We might be tempted to do a regex replacement here, but that would basically always leave open edge cases, as #Wiktor has correctly pointed out in a comment below. Instead, a more foolproof approach is to use re.findall and simply extract every tuple with does not end in 'page'. Here is an example:
parts = re.findall(r"\(\d+,'[^']*?'(?:,'[^']*?'){4},'(?!page')[^']*?'\),?", txt)
print(''.join(parts))
This prints:
(12,'Unprintworthy_redirects','*..2NN:,#2.FBHRP:D6ܽ�','2010-08-26 22:38:36','','uca-default-u-kn','subcat'),(14,'Anti-capitalism','random_stuff','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),(15,'Anti-fascism','DL.8:NB�','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),
The regex pattern used above just matches a leading number, followed by 5 singly quoted terms, and then a sixth singly quoted term which is not 'page'. Then, we string join the tuples in the list output to form a string.
What happens is that you concatenate the string, then then remove all until the first occurrence of ,'page') leaving only the trailing comma's.
Another workaround might be using a list of the strings, and join them with a newline instead of concatenating them.
Then use your pattern matching an optional comma and newline at the end to remove the line, leaving the ones that end with subcat
import re
lines = [
"(10,'Redirects_from_moves','*..2NN:,#2.FBHRP:D6ܽ�','2014-10-26 04:50:23','','uca-default-u-kn','page'),",
"(11,'Redirects_with_old_history','*..2NN:,#2.FBHRP:D6ܽ�','2010-08-26 22:38:36','','uca-default-u-kn','page'),",
"(12,'Unprintworthy_redirects','*..2NN:,#2.FBHRP:D6ܽ�','2010-08-26 22:38:36','','uca-default-u-kn','subcat'),",
"(13,'Anarchism','random_stuff','2020-01-23 13:27:44',' ','uca-default-u-kn','page'),",
"(14,'Anti-capitalism','random_stuff','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),",
"(15,'Anti-fascism','*D*L.8:NB\r�','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),",
"(16,'Articles_containing_French-language_text','*D*L.8:NB\r�','2020-01-23 13:27:44','','uca-default-u-kn','page'),",
"(17,'Articles_containing_French-language_text','*D*L.8:NB\r�','2020-01-23 13:27:44','','uca-default-u-kn','page')"
]
new_txt = re.sub("\(.*,'page'\)(?:,\n)?", "", '\n'.join(lines))
print(new_txt)
Output
(12,'Unprintworthy_redirects','*..2NN:,#2.FBHRP:D6ܽ�','2010-08-26 22:38:36','','uca-default-u-kn','subcat'),
(14,'Anti-capitalism','random_stuff','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),
�','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),
Or you can use a list comprehension to keep the lines that do not match the pattern.
result = [line for line in lines if not re.match(r"\(.*,'page'\),?$", line)]
print('\n'.join(result))
Output
(12,'Unprintworthy_redirects','*..2NN:,#2.FBHRP:D6ܽ�','2010-08-26 22:38:36','','uca-default-u-kn','subcat'),
(14,'Anti-capitalism','random_stuff','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),
�','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),
Another option to match the parts that end with 'page') for the example data:
\(\d+,[^)]*(?:\)(?!,\s*\(\d+,)[^)]*)*,'page'\),?
The pattern matches:
\(\d+, Match ( followed by 1+ digits and a comma
[^)]* Optionally match any char except )
(?: Non capture group
\)(?!,\s*\(\d+,)[^)]* Only match a ) when not directly followed by the pattern ,\s*\(\d+, which matches the start of the parts in the example data
)* Close group and optionally repeat
,'page'\),? Match ,'page') with an optional comma
Regex demo
I am having no luck getting anything from this regex search.
I have a text file that looks like this:
REF*0F*452574437~
REF*1L*627783972~
REF*23*526344060~
REF*6O*1024817112~
DTP*336*D8*20140623~
DTP*473*D8*20191001~
DTP*474*D8*20191031~
DTP*473*D8*20191101~
I want to extract the lines that begin with "REF*23*" and ending with the "~"
txtfile = open(i + fileName, "r")
for line in txtfile:
line = line.rstrip()
p = re.findall(r'^REF*23*.+~', line)
print(p)
But this gives me nothing. As much as I'd like to dig deep into regex with python I need a quick solution to this. What i'm eventually wanting is just the digits between the last "*" and the "~" Thanks
You do not really need a regex if the only task is to extract the lines that begin with "REF*23*" and ending with the "~":
results = []
with open(i + fileName, "r") as txtfile:
for line in txtfile:
line = line.rstrip()
if line.startswith('REF*23*') and line.endswith('~'):
results.append(line)
print(results)
If you need to get the digit chunks:
results = []
with open(i + fileName, "r") as txtfile:
for line in txtfile:
line = line.rstrip()
if line.startswith('REF*23*') and line.endswith('~'):
results.append(line[7:-1]) # Just grab the slice
See non-regex approach demo.
NOTES
In a regex, * must be escaped to match a literal asterisk
You read line by line, re.findall(r'^REF*23*.+~', line) makes little sense as the re.findall method is used to get multiple matches while you expect one
Your regex is not anchored on the right, you need $ or \Z to match ~ at the end of the line. So, if you want to use a regex, it would look like
m = re.search(r'^REF\*23\*(.*)~$', line):
if m:
results.append(m.group(1)) # To grab just the contents between delimiters
# or
results.append(line) # To get the whole line
See this Python demo
In your case, you search for lines that start and end with fixed text, thus, no need using a regex.
Edit as an answer to the comment
Another text file is a very long unbroken like with hardly any spaces. I need to find where a section begins with REF*0F* and ends with ~, with the number I want in between.
You may read the file line by line and grab all occurrences of 1+ digits between REF*0F* and ~:
results = []
with open(fileName, "r") as txtfile:
for line in txtfile:
res = re.findall(r'REF\*0F\*(\d+)~', line)
if len(res):
results.extend(res)
print(results)
You can entirely use string functions to get only the digits (though a simple regex might be more easy to understand, really):
raw = """
REF*0F*452574437~
REF*1L*627783972~
REF*23*526344060~
REF*6O*1024817112~
DTP*336*D8*20140623~
DTP*473*D8*20191001~
DTP*474*D8*20191031~
DTP*473*D8*20191101~
"""
result = [digits[:-1]
for line in raw.split("\n") if line.startswith("REF*23*") and line.endswith("~")
for splitted in [line.split("*")]
for digits in [splitted[-1]]]
print(result)
This yields
['526344060']
* is a special character in regex, so you have to escape it as #The Fourth Bird points out. You are using an raw string, which means you don't have to escape chars from Python-language string parsing, but you still have to escape it for the regex engine.
r'^REF\*23\*.+~'
or
'^REF\\*23\\*.+~'
# '\\*' -> '\*' by Python string
# '\*' matches '*' literally by regex engine
will work. Having to escape things twice leads to the Leaning Toothpick Syndrome. Using a raw-string means you have to escape once, "saving some trees" in this regard.
Additional changes
You might also want to throw parens around .+ to match the group, if you want to match it. Also change the findall to match, unless you expect multiple matches per line.
results = []
with open(i + fileName, "r") as txtfile:
line = line.rstrip()
p = re.match(r'^REF\*23\*(.+)~', line)
if p:
results.append(int(p.group(1)))
Consider using a regex tester such as this one.
Based on this forum Replacing a line in a file based on a keyword search, by line from another file i am having little difficulty in my real file. Where as shown in picture below, i want to search a keyword "PBUSH followed by number(keeps increasing)" and based on that keyword i'd search in the other file if it is present or not. If it is present then replace the data from the line "PBUSH number K Some decimals" to the found line in another file, keeping search keyword as same. It'll keep going till the end of file, which looks like
and the code i modified (notice the findall and sub format) looks like:
import re
path1 = "C:\Users\sony\Desktop\PBUSH1.BDF"
path2 = "C:\Users\sony\Desktop\PBUSH2.BDF"
with open(path1) as f1, open(path2) as f2:
dat1 = f1.read()
dat2 = f2.read()
matches = re.findall('^PBUSH \s [0-9] \s K [0-9 ]+', dat1, flags=re.MULTILINE)
for match in matches:
dat2 = re.sub('^{} \s [0-9] \s K \s'.format(match.split(' ')[0]), match, dat2, flags=re.MULTILINE)
with open(path2, 'w') as f:
f.write(dat2)
Here my search keyword is PBUSH spaces Number and then the rest follows as shown in the PBUSH lines. I am unable to make it work. What could be the possible reason!
Better to use groups in such case and separate the whole string into two, one for matching phrase and other for data.
import re
# must use raw strings for paths, otherwise we need to
# escape \ characters
input1 = r"C:\Users\sony\Desktop\PBUSH1.BDF"
input2 = r"C:\Users\sony\Desktop\PBUSH2.BDF"
with open(input1) as f1, open(input2) as f2:
dat1 = f1.read()
dat2 = f2.read()
# use finditer instead of findall so that we will get
# a match object for each match.
# For each matching line we also have one subgroup, containing the
# "PBUSH NNN " part, whereas the whole regex matches until
# the next end of line
matches = re.finditer('^(PBUSH\s+[0-9]+\s+).*$', dat1, flags=re.MULTILINE)
for match in matches:
# for each match we construct a regex that looks like
# "^PBUSH 123 .*$", then replace all matches thereof
# with the contents of the whole line
dat2 = re.sub('^{}.*$'.format(match.group(1)), match.group(0), dat2, flags=re.MULTILINE)
with open(input2, 'w') as outf:
outf.write(dat2)