python regex command to extract data excluding comment line - python

I need to extract data in a data file beginning with the letter
"U"
or
"L"
and exclude comment lines beginning with character "/" .
Example:
/data file FLG.dat
UAB-AB LRD1503 / reminder latches
I used a regex pattern in the python program which results in only capturing the comment lines. I'm only getting comment lines but not the identity beginning with character.

You can use ^([UL].+?)(?:/.*|)$. Code:
import re
s = """/data file FLG.dat
UAB-AB LRD1503 / reminder latches
LAB-AB LRD1503 / reminder latches
SAB-AB LRD1503 / reminder latches"""
lines = re.findall(r"^([UL].+?)(?:/.*|)$", s, re.MULTILINE)
If you want to delete spaces at the end of string you can use list comprehension with same regular expression:
lines = [match.group(1).strip() for match in re.finditer(r"^([UL].+)/.*$", s, re.MULTILINE)]
OR you can edit regular expression to not include spaces before slash ^([UL].+?)(?:\s*/.*|)$:
lines = re.findall(r"^([UL].+?)(?:\s*/.*|)$", s, re.MULTILINE)

In case the comments in your data lines are optional here's a regular expression that covers both types, lines with or without a comment.
The regular expression for that is R"^([UL][^/]*)"
(edited, original RE was R"^([UL][^/]*)(/.*)?$")
The first group is the data you want to extract, the 2nd (optional group) would catch the comment if any.
This example code prints only the 2 valid data lines.
import re
lines=["/data file FLG.dat",
"UAB-AB LRD1503 / reminder latches",
"UAB-AC LRD1600",
"MAB-AD LRD1700 / does not start with U or L"
]
datare=re.compile(R"^([UL][^/]*)")
matches = ( match.group(1).strip() for match in ( datare.match(line) for line in lines) if match)
for match in matches:
print(match)
Note how match.group(1).strip() extracts the first group of your RE and strip() removes any trailing spaces in your match
Also note that you can replace lines in this example with a file handle and it would work the same way
If the matches = line looks too complicated, it's an efficient way for writing this:
for line in lines:
match = datare.match(line)
if match:
print(match.group(1).strip())

Related

I want to extract gene boundaries(like 1..234, 234..456) from a file using regex in python but every time I use this code it returns me empty list

I want to extract gene boundaries (like 1..234, 234..456) from a file using regex in python but every time I use this code it returns me empty list.
Below is example file
Below is what i have so far:
import re
#with open('boundaries.txt','a') as wf:
with open('sequence.gb','r') as rf:
for line in rf:
x= re.findall(r"^\s+\w+\s+\d+\W\d+",line)
print(x)
The pattern does not match, as you are matching a single non word character after matching the first digits that you encounter.
You can repeat matching those 1 or more times.
As you want to have a single match from the start of the string, you can also use re.match without the anchor ^
^\s+\w+\s+\d+\W+\d+
^
Regex demo
import re
s=" gene 1..3256"
pattern = r"\s+\w+\s+\d+\W+\d+"
m = re.match(pattern, s)
if m:
print(m.group())
Output
gene 1..3256
Maybe you used the wrong regex.Try the code below.
for line in rf:
x = re.findall(r"g+.*\s*\d+",line)
print(x)
You can also use online regex 101, to test your regex pattern online.
online regex 101
More suitable pattern would be: ^\s*gene\s*(\d+\.\.\d+)
Explanation:
^ - match beginning of a line
\s* - match zero or more whitespaces
gene - match gene literally
(...) - capturing group
\d+ - match one or more digits
\.\. - match .. literally
Then, it's enough to get match from first capturing group to get gene boundaries.

how to write a regular expression to match a small part of a repeating pattern?

I have the following pattern to match :
(10,'more random stuff 21325','random stuff','2014-10-26 04:50:23','','uca-default-u-kn','page')
For some context, it's part of a larger file , which contains many similar patterns separated by commas :
(10,'more random stuff 21325','random stuff','2014-10-26 04:50:23','','uca-default-u-kn','page'),
(11,'more random stuff 1nyny5','random stuff','2014-10-26 04:50:23','','uca-default-u-kn','subcat'),
(14,'more random stuff 21dd5','random stuff','2014-10-26 04:50:23','','uca-default-u-kn','page')
my goal is to ditch all patterns ending with 'page' and to keep the rest. For that, I'm trying to use
regular expressions to identify those patterns. Here is the one I come out with for now :
"\(.*?,\'page\'\)"
However, it's not working as expected.
In the following python code, I use this regex, and replace every match with an empty string :
import re
txt = "(10,'Redirects_from_moves','*..2NN:,#2.FBHRP:D6ܽ�','2014-10-26 04:50:23','','uca-default-u-kn','page'),"
txt += "(11,'Redirects_with_old_history','*..2NN:,#2.FBHRP:D6ܽ�','2010-08-26 22:38:36','','uca-default-u-kn','page'),"
txt += "(12,'Unprintworthy_redirects','*..2NN:,#2.FBHRP:D6ܽ�','2010-08-26 22:38:36','','uca-default-u-kn','subcat'),"
txt += "(13,'Anarchism','random_stuff','2020-01-23 13:27:44',' ','uca-default-u-kn','page'),"
txt += "(14,'Anti-capitalism','random_stuff','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),"
txt += "(15,'Anti-fascism','*D*L.8:NB\r�','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),"
txt += "(16,'Articles_containing_French-language_text','*D*L.8:NB\r�','2020-01-23 13:27:44','','uca-default-u-kn','page'),"
txt += "(17,'Articles_containing_French-language_text','*D*L.8:NB\r�','2020-01-23 13:27:44','','uca-default-u-kn','page')"
new_txt = re.sub("\(.*?,\'page\'\)", "",txt)
I was expecting that new_text would contains all patterns ending with 'subcat', and remove all
patterns ending with 'page', however, I obtain :
new_txt = ,,,,
What's happening here ? How can I change my regex to obtain the desired result ?
We might be tempted to do a regex replacement here, but that would basically always leave open edge cases, as #Wiktor has correctly pointed out in a comment below. Instead, a more foolproof approach is to use re.findall and simply extract every tuple with does not end in 'page'. Here is an example:
parts = re.findall(r"\(\d+,'[^']*?'(?:,'[^']*?'){4},'(?!page')[^']*?'\),?", txt)
print(''.join(parts))
This prints:
(12,'Unprintworthy_redirects','*..2NN:,#2.FBHRP:D6ܽ�','2010-08-26 22:38:36','','uca-default-u-kn','subcat'),(14,'Anti-capitalism','random_stuff','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),(15,'Anti-fascism','DL.8:NB�','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),
The regex pattern used above just matches a leading number, followed by 5 singly quoted terms, and then a sixth singly quoted term which is not 'page'. Then, we string join the tuples in the list output to form a string.
What happens is that you concatenate the string, then then remove all until the first occurrence of ,'page') leaving only the trailing comma's.
Another workaround might be using a list of the strings, and join them with a newline instead of concatenating them.
Then use your pattern matching an optional comma and newline at the end to remove the line, leaving the ones that end with subcat
import re
lines = [
"(10,'Redirects_from_moves','*..2NN:,#2.FBHRP:D6ܽ�','2014-10-26 04:50:23','','uca-default-u-kn','page'),",
"(11,'Redirects_with_old_history','*..2NN:,#2.FBHRP:D6ܽ�','2010-08-26 22:38:36','','uca-default-u-kn','page'),",
"(12,'Unprintworthy_redirects','*..2NN:,#2.FBHRP:D6ܽ�','2010-08-26 22:38:36','','uca-default-u-kn','subcat'),",
"(13,'Anarchism','random_stuff','2020-01-23 13:27:44',' ','uca-default-u-kn','page'),",
"(14,'Anti-capitalism','random_stuff','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),",
"(15,'Anti-fascism','*D*L.8:NB\r�','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),",
"(16,'Articles_containing_French-language_text','*D*L.8:NB\r�','2020-01-23 13:27:44','','uca-default-u-kn','page'),",
"(17,'Articles_containing_French-language_text','*D*L.8:NB\r�','2020-01-23 13:27:44','','uca-default-u-kn','page')"
]
new_txt = re.sub("\(.*,'page'\)(?:,\n)?", "", '\n'.join(lines))
print(new_txt)
Output
(12,'Unprintworthy_redirects','*..2NN:,#2.FBHRP:D6ܽ�','2010-08-26 22:38:36','','uca-default-u-kn','subcat'),
(14,'Anti-capitalism','random_stuff','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),
�','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),
Or you can use a list comprehension to keep the lines that do not match the pattern.
result = [line for line in lines if not re.match(r"\(.*,'page'\),?$", line)]
print('\n'.join(result))
Output
(12,'Unprintworthy_redirects','*..2NN:,#2.FBHRP:D6ܽ�','2010-08-26 22:38:36','','uca-default-u-kn','subcat'),
(14,'Anti-capitalism','random_stuff','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),
�','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),
Another option to match the parts that end with 'page') for the example data:
\(\d+,[^)]*(?:\)(?!,\s*\(\d+,)[^)]*)*,'page'\),?
The pattern matches:
\(\d+, Match ( followed by 1+ digits and a comma
[^)]* Optionally match any char except )
(?: Non capture group
\)(?!,\s*\(\d+,)[^)]* Only match a ) when not directly followed by the pattern ,\s*\(\d+, which matches the start of the parts in the example data
)* Close group and optionally repeat
,'page'\),? Match ,'page') with an optional comma
Regex demo

How can I find all paths in javascript file with regex in Python?

Sample Javascript (content):
t.appendChild(u),t}},{10:10}],16:[function(e,t,r){e(10);t.exports=function(e){var t=document.createDocumentFragment(),r=document.createElement("img");r.setAttribute("alt",e.empty),r.id="trk_recaptcha",r.setAttribute("src","/cdn-cgi/images/trace/captcha/js/re/transparent.gif?ray="+e.ray),t.appendChild(r);var n=document.createTextNode(" ");t.appendChild(n);var a=document.createElement("input");a.id="id",a.setAttribute("name","id"),a.setAttribute("type","hidden"),a.setAttribute("value",e.ray),t.appendChild(a);var i=document.createTextNode(" ");t.appendChild(i);
t.appendChild(u),t}},{10:10}],16:[function(e,t,r){e(10);t.exports=function(e){var t=document.createDocumentFragment(),r=document.createElement("img");r.setAttribute("alt",e.empty),r.id="trk_recaptcha",r.setAttribute("sdfdsfsfds",'/test/path'),t.appendChild(r);var n=document.createTextNode(" ");t.appendChild(n);var a=document.createElement("input");a.id="id",a.setAttribute("name","id"),a.setAttribute("type","hidden"),a.setAttribute("value",e.ray),t.appendChild(a);var i=document.createTextNode(" ");t.appendChild(i);
regex = ""
endpoints = re.findall(regex, content)
Output I want:
> /cdn-cgi/images/trace/captcha/js/re/transparent.gif?ray=
> /test/path
I want to find all fields starting with "/ and '/ with regex. I've tried many url regexes but it didn't work for me.
This should do it:
regex = r"""["']\/[^"']*"""
Note that you will need to trim the first character from the match. This also assumes that there are no quotation marks in the path.
Consider:
import re
txt = ... #your code
pat = r"(\"|\')(\/.*?)\1"
for el in re.findall(pat, txt):
print(el[1])
each el will be match of pattern starting with single, or double quote. Then minimal number of characters, then the same character as at the beginning (same type of quote).
.* stands for whatever number of any characters, following ? makes it non-greedy i.e. provides minimal characters match. Then \1 refers to first group, so it will match whatever type of quote was matched at the beginning. Then by specifying el[1] we return second group matched i.e. whatever was matched within quotes.

Reading a text file and combinig 2 lines into one using a regular expression

I am fairly new to python. I am trying to use regular expressions to match specific text in a file.
I can extract the data but only one regular expression at a time since the both values are in different lines and I am struggling to put them together. These severa lines repeat all the time in the file.
[06/05/2020 08:30:16]
othertext <000.000.000.000> xx s
example <000.000.000.000> xx s
I managed to print one or the other regular expressions:
[06/05/2020 08:30:16]
or
example <000.000.000.000> xx s
But not combined into something like this:
(timestamp) (text)
[06/05/2020 08:30:16] example <000.000.000.000> xx s
These are the regular expressions
regex = r"^\[\d\d\/\d\d\/\d\d\d\d\s\d\d\:\d\d\:\d\d\]" #Timestamp
regex = r"(^example\s+.*\<000\.000\.000\.000\>\s+.*$)" # line that contain the text
This is the code so far, I have tried a secondary for loop with another condition but seem that only match one of the regular expression at a time.
Any pointers will be greatly appreciated.
import re
filename = input("Enter the file: ")
regex = r"^\[\d\d\/\d\d\/\d\d\d\d\s\d\d\:\d\d\:\d\d\]" #Timestamp
with open (filename, "r") as file:
list = []
for line in file:
for match in re.finditer(regex, line, re.S):
match_text = match.group()
list.append(match_text)
print (match_text)
You can match blocks of text similar to this in one go with a regex of this type:
(^\[\d\d\/\d\d\/\d\d\d\d[ ]+\d\d:\d\d:\d\d\])\s+[\s\S]*?(^example.*)
Demo
All the file's text needs to be 'gulped' to do so however.
The key elements of the regex:
[\s\S]*?
^ idiomatically, this matches all characters in regex
^ zero or more
^ not greedily or the rest of the text will match skipping
the (^example.*) part

Having a problem with Python Regex: Prints "None" when printing "matches". Regex works in tester

I'm supposed to extract groups of text from a file with a top ten list: name, rank, etc. for each. You can see the file and the regex here https://regex101.com/r/fXK5YV/1. It works in there and you can see the capturing groups.
import re
pattern = '''
(?P<list><li\sclass="regular-search-result">(.|\n)*?(?<=\<span class=\"indexed-biz-name\"\>)
(?P<rank>\d{1,2})
(.|\n)*?\<span\>
(?P<name>.+)
\<\/span\>(.|\n)*?alt=\"
(?P<stars>\d\.\d)
\sstar\srating\"(.|\n)*?\<span class=\"review-count rating-qualifier\"\>(\s|\t|\n)*?
(?P<numrevs>\d{1,7})(.|\n)*?\<span\sclass=\"business-attribute\sprice-range\">
(?P<price>\${1,6})
\<\/span\>(.|\n)*?<\/li>)
'''
pattern_matcher = re.compile(pattern, re.VERBOSE)
matches = pattern_matcher.match(yelp_html)
This prints None.
There is definitely text inside of yelp_html.
What am I doing wrong?
I see two issues:
You're not using a raw string (prefix the string with an r), which means that your backslashes are going to be trying to represent special things instead of being part of the string.
I believe your multiline string is going to be attempting to match both the newlines between each line and the spaces at the start of the string into your regex (which you don't want, given this is not how the regex is formatted in your link).
import re
pattern = r'''
(?P<list><li\sclass=\"regular-search-result\">(.|\n)*?(?<=\<span\sclass=\"indexed-biz-name\"\>)
(?P<rank>\d{1,2})
(.|\n)*?\<span\>
(?P<name>.+)
\<\/span\>(.|\n)*?alt=\"
(?P<stars>\d\.\d)
\sstar\srating\"(.|\n)*?\<span\sclass=\"review-count\srating-qualifier\"\>(\s|\t|\n)*?
(?P<numrevs>\d{1,7})
(.|\n)*?\<span\sclass=\"business-attribute\sprice-range\">
(?P<price>\${1,6})
\<\/span\>(.|\n)*?<\/li>)
'''
pattern_matcher = re.compile(pattern, re.VERBOSE)
matches = pattern_matcher.finditer(yelp_html)
for item in matches:
print(item.group('rank', 'name', 'stars', 'numrevs', 'price'))

Categories

Resources