Python Regex matching on unexpected parts of search string - python

I'm trying to parse a page using regex (Python 2.7; IPython QTConsole). The page is a .txt pulled from a web directory that I grabbed using urllib2
>>> import re
>>> Z = '[A-Z]{2}Z[0-9]{3}.*?\\$\\$'
>>> snippet = re.search(Z, page, re.DOTALL)
>>> snippet = snippet.group() # Only including the first part for brevity.
'PZZ570-122200-\nPOINT ARENA TO POINT REYES 10 TO 60 NM OFFSHORE-\n249 AM PDT FRI SEP 12 2014\n.TODAY...SW WINDS 5 KT. WIND WAVES 2 FT OR LESS.\nNW SWELL 3 TO 5 FT AT 12 SECONDS. PATCHY FOG IN THE MORNING.\n.TONIGHT...W WINDS 10 KT. WIND WAVES 2 FT OR LESS.'
I want to search for the newline followed by a period. I'd like to get the first and second occurrences as below. The objective is to parse the information between the first and second (and subsequent) \n\. delimiters. I know I could do look-around, but I'm having trouble making the lookahead greedy. Further, I can't figure out why the following doesn't work.
>>> pat = r"\n\."
>>> s = re.search(pat, snippet.group(), re.DOTALL)
>>> e = re.search(pat, snippet.group()[s.end():], re.DOTALL)
The s above works, but I get a strange result for e.
>>> [s.group(), s.start(), e.group(), e.end()]
['\n.', 90, '\n.', 110]
>>> snippet.group()[s.start():e.end()]
'\n.TODAY...SW WINDS 5'
>>> snippet.group()[e.start():e.end()]
' 5'
I guess there's some formatting in snippet.group() that's hidden? If that's the case, then it's strange that some newlines are explicit as if snippet.group() is raw, and others are hidden. Why are e.group(), and snippet.group()[e.start():e.end()] different?
I apologize if this question has already been addressed. I couldn't find anything related.
Thanks very much in advance.

To split a string in python, it might be easier to use str.split() or re.split().
e.g.:
"1\n.2\n.3".split("\n.")

Related

How do I print everything between two sentences?

Im very new to coding and only know the very basics. I am using python and trying to print everything between two sentences in a text. I only want the content between, not before or after. It`s probably very easy, but i couldnt figure it out.
Ev 39 Fursetfjellet (Oppdøl - Batnfjordsøra) No reports. Ev 134 Haukelifjell (Liamyrane bom - Fjellstad bom) Ev 134 Haukelifjell Hordaland / Telemark — Icy. 10 o'clock 1 degree. Valid from: 05.01.2020 13:53 Rv 3 Kvikne (Tynset (Motrøa) - Ulsberg)
I want to collect the bold text to use in website later. Everything except the italic text(the sentence before and after) is dynamic if that has anything to say.
You can use split to cut the string and access the parts that you are interested in.
If you know how to get the full text already, it's easy to get the bold sentence by removing the two constant sentences before and after.
full_text = "Ev 39 Fursetfjellet (Oppdøl - Batnfjordsøra) No reports. Ev 134 Haukelifjell (Liamyrane bom - Fjellstad bom) Ev 134 Haukelifjell Hordaland / Telemark — Icy. 10 o'clock 1 degree. Valid from: 05.01.2020 13:53 Rv 3 Kvikne (Tynset (Motrøa) - Ulsberg)"
s1 = "Ev 39 Fursetfjellet (Oppdøl - Batnfjordsøra) No reports. Ev 134 Haukelifjell (Liamyrane bom - Fjellstad bom)"
s2 = "Rv 3 Kvikne (Tynset (Motrøa) - Ulsberg)"
bold_text = full_text.split(s1)[1] # Remove the left part.
bold_text = bold_text.split(s2)[0] # Remove the right part.
bold_text = bold_text.strip() # Clean up spaces on each side if needed.
print(bold_text)
It looks like a job for regular expressions, there is the re module in Python.
You should:
Open the file
Read its content in a variable
Use search or match function in the re module
In particular, in the last step you should use your "surrounding" strings as "delimiters" and capture everything between them. You can achieve this using a regex pattern like str1 + "(.*)" + str2.
You can give a look at regex documentation, but just to give you an idea:
".*" captures everything
"()" allows you actually capture the content inside them and access it later with an index (e.g. re.search(pattern, original_string).group(1))

Parsing variable length data

I'm using Python 3 and Im relatively new to RegEx.
I'm struggling to come up with a good way to tackle the following problem.
I have a text string (that can include line breaks etc) that contains a several sets of information.
For example:
TAG1/123456 TAG2/ABCDEFG HISTAG3/A1B1C1D1 QWERTY TAG4/0987654321
TAG5/THE CAT SAT ON THE MAT MYTAG6/FLINTSTONE
TAG7/99887766AA
I need this parsed to the following
TAG1/123456
TAG2/ABCDEFG
HISTAG3/A1B1C1D1 QWERTY
TAG4/0987654321
TAG5/THE CAT SAT ON THE MAT
MYTAG6/FLINTSTONE
TAG7/99887766AA
I can't seem to work out how to deal with the variable length tags :( TAG3 and TAG5
I always end up capturing the next tag i.e.
TAG5/THE CAT SAT ON THE MAT TAG6
In reality the TAGs themselves are also variable. Most are 3 characters followed by '/' but not all. Some are 4, 5 and 6 characters long. But all are followed by '/' and all EXCEPT the first one are preceded by a space
Updated Information
I have updated the example to show these variable tags. But to clarify a tag can be 1-8 alphabetic characters, preceded by a space and terminated by '/'
The data after the tag can be one or more words (alphanumeric) and is defined as all the data that follows the '/' of the tag up until the start of the next tag or the end of the string.
Any pointers would be greatly appreciated.
This is one way to achieve what you want I think:
import re
s = """TAG1/123456 TAG2/ABCDEFG TAG3/A1B1C1D1 QWERTY TAG4/0987654321
TAG5/THE CAT SAT ON THE MAT TAG6/FLINTSTONE
TAG7/99887766AA"""
r = re.compile(r'\w+/.+?(?=$|\s+\w+/)')
tags = r.findall(s)
print(*tags, sep='\n')
Output:
TAG1/123456
TAG2/ABCDEFG
TAG3/A1B1C1D1 QWERTY
TAG4/0987654321
TAG5/THE CAT SAT ON THE MAT
TAG6/FLINTSTONE
TAG7/99887766AA
The important bits are the non-greedy qualifier +? and the lookahead (?=$|\s+\w+/).

Python Regex Problems and grouping

I'm trying to parse data from a text file that has lines like:
On 1-1-16 1:48 Bob used: 187
On 1-5-16 2:50 Bob used: 2
I want to print only the time and the number used, so it would look like:
1-1-16, 1:48, 187
1-5-16, 2:50, 2
I'm using this regex:
print(re.search(r"On ([0-9,-, ]+)Bob used ([0-9\.]+)", line.strip()))
I get results that say <_sre.SRE_Match object; span=(23, 26), match='Bob used: 187'>
I tried using .group() but it give the error "'NoneType' object has no attribute 'group'" I also noticed its only finding the second grouping (the number) and not the first (the date and time).
How can this be fixed?
You are missing the : after the Bob used and you need are more precise expression for the date part - for instance, \d+-\d+-\d+ \d+:\d+:
>>> s = 'On 1-1-16 1:48 Bob used: 187 On 1-5-16 2:50 Bob used: 2'
>>> re.search(r"On (\d+-\d+-\d+ \d+:\d+) Bob used: ([0-9\.]+)", s).groups()
('1-1-16 1:48', '187')
You didn't give enough information on how you're using it, but since you're getting a Match object back, it shouldn't be None when you call .group() unless you're failing to store the result to the correct place. Most likely you are processing many lines, some of which match, and some of which don't, and you're not checking whether you matched before accessing groups.
Your code should always verify it got a Match before working with it further; make sure your test is structured like:
match = re.search(r"On ([0-9,-, ]+)Bob used ([0-9\.]+)", line.strip())
if match is not None:
... do stuff with match.group() here ...
... but not here ...
I'm pretty new to regular expressions myself however I came up with this
import re
source = "On 1-1-16 1:48 Bob used: 187\nOn 1-5-16 2:50 Bob used: 2"
x=re.finditer('([0-9]-)+[0-9]+',source)
y=re.finditer('[0-9]+:[0-9]+',source)
z=re.finditer(': [0-9]*',source)
L = []
for i,j,k in zip(x,y,z):
L.append((i.group(), j.group(), k.group().replace(': ', '') ))
print(L)
output
[('1-1-16', '1:48', '187'), ('1-5-16', '2:50', '2')]

Match to second regular expression if first has no matches

I'm attempting to extract the text between HTML tags using regex in python. The catch is that sometimes there are no HTML tags in the string, so I want my regex to match the entire string. So far, I've got the part that matches the inner text of the tag:
(?<=>).*(?=<\/)
This would match to Russia in the tag below
<a density="sparse" href="http://topics.bloomberg.com/russia/">Russia</a>
Alternately, the entire string would be matched:
Typhoon Vongfong prompted ANA to cancel 101 flights, affecting about 16,600 passengers, the airline said in a faxed statement. Japan Airlines halted 31 flights today and three tomorrow, it said by fax. The storm turned northeast after crossing Okinawa, Japan’s southernmost prefecture, with winds gusting to 75 knots (140 kilometers per hour), according to the U.S. Navy’s Joint Typhoon Warning Center.
Otherwise I want it to return all the text in the string.
I've read a bit about regex conditionals online, but I can't seem to get them to work. If anyone can point me in the right direction, that would be great. Thanks in advance.
You could do this with a single regex. You don't need to go for any workaround.
>>> import re
>>> s='<a density="sparse" href="http://topics.bloomberg.com/russia/">Russia</a>'
>>> re.findall(r'(?<=>)[^<>]+(?=</)|^(?!.*?>.*?</).*', s, re.M)
['Russia']
>>> s='This is Russia Today'
>>> re.findall(r'(?<=>)[^<>]+(?=</)|^(?!.*?>.*?</).*', s, re.M)
['This is Russia Today']
Here is a work-around. Instead of adjusting the regex, we adjust the string:
>>> s='<a density="sparse" href="http://topics.bloomberg.com/russia/">Russia</a>'
>>> re.findall(r'(?<=>)[^<>]*(?=<\/)', s if '>' in s else '>%s</' % s)
['Russia']
>>> s='This is Russia Today'
>>> re.findall(r'(?<=>)[^<>]*(?=<\/)', s if '>' in s else '>%s</' % s)
['This is Russia Today']

Python parsing

I'm trying to parse the title tag in an RSS 2.0 feed into three different variables for each entry in that feed. Using ElementTree I've already parsed the RSS so that I can print each title [minus the trailing )] with the code below:
feed = getfeed("http://www.tourfilter.com/dallas/rss/by_concert_date")
for item in feed:
print repr(item.title[0:-1])
I include that because, as you can see, the item.title is a repr() data type, which I don't know much about.
A particular repr(item.title[0:-1]) printed in the interactive window looks like this:
'randy travis (Billy Bobs 3/21'
'Michael Schenker Group (House of Blues Dallas 3/26'
The user selects a band and I hope to, after parsing each item.title into 3 variables (one each for band, venue, and date... or possibly an array or I don't know...) select only those related to the band selected. Then they are sent to Google for geocoding, but that's another story.
I've seen some examples of regex and I'm reading about them, but it seems very complicated. Is it? I thought maybe someone here would have some insight as to exactly how to do this in an intelligent way. Should I use the re module? Does it matter that the output is currently is repr()s? Is there a better way? I was thinking I'd use a loop like (and this is my pseudoPython, just kind of notes I'm writing):
list = bandRaw,venue,date,latLong
for item in feed:
parse item.title for bandRaw, venue, date
if bandRaw == str(band)
send venue name + ", Dallas, TX" to google for geocoding
return lat,long
list = list + return character + bandRaw + "," + venue + "," + date + "," + lat + "," + long
else
In the end, I need to have the chosen entries in a .csv (comma-delimited) file looking like this:
band,venue,date,lat,long
randy travis,Billy Bobs,3/21,1234.5678,1234.5678
Michael Schenker Group,House of Blues Dallas,3/26,4321.8765,4321.8765
I hope this isn't too much to ask. I'll be looking into it on my own, just thought I should post here to make sure it got answered.
So, the question is, how do I best parse each repr(item.title[0:-1]) in the feed into the 3 separate values that I can then concatenate into a .csv file?
Don't let regex scare you off... it's well worth learning.
Given the examples above, you might try putting the trailing parenthesis back in, and then using this pattern:
import re
pat = re.compile('([\w\s]+)\(([\w\s]+)(\d+/\d+)\)')
info = pat.match(s)
print info.groups()
('Michael Schenker Group ', 'House of Blues Dallas ', '3/26')
To get at each group individual, just call them on the info object:
print info.group(1) # or info.groups()[0]
print '"%s","%s","%s"' % (info.group(1), info.group(2), info.group(3))
"Michael Schenker Group","House of Blues Dallas","3/26"
The hard thing about regex in this case is making sure you know all the known possible characters in the title. If there are non-alpha chars in the 'Michael Schenker Group' part, you'll have to adjust the regex for that part to allow them.
The pattern above breaks down as follows, which is parsed left to right:
([\w\s]+) : Match any word or space characters (the plus symbol indicates that there should be one or more such characters). The parentheses mean that the match will be captured as a group. This is the "Michael Schenker Group " part. If there can be numbers and dashes here, you'll want to modify the pieces between the square brackets, which are the possible characters for the set.
\( : A literal parenthesis. The backslash escapes the parenthesis, since otherwise it counts as a regex command. This is the "(" part of the string.
([\w\s]+) : Same as the one above, but this time matches the "House of Blues Dallas " part. In parentheses so they will be captured as the second group.
(\d+/\d+) : Matches the digits 3 and 26 with a slash in the middle. In parentheses so they will be captured as the third group.
\) : Closing parenthesis for the above.
The python intro to regex is quite good, and you might want to spend an evening going over it http://docs.python.org/library/re.html#module-re. Also, check Dive Into Python, which has a friendly introduction: http://diveintopython3.ep.io/regular-expressions.html.
EDIT: See zacherates below, who has some nice edits. Two heads are better than one!
Regular expressions are a great solution to this problem:
>>> import re
>>> s = 'Michael Schenker Group (House of Blues Dallas 3/26'
>>> re.match(r'(.*) \((.*) (\d+/\d+)', s).groups()
('Michael Schenker Group', 'House of Blues Dallas', '3/26')
As a side note, you might want to look at the Universal Feed Parser for handling the RSS parsing as feeds have a bad habit of being malformed.
Edit
In regards to your comment... The strings occasionally being wrapped in "s rather than 's has to do with the fact that you're using repr. The repr of a string is usually delimited with 's, unless that string contains one or more 's, where instead it uses "s so that the 's don't have to be escaped:
>>> "Hello there"
'Hello there'
>>> "it's not its"
"it's not its"
Notice the different quote styles.
Regarding the repr(item.title[0:-1]) part, not sure where you got that from but I'm pretty sure you can simply use item.title. All you're doing is removing the last char from the string and then calling repr() on it, which does nothing.
Your code should look something like this:
import geocoders # from GeoPy
us = geocoders.GeocoderDotUS()
import feedparser # from www.feedparser.org
feedurl = "http://www.tourfilter.com/dallas/rss/by_concert_date"
feed = feedparser.parse(feedurl)
lines = []
for entry in feed.entries:
m = re.search(r'(.*) \((.*) (\d+/\d+)\)', entry.title)
if m:
bandRaw, venue, date = m.groups()
if band == bandRaw:
place, (lat, lng) = us.geocode(venue + ", Dallas, TX")
lines.append(",".join([band, venue, date, lat, lng]))
result = "\n".join(lines)
EDIT: replaced list with lines as the var name. list is a builtin and should not be used as a variable name. Sorry.

Categories

Resources