Match to second regular expression if first has no matches - python

I'm attempting to extract the text between HTML tags using regex in python. The catch is that sometimes there are no HTML tags in the string, so I want my regex to match the entire string. So far, I've got the part that matches the inner text of the tag:
(?<=>).*(?=<\/)
This would match to Russia in the tag below
<a density="sparse" href="http://topics.bloomberg.com/russia/">Russia</a>
Alternately, the entire string would be matched:
Typhoon Vongfong prompted ANA to cancel 101 flights, affecting about 16,600 passengers, the airline said in a faxed statement. Japan Airlines halted 31 flights today and three tomorrow, it said by fax. The storm turned northeast after crossing Okinawa, Japan’s southernmost prefecture, with winds gusting to 75 knots (140 kilometers per hour), according to the U.S. Navy’s Joint Typhoon Warning Center.
Otherwise I want it to return all the text in the string.
I've read a bit about regex conditionals online, but I can't seem to get them to work. If anyone can point me in the right direction, that would be great. Thanks in advance.

You could do this with a single regex. You don't need to go for any workaround.
>>> import re
>>> s='<a density="sparse" href="http://topics.bloomberg.com/russia/">Russia</a>'
>>> re.findall(r'(?<=>)[^<>]+(?=</)|^(?!.*?>.*?</).*', s, re.M)
['Russia']
>>> s='This is Russia Today'
>>> re.findall(r'(?<=>)[^<>]+(?=</)|^(?!.*?>.*?</).*', s, re.M)
['This is Russia Today']

Here is a work-around. Instead of adjusting the regex, we adjust the string:
>>> s='<a density="sparse" href="http://topics.bloomberg.com/russia/">Russia</a>'
>>> re.findall(r'(?<=>)[^<>]*(?=<\/)', s if '>' in s else '>%s</' % s)
['Russia']
>>> s='This is Russia Today'
>>> re.findall(r'(?<=>)[^<>]*(?=<\/)', s if '>' in s else '>%s</' % s)
['This is Russia Today']

Related

How to extract all the sentences with review/text in the below text?

Here I want to extract the review/text.But its extracting only few parts from it.
Following are the outputs:-
<re.Match object; span=(226, 258), match='review/text: I like Creme Brulee'>
<re.Match object; span=(750, 860), match='review/text: not what I was expecting in terms of>
import re
text='''
'product/productId: B004K2IHUO\n',
'review/userId: A2O9G2521O626G\n',
'review/profileName: Rachel Westendorf\n',
'review/helpfulness: 0/0\n',
'review/score: 5.0\n',
'review/time: 1308700800\n',
'review/summary: The best\n',
'review/text: I like Creme Brulee. I loved that these were so easy. Just sprinkle on the sugar that came with and broil. They look amazing and taste great. My guess thought I really went out of the way for them when really it took all of 5 minutes. I will be ordering more!\n',
'\n',
'product/productId: B004K2IHUO\n',
'review/userId: A1ZKFQLHFZAEH9\n',
'review/profileName: S. J. Monson "world citizen"\n',
'review/helpfulness: 2/8\n',
'review/score: 3.0\n',
'review/time: 1236384000\n',
'review/summary: disappointing\n',
"review/text: not what I was expecting in terms of the company's reputation for excellent home delivery products\n",
'\n',
'''
pattern=re.compile(r'review/text:\s[^.]+')
matches=pattern.finditer(text)
for match in matches:
print(match)
If you don't mind not using re and if the identifier is 'review/text' and your data is always comma seperated, you can get the lines simply with:
matches = [s.strip() for s in text.split(',') if s.strip(' "\n\'').startswith('review/text')]
for match in matches:
print(match)
where s.strip(' "\'\n') removes spaces, ", ', and newline characters from the beginning and ends of the line for a string comparison.These two lines are returned:
'review/text: I like Creme Brulee. I loved that these were so easy. Just sprinkle on the sugar that came with and broil. They look amazing and taste great. My guess thought I really went out of the way for them when really it took all of 5 minutes. I will be ordering more!
'
"review/text: not what I was expecting in terms of the company's reputation for excellent home delivery products
"
Use
matches = re.findall(r'review/text:.+', text)
See proof.
EXPLANATION
--------------------------------------------------------------------------------
review/text: 'review/text:'
--------------------------------------------------------------------------------
.+ any character except \n (1 or more times
(matching the most amount possible))

regex for removing entity names

Given tweets like the following:
Brick Brewing Co Limited (BRB) Downgraded by Cormark to Market Perform
Brinker International Inc (EAT) Upgraded by Zacks Investment Research to Hold
How do I write a regex that removes both "by Cormark" and "by Zacks Investment Research"
I tried this:
"by ([A-Za-z ]+\w to)"
using python but it requires the word "to". I would like the regex to stop before capturing the word "to".
It would also be interesting if someone could show me how to write a regex that captures camel-case examples, like "Zacks Investment Research".
You can use a positive look-ahead in order to exclude the word to:
>>> s1 = "Brick Brewing Co Limited (BRB) Downgraded by Cormark to Market Perform"
>>>
>>> s2 = "Brinker International Inc (EAT) Upgraded by Zacks Investment Research to Hold"
>>>
>>> import re
>>> re.sub(r'by[\w\s]+(?=to)','',s1)
'Brick Brewing Co Limited (BRB) Downgraded to Market Perform'
>>> re.sub(r'by[\w\s]+(?=to)','',s2)
'Brinker International Inc (EAT) Upgraded to Hold'
>>>
Note that the regex [\w\s]+ will match any combination of word characters and white spaces. If you just want to match the alphabetical characters and white space you can use [a-z\s] with re.I flag (Ignore case).
To remove all capitalized words after by, you can use
by [A-Z][a-z]*(?: +[A-Z][a-z]*)*
See regex demo
Explanation:
by - literal sequence of 3 characters b, y and a space
[A-Z][a-z]* - a capitalized word (one uppercase followed by zero or more lowercase letters)
(?: +[A-Z][a-z]*)* - zero or more sequences of...
+[A-Z][a-z]* - 1 or more spaces followed by an uppercase letter followed by zero or more lowercase letters.
A regular space may be replaced with \s in the pattern to match any whitespace. Also, to match CaMeL words, you can replace all [a-z] with [a-zA-Z].
You could also do it with str method index then just slice and add up:
>>> def remove_name(s):
b = s.index(' by ')
t = s.index(' to ')
s = s[:b]+s[t:]
return s
>>>
>>> s = 'Brick Brewing Co Limited (BRB) Downgraded by Cormark to Market Perform'
>>> remove_name(s)
'Brick Brewing Co Limited (BRB) Downgraded to Market Perform'
>>>
>>> s = "Brinker International Inc (EAT) Upgraded by Zacks Investment Research to Hold"
>>> remove_name(s)
'Brinker International Inc (EAT) Upgraded to Hold'

Python Regex matching on unexpected parts of search string

I'm trying to parse a page using regex (Python 2.7; IPython QTConsole). The page is a .txt pulled from a web directory that I grabbed using urllib2
>>> import re
>>> Z = '[A-Z]{2}Z[0-9]{3}.*?\\$\\$'
>>> snippet = re.search(Z, page, re.DOTALL)
>>> snippet = snippet.group() # Only including the first part for brevity.
'PZZ570-122200-\nPOINT ARENA TO POINT REYES 10 TO 60 NM OFFSHORE-\n249 AM PDT FRI SEP 12 2014\n.TODAY...SW WINDS 5 KT. WIND WAVES 2 FT OR LESS.\nNW SWELL 3 TO 5 FT AT 12 SECONDS. PATCHY FOG IN THE MORNING.\n.TONIGHT...W WINDS 10 KT. WIND WAVES 2 FT OR LESS.'
I want to search for the newline followed by a period. I'd like to get the first and second occurrences as below. The objective is to parse the information between the first and second (and subsequent) \n\. delimiters. I know I could do look-around, but I'm having trouble making the lookahead greedy. Further, I can't figure out why the following doesn't work.
>>> pat = r"\n\."
>>> s = re.search(pat, snippet.group(), re.DOTALL)
>>> e = re.search(pat, snippet.group()[s.end():], re.DOTALL)
The s above works, but I get a strange result for e.
>>> [s.group(), s.start(), e.group(), e.end()]
['\n.', 90, '\n.', 110]
>>> snippet.group()[s.start():e.end()]
'\n.TODAY...SW WINDS 5'
>>> snippet.group()[e.start():e.end()]
' 5'
I guess there's some formatting in snippet.group() that's hidden? If that's the case, then it's strange that some newlines are explicit as if snippet.group() is raw, and others are hidden. Why are e.group(), and snippet.group()[e.start():e.end()] different?
I apologize if this question has already been addressed. I couldn't find anything related.
Thanks very much in advance.
To split a string in python, it might be easier to use str.split() or re.split().
e.g.:
"1\n.2\n.3".split("\n.")

Extracting sub-string after the first space in Python

I need help in regex or Python to extract a substring from a set of string. The string consists of alphanumeric. I just want the substring that starts after the first space and ends before the last space like the example given below.
Example 1:
A:01 What is the date of the election ?
BK:02 How long is the river Nile ?
Results:
What is the date of the election
How long is the river Nile
While I am at it, is there an easy way to extract strings before or after a certain character? For example, I want to extract the date or day like from a string like the ones given in Example 2.
Example 2:
Date:30/4/2013
Day:Tuesday
Results:
30/4/2013
Tuesday
I have actually read about regex but it's very alien to me. Thanks.
I recommend using split
>>> s="A:01 What is the date of the election ?"
>>> " ".join(s.split()[1:-1])
'What is the date of the election'
>>> s="BK:02 How long is the river Nile ?"
>>> " ".join(s.split()[1:-1])
'How long is the river Nile'
>>> s="Date:30/4/2013"
>>> s.split(":")[1:][0]
'30/4/2013'
>>> s="Day:Tuesday"
>>> s.split(":")[1:][0]
'Tuesday'
>>> s="A:01 What is the date of the election ?"
>>> s.split(" ", 1)[1].rsplit(" ", 1)[0]
'What is the date of the election'
>>>
There's no need to dig into regex if this is all you need; you can use str.partition
s = "A:01 What is the date of the election ?"
before,sep,after = s.partition(' ') # could be, eg, a ':' instead
If all you want is the last part, you can use _ as a placeholder for 'don't care':
_,_,theReallyAwesomeDay = s.partition(':')

Python parsing

I'm trying to parse the title tag in an RSS 2.0 feed into three different variables for each entry in that feed. Using ElementTree I've already parsed the RSS so that I can print each title [minus the trailing )] with the code below:
feed = getfeed("http://www.tourfilter.com/dallas/rss/by_concert_date")
for item in feed:
print repr(item.title[0:-1])
I include that because, as you can see, the item.title is a repr() data type, which I don't know much about.
A particular repr(item.title[0:-1]) printed in the interactive window looks like this:
'randy travis (Billy Bobs 3/21'
'Michael Schenker Group (House of Blues Dallas 3/26'
The user selects a band and I hope to, after parsing each item.title into 3 variables (one each for band, venue, and date... or possibly an array or I don't know...) select only those related to the band selected. Then they are sent to Google for geocoding, but that's another story.
I've seen some examples of regex and I'm reading about them, but it seems very complicated. Is it? I thought maybe someone here would have some insight as to exactly how to do this in an intelligent way. Should I use the re module? Does it matter that the output is currently is repr()s? Is there a better way? I was thinking I'd use a loop like (and this is my pseudoPython, just kind of notes I'm writing):
list = bandRaw,venue,date,latLong
for item in feed:
parse item.title for bandRaw, venue, date
if bandRaw == str(band)
send venue name + ", Dallas, TX" to google for geocoding
return lat,long
list = list + return character + bandRaw + "," + venue + "," + date + "," + lat + "," + long
else
In the end, I need to have the chosen entries in a .csv (comma-delimited) file looking like this:
band,venue,date,lat,long
randy travis,Billy Bobs,3/21,1234.5678,1234.5678
Michael Schenker Group,House of Blues Dallas,3/26,4321.8765,4321.8765
I hope this isn't too much to ask. I'll be looking into it on my own, just thought I should post here to make sure it got answered.
So, the question is, how do I best parse each repr(item.title[0:-1]) in the feed into the 3 separate values that I can then concatenate into a .csv file?
Don't let regex scare you off... it's well worth learning.
Given the examples above, you might try putting the trailing parenthesis back in, and then using this pattern:
import re
pat = re.compile('([\w\s]+)\(([\w\s]+)(\d+/\d+)\)')
info = pat.match(s)
print info.groups()
('Michael Schenker Group ', 'House of Blues Dallas ', '3/26')
To get at each group individual, just call them on the info object:
print info.group(1) # or info.groups()[0]
print '"%s","%s","%s"' % (info.group(1), info.group(2), info.group(3))
"Michael Schenker Group","House of Blues Dallas","3/26"
The hard thing about regex in this case is making sure you know all the known possible characters in the title. If there are non-alpha chars in the 'Michael Schenker Group' part, you'll have to adjust the regex for that part to allow them.
The pattern above breaks down as follows, which is parsed left to right:
([\w\s]+) : Match any word or space characters (the plus symbol indicates that there should be one or more such characters). The parentheses mean that the match will be captured as a group. This is the "Michael Schenker Group " part. If there can be numbers and dashes here, you'll want to modify the pieces between the square brackets, which are the possible characters for the set.
\( : A literal parenthesis. The backslash escapes the parenthesis, since otherwise it counts as a regex command. This is the "(" part of the string.
([\w\s]+) : Same as the one above, but this time matches the "House of Blues Dallas " part. In parentheses so they will be captured as the second group.
(\d+/\d+) : Matches the digits 3 and 26 with a slash in the middle. In parentheses so they will be captured as the third group.
\) : Closing parenthesis for the above.
The python intro to regex is quite good, and you might want to spend an evening going over it http://docs.python.org/library/re.html#module-re. Also, check Dive Into Python, which has a friendly introduction: http://diveintopython3.ep.io/regular-expressions.html.
EDIT: See zacherates below, who has some nice edits. Two heads are better than one!
Regular expressions are a great solution to this problem:
>>> import re
>>> s = 'Michael Schenker Group (House of Blues Dallas 3/26'
>>> re.match(r'(.*) \((.*) (\d+/\d+)', s).groups()
('Michael Schenker Group', 'House of Blues Dallas', '3/26')
As a side note, you might want to look at the Universal Feed Parser for handling the RSS parsing as feeds have a bad habit of being malformed.
Edit
In regards to your comment... The strings occasionally being wrapped in "s rather than 's has to do with the fact that you're using repr. The repr of a string is usually delimited with 's, unless that string contains one or more 's, where instead it uses "s so that the 's don't have to be escaped:
>>> "Hello there"
'Hello there'
>>> "it's not its"
"it's not its"
Notice the different quote styles.
Regarding the repr(item.title[0:-1]) part, not sure where you got that from but I'm pretty sure you can simply use item.title. All you're doing is removing the last char from the string and then calling repr() on it, which does nothing.
Your code should look something like this:
import geocoders # from GeoPy
us = geocoders.GeocoderDotUS()
import feedparser # from www.feedparser.org
feedurl = "http://www.tourfilter.com/dallas/rss/by_concert_date"
feed = feedparser.parse(feedurl)
lines = []
for entry in feed.entries:
m = re.search(r'(.*) \((.*) (\d+/\d+)\)', entry.title)
if m:
bandRaw, venue, date = m.groups()
if band == bandRaw:
place, (lat, lng) = us.geocode(venue + ", Dallas, TX")
lines.append(",".join([band, venue, date, lat, lng]))
result = "\n".join(lines)
EDIT: replaced list with lines as the var name. list is a builtin and should not be used as a variable name. Sorry.

Categories

Resources