Multiple distinct replaces using RegEx - python

I am trying to write some Python code that will replace some unwanted string using RegEx. The code I have written has been taken from another question on this site.
I have a text:
text_1=u'I\u2019m \u2018winning\u2019, I\u2019ve enjoyed none of it. That\u2019s why I\u2019m withdrawing from the market,\u201d wrote Arment.'
I want to remove all the \u2019m, \u2019s, \u2019ve and etc..
The code that I've written is given below:
rep={"\n":" ","\n\n":" ","\n\n\n":" ","\n\n\n\n":" ",u"\u201c":"", u"\u201d":"", u"\u2019[a-z]":"", u"\u2013":"", u"\u2018":""}
rep = dict((re.escape(k), v) for k, v in rep.iteritems())
pattern = re.compile("|".join(rep.keys()))
text = pattern.sub(lambda m: rep[re.escape(m.group(0))], text_1)
The code works perfectly for:
"u"\u201c":"", u"\u201d":"", u"\u2013":"" and u"\u2018":""
However, It doesn't work that great for:
u"\u2019[a-z] : The presence of [a-z] turns rep into \\[a\\-z\\] which doesnt match.
The output I am looking for is:
text_1=u'I winning, I enjoyed none of it. That why I withdrawing from the market,wrote Arment.'
How do I achieve this?

The information about the newlines completely changes the answer. For this, I think building the expression using a loop is actually less legible than just using better formatting in the pattern itself.
replacements = {'newlines': ' ',
'deletions': ''}
pattern = re.compile(u'(?P<newlines>\n+)|'
u'(?P<deletions>\u201c|\u201d|\u2019[a-z]?|\u2013|\u2018)')
def lookup(match):
return replacements[match.lastgroup]
text = pattern.sub(lookup, text_1)

The problem here is actually the escaping, this code does what you want more directly:
remove = (u"\u201c", u"\u201d", u"\u2019[a-z]?", u"\u2013", u"\u2018")
pattern = re.compile("|".join(remove))
text = pattern.sub("", text_1)
I've added the ? to the u2019 match, as I suppose that's what you want as well given your test string.
For completeness, I think I should also link to the Unidecode package which may actually be more closely what you're trying to achieve by removing these characters.

The simplest way is this regex:
X = re.compile(r'((\\)(.*?) ')
text = re.sub(X, ' ', text_1)

Related

Making multiple "any" more efficient

I am using any to see if a string in a longer string (description) matches with any strings across several lists. I have the code working, but I feel like it's an inefficient way of doing a comparison, and would like feedback on how I can make it more efficient.
def convert_category(description):
categoryFood = ['COUNTDOWN', 'BAKE', 'MCDONALDS', 'ST PIERRE', 'PAK N SAVE', 'NEW WORLD']
categoryDIY = ['BUNNINGS', 'MITRE10']
containsFood = any(keyword in description for keyword in categoryFood)
containsDIY = any(keyword in description for keyword in categoryDIY)
if(containsFood):
return 'Food and Groceries'
elif(containsDIY):
return 'Home and DIY'
return ''
I would use a regular expression. They are optimized for this kind of problem - searching for any of multiple strings - and the hot part of the code is pushed into a fast library. With big enough strings you should notice the difference.
import re
foodPattern = '|'.join(map(re.escape, categoryFood))
diyPattern = '|'.join(map(re.escape, categoryDIY))
containsFood = re.search(foodPattern, description) is not None
containsDiy = re.search(diyPattern, description) is not None
You can easily extend this with word boundary or similar features to make the keyword matching be smarter/only match whole words.
The only way to make this faster is some negligible work to return some statements easier from the sounds of things. Marking as answered and closing.

Regex not specific enough

So I wrote a program for my Kindle e-reader that searches my highlights and deletes repetitive text (it's usually information about the book title, author, page number, etc.). I thought it was functional but sometimes there would random be periods (.) on certain lines of the output. At first I thought the program was just buggy but then I realized that the regex I'm using to match the books title and author was also matching any sentence that ended in brackets.
This is the code for the regex that I'm using to detect the books title and author
titleRegex = re.compile('(.+)\((.+)\)')
Example
Desired book title and author match: Book title (Author name)
What would also get matched: *I like apples because they are green (they are sometimes red as well). *
In this case it would delete everything and leave just the period at the end of the sentence. This is obviously not ideal because it deletes the text I highlighted
Here is the unformatted text file that goes into my program
The program works by finding all of the matches for the regexes I wrote, looping through those matches and one by one replacing them with empty strings.
Would there be any ways to make my title regex more specific so that it only picks up author titles and not full sentences that end in brackets? If not, what steps would I have to take to restructure this program?
I've attached my code to the bottom of this post. I would greatly appreciate any help as I'm a total coding newbie. Thanks :)
import re
titleRegex = re.compile('(.+)\((.+)\)')
titleRegex2 = re.compile(r'\ufeff (.+)\((.+)\)')
infoRegex = re.compile(r'(.) ([a-zA-Z]+) (Highlight|Bookmark|Note) ([a-zA-Z]+) ([a-zA-Z]+) ([0-9]+) (\|)')
locationRegex = re.compile(r' Location (\d+)(-\d+)? (\|)')
dateRegex = re.compile(r'([a-zA-Z]+) ([a-zA-Z]+) ([a-zA-Z]+), ([a-zA-Z]+) ([0-9]+), ([0-9]+)')
timeRegex = re.compile(r'([0-9]+):([0-9]+):([0-9]+) (AM|PM)')
newlineRegex = re.compile(r'\n')
sepRegex = re.compile('==========')
regexList = [titleRegex, titleRegex2, infoRegex, locationRegex, dateRegex, timeRegex, sepRegex, newlineRegex]
string = open("/Users/devinnagami/myclippings.txt").read()
for x in range(len(regexList)):
newString = re.sub(regexList[x], ' ', string)
string = newString
finalText = newString.split(' ')
with open('booknotes.txt', 'w') as f:
for item in finalText:
f.write('%s\n' % item)
There isn't enough information to tell if "Book title (Book Author)" is different than something like "I like Books (Good Ones)" without context. Thankfully, the text you showed has plenty of context. Instead of creating several different regular expressions, you can combine them into one expression to encode that context.
For instance:
quoteInfoRegex = re.compile(
r"^=+\n(?P<title>.*?) \((?P<author>.*?)\)\n" +
r"- Your Highlight on page (?P<page>[\d]+) \| Location (?P<location>[\d-]+) \| Added on (?P<added>.*?)\n" +
r"\n" +
r"(?P<quote>.*?)\n", flags=re.MULTILINE)
for m in quoteInfoRegex.finditer(data):
print(m.groupdict())
This will pull out each line of the text, and parse it, knowing that the book title is the first line after the equals, and the quote itself is below that.

Extracting titles from a text

I'm working on a project, where I have to extract honorific titles (Mr, Mrs, St, etc.) from a novel. The desired output with the text I'm working with is:
['Col', 'Dr', 'Mr', 'Mrs', 'Otto', 'Rev', 'St']
However, with the code I wrote, the output is this:
{'Tom.', 'Mrs.', 'Otto.', 'Mary.', 'Bots.', 'Come.', 'No.', 'Col.', 'Cain.', 'Dr.', 'Gang.', 'Ike.', 'Kean.', 'St.', 'Hank.', 'Him.', 'Finn.', 'Ann.', 'Jane.', 'Alas.', 'Huck.', 'Sis.', 'Buck.', 'Jim.', 'Sid.', 'Mr.', 'Bill.', 'Rev.', 'Yes.'}
This is the code I have so far:
def get_titles(text):
pattern = re.compile('[A-Z][a-z]{1,3}\.')
title_tokens = set(re.findall(pattern, text))
pattern2 = re.compile('[A-Z][a-z]{1,3}')
pseudo_titles = set(re.findall(pattern2, text))
pseudo_titles = [word.strip() for word in pseudo_titles]
pseudo_titles = [word.replace('\n', '') for word in pseudo_titles]
difference = title_tokens.difference(pseudo_titles)
return difference
test = get_titles(text)
print(test)
As you can notice, the output gives me additional words with periods in them. I believe the issue stems from the regular expressions, but I'm not sure. Any advice or tips are appreciated.
The text can be found here: http://www.gutenberg.org/files/76/76-0.txt
Essentially, you are asking for an algorithm which can tell the difference between a title and one-word sentence. These are lexically indistinguishable; for example, consider the following two strings:
"Do I know who did this? Yes. Smith did it."
"Do I know who did this? Mr. Smith did it."
In the first sentence, "Yes." is a one-word sentence, and in the second, "Mr." is a title. As humans we only know this because we understand the meanings of the tokens "Yes" and "Mr"; so an algorithm which is able to distinguish between these cases requires some information about the meanings of the tokens it's parsing. It cannot work purely lexically like a regex does. This means you must either write a whitelist of allowed titles, or a blacklist of words which are not titles, or otherwise the problem is much more difficult.
Alternatively, if your project doesn't involve parsing titles from very many novels, you could just trim down the results by hand, using your human knowledge that "Tom" and "Yes" aren't titles. It shouldn't be that much work.

Using regular expressions in python to extract location mentions in a sentence

I am writing a code using python to extract the name of a road,street, highway, for example a sentence like "There is an accident along Uhuru Highway", I want my code to be able to extract the name of the highway mentioned, I have written the code below.
sentence="there is an accident along uhuru highway"
listw=[word for word in sentence.lower().split()]
for i in range(len(listw)):
if listw[i] == "highway":
print listw[i-1] + " "+ listw[i]
I can achieve this but my code is not optimized, i am thinking of using regular expressions, any help please
'uhuru highway' can be found as follows
import re
m = re.search(r'\S+ highway', sentence) # non-white-space followed by ' highway'
print(m.group())
# 'uhuru highway'
If the location you want to extract will always have highway after it, you can use:
>>> sentence = "there is an accident along uhuru highway"
>>> a = re.search(r'.* ([\w\s\d\-\_]+) highway', sentence)
>>> print(a.group(1))
>>> uhuru
You can do the following without using regexes:
sentence.split("highway")[0].strip().split(' ')[-1]
First split according to "highway". You'll get:
['there is an accident along uhuru', '']
And now you can easily extract the last word from the first part.

Python parsing

I'm trying to parse the title tag in an RSS 2.0 feed into three different variables for each entry in that feed. Using ElementTree I've already parsed the RSS so that I can print each title [minus the trailing )] with the code below:
feed = getfeed("http://www.tourfilter.com/dallas/rss/by_concert_date")
for item in feed:
print repr(item.title[0:-1])
I include that because, as you can see, the item.title is a repr() data type, which I don't know much about.
A particular repr(item.title[0:-1]) printed in the interactive window looks like this:
'randy travis (Billy Bobs 3/21'
'Michael Schenker Group (House of Blues Dallas 3/26'
The user selects a band and I hope to, after parsing each item.title into 3 variables (one each for band, venue, and date... or possibly an array or I don't know...) select only those related to the band selected. Then they are sent to Google for geocoding, but that's another story.
I've seen some examples of regex and I'm reading about them, but it seems very complicated. Is it? I thought maybe someone here would have some insight as to exactly how to do this in an intelligent way. Should I use the re module? Does it matter that the output is currently is repr()s? Is there a better way? I was thinking I'd use a loop like (and this is my pseudoPython, just kind of notes I'm writing):
list = bandRaw,venue,date,latLong
for item in feed:
parse item.title for bandRaw, venue, date
if bandRaw == str(band)
send venue name + ", Dallas, TX" to google for geocoding
return lat,long
list = list + return character + bandRaw + "," + venue + "," + date + "," + lat + "," + long
else
In the end, I need to have the chosen entries in a .csv (comma-delimited) file looking like this:
band,venue,date,lat,long
randy travis,Billy Bobs,3/21,1234.5678,1234.5678
Michael Schenker Group,House of Blues Dallas,3/26,4321.8765,4321.8765
I hope this isn't too much to ask. I'll be looking into it on my own, just thought I should post here to make sure it got answered.
So, the question is, how do I best parse each repr(item.title[0:-1]) in the feed into the 3 separate values that I can then concatenate into a .csv file?
Don't let regex scare you off... it's well worth learning.
Given the examples above, you might try putting the trailing parenthesis back in, and then using this pattern:
import re
pat = re.compile('([\w\s]+)\(([\w\s]+)(\d+/\d+)\)')
info = pat.match(s)
print info.groups()
('Michael Schenker Group ', 'House of Blues Dallas ', '3/26')
To get at each group individual, just call them on the info object:
print info.group(1) # or info.groups()[0]
print '"%s","%s","%s"' % (info.group(1), info.group(2), info.group(3))
"Michael Schenker Group","House of Blues Dallas","3/26"
The hard thing about regex in this case is making sure you know all the known possible characters in the title. If there are non-alpha chars in the 'Michael Schenker Group' part, you'll have to adjust the regex for that part to allow them.
The pattern above breaks down as follows, which is parsed left to right:
([\w\s]+) : Match any word or space characters (the plus symbol indicates that there should be one or more such characters). The parentheses mean that the match will be captured as a group. This is the "Michael Schenker Group " part. If there can be numbers and dashes here, you'll want to modify the pieces between the square brackets, which are the possible characters for the set.
\( : A literal parenthesis. The backslash escapes the parenthesis, since otherwise it counts as a regex command. This is the "(" part of the string.
([\w\s]+) : Same as the one above, but this time matches the "House of Blues Dallas " part. In parentheses so they will be captured as the second group.
(\d+/\d+) : Matches the digits 3 and 26 with a slash in the middle. In parentheses so they will be captured as the third group.
\) : Closing parenthesis for the above.
The python intro to regex is quite good, and you might want to spend an evening going over it http://docs.python.org/library/re.html#module-re. Also, check Dive Into Python, which has a friendly introduction: http://diveintopython3.ep.io/regular-expressions.html.
EDIT: See zacherates below, who has some nice edits. Two heads are better than one!
Regular expressions are a great solution to this problem:
>>> import re
>>> s = 'Michael Schenker Group (House of Blues Dallas 3/26'
>>> re.match(r'(.*) \((.*) (\d+/\d+)', s).groups()
('Michael Schenker Group', 'House of Blues Dallas', '3/26')
As a side note, you might want to look at the Universal Feed Parser for handling the RSS parsing as feeds have a bad habit of being malformed.
Edit
In regards to your comment... The strings occasionally being wrapped in "s rather than 's has to do with the fact that you're using repr. The repr of a string is usually delimited with 's, unless that string contains one or more 's, where instead it uses "s so that the 's don't have to be escaped:
>>> "Hello there"
'Hello there'
>>> "it's not its"
"it's not its"
Notice the different quote styles.
Regarding the repr(item.title[0:-1]) part, not sure where you got that from but I'm pretty sure you can simply use item.title. All you're doing is removing the last char from the string and then calling repr() on it, which does nothing.
Your code should look something like this:
import geocoders # from GeoPy
us = geocoders.GeocoderDotUS()
import feedparser # from www.feedparser.org
feedurl = "http://www.tourfilter.com/dallas/rss/by_concert_date"
feed = feedparser.parse(feedurl)
lines = []
for entry in feed.entries:
m = re.search(r'(.*) \((.*) (\d+/\d+)\)', entry.title)
if m:
bandRaw, venue, date = m.groups()
if band == bandRaw:
place, (lat, lng) = us.geocode(venue + ", Dallas, TX")
lines.append(",".join([band, venue, date, lat, lng]))
result = "\n".join(lines)
EDIT: replaced list with lines as the var name. list is a builtin and should not be used as a variable name. Sorry.

Categories

Resources