US-format phone numbers to links in Python - python

I'm working a piece of code to turn phone numbers into links for mobile phone - I've got it but it feels really dirty.
import re
from string import digits
PHONE_RE = re.compile('([(]{0,1}[2-9]\d{2}[)]{0,1}[-_. ]{0,1}[2-9]\d{2}[-_. ]{0,1}\d{4})')
def numbers2links(s):
result = ""
last_match_index = 0
for match in PHONE_RE.finditer(s):
raw_number = match.group()
number = ''.join(d for d in raw_number if d in digits)
call = '%s' % (number, raw_number)
result += s[last_match_index:match.start()] + call
last_match_index = match.end()
result += s[last_match_index:]
return result
>>> numbers2links("Ghost Busters at (555) 423-2368! How about this one: 555 456 7890! 555-456-7893 is where its at.")
'Ghost Busters at (555) 423-2368! How about this one: 555 456 7890! 555-456-7893 is where its at.'
Is there anyway I could restructure the regex or the the regex method I'm using to make this cleaner?
Update
To clarify, my question is not about the correctness of my regex - I realize that it's limited. Instead I'm wondering if anyone had any comments on the method of substiting in links for the phone numbers - is there anyway I could use re.replace or something like that instead of the string hackery that I have?

Nice first take :) I think this version is a bit more readable (and probably a teensy bit faster). The key thing to note here is the use of re.sub. Keeps us away from the nasty match indexes...
import re
PHONE_RE = re.compile('([(]{0,1}[2-9]\d{2}[)]{0,1}[-_. ]{0,1}[2-9]\d{2}[-_. ]{0,1}\d{4})')
NON_NUMERIC = re.compile('\D')
def numbers2links(s):
def makelink(mo):
raw_number = mo.group()
number = NON_NUMERIC.sub("", raw_number)
return '%s' % (number, raw_number)
return PHONE_RE.sub(makelink, s)
print numbers2links("Ghost Busters at (555) 423-2368! How about this one: 555 456 7890! 555-456-7893 is where its at.")
A note: In my practice, I've not noticed much of a speedup pre-compiling simple regular expressions like the two I'm using, even if you're using them thousands of times. The re module may have some sort of internal caching - didn't bother to read the source and check.
Also, I replaced your method of checking each character to see if it's in string.digits with another re.sub() because I think my version is more readable, not because I'm certain it performs better (although it might).

Your regexp only parses a specific format, which is not the international standard. If you limit yourself to one country, it may work.
Otherwise, the international standard is ITU E.123 : "Notation for national and international telephone numbers,
e-mail addresses and Web addresses"

Why not re-use the work of others - for example, from RegExpLib.com?
My second suggestion is to remember there are other countries besides the USA, and quite a few of them have telephones ;-) Please don't forget us during your software development.
Also, there is a standard for the formatting of telephone numbers; the ITU's E.123. My recollection of the standard was that what it describes doesn't match well with popular usage.
Edit: I mixed up G.123 and E.123. Oops. Props Bortzmeyer

First off, reliably capturing phone numbers with a single regular expression is notoriously difficult with a strong tendency to being impossible. Not every country has a definition of a "phone number" that is as narrow as it is in the U.S. Even in the U.S., things are more complicated than they seem (from the Wikipedia article on the North American Numbering Plan):
A) Country code: optional prefix ("1" or "+1" or "001")
((00|\+)?1)?
B) Numbering Plan Area code (NPA): cannot begin with 1, digit 2 cannot be 9
[2-9][0-8][0-9]
C) Exchange code (NXX): cannot begin with 1, cannot end with "11", optional parentheses
\(?[2-9](00|[2-9]{2})\)?
D) Station Code: four digits, cannot all be 0 (I suppose)
(?!0{4})\d{4}
E) an optional extension may follow
([x#-]\d+)?
S) parts of the number are separated by spaces, dashes, dots (or not)
[. -]?
So, the basic regex for the U.S. would be:
((00|\+)?1[. -]?)?\(?[2-9][0-8][0-9]\)?[. -]?[2-9](00|[2-9]{2})[. -]?(?!0{4})\d{4}([. -]?[x#-]\d+)?
| A |S | | B | S | C | S | D | S | E |
And that's just for the relatively trivial numbering plan of the U.S., and even there it certainly is not covering all subtleties. If you want to make it reliable you have to develop a similar beast for all expected input languages.

A few things that will clean up your existing regex without really changing the functionality:
Replace {0,1} with ?, [(] with (, [)] with ). You also might as well just make your [2-9] b e a \d as well, so you can make those patterns be \d{3} and \d{4} for the last part. I doubt it will really increase the rate of false positives.

Related

Regex for matching alphabet, numbers and special charters while looping in python

I am trying to find words and print using below code. Everything is working perfect but only issue is i am unable to print the last word(which is number).
words = ['Town of','Block No.','Lot No.','Premium (if any) Paid ']
import re
for i in words:
y = re.findall('{} ([^ ]*)'.format(i), textfile)
print(y)
Text file i working with:
textfile = """1, REBECCA M. ROTH , COLLECTOR OF TAXES of the taxing district of the
township of MORRIS for Six Hundred Sixty Seven dollars andFifty Two cents, the land
in said taxing district described as Block No. 10303 Lot No. 10 :
and known as 239 E HANOVER AVE , on the tax Taxes For: 2012
Sewer
Assessments For Improvements
Total Cost of Sale 35.00
Total
Premium (if any) Paid 1,400.00 """
Would like to know where am i making mistake.
Any suggestion is appreciated.
A couple of issues:
As others have mentioned, you need to escape special characters like parentheses ( ) and dots .. Very simply, you can use re.escape
Another issue is the trailing space in Premium \(if any\) Paid (it's trying to match two spaces instead of one as you're also checking for a space in your regex {} ([^ ]*))
You should instead change your code to the following:
See working code here
words = ['Town of','Block No.','Lot No.','Premium (if any) Paid']
import re
for i in words:
y = re.findall('{} ([^ ]*)'.format(re.escape(i)), textfile)
print(y)
Two problems:
Your current 'Premium (if any) Paid ' string ends on a space, and '{} ([^ ]*)' also has a space after {}, which adds them together. Delete the trailing space in 'Premium (if any) Paid '.
You need to escape parenthesis, so if you want to keep your regular expression unchanged, the string in the list should be ['Premium \(if any\) Paid']. You can also use re.escape instead.
For your particular cases, this seems to be an optimal solution:
words = ['Town of','Block No.','Lot No.','Premium (if any) Paid']
import re
for i in words:
y = re.findall('{}\s+([\S]*)'.format(re.escape(i)), text, re.I)
print(y)

Split a string with one delimiter but multiple conditions

Good morning,
I found multiple threads dealing with splitting strings with multiple delimiters, but not with one delimiter and multiple conditions.
I want to split the following strings by sentences:
desc = Dr. Anna Pytlik is an expert in conservative and aesthetic dentistry. She speaks both English and Polish.
If I do:
[t.split('. ') for t in desc]
I get:
['Dr', 'Anna Pytlik is an expert in conservative and aesthetic dentistry', 'She speaks both English and Polish.']
I don't want to split the first dot after 'Dr'. How can I add a list of substrings in which case the .split('. ') should not apply?
Thank you!
You could use re.split with a negative lookbehind:
>>> desc = "Dr. Anna Pytlik is an expert in conservative and aesthetic dentistry. She speaks both English and Polish."
>>> re.split(r"(?<!Dr|Mr)\. ", desc)
['Dr. Anna Pytlik is an expert in conservative and aesthetic dentistry',
'She speaks both English and Polish.']
Just add more "exceptions", delimited with |.
Update: Seems like negative lookbehind requires all the alternatives to have the same length, so this does not work with both "Dr." and "Prof." One workaround might be to pad the pattern with ., e.g. (?<!..Dr|..Mr|Prof). You could easily write a helper method to pad each title with as many . as needed. However, this may break if the very first word of the text is Dr., as the .. will not be matched.
Another workaround might be to first replace all the titles with some placeholders, e.g. "Dr." -> "{DR}" and "Prof." -> "{PROF}", then split, then swap the original titles back in. This way you don't even need regular expressions.
pairs = (("Dr.", "{DR}"), ("Prof.", "{PROF}")) # and some more
def subst_titles(s, reverse=False):
for x, y in pairs:
s = s.replace(*(x, y) if not reverse else (y, x))
return s
Example:
>>> text = "Dr. Anna Pytlik is an expert in conservative and aesthetic dentistry. Prof. Miller speaks both English and Polish."
>>> [subst_titles(s, True) for s in subst_titles(text).split(". ")]
['Dr. Anna Pytlik is an expert in conservative and aesthetic dentistry', 'Prof. Miller speaks both English and Polish.']
You could split and then join again Dr/Mr/...
It doesn't need complicated regexes and could be faster (you should benchmark it to choose best option).

Returning first number in function if keyword is met

I'm working with data and I have it set to spit out items I need.
Example:
LOT OF 4 American motor vinegar
Lot of (6) 808 metal/steel/G/N LWAP
LOT 12 product number 57838290
What I want is to have it spit out the amount in each lot, could be lowercase or capitalized, if 'lot' is found in the text. I think I have my code half built, but since the value isn't in a set location I don't know how to retrieve it. Also, the list above is from a TEXT string so it doesn't recognize integers
def auction(title):
for word in title.split():
if word.startswith('lot'):
return # not sure what to return (from the example the answer would be 4 6 and 12)
You can re-write that in the following order:
def auction(title):
found = False;
for word in title.split():
if word.upper().startswith('LOT'):
found = True;
if found:
if word.isdigit():
return int(word)
The base is same as your own, we set the boolean value to True after we found the LOT value (in any upper or lower case). Then we check to see if the word is a digit and if it was, return it's value.
Can use regular expression
import re
def auction(title):
for word in title.split():
if word.startswith('lot'):
search_result = re.search('([0-9]+)', title)
if search_result
return int(search_result.groups()[0])
Some folks don't like regular expressions, but in cases like this, they're pretty handy. I might try something like this:
import re
inputs = [
"LOT OF 4 CISCO AIRONET 4800 AIR-LM4800 DSSS WLAN PC CARD",
"Lot of (6) CISCO AIRONET AIR-LAP1252AG-A-K9 DUAL BAND 802.11A/G/N LWAP",
"LOT 12 Cisco Systems Aironet 1200 Wireless Access Point AIR-AP1231G-A-K9 MP21G",
"CISCO AIRONET 4800 AIR-LM4800 DSSS WLAN PC CARD lot of 4",
"Ocelot 4800 AIR-LM4800"]
patterns = [
r'\blot(?:\s+of|)\s+(\d+)',
r'\blot(?:\s+of|)\s+\((\d+)\)']
for a in inputs:
for pattern in patterns:
m = re.search(pattern, a, flags=re.IGNORECASE)
if m:
print "lot size = ", m.group(1)
break
else:
print "No lot size found!"
Outputs:
lot size = 4
lot size = 6
lot size = 12
lot size = 4
No lot size found!
The patterns here look a bit horrible, but they're just saying this: find the word 'lot', possibly followed (or not) by the word 'of' and then some digits. Or, in the second case some digits surrounded by literal parentheses.
Since this is free text you're parsing, you'll probably have some errors that might be have to be corrected either by hand or by adding more patterns.
you can use list comprehension to see if you need parse the string
num=['0','1','2','3','4','5','6','7','8','9']
t='this is a lot of 10'
if [e for e in t if e in num]!=[]:
parse_the_string(t)
def parse_the_string(the_string):
the_string=the_string.upper()
the_number=''
number_founded=False
for n in the_string[the_string.find("LOT"):]:
if n.isdigit():
the_number+=n
number_founded=True
elif number_founded:
break;
return the_number

Python - RegEx for splitting text into sentences (sentence-tokenizing) [duplicate]

This question already has answers here:
How can I split a text into sentences?
(20 answers)
Closed 3 years ago.
I want to make a list of sentences from a string and then print them out. I don't want to use NLTK to do this. So it needs to split on a period at the end of the sentence and not at decimals or abbreviations or title of a name or if the sentence has a .com This is attempt at regex that doesn't work.
import re
text = """\
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.
"""
sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)
for stuff in sentences:
print(stuff)
Example output of what it should look like
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with a probability of .9 it isn't.
(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s
Try this. split your string this.You can also check demo.
http://regex101.com/r/nG1gU7/27
Ok so sentence-tokenizers are something I looked at in a little detail, using regexes, nltk, CoreNLP, spaCy. You end up writing your own and it depends on the application. This stuff is tricky and valuable and people don't just give their tokenizer code away. (Ultimately, tokenization is not a deterministic procedure, it's probabilistic, and also depends very heavily on your corpus or domain, e.g. legal/financial documents vs social-media posts vs Yelp reviews vs biomedical papers...)
In general you can't rely on one single Great White infallible regex, you have to write a function which uses several regexes (both positive and negative); also a dictionary of abbreviations, and some basic language parsing which knows that e.g. 'I', 'USA', 'FCC', 'TARP' are capitalized in English.
To illustrate how easily this can get seriously complicated, let's try to write you that functional spec for a deterministic tokenizer just to decide whether single or multiple period ('.'/'...') indicates end-of-sentence, or something else:
function isEndOfSentence(leftContext, rightContext)
Return False for decimals inside numbers or currency e.g. 1.23 , $1.23, "That's just my $.02" Consider also section references like 1.2.A.3.a, European date formats like 09.07.2014, IP addresses like 192.168.1.1, MAC addresses...
Return False (and don't tokenize into individual letters) for known abbreviations e.g. "U.S. stocks are falling" ; this requires a dictionary of known abbreviations. Anything outside that dictionary you will get wrong, unless you add code to detect unknown abbreviations like A.B.C. and add them to a list.
Ellipses '...' at ends of sentences are terminal, but in the middle of sentences are not. This is not as easy as you might think: you need to look at the left context and the right context, specifically is the RHS capitalized and again consider capitalized words like 'I' and abbreviations. Here's an example proving ambiguity which : She asked me to stay... I left an hour later. (Was that one sentence or two? Impossible to determine)
You may also want to write a few patterns to detect and reject miscellaneous non-sentence-ending uses of punctuation: emoticons :-), ASCII art, spaced ellipses . . . and other stuff esp. Twitter. (Making that adaptive is even harder). How do we tell if #midnight is a Twitter user, the show on Comedy Central, text shorthand, or simply unwanted/junk/typo punctuation? Seriously non-trivial.
After you handle all those negative cases, you could arbitrarily say that any isolated period followed by whitespace is likely to be an end of sentence. (Ultimately, if you really want to buy extra accuracy, you end up writing your own probabilistic sentence-tokenizer which uses weights, and training it on a specific corpus(e.g. legal texts, broadcast media, StackOverflow, Twitter, forums comments etc.)) Then you have to manually review exemplars and training errors. See Manning and Jurafsky book or Coursera course [a].
Ultimately you get as much correctness as you are prepared to pay for.
All of the above is clearly specific to the English-language/ abbreviations, US number/time/date formats. If you want to make it country- and language-independent, that's a bigger proposition, you'll need corpora, native-speaking people to label and QA it all, etc.
All of the above is still only ASCII, which is practically speaking only 96 characters. Allow the input to be Unicode, and things get harder still (and the training-set necessarily must be either much bigger or much sparser)
In the simple (deterministic) case, function isEndOfSentence(leftContext, rightContext) would return boolean, but in the more general sense, it's probabilistic: it returns a float 0.0-1.0 (confidence level that that particular '.' is a sentence end).
References: [a] Coursera video: "Basic Text Processing 2-5 - Sentence Segmentation - Stanford NLP - Professor Dan Jurafsky & Chris Manning" [UPDATE: an unofficial version used to be on YouTube, was taken down]
Try to split the input according to the spaces rather than a dot or ?, if you do like this then the dot or ? won't be printed in the final result.
>>> import re
>>> s = """Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't."""
>>> m = re.split(r'(?<=[^A-Z].[.?]) +(?=[A-Z])', s)
>>> for i in m:
... print i
...
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with a probability of .9 it isn't.
sent = re.split('(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)(\s|[A-Z].*)',text)
for s in sent:
print s
Here the regex used is : (?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)(\s|[A-Z].*)
First block: (?<!\w\.\w.) : this pattern searches in a negative feedback loop (?<!) for all words (\w) followed by fullstop (\.) , followed by other words (\.)
Second block: (?<![A-Z][a-z]\.): this pattern searches in a negative feedback loop for anything starting with uppercase alphabets ([A-Z]), followed by lower case alphabets ([a-z]) till a dot (\.) is found.
Third block: (?<=\.|\?): this pattern searches in a feedback loop of dot (\.) OR question mark (\?)
Fourth block: (\s|[A-Z].*): this pattern searches after the dot OR question mark from the third block. It searches for blank space (\s) OR any sequence of characters starting with a upper case alphabet ([A-Z].*).
This block is important to split if the input is as
Hello world.Hi I am here today.
i.e. if there is space or no space after the dot.
Naive approach for proper english sentences not starting with non-alphas and not containing quoted parts of speech:
import re
text = """\
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.
"""
EndPunctuation = re.compile(r'([\.\?\!]\s+)')
NonEndings = re.compile(r'(?:Mrs?|Jr|i\.e)\.\s*$')
parts = EndPunctuation.split(text)
sentence = []
for part in parts:
if len(part) and len(sentence) and EndPunctuation.match(sentence[-1]) and not NonEndings.search(''.join(sentence)):
print(''.join(sentence))
sentence = []
if len(part):
sentence.append(part)
if len(sentence):
print(''.join(sentence))
False positive splitting may be reduced by extending NonEndings a bit. Other cases will require additional code. Handling typos in a sensible way will prove difficult with this approach.
You will never reach perfection with this approach. But depending on the task it might just work "enough"...
I'm not great at regular expressions, but a simpler version, "brute force" actually, of above is
sentence = re.compile("([\'\"][A-Z]|([A-Z][a-z]*\. )|[A-Z])(([a-z]*\.[a-z]*\.)|([A-Za-z0-9]*\.[A-Za-z0-9])|([A-Z][a-z]*\. [A-Za-z]*)|[^\.?]|[A-Za-z])*[\.?]")
which means
start acceptable units are '[A-Z] or "[A-Z]
please note, most regular expressions are greedy so the order is very important when we do |(or). That's, why I have written i.e. regular expression first, then is come forms like Inc.
Try this:
(?<!\b(?:[A-Z][a-z]|\d|[i.e]))\.(?!\b(?:com|\d+)\b)
I wrote this taking into consideration smci's comments above. It is a middle-of-the-road approach that doesn't require external libraries and doesn't use regex. It allows you to provide a list of abbreviations and accounts for sentences ended by terminators in wrappers, such as a period and quote: [.", ?', .)].
abbreviations = {'dr.': 'doctor', 'mr.': 'mister', 'bro.': 'brother', 'bro': 'brother', 'mrs.': 'mistress', 'ms.': 'miss', 'jr.': 'junior', 'sr.': 'senior', 'i.e.': 'for example', 'e.g.': 'for example', 'vs.': 'versus'}
terminators = ['.', '!', '?']
wrappers = ['"', "'", ')', ']', '}']
def find_sentences(paragraph):
end = True
sentences = []
while end > -1:
end = find_sentence_end(paragraph)
if end > -1:
sentences.append(paragraph[end:].strip())
paragraph = paragraph[:end]
sentences.append(paragraph)
sentences.reverse()
return sentences
def find_sentence_end(paragraph):
[possible_endings, contraction_locations] = [[], []]
contractions = abbreviations.keys()
sentence_terminators = terminators + [terminator + wrapper for wrapper in wrappers for terminator in terminators]
for sentence_terminator in sentence_terminators:
t_indices = list(find_all(paragraph, sentence_terminator))
possible_endings.extend(([] if not len(t_indices) else [[i, len(sentence_terminator)] for i in t_indices]))
for contraction in contractions:
c_indices = list(find_all(paragraph, contraction))
contraction_locations.extend(([] if not len(c_indices) else [i + len(contraction) for i in c_indices]))
possible_endings = [pe for pe in possible_endings if pe[0] + pe[1] not in contraction_locations]
if len(paragraph) in [pe[0] + pe[1] for pe in possible_endings]:
max_end_start = max([pe[0] for pe in possible_endings])
possible_endings = [pe for pe in possible_endings if pe[0] != max_end_start]
possible_endings = [pe[0] + pe[1] for pe in possible_endings if sum(pe) > len(paragraph) or (sum(pe) < len(paragraph) and paragraph[sum(pe)] == ' ')]
end = (-1 if not len(possible_endings) else max(possible_endings))
return end
def find_all(a_str, sub):
start = 0
while True:
start = a_str.find(sub, start)
if start == -1:
return
yield start
start += len(sub)
I used Karl's find_all function from this entry: Find all occurrences of a substring in Python
My example is based on the example of Ali, adapted to Brazilian Portuguese. Thanks Ali.
ABREVIACOES = ['sra?s?', 'exm[ao]s?', 'ns?', 'nos?', 'doc', 'ac', 'publ', 'ex', 'lv', 'vlr?', 'vls?',
'exmo(a)', 'ilmo(a)', 'av', 'of', 'min', 'livr?', 'co?ls?', 'univ', 'resp', 'cli', 'lb',
'dra?s?', '[a-z]+r\(as?\)', 'ed', 'pa?g', 'cod', 'prof', 'op', 'plan', 'edf?', 'func', 'ch',
'arts?', 'artigs?', 'artg', 'pars?', 'rel', 'tel', 'res', '[a-z]', 'vls?', 'gab', 'bel',
'ilm[oa]', 'parc', 'proc', 'adv', 'vols?', 'cels?', 'pp', 'ex[ao]', 'eg', 'pl', 'ref',
'[0-9]+', 'reg', 'f[ilĂ­]s?', 'inc', 'par', 'alin', 'fts', 'publ?', 'ex', 'v. em', 'v.rev']
ABREVIACOES_RGX = re.compile(r'(?:{})\.\s*$'.format('|\s'.join(ABREVIACOES)), re.IGNORECASE)
def sentencas(texto, min_len=5):
# baseado em https://stackoverflow.com/questions/25735644/python-regex-for-splitting-text-into-sentences-sentence-tokenizing
texto = re.sub(r'\s\s+', ' ', texto)
EndPunctuation = re.compile(r'([\.\?\!]\s+)')
# print(NonEndings)
parts = EndPunctuation.split(texto)
sentencas = []
sentence = []
for part in parts:
txt_sent = ''.join(sentence)
q_len = len(txt_sent)
if len(part) and len(sentence) and q_len >= min_len and \
EndPunctuation.match(sentence[-1]) and \
not ABREVIACOES_RGX.search(txt_sent):
sentencas.append(txt_sent)
sentence = []
if len(part):
sentence.append(part)
if sentence:
sentencas.append(''.join(sentence))
return sentencas
Full code in: https://github.com/luizanisio/comparador_elastic
If you want to break up sentences at 3 periods (not sure if this is what you want) you can use this regular expresion:
import re
text = """\
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.
"""
sentences = re.split(r'\.{3}', text)
for stuff in sentences:
print(stuff)

Python parsing

I'm trying to parse the title tag in an RSS 2.0 feed into three different variables for each entry in that feed. Using ElementTree I've already parsed the RSS so that I can print each title [minus the trailing )] with the code below:
feed = getfeed("http://www.tourfilter.com/dallas/rss/by_concert_date")
for item in feed:
print repr(item.title[0:-1])
I include that because, as you can see, the item.title is a repr() data type, which I don't know much about.
A particular repr(item.title[0:-1]) printed in the interactive window looks like this:
'randy travis (Billy Bobs 3/21'
'Michael Schenker Group (House of Blues Dallas 3/26'
The user selects a band and I hope to, after parsing each item.title into 3 variables (one each for band, venue, and date... or possibly an array or I don't know...) select only those related to the band selected. Then they are sent to Google for geocoding, but that's another story.
I've seen some examples of regex and I'm reading about them, but it seems very complicated. Is it? I thought maybe someone here would have some insight as to exactly how to do this in an intelligent way. Should I use the re module? Does it matter that the output is currently is repr()s? Is there a better way? I was thinking I'd use a loop like (and this is my pseudoPython, just kind of notes I'm writing):
list = bandRaw,venue,date,latLong
for item in feed:
parse item.title for bandRaw, venue, date
if bandRaw == str(band)
send venue name + ", Dallas, TX" to google for geocoding
return lat,long
list = list + return character + bandRaw + "," + venue + "," + date + "," + lat + "," + long
else
In the end, I need to have the chosen entries in a .csv (comma-delimited) file looking like this:
band,venue,date,lat,long
randy travis,Billy Bobs,3/21,1234.5678,1234.5678
Michael Schenker Group,House of Blues Dallas,3/26,4321.8765,4321.8765
I hope this isn't too much to ask. I'll be looking into it on my own, just thought I should post here to make sure it got answered.
So, the question is, how do I best parse each repr(item.title[0:-1]) in the feed into the 3 separate values that I can then concatenate into a .csv file?
Don't let regex scare you off... it's well worth learning.
Given the examples above, you might try putting the trailing parenthesis back in, and then using this pattern:
import re
pat = re.compile('([\w\s]+)\(([\w\s]+)(\d+/\d+)\)')
info = pat.match(s)
print info.groups()
('Michael Schenker Group ', 'House of Blues Dallas ', '3/26')
To get at each group individual, just call them on the info object:
print info.group(1) # or info.groups()[0]
print '"%s","%s","%s"' % (info.group(1), info.group(2), info.group(3))
"Michael Schenker Group","House of Blues Dallas","3/26"
The hard thing about regex in this case is making sure you know all the known possible characters in the title. If there are non-alpha chars in the 'Michael Schenker Group' part, you'll have to adjust the regex for that part to allow them.
The pattern above breaks down as follows, which is parsed left to right:
([\w\s]+) : Match any word or space characters (the plus symbol indicates that there should be one or more such characters). The parentheses mean that the match will be captured as a group. This is the "Michael Schenker Group " part. If there can be numbers and dashes here, you'll want to modify the pieces between the square brackets, which are the possible characters for the set.
\( : A literal parenthesis. The backslash escapes the parenthesis, since otherwise it counts as a regex command. This is the "(" part of the string.
([\w\s]+) : Same as the one above, but this time matches the "House of Blues Dallas " part. In parentheses so they will be captured as the second group.
(\d+/\d+) : Matches the digits 3 and 26 with a slash in the middle. In parentheses so they will be captured as the third group.
\) : Closing parenthesis for the above.
The python intro to regex is quite good, and you might want to spend an evening going over it http://docs.python.org/library/re.html#module-re. Also, check Dive Into Python, which has a friendly introduction: http://diveintopython3.ep.io/regular-expressions.html.
EDIT: See zacherates below, who has some nice edits. Two heads are better than one!
Regular expressions are a great solution to this problem:
>>> import re
>>> s = 'Michael Schenker Group (House of Blues Dallas 3/26'
>>> re.match(r'(.*) \((.*) (\d+/\d+)', s).groups()
('Michael Schenker Group', 'House of Blues Dallas', '3/26')
As a side note, you might want to look at the Universal Feed Parser for handling the RSS parsing as feeds have a bad habit of being malformed.
Edit
In regards to your comment... The strings occasionally being wrapped in "s rather than 's has to do with the fact that you're using repr. The repr of a string is usually delimited with 's, unless that string contains one or more 's, where instead it uses "s so that the 's don't have to be escaped:
>>> "Hello there"
'Hello there'
>>> "it's not its"
"it's not its"
Notice the different quote styles.
Regarding the repr(item.title[0:-1]) part, not sure where you got that from but I'm pretty sure you can simply use item.title. All you're doing is removing the last char from the string and then calling repr() on it, which does nothing.
Your code should look something like this:
import geocoders # from GeoPy
us = geocoders.GeocoderDotUS()
import feedparser # from www.feedparser.org
feedurl = "http://www.tourfilter.com/dallas/rss/by_concert_date"
feed = feedparser.parse(feedurl)
lines = []
for entry in feed.entries:
m = re.search(r'(.*) \((.*) (\d+/\d+)\)', entry.title)
if m:
bandRaw, venue, date = m.groups()
if band == bandRaw:
place, (lat, lng) = us.geocode(venue + ", Dallas, TX")
lines.append(",".join([band, venue, date, lat, lng]))
result = "\n".join(lines)
EDIT: replaced list with lines as the var name. list is a builtin and should not be used as a variable name. Sorry.

Categories

Resources