I am trying to parse a csv file with the following contents:
# country,title1,title2,type
GB,Fast Friends,Burn Notice, S:4, E:2,episode,
SE,The Spiderwick Chronicles,"SPIDERWICK CHRONICLES, THE",movie,
The expected output is:
['SE', 'The Spiderwick Chronicles', '"SPIDERWICK CHRONICLES, THE"', 'movie']
['GB', 'Fast Friends', 'Burn Notice, S:4, E:2', 'episode']
The problem is, the commas in the 'title' fields are not escaped. I tried using csvreader as well as doing string and regex parsing, but was unable to get unambiguous matches.
Is it possible at all to parse this file accurately with unescaped commas on half of the fields? Or, does it require that a new csv be created?
You may be able to play a trick if you can make the assumption that all commas will appear in title2. Otherwise, you have ambiguous data.
strings = ['SE,The Spiderwick Chronicles,"SPIDERWICK CHRONICLES, THE",movie,'
,'GB,Fast Friends,Burn Notice, S:4, E:2,episode,'
]
for string in strings:
xs = string.split(',')
country = xs[0]
title1 = xs[1]
title2 = ' '.join(xs[2:-2])
mtype = xs[-2]
print [country, title1, title2, mtype]
Output:
['SE', 'The Spiderwick Chronicles', '"SPIDERWICK CHRONICLES THE"', 'movie']
['GB', 'Fast Friends', 'Burn Notice S:4 E:2', 'episode']
You can use RegEx (import re) - see documentation
Match for (\".*\",)|(.*,)
This way you're looking either for [quoted string,] or [any string,].
If there are commas in the fields, I would save the excel as text file with fields separated by tab.
Related
I have a text and I have got a task in python with reading module:
Find the names of people who are referred to as Mr. XXX. Save the result in a dictionary with the name as key and number of times it is used as value. For example:
If Mr. Churchill is in the novel, then include {'Churchill' : 2}
If Mr. Frank Churchill is in the novel, then include {'Frank Churchill' : 4}
The file is .txt and it contains around 10-15 paragraphs.
Do you have ideas about how can it be improved? (It gives me error after some words, I guess error happens due to the reason that one of the Mr. is at the end of the line.)
orig_text= open('emma.txt', encoding = 'UTF-8')
lines= orig_text.readlines()[32:16267]
counts = dict()
for line in lines:
wordsdirty = line.split()
try:
print (wordsdirty[wordsdirty.index('Mr.') + 1])
except ValueError:
continue
Try this:
text = "When did Mr. Churchill told Mr. James Brown about the fish"
m = [x[0] for x in re.findall('(Mr\.( [A-Z][a-z]*)+)', text)]
You get:
['Mr. Churchill', 'Mr. James Brown']
To solve the line issue simply read the entire file:
text = file.read()
Then, to count the occurrences, simply run:
Counter(m)
Finally, if you'd like to drop 'Mr. ' from all your dictionary entries, use x[0][4:] instead of x[0].
This can be easily done using regex and capturing group.
Take a look here for reference, in this scenario you might want to do something like
# retrieve a list of strings that match your regex
matches = re.findall("Mr\. ([a-zA-Z]+)", your_entire_file) # not sure about the regex
# then create a dictionary and count the occurrences of each match
# if you are allowed to use modules, this can be done using Counter
Counter(matches)
To access the entire file like that, you might want to map it to memory, take a look at this question
I'm working on a project, where I have to extract honorific titles (Mr, Mrs, St, etc.) from a novel. The desired output with the text I'm working with is:
['Col', 'Dr', 'Mr', 'Mrs', 'Otto', 'Rev', 'St']
However, with the code I wrote, the output is this:
{'Tom.', 'Mrs.', 'Otto.', 'Mary.', 'Bots.', 'Come.', 'No.', 'Col.', 'Cain.', 'Dr.', 'Gang.', 'Ike.', 'Kean.', 'St.', 'Hank.', 'Him.', 'Finn.', 'Ann.', 'Jane.', 'Alas.', 'Huck.', 'Sis.', 'Buck.', 'Jim.', 'Sid.', 'Mr.', 'Bill.', 'Rev.', 'Yes.'}
This is the code I have so far:
def get_titles(text):
pattern = re.compile('[A-Z][a-z]{1,3}\.')
title_tokens = set(re.findall(pattern, text))
pattern2 = re.compile('[A-Z][a-z]{1,3}')
pseudo_titles = set(re.findall(pattern2, text))
pseudo_titles = [word.strip() for word in pseudo_titles]
pseudo_titles = [word.replace('\n', '') for word in pseudo_titles]
difference = title_tokens.difference(pseudo_titles)
return difference
test = get_titles(text)
print(test)
As you can notice, the output gives me additional words with periods in them. I believe the issue stems from the regular expressions, but I'm not sure. Any advice or tips are appreciated.
The text can be found here: http://www.gutenberg.org/files/76/76-0.txt
Essentially, you are asking for an algorithm which can tell the difference between a title and one-word sentence. These are lexically indistinguishable; for example, consider the following two strings:
"Do I know who did this? Yes. Smith did it."
"Do I know who did this? Mr. Smith did it."
In the first sentence, "Yes." is a one-word sentence, and in the second, "Mr." is a title. As humans we only know this because we understand the meanings of the tokens "Yes" and "Mr"; so an algorithm which is able to distinguish between these cases requires some information about the meanings of the tokens it's parsing. It cannot work purely lexically like a regex does. This means you must either write a whitelist of allowed titles, or a blacklist of words which are not titles, or otherwise the problem is much more difficult.
Alternatively, if your project doesn't involve parsing titles from very many novels, you could just trim down the results by hand, using your human knowledge that "Tom" and "Yes" aren't titles. It shouldn't be that much work.
My question is simply: if neither of the above commands work on splitting up a string into multiple lines, does that mean that nothing is delimiting the string?
My example is pretty in depth but in short: I have parsed specific data out of an HTML table with BeautifulSoup, but when I go to print the data it is all one messy string instead of a neat table format. I tried converting it to a Pandas DataFrame but still no success. I then tried using the above commands to neaten up the output but those also failed. This all leads me to believe it must in fact be one continuous string with no delimiters (even though obviously in the table they are separate entries).
I would love help with this problem. I am not sure if I'm using the commands wrong, or if my data really is this difficult to work with. Thank you.
My data (and how I expect it should be printed):
My relevant code:
rows = table.findAll("tr")[1:2]
data = {
'ID' : [],
'Available Quota' : [],
'Live Weight Pounds' : [],
'Price' : [],
'Date Posted' : []
}
for row in rows:
cols = row.findAll("td")
data['ID'].append(cols[0].get_text())
data['Available Quota'].append(cols[1].get_text())
data['Live Weight Pounds'].append(cols[2].get_text())
data['Price'].append(cols[3].get_text())
data['Date Posted'].append(cols[4].get_text())
fishData = pd.DataFrame(data)
#print(fishData)
str1 = ''.join(data['Available Quota'])
#print(type(str1))
#str1.split("\n")
str1.splitlines()
print(str1)
What gets printed:
GOM CODGOM HADDDABSGOM YT
My guess is that there's some formatting happening inside the table cells that you're throwing away. Supposing that the four lines visible in your table cell are separated by <br> tags, BeautifulSoup will discard that information when you call get_text:
>>> s = 'First line <br />Second line <br />Third line'
>>> soup = BeautifulSoup(s)
>>> soup.get_text()
u'First line Second line Third line'
As noted over here, you can swap out <br> tags for newlines, which might make your life easier:
>>> for br in soup.find_all("br"):
... br.replace_with("\n")
>>> soup.get_text()
u'First line \nSecond line \nThird line'
The strings and stripped_strings generators might also be useful here; they return chunks of text which were originally separated by tags:
>>> soup = BeautifulSoup(s)
>>> list(soup.stripped_strings)
[u'First line', u'Second line', u'Third line']
So, what happens if you do:
data['Available Quota'].extend(cols[1].stripped_strings)
Hopefully, you should have the list you're looking for in data['Available Quota']:
>>> data['Available Quota']
['GOM', 'CODGOM', 'HADDDABSGOM', 'YT']
If you simply replace:
str1 = ''.join(data['Available Quota'])
with
str1 = '\n'.join(data['Available Quota'])
and then comment out the:
str1.splitlines()
Then your print statement will print out the following:
GOM
CODGOM
HADDDABSGOM
YT
A simple example of what I see as the output from doing the '\n'.join()
In [41]: b
Out[41]: ['a', 'b', 'c']
In [42]: print('\n'.join(b))
a
b
c
I'm sorry to ask question with same text file.
below is my working text file string.
The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta's/np$ recent/jj primary/nn election/nn produced/vbd
This string consists of the "word / its tag" format, as you can see. From this string, I want to filter only the sequence of "noun + adjective" and make them to the bigram. For example, "Grand/jj-tl Jury/nn-tl" is exact word sequence that I want. (nn means noun, jj means adjective and adjuncts such as "-tl" are additional information about the tag.)
Maybe this will be easy job. And I first used regex for filtering. Below is my code.
import re
f = open(textfile)
raw = f.read()
tag_list = re.findall("\w+/jj-?\w* \w+/nn-?\w*", raw)
print tag_list
This codes give me the exact words list. However, what I want is the bigram data. That code only gives me the list of words, such like this.
['Grand/jj-tl Jury/nn-tl', 'recent/jj primary/nn', 'Executive/jj-tl Committee/nn-tl']
I want this data to be converted such as below.
[('Grand/jj-tl, Jury/nn-tl'), ('recent/jj ,primary/nn'), ('Executive/jj-tl , Committee/nn-tl')]
i.e. the list of bigram data. I need your advice.
I think once you have found the tag_list it should be an easy job afterwards just using the list comprehension:
>>> tag_list = ['Grand/jj-tl Jury/nn-tl', 'recent/jj primary/nn', 'Executive/jj-tl Committee/nn-tl']
>>> [tag.replace(' ', ', ') for tag in tag_list]
['Grand/jj-tl, Jury/nn-tl', 'recent/jj, primary/nn', 'Executive/jj-tl, Committee/nn-tl']
In your original demonstration, I am not sure why do you have ('Grand/jj-tl, Jury/nn-tl') and I am also not sure why would you like to join these bigrams using comma.
I think it would be better to have a list of list where the inner list have the bigram data:
>>> [tag.split() for tag in tag_list]
[['Grand/jj-tl', 'Jury/nn-tl'], ['recent/jj', 'primary/nn'], ['Executive/jj-tl', 'Committee/nn-tl']]
I'm trying to parse the title tag in an RSS 2.0 feed into three different variables for each entry in that feed. Using ElementTree I've already parsed the RSS so that I can print each title [minus the trailing )] with the code below:
feed = getfeed("http://www.tourfilter.com/dallas/rss/by_concert_date")
for item in feed:
print repr(item.title[0:-1])
I include that because, as you can see, the item.title is a repr() data type, which I don't know much about.
A particular repr(item.title[0:-1]) printed in the interactive window looks like this:
'randy travis (Billy Bobs 3/21'
'Michael Schenker Group (House of Blues Dallas 3/26'
The user selects a band and I hope to, after parsing each item.title into 3 variables (one each for band, venue, and date... or possibly an array or I don't know...) select only those related to the band selected. Then they are sent to Google for geocoding, but that's another story.
I've seen some examples of regex and I'm reading about them, but it seems very complicated. Is it? I thought maybe someone here would have some insight as to exactly how to do this in an intelligent way. Should I use the re module? Does it matter that the output is currently is repr()s? Is there a better way? I was thinking I'd use a loop like (and this is my pseudoPython, just kind of notes I'm writing):
list = bandRaw,venue,date,latLong
for item in feed:
parse item.title for bandRaw, venue, date
if bandRaw == str(band)
send venue name + ", Dallas, TX" to google for geocoding
return lat,long
list = list + return character + bandRaw + "," + venue + "," + date + "," + lat + "," + long
else
In the end, I need to have the chosen entries in a .csv (comma-delimited) file looking like this:
band,venue,date,lat,long
randy travis,Billy Bobs,3/21,1234.5678,1234.5678
Michael Schenker Group,House of Blues Dallas,3/26,4321.8765,4321.8765
I hope this isn't too much to ask. I'll be looking into it on my own, just thought I should post here to make sure it got answered.
So, the question is, how do I best parse each repr(item.title[0:-1]) in the feed into the 3 separate values that I can then concatenate into a .csv file?
Don't let regex scare you off... it's well worth learning.
Given the examples above, you might try putting the trailing parenthesis back in, and then using this pattern:
import re
pat = re.compile('([\w\s]+)\(([\w\s]+)(\d+/\d+)\)')
info = pat.match(s)
print info.groups()
('Michael Schenker Group ', 'House of Blues Dallas ', '3/26')
To get at each group individual, just call them on the info object:
print info.group(1) # or info.groups()[0]
print '"%s","%s","%s"' % (info.group(1), info.group(2), info.group(3))
"Michael Schenker Group","House of Blues Dallas","3/26"
The hard thing about regex in this case is making sure you know all the known possible characters in the title. If there are non-alpha chars in the 'Michael Schenker Group' part, you'll have to adjust the regex for that part to allow them.
The pattern above breaks down as follows, which is parsed left to right:
([\w\s]+) : Match any word or space characters (the plus symbol indicates that there should be one or more such characters). The parentheses mean that the match will be captured as a group. This is the "Michael Schenker Group " part. If there can be numbers and dashes here, you'll want to modify the pieces between the square brackets, which are the possible characters for the set.
\( : A literal parenthesis. The backslash escapes the parenthesis, since otherwise it counts as a regex command. This is the "(" part of the string.
([\w\s]+) : Same as the one above, but this time matches the "House of Blues Dallas " part. In parentheses so they will be captured as the second group.
(\d+/\d+) : Matches the digits 3 and 26 with a slash in the middle. In parentheses so they will be captured as the third group.
\) : Closing parenthesis for the above.
The python intro to regex is quite good, and you might want to spend an evening going over it http://docs.python.org/library/re.html#module-re. Also, check Dive Into Python, which has a friendly introduction: http://diveintopython3.ep.io/regular-expressions.html.
EDIT: See zacherates below, who has some nice edits. Two heads are better than one!
Regular expressions are a great solution to this problem:
>>> import re
>>> s = 'Michael Schenker Group (House of Blues Dallas 3/26'
>>> re.match(r'(.*) \((.*) (\d+/\d+)', s).groups()
('Michael Schenker Group', 'House of Blues Dallas', '3/26')
As a side note, you might want to look at the Universal Feed Parser for handling the RSS parsing as feeds have a bad habit of being malformed.
Edit
In regards to your comment... The strings occasionally being wrapped in "s rather than 's has to do with the fact that you're using repr. The repr of a string is usually delimited with 's, unless that string contains one or more 's, where instead it uses "s so that the 's don't have to be escaped:
>>> "Hello there"
'Hello there'
>>> "it's not its"
"it's not its"
Notice the different quote styles.
Regarding the repr(item.title[0:-1]) part, not sure where you got that from but I'm pretty sure you can simply use item.title. All you're doing is removing the last char from the string and then calling repr() on it, which does nothing.
Your code should look something like this:
import geocoders # from GeoPy
us = geocoders.GeocoderDotUS()
import feedparser # from www.feedparser.org
feedurl = "http://www.tourfilter.com/dallas/rss/by_concert_date"
feed = feedparser.parse(feedurl)
lines = []
for entry in feed.entries:
m = re.search(r'(.*) \((.*) (\d+/\d+)\)', entry.title)
if m:
bandRaw, venue, date = m.groups()
if band == bandRaw:
place, (lat, lng) = us.geocode(venue + ", Dallas, TX")
lines.append(",".join([band, venue, date, lat, lng]))
result = "\n".join(lines)
EDIT: replaced list with lines as the var name. list is a builtin and should not be used as a variable name. Sorry.