Regex to help split up list into two-tuples - python

Given a list of actors, with their their character name in brackets, separated by either a semi-colon (;) or comm (,):
Shelley Winters [Ruby]; Millicent Martin [Siddie]; Julia Foster [Gilda];
Jane Asher [Annie]; Shirley Ann Field [Carla]; Vivien Merchant [Lily];
Eleanor Bron [Woman Doctor], Denholm Elliott [Mr. Smith; abortionist];
Alfie Bass [Harry]
How would I parse this into a list of two-typles in the form of [(actor, character),...]
--> [('Shelley Winters', 'Ruby'), ('Millicent Martin', 'Siddie'),
('Denholm Elliott', 'Mr. Smith; abortionist')]
I originally had:
actors = [item.strip().rstrip(']') for item in re.split('\[|,|;',data['actors'])]
data['actors'] = [(actors[i], actors[i + 1]) for i in range(0, len(actors), 2)]
But this doesn't quite work, as it also splits up items within brackets.

You can go with something like:
>>> re.findall(r'(\w[\w\s\.]+?)\s*\[([\w\s;\.,]+)\][,;\s$]*', s)
[('Shelley Winters', 'Ruby'),
('Millicent Martin', 'Siddie'),
('Julia Foster', 'Gilda'),
('Jane Asher', 'Annie'),
('Shirley Ann Field', 'Carla'),
('Vivien Merchant', 'Lily'),
('Eleanor Bron', 'Woman Doctor'),
('Denholm Elliott', 'Mr. Smith; abortionist'),
('Alfie Bass', 'Harry')]
One can also simplify some things with .*?:
re.findall(r'(\w.*?)\s*\[(.*?)\][,;\s$]*', s)

inputData = inputData.replace("];", "\n")
inputData = inputData.replace("],", "\n")
inputData = inputData[:-1]
for line in inputData.split("\n"):
actorList.append(line.partition("[")[0])
dataList.append(line.partition("[")[2])
togetherList = zip(actorList, dataList)
This is a bit of a hack, and I'm sure you can clean it up from here. I'll walk through this approach just to make sure you understand what I'm doing.
I am replacing both the ; and the , with a newline, which I will later use to split up every pair into its own line. Assuming your content isn't filled with erroneous ]; or ], 's this should work. However, you'll notice the last line will have a ] at the end because it didn't have a need a comma or semi-colon. Thus, I splice it off with the third line.
Then, just using the partition function on each line that we created within your input string, we assign the left part to the actor list, the right part to the data list and ignore the bracket (which is at position 1).
After that, Python's very useful zip funciton should finish the job for us by associating the ith element of each list together into a list of matched tuples.

Related

How can I split concatenated strings that contain no delimiters in python?

Let's say I have a list of concatenated firstname + lastname combinations like this:
["samsmith","sallyfrank","jamesandrews"]
I also have lists possible_firstnames and possible_lastnames.
If I want to split those full name strings based on values that appear in possible_firstnames and possible_lastnames, what is the best way of doing so?
My initial strategy was to compare characters between full name strings and each possible_firstnames/possible_lastnames value one by one, where I would split the full name string on discovery of a match. However, I realize that I would encounter a problem if, for example, "Sal" was included as a possible first name (my code would try to turn "sallyfrank" into "Sal Lyfrank" etc).
My next step would be to crosscheck what remains in the string after "sal" to values in possible_lastnames before finalizing the split, but this is starting to approach the convoluted and so I am left wondering if there is perhaps a much simpler option that I have been overlooking from the very beginning?
The language that I am working in is Python.
If you are getting similar names, like sam, samantha and saman, put them in reverse order so that the shortest is last
full_names = ["samsmith","sallyfrank","jamesandrews", "samanthasang", "samantorres"]
first_name = ["sally","james", "samantha", "saman", "sam"]
matches = []
for name in full_names:
for first in first_name:
if name.startswith(first):
matches.append(f'{first} {name[len(first):]}')
break
print(*matches, sep='\n')
Result
sam smith
sally frank
james andrews
samantha sang
saman torres
This won't pick out a name like Sam Antony. It would show this as *Saman Tony", in which case, your last name idea would work.
It also won't pick out Sam Anthanei. This could be Samantha Nei, Saman Thanei or Sam Anthanei if all three surnames were in your surname list.
Is this what u wanted
names = ["samsmith","sallyfrank","jamesandrews"]
pos_fname = ["sally","james"]
pos_lname = ["smith","frank"]
matches = []
for i in names:
for n in pos_fname:
if i.startswith(n):
break
else:
continue
for n in pos_lname:
if i.endswith(n):
matches.append(f"{i[:-len(n)].upper()} {n.upper()}")
break
else:
continue
print(matches)

Confused about the type of parameter that goes into this method

I'm trying to understand how this code works, we have:
people = ['Dr. Christopher Brooks', 'Dr. Kevyn Collins-Thompson',
'Dr. VG Vinod Vydiswaran', 'Dr. Daniel Romero']
def split_title_and_name(person):
return person.split()[0] + ' ' + person.split()[-1]
So we are given a list, and this method is supposed to basically delete everything in the middle between "Dr." and the last name. As far as I know, the split() function cannot be used for lists, but for strings. so person must be a string. However, we also add [0] and [-1] to person, which means we should be getting the first and last character of "person" but instead, we get first word and last word. I cannot make sense of this code! May you please help me understand?
Any help is greatly appreciated, thank you :)
The split function splits the string into a list of words. And then we select the first and last words to form the output.
>>> person = 'Dr. Christopher Brooks'
>>> person.split()
['Dr.', 'Christopher', 'Brooks']
>>> person.split()[0]
'Dr.'
>>> person.split()[-1]
'Brooks'
This is not a real answer, just adding this for clarification on how the function would be used, given a list of strings.
people = ['Dr. Christopher Brooks', 'Dr. Kevyn Collins-Thompson',
'Dr. VG Vinod Vydiswaran', 'Dr. Daniel Romero']
def split_title_and_name(person: str):
return person.split()[0] + ' ' + person.split()[-1]
# This code does not actually run (I guess this might have been what you were trying)
# result = split_title_and_name(people)
# Using a for loop to print the result of running function over each list element
print('== With loop')
for person in people:
result = split_title_and_name(person)
print(result)
# Using a list comprehension to get the same results as above
print('== With list comprehension')
results = [split_title_and_name(person) for person in people]
print(results)
Python's split() method splits a string into a list. You can specify the separator, the default separator is any whitespace. So in your case, you didn't specify any separator and therefore this function will split the string person into ['Dr.', 'Christopher', 'Brooks'] and therefore [0] = 'Dr.' and [-1] = 'Brooks'.
The syntax for split() function is: string.split(separator, maxsplit), here both parameters are optional.
If you don't give any parameters, the default values for separator is any whitespace such as space, \t , \n , etc and maxsplit is -1 (meaning, all occurrences)
You can learn more about split() on https://www.w3schools.com/python/ref_string_split.asp

How to find required word in novel in python?

I have a text and I have got a task in python with reading module:
Find the names of people who are referred to as Mr. XXX. Save the result in a dictionary with the name as key and number of times it is used as value. For example:
If Mr. Churchill is in the novel, then include {'Churchill' : 2}
If Mr. Frank Churchill is in the novel, then include {'Frank Churchill' : 4}
The file is .txt and it contains around 10-15 paragraphs.
Do you have ideas about how can it be improved? (It gives me error after some words, I guess error happens due to the reason that one of the Mr. is at the end of the line.)
orig_text= open('emma.txt', encoding = 'UTF-8')
lines= orig_text.readlines()[32:16267]
counts = dict()
for line in lines:
wordsdirty = line.split()
try:
print (wordsdirty[wordsdirty.index('Mr.') + 1])
except ValueError:
continue
Try this:
text = "When did Mr. Churchill told Mr. James Brown about the fish"
m = [x[0] for x in re.findall('(Mr\.( [A-Z][a-z]*)+)', text)]
You get:
['Mr. Churchill', 'Mr. James Brown']
To solve the line issue simply read the entire file:
text = file.read()
Then, to count the occurrences, simply run:
Counter(m)
Finally, if you'd like to drop 'Mr. ' from all your dictionary entries, use x[0][4:] instead of x[0].
This can be easily done using regex and capturing group.
Take a look here for reference, in this scenario you might want to do something like
# retrieve a list of strings that match your regex
matches = re.findall("Mr\. ([a-zA-Z]+)", your_entire_file) # not sure about the regex
# then create a dictionary and count the occurrences of each match
# if you are allowed to use modules, this can be done using Counter
Counter(matches)
To access the entire file like that, you might want to map it to memory, take a look at this question

Removing white space from txt with python

I have a .txt file (scraped as pre-formatted text from a website) where the data looks like this:
B, NICKOLAS CT144531X D1026 JUDGE ANNIE WHITE JOHNSON
ANDREWS VS BALL JA-15-0050 D0015 JUDGE EDWARD A ROBERTS
I'd like to remove all extra spaces (they're actually different number of spaces, not tabs) in between the columns. I'd also then like to replace it with some delimiter (tab or pipe since there's commas within the data), like so:
ANDREWS VS BALL|JA-15-0050|D0015|JUDGE EDWARD A ROBERTS
Looked around and found that the best options are using regex or shlex to split. Two similar scenarios:
Python Regular expression must strip whitespace except between quotes,
Remove white spaces from dict : Python.
You can apply the regex '\s{2,}' (two or more whitespace characters) to each line and substitute the matches with a single '|' character.
>>> import re
>>> line = 'ANDREWS VS BALL JA-15-0050 D0015 JUDGE EDWARD A ROBERTS '
>>> re.sub('\s{2,}', '|', line.strip())
'ANDREWS VS BALL|JA-15-0050|D0015|JUDGE EDWARD A ROBERTS'
Stripping any leading and trailing whitespace from the line before applying re.sub ensures that you won't get '|' characters at the start and end of the line.
Your actual code should look similar to this:
import re
with open(filename) as f:
for line in f:
subbed = re.sub('\s{2,}', '|', line.strip())
# do something here
What about this?
your_string ='ANDREWS VS BALL JA-15-0050 D0015 JUDGE EDWARD A ROBERTS'
print re.sub(r'\s{2,}','|',your_string.strip())
Output:
ANDREWS VS BALL|JA-15-0050|D0015|JUDGE EDWARD A ROBERTS
Expanation:
I've used re.sub() which takes 3 parameter, a pattern, a string you want to replace with and the string you want to work on.
What I've done is taking at least two space together , I 've replaced them with a | and applied it on your string.
s = """B, NICKOLAS CT144531X D1026 JUDGE ANNIE WHITE JOHNSON
ANDREWS VS BALL JA-15-0050 D0015 JUDGE EDWARD A ROBERTS
"""
# Update
re.sub(r"(\S)\ {2,}(\S)(\n?)", r"\1|\2\3", s)
In [71]: print re.sub(r"(\S)\ {2,}(\S)(\n?)", r"\1|\2\3", s)
B, NICKOLAS|CT144531X|D1026|JUDGE ANNIE WHITE JOHNSON
ANDREWS VS BALL|JA-15-0050|D0015|JUDGE EDWARD A ROBERTS
Considering there are at least two spaces separating the columns, you can use this:
lines = [
'B, NICKOLAS CT144531X D1026 JUDGE ANNIE WHITE JOHNSON ',
'ANDREWS VS BALL JA-15-0050 D0015 JUDGE EDWARD A ROBERTS '
]
for line in lines:
parts = []
for part in line.split(' '):
part = part.strip()
if part: # checking if stripped part is a non-empty string
parts.append(part)
print('|'.join(parts))
Output for your input:
B, NICKOLAS|CT144531X|D1026|JUDGE ANNIE WHITE JOHNSON
ANDREWS VS BALL|JA-15-0050|D0015|JUDGE EDWARD A ROBERTS
It looks like your data is in a "text-table" format.
I recommend using the first row to figure out the start point and length of each column (either by hand or write a script with regex to determine the likely columns), then writing a script to iterate the rows of the file, slice the row into column segments, and apply strip to each segment.
If you use a regex, you must keep track of the number of columns and raise an error if any given row has more than the expected number of columns (or a different number than the rest). Splitting on two-or-more spaces will break if a column's value has two-or-more spaces, which is not just entirely possible, but also likely. Text-tables like this aren't designed to be split on a regex, they're designed to be split on the column index positions.
In terms of saving the data, you can use the csv module to write/read into a csv file. That will let you handle quoting and escaping characters better than specifying a delimiter. If one of your columns has a | character as a value, unless you're encoding the data with a strategy that handles escapes or quoted literals, your output will break on read.
Parsing the text above would look something like this (i nested a list comprehension with brackets instead of the traditional format so it's easier to understand):
cols = ((0,34),
(34, 50),
(50, 59),
(59, None),
)
for line in lines:
cleaned = [i.strip() for i in [line[s:e] for (s, e) in cols]]
print cleaned
then you can write it with something like:
import csv
with open('output.csv', 'wb') as csvfile:
spamwriter = csv.writer(csvfile, delimiter='|',
quotechar='"', quoting=csv.QUOTE_MINIMAL)
for line in lines:
spamwriter.writerow([line[col_start:col_end].strip()
for (col_start, col_end) in cols
])
Looks like this library can solve this quite nicely:
http://docs.astropy.org/en/stable/io/ascii/fixed_width_gallery.html#fixed-width-gallery
Impressive...

Python parsing

I'm trying to parse the title tag in an RSS 2.0 feed into three different variables for each entry in that feed. Using ElementTree I've already parsed the RSS so that I can print each title [minus the trailing )] with the code below:
feed = getfeed("http://www.tourfilter.com/dallas/rss/by_concert_date")
for item in feed:
print repr(item.title[0:-1])
I include that because, as you can see, the item.title is a repr() data type, which I don't know much about.
A particular repr(item.title[0:-1]) printed in the interactive window looks like this:
'randy travis (Billy Bobs 3/21'
'Michael Schenker Group (House of Blues Dallas 3/26'
The user selects a band and I hope to, after parsing each item.title into 3 variables (one each for band, venue, and date... or possibly an array or I don't know...) select only those related to the band selected. Then they are sent to Google for geocoding, but that's another story.
I've seen some examples of regex and I'm reading about them, but it seems very complicated. Is it? I thought maybe someone here would have some insight as to exactly how to do this in an intelligent way. Should I use the re module? Does it matter that the output is currently is repr()s? Is there a better way? I was thinking I'd use a loop like (and this is my pseudoPython, just kind of notes I'm writing):
list = bandRaw,venue,date,latLong
for item in feed:
parse item.title for bandRaw, venue, date
if bandRaw == str(band)
send venue name + ", Dallas, TX" to google for geocoding
return lat,long
list = list + return character + bandRaw + "," + venue + "," + date + "," + lat + "," + long
else
In the end, I need to have the chosen entries in a .csv (comma-delimited) file looking like this:
band,venue,date,lat,long
randy travis,Billy Bobs,3/21,1234.5678,1234.5678
Michael Schenker Group,House of Blues Dallas,3/26,4321.8765,4321.8765
I hope this isn't too much to ask. I'll be looking into it on my own, just thought I should post here to make sure it got answered.
So, the question is, how do I best parse each repr(item.title[0:-1]) in the feed into the 3 separate values that I can then concatenate into a .csv file?
Don't let regex scare you off... it's well worth learning.
Given the examples above, you might try putting the trailing parenthesis back in, and then using this pattern:
import re
pat = re.compile('([\w\s]+)\(([\w\s]+)(\d+/\d+)\)')
info = pat.match(s)
print info.groups()
('Michael Schenker Group ', 'House of Blues Dallas ', '3/26')
To get at each group individual, just call them on the info object:
print info.group(1) # or info.groups()[0]
print '"%s","%s","%s"' % (info.group(1), info.group(2), info.group(3))
"Michael Schenker Group","House of Blues Dallas","3/26"
The hard thing about regex in this case is making sure you know all the known possible characters in the title. If there are non-alpha chars in the 'Michael Schenker Group' part, you'll have to adjust the regex for that part to allow them.
The pattern above breaks down as follows, which is parsed left to right:
([\w\s]+) : Match any word or space characters (the plus symbol indicates that there should be one or more such characters). The parentheses mean that the match will be captured as a group. This is the "Michael Schenker Group " part. If there can be numbers and dashes here, you'll want to modify the pieces between the square brackets, which are the possible characters for the set.
\( : A literal parenthesis. The backslash escapes the parenthesis, since otherwise it counts as a regex command. This is the "(" part of the string.
([\w\s]+) : Same as the one above, but this time matches the "House of Blues Dallas " part. In parentheses so they will be captured as the second group.
(\d+/\d+) : Matches the digits 3 and 26 with a slash in the middle. In parentheses so they will be captured as the third group.
\) : Closing parenthesis for the above.
The python intro to regex is quite good, and you might want to spend an evening going over it http://docs.python.org/library/re.html#module-re. Also, check Dive Into Python, which has a friendly introduction: http://diveintopython3.ep.io/regular-expressions.html.
EDIT: See zacherates below, who has some nice edits. Two heads are better than one!
Regular expressions are a great solution to this problem:
>>> import re
>>> s = 'Michael Schenker Group (House of Blues Dallas 3/26'
>>> re.match(r'(.*) \((.*) (\d+/\d+)', s).groups()
('Michael Schenker Group', 'House of Blues Dallas', '3/26')
As a side note, you might want to look at the Universal Feed Parser for handling the RSS parsing as feeds have a bad habit of being malformed.
Edit
In regards to your comment... The strings occasionally being wrapped in "s rather than 's has to do with the fact that you're using repr. The repr of a string is usually delimited with 's, unless that string contains one or more 's, where instead it uses "s so that the 's don't have to be escaped:
>>> "Hello there"
'Hello there'
>>> "it's not its"
"it's not its"
Notice the different quote styles.
Regarding the repr(item.title[0:-1]) part, not sure where you got that from but I'm pretty sure you can simply use item.title. All you're doing is removing the last char from the string and then calling repr() on it, which does nothing.
Your code should look something like this:
import geocoders # from GeoPy
us = geocoders.GeocoderDotUS()
import feedparser # from www.feedparser.org
feedurl = "http://www.tourfilter.com/dallas/rss/by_concert_date"
feed = feedparser.parse(feedurl)
lines = []
for entry in feed.entries:
m = re.search(r'(.*) \((.*) (\d+/\d+)\)', entry.title)
if m:
bandRaw, venue, date = m.groups()
if band == bandRaw:
place, (lat, lng) = us.geocode(venue + ", Dallas, TX")
lines.append(",".join([band, venue, date, lat, lng]))
result = "\n".join(lines)
EDIT: replaced list with lines as the var name. list is a builtin and should not be used as a variable name. Sorry.

Categories

Resources