Python - Splitting a string of names separated by spaces and commas - python

I have an API feeding into my program into a django many to many model field. The names of the individuals within my database are structured with a separated first name and last name. However, the API is sending a bulk list of names structured as as a string list as so: "Jones, Bob Smith, Jason Donald, Mic" (Last name-comma-space-first name-space-new last name- etc.)
How would I separate this string in a way that would allow me to filter and add a particular user to the many-to-many field?
Thanks!!

This answer excludes the case where a first name or last name contains space (this case is much more complicated as you will have a word with a space on his left AND on his right).
You need to replace the -comma-space- by something without a space (because you also have a space between two different names).
string = "Jones, Bob Smith, Jason Donald, Mic"
names = []
for name in string.replace(', ', ',').split(' '):
name = name.split(',')
last_name = name[0]
first_name = name[1]
names.append((last_name, first_name))
names
Output:
[('Jones', 'Bob'), ('Smith', 'Jason'), ('Donald', 'Mic')]

You can use regex:
s = "Jones, Bob Smith, Jason Donald, Mic"
list(re.findall(r'(\S+), (\S+)', s))
# [('Jones', 'Bob'), ('Smith', 'Jason'), ('Donald', 'Mic')]

Related

How can I split concatenated strings that contain no delimiters in python?

Let's say I have a list of concatenated firstname + lastname combinations like this:
["samsmith","sallyfrank","jamesandrews"]
I also have lists possible_firstnames and possible_lastnames.
If I want to split those full name strings based on values that appear in possible_firstnames and possible_lastnames, what is the best way of doing so?
My initial strategy was to compare characters between full name strings and each possible_firstnames/possible_lastnames value one by one, where I would split the full name string on discovery of a match. However, I realize that I would encounter a problem if, for example, "Sal" was included as a possible first name (my code would try to turn "sallyfrank" into "Sal Lyfrank" etc).
My next step would be to crosscheck what remains in the string after "sal" to values in possible_lastnames before finalizing the split, but this is starting to approach the convoluted and so I am left wondering if there is perhaps a much simpler option that I have been overlooking from the very beginning?
The language that I am working in is Python.
If you are getting similar names, like sam, samantha and saman, put them in reverse order so that the shortest is last
full_names = ["samsmith","sallyfrank","jamesandrews", "samanthasang", "samantorres"]
first_name = ["sally","james", "samantha", "saman", "sam"]
matches = []
for name in full_names:
for first in first_name:
if name.startswith(first):
matches.append(f'{first} {name[len(first):]}')
break
print(*matches, sep='\n')
Result
sam smith
sally frank
james andrews
samantha sang
saman torres
This won't pick out a name like Sam Antony. It would show this as *Saman Tony", in which case, your last name idea would work.
It also won't pick out Sam Anthanei. This could be Samantha Nei, Saman Thanei or Sam Anthanei if all three surnames were in your surname list.
Is this what u wanted
names = ["samsmith","sallyfrank","jamesandrews"]
pos_fname = ["sally","james"]
pos_lname = ["smith","frank"]
matches = []
for i in names:
for n in pos_fname:
if i.startswith(n):
break
else:
continue
for n in pos_lname:
if i.endswith(n):
matches.append(f"{i[:-len(n)].upper()} {n.upper()}")
break
else:
continue
print(matches)

How to find required word in novel in python?

I have a text and I have got a task in python with reading module:
Find the names of people who are referred to as Mr. XXX. Save the result in a dictionary with the name as key and number of times it is used as value. For example:
If Mr. Churchill is in the novel, then include {'Churchill' : 2}
If Mr. Frank Churchill is in the novel, then include {'Frank Churchill' : 4}
The file is .txt and it contains around 10-15 paragraphs.
Do you have ideas about how can it be improved? (It gives me error after some words, I guess error happens due to the reason that one of the Mr. is at the end of the line.)
orig_text= open('emma.txt', encoding = 'UTF-8')
lines= orig_text.readlines()[32:16267]
counts = dict()
for line in lines:
wordsdirty = line.split()
try:
print (wordsdirty[wordsdirty.index('Mr.') + 1])
except ValueError:
continue
Try this:
text = "When did Mr. Churchill told Mr. James Brown about the fish"
m = [x[0] for x in re.findall('(Mr\.( [A-Z][a-z]*)+)', text)]
You get:
['Mr. Churchill', 'Mr. James Brown']
To solve the line issue simply read the entire file:
text = file.read()
Then, to count the occurrences, simply run:
Counter(m)
Finally, if you'd like to drop 'Mr. ' from all your dictionary entries, use x[0][4:] instead of x[0].
This can be easily done using regex and capturing group.
Take a look here for reference, in this scenario you might want to do something like
# retrieve a list of strings that match your regex
matches = re.findall("Mr\. ([a-zA-Z]+)", your_entire_file) # not sure about the regex
# then create a dictionary and count the occurrences of each match
# if you are allowed to use modules, this can be done using Counter
Counter(matches)
To access the entire file like that, you might want to map it to memory, take a look at this question

input value same as predefined list but getting a different output

Ive written a program which takes in the name and age of multiple entries seperated by a comma and then sepearates the aplhabets from the numerics and then compares the name with a pre defined set/list.
If the entry doesnt match with the pre defined data, the program sends a message"incorrect entry" along with the element which didnt match.
heres the code:
from string import digits
print("enter name and age")
order=input("Seperate entries using a comma ',':")
order1=order.strip()
order2=order1.replace(" ","")
order_sep=order2.split()
removed_digits=str.maketrans('','',digits)
names=order.translate(removed_digits)
print(names)
names1=names.split(',')
names_list=['abby','chris','john','cena']
names_list=set(names_list)
for name in names1:
if name not in names_list:
print(f"{name}:doesnt match with predefined data")
the problem im having is even when i enter chris or john, the program treats them as they dont belong to the pre defined list
sample input : ravi 19,chris 20
output:ravi ,chris
ravi :doesnt match with predefined data
chris :doesnt match with predefined data
also i have another issue , ive written a part to eliminate whitespace but i dont know why, it doesnt elimintae them
sample input:ravi , chris
ravi :doesnt match with predefined data
()chris :doesnt match with predefined data
theres a space where ive put parenthesis.
any suggestion to tackle this problem and/or improve this code is appreciated!
I think some of the parts can be simplified, especially when removing the digits. As long as the input is entered with a space between the name and age, you can use split() twice. First to separate the entries with split(',') and next to separate out the ages with split(). It makes comparisons easier later if you store the names by themselves with no punctuation or whitespace around them. To print the names out from an iterable, you can use the str.join() function. Here is an example:
print("enter name and age")
order = input("Seperate entries using a comma ',': ")
names1 = [x.split()[0] for x in order.split(',')]
print(', '.join(names1))
names_list=['abby', 'chris', 'john', 'cena']
for name in names1:
if name not in names_list:
print(f"{name}:doesnt match with predefined data")
This will give the desired output:
enter name and age
Seperate entries using a comma ',': ravi 19, chris 20
ravi, chris
ravi:doesnt match with predefined data

Regex to help split up list into two-tuples

Given a list of actors, with their their character name in brackets, separated by either a semi-colon (;) or comm (,):
Shelley Winters [Ruby]; Millicent Martin [Siddie]; Julia Foster [Gilda];
Jane Asher [Annie]; Shirley Ann Field [Carla]; Vivien Merchant [Lily];
Eleanor Bron [Woman Doctor], Denholm Elliott [Mr. Smith; abortionist];
Alfie Bass [Harry]
How would I parse this into a list of two-typles in the form of [(actor, character),...]
--> [('Shelley Winters', 'Ruby'), ('Millicent Martin', 'Siddie'),
('Denholm Elliott', 'Mr. Smith; abortionist')]
I originally had:
actors = [item.strip().rstrip(']') for item in re.split('\[|,|;',data['actors'])]
data['actors'] = [(actors[i], actors[i + 1]) for i in range(0, len(actors), 2)]
But this doesn't quite work, as it also splits up items within brackets.
You can go with something like:
>>> re.findall(r'(\w[\w\s\.]+?)\s*\[([\w\s;\.,]+)\][,;\s$]*', s)
[('Shelley Winters', 'Ruby'),
('Millicent Martin', 'Siddie'),
('Julia Foster', 'Gilda'),
('Jane Asher', 'Annie'),
('Shirley Ann Field', 'Carla'),
('Vivien Merchant', 'Lily'),
('Eleanor Bron', 'Woman Doctor'),
('Denholm Elliott', 'Mr. Smith; abortionist'),
('Alfie Bass', 'Harry')]
One can also simplify some things with .*?:
re.findall(r'(\w.*?)\s*\[(.*?)\][,;\s$]*', s)
inputData = inputData.replace("];", "\n")
inputData = inputData.replace("],", "\n")
inputData = inputData[:-1]
for line in inputData.split("\n"):
actorList.append(line.partition("[")[0])
dataList.append(line.partition("[")[2])
togetherList = zip(actorList, dataList)
This is a bit of a hack, and I'm sure you can clean it up from here. I'll walk through this approach just to make sure you understand what I'm doing.
I am replacing both the ; and the , with a newline, which I will later use to split up every pair into its own line. Assuming your content isn't filled with erroneous ]; or ], 's this should work. However, you'll notice the last line will have a ] at the end because it didn't have a need a comma or semi-colon. Thus, I splice it off with the third line.
Then, just using the partition function on each line that we created within your input string, we assign the left part to the actor list, the right part to the data list and ignore the bracket (which is at position 1).
After that, Python's very useful zip funciton should finish the job for us by associating the ith element of each list together into a list of matched tuples.

Method for parsing text Cc field of email header?

I have the plain text of a Cc header field that looks like so:
friend#email.com, John Smith <john.smith#email.com>,"Smith, Jane" <jane.smith#uconn.edu>
Are there any battle tested modules for parsing this properly?
(bonus if it's in python! the email module just returns the raw text without any methods for splitting it, AFAIK)
(also bonus if it splits name and address into to fields)
There are a bunch of function available as a standard python module, but I think you're looking for
email.utils.parseaddr() or email.utils.getaddresses()
>>> addresses = 'friend#email.com, John Smith <john.smith#email.com>,"Smith, Jane" <jane.smith#uconn.edu>'
>>> email.utils.getaddresses([addresses])
[('', 'friend#email.com'), ('John Smith', 'john.smith#email.com'), ('Smith, Jane', 'jane.smith#uconn.edu')]
I haven't used it myself, but it looks to me like you could use the csv package quite easily to parse the data.
The bellow is completely unnecessary. I wrote it before realising that you could pass getaddresses() a list containing a single string containing multiple addresses.
I haven't had a chance to look at the specifications for addresses in email headers, but based on the string you provided, this code should do the job splitting it into a list, making sure to ignore commas if they are within quotes (and therefore part of a name).
from email.utils import getaddresses
addrstring = ',friend#email.com, John Smith <john.smith#email.com>,"Smith, Jane" <jane.smith#uconn.edu>,'
def addrparser(addrstring):
addrlist = ['']
quoted = False
# ignore comma at beginning or end
addrstring = addrstring.strip(',')
for char in addrstring:
if char == '"':
# toggle quoted mode
quoted = not quoted
addrlist[-1] += char
# a comma outside of quotes means a new address
elif char == ',' and not quoted:
addrlist.append('')
# anything else is the next letter of the current address
else:
addrlist[-1] += char
return getaddresses(addrlist)
print addrparser(addrstring)
Gives:
[('', 'friend#email.com'), ('John Smith', 'john.smith#email.com'),
('Smith, Jane', 'jane.smith#uconn.edu')]
I'd be interested to see how other people would go about this problem!
Convert multiple E-mail string in to dictionary (Multiple E-Mail with name in to one string).
emailstring = 'Friends <friend#email.com>, John Smith <john.smith#email.com>,"Smith" <jane.smith#uconn.edu>'
Split string by Comma
email_list = emailstring.split(',')
name is key and email is value and make dictionary.
email_dict = dict(map(lambda x: email.utils.parseaddr(x), email_list))
Result like this:
{'John Smith': 'john.smith#email.com', 'Friends': 'friend#email.com', 'Smith': 'jane.smith#uconn.edu'}
Note:
If there is same name with different email id then one record is skip.
'Friends <friend#email.com>, John Smith <john.smith#email.com>,"Smith" <jane.smith#uconn.edu>, Friends <friend_co#email.com>'
"Friends" is duplicate 2 time.

Categories

Resources