Python Multiple Strings to Tuples - python

Hi everyone I wonder if you can help with my problem.
I am defining a function which takes a string and converts it into 5 items in a tuple. The function will be required to take a number of strings, in which some of the items will vary in length. How would I go about doing this as using the indexes of the string does not work for every string.
As an example -
I want to convert a string like the following:
Doctor E212 40000 Peter David Jones
The tuple items of the string will be:
Job(Doctor), Department(E212), Pay(40000), Other names (Peter David), Surname (Jones)
However some of the strings have 2 other names where others will have just 1.
How would I go about converting strings like this into tuples when the other names can vary between 1 and 2?
I am a bit of a novice when it comes to python as you can probably tell ;)

With Python 3, you can just split() and use "catch-all" tuple unpacking with *:
>>> string = "Doctor E212 40000 Peter David Jones"
>>> job, dep, sal, *other, names = string.split()
>>> job, dep, sal, " ".join(other), names
('Doctor', 'E212', '40000', 'Peter David', 'Jones')
Alternatively, you can use regular expressions, e.g. something like this:
>>> m = re.match(r"(\w+) (\w+) (\d+) ([\w\s]+) (\w+)", string)
>>> job, dep, sal, other, names = m.groups()
>>> job, dep, sal, other, names
('Doctor', 'E212', '40000', 'Peter David', 'Jones')

Related

How can I split concatenated strings that contain no delimiters in python?

Let's say I have a list of concatenated firstname + lastname combinations like this:
["samsmith","sallyfrank","jamesandrews"]
I also have lists possible_firstnames and possible_lastnames.
If I want to split those full name strings based on values that appear in possible_firstnames and possible_lastnames, what is the best way of doing so?
My initial strategy was to compare characters between full name strings and each possible_firstnames/possible_lastnames value one by one, where I would split the full name string on discovery of a match. However, I realize that I would encounter a problem if, for example, "Sal" was included as a possible first name (my code would try to turn "sallyfrank" into "Sal Lyfrank" etc).
My next step would be to crosscheck what remains in the string after "sal" to values in possible_lastnames before finalizing the split, but this is starting to approach the convoluted and so I am left wondering if there is perhaps a much simpler option that I have been overlooking from the very beginning?
The language that I am working in is Python.
If you are getting similar names, like sam, samantha and saman, put them in reverse order so that the shortest is last
full_names = ["samsmith","sallyfrank","jamesandrews", "samanthasang", "samantorres"]
first_name = ["sally","james", "samantha", "saman", "sam"]
matches = []
for name in full_names:
for first in first_name:
if name.startswith(first):
matches.append(f'{first} {name[len(first):]}')
break
print(*matches, sep='\n')
Result
sam smith
sally frank
james andrews
samantha sang
saman torres
This won't pick out a name like Sam Antony. It would show this as *Saman Tony", in which case, your last name idea would work.
It also won't pick out Sam Anthanei. This could be Samantha Nei, Saman Thanei or Sam Anthanei if all three surnames were in your surname list.
Is this what u wanted
names = ["samsmith","sallyfrank","jamesandrews"]
pos_fname = ["sally","james"]
pos_lname = ["smith","frank"]
matches = []
for i in names:
for n in pos_fname:
if i.startswith(n):
break
else:
continue
for n in pos_lname:
if i.endswith(n):
matches.append(f"{i[:-len(n)].upper()} {n.upper()}")
break
else:
continue
print(matches)

Confused about the type of parameter that goes into this method

I'm trying to understand how this code works, we have:
people = ['Dr. Christopher Brooks', 'Dr. Kevyn Collins-Thompson',
'Dr. VG Vinod Vydiswaran', 'Dr. Daniel Romero']
def split_title_and_name(person):
return person.split()[0] + ' ' + person.split()[-1]
So we are given a list, and this method is supposed to basically delete everything in the middle between "Dr." and the last name. As far as I know, the split() function cannot be used for lists, but for strings. so person must be a string. However, we also add [0] and [-1] to person, which means we should be getting the first and last character of "person" but instead, we get first word and last word. I cannot make sense of this code! May you please help me understand?
Any help is greatly appreciated, thank you :)
The split function splits the string into a list of words. And then we select the first and last words to form the output.
>>> person = 'Dr. Christopher Brooks'
>>> person.split()
['Dr.', 'Christopher', 'Brooks']
>>> person.split()[0]
'Dr.'
>>> person.split()[-1]
'Brooks'
This is not a real answer, just adding this for clarification on how the function would be used, given a list of strings.
people = ['Dr. Christopher Brooks', 'Dr. Kevyn Collins-Thompson',
'Dr. VG Vinod Vydiswaran', 'Dr. Daniel Romero']
def split_title_and_name(person: str):
return person.split()[0] + ' ' + person.split()[-1]
# This code does not actually run (I guess this might have been what you were trying)
# result = split_title_and_name(people)
# Using a for loop to print the result of running function over each list element
print('== With loop')
for person in people:
result = split_title_and_name(person)
print(result)
# Using a list comprehension to get the same results as above
print('== With list comprehension')
results = [split_title_and_name(person) for person in people]
print(results)
Python's split() method splits a string into a list. You can specify the separator, the default separator is any whitespace. So in your case, you didn't specify any separator and therefore this function will split the string person into ['Dr.', 'Christopher', 'Brooks'] and therefore [0] = 'Dr.' and [-1] = 'Brooks'.
The syntax for split() function is: string.split(separator, maxsplit), here both parameters are optional.
If you don't give any parameters, the default values for separator is any whitespace such as space, \t , \n , etc and maxsplit is -1 (meaning, all occurrences)
You can learn more about split() on https://www.w3schools.com/python/ref_string_split.asp

List within a string and print formatting

I am creating something which takes a tuple, converts it into a string and then reorganises the string using print formatting. 'other' can sometimes have 2 names, hence why I have used * and the " ".join(other) in this function:
def strFormat(x):
#Convert to string
s=' '
s = s.join(x)
print(s)
#Split string into different parts
payR, dep, sal, *other, surn = s.split()
payR, dep, sal, " ".join(other), surn
#Print formatting!
print (surn , other, payR, dep, sal)
The problem with this is that it prints a list of 'other' within the string like this:
Jones ['David', 'Peter'] 84921 Python 63120
But I want it more like this:
Jones David Peter 84921 Python 63120
So that it is ready for formatting into something like this:
Jones, David Peter 84921 Python £63120
Am I going about this the right way and how do I stop the list appearing within the string?
You're close. Change this line (which does nothing):
payR, dep, sal, " ".join(other), surn
to
other = " ".join(other)

Python matching regex multiple times in a row (not the findall way)

This question is not asking about finding 'a' multiple times in a string etc.
What I would like to do is match:
[ a-zA-Z0-9]{1,3}\.
regexp multiple times, one way of doing this is using |
'[ a-zA-Z0-9]{1,3}\.[ a-zA-Z0-9]{1,3}\.[ a-zA-Z0-9]{1,3}\.[ a-zA-Z0-9]{1,3}\.|[ a-zA-Z0-9]{1,3}\.[ a-zA-Z0-9]{1,3}\.[ a-zA-Z0-9]{1,3}\.|[ a-zA-Z0-9]{1,3}\.[ a-zA-Z0-9]{1,3}\.'
so this matches the regexp 4 or 3 or 2 times.
Matches stuff like:
a. v. b.
m a.b.
Is there any way to make this more coding like?
I tried doing
([ a-zA-Z0-9]{1,3}\.){2,4}
but the functionality is not the same what I expected. THis one matches:
regex.findall(string)
[u' b.', u'b.']
string is:
a. v. b. split them a.b. split somethinf words. THen we say some more words, like ten
Is there any way to do this? THe goal is to match possible english abbreviations and names like Mary J. E. things that the sentence tokenizer recognizes as sentence punctuation but are not.
I want to match all of this:
U.S. , c.v.a.b. , a. v. p.
first of all Your regex will work as you expect :
>>> s="aa2.jhf.jev.d23.llo."
>>> import re
>>> re.search(r'([ a-zA-Z0-9]{1,3}\.){2,4}',s).group(0)
'aa2.jhf.jev.d23.'
But if you want to match some sub strings like U.S. , c.v.a.b. , a. v. p. you need to put the whole of regex in a capture group :
>>> s= 'a. v. b. split them a.b. split somethinf words. THen we say' some more
>>> re.findall(r'(([ a-zA-Z0-9]{1,3}\.){2,4})',s)
[('a. v. b.', ' b.'), ('m a.b.', 'b.')]
then use a list comprehension to get the first matches :
>>> [i[0] for i in re.findall(r'(([ a-zA-Z0-9]{1,3}\.){2,4})',s)]
['a. v. b.', 'm a.b.']

Regex to help split up list into two-tuples

Given a list of actors, with their their character name in brackets, separated by either a semi-colon (;) or comm (,):
Shelley Winters [Ruby]; Millicent Martin [Siddie]; Julia Foster [Gilda];
Jane Asher [Annie]; Shirley Ann Field [Carla]; Vivien Merchant [Lily];
Eleanor Bron [Woman Doctor], Denholm Elliott [Mr. Smith; abortionist];
Alfie Bass [Harry]
How would I parse this into a list of two-typles in the form of [(actor, character),...]
--> [('Shelley Winters', 'Ruby'), ('Millicent Martin', 'Siddie'),
('Denholm Elliott', 'Mr. Smith; abortionist')]
I originally had:
actors = [item.strip().rstrip(']') for item in re.split('\[|,|;',data['actors'])]
data['actors'] = [(actors[i], actors[i + 1]) for i in range(0, len(actors), 2)]
But this doesn't quite work, as it also splits up items within brackets.
You can go with something like:
>>> re.findall(r'(\w[\w\s\.]+?)\s*\[([\w\s;\.,]+)\][,;\s$]*', s)
[('Shelley Winters', 'Ruby'),
('Millicent Martin', 'Siddie'),
('Julia Foster', 'Gilda'),
('Jane Asher', 'Annie'),
('Shirley Ann Field', 'Carla'),
('Vivien Merchant', 'Lily'),
('Eleanor Bron', 'Woman Doctor'),
('Denholm Elliott', 'Mr. Smith; abortionist'),
('Alfie Bass', 'Harry')]
One can also simplify some things with .*?:
re.findall(r'(\w.*?)\s*\[(.*?)\][,;\s$]*', s)
inputData = inputData.replace("];", "\n")
inputData = inputData.replace("],", "\n")
inputData = inputData[:-1]
for line in inputData.split("\n"):
actorList.append(line.partition("[")[0])
dataList.append(line.partition("[")[2])
togetherList = zip(actorList, dataList)
This is a bit of a hack, and I'm sure you can clean it up from here. I'll walk through this approach just to make sure you understand what I'm doing.
I am replacing both the ; and the , with a newline, which I will later use to split up every pair into its own line. Assuming your content isn't filled with erroneous ]; or ], 's this should work. However, you'll notice the last line will have a ] at the end because it didn't have a need a comma or semi-colon. Thus, I splice it off with the third line.
Then, just using the partition function on each line that we created within your input string, we assign the left part to the actor list, the right part to the data list and ignore the bracket (which is at position 1).
After that, Python's very useful zip funciton should finish the job for us by associating the ith element of each list together into a list of matched tuples.

Categories

Resources