How can I split concatenated strings that contain no delimiters in python? - python

Let's say I have a list of concatenated firstname + lastname combinations like this:
["samsmith","sallyfrank","jamesandrews"]
I also have lists possible_firstnames and possible_lastnames.
If I want to split those full name strings based on values that appear in possible_firstnames and possible_lastnames, what is the best way of doing so?
My initial strategy was to compare characters between full name strings and each possible_firstnames/possible_lastnames value one by one, where I would split the full name string on discovery of a match. However, I realize that I would encounter a problem if, for example, "Sal" was included as a possible first name (my code would try to turn "sallyfrank" into "Sal Lyfrank" etc).
My next step would be to crosscheck what remains in the string after "sal" to values in possible_lastnames before finalizing the split, but this is starting to approach the convoluted and so I am left wondering if there is perhaps a much simpler option that I have been overlooking from the very beginning?
The language that I am working in is Python.

If you are getting similar names, like sam, samantha and saman, put them in reverse order so that the shortest is last
full_names = ["samsmith","sallyfrank","jamesandrews", "samanthasang", "samantorres"]
first_name = ["sally","james", "samantha", "saman", "sam"]
matches = []
for name in full_names:
for first in first_name:
if name.startswith(first):
matches.append(f'{first} {name[len(first):]}')
break
print(*matches, sep='\n')
Result
sam smith
sally frank
james andrews
samantha sang
saman torres
This won't pick out a name like Sam Antony. It would show this as *Saman Tony", in which case, your last name idea would work.
It also won't pick out Sam Anthanei. This could be Samantha Nei, Saman Thanei or Sam Anthanei if all three surnames were in your surname list.

Is this what u wanted
names = ["samsmith","sallyfrank","jamesandrews"]
pos_fname = ["sally","james"]
pos_lname = ["smith","frank"]
matches = []
for i in names:
for n in pos_fname:
if i.startswith(n):
break
else:
continue
for n in pos_lname:
if i.endswith(n):
matches.append(f"{i[:-len(n)].upper()} {n.upper()}")
break
else:
continue
print(matches)

Related

Search for terms in list within dataframe column, add terms found to new column

Example dataframe:
data = pd.DataFrame({'Name': ['Nick', 'Matthew', 'Paul'],
'Text': ["Lived in Norway, England, Spain and Germany with his car",
"Used his bikes in England. Loved his bike",
"Lived in Alaska"]})
Example list:
example_list = ["England", "Bike"]
What I need
I want to create a new column, called x, where if a term from example_list is found as a string/substring in data.Text (case insensitive), it adds the word it was found from to the new column.
Output
So in row 1, the word England was found and returned, and bike was found and returned, as well as bikes (which bike was a substring of).
Progress so far:
I have managed - with the following code - to return terms that match the terms regardless of case, however it wont find substrings... e.g. if search for "bike", and it finds "bikes", I want it to return "bikes".
pattern = fr'({"|".join(example_list)})'
data['Text'] = data['Text'].str.findall(pattern, flags=re.IGNORECASE).str.join(", ")
I think I might have found a solution for your pattern there:
pattern = fr'({"|".join("[a-zA-Z]*" + ex + "[a-zA-Z]*" for ex in example_list)})'
data['x'] = data['Text'].str.findall(pattern, flags=re.IGNORECASE).str.join(",")
Basically what I do is, I extend the pattern by optionally allowing letters before the (I think you don't explicitly mention this, maybe this has to be omitted) and after the word.
As an output I get the following:
I'm just not so sure, in which format you want this x-column. In your code you join it via commas (which I followed here) but in the picture you only have a list of the values. If you specify this, I could update my solution.

How to find required word in novel in python?

I have a text and I have got a task in python with reading module:
Find the names of people who are referred to as Mr. XXX. Save the result in a dictionary with the name as key and number of times it is used as value. For example:
If Mr. Churchill is in the novel, then include {'Churchill' : 2}
If Mr. Frank Churchill is in the novel, then include {'Frank Churchill' : 4}
The file is .txt and it contains around 10-15 paragraphs.
Do you have ideas about how can it be improved? (It gives me error after some words, I guess error happens due to the reason that one of the Mr. is at the end of the line.)
orig_text= open('emma.txt', encoding = 'UTF-8')
lines= orig_text.readlines()[32:16267]
counts = dict()
for line in lines:
wordsdirty = line.split()
try:
print (wordsdirty[wordsdirty.index('Mr.') + 1])
except ValueError:
continue
Try this:
text = "When did Mr. Churchill told Mr. James Brown about the fish"
m = [x[0] for x in re.findall('(Mr\.( [A-Z][a-z]*)+)', text)]
You get:
['Mr. Churchill', 'Mr. James Brown']
To solve the line issue simply read the entire file:
text = file.read()
Then, to count the occurrences, simply run:
Counter(m)
Finally, if you'd like to drop 'Mr. ' from all your dictionary entries, use x[0][4:] instead of x[0].
This can be easily done using regex and capturing group.
Take a look here for reference, in this scenario you might want to do something like
# retrieve a list of strings that match your regex
matches = re.findall("Mr\. ([a-zA-Z]+)", your_entire_file) # not sure about the regex
# then create a dictionary and count the occurrences of each match
# if you are allowed to use modules, this can be done using Counter
Counter(matches)
To access the entire file like that, you might want to map it to memory, take a look at this question

Is there a way to combine multiple strings using Regex?

Having an issue with Regex and not really understanding its usefulness right now.
Trying to extrapolate data from a file. file consists of first name, last name, grade
File:
Peter Jenkins: A
Robert Right: B
Kim Long: C
Jim Jim: B
Opening file code:
##Regex Code r'([A-Za-z]+)(: B)
regcode = r'([A-Za-z]+)(: B)'
answer=re.findall(regcode,file)
return answer
The expected result is first name last name. The given result is last name and letter grade. How do I just get the first name and last name for all B grades?
Since you must use regex for this task, here's a simple regex solution that returns the full name:
'(.*): B'
Which works in this case because:
(.*) returns all text up to a match of : B
Click here to see my test and matching output. I recommend this site for your regex testing needs.
You can do it without regex:
students = '''Peter Jenkins: A
Robert Right: B
Kim Long: C
Jim Jim: B'''
for x in students.split('\n'):
string = x.split(': ')
if string[1] == 'B':
print(string[0])
# Robert Right
# Jim Jim
or
[x[0:-3] for x in students.split('\n') if x[-1] == 'B']
If a regex solution is required (I perosnally like the solution of Roman Zhak more), put inside a group what you are interested in, i.e. the first name and the second name. Follows colon and B:
import re
file = """
Peter Jenkins: A
Robert Right: B
Kim Long: C
Jim Jim: B
"""
regcode = r'([A-Za-z]+) ([A-Za-z]+): B'
answer=re.findall(regcode,file,re.)
print(answer) # [('Robert', 'Right'), ('Jim', 'Jim')]
Add a capturing group ('()') to your expression. Everything outside the group will be ignored, even if it matches the expression.
re.findall('(\w+\s+\w+):\s+B', file)
#['Robert Right', 'Jim Jim']
'\w' is any alphanumeric character, '\s' is any space-like character.
You can add two groups, one for the first name and one for the last name:
re.findall('(\w+)\s+(\w+):\s+B', data)
#[('Robert', 'Right'), ('Jim', 'Jim')]
The latter will not work if there are more than two names on one line.

input value same as predefined list but getting a different output

Ive written a program which takes in the name and age of multiple entries seperated by a comma and then sepearates the aplhabets from the numerics and then compares the name with a pre defined set/list.
If the entry doesnt match with the pre defined data, the program sends a message"incorrect entry" along with the element which didnt match.
heres the code:
from string import digits
print("enter name and age")
order=input("Seperate entries using a comma ',':")
order1=order.strip()
order2=order1.replace(" ","")
order_sep=order2.split()
removed_digits=str.maketrans('','',digits)
names=order.translate(removed_digits)
print(names)
names1=names.split(',')
names_list=['abby','chris','john','cena']
names_list=set(names_list)
for name in names1:
if name not in names_list:
print(f"{name}:doesnt match with predefined data")
the problem im having is even when i enter chris or john, the program treats them as they dont belong to the pre defined list
sample input : ravi 19,chris 20
output:ravi ,chris
ravi :doesnt match with predefined data
chris :doesnt match with predefined data
also i have another issue , ive written a part to eliminate whitespace but i dont know why, it doesnt elimintae them
sample input:ravi , chris
ravi :doesnt match with predefined data
()chris :doesnt match with predefined data
theres a space where ive put parenthesis.
any suggestion to tackle this problem and/or improve this code is appreciated!
I think some of the parts can be simplified, especially when removing the digits. As long as the input is entered with a space between the name and age, you can use split() twice. First to separate the entries with split(',') and next to separate out the ages with split(). It makes comparisons easier later if you store the names by themselves with no punctuation or whitespace around them. To print the names out from an iterable, you can use the str.join() function. Here is an example:
print("enter name and age")
order = input("Seperate entries using a comma ',': ")
names1 = [x.split()[0] for x in order.split(',')]
print(', '.join(names1))
names_list=['abby', 'chris', 'john', 'cena']
for name in names1:
if name not in names_list:
print(f"{name}:doesnt match with predefined data")
This will give the desired output:
enter name and age
Seperate entries using a comma ',': ravi 19, chris 20
ravi, chris
ravi:doesnt match with predefined data

Python - Splitting a string of names separated by spaces and commas

I have an API feeding into my program into a django many to many model field. The names of the individuals within my database are structured with a separated first name and last name. However, the API is sending a bulk list of names structured as as a string list as so: "Jones, Bob Smith, Jason Donald, Mic" (Last name-comma-space-first name-space-new last name- etc.)
How would I separate this string in a way that would allow me to filter and add a particular user to the many-to-many field?
Thanks!!
This answer excludes the case where a first name or last name contains space (this case is much more complicated as you will have a word with a space on his left AND on his right).
You need to replace the -comma-space- by something without a space (because you also have a space between two different names).
string = "Jones, Bob Smith, Jason Donald, Mic"
names = []
for name in string.replace(', ', ',').split(' '):
name = name.split(',')
last_name = name[0]
first_name = name[1]
names.append((last_name, first_name))
names
Output:
[('Jones', 'Bob'), ('Smith', 'Jason'), ('Donald', 'Mic')]
You can use regex:
s = "Jones, Bob Smith, Jason Donald, Mic"
list(re.findall(r'(\S+), (\S+)', s))
# [('Jones', 'Bob'), ('Smith', 'Jason'), ('Donald', 'Mic')]

Categories

Resources