How can I extract names from a concatenated string using Python?

How can I extract names from a concatenated string using Python? - python

Suppose I have a string of concatenated names like so:
name.s = 'johnwilliamsfrankbrown'.
How do I go from here to a list of names and surnames ["john", "williams", "frank", "brown"]?
So far I only found pieces of code to extract words from non concatenated strings.

As timgeb noted in the comments, this is only possible if you already know which names you expect. Assuming that you have this information, you can extract them like this:
>>> import re
>>> names = ['john', 'frank', 'brown', 'williams']
>>> regex = '(' + '|'.join(names) + ')'
>>> separated_names = re.findall(regex, 'johnwilliamsfrankbrown')
>>> separated_names
['john', 'williams', 'frank', 'brown']

Related

Get proper list from list of unicode list

I have a list with a unicode string in a form of a list.
my_list = [u'[James, Williams, Kevin, Parker, Alex, Emma, Katie\xa0, Annie]']
I want a list which I am able to iterate such as;
name_list = [James, Williams, Kevin, Parker, Alex, Emma, Katie, Annie]
I have tried several possible solutions given here, but none of them worked in my case.
# Tried
name_list = name_list.encode('ascii', 'ignore').decode('utf-8')
#Gives unicode return type
# Tried
ast.literal_eval(name_list)
#Gives me invalid token error

Firstly, a list does not have a encode method, you have to apply any string methods on the item in the list.
Secondly, if you are looking at normalizing the string, you can use the normalize function from Python's unicodedata library, read more here, this removes the unwanted characters '\xa0' and will help you normalize any other characters.
Then instead of using eval which is generally unsafe, use a list comprehension to build a list:
import unicodedata
li = [u'[James, Williams, Kevin, Parker, Alex, Emma, Katie\xa0, Annie]']
inner_li = unicodedata.normalize("NFKD", li[0]) #<--- notice the list selection
#get only part of the string you want to convert into a list
new_li = [i.strip() for i in inner_li[1:-1].split(',')]
new_li
>> ['James', 'Williams', 'Kevin', 'Parker', 'Alex', 'Emma', 'Katie', 'Annie']
In your expected output, they are actually a list of variables, which unless declared before, will give you an error.

This is a good application for regular expressions:
import re
body = re.findall(r"\[\s*(.+)\s*]", my_list[0])[0] # extract the stuff in []s
names = re.split("\s*,\s*", body) # extract the names
#['James', 'Williams', 'Kevin', 'Parker', 'Alex', 'Emma', 'Katie', 'Annie']

import unicodedata
lst = [u'[James, Williams, Kevin, Parker, Alex, Emma, Katie\xa0, Annie]']
lst = unicodedata.normalize("NFKD", lst[0])
lst2 = lst[1:-1].split(", ") # remove open and close brackets
print(lst2)
output will be:
["James", "Williams", "Kevin", "Parker", "Alex", "Emma", "Katie ", "Annie"]
if you want to remove all spaces leading/trailing whitespaces:
lst3 = [i.strip() for i in lst2]
print(lst3)
output will be:
["James", "Williams", "Kevin", "Parker", "Alex", "Emma", "Katie", "Annie"]

Checking the elements of a list for multiple strings

Say I have a list:
['[name]\n', 'first_name,jane\n', 'middle_name,anna\n', 'last_name,doe\n', '[age]\n', 'age,30\n', 'dob,1/1/1988\n']
How could I check if the strings 'jane', 'anna' and 'doe' are ALL contained in an element of the list.

For each name you can use any to see if it is contained in any of the strings in the list, then make sure this is true for all of the names
>>> data = ['[name]\n', 'first_name,jane\n', 'middle_name,anna\n', 'last_name,doe\n', '[age]\n', 'age,30\n', 'dob,1/1/1988\n']
>>> names = ['jane', 'anna', 'doe']
>>> all(any(name in sub for sub in data) for name in names)
True

Split string with multiple separators from an array (Python)

Given an array of separators:
columns = ["Name:", "ID:", "Date:", "Building:", "Room:", "Notes:"]
and a string where some columns were left blank (and there is random white space):
input = "Name: JohnID:123:45Date: 8/2/17Building:Room:Notes: i love notes"
How can I get this:
["John", "123:45", "8/2/17", "", "", "i love notes"]
I've tried simply removing the substrings to see where I can go from there but I'm still stuck
import re
input = re.sub(r'|'.join(map(re.escape, columns)), "", input)

use the list to generate a regular expression by inserting (.*) in between, then use strip to remove spaces:
import re
columns = ["Name:", "ID:", "Date:", "Building:", "Room:", "Notes:"]
s = "Name: JohnID:123:45Date: 8/2/17Building:Room:Notes: i love notes"
result = [x.strip() for x in re.match("".join(map("{}(.*)".format,columns)),s).groups()]
print(result)
yields:
['John', '123:45', '8/2/17', '', '', 'i love notes']
the strip part can be handled by the regular expression at the expense of a more complex regex, but simpler overall expression:
result = re.match("".join(map("{}\s*(.*)\s*".format,columns)),s).groups()
more complex: if field data contains regex special chars, we have to escape them (not the case here):
result = re.match("".join(["{}\s*(.*)\s*".format(re.escape(x)) for x in columns]),s).groups()

How about using re.split?
>>> import re
>>> columns = ["Name:", "ID:", "Date:", "Building:", "Room:", "Notes:"]
>>> i = "Name: JohnID:123:45Date: 8/2/17Building:Room:Notes: i love notes"
>>> re.split('|'.join(map(re.escape, columns)), i)
['', ' John', '123:45', ' 8/2/17', '', '', ' i love notes']
To get rid of the whitespace, split on whitespace too:
>>> re.split(r'\s*' + (r'\s*|\s*'.join(map(re.escape, columns))) + r'\s*', i.strip())
['', 'John', '123:45', '8/2/17', '', '', ' i love notes']

Python LOB to List

Using:
cur.execute(SQL)
response= cur.fetchall() //response is a LOB object
names = response[0][0].read()
i have following SQL response as String names:
'Mike':'Mike'
'John':'John'
'Mike/B':'Mike/B'
As you can see it comes formatted. It is actualy formatted like:\\'Mike\\':\\'Mike\\'\n\\'John\\'... and so on
in order to check if for example Mike is inside list at least one time (i don't care how many times but at least one time)
I would like to have something like that:
l = ['Mike', 'Mike', 'John', 'John', 'Mike/B', 'Mike/B'],
so i could simply iterate over the list and ask
for name in l:
'Mike' == name:
do something
Any Ideas how i could do that?
Many thanks
Edit:
When i do:
list = names.split()
I receive the list which is nearly how i want it, but the elements inside look still like this!!!:
list = ['\\'Mike\\':\\'Mike\\", ...]

names = ['\\'Mike\\':\\'Mike\\", ...]
for name in names:
if "Mike" in name:
print "Mike is here"
The \\' business is caused by mysql escaping the '
if you have a list of names try this:
my_names = ["Tom", "Dick", "Harry"]
names = ['\\'Mike\\':\\'Mike\\", ...]
for name in names:
for my_name in my_names:
if myname in name:
print myname, " is here"

import re
pattern = re.compile(r"[\n\\:']+")
list_of_names = pattern.split(names)
# ['', 'Mike', 'Mike', 'John', 'John', 'Mike/B', '']
# Quick-tip: Try not to name a list with "list" as "list" is a built-in
You can keep your results this way or do a final cleanup to remove empty strings
clean_list = list(filter(lambda x: x!='', list_of_names))

Regex extract element after string

If I have a string s = "Name: John, Name: Abby, Name: Kate". How do I extract everything in between Name: and ,. So I'd want to have an array a = John, Abby, Kate
Thanks!

No need for a regex:
>>> s = "Name: John, Name: Abby, Name: Kate"
>>> [x[len('Name: '):] for x in s.split(', ')]
['John', 'Abby', 'Kate']
Or even:
>>> prefix = 'Name: '
>>> s[len(prefix):].split(', ' + prefix)
['John', 'Abby', 'Kate']
Now if you still think a regex is more appropriate:
>>> import re
>>> re.findall('Name:\s+([^,]*)', s)
['John', 'Abby', 'Kate']

The interesting question is how you would choose among the many ways to do this in Python. The answer using "split" is nice if you're confident that the format will be exact. If you would like some protection from minor format changes, a regular expression might be useful. You should think through what parts of the format are most likely to be stable, and capture those in your regular expression, while leaving flexibility for the others. Here is an example that assumes that the names are alphabetic, and that the word "Name" and the colon are stable:
import re
s = "Name: John, Name: Abby, Name: Kate"
names = [i.group(1) for i in re.finditer("Name:\s+([A-Za-z]*)", s)]
print names
You might instead want to allow for hyphens or other characters inside a name; you can do so by changing the text inside [A-Za-z].
A good page about Python regular expressions with lots of examples is http://docs.python.org/howto/regex.html.

Few more ways to do it
>>> s
'Name: John, Name: Abby, Name: Kate'
Method 1:
>>> [x.strip() for x in s.split("Name:")[1:]]
['John,', 'Abby,', 'Kate']
Method 2:
>>> [x.rsplit(":",1)[-1].strip() for x in s.split(",")]
['John', 'Abby', 'Kate']
Method 3:
>>> [x.strip() for x in re.findall(":([^,]*)",s)]
['John', 'Abby', 'Kate']
Method 4:
>>> [x.strip() for x in s.replace('Name:','').split(',')]
['John', 'Abby', 'Kate']
Also note, how I always consistently applied strip which makes sense if their can be multiple spaces between 'Name:' token and the actual Name.
Method 2 and 3 can be used in a more generalized way.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I extract names from a concatenated string using Python? - python

Suppose I have a string of concatenated names like so: name.s = 'johnwilliamsfrankbrown'. How do I go from here to a list of names and surnames ["john", "williams", "frank", "brown"]? So far I only found pieces of code to extract words from non concatenated strings.

Related

Get proper list from list of unicode list

Checking the elements of a list for multiple strings

Split string with multiple separators from an array (Python)

Python LOB to List

Regex extract element after string

Categories

Resources