Is there a way to combine multiple strings using Regex? - python

Having an issue with Regex and not really understanding its usefulness right now.
Trying to extrapolate data from a file. file consists of first name, last name, grade
File:
Peter Jenkins: A
Robert Right: B
Kim Long: C
Jim Jim: B
Opening file code:
##Regex Code r'([A-Za-z]+)(: B)
regcode = r'([A-Za-z]+)(: B)'
answer=re.findall(regcode,file)
return answer
The expected result is first name last name. The given result is last name and letter grade. How do I just get the first name and last name for all B grades?

Since you must use regex for this task, here's a simple regex solution that returns the full name:
'(.*): B'
Which works in this case because:
(.*) returns all text up to a match of : B
Click here to see my test and matching output. I recommend this site for your regex testing needs.

You can do it without regex:
students = '''Peter Jenkins: A
Robert Right: B
Kim Long: C
Jim Jim: B'''
for x in students.split('\n'):
string = x.split(': ')
if string[1] == 'B':
print(string[0])
# Robert Right
# Jim Jim
or
[x[0:-3] for x in students.split('\n') if x[-1] == 'B']

If a regex solution is required (I perosnally like the solution of Roman Zhak more), put inside a group what you are interested in, i.e. the first name and the second name. Follows colon and B:
import re
file = """
Peter Jenkins: A
Robert Right: B
Kim Long: C
Jim Jim: B
"""
regcode = r'([A-Za-z]+) ([A-Za-z]+): B'
answer=re.findall(regcode,file,re.)
print(answer) # [('Robert', 'Right'), ('Jim', 'Jim')]

Add a capturing group ('()') to your expression. Everything outside the group will be ignored, even if it matches the expression.
re.findall('(\w+\s+\w+):\s+B', file)
#['Robert Right', 'Jim Jim']
'\w' is any alphanumeric character, '\s' is any space-like character.
You can add two groups, one for the first name and one for the last name:
re.findall('(\w+)\s+(\w+):\s+B', data)
#[('Robert', 'Right'), ('Jim', 'Jim')]
The latter will not work if there are more than two names on one line.

Related

Extract a substring from a column and replace column data frame

I need some help extracting a substring from a column in my data frame and then replacing that column with a substring. I was wondering if python would be better performance for stripping the string or using regular expression to substitute/replace the string with the substring.
The string looks something like this in the column:
Person
------
<Person 1234567 Tom Brady>
<Person 456789012 Mary Ann Thomas>
<Person 92145 John Smith>
What I would like is this:
Person
------
Tom Brady
Mary Ann Thomas
John Smith
What I have so far as far as regular expressions go is this:
/^([^.]+[.]+[^.]+)[.]/g
And that just gets this part '<Person 1234567 ', not sure how to get the '>' from the end.
Multiple ways, but you can use str.replace():
import pandas as pd
df = pd.DataFrame({'Person': ['<Person 1234567 Tom Brady>',
'<Person 456789012 Mary Ann Thomas>',
'<Person 92145 John Smith>']})
df['Person'] = df['Person'].str.replace(r'(?:<Person[\d\s]+|>)', '', regex=True)
print(df)
Prints:
Person
0 Tom Brady
1 Mary Ann Thomas
2 John Smith
Pattern used: (?:<Person[\d\s]+|>), see an online demo:
(?: - Open non-capture group for alternation;
<Person[\d\s]+ - Match literal '<Person' followed by 1+ whitespace characters or digits;
| - Or;
> - A literal '>'
) - Close group.
You can first identify all the alphabets in keeping things simple with this code
res = re.findall(r"[^()0-9-]+", string)
res[1]
This should return you a list of strings ['Person', 'Tom Brady'], you can then access the name of the Person with res[1]
**Remark: I have yet to try the code, in the case that it also returns spaces, you should be able to easily remove them with strip() or it should be the the third string of the list res[3] instead.
You can read more about re.findall() online or through the documentation.
Python regex has a function called search that finds the matching pattern in a string. With the examples given, you can use regex to extract the names with:
import re
s = "<Person 1234567 John Smith>"
re.search("[A-Z][a-z]+(\s[A-Z][a-z]+)+", s).group(0)
>>> 'John Smith'
The regular expression [A-Z][a-z]+(\s[A-Z][a-z]+)+ is just matching the names (Tom Brady, Mary Ann Thomas, etc.)
I like to use Panda's apply function to apply an operation on each row, so the final result would look like this:
import re
import pandas as pd
def extract_name(row):
row["Person"] = re.search("[A-Z][a-z]+(\s[A-Z][a-z]+)+", row["Person"]).group(0)
return row
df = YOUR DATAFRAME
df2 = df.apply(extract_name, axis=1)
and df2 has the Person column with the extracted names.

New to Python: How to keep the first letter of each word capitalized?

I was practicing with this tiny program with the hopes to capitalize the first letter of each word in: john Smith.
I wanted to capitalize the j in john so I would have an end result of John Smith and this is the code I used:
name = "john Smith"
if (name[0].islower()):
name = name.capitalize()
print(name)
Though, capitalizing the first letter caused an output of: John smith where the S was converted to a lowercase. How can I capitalize the letter j without messing with the rest of the name?
I thank you all for your time and future responses!
I appreciate it very much!!!
As #j1-lee pointed out, what you are looking for is the title method, which will capitalize each word (as opposed to capitalize, which will capitalize only the first word, as if it was a sentence).
So your code becomes
name = "john smith"
name = name.title()
print(name) #> John Smith
Of course you should be using str.title(). However, if you want to reinvent that functionality then you could do this:
name = 'john paul smith'
r = ' '.join(w[0].upper()+w[1:].lower() for w in name.split())
print(r)
Output:
John Paul Smith
Note:
This is not strictly equivalent to str.title() as it assumes all whitespace in the original string is replaced with a single space

How can I split concatenated strings that contain no delimiters in python?

Let's say I have a list of concatenated firstname + lastname combinations like this:
["samsmith","sallyfrank","jamesandrews"]
I also have lists possible_firstnames and possible_lastnames.
If I want to split those full name strings based on values that appear in possible_firstnames and possible_lastnames, what is the best way of doing so?
My initial strategy was to compare characters between full name strings and each possible_firstnames/possible_lastnames value one by one, where I would split the full name string on discovery of a match. However, I realize that I would encounter a problem if, for example, "Sal" was included as a possible first name (my code would try to turn "sallyfrank" into "Sal Lyfrank" etc).
My next step would be to crosscheck what remains in the string after "sal" to values in possible_lastnames before finalizing the split, but this is starting to approach the convoluted and so I am left wondering if there is perhaps a much simpler option that I have been overlooking from the very beginning?
The language that I am working in is Python.
If you are getting similar names, like sam, samantha and saman, put them in reverse order so that the shortest is last
full_names = ["samsmith","sallyfrank","jamesandrews", "samanthasang", "samantorres"]
first_name = ["sally","james", "samantha", "saman", "sam"]
matches = []
for name in full_names:
for first in first_name:
if name.startswith(first):
matches.append(f'{first} {name[len(first):]}')
break
print(*matches, sep='\n')
Result
sam smith
sally frank
james andrews
samantha sang
saman torres
This won't pick out a name like Sam Antony. It would show this as *Saman Tony", in which case, your last name idea would work.
It also won't pick out Sam Anthanei. This could be Samantha Nei, Saman Thanei or Sam Anthanei if all three surnames were in your surname list.
Is this what u wanted
names = ["samsmith","sallyfrank","jamesandrews"]
pos_fname = ["sally","james"]
pos_lname = ["smith","frank"]
matches = []
for i in names:
for n in pos_fname:
if i.startswith(n):
break
else:
continue
for n in pos_lname:
if i.endswith(n):
matches.append(f"{i[:-len(n)].upper()} {n.upper()}")
break
else:
continue
print(matches)

Python - Applying a function to separate string in column every two words

I want to add a separator (,) every two words capture/better delineate the full names of the row.
For example df['Names'] is currently:
John Smith David Smith Golden Brown Austin James
and I would like to be:
John Smith, David Smith, Golden Brown, Austin James
I was able to find some code which splits the string every x words which would be perfect for my purposes shown below:
def splitText(string):
words = string.split()
grouped_words = [' '.join(words[i: i + 2]) for i in range(0, len(words), 2)]
return grouped_words
However I'm not sure how to apply this to the column of choice.
I tried the following:
df['Names'].apply(splitText())
This gives me a missing positional argument.
Asking for any advice on either modifying the function or my application of it to a column dataframe. I'm pretty new to this stuff so any advice would be great!
Cheers
You can pass only function without ():
df['Names'].apply(splitText)
Working like using lambda function:
df['Names'].apply(lambda x: splitText(x))

regex groups: How to get the desired output with a more specific match pattern?

The following input list of entries
l = ["555-8396 Neu, Allison",
"Burns, C. Montgomery",
"555-5299 Putz, Lionel",
"555-7334 Simpson, Homer Jay"]
is expected to be transformed to:
Allison Neu 555-8396
C. Montgomery Burns
Lionel Putz 555-5299
Homer Jay Simpson 555-7334
I tried the following:
for i in l:
mo = re.search(r"([0-9]{3}-[0-9]{4})?\s*(\w*),\s*(\S.*$)", i)
if mo:
print("{} {} {}".format(mo.group(3), mo.group(2), mo.group(1)))
and it results in the following incorrect output (note the "None" in the second line of output)
Allison Neu 555-8396
C. Montgomery Burns None
Lionel Putz 555-5299
Homer Jay Simpson 555-7334
However the following solution mentioned in the e-book does indeed give the desired output:
for i in l:
mo = re.search(r"([0-9-]*)\s*([A-Za-z]+),\s+(.*)", i)
print(mo.group(3) + " " + mo.group(2) + " " + mo.group(1))
In short, it boils down to the difference in the groups() output of the 2 reg exp searches:
>>> mo = re.search(r"([0-9]{3}-[0-9]{4})?\s*(\w*),\s*(\S.*$)", "Burns, C. Montgomery")
>>> mo.groups()
(None, 'Burns', 'C. Montgomery')
versus
>>> mo = re.search(r"([0-9-]*)\s*(\w*),\s*(\S.*$)", "Burns, C. Montgomery")
>>> mo.groups()
('', 'Burns', 'C. Montgomery')
None vs ''
I wanted to do a more accurate match of the phone number format with [0-9]{3}-[0-9]{4} instead of using [0-9-]* which can match arbitrary number and - combinations (ex: "0-1-2" or "1-23").
Why does "*" result in a different grouping than "?".
Yes, it is trivial for me to take care of the "None" while printing out the result, but I am interested to know the reason for the difference in grouping results.
((?:[0-9]{3}-[0-9]{4})?)\s*(\w*),\s*(\S.*$)
Try this.See demo.
https://regex101.com/r/Qx6ylw/1
In the book example group was not optional...its contents were....in your regex group was optional.
Let me say in plain English what RegEx demos are hinting at and actually answer your actual question:
([0-9-]*) Matches 0 or more characters of digits or the - character. When there is no telephone present, that would be the case of matching 0 characters. But note the operative word matching, i.e. it is still a match. Thus, mo.group(1) returns ''.
([0-9]{3}-[0-9]{4})? Attempts to match a phone number in a specific format, but this match is optional. When the phone number is not present in the input, the match does not exist and thus mo.group(1) returns None.
Using judicious whitespace trimming, a simple find and replace example is this :
Find: ^((?:\d+(?:-\d+)+)?)\s*([^,]*?)\s*,\s*(.*)
Replace \3 \2 \1
https://regex101.com/r/oo0NWy/1
This code solves your problem:
for i in l:
mo = re.search(r"([0-9]{3}-[0-9]{4})?\s*(\w*),\s*(\S.*$)", i)
if mo:
if mo.group(1):
print("{} {} {}".format(mo.group(3), mo.group(2), mo.group(1)))
else:
print("{} {}".format(mo.group(3), mo.group(2)))
Output:
Allison Neu 555-8396
C. Montgomery Burns
Lionel Putz 555-5299
Homer Jay Simpson 555-7334

Categories

Resources