Formatting strings in a dataframe

Formatting strings in a dataframe - python

i have a dataframe
Name
Joe Smith
Jane Doe
Homer Simpson
i am trying to format this to get to
Name
Smith, Joe
Doe, Jane
Simpson, Homer
i have this code, and it works for ~ 80% of users in my list but some users are not coming through right.
invalid_users = ['Test User', 'Test User2', 'Test User3']
for index, row in df_Users.iterrows():
gap_pos = df_Users["Name"][index].find(" ")
if gap_pos > 0 and row["Name"] not in invalid_users:
row["Name"] = df_Users["Name"][index][len(df_Users["Name"][index])-gap_pos+1:].strip() +', ' + df_Users["Name"][index][:gap_pos]
the users who are not coming through correctly, usually their last name is truncated somewhere - i.e. Simpson ==> mpson
What am I doing wrong here?

Just split on space, then reverse it (that's what .str[::-1] is doing) and join on , :
>>> df['Name'].str.split(' ').str[::-1].str.join(', ')
0 Smith, Joe
1 Doe, Jane
2 Simpson, Homer
Name: Name, dtype: object
And if your data contains the name like Jr. Joe Smith, then you may do it following way:
df['Name'].str.split(' ').str[::-1].apply(lambda x:(x[0],' '.join(x[1:]))).str.join(', ')

I'm not sure what you were trying to with len there, but it's not right. You just want to start straight from gap_pos:
row["Name"] = df_Users["Name"][index][gap_pos+1:].strip() +', ' + df_Users["Name"][index][:gap_pos]
I would be tempted to use split for this.

Pandas is a library that takes profit of vectorial operations, especially for simple transformations and most of DataFrame manipulations.
Given your example, here is a code that would work:
import pandas as pd
df = pd.DataFrame({"name": ["Joe Smith", "Jane Doe", "Homer Simpson"]})
# df
# name
# 0 Joe Smith
# 1 Jane Doe
# 2 Homer Simpson
df["name"] = df["name"].apply(lambda x: f"{x.split(' ')[1]}, {x.split(' ')[0]}")
# df
# name
# 0 Smith, Joe
# 1 Doe, Jane
# 2 Simpson, Homer
The apply function takes every row and applies the specified function to each one of them.
Here, the specified function is a lambda function that, supposing the name pattern is "FirstName LastName", does what you want.

Related

String literal matching between words in two different dataframe (dfs) and generate a new dataframe

I have two dataframes df1 and df2
df1 =
University
School
Student first name
last name
nick name
AAA
Law
John
Mckenzie
Stevie
BBB
Business
Steve
Savannah
JO
CCC
Engineering
Mark
Justice
Fre
DDD
Arts
Stuart
Little
Rah
EEE
Life science
Adam
Johnson
meh
120 rows X 5 columns
df2 =
Statement
Stuart had a headache last nigh which was due to th……
Rah basically found a new found friend which lead to the……
Gerome got a brand new watch which was……….
Adam was found chilling all through out his life……
Savannah is such a common name that……..
3000 rows X1 columns
AIM is to form df3
Match the string literal and iterate it through every cells in the columns "Student first name" , "Student last name" , "Student nick name" to produce the table below
Df3 =
Statement
Matching
University
School
Stuart had a headache last nigh which was due to th…
Stuart
DDD
Arts
Rah basically found a new found friend which lead to
Rah
DDD
Arts
Gerome got a brand new watch which was……….
NA
NA
NA
Adam was found chilling all through out his life……
Adam
EEE
Life science
Savannah is such a common name that……..
Savannah
BBB
Business
3000 rows X 4 columns

You can melt and merge:
import re
df1_melt = df1.melt(['University', 'School'], value_name='Match')
regex = '|'.join(map(re.escape, df1_melt['Match']))
out = df2.join(
df1_melt[['Match', 'University', 'School']]
.merge(df2['Statement']
.str.extract(f'({regex})', expand=False)
.rename('Match'),
how='right', on='Match'
)
)
output:
Statement Match University School
0 Stuart had a headache last nigh which was due to the Stuart DDD Arts
1 Rah basically found a new found friend which lead to the Rah DDD Arts
2 Gerome got a brand new watch which was NaN NaN NaN
3 Adam was found chilling all through out his life Adam EEE Life science
4 Savannah is such a common name that Savannah BBB Business

Naïve approach, loop columns to find matches then loop to merge on matches:
import re
columns_to_match = ["Student first name", "last name", "nick name"]
dfs = []
for column in columns_to_match:
search_strings = df1[column].unique().tolist()
regex = "|".join(map(re.escape, search_strings))
df2["Matching"] = df2["Statement"].str.extract(f"({regex})")
dfs.append(df2.dropna())
matched_df = pd.concat(dfs).reset_index(drop=True)
dfs = []
for column in columns_to_match:
final_df = df1.merge(matched_df, how="inner", left_on=column, right_on="Matching")
dfs.append(final_df)
final_df = pd.concat(dfs).reset_index(drop=True).drop(columns=columns_to_match)

My answer makes the following assumptions:
The index on df1 serves as the student ID and is unique.
That you only want to fill the first student found. A statement like "John and Steve are friends" will be assigned to John.
import re
assigned = pd.Series([False] * len(df2))
df3 = df2.copy()
# Loop through each student, taking their first, last and nick name
for idx, names in df1[["Student first name", "last name", "nick name"]].iterrows():
# If all statements have been assigned, terminate the loop
if assigned.all():
break
# Combine the student's first, last and nick name into a regex pattern
pattern = f"({'|'.join(names.map(re.escape))})"
# For each UNASSIGNED statement, Find the pattern. We only search unassigned
# statements to lower the number of searches.
match = df3.loc[~assigned, "Statement"].str.extract(pattern, expand=False)
# Mark the statement as assigned
cond = ~assigned & match.notna()
assigned[cond] = True
# Fill in the student's info
df3.loc[cond, "Match"] = match[cond]
df3.loc[cond, "University"] = df1.loc[idx, "University"]
df3.loc[cond, "School"] = df1.loc[idx, "School"]

Rather than iterating through each cell, you could create three dataframes (merging with all three columns separately) and concatenate the results into one dataframe.
df2['Matching'] = df2['Statement'].str.split().str[0]
dfs = []
for col in ['Student first name', 'last name', 'nick name']:
df_temp = pd.merge(df2, df1[[col, 'University', 'School']].rename(columns={col:'Matching'}), how='left')
dfs.append(df_temp)
df3 = pd.concat(dfs).drop_duplicates()

How to filter and sort specific csv using python

Please help me with the python script to filter the below CSV.
Below is the example of the CSV dump for which I have done the initial filtration.
Last_name
Gender
Name
Phone
city
Ford
Male
Tom
123
NY
Rich
Male
Robert
21312
LA
Ford
Female
Jessica
123123
NY
Ford
Male
John
3412
NY
Rich
Other
Linda
12312
LA
Ford
Other
James
4321
NY
Smith
Male
David
123123
TX
Rich
Female
Mary
98689
LA
Rich
Female
Jennifer
86860
LA
Ford
Male
Richard
12123
NY
Smith
Other
Daniel
897097
TX
Ford
Other
Lisa
123123123
NY
import re
def gather_info (L_name):
dump_filename = "~/Documents/name_report.csv"
LN = []
with open(dump_filename, "r") as FH:
for var in FH.readlines():
if L_name in var
final = var.split(",")
print(final[1], final[2], final[3])
return LN
if __name__ == "__main__":
L_name = input("Enter the Last name: ")
la_name = gather_info(L_name)
By this, I am able to filter by the last name. for example, if I choose L_name as Ford, then I have my output as
Gender
Name
Phone
Male
Tom
123
Female
Jessica
123123
Male
John
3412
Other
James
4321
Male
Richard
12123
Other
Lisa
22412
I need help extending the script by selecting each gender and the values in the list to perform other functions, then calling the following gender and the values to achieve the same functions. for example, first, it selects the gender Male [Tom, John] and performs other functions. then selects the next gender Female [Jessica] and performs the same functions and then selects the gender Other [James, Lisa] and performs the same functions.

I would recomend using the pandas module which allows for easy filtering and grouping of data
import pandas as pd
if __name__ == '__main__':
data = pd.read_csv('name_reports.csv')
L_name = input("Enter the last name: ")
by_last_name = data[data['Last_name'] == L_name]
groups = by_last_name.groupby(['Gender'])
for group_name, group_data in groups:
print(group_name)
print(group_data)
Breaking this down into its pieces the first part is
data = pd.read_csv('name_reports.csv')
This reads the data from the csv and places it into a dataframe
Second we have
by_last_name = data[data['Last_name'] == L_name]
This filters the dataframe to only have results with Last_name equal to L_name
Next we group the data.
groups = by_last_name.groupby(['Gender'])
this groups the filtered data frames by gender
then we iterate over this. It returns a tuple with the group name and the dataframe associated with that group.
for group_name, group_data in groups:
print(group_name)
print(group_data)
This loop just prints out the data to access fields from it you can use the iterrows function
for index,row in group_data.iterrows():
print(row['city']
print(row['Phone']
print(row['Name']
And then you can use those for whatever function you want. I would recommend reading on the documentation for pandas since depending on the function you plan on using there may be a better way to do it using the library. Here is the link to the library https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html

Since you cannot use the pandas module then a method using only the csv module would look like this
import csv
def has_last_name(row,last_name):
return row['Last_name'] == last_name
def has_gender(row,current_gender):
return row['Gender'] == current_gender
if __name__ == '__main__':
data = None
genders = ['Male','Female','Other']
with open('name_reports.csv') as csvfile:
data = list(csv.DictReader(csvfile,delimiter=','))
L_name = input('Enter the Last name: ')
get_by_last_name = lambda row: has_last_name(row,L_name)
filtered_by_last_name = list(filter(get_by_last_name,data))
for gender in genders:
get_by_gender = lambda row: has_gender(row,gender)
filtered_by_gender = list(filter(get_by_gender,filtered_by_last_name))
print(filtered_by_gender)
The important part is the filter built in function. This takes in a function that takes in an item from a list and returns a bool. filter takes this function and an iterable and returns a generator of items that return true for that function. The other important part is the csv.DictReader which returns your csv file as a dictionary which makes allows you to access attributes by key instead of by index.

How to replace substrings in a Dataframe column, but only at the start of the strings?

I'm trying to replace specific characters in a data frame just if the string of the column starts with the characters specified. I mean, the df is as below:
UBICACION
NAME
AL03
Joe
FL03
Maria
AL07
Karla
DAL5
Marco
The desired output would be:
UBICACION
NAME
FL03
Joe
FL03
Maria
FL07
Karla
DAL5
Marco
This is my try:
df['UBICACION'] = df['UBICACION'].replace ("FL","AL")
The last sentence is not working, cause' it changes all the word, it just keeps the specified characters
Hope you can help me, I'm a little bit new on this. Best regards.

DataFrame.replace includes a regex=True option, so you can use ^AL:
df['UBICACION'] = df['UBICACION'].replace('^AL', 'FL', regex=True)
# UBICACION NAME
# 0 FL03 Joe
# 1 FL03 Maria
# 2 FL07 Karla
# 3 DAL5 Marco

try this:
df["UBICACION"] = df["UBICACION"].apply(lambda x: f"FL{x[2:]}" if x.startswith("AL") else x)

How to speed up pandas drop() method?

I have a large excel file to clean around 200000 rows. So Im using pandas to drop unwanted rows if the conditions meet but it takes some time to run.
My current code looks like this
def cleanNumbers(number): # checks number if it is a valid number
vaild = True
try:
num = pn.parse('+' + str(number), None)
if not pn.is_valid_number(num):
vaild = False
except:
vaild = False
return vaild
for UncleanNum in tqdm(TeleNum):
valid = cleanNumbers(UncleanNum) # calling cleanNumbers function
if valid is False:
df = df.drop(df[df.telephone == UncleanNum].index)
# dropping row if number is not a valid number
It takes around 30 min for this line of code to finish. Is there a more efficient way to drop rows with pandas? If not can I use numpy to have the same output?
Im not that aquainted with pandas or numpy so if you have any tips to share it would be helpful.
Edit:
Im using phonenumbers lib to check if the telephone number is valid. If its not a valid phonenumber i drop the row that number is on.
Example data
address name surname telephone
Street St. Bill Billinson 7398673456897<--let say this is wrong
Street St. Nick Nick 324523452345
Street St. Sam Sammy 234523452345
Street St. Bob Bob 32452345234534<--and this too
Street St. John Greg 234523452345
Output
address name surname telephone
Street St. Nick Nick 324523452345
Street St. Sam Sammy 234523452345
Street St. John Greg 234523452345
This is what my code does but it slow.

In my opinion here main bootleneck is not drop, but custom function repeating for large number of values.
Create list of all valid numbers and then filter by boolean indexing with Series.isin:
v = [UncleanNum for UncleanNum in tqdm(TeleNum) if cleanNumbers(UncleanNum)]
df = df[df.telephone.isin(v)]
EDIT:
After some testing solution should be simplify, because function return boolean:
df1 = df[df['telephone'].apply(cleanNumbers)]

Partial string slice (or string split?) in new column

I am trying to figure out how to remove a word from a group of words in a column and insert that removed word into a new column. I figured out how to remove a part of a column and insert it into a new row, but I cannot figure out how to target a specific word (by placement I assume; "Mr." is always the 2nd word; or maybe by taking the word between the first "," and ".'s which is also always constant in my data set).
Name Age New_Name
Doe, Mr. John 23 Mr.
Anna, Mrs. Fox 33 Mrs.
EDITED the above to add another row
How would I remove the "Mr." from the name column and insert it into the "New_Name" column?
So far I have come up with:
data['New_name'] = data.Name.str[:2]
This doesn't allow me to specifically target "Mr." though.
I think I have to use a string.split, but the exact code is eluding me.

If the Mr. is always in the same position as indicated by your example, this can be accomplished with list interpolation:
df['New_Name'] = [x.split(' ')[1] for x in df['Name']]
and
d['Name'] = [' '.join(x.split(' ')[::2]) for x in d['Name']]

First, you have to get title from a name (it is between comma and dot) and stores it to another column. Then repeat this operation to remove title from column 'Name':
import pandas as pd
df = pd.DataFrame({'Name':['Doe, Mr. John', 'Anna, Ms. Fox'], 'Age':[23,33]})
df['New_Name'] = df['Name'].apply(lambda x: x[x.find(',')+len(','):x.rfind('.')]+'.')
df['Name'] = df['Name'].apply(lambda x: x.replace(x[x.find(',')+len(','):x.rfind('.')]+'.',''))
print df
Output:
Age Name New_Name
0 23 Doe, John Mr.
1 33 Anna, Fox Ms.

You can use pandas str.replace and str.extract methods
First extract title to form new column
df['New_Name'] = df['Name'].str.extract(',\s([A-Za-z]+.)')
Then use replace to replace extracted string with empty string
df['Name'] = df['Name'].str.replace('\s([A-Za-z]+.)\s', ' ')
You get:
Age Name New_Name
0 23 Doe, John Mr.

name = "Doe, Mr. John"
# if you always expect a title (Mr/Ms) between comma and dot
# split to lastname, title and firstname and strip spaces
newname = [ n.strip() for n in name.replace(".", ",").split(",") ]
print(newname)
#> ['Doe', 'Mr', 'John']
then you can print a title and a firstname-lastname column or other combination of them.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Formatting strings in a dataframe - python

I'm not sure what you were trying to with len there, but it's not right. You just want to start straight from gap_pos: row["Name"] = df_Users["Name"][index][gap_pos+1:].strip() +', ' + df_Users["Name"][index][:gap_pos] I would be tempted to use split for this.

Related

String literal matching between words in two different dataframe (dfs) and generate a new dataframe

How to filter and sort specific csv using python

How to replace substrings in a Dataframe column, but only at the start of the strings?

How to speed up pandas drop() method?

Partial string slice (or string split?) in new column

Categories

Resources