Remove common word from headers in pandas data frame

Remove common word from headers in pandas data frame - python

Lets say I had the following dataframe
import pandas as pd
data = [['Mallika', 23, 'Student'], ['Yash', 25, 'Tutor'], ['Abc', 14, 'Clerk']]
data_frame = pd.DataFrame(data, columns=['Student.first.name.word', 'Student.Current.Age.word', 'Student.Current.Profession.word'])
Student.first.name.word Student.Current.Age.word Student.Current.Profession.word
0 Mallika 23 Student
1 Yash 25 Tutor
2 Abc 14 Clerk
How would I sub out the common column header words "Student" and "word"
so that you would get the following dataframe:
first.name Current.Age Current.Profession
0 Mallika 23 Student
1 Yash 25 Tutor
2 Abc 14 Clerk

You can remove those words and .s from the columns with a regex and assign it back:
data_frame.columns = data_frame.columns.str.replace(r"(Student|word|\.)", "")
to get
>>> data_frame
name Age Profession
0 Mallika 23 Student
1 Yash 25 Tutor
2 Abc 14 Clerk
after update
You can split - slice - join:
data_frame.columns = data_frame.columns.str.split(r"\.").str[1:-1].str.join(".")
i.e. split over literal dot, take out first & last elements and lastly join them with a dot
to get
first.name Current.Age Current.Profession
0 Mallika 23 Student
1 Yash 25 Tutor
2 Abc 14 Clerk

Here's is an extension of my answer to remove common prefixes. The benefit of this method is that it finds the prefixes and suffixes in a general way, so no need to hardcode any patterns.
cols = data_frame.columns
common_prefix = os.path.commonprefix(cols.tolist())
common_suffix = os.path.commonprefix([col[::-1] for col in cols])[::-1]
data_frame.columns = cols.str.replace(f"{common_prefix}|{common_suffix}", "", regex=True)
name Age Profession
0 Mallika 23 Student
1 Yash 25 Tutor
2 Abc 14 Clerk
Update, same solution works in a general way for the updated question:
first.name Current.Age Current.Profession
0 Mallika 23 Student
1 Yash 25 Tutor
2 Abc 14 Clerk

to remove all words and not just hard coded ones you can try
df = data_frame
from functools import reduce
common_words = [i.split(".") for i in df.columns.tolist()]
common_words =reduce(lambda x,y : set(x).intersection(y) ,common_words)
pat = r'\b(?:{})\b'.format('|'.join(common_words))
df.columns = df.columns.str.replace(pat, "").str[1:-1]
Output:
print(df)
first.name Current.Age Current.Profession
0 Mallika 23 Student
1 Yash 25 Tutor
2 Abc 14 Clerk

Related

How to leave certain values (which have a comma in them) intact when separating list-values in strings in pandas?

From the dataframe, I create a new dataframe, in which the values from the "Select activity" column contain lists, which I will split and transform into new rows. But there is a value: "Nothing, just walking", which I need to leave unchanged. Tell me, please, how can I do this?
The original dataframe looks like this:
Name Age Select activity Profession
0 Ann 25 Cycling, Running Saleswoman
1 Mark 30 Nothing, just walking Manager
2 John 41 Cycling, Running, Swimming Accountant
My code looks like this:
df_new = df.loc[:, ['Name', 'Age']]
df_new['Activity'] = df['Select activity'].str.split(', ')
df_new = df_new.explode('Activity').reset_index(drop=True)
I get this result:
Name Age Activity
0 Ann 25 Cycling
1 Ann 25 Running
2 Mark 30 Nothing
3 Mark 30 just walking
4 John 41 Cycling
5 John 41 Running
6 John 41 Swimming
In order for the value "Nothing, just walking" not to be divided by 2 values, I added the following line:
if df['Select activity'].isin(['Nothing, just walking']) is False:
But it throws an error.

then let's look ahead after comma to guarantee a Capital letter, and only then split. So instead of , we have , (?=[A-Z])
df_new = df.loc[:, ["Name", "Age"]]
df_new["Activity"] = df["Select activity"].str.split(", (?=[A-Z])")
df_new = df_new.explode("Activity", ignore_index=True)
i only changed the splitter, and ignore_index=True to explode instead of resetting afterwards (also the single quotes..)
to get
>>> df_new
Name Age Activity
0 Ann 25 Cycling
1 Ann 25 Running
2 Mark 30 Nothing, just walking
3 John 41 Cycling
4 John 41 Running
5 John 41 Swimming
one line as usual
df_new = (df.loc[:, ["Name", "Age"]]
.assign(Activity=df["Select activity"].str.split(", (?=[A-Z])"))
.explode("Activity", ignore_index=True))

Pandas Series and Nan Values for mismatched values

I have these two dictionaries,
dico = {'Name': ['Arthur','Henri','Lisiane','Patrice','Zadig','Sacha'],
"Age": ["20","18","62","73",'21','20'],
"Studies": ['Economics','Maths','Psychology','Medical','Cinema','CS']
}
dico2 = {'Surname': ['Arthur1','Henri2','Lisiane3','Patrice4']}
dico = pd.DataFrame.from_dict(dico)
dico2 = pd.DataFrame.from_dict(dico2)
in which I would like to match then append the Surname column with the Name column, to finally append it to dico, for a following output:
Name Surname Age Studies
0 Arthur Arthur1 20 Economics
1 Henri Henri2 18 Maths
2 Lisiane Lisiane3 62 Psychology
3 Patrice Nan 73 Medical
4 Zadig Nan 21 Cinema
5 Sacha Nan 20 CS
and ultimately delete the rows for which Surname is Nan
Name Surname Age Studies
0 Arthur Arthur1 20 Economics
1 Henri Henri2 18 Maths
2 Lisiane Lisiane3 62 Psychology
map_list = []
for name in dico['Name']:
best_ratio = None
for idx, surname in enumerate(dico2['Surname']):
if best_ratio == None:
best_ratio = fuzz.ratio(name, surname)
best_idx = 0
else:
ratio = fuzz.ratio(name, surname)
if ratio > best_ratio:
best_ratio = ratio
best_idx = idx
map_list.append(dico2['Surname'][best_idx]) # obtain surname
dico['Surname'] = pd.Series(map_list) # add column
dico = dico[["Name", "Surname", "Age", "Studies"]] # reorder columns
#if the surname is not a great match, print "Nan"
dico = dico.drop(dico[dico.Surname == "NaN"].index)
but when I print(dico), the output is as follows:
Name Surname Age Studies
0 Arthur Arthur1 20 Economics
1 Henri Henri2 18 Maths
2 Lisiane Lisiane3 62 Psychology
3 Patrice Patrice4 73 Medical
4 Zadig Patrice4 21 Cinema
5 Sacha Patrice4 20 CS
I don't see why after the Patrice row, there's a mismatch, while I want it to be "Nan".

Lets try pd.Multiindex.from_product to create combinations and then assign a score with zip and fuzz.ratio and some filtering to create our dict, then we can use series.map and df.dropna:
from fuzzywuzzy import fuzz
comb = pd.MultiIndex.from_product((dico['Name'],dico2['Surname']))
scores = comb.map(lambda x: fuzz.ratio(*x)) #or fuzz.partial_ratio(*x)
d = dict(a for a,b in zip(comb,scores) if b>90) #change threshold
out = dico.assign(SurName=dico['Name'].map(d)).dropna(subset=['SurName'])
print(out)
Name Age Studies SurName
0 Arthur 20 Economics Arthur1
1 Henri 18 Maths Henri2
2 Lisiane 62 Psychology Lisiane3
3 Patrice 73 Medical Patrice4

You could do the following thing. Define the function:
def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
s = df_2[key2].tolist()
m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))
df_1['Surname'] = m
m2 = df_1['Surname'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
df_1['Surname'] = m2
return df_1
and run
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
df = fuzzy_merge(dico, dico2, 'Name', 'Surname',threshold=90, limit=2)
This returns:
Name Age Studies Surname
0 Arthur 20 Economics Arthur1
1 Henri 18 Maths Henri2
2 Lisiane 62 Psychology Lisiane3
3 Patrice 73 Medical Patrice4
4 Zadig 21 Cinema
5 Sacha 20 CS

How to map to multiple values in a dictionary in pandas

I have the following pandas df:
Name
Jack
Alex
Jackie
Susan
i also have the following dict:
d = {'Jack':['Male','22'],'Alex':['Male','26'],'Jackie':['Female','28'],'Susan':['Female','30']}
I would like to add in two colums for Gender and Age so that my df returns:
Name Gender Age
Jack Male 22
Alex Male 26
Jackie Female 28
Susan Female 30
I have tried:
df['Gender'] = df.Name.map(d[0])
df['Age'] = df.Name.map(d[1])
but no such luck. Any ideas or help would be muhc appreciated! Thanks!

df['Gender'] = df.Name.map(lambda x: d[x][0])
df['Age'] = df.Name.map(lambda x: d[x][1])

Take all the values of the dictionary
d = {'Jack':['Male','22'],'Alex':['Male','26'],'Jackie':['Female','28'],'Susan':['Female','30']}
value_list = list(d.values())
df = pd.DataFrame(value_list, columns =['Gender', 'Age'])
print(df)

Use pd.DataFrame constructor with Series.map and use pd.concat to concat with df:
In [2696]: df = pd.concat([df,pd.DataFrame(df.Name.map(d).tolist(), columns=['Gender', 'Age'])], axis=1)
In [2695]: df
Out[2696]:
Name Gender Age
0 Jack Male 22
1 Alex Male 26
2 Jackie Female 28
3 Susan Female 30

Solutions working well also if no match in dictionary like:
d = {'Alex':['Male','26'],'Jackie':['Female','28'],'Susan':['Female','30']}
print (df)
Name Gender Age
0 Alex Male 26
1 Jack NaN NaN
2 Jackie Female 28
3 Susan Female 30
Use DataFrame.from_dict from your dictionary and add to column Name by DataFrame.join, advantage is if more columns in input data all working same way:
df = df.join(pd.DataFrame.from_dict(d, orient='index', columns=['Gender','Age']), on='Name')
print (df)
Name Gender Age
0 Jack Male 22
1 Alex Male 26
2 Jackie Female 28
3 Susan Female 30
Your solution should working if create 2 dictionaries:
d1 = {k:v[0] for k,v in d.items()}
d2 = {k:v[1] for k,v in d.items()}
df['Gender'] = df.Name.map(d1)
df['Age'] = df.Name.map(d2)
print (df)
Name Gender Age
0 Jack Male 22
1 Alex Male 26
2 Jackie Female 28
3 Susan Female 30

Choose higher value based off column value between two dataframes

question to choose value based on two df.
>>> df[['age','name']]
age name
0 44 Anna
1 22 Bob
2 33 Cindy
3 44 Danis
4 55 Cindy
5 66 Danis
6 11 Anna
7 43 Bob
8 12 Cindy
9 19 Danis
10 11 Anna
11 32 Anna
12 55 Anna
13 33 Anna
14 32 Anna
>>> df2[['age','name']]
age name
5 66 Danis
4 55 Cindy
0 44 Anna
7 43 Bob
expected result is all rows that value 'age' is higher than df['age'] based on column 'name.
expected result
age name
12 55 Anna

Per comments, use merge and filter dataframe:
df.merge(df2, on='name', suffixes={'','_y'}).query('age > age_y')[['name','age']]
Output:
name age
4 Anna 55
IIUC, you can use this to find the max age of all names:
pd.concat([df,df2]).groupby('name')['age'].max()
Output:
name
Anna 55
Bob 43
Cindy 55
Danis 66
Name: age, dtype: int64

Try this:
index = df[df['age'] > age].index
df.loc[index]

There are a few edge cases you don't mention how you would like to resolve, but generally what you want to do is iterate down the df and compare ages and use the larger. You could do so in the following manner:
df3 = pd.DataFrame(columns = ['age', 'name'])
for x in len(df):
if df['age'][x] > df2['age'][x]:
df3['age'][x] = df['age'][x]
df3['name'][x] = df['name'][x]
else:
df3['age'][x] = df2['age'][x]
df3['name'][x] = df2['name'][x]
Although you will need to modify this to reflect how you want to resolve names that are only in one list, or if the lists are of different sizes.

One solution comes to my mind is merge and drop
df.merge(df2, on='name', suffixes=('', '_y')).query('age.gt(age_y)', engine='python')[['age','name']]
Out[175]:
age name
4 55 Anna

Filter pandas dataframe based on a column: keep all rows if a value is that column

So I have a dataframe like the following:
Name Age City
A 21 NY
A 20 DC
A 35 OR
B 18 DC
B 19 PA
I need to keep all the rows for every Name and Age pair where a specific value is among those associated with column City. For example if my target city is NY, then my desired output would be:
Name Age City
A 21 NY
A 20 DC
A 35 OR
Edit1: I am not necessarily looking for a single value. There might be cases where there are multiple cities that I am looking for. For example: NY and DC at the same time.
Edit2: I have tried the followings which does not return correct output (daah):
df = df[df['City'] == 'NY']
and
df = df[df['City'].isin('NY')]

You can create function - first test City for equal and get all unique names for again filtering by isin:
def get_df_by_val(df, val):
return df[df['Name'].isin(df.loc[df['City'].eq(val), 'Name'].unique())]
print (get_df_by_val(df, 'NY'))
Name Age City
0 A 21 NY
1 A 20 DC
2 A 35 OR
print (get_df_by_val(df, 'PA'))
Name Age City
3 B 18 DC
4 B 19 PA
print (get_df_by_val(df, 'OR'))
Name Age City
0 A 21 NY
1 A 20 DC
2 A 35 OR
EDIT:
If need check multiple values per groups use GroupBy.transform with compare sets with issubset:
vals = ['NY', 'DC']
df1 = df[df.groupby('Name')['City'].transform(lambda x: set(vals).issubset(x))]
print (df1)
Name Age City
0 A 21 NY
1 A 20 DC
2 A 35 OR

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove common word from headers in pandas data frame - python

Related

How to leave certain values (which have a comma in them) intact when separating list-values in strings in pandas?

Pandas Series and Nan Values for mismatched values

How to map to multiple values in a dictionary in pandas

Choose higher value based off column value between two dataframes

Filter pandas dataframe based on a column: keep all rows if a value is that column

Categories

Resources