Stacking value by duplicating some other columns also in pandas dataframe? - python

I have data-frame like thi:
df
ID Brands Age Gender City
1 BMW_Audi_VW 50 M Milano
2 VW_BMW 45 F SF
I would like to split the Brands column on "_" and want to duplicate all columns except City
I can do based based on ID column like this:
df = df.set_index('ID').stack().str.split('_', expand=True).unstack(-1).stack(0).reset_index()
but it duplicate only ID column. I need all columns but not "City"
Here is the desired output that i am looking for:
ID Brands Age Gender City
1 BMW 50 M Milano
1 Audi 50 M None
1 VW 50 M None
2 VW 45 F SF
2 BMW 45 F None

Use DataFrame.explode with splitted columns values by Series.str.split and then set Nones by DataFrame.mask:
df = df.assign(Brands = df['Brands'].str.split('_')).explode('Brands')
include = ['ID','Brands','Age','Gender']
cols = df.columns.difference(include)
df[cols] = df[cols].mask(df.index.to_series().duplicated(), None)
df = df.reset_index(drop=True)
print (df)
ID Brands Age Gender City
0 1 BMW 50 M Milano
1 1 Audi 50 M None
2 1 VW 50 M None
3 2 VW 45 F SF
4 2 BMW 45 F None
EDIT:
Check difference:
#Brands column is assigned to Brands column (to same column)
df1= df.assign(Brands = df['Brands '].str.split('_')).explode('Brands')
print (df1)
ID Brands Age Gender City
0 1 BMW 50 M Milano
0 1 Audi 50 M Milano
0 1 VW 50 M Milano
1 2 VW 45 F SF
1 2 BMW 45 F SF
#Brands column is assigned to Brands1 column (to another column)
df2 = df.assign(Brands1 = df['Brands'].str.split('_')).explode('Brands')
print (df2)
ID Brands Age Gender City Brands1
0 1 BMW_Audi_VW 50 M Milano [BMW, Audi, VW]
1 2 VW_BMW 45 F SF [VW, BMW]

Related

Creation of DataFrame with specific conditions on rows

Be the following python pandas DataFrame:
ID
country
money
other
money_add
832932
France
12131
19
82932
217#8#
1329T2
832932
France
30
31728#
I would like to make the following modifications for each row:
If the ID column has any '#' value, the row is left unchanged.
If the ID column has no '#' value, and country is NaN, "Other" is added to the country column, and a 0 is added to other column.
Finally, only if the money column is NaN and the other column has value, we assign the values money and money_add from the following table:
other_ID
money
money_add
19
4532
723823
50
1213
238232
18
1813
273283
30
1313
83293
0
8932
3920
Example of the resulting table:
ID
country
money
other
money_add
832932
France
12131
19
82932
217#8#
1329T2
Other
8932
0
3920
832932
France
1313
30
83293
31728#
First set values to both columns if match both conditions by list, then filter non # rows and update values by DataFrame.update only matched rows:
m1 = df['ID'].str.contains('#')
m2 = df['country'].isna()
df.loc[~m1 & m2, ['country','other']] = ['Other',0]
df1 = df1.set_index(df1['other_ID'])
df = df.set_index(df['other'].mask(m1))
df.update(df1, overwrite=False)
df = df.reset_index(drop=True)
print (df)
ID country money other money_add
0 832932 France 12131 19.0 82932.0
1 217#8# NaN ; NaN NaN
2 1329T2 Other 8932.0 0.0 3920.0
3 832932 France 1313.0 30.0 83293.0
4 31728# NaN NaN NaN NaN

Pandas Series and Nan Values for mismatched values

I have these two dictionaries,
dico = {'Name': ['Arthur','Henri','Lisiane','Patrice','Zadig','Sacha'],
"Age": ["20","18","62","73",'21','20'],
"Studies": ['Economics','Maths','Psychology','Medical','Cinema','CS']
}
dico2 = {'Surname': ['Arthur1','Henri2','Lisiane3','Patrice4']}
dico = pd.DataFrame.from_dict(dico)
dico2 = pd.DataFrame.from_dict(dico2)
in which I would like to match then append the Surname column with the Name column, to finally append it to dico, for a following output:
Name Surname Age Studies
0 Arthur Arthur1 20 Economics
1 Henri Henri2 18 Maths
2 Lisiane Lisiane3 62 Psychology
3 Patrice Nan 73 Medical
4 Zadig Nan 21 Cinema
5 Sacha Nan 20 CS
and ultimately delete the rows for which Surname is Nan
Name Surname Age Studies
0 Arthur Arthur1 20 Economics
1 Henri Henri2 18 Maths
2 Lisiane Lisiane3 62 Psychology
map_list = []
for name in dico['Name']:
best_ratio = None
for idx, surname in enumerate(dico2['Surname']):
if best_ratio == None:
best_ratio = fuzz.ratio(name, surname)
best_idx = 0
else:
ratio = fuzz.ratio(name, surname)
if ratio > best_ratio:
best_ratio = ratio
best_idx = idx
map_list.append(dico2['Surname'][best_idx]) # obtain surname
dico['Surname'] = pd.Series(map_list) # add column
dico = dico[["Name", "Surname", "Age", "Studies"]] # reorder columns
#if the surname is not a great match, print "Nan"
dico = dico.drop(dico[dico.Surname == "NaN"].index)
but when I print(dico), the output is as follows:
Name Surname Age Studies
0 Arthur Arthur1 20 Economics
1 Henri Henri2 18 Maths
2 Lisiane Lisiane3 62 Psychology
3 Patrice Patrice4 73 Medical
4 Zadig Patrice4 21 Cinema
5 Sacha Patrice4 20 CS
I don't see why after the Patrice row, there's a mismatch, while I want it to be "Nan".
Lets try pd.Multiindex.from_product to create combinations and then assign a score with zip and fuzz.ratio and some filtering to create our dict, then we can use series.map and df.dropna:
from fuzzywuzzy import fuzz
comb = pd.MultiIndex.from_product((dico['Name'],dico2['Surname']))
scores = comb.map(lambda x: fuzz.ratio(*x)) #or fuzz.partial_ratio(*x)
d = dict(a for a,b in zip(comb,scores) if b>90) #change threshold
out = dico.assign(SurName=dico['Name'].map(d)).dropna(subset=['SurName'])
print(out)
Name Age Studies SurName
0 Arthur 20 Economics Arthur1
1 Henri 18 Maths Henri2
2 Lisiane 62 Psychology Lisiane3
3 Patrice 73 Medical Patrice4
You could do the following thing. Define the function:
def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
s = df_2[key2].tolist()
m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))
df_1['Surname'] = m
m2 = df_1['Surname'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
df_1['Surname'] = m2
return df_1
and run
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
df = fuzzy_merge(dico, dico2, 'Name', 'Surname',threshold=90, limit=2)
This returns:
Name Age Studies Surname
0 Arthur 20 Economics Arthur1
1 Henri 18 Maths Henri2
2 Lisiane 62 Psychology Lisiane3
3 Patrice 73 Medical Patrice4
4 Zadig 21 Cinema
5 Sacha 20 CS

filtering data in pandas where string is in multiple columns

I have a dataframe that looks like this:
team_1 score_1 team_2 score_2
AUS 2 SCO 1
ENG 1 ARG 0
JPN 0 ENG 2
I can retreive all the data from a single team by using:
#list specifiying team of interest
team = ['ENG']
#slice the dataframe to show only the rows where the column 'Team 1' or 'Team 2' value is in the specified string list 'team'
df.loc[df['team_1'].isin(team) | df['team_2'].isin(team)]
team_1 score_1 team_2 score_2
ENG 1 ARG 0
JPN 0 ENG 2
How can I now return only the score for my 'team' such as:
team score
ENG 1
ENG 2
Maybe creating an index to each team so as to filter out?
Maybe encoding the team_1 and team_2 columns to filter out?
new_df_1 = df[df.team_1 =='ENG'][['team_1', 'score_1']]
new_df_1 =new_df_1.rename(columns={"team_1":"team", "score_1":"score"})
# team score
# 0 ENG 1
new_df_2 = df[df.team_2 =='ENG'][['team_2', 'score_2']]
new_df_2 = new_df_2.rename(columns={"team_2":"team", "score_2":"score"})
# team score
# 1 ENG 2
then concat two dataframe:
pd.concat([new_df_1, new_df_2])
the output is :
team score
0 ENG 1
1 ENG 2
Melt the columns, filter for values in team, compute the sum of the scores column, and filter for only teams and score:
team = ["ENG"]
(
df
.melt(cols, value_name="team")
.query("team in #team")
.assign(score=lambda x: x.filter(like="score").sum(axis=1))
.loc[:, ["team", "score"]]
)
team score
1 ENG 1
5 ENG 2

Filter pandas dataframe based on a column: keep all rows if a value is that column

So I have a dataframe like the following:
Name Age City
A 21 NY
A 20 DC
A 35 OR
B 18 DC
B 19 PA
I need to keep all the rows for every Name and Age pair where a specific value is among those associated with column City. For example if my target city is NY, then my desired output would be:
Name Age City
A 21 NY
A 20 DC
A 35 OR
Edit1: I am not necessarily looking for a single value. There might be cases where there are multiple cities that I am looking for. For example: NY and DC at the same time.
Edit2: I have tried the followings which does not return correct output (daah):
df = df[df['City'] == 'NY']
and
df = df[df['City'].isin('NY')]
You can create function - first test City for equal and get all unique names for again filtering by isin:
def get_df_by_val(df, val):
return df[df['Name'].isin(df.loc[df['City'].eq(val), 'Name'].unique())]
print (get_df_by_val(df, 'NY'))
Name Age City
0 A 21 NY
1 A 20 DC
2 A 35 OR
print (get_df_by_val(df, 'PA'))
Name Age City
3 B 18 DC
4 B 19 PA
print (get_df_by_val(df, 'OR'))
Name Age City
0 A 21 NY
1 A 20 DC
2 A 35 OR
EDIT:
If need check multiple values per groups use GroupBy.transform with compare sets with issubset:
vals = ['NY', 'DC']
df1 = df[df.groupby('Name')['City'].transform(lambda x: set(vals).issubset(x))]
print (df1)
Name Age City
0 A 21 NY
1 A 20 DC
2 A 35 OR

Pandas merge fail to extract common Index values

I'm trying to merge 2 DataFrames of different sizes, both are indexed by 'Country'. The first dataframe 'GDP_EN' contains every country in the world, and the second dataframe 'ScimEn' contains 15 countries.
When I try to merge these DataFrames,instead of merging the columns based on index countries of ScimEn, I got back 'Country_x' and 'Country_y'. 'Country_x' came from GDP_EN, which are the first 15 countries in alphabetical order. 'Country_y' are the 15 countries from ScimEn. I'm wondering why didn't they merge?
I used:
DF=pd.merge(GDP_EN,ScimEn,left_index=True,right_index=True,how='right')
I think both DataFrames are not indexes by Country, but Country is column add parameter on='Country':
GDP_EN = pd.DataFrame({'Country':['USA','France','Slovakia', 'Russia'],
'a':[4,8,6,9]})
print (GDP_EN)
Country a
0 USA 4
1 France 8
2 Slovakia 6
3 Russia 9
ScimEn = pd.DataFrame({'Country':['France','Slovakia'],
'b':[80,70]})
print (ScimEn)
Country b
0 France 80
1 Slovakia 70
DF=pd.merge(GDP_EN,ScimEn,left_index=True,right_index=True,how='right')
print (DF)
Country_x a Country_y b
0 USA 4 France 80
1 France 8 Slovakia 70
DF=pd.merge(GDP_EN,ScimEn,on='Country',how='right')
print (DF)
Country a b
0 France 8 80
1 Slovakia 6 70
If Country are indexes it works perfectly:
GDP_EN = pd.DataFrame({'Country':['USA','France','Slovakia', 'Russia'],
'a':[4,8,6,9]}).set_index('Country')
print (GDP_EN)
a
Country
USA 4
France 8
Slovakia 6
Russia 9
print (GDP_EN.index)
Index(['USA', 'France', 'Slovakia', 'Russia'], dtype='object', name='Country')
ScimEn = pd.DataFrame({'Country':['France','Slovakia'],
'b':[80,70]}).set_index('Country')
print (ScimEn)
b
Country
France 80
Slovakia 70
print (ScimEn.index)
Index(['France', 'Slovakia'], dtype='object', name='Country')
DF=pd.merge(GDP_EN,ScimEn,left_index=True,right_index=True,how='right')
print (DF)
a b
Country
France 8 80
Slovakia 6 70

Categories

Resources