Stacking value by duplicating some other columns also in pandas dataframe?

Stacking value by duplicating some other columns also in pandas dataframe? - python

I have data-frame like thi:
df
ID Brands Age Gender City
1 BMW_Audi_VW 50 M Milano
2 VW_BMW 45 F SF
I would like to split the Brands column on "_" and want to duplicate all columns except City
I can do based based on ID column like this:
df = df.set_index('ID').stack().str.split('_', expand=True).unstack(-1).stack(0).reset_index()
but it duplicate only ID column. I need all columns but not "City"
Here is the desired output that i am looking for:
ID Brands Age Gender City
1 BMW 50 M Milano
1 Audi 50 M None
1 VW 50 M None
2 VW 45 F SF
2 BMW 45 F None

Use DataFrame.explode with splitted columns values by Series.str.split and then set Nones by DataFrame.mask:
df = df.assign(Brands = df['Brands'].str.split('_')).explode('Brands')
include = ['ID','Brands','Age','Gender']
cols = df.columns.difference(include)
df[cols] = df[cols].mask(df.index.to_series().duplicated(), None)
df = df.reset_index(drop=True)
print (df)
ID Brands Age Gender City
0 1 BMW 50 M Milano
1 1 Audi 50 M None
2 1 VW 50 M None
3 2 VW 45 F SF
4 2 BMW 45 F None
EDIT:
Check difference:
#Brands column is assigned to Brands column (to same column)
df1= df.assign(Brands = df['Brands '].str.split('_')).explode('Brands')
print (df1)
ID Brands Age Gender City
0 1 BMW 50 M Milano
0 1 Audi 50 M Milano
0 1 VW 50 M Milano
1 2 VW 45 F SF
1 2 BMW 45 F SF
#Brands column is assigned to Brands1 column (to another column)
df2 = df.assign(Brands1 = df['Brands'].str.split('_')).explode('Brands')
print (df2)
ID Brands Age Gender City Brands1
0 1 BMW_Audi_VW 50 M Milano [BMW, Audi, VW]
1 2 VW_BMW 45 F SF [VW, BMW]

Related

Creation of DataFrame with specific conditions on rows

Be the following python pandas DataFrame:
ID
country
money
other
money_add
832932
France
12131
19
82932
217#8#
1329T2
832932
France
30
31728#
I would like to make the following modifications for each row:
If the ID column has any '#' value, the row is left unchanged.
If the ID column has no '#' value, and country is NaN, "Other" is added to the country column, and a 0 is added to other column.
Finally, only if the money column is NaN and the other column has value, we assign the values money and money_add from the following table:
other_ID
money
money_add
19
4532
723823
50
1213
238232
18
1813
273283
30
1313
83293
0
8932
3920
Example of the resulting table:
ID
country
money
other
money_add
832932
France
12131
19
82932
217#8#
1329T2
Other
8932
0
3920
832932
France
1313
30
83293
31728#

First set values to both columns if match both conditions by list, then filter non # rows and update values by DataFrame.update only matched rows:
m1 = df['ID'].str.contains('#')
m2 = df['country'].isna()
df.loc[~m1 & m2, ['country','other']] = ['Other',0]
df1 = df1.set_index(df1['other_ID'])
df = df.set_index(df['other'].mask(m1))
df.update(df1, overwrite=False)
df = df.reset_index(drop=True)
print (df)
ID country money other money_add
0 832932 France 12131 19.0 82932.0
1 217#8# NaN ; NaN NaN
2 1329T2 Other 8932.0 0.0 3920.0
3 832932 France 1313.0 30.0 83293.0
4 31728# NaN NaN NaN NaN

Pandas Series and Nan Values for mismatched values

I have these two dictionaries,
dico = {'Name': ['Arthur','Henri','Lisiane','Patrice','Zadig','Sacha'],
"Age": ["20","18","62","73",'21','20'],
"Studies": ['Economics','Maths','Psychology','Medical','Cinema','CS']
}
dico2 = {'Surname': ['Arthur1','Henri2','Lisiane3','Patrice4']}
dico = pd.DataFrame.from_dict(dico)
dico2 = pd.DataFrame.from_dict(dico2)
in which I would like to match then append the Surname column with the Name column, to finally append it to dico, for a following output:
Name Surname Age Studies
0 Arthur Arthur1 20 Economics
1 Henri Henri2 18 Maths
2 Lisiane Lisiane3 62 Psychology
3 Patrice Nan 73 Medical
4 Zadig Nan 21 Cinema
5 Sacha Nan 20 CS
and ultimately delete the rows for which Surname is Nan
Name Surname Age Studies
0 Arthur Arthur1 20 Economics
1 Henri Henri2 18 Maths
2 Lisiane Lisiane3 62 Psychology
map_list = []
for name in dico['Name']:
best_ratio = None
for idx, surname in enumerate(dico2['Surname']):
if best_ratio == None:
best_ratio = fuzz.ratio(name, surname)
best_idx = 0
else:
ratio = fuzz.ratio(name, surname)
if ratio > best_ratio:
best_ratio = ratio
best_idx = idx
map_list.append(dico2['Surname'][best_idx]) # obtain surname
dico['Surname'] = pd.Series(map_list) # add column
dico = dico[["Name", "Surname", "Age", "Studies"]] # reorder columns
#if the surname is not a great match, print "Nan"
dico = dico.drop(dico[dico.Surname == "NaN"].index)
but when I print(dico), the output is as follows:
Name Surname Age Studies
0 Arthur Arthur1 20 Economics
1 Henri Henri2 18 Maths
2 Lisiane Lisiane3 62 Psychology
3 Patrice Patrice4 73 Medical
4 Zadig Patrice4 21 Cinema
5 Sacha Patrice4 20 CS
I don't see why after the Patrice row, there's a mismatch, while I want it to be "Nan".

Lets try pd.Multiindex.from_product to create combinations and then assign a score with zip and fuzz.ratio and some filtering to create our dict, then we can use series.map and df.dropna:
from fuzzywuzzy import fuzz
comb = pd.MultiIndex.from_product((dico['Name'],dico2['Surname']))
scores = comb.map(lambda x: fuzz.ratio(*x)) #or fuzz.partial_ratio(*x)
d = dict(a for a,b in zip(comb,scores) if b>90) #change threshold
out = dico.assign(SurName=dico['Name'].map(d)).dropna(subset=['SurName'])
print(out)
Name Age Studies SurName
0 Arthur 20 Economics Arthur1
1 Henri 18 Maths Henri2
2 Lisiane 62 Psychology Lisiane3
3 Patrice 73 Medical Patrice4

You could do the following thing. Define the function:
def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
s = df_2[key2].tolist()
m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))
df_1['Surname'] = m
m2 = df_1['Surname'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
df_1['Surname'] = m2
return df_1
and run
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
df = fuzzy_merge(dico, dico2, 'Name', 'Surname',threshold=90, limit=2)
This returns:
Name Age Studies Surname
0 Arthur 20 Economics Arthur1
1 Henri 18 Maths Henri2
2 Lisiane 62 Psychology Lisiane3
3 Patrice 73 Medical Patrice4
4 Zadig 21 Cinema
5 Sacha 20 CS

filtering data in pandas where string is in multiple columns

I have a dataframe that looks like this:
team_1 score_1 team_2 score_2
AUS 2 SCO 1
ENG 1 ARG 0
JPN 0 ENG 2
I can retreive all the data from a single team by using:
#list specifiying team of interest
team = ['ENG']
#slice the dataframe to show only the rows where the column 'Team 1' or 'Team 2' value is in the specified string list 'team'
df.loc[df['team_1'].isin(team) | df['team_2'].isin(team)]
team_1 score_1 team_2 score_2
ENG 1 ARG 0
JPN 0 ENG 2
How can I now return only the score for my 'team' such as:
team score
ENG 1
ENG 2
Maybe creating an index to each team so as to filter out?
Maybe encoding the team_1 and team_2 columns to filter out?

new_df_1 = df[df.team_1 =='ENG'][['team_1', 'score_1']]
new_df_1 =new_df_1.rename(columns={"team_1":"team", "score_1":"score"})
# team score
# 0 ENG 1
new_df_2 = df[df.team_2 =='ENG'][['team_2', 'score_2']]
new_df_2 = new_df_2.rename(columns={"team_2":"team", "score_2":"score"})
# team score
# 1 ENG 2
then concat two dataframe:
pd.concat([new_df_1, new_df_2])
the output is :
team score
0 ENG 1
1 ENG 2

Melt the columns, filter for values in team, compute the sum of the scores column, and filter for only teams and score:
team = ["ENG"]
(
df
.melt(cols, value_name="team")
.query("team in #team")
.assign(score=lambda x: x.filter(like="score").sum(axis=1))
.loc[:, ["team", "score"]]
)
team score
1 ENG 1
5 ENG 2

Filter pandas dataframe based on a column: keep all rows if a value is that column

So I have a dataframe like the following:
Name Age City
A 21 NY
A 20 DC
A 35 OR
B 18 DC
B 19 PA
I need to keep all the rows for every Name and Age pair where a specific value is among those associated with column City. For example if my target city is NY, then my desired output would be:
Name Age City
A 21 NY
A 20 DC
A 35 OR
Edit1: I am not necessarily looking for a single value. There might be cases where there are multiple cities that I am looking for. For example: NY and DC at the same time.
Edit2: I have tried the followings which does not return correct output (daah):
df = df[df['City'] == 'NY']
and
df = df[df['City'].isin('NY')]

You can create function - first test City for equal and get all unique names for again filtering by isin:
def get_df_by_val(df, val):
return df[df['Name'].isin(df.loc[df['City'].eq(val), 'Name'].unique())]
print (get_df_by_val(df, 'NY'))
Name Age City
0 A 21 NY
1 A 20 DC
2 A 35 OR
print (get_df_by_val(df, 'PA'))
Name Age City
3 B 18 DC
4 B 19 PA
print (get_df_by_val(df, 'OR'))
Name Age City
0 A 21 NY
1 A 20 DC
2 A 35 OR
EDIT:
If need check multiple values per groups use GroupBy.transform with compare sets with issubset:
vals = ['NY', 'DC']
df1 = df[df.groupby('Name')['City'].transform(lambda x: set(vals).issubset(x))]
print (df1)
Name Age City
0 A 21 NY
1 A 20 DC
2 A 35 OR

Pandas merge fail to extract common Index values

I'm trying to merge 2 DataFrames of different sizes, both are indexed by 'Country'. The first dataframe 'GDP_EN' contains every country in the world, and the second dataframe 'ScimEn' contains 15 countries.
When I try to merge these DataFrames,instead of merging the columns based on index countries of ScimEn, I got back 'Country_x' and 'Country_y'. 'Country_x' came from GDP_EN, which are the first 15 countries in alphabetical order. 'Country_y' are the 15 countries from ScimEn. I'm wondering why didn't they merge?
I used:
DF=pd.merge(GDP_EN,ScimEn,left_index=True,right_index=True,how='right')

I think both DataFrames are not indexes by Country, but Country is column add parameter on='Country':
GDP_EN = pd.DataFrame({'Country':['USA','France','Slovakia', 'Russia'],
'a':[4,8,6,9]})
print (GDP_EN)
Country a
0 USA 4
1 France 8
2 Slovakia 6
3 Russia 9
ScimEn = pd.DataFrame({'Country':['France','Slovakia'],
'b':[80,70]})
print (ScimEn)
Country b
0 France 80
1 Slovakia 70
DF=pd.merge(GDP_EN,ScimEn,left_index=True,right_index=True,how='right')
print (DF)
Country_x a Country_y b
0 USA 4 France 80
1 France 8 Slovakia 70
DF=pd.merge(GDP_EN,ScimEn,on='Country',how='right')
print (DF)
Country a b
0 France 8 80
1 Slovakia 6 70
If Country are indexes it works perfectly:
GDP_EN = pd.DataFrame({'Country':['USA','France','Slovakia', 'Russia'],
'a':[4,8,6,9]}).set_index('Country')
print (GDP_EN)
a
Country
USA 4
France 8
Slovakia 6
Russia 9
print (GDP_EN.index)
Index(['USA', 'France', 'Slovakia', 'Russia'], dtype='object', name='Country')
ScimEn = pd.DataFrame({'Country':['France','Slovakia'],
'b':[80,70]}).set_index('Country')
print (ScimEn)
b
Country
France 80
Slovakia 70
print (ScimEn.index)
Index(['France', 'Slovakia'], dtype='object', name='Country')
DF=pd.merge(GDP_EN,ScimEn,left_index=True,right_index=True,how='right')
print (DF)
a b
Country
France 8 80
Slovakia 6 70

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Stacking value by duplicating some other columns also in pandas dataframe? - python

Related

Creation of DataFrame with specific conditions on rows

Pandas Series and Nan Values for mismatched values

filtering data in pandas where string is in multiple columns

Filter pandas dataframe based on a column: keep all rows if a value is that column

Pandas merge fail to extract common Index values

Categories

Resources