Python - expand rows in dataframe n-times - python

I need to make a function to expand a dataframe. For example, the input of the function is :
df = pd.DataFrame({
'Name':['Ali', 'Ali', 'Ali', 'Sasha', 'Sasha', 'Sasha'],
'Cart':['book', 'phonecase', 'shirt', 'phone', 'food', 'bag']
})
suppose the n value is 3. Then, for each person inside the Name column, I have to add 3 more new rows and leave the Cart as np.nan. The output should be like this :
df = pd.DataFrame({
'Name':['Ali', 'Ali', 'Ali', 'Ali', 'Ali', 'Ali', 'Sasha', 'Sasha', 'Sasha', 'Sasha', 'Sasha', 'Sasha'],
'Cart':['book', 'phonecase', 'shirt', np.nan, np.nan, np.nan, 'phone', 'food', 'bag', np.nan, np.nan, np.nan]
})
How can I solve this with using copy() and append()?

You can use np.repeat with pd.Series.unique:
n = 3
print (df.append(pd.DataFrame(np.repeat(df["Name"].unique(), n), columns=["Name"])))
Name Cart
0 Ali book
1 Ali phonecase
2 Ali shirt
3 Sasha phone
4 Sasha food
5 Sasha bag
0 Ali NaN
1 Ali NaN
2 Ali NaN
3 Sasha NaN
4 Sasha NaN
5 Sasha NaN

Try this one: (it adds n rows to each group of rows with the same Name value)
import pandas as pd
import numpy as np
n = 3
list_of_df_unique_names = [df[df["Name"]==name] for name in df["Name"].unique()]
df2 = pd.concat([d.append(pd.DataFrame({"Name":np.repeat(d["Name"].values[-1], n)}))\
for d in list_of_df_unique_names]).reset_index(drop=True)
print(df2)
Output:
Name Cart
0 Ali book
1 Ali phonecase
2 Ali shirt
3 Ali NaN
4 Ali NaN
5 Ali NaN
6 Sasha phone
7 Sasha food
8 Sasha bag
9 Sasha NaN
10 Sasha NaN
11 Sasha NaN

Maybe not the most beautiful of all solutions, but it works. Say that you want to add 4 NaN rows by group. Then, given your df:
df = pd.DataFrame({
'Name':['Ali', 'Ali', 'Ali', 'Sasha', 'Sasha', 'Sasha'],
'Cart':['book', 'phonecase', 'shirt', 'phone', 'food', 'bag']
})
you can creat an empty dataframe DF and loop trough the range (1,4), filter the df you had and in every loop add an empty row:
DF = []
names = list(set(df.Name))
for i in range(4):
for name in names:
gf = df[df['Name']=='{}'.format(name)]
a = pd.concat([gf, gf.groupby('Name')['Cart'].apply(lambda x: x.shift(-1).iloc[-1]).reset_index()]).sort_values('Name').reset_index(drop=True)
DF.append(a)
DF_full = pd.concat(DF)
Now, you'll end up with copies of your original df, so you need to dump them without dumping the NaN rows:
DFF = DF_full.sort_values(['Name','Cart'])
DFF = DFF[(~DFF.duplicated()) | (DFF['Cart'].isnull())]
which gives:
Name Cart
0 Ali book
1 Ali phonecase
2 Ali shirt
3 Ali NaN
3 Ali NaN
3 Ali NaN
3 Ali NaN
2 Sasha bag
1 Sasha food
0 Sasha phone
3 Sasha NaN
3 Sasha NaN
3 Sasha NaN
3 Sasha NaN

Related

Merge three dataframe, count the matches and add new columns

I have 3 dataframes (df1, df2, df3) i want to merge these dataframe based on a column and add two new columns. one column should say which dataframes are matching, second how many of them matched.
# df1
data = {'ID': ["M1", "M2", "M3", "M4"],
'Movie': ["Top gun", "Thor", "Batman", "MadMax"],
'Actor' : ["Tom", "Chris", "Bale", "Tom"],
'type': ["Action", "SciFi", "Comic", "SciFi"]}
df1 = pd.DataFrame(data)
# df2
data = {'ID': ["M1", "M2", "M3"],
'highlight': ["Flight school", "Love and thunder", "I am Batman"]}
df2 = pd.DataFrame(data)
# df3
data = {'ID': ["M2", "M3"],
'no of parts': [3, 3],
'co-star' : ["portman", "neeson"],
'award': ["yes", "yes"]}
df3 = pd.DataFrame(data)
Expected output will be
The match and no of match are the new column
Thank you for your time
Any help would be much appreciated
You can merge your three dataframes on ID, then use the indicator parameter to merge to determine which dataframes had valid data, using this info to generate the match column. You can then count the number of | characters in match to determine the No of match column:
import pandas as pd
data = {'ID': ["M1", "M2", "M3", "M4"], 'Movie': ["Top gun", "Thor", "Batman", "MadMax"], 'Actor' : ["Tom", "Chris", "Bale", "Tom"], 'type': ["Action", "SciFi", "Comic", "SciFi"]}
df1 = pd.DataFrame(data)
data = {'ID': ["M1", "M2", "M3"], 'highlight': ["Flight school", "Love and thunder", "I am Batman"]}
df2 = pd.DataFrame(data)
data = {'ID': ["M2", "M3"], 'no of parts': [3, 3], 'co-star' : ["portman", "neeson"], 'award': ["yes", "yes"]}
df3 = pd.DataFrame(data)
df = df1.merge(df2, on='ID', how='left', indicator='df1df2').merge(df3, on='ID', how='left',indicator='df3')
df['match'] = df['df1df2'].map({'both':'df1|df2', 'left_only':'df1'})+df['df3'].map({'both':'|df3', 'left_only':''})
df['No of match'] = df['match'].str.count('\|')+1
df = df.drop(['df1df2', 'df3'], axis=1)
Output:
ID Movie Actor type highlight no of parts co-star award match No of match
0 M1 Top gun Tom Action Flight school NaN NaN NaN df1|df2 2
1 M2 Thor Chris SciFi Love and thunder 3.0 portman yes df1|df2|df3 3
2 M3 Batman Bale Comic I am Batman 3.0 neeson yes df1|df2|df3 3
3 M4 MadMax Tom SciFi NaN NaN NaN NaN df1 1
You can use pandas.concat on a list of the input DafaFrames. This will work on any number of input DataFrames (not just 3):
# dataframes will be later named in order: 1->2->3
# you can easily tweak this solution to use a dictionary
# and custom names if desired
dfs = [df1, df2, df3]
out = (pd
.concat([d.set_index('ID').assign(ID=f'df{i}')
for i,d in enumerate(dfs, start=1)], axis=1)
.assign(**{'match': lambda d: d[['ID']].agg(lambda x: '|'.join(x.dropna()),
axis=1),
'No of matches': lambda d: d[['ID']].notna().sum(axis=1)
})
.drop('ID', axis=1).reset_index()
)
NB. this approach uses a temporary ID column, make sure it is not present in any of the input DataFrame's column. You can chose another name for safety if needed.
output:
ID Movie Actor type highlight no of parts co-star award match No of matches
0 M1 Top gun Tom Action Flight school NaN NaN NaN df1|df2 2
1 M2 Thor Chris SciFi Love and thunder 3.0 portman yes df1|df2|df3 3
2 M3 Batman Bale Comic I am Batman 3.0 neeson yes df1|df2|df3 3
3 M4 MadMax Tom SciFi NaN NaN NaN NaN df1 1
Use DataFrame.merge with left join and indicator parameters for see matched DataFrames, then use DataFrame.pop for remove column with processing by Series.map for dictionaries, append df3 column with mapping another dictionary and last count | by Series.str.count:
df = (df1.merge(df2, on='ID', how='left', indicator='df2')
.merge(df3, on='ID', how='left', indicator='df3'))
df['match'] = (df.pop('df2').map({'both':'df1|df2', 'left_only':'df1'}) +
df.pop('df3').map({'both':'|df3', 'left_only':''}))
df['No of match'] = df['match'].str.count('\|') + 1
print (df)
ID Movie Actor type highlight no of parts co-star award \
0 M1 Top gun Tom Action Flight school NaN NaN NaN
1 M2 Thor Chris SciFi Love and thunder 3.0 portman yes
2 M3 Batman Bale Comic I am Batman 3.0 neeson yes
3 M4 MadMax Tom SciFi NaN NaN NaN NaN
match No of match
0 df1|df2 2
1 df1|df2|df3 3
2 df1|df2|df3 3
3 df1 1
Another idea with concat for assign helper columns df1-df3 filled by Trues and for match columns use DataFrame.dot:
dfs = [df1, df2, df3]
L = [df_.set_index('ID').assign(**{f'df{i}':True}) for i, df_ in enumerate(dfs, start=1)]
df = pd.concat(L, axis=1)
cols = df.filter(regex='^df\d').columns
df['match'] = df[cols].fillna(False).dot(cols + '|').str[:-1]
df['No of match'] = df['match'].str.count('\|') + 1
df = df.drop(cols, axis=1)
print (df)
Movie Actor type highlight no of parts co-star award \
ID
M1 Top gun Tom Action Flight school NaN NaN NaN
M2 Thor Chris SciFi Love and thunder 3.0 portman yes
M3 Batman Bale Comic I am Batman 3.0 neeson yes
M4 MadMax Tom SciFi NaN NaN NaN NaN
match No of match
ID
M1 df1|df2 2
M2 df1|df2|df3 3
M3 df1|df2|df3 3
M4 df1 1
You can try this one too; merging with reduce-lambda
dfs = [df1, df2, df3]
df_final = reduce(lambda left,right: pd.merge(left,right,on='ID',how='outer'), dfs)
df_temp = df_final[[df1.columns[1],df2.columns[1],df3.columns[1]]]
df_final["match"] = df_temp.apply(lambda x: "|".join(["df"+str(idx+1) for idx,i in enumerate(x) if pd.isna(i)==False]),axis=1)
df_final["No of match"] = df_final["match"].apply(lambda x: x.count("|")+1)
Output;
ID Movie Actor type ... co-star award match No of match
0 M1 Top gun Tom Action ... NaN NaN df1|df2 2
1 M2 Thor Chris SciFi ... portman yes df1|df2|df3 3
2 M3 Batman Bale Comic ... neeson yes df1|df2|df3 3
3 M4 MadMax Tom SciFi ... NaN NaN df1 1

Matching IDs with names on a pandas DataFrame

I have a DataFrame with 3 columns: ID, BossID and Name. Each row has a unique ID and has a corresponding name. BossID is the ID of the boss of the person in that row. Suppose I have the following DataFrame:
df = pd.DataFrame({'id':[1,2,3,4,5], 'bossId':[np.nan, 1, 2, 2, 3],
'name':['Anne Boe','Ben Coe','Cate Doe','Dan Ewe','Erin Aoi']})
So here, Anne is Ben's boss and Ben Coe is Cate and Dan's boss, etc.
Now, I want to have another column that has the boss's name for each person.
The desired output is:
id boss name boss_name
0 1 NaN Anne NaN
1 2 1.0 Ben Anne
2 3 2.0 Cate Ben
3 4 2.0 Dan Ben
4 5 3.0 Erin Cate
I can get my output using an ugly double for-loop. Is there a cleaner way to obtain the desired output?
This should work:
bossmap = df.set_index('id')['name'].squeeze()
df['boss_name'] = df['bossId'].map(bossmap)
You can set id as index, then use pd.Series.reindex
df = df.set_index('id')
df['boss_name'] = df['name'].reindex(df['bossId']).to_numpy() # or .to_list()
id bossId name boss_name
0 1 NaN Anne Boe NaN
1 2 1.0 Ben Coe Anne Boe
2 3 2.0 Cate Doe Ben Coe
3 4 2.0 Dan Ewe Ben Coe
4 5 3.0 Erin Aoi Cate Doe
Create a separate dataframe for 'name' and 'id'.
Renamed 'name' and set 'id' as the index
.merge df with the new dataframe
import pandas as pd
# test dataframe
df = pd.DataFrame({'id':[1,2,3,4,5], 'bossId':[np.nan, 1, 2, 2, 3], 'name':['Anne Boe','Ben Coe','Cate Doe','Dan Ewe','Erin Aoi']})
# separate dataframe with id and name
names = df[['id', 'name']].dropna().set_index('id').rename(columns={'name': 'boss_name'})
# merge the two
df = df.merge(names, left_on='bossId', right_index=True, how='left')
# df
id bossId name boss_name
0 1 NaN Anne Boe NaN
1 2 1.0 Ben Coe Anne Boe
2 3 2.0 Cate Doe Ben Coe
3 4 2.0 Dan Ewe Ben Coe
4 5 3.0 Erin Aoi Cate Doe

Error pd.pivot "MultiIndex.name must be a hashable type"

I´m trying to apply to my pandas dataframe something similar to R's tidyr::spread . I saw in some places people using pd.pivot but so far I had no success.
So in this example I have the following dataframe DF:
df = pd.DataFrame({'action_id' : [1,2,1,4,5],
'name': ['jess', 'alex', 'jess', 'cath', 'mary'],
'address': ['house', 'house', 'park', 'park', 'park'],
'date': [ '01/01', '02/01', '03/01', '04/01', '05/01']})
How does it look like:
Ok, so what I want is a multi-index pivot table having 'action_id' and 'name' as index, "spread" the address column and fill it with the 'date' column. So my df would look like this:
What I tryed to do was:
df.pivot(index = ['action_id', 'name'], columns = 'address', values = 'date')
And I got the error TypeError: MultiIndex.name must be a hashable type
Does anyone know what am I doing wrong?
You do not need to mention the index in pd.pivot
This will work
import pandas as pd
df = pd.DataFrame({'action_id' : [1,2,1,4,5],
'name': ['jess', 'alex', 'jess', 'cath', 'mary'],
'address': ['house', 'house', 'park', 'park', 'park'],
'date': [ '01/01', '02/01', '03/01', '04/01', '05/01']})
df = pd.concat([df, pd.pivot(data=df, index=None, columns='address', values='date')], axis=1) \
.reset_index(drop=True).drop(['address','date'], axis=1)
print(df)
action_id name house park
0 1 jess 01/01 NaN
1 2 alex 02/01 NaN
2 1 jess NaN 03/01
3 4 cath NaN 04/01
4 5 mary NaN 05/01
And to arrive at what you want, you need to do a groupby
df = df.groupby(['action_id','name']).agg({'house':'first','park':'first'}).reset_index()
print(df)
action_id name house park
0 1 jess 01/01 03/01
1 2 alex 02/01 NaN
2 4 cath NaN 04/01
3 5 mary NaN 05/01
Dont forget to accept the answer if it helped you
Another option:
df2 = df.set_index(['action_id','name','address']).date.unstack().reset_index()
df2.columns.name = None

How delete rows from dataframe after comparison

I want to filter my dataframe, use the part of filter with condition. And I do't know how to do it
import numpy as np
table = pd.DataFrame({'movie': ['thg', 'thg', 'mol', 'mol', 'lob', 'lob'],
'rating': [3., 4., 5., np.nan, np.nan, np.nan],
'name': ['John', 'Paul', 'Adam', 'Graham', 'Eva', 'Thomas']})
filter = pd.DataFrame({'name': ['John', 'Paul','Adam', 'Graham', 'Eva', 'Thomas'],
'qty': [1, 1, 3, 10, 7, 5]})
>>> table
movie name rating
0 thg John 3
1 thg Paul 4
3 mol Adam 5
4 mol Graham NaN
5 lob Eva NaN
6 lob Thomas NaN
I know that this doesn't work, but I can't change this, help me please
result=df[(df['name'] == filter[qty<3]) ]
>>> result
movie name rating
0 thg John 3
1 thg Paul 4
I believe you need:
table[table['name'].isin(filt.loc[filt['qty']<3,'name'])]
movie rating name
0 thg 3.0 John
1 thg 4.0 Paul
Note: i have changed the filter variable to filt since filter is a builtin function and you should not name a variable with such name
You can try with. I am undeleting this answer because of not using loc unlike the other answers, although it is essentially the same:
result = table[table['name'].isin(filter[filter['qty']<3]['name'].values)]
Use Series.isin with callable :
table[table['name'].isin(df_filter.loc[lambda x: x['qty']<3, 'name'])]
movie rating name
0 thg 3.0 John
1 thg 4.0 Paul
or DataFrame.merge
table.merge(df_filter.loc[lambda x: x['qty'].lt(3), ['name']])
I have a more concise solution, why not use query:
df=table.merge(filter,on="name")
df.query("qty<3")

Update values in pandas dataframe

I would like to update fields in my dataframe :
df = pd.DataFrame([
{'Name': 'Paul', 'Book': 'Plane', 'Cost': 22.50},
{'Name': 'Jean', 'Book': 'Harry Potter', 'Cost': 2.50},
{'Name': 'Jim', 'Book': 'Sponge bob', 'Cost': 5.00}
])
Book Cost Name
0 Plane 22.5 Paul
1 Harry Potter 2.5 Jean
2 Sponge bob 5.0 Jim
Changing names with this string :
{"Paul": "Paula", "Jim": "Jimmy"}
to get this result :
Book Cost Name
0 Plane 22.5 Paula
1 Harry Potter 2.5 Jean
2 Sponge bob 5.0 Jimmy
any idea ?
I think you need replace by dictionary d:
d = {"Paul": "Paula", "Jim": "Jimmy"}
df.Name = df.Name.replace(d)
print (df)
Book Cost Name
0 Plane 22.5 Paula
1 Harry Potter 2.5 Jean
2 Sponge bob 5.0 Jimmy
Another solution with map and combine_first - map return NaN where not match, so need replace it by original values:
df.Name = df.Name.map(d).combine_first(df.Name)
print (df)
Book Cost Name
0 Plane 22.5 Paula
1 Harry Potter 2.5 Jean
2 Sponge bob 5.0 Jimmy

Categories

Resources