Reshaping Pandas dataframe grouping variables - python

I have a pandas dataFrame in the following format
ID Name
0 1 Jim
1 1 Jimmy
2 2 Mark
3 2 Marko
4 3 Sergi
4 3 Sergi
I want to reshape the dataframe in the following format
ID Name_1 Name_2
0 1 Jim Jimmy
1 2 Mark Marko
2 3 Sergi Sergi
So that I can compare the two names. I am unable to use pd.pivot or pd.pivottable for this requirement.
Should be fairly simple. Please, can you suggest how to do this?

You can use cumcount with pivot, last add_prefix to column names:
df['groups'] = df.groupby('ID').cumcount() + 1
df = df.pivot(index='ID', columns='groups', values='Name').add_prefix('Name_')
print (df)
groups Name_1 Name_2
ID
1 Jim Jimmy
2 Mark Marko
3 Sergi Sergi
Another solution with groupby and unstack, last add_prefix to column names:
df1 = df.groupby('ID')["Name"] \
.apply(lambda x: pd.Series(x.values)) \
.unstack(1) \
.rename(columns=lambda x: x+1) \
.add_prefix('Name_')
print (df1)
Name_1 Name_2
ID
1 Jim Jimmy
2 Mark Marko
3 Sergi Sergi

Related

Python merging data frames and renaming column values

In python, I have a df that looks like this
Name ID
Anna 1
Sarah 2
Max 3
And a df that looks like this
Name ID
Dan 1
Hallie 2
Cam 3
How can I merge the df’s so that the ID column looks like this
Name ID
Anna 1
Sarah 2
Max 3
Dan 4
Hallie 5
Cam 6
This is just a minimal reproducible example. My actual data set has 1000’s of values. I’m basically merging data frames and want the ID’s in numerical order (continuation of previous data frame) instead of repeating from one each time.
Use pd.concat:
out = pd.concat([df1, df2.assign(ID=df2['ID'] + df1['ID'].max())], ignore_index=True)
print(out)
# Output
Name ID
0 Anna 1
1 Sarah 2
2 Max 3
3 Dan 4
4 Hallie 5
5 Cam 6
Concatenate the two DataFrames, reset_index and use the new index to assign "ID"s
df_new = pd.concat((df1, df2)).reset_index(drop=True)
df_new['ID'] = df_new.index + 1
Output:
Name ID
0 Anna 1
1 Sarah 2
2 Max 3
3 Dan 4
4 Hallie 5
5 Cam 6
You can concat dataframes with ignore_index=True and then set ID column:
df = pd.concat([df1, df2], ignore_index=True)
df['ID'] = df.index + 1

How to count the most popular value from multiple value pandas column

i have such a problem:
I have pandas dataframe with shop ID and shop cathegories, looking smth like that:
id cats
0 10002718 182,45001,83079
1 10004056 9798
2 10009726 17,45528
3 10009752 64324,17
4 1001107 44607,83520,76557
... ... ...
24922 9992184 45716
24923 9997866 77063
24924 9998461 45001,44605,3238,72627,83785
24925 9998954 69908,78574,77890
24926 9999728 45653,44605,83648,85023,84481,68822
So the problem is that each shop can have multiple cathegories, and the task is to count frequency of each cathegoty. What's the easiest way to do it?
In conclusion i need to have dataframe with columns
cats count
0 1 133
1 2 1
2 3 15
3 4 12
Use Series.str.split with Series.explode and Series.value_counts:
df1 = (df['cats'].str.split(',')
.explode()
.value_counts()
.rename_axis('cats')
.reset_index(name='count'))
Or add expand=True to split to DataFrame and DataFrame.stack:
df1 = (df['cats'].str.split(',', expand=True)
.stack()
.value_counts()
.rename_axis('cats')
.reset_index(name='count'))
print (df1.head(10))
cats count
0 17 2
1 44605 2
2 45001 2
3 83520 1
4 64324 1
5 44607 1
6 45653 1
7 69908 1
8 83785 1
9 83079 1

Rename multi index dataframe pandas [duplicate]

This question already has answers here:
Pandas dataframe with multiindex column - merge levels
(4 answers)
Closed 3 years ago.
I have a multiIndex pandas dataframe.
Issue
high med low
name age empId
Jack 44 Ab1 0 1 0
Bob 34 Ab2 0 0 1
Mike 52 Ab6 1 1 0
When I'm executing df.columns I'm getting the following result:-
MultiIndex(levels=[['Issue'], ['high', 'med', 'low']],
labels=[[0, 0, 0], [0, 1, 2]])
I'm looking to flatten this dataframe by renaming the Multi_index issue columns.
Expected output df:
name age empId Issue_high Issue_med Issue_low
Jack 44 Ab1 0 1 0
Bob 34 Ab2 0 0 1
Mike 52 Ab6 1 1 0
I tried this:
df2 = df.rename(columns={'high':'Issue_high','low':'Issue_low','med':'Issue_med'}, level = 1)
Im getting error.
rename() got an unexpected keyword argument "level"
Is there any way to get the output structure?
Edit: By using df.columns = df.columns.map('_'.join) I'm getting
Issue_high Issue_med Issue_low
name age empId
Jack 44 Ab1 0 1 0
Bob 34 Ab2 0 0 1
Mike 52 Ab6 1 1 0
df.columns
>>> Index(['Issue_high',
'Issue_med', 'Issue_low'],
dtype='object')
Use Index.map with join if all values are strings:
df.columns = df.columns.map('_'.join)
Or format or f-strings - working with numeric values in levels too:
df.columns = df.columns.map('{0[0]}_{0[1]}'.format)
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
Last DataFrame.reset_index for convert MultiIndex in Index to columns:
df = df.reset_index()
print (df)
name age empId Issue_high Issue_med Issue_low
0 Jack 44 Ab1 0 1 0
1 Bob 34 Ab2 0 0 1
2 Mike 52 Ab6 1 1 0

Python Help Pandas row and Column

Hi I am kind of new to python, but I have a dataframe like this:
ID NAME NAME1 VALUE
1 Sarah orange 5
1 Roger apple 3
2 Amy pineapple 2
2 Kia pear 8
I want it like this:
ID NAME NAME1 VALUE NAME NAME1 VALUE
1 Sarah orange 5 Roger apple 3
2 Amy pineapple 2 Kia pear 8
I am using pandas but not sure how I can achieve this and write to a csv. Any help would highly appreciated! Thanks!
Use set_index with cumcount for MultiIndex, reshape by unstack, sort MulitIndex by second level by sort_index and last flatten it by list comprehension with reset_index:
df = df.set_index(['ID',df.groupby('ID').cumcount()]).unstack().sort_index(axis=1, level=1)
#python 3.6+
df.columns = [f'{a}_{b}' for a, b in df.columns]
#python bellow 3.6
#df.columns = ['{}_{}'.format(a,b) for a, b in df.columns]
df = df.reset_index()
print (df)
ID NAME_0 NAME1_0 VALUE_0 NAME_1 NAME1_1 VALUE_1
0 1 Sarah orange 5 Roger apple 3
1 2 Amy pineapple 2 Kia pear 8

how to merge two dataframes and sum the values of columns

I have two dataframes
df1
Name class value
Sri 1 5
Ram 2 8
viv 3 4
df2
Name class value
Sri 1 5
viv 4 4
My desired output is,
df,
Name class value
Sri 2 10
Ram 2 8
viv 7 8
Please help, thanks in advance!
I think need set_index for both DataFrames, add and last reset_index:
df = df1.set_index('Name').add(df2.set_index('Name'), fill_value=0).reset_index()
print (df)
Name class value
0 Ram 2.0 8.0
1 Sri 2.0 10.0
2 viv 7.0 8.0
If values in Name are not unique use groupby and aggregate sum:
df = df1.groupby('Name').sum().add(df2.groupby('Name').sum(), fill_value=0).reset_index()
pd.concat + groupby + sum
You can concatenate your individual dataframes and then group by your key column:
df = pd.concat([df1, df2])\
.groupby('Name')['class', 'value']\
.sum().reset_index()
print(df)
Name class value
0 Ram 2 8
1 Sri 2 10
2 viv 7 8

Categories

Resources