Say I have the following dataframe
c1
c2
c3
p
x
1
n
x
2
n
y
1
p
y
2
p
y
1
n
x
2
etc. I then want this in the following format:
p
n
x
y
4
5
5
4
i.e., i want to sum column 3 for each group in columns 1 & 2, but I don't want the unique combinations of columns 1 & 2, which would be achieved by grouping by those columns and summing on the third. Any way to do this using groupby?
As Karan said, just call groupby on each of your label columns separately, then concatenate (and transpose) the results:
import pandas as pd
df = pd.DataFrame([['p', 'x', 1],
['n', 'x', 2],
['n', 'y', 1],
['p', 'y', 2],
['p', 'y', 1],
['n', 'x', 2]])
df.columns = ['c1', 'c2', 'c3']
sums1 = df.groupby('c1').sum()
sums2 = df.groupby('c2').sum()
sums = pd.concat([sums1, sums2]).T
sums
n p x y
c3 5 4 5 4
Related
I have python list:
my_list = [1, 'V']
I have pd.Dataframe:
A B C
0 f v b
1 f i n
2 f i m
I need to create new column in my dataframe with value = my_list:
A B C D
0 f v b [1, 'V']
1 f i n [1, 'V']
2 f i m [1, 'V']
As far as I understand python lists can be values, bc df.groupby with apply "list":
df = df.groupby(['A', 'B'], group_keys=True)['C'].apply(list).reset_index(name='H')
A B H
0 f i [n, m]
1 f v [b]
Its posible without convert my_list type? What the the easiest way to do that?
I tried:
df['D'] = my_list
df['D'] = pd.Series(my_list)
but they did not meet my expectations
You can try using: np.repeat and set its repeat parameter to number of rows which can be found out from the shape of the dataframe.
my_list = [1, 'V']
df = pd.DataFrame({'col1': ['f', 'f', 'f'], 'col2': ['v', 'i', 'i'], 'col3': ['b', 'n', 'm']})
df['new_col'] = np.repeat(my_list, df.shape[0])
This will repeat the values of my_list as many times as there are rows in the DataFrame.
You can do it by creating a new array with my_list through hstack and then forming a new DataFrame. The code below has been tested and works fine.
import numpy as np
import pandas as ph
a1 = np.array([['f','v','b'], ['f','i','n'], ['f','i','m']])
a2 = np.array([1, 'V']).repeat(3).reshape(2,3).transpose()
df = pd.DataFrame(np.hstack((a1,a2)))
Edit: Another code that has been tested is:
import pandas as pd
import numpy as np
a1 = np.array([['f','v','b'], ['f','i','n'], ['f','i','m']])
a2 = np.squeeze(np.dstack((np.array(1).repeat(3), np.array('V').repeat(3))))
df = pd.DataFrame(np.hstack((a1,a2)))
This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 1 year ago.
There is a similar question with a solution not fully fitting my needs. And I do not understand all details of the solution their so I am not able to adapt it to my situation.
This is my initial dataframe where all unique values in the Y column should become a column.
Y P v
0 A X 0
1 A Y 1
2 B X 2
3 B Y 3
4 C X 4
5 C Y 5
The result should look like this where P is the first column or it could be the index also. So P could be understood as a row heading. And the values from 'Y' are the column headings. And the values from v are in each cell now.
P A B C
0 X 0 2 4
1 Y 1 3 5
Not working approach
This is based on https://stackoverflow.com/a/52082963/4865723
new_index = ['Y', df.groupby('Y').cumcount()]
final = df.set_index(new_index)
final = final['P'].unstack('Y')
print(final)
The problem here is that the index (or first column) does not contain the values from Y and the v column is totally gone.
Y A B C
0 X X X
1 Y Y Y
My own unfinished idea
>>> df.groupby('Y').agg(list)
P v
Y
A [X, Y] [0, 1]
B [X, Y] [2, 3]
C [X, Y] [4, 5]
I do not know if this help or how to go further from this point on.
The full MWE
#!/usr/bin/env python3
import pandas as pd
# initial data
df = pd.DataFrame({
'Y': ['A', 'A', 'B', 'B', 'C', 'C'],
'P': list('XYXYXY'),
'v': range(6)
})
print(df)
# final result I want
final = pd.DataFrame({
'P': list('XY'),
'A': [0, 1],
'B': [2, 3],
'C': [4, 5]
})
print(final)
# approach based on:
# https://stackoverflow.com/a/52082963/4865723
new_index = ['Y', df.groupby('Y').cumcount()]
final = df.set_index(new_index)
final = final['P'].unstack('Y')
print(final)
You don't need anything complex, this is a simple pivot:
df.pivot(index='P', columns='Y', values='v').reset_index()
This question already has answers here:
Reshape wide to long in pandas
(2 answers)
Closed 1 year ago.
I have a pandas dataframe with unique values in ID column.
df = pd.DataFrame({'ID': ['A', 'B', 'C'],
'STAT': ['X', 'X', 'X'],
'IN1': [1, 3, 7],
'IN2': [2, 5, 8],
'IN3': [3, 6, 9]})
I need to create a new dataframe where I have a row for each value in IN1, IN2 and IN3 with corresponding ID and STAT:
df_new = pd.DataFrame({'IN': [1, 2, 3, 3, 5, 6, 7, 8, 9],
'ID': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
'STAT': ['X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X']})
You can use pandas.wide_to_long:
(pd.wide_to_long(df, ['IN'], j='to_drop', i='ID')
.droplevel('to_drop')
.sort_index()
.reset_index()
)
output:
ID STAT IN
0 A X 1
1 A X 2
2 A X 3
3 B X 3
4 B X 5
5 B X 6
6 C X 7
7 C X 8
8 C X 9
You can use melt
df.melt(id_vars=['ID','STAT'], value_name='IN')
Gives:
ID STAT variable IN
0 A X IN1 1
1 B X IN1 3
2 C X IN1 7
3 A X IN2 2
4 B X IN2 5
5 C X IN2 8
6 A X IN3 3
7 B X IN3 6
8 C X IN3 9
To make the df into a row:
(df.melt(id_vars=['ID','STAT'], value_name='IN')
.sort_values(by='ID')
.drop('variable', axis=1)
)
Gives the exact same results.
I was going through this link: Return top N largest values per group using pandas
and found multiple ways to find the topN values per group.
However, I prefer dictionary method with agg function and would like to know if it is possible to get the equivalent of the dictionary method for the following problem?
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [1, 1, 1, 2, 2],
'B': [1, 1, 2, 2, 1],
'C': [10, 20, 30, 40, 50],
'D': ['X', 'Y', 'X', 'Y', 'Y']})
print(df)
A B C D
0 1 1 10 X
1 1 1 20 Y
2 1 2 30 X
3 2 2 40 Y
4 2 1 50 Y
I can do this:
df1 = df.groupby(['A'])['C'].nlargest(2).droplevel(-1).reset_index()
print(df1)
A C
0 1 30
1 1 20
2 2 50
3 2 40
# also this
df1 = df.sort_values('C', ascending=False).groupby('A', sort=False).head(2)
print(df1)
# also this
df.set_index('C').groupby('A')['B'].nlargest(2).reset_index()
Required
df.groupby('A',as_index=False).agg(
{'C': lambda ser: ser.nlargest(2) # something like this
})
Is it possible to use the dictionary here?
If you want to get a dictionary like A: 2 top values from C,
you can run:
df.groupby(['A'])['C'].apply(lambda x:
x.nlargest(2).tolist()).to_dict()
For your DataFrame, the result is:
{1: [30, 20], 2: [50, 40]}
I want to replace certain values in a dataframe containing multiple categoricals.
df = pd.DataFrame({'s1': ['a', 'b', 'c'], 's2': ['a', 'c', 'd']}, dtype='category')
If I apply .replace on a single column, the result is as expected:
>>> df.s1.replace('a', 1)
0 1
1 b
2 c
Name: s1, dtype: object
If I apply the same operation to the whole dataframe, an error is shown (short version):
>>> df.replace('a', 1)
ValueError: Cannot setitem on a Categorical with a new category, set the categories first
During handling of the above exception, another exception occurred:
ValueError: Wrong number of dimensions
If the dataframe contains integers as categories, the following happens:
df = pd.DataFrame({'s1': [1, 2, 3], 's2': [1, 3, 4]}, dtype='category')
>>> df.replace(1, 3)
s1 s2
0 3 3
1 2 3
2 3 4
But,
>>> df.replace(1, 2)
ValueError: Wrong number of dimensions
What am I missing?
Without digging, that seems to be buggy to me.
My Work Around
pd.DataFrame.apply with pd.Series.replace
This has the advantage that you don't need to mess with changing any types.
df = pd.DataFrame({'s1': [1, 2, 3], 's2': [1, 3, 4]}, dtype='category')
df.apply(pd.Series.replace, to_replace=1, value=2)
s1 s2
0 2 2
1 2 3
2 3 4
Or
df = pd.DataFrame({'s1': ['a', 'b', 'c'], 's2': ['a', 'c', 'd']}, dtype='category')
df.apply(pd.Series.replace, to_replace='a', value=1)
s1 s2
0 1 1
1 b c
2 c d
#cᴏʟᴅsᴘᴇᴇᴅ's Work Around
df = pd.DataFrame({'s1': ['a', 'b', 'c'], 's2': ['a', 'c', 'd']}, dtype='category')
df.applymap(str).replace('a', 1)
s1 s2
0 1 1
1 b c
2 c d
The reason for such behavior is different set of categorical values for each column:
In [224]: df.s1.cat.categories
Out[224]: Index(['a', 'b', 'c'], dtype='object')
In [225]: df.s2.cat.categories
Out[225]: Index(['a', 'c', 'd'], dtype='object')
so if you will replace to a value that is in both categories it'll work:
In [226]: df.replace('d','a')
Out[226]:
s1 s2
0 a a
1 b c
2 c a
As a solution you might want to make your columns categorical manually, using:
pd.Categorical(..., categories=[...])
where categories would have all possible values for all columns...