This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 7 months ago.
I have dataframe like this
customer_id product_id sum
0 1 A 20
1 2 A 10
2 2 C 30
3 1 B 40
I want to transform it into a sparse dataframe where I will have customer data as rows and columns will be decoded products in order. Something like this:
customer_id Product_A_Sum Product_B_Sum Product_C_Sum
0 1 20 40 0
1 2 10 0 30
I have a brute force solution - something like this:
df_new= pd.DataFrame()
df_new['customer_id'] = df.customer_id.unique()
for product in range(len(list_products)):
temp = df.groupby(['customer_id', product])['sum'].sum().reset_index().rename(columns = {'sum':('sum'+product)})
df_new = df_new.merge(temp[['customer_id', 'sum'+product]], how='left', on = 'customer_id').fillna(0)
This code works but my list of products is large so it will not scale at all. Are there any pandas tricks that will allow me to do this easier?
Thank you in advance!
try this,
df = df.pivot_table(index='customer_id', columns='product_id', values='sum', fill_value=0)
df.columns = [f"Product_sum_{x}" for x in df.columns]
df = df.reset_index()
O/P:
customer_id Product_sum_A Product_sum_B Product_sum_C
0 1 20 40 0
1 2 10 0 30
You should use pivot:
df.pivot(index='customer_id', columns='product_id', values='sum')
Related
I have DataFrame with column Sales.
How can I split it into 2 based on Sales value?
First DataFrame will have data with 'Sales' < s and second with 'Sales' >= s
You can use boolean indexing:
df = pd.DataFrame({'Sales':[10,20,30,40,50], 'A':[3,4,7,6,1]})
print (df)
A Sales
0 3 10
1 4 20
2 7 30
3 6 40
4 1 50
s = 30
df1 = df[df['Sales'] >= s]
print (df1)
A Sales
2 7 30
3 6 40
4 1 50
df2 = df[df['Sales'] < s]
print (df2)
A Sales
0 3 10
1 4 20
It's also possible to invert mask by ~:
mask = df['Sales'] >= s
df1 = df[mask]
df2 = df[~mask]
print (df1)
A Sales
2 7 30
3 6 40
4 1 50
print (df2)
A Sales
0 3 10
1 4 20
print (mask)
0 False
1 False
2 True
3 True
4 True
Name: Sales, dtype: bool
print (~mask)
0 True
1 True
2 False
3 False
4 False
Name: Sales, dtype: bool
Using groupby you could split into two dataframes like
In [1047]: df1, df2 = [x for _, x in df.groupby(df['Sales'] < 30)]
In [1048]: df1
Out[1048]:
A Sales
2 7 30
3 6 40
4 1 50
In [1049]: df2
Out[1049]:
A Sales
0 3 10
1 4 20
Using groupby and list comprehension:
Storing all the split dataframe in list variable and accessing each of the seprated dataframe by their index.
DF = pd.DataFrame({'chr':["chr3","chr3","chr7","chr6","chr1"],'pos':[10,20,30,40,50],})
ans = [y for x, y in DF.groupby('chr')]
accessing the separated DF like this:
ans[0]
ans[1]
ans[len(ans)-1] # this is the last separated DF
accessing the column value of the separated DF like this:
ansI_chr=ans[i].chr
One-liner using the walrus operator (Python 3.8):
df1, df2 = df[(mask:=df['Sales'] >= 30)], df[~mask]
Consider using copy to avoid SettingWithCopyWarning:
df1, df2 = df[(mask:=df['Sales'] >= 30)].copy(), df[~mask].copy()
Alternatively, you can use the method query:
df1, df2 = df.query('Sales >= 30').copy(), df.query('Sales < 30').copy()
I like to use this for speeding up searches or rolling average finds .apply(lambda x...) type functions so I split big files into dictionaries of dataframes:
df_dict = {sale_v: df[df['Sales'] == sale_v] for sale_v in df.Sales.unique()}
This should do it if you wanted to go based on categorical groups.
This question already has answers here:
Convert columns into rows with Pandas
(6 answers)
Closed 2 years ago.
I have a dataframe like this:
time a b
0 10 20
1 11 21
Now i need a dataframe like this:
time a
0 10
1 11
0 20
1 21
This can be done with melt:
df.melt('time', value_name='a').drop('variable', axis=1)
Output:
time a
0 0 10
1 1 11
2 0 20
3 1 21
Or if you have columns other than a,b in your data:
df.melt('time', ['a','b'], value_name='a').drop('variable', axis=1)
This question already has answers here:
Pandas - Duplicate Row based on condition
(3 answers)
Closed 2 years ago.
I need to replicate some rows in a panda data frame like this
name times
A 2
B 1
C 3
D 20
...
What I need is to replicate rows just when col2 value is less than 20
What I'm doing now is:
for t in df["times"]:
if t < 20:
df = df.loc[df.index.repeat(t)]
But the script keeps running and I have to stop it (I've been waiting a long time...).
Is there any way to improve this or doing it in another way?
Use:
#condition lt for <
mask = df['times'].lt(20)
#filter by boolean indexing
df1 = df[mask].copy()
#repeat rows
df1 = df1.loc[df1.index.repeat(df1['times'])]
#add rows higher like 20, sorting and create default index
df = pd.concat([df1, df[~mask]]).sort_index().reset_index(drop=True)
print (df)
name times
0 A 2
1 A 2
2 B 1
3 C 3
4 C 3
5 C 3
6 D 20
This question already has answers here:
How to change the order of DataFrame columns?
(41 answers)
Closed 4 years ago.
I have a pandas dataframe like this-
d = {'class': [0, 1,1,0,1,0], 'A': [0,4,8,1,0,0],'B':[4,1,0,0,3,1],'Z':[0,9,3,1,4,7]}
df = pd.DataFrame(data=d)
A B Z class
0 0 4 0 0
1 4 1 9 1
2 8 0 3 1
3 1 0 1 0
4 0 3 4 1
5 0 1 7 0
and I have a list like this-['Z','B','class','A']
Now I want to sort my pandas dataframe according to the list of column names
therefore the new dataframe would have the columns names-
Z B class A
Use reindex:
L = ['Z','B','class','A']
df = df.reindex(columns=L)
Or select by subset:
df = df[L]
This question already has answers here:
How to delete rows from a pandas DataFrame based on a conditional expression [duplicate]
(6 answers)
How do you filter pandas dataframes by multiple columns?
(10 answers)
Closed 4 years ago.
I want to drop rows with zero value in specific columns
>>> df
salary age gender
0 10000 23 1
1 15000 34 0
2 23000 21 1
3 0 20 0
4 28500 0 1
5 35000 37 1
some data in columns salary and age are missing
and the third column, gender is a binary variables, which 1 means male 0 means female. And 0 here is not a missing data,
I want to drop the row in either salary or age is missing
so I can get
>>> df
salary age gender
0 10000 23 1
1 15000 34 0
2 23000 21 1
3 35000 37 1
Option 1
You can filter your dataframe using pd.DataFrame.loc:
df = df.loc[~((df['salary'] == 0) | (df['age'] == 0))]
Option 2
Or a smarter way to implement your logic:
df = df.loc[df['salary'] * df['age'] != 0]
This works because if either salary or age are 0, their product will also be 0.
Option 3
The following method can be easily extended to several columns:
df.loc[(df[['a', 'b']] != 0).all(axis=1)]
Explanation
In all 3 cases, Boolean arrays are generated which are used to index your dataframe.
All these methods can be further optimised by using numpy representation, e.g. df['salary'].values.