Need some help with data aggregaion in Python.
I have a Dataframe with 3 columns and N rows. First two columns contains indices (let it be X and Y), the last one contains values. The task is to calc a sum() of values of third column [corresponding with (x_i,y_j)] and write it in the new Dataframe in the intersection of (x_i,y_j)
Or, simplier, transform:
ind1 ind2 value
x1 y1 k1
x2 y1 k2
x3 y1 k3
x1 y2 k4
x2 y2 k5
x3 y2 k6
into some kind of 2d massive
y1 y2
________
x1 |k1 k4
x2 |k2 k5
x3 |k3 k6
I've tried pandas.groupby but didn't found proper solution. So, what should i do?
You want to pivot your data. Example:
In [5]: data = {'ind1': ['x1','x2','x3','x1','x2','x3'],
'ind2': ['y1','y1','y1','y2','y2','y2'],
'value': ['k1','k2','k3','k4','k5','k6']}
In [6]: pd.DataFrame(data=data)
Out[6]:
ind1 ind2 value
0 x1 y1 k1
1 x2 y1 k2
2 x3 y1 k3
3 x1 y2 k4
4 x2 y2 k5
5 x3 y2 k6
In [9]: df.pivot(index='ind1', columns='ind2', values='value')
Out[9]:
ind2 y1 y2
ind1
x1 k1 k4
x2 k2 k5
x3 k3 k6
You can find more information here: http://pandas.pydata.org/pandas-docs/stable/reshaping.html
Related
Lets say we have a df like below:
df = pd.DataFrame({'A':['y2','x3','z1','z1'],'B':['y2','x3','a2','z1']})
A B
0 y2 y2
1 x3 x3
2 z1 a2
3 z1 z1
if we wanted to sort the values on just the numbers in column A, we can do:
df.sort_values(by='A',key=lambda x: x.str[1])
A B
3 z1 z1
2 z1 a2
0 y2 y2
1 x3 x3
If we wanted to sort by both columns A and B, but have the key only apply to column A, is there a way to do that?
df.sort_values(by=['A','B'],key=lambda x: x.str[1])
Expected output:
A B
2 z1 a2
3 z1 z1
0 y2 y2
1 x3 x3
You can sort by B, then sort by A with a stable method:
(df.sort_values('B')
.sort_values('A', key=lambda x: x.str[1], kind='mergesort')
)
Output:
A B
2 z1 a2
3 z1 z1
0 y2 y2
1 x3 x3
I am trying to manipulate a data frame into the output data frame format. There are multiple values in a particular cell separated by ','. When I use .stack() to convert a number of values to rows, the remaining empty cells are filled with NaN. Is there any generic solution in pandas to handle this?
Input data frame:
x1 y1 x2 x3 x4
abc x or y v1,v2,v3 l1,l2,l3 self
abc z no1,no2,no3 e1,e2,e3 self
Output data frame:
x1 y1 x2 x3 x4
abc x v1 l1 self
v2 l2
v3 l3
y v1 l1 self
v2 l2
v3 l3
abc z no1 e1 self
no2 e2
no3 e3
df.set_index(df.index).apply(lambda x: x.str.split(",").apply(pd.Series).stack()).reset_index(drop=True).fillna("")
Output:
x1 x2 x3 x4
0 abc v1 11 self
1 v2 12
2 v3 13
3 abc no1 e1 self
4 no2 e2
5 no3 e3
I happen to have the following DataFrame:
import pandas as pd
import numpy as np
df = pd.DataFrame({ 'Prod1': ['10','','10','','',''],
'Prod2': ['','5','5','','','5'],
'Prod3': ['','','','8','8','8'],
'String1': ['','','','','',''],
'String2': ['','','','','',''],
'String3': ['','','','','',''],
'X1': ['x1','x2','x3','x4','x5','x6'],
'X2': ['','','y1','','','y2']
})
print(df)
Prod1 Prod2 Prod3 String1 String2 String3 X1 X2
0 10 x1
1 5 x2
2 10 5 x3 y1
3 8 x4
4 8 x5
5 5 8 x6 y2
It's a schematic table of Products with associated Strings; the actual Strings are in columns (X1, X2), but they should eventually move to (String1, String2, String3) based on whether the corresponding product has a value or not.
For instance:
row 0 has a value on Prod1, hence x1 should move to String1.
row 1 has a value on Prod2, hence x2 should move to String2.
In the actual dataset, mostly each Prod has a single String, but there are rows where multiple values are found in the Prods, and the String columns should be filled giving priority to the left. The final result should look like:
Prod1 Prod2 Prod3 String1 String2 String3 X1 X2
0 10 x1
1 5 x2
2 10 5 x3 y1
3 8 x4
4 8 x5
5 5 8 x6 y1
I was thinking about nested column/row loops, but I'm still not familiar enough with pandas to get to the solution.
Thank you very much in advance for any suggestion!
I break down the steps :
df[['String1', 'String2', 'String3']]=(df[['Prod1', 'Prod2', 'Prod3']]!='')
df1=df[['String1', 'String2', 'String3']].replace({False:np.nan}).stack().to_frame()
df1[0]=df[['X1','X2']].replace({'':np.nan}).stack().values
df[['String1', 'String2', 'String3']]=df1[0].unstack()
df.replace({None:''})
Out[1036]:
Prod1 Prod2 Prod3 String1 String2 String3 X1 X2
0 10 x1 x1
1 5 x2 x2
2 10 5 x3 y1 x3 y1
3 8 x4 x4
4 8 x5 x5
5 5 8 x6 y2 x6 y2
let's suppose I have one dataframe with at least two columns col1 and col2. Also I have another dataframe whose column names are values in col 1 and whose indices are values in col2.
import pandas as pd
df1 = pd.DataFrame( {'col1': ['x1', 'x2', 'x2'], 'col2': ['y0', 'y1', 'y0']})
print(df1)
col1 col2
0 x1 y0
1 x2 y1
2 x2 y0
print(df2)
y0 y1
x1 1 4
x2 2 5
x3 3 6
Now I wish to add col3 that gives me the value of the second dataframe at index of col1 and in column of col2.
The result should look like this:
col1 col2 col3
0 x1 y0 1
1 x2 y1 5
2 x2 y0 2
Thank you all!
You can use stack for new df with merge:
df2 = df2.stack().reset_index()
df2.columns = ['col1','col2','col3']
print (df2)
col1 col2 col3
0 x1 y0 1
1 x1 y1 4
2 x2 y0 2
3 x2 y1 5
4 x3 y0 3
5 x3 y1 6
print (pd.merge(df1, df2, on=['col1','col2'], how='left'))
col1 col2 col3
0 x1 y0 1
1 x2 y1 5
2 x2 y0 2
Another solution is create new Series with join:
s = df2.stack().rename('col3')
print (s)
col1 col2
0 x1 y0
1 x2 y1
2 x2 y0
x1 y0 1
y1 4
x2 y0 2
y1 5
x3 y0 3
y1 6
Name: col3, dtype: int64
print (df1.join(s, on=['col1','col2']))
col1 col2 col3
0 x1 y0 1
1 x2 y1 5
2 x2 y0 2
Simple join
Pandas supports the join operation both on indexes and on columns, meaning you can do this:
df1.merge(df2, left_on='col1', right_index=True)
Produces
col1 col2 y0 y1
0 x1 y0 1 4
1 x2 y1 2 5
2 x2 y0 2 5
Getting the proper value into col3 is the next step
Apply
This is a bit inefficient, but it is a way to get the correct data into one column
df['col3'] = df[['col2', 'y0', 'y1']].apply(lambda x: x[int(x[0][1]) + 1], axis=1)
It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 9 years ago.
I have a tab-delimited table in a file (file-A) that looks like this:
ind1 A1 B1 C1
ind2 A2 B2 C2
ind3 A3 B3 C3
and one column of values in another file (file-B) ordered as follows:
ind1
X1
Y1
ind2
X2
Y2
ind3
X1
Y2
I would like to combine the two files such that the values listed under each individual in file-B (ind1, ind2, etc.) become inserted "in between" values corresponding to each individual in file-A. Here's what the output should look like for this particular case:
ind1 A1 X1 B1 Y1 C1
ind2 A2 X2 B2 Y2 C2
ind1 A3 X3 B3 Y3 C3
I believe that your example has errors:
file2 should end with
X3
Y3
and the last line of output should be:
ind3 .....
this awk oneliner works for your example`:
awk -F'\t' -v OFS='\t' 'NR==FNR{a[NR]=$0;next}{print $1,$2,a[(FNR-1)*3+2],$3,a[FNR*3],$4}' file2 file
with your data:
kent$ head file file2
==> file <==
ind1 A1 B1 C1
ind2 A2 B2 C2
ind3 A3 B3 C3
==> file2 <==
ind1
X1
Y1
ind2
X2
Y2
ind3
X3
Y3
kent$ awk -F'\t' -v OFS='\t' 'NR==FNR{a[NR]=$0;next}{print $1,$2,a[(FNR-1)*3+2],$3,a[FNR*3],$4}' file2 file
ind1 A1 X1 B1 Y1 C1
ind2 A2 X2 B2 Y2 C2
ind3 A3 X3 B3 Y3 C3