Pandas: Pack column into rows - python

I've been reading through pd.stack, pd.unstack and pd.pivot but I can't wrap my head around getting what I want done
Given a dataframe as follows
id1 id2 id3 vals vals1
0 1 a -1 10 20
1 1 a -2 11 21
2 1 a -3 12 22
3 1 a -4 13 23
4 1 b -1 14 24
5 1 b -2 15 25
6 1 b -3 16 26
7 1 b -4 17 27
I'd like to get the following result
id1 id2 -1_vals -2_vals ... -1_vals1 -2_vals1 -3_vals1 -4_vals1
0 1 a 10 11 ... 20 21 22 23
1 1 b 14 15 ... 24 25 26 27
It's kind of a groupby with a pivot, The column id3 is being spread into rows, where the new column names is the corresponding concatenation of the original column and the value of id3
EDIT: It is guaranteed that per id1 + id2 id3 will be unique, but some groups of id1 + id2 will have diffenet id3 - in this case it is ok to put NaNs there

Use DataFrame.set_index with DataFrame.unstack and DataFrame.sort_index for MultiIndex in columns and then flatten it by list comprehension with f-strings:
df1 = (df.set_index(['id1','id2','id3'])
.unstack()
.sort_index(level=[0,1], ascending=[True, False], axis=1))
#python 3.6+
df1.columns = [f'{b}_{a}' for a, b in df1.columns]
#python below
#df1.columns = ['{}_{}'.format(a, b) for a, b in df1.columns]
df1 = df1.reset_index()
print (df1)
id1 id2 -1_vals -2_vals -3_vals -4_vals -1_vals1 -2_vals1 -3_vals1 \
0 1 a 10 11 12 13 20 21 22
1 1 b 14 15 16 17 24 25 26
-4_vals1
0 23
1 27

Related

Python Rank with non numeric columns

I'm trying to find a way to do nested ranking (row number) in python that is equivalent to the following in TSQL:
I have a table thank looks like this:
data = {
'col1':[11,11,11,22,22,33,33],
'col2':['a','b','c','a','b','a','b']
}
df = pd.DataFrame(data)
# col1 col2
# 11 a
# 11 b
# 11 c
# 22 a
# 22 b
# 33 a
# 33 b
Looking for Python equivalent to:
SELECT
col1
,col2
,row_number() over(Partition by col1 order by col2) as rnk
FROM df
group by col1, col2
The output to be:
# col1 col2 rnk
# 11 a 1
# 11 b 2
# 11 c 3
# 22 a 1
# 22 b 2
# 33 a 1
# 33 b 2
I've tried to use rank() and groupby() but I keep running into a problem of No numeric types to aggregate. Is there a way to rank non numeric columns and give them row numbers
Use cumcount()
df['rnk']=df.groupby('col1')['col2'].cumcount()+1
col1 col2 rnk
0 11 a 1
1 11 b 2
2 11 c 3
3 22 a 1
4 22 b 2
5 33 a 1
6 33 b 2

Combining Multiple Dataframes Based On Multiple Columns Matching And Summing Other Columns Pandas Python

I currently have multiple pandas dataframes like below:
df1
id1 id2 col_sum_1 col_sum_2
0 13 15 3 4
1 15 234 7 6
2 63 627 1 7
df2
id1 id2 col_sum_1 col_sum_2
0 13 15 8 3
1 15 234 2 3
2 63 627 8 1
df3
id1 id2 col_sum_1 col_sum_2
0 13 15 3 5
1 15 234 7 7
2 63 627 4 4
I want to create a new dataframe from these where I join when id1 and id2 are matched. Then summing col_sum_1 and col_sum_2 together to get the following outcome
df
id1 id2 col_sum_1 col_sum_2
0 13 15 14 12
1 15 234 16 16
2 63 627 13 12
Is there a way to join 3 tables where id1 is equal and id2 is equal and then summing the rows for col_sum_1 and col_sum_2 together to create a new dataframe based of the join and sums in pandas?
merge() all three data frames then sum(axis=1) (across the row). Finally cleanup the columns.
df1 = pd.read_csv(io.StringIO(""" id1 id2 col_sum_1 col_sum_2
0 13 15 3 4
1 15 234 7 6
2 63 627 1 7
"""), sep="\s+")
df2 = pd.read_csv(io.StringIO(""" id1 id2 col_sum_1 col_sum_2
0 13 15 8 3
1 15 234 2 3
2 63 627 8 1"""), sep="\s+")
df3 = pd.read_csv(io.StringIO(""" id1 id2 col_sum_1 col_sum_2
0 13 15 3 5
1 15 234 7 7
2 63 627 4 4"""), sep="\s+")
(
df1.merge(df2, on=["id1","id2"])
.merge(df3, on=["id1","id2"])
.assign(col_sum_1=lambda dfa: dfa.loc[:,[c for c in dfa.columns if "col_sum_1" in c]].sum(axis=1),
col_sum_2=lambda dfa: dfa.loc[:,[c for c in dfa.columns if "col_sum_2" in c]].sum(axis=1),
)
.drop(columns=["col_sum_1_x","col_sum_2_x","col_sum_1_y","col_sum_2_y"])
)
First, you can concatenate the dataframes:
>>> df = pd.concat([df1, df2, df3]).groupby(['id1', 'id2']).sum().reset_index()
>>> df
id1 id2 col_sum_1 col_sum_2
0 13 15 14 12
1 15 234 16 16
2 63 627 13 12
Note: The above produces the desired dataframe for the 3 "input" dataframes in the question. The next steps are not needed if all the "input" dataframes have only rows with the same pairs of id1 and id2 values.
Then, you can find the common id1 and id2 pairs in the "input" dataframes:
>>> common_pairs = set(zip(df1.id1, df1.id2)) & set(zip(df2.id1, df2.id2)) & set(zip(df3.id1, df3.id2))
>>> common_pairs
{(63, 627), (13, 15), (15, 234)}
Finally, you can create a MultiIndex and use it to keep only the rows with the common_pairs:
>>> idx = pd.MultiIndex.from_frame(df[['id1', 'id2']])
>>> df = df.loc[idx.isin(common_pairs)].reset_index(drop=True)
>>> df
id1 id2 col_sum_1 col_sum_2
0 13 15 14 12
1 15 234 16 16
2 63 627 13 12

pandas version of SQL CROSS APPLY

Say we have a DataFrame df
df = pd.DataFrame({
"Id": [1, 2],
"Value": [2, 5]
})
df
Id Value
0 1 2
1 2 5
and some function f which takes an element of df and returns a DataFrame.
def f(value):
return pd.DataFrame({"A": range(10, 10 + value), "B": range(20, 20 + value)})
f(2)
A B
0 10 20
1 11 21
We want to apply f to each element in df["Value"], and join the result to df, like so:
Id Value A B
0 1 2 10 20
1 1 2 11 21
2 2 5 10 20
2 2 5 11 21
2 2 5 12 22
2 2 5 13 23
2 2 5 14 24
In T-SQL, with a table df and table-valued function f, we would do this with a CROSS APPLY:
SELECT * FROM df
CROSS APPLY f(df.Value)
How can we do this in pandas?
You could apply the function to each element in Value in a list comprehension and use pd.concat to concatenate all resulting dataframes. Also assign the corresponding Id so that it can be later on used to merge both dataframes:
l = pd.concat([f(row.Value).assign(Id=row.Id) for _, row in df.iterrows()])
df.merge(l, on='Id')
Id Value A B
0 1 2 10 20
1 1 2 11 21
2 2 5 10 20
3 2 5 11 21
4 2 5 12 22
5 2 5 13 23
6 2 5 14 24
One of the few cases I would use DataFrame.iterrows. We can iterate over each row, concat the cartesian product out of your function with the original dataframe and at the same time fillna with bfill and ffill:
df = pd.concat([pd.concat([f(r['Value']), pd.DataFrame(r).T], axis=1).bfill().ffill() for _, r in df.iterrows()],
ignore_index=True)
Which yields:
print(df)
A B Id Value
0 10 20 1.0 2.0
1 11 21 1.0 2.0
2 10 20 2.0 5.0
3 11 21 2.0 5.0
4 12 22 2.0 5.0
5 13 23 2.0 5.0
6 14 24 2.0 5.0

Merging 3 databases with same names, and renaming them in python

I have 3 df with 25 columns each. All the columns are same in the 3 df.
I want to merge the 3 df, and change the column name to "_a" for 25 columns of df1, change to "_b" for 25 columns of df2, and change to "_c" for 25 columns of df3.
I am using the below code:
pd.merge(pd.merge(df1,df2,'left',on='year',suffixes=['_a','_b']),df3,'left',on='year')
How do I use a rename or some other function, to change all the 25 columns of df3 in the code above?
Thanks.
pd.merge(pd.merge(df1,df2,'left',on='year',suffixes=['_a','_b']),
df3,'left',on='year',suffixes=['','_c'])
Another approach:
Source DFs:
In [68]: d1
Out[68]:
col1 col2 col3
0 1 2 3
1 4 5 6
In [69]: d2
Out[69]:
col1 col2 col3
0 11 12 13
1 14 15 16
In [70]: d3
Out[70]:
col1 col2 col3
0 21 22 23
1 24 25 26
Let's create a list of DFs:
In [71]: dfs = [d1,d2,d3]
and a list of suffixes:
In [73]: suffixes = ['_a','_b','_c']
Now we can merge them in one step like as follows:
In [74]: pd.concat([df.add_suffix(suffixes[i]) for i,df in enumerate(dfs)], axis=1)
Out[74]:
col1_a col2_a col3_a col1_b col2_b col3_b col1_c col2_c col3_c
0 1 2 3 11 12 13 21 22 23
1 4 5 6 14 15 16 24 25 26
Short explanation: in the list comprehension we are generating a list of DFs with already renamed columns:
In [75]: [suffixes[i] for i,df in enumerate(dfs)]
Out[75]: ['_a', '_b', '_c']
In [76]: [df.add_suffix(suffixes[i]) for i,df in enumerate(dfs)]
Out[76]:
[ col1_a col2_a col3_a
0 1 2 3
1 4 5 6, col1_b col2_b col3_b
0 11 12 13
1 14 15 16, col1_c col2_c col3_c
0 21 22 23
1 24 25 26]

Dynamically reshape the dataframe in pandas

I am having a dataframe which has 4 columns and 4 rows. I need to reshape it into 2 columns and 4 rows. The 2 new columns are result of addition of values of col1 + col3 and col2 +col4. I do not wish to create any other memory object for it.
I am trying
df['A','B'] = df['A']+df['C'],df['B']+df['D']
Can it be achieved by using drop function only? Is there any other simpler method for this?
The dynamic way of summing two columns at a time is to use groupby:
df.groupby(np.arange(len(df.columns)) % 2, axis=1).sum()
Out[11]:
0 1
0 2 4
1 10 12
2 18 20
3 26 28
You can use rename afterwards if you want to change column names but that would require a logic.
Consider the sample dataframe df
df = pd.DataFrame(np.arange(16).reshape(4, 4), columns=list('ABCD'))
df
A B C D
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
One line of code
pd.DataFrame(
df.values.reshape(4, 2, 2).transpose(0, 2, 1).sum(2),
columns=df.columns[:2]
)
A B
0 2 4
1 10 12
2 18 20
3 26 28
Another line of code
df.iloc[:, :2] + df.iloc[:, 2:4].values
A B
0 2 4
1 10 12
2 18 20
3 26 28
Yet another
df.assign(A=df.A + df.C, B=df.B + df.D).drop(['C', 'D'], 1)
A B
0 2 4
1 10 12
2 18 20
3 26 28
This works for me:
df['A'], df['B'] = df['A'] + df['C'], df['B'] + df['D']
df.drop(['C','D'], axis=1)

Categories

Resources