pandas groupby transpose str column - python

here is what I am trying to do:
>>>import pandas as pd
>>>dftemp = pd.DataFrame({'a': [1] * 3 + [2] * 3, 'b': 'a a b c d e'.split()})
a b
0 1 a
1 1 a
2 1 b
3 2 c
4 2 d
5 2 e
6 3 f
how to transpose column 'b' grouped by column 'a', so that output looks like:
a b0 b1 b2
0 1 a a b
3 2 c d e
6 3 f NaN NaN

Using pivot_table with cumcount:
(df.assign(flag=df.groupby('a').b.cumcount())
.pivot_table(index='a', columns='flag', values='b', aggfunc='first')
.add_prefix('B'))
flag B0 B1 B2
a
1 a a b
2 c d e
3 f NaN NaN

You can try of grouping by column and flattening the values associated with group and reframe it as dataframe
df = df.groupby(['a'])['b'].apply(lambda x: x.values.flatten())
pd.DataFrame(df.values.tolist(),index=df.index).add_prefix('B')
Out:
B0 B1 B2
a
1 a a b
2 c d e
3 f None None

you could probably try something like this :
>>> dftemp = pd.DataFrame({'a': [1] * 3 + [2] * 2 + [3]*1, 'b': 'a a b c d e'.split()})
>>> dftemp
a b
0 1 a
1 1 a
2 1 b
3 2 c
4 2 d
5 3 e
>>> dftemp.groupby('a')['b'].apply(lambda df: df.reset_index(drop=True)).unstack()
0 1 2
a
1 a a b
2 c d None
3 e None None

Given the ordering of your DataFrame you could find where the group changes and use np.split to create a new DataFrame.
import numpy as np
import pandas as pd
splits = dftemp[(dftemp.a != dftemp.a.shift())].index.values
df = pd.DataFrame(np.split(dftemp.b.values, splits[1:])).add_prefix('b').fillna(np.NaN)
df['a'] = dftemp.loc[splits, 'a'].values
Output
b0 b1 b2 a
0 a a b 1
1 c d e 2
2 f NaN NaN 3

Related

complete and repeat one dataframe along another one

How do you combine 2 dataframes so that one is repeated over and over and combined for every line of the other dataframe, for example :
d1 = pd.DataFrame([[1,3],[2,4]])
print(d1)
0 1
0 1 3
1 2 4
and
d2 = pd.DataFrame([['A','D'],['B','E'],['C','F']])
print(d2)
0 1
0 A D
1 B E
2 C F
combining in :
d3 = pd.DataFrame([[1,3,'A','D'],[1,3,'B','E'],[1,3,'C','F'],[2,4,'A','D'],[2,4,'B','E'],[2,4,'C','F']])
print(d3)
0 1 2 3
0 1 3 A D
1 1 3 B E
2 1 3 C F
3 2 4 A D
4 2 4 B E
5 2 4 C F
I can loop over d1 and concat, but is there any implemented functionnality already doing this ?
Thanks
I believe what you are searching for is a cross-join.
You can use the following code to get your answer, you will just need to clean up the column naming
df1 = pd.DataFrame([[1,3],[2,4]])
df2 = pd.DataFrame([['A','D'],['B','E'],['C','F']])
df1.merge(df2, how = 'cross')
I hope, this works for your solution. Create a key column with value of 1 in both dataframes and join with that key and then drop it.
import pandas as pd
d1 = pd.DataFrame([[1,3],[2,4]])
print(d1)
d2 = pd.DataFrame([['A','D'],['B','E'],['C','F']])
print(d2)
d1['key'] = 1
d2['key'] = 1
d1.merge(d2, on='key').drop('key', axis=1)
Here is an alternative solution using pd.merge() and df.assign()
d2.columns = ['2', '3']
d3 = pd.merge(d1.assign(key=1), d2.assign(key=1), on='key', suffixes=('', '')).drop('key', axis=1)
print(d3)
0 1 2 3
0 1 3 A D
1 1 3 B E
2 1 3 C F
3 2 4 A D
4 2 4 B E
5 2 4 C F

Join an array to every row in the pandas dataframe

I have a data frame and an array as follows:
df = pd.DataFrame({'x': range(0,5), 'y' : range(1,6)})
s = np.array(['a', 'b', 'c'])
I would like to attach the array to every row of the data frame, such that I got a data frame as follows:
What would be the most efficient way to do this?
Just plain assignment:
# replace the first `s` with your desired column names
df[s] = [s]*len(df)
Try this:
for i in s:
df[i] = i
Output:
x y a b c
0 0 1 a b c
1 1 2 a b c
2 2 3 a b c
3 3 4 a b c
4 4 5 a b c
You could use pandas.concat:
pd.concat([df, pd.DataFrame(s).T], axis=1).ffill()
output:
x y 0 1 2
0 0 1 a b c
1 1 2 a b c
2 2 3 a b c
3 3 4 a b c
4 4 5 a b c
You can try using df.loc here.
df.loc[:, s] = s
print(df)
x y a b c
0 0 1 a b c
1 1 2 a b c
2 2 3 a b c
3 3 4 a b c
4 4 5 a b c

How to populate categories in one column and paste the exact value in other column

It has been a long time that I dealt with pandas library. I searched for it but could not come up with an efficient way, which might be a function existed in the library.
Let's say I have the dataframe below:
df1 = pd.DataFrame({'V1':['A','A','B'],
'V2':['B','C','C'],
'Value':[4, 1, 5]})
df1
And I would like to extend this dataset and populate all the combinations of categories and put its corresponding value as exactly the same.
df2 = pd.DataFrame({'V1':['A','B','A', 'C', 'B', 'C'],
'V2':['B','A','C','A','C','B'],
'Value':[4, 4 , 1, 1, 5, 5]})
df2
In other words, in df1, A and B has Value of 4 and I also want to have a row of that B and A has Value of 4 in the second dataframe. It is very similar to melting. I also do not want to use a for loop. I am looking for a more efficient way.
Use:
df = pd.concat([df1, df1.rename(columns={'V2':'V1', 'V1':'V2'})]).sort_index().reset_index(drop=True)
Output:
V1 V2 Value
0 A B 4
1 B A 4
2 A C 1
3 C A 1
4 B C 5
5 C B 5
Or np.vstack:
>>> pd.DataFrame(np.vstack((df1.to_numpy(), df1.iloc[:, np.r_[1:-1:-1, -1]].to_numpy())), columns=df1.columns)
V1 V2 Value
0 A B 4
1 A C 1
2 B C 5
3 B A 4
4 C A 1
5 C B 5
>>>
For correct order:
>>> pd.DataFrame(np.vstack((df1.to_numpy(), df1.iloc[:, np.r_[1:-1:-1, -1]].to_numpy())), columns=df1.columns, index=[*df1.index, *df1.index]).sort_index()
V1 V2 Value
0 A B 4
0 B A 4
1 A C 1
1 C A 1
2 B C 5
2 C B 5
>>>
And index reset:
>>> pd.DataFrame(np.vstack((df1.to_numpy(), df1.iloc[:, np.r_[1:-1:-1, -1]].to_numpy())), columns=df1.columns, index=[*df1.index, *df1.index]).sort_index().reset_index(drop=True)
V1 V2 Value
0 A B 4
1 B A 4
2 A C 1
3 C A 1
4 B C 5
5 C B 5
>>>
You can use methods assign and append:
df1.append(df1.assign(V1=df1.V2, V2=df1.V1), ignore_index=True)
Output:
V1 V2 Value
0 A B 4
1 A C 1
2 B C 5
3 B A 4
4 C A 1
5 C B 5

Python - Pandas Tricky sum of columns

This is a tricky question, I have a dataframe like this and I want to create 3 columns with conditional sums such as,
If the id = A then A = A1 and B and C = B1
If the id = B then B = B1 and A and C = A1
Example data:
id A1 B1 A B C
A 5 4 5 4 4
B 6 1 6 1 6
A 7 2 7 2 2
B 6 8 8 6 6
C 2 1 2 1 0
I´m trying to come with a general solution so I don´t need a lot of sum by axis.
Your condition can be reduced to:
if id == A, then column A = column A1, column C = column B1
if id == B, then column B = column B1, column C = column A1
So, it transferred to pandas code as:
df = pd.DataFrame([[5,4],[6,1],[7,2],[6,8],[2,1]], index=['A', 'B', 'A', 'B', 'C'], columns=['A1', 'B1'])
df['A'] = df['A1']
df['B'] = df['B1']
df['C'] = (df.index == 'B')*df['A1'] +(df.index == 'A')*df['B1']
# or faster method from #user3483203
# df['id'] = df.index
# df['C'] = np.select([df.id.eq('A'), df.id.eq('B')], [df.B1, df.A1], 0)
# >>> df
# A1 B1 A B C
# A 5 4 5 4 4
# B 6 1 6 1 6
# A 7 2 7 2 2
# B 6 8 6 8 6
# C 2 1 2 1 0

How can I add a column to a pandas DataFrame that uniquely identifies grouped data? [duplicate]

Given the following data frame:
import pandas as pd
import numpy as np
df=pd.DataFrame({'A':['A','A','A','B','B','B'],
'B':['a','a','b','a','a','a'],
})
df
A B
0 A a
1 A a
2 A b
3 B a
4 B a
5 B a
I'd like to create column 'C', which numbers the rows within each group in columns A and B like this:
A B C
0 A a 1
1 A a 2
2 A b 1
3 B a 1
4 B a 2
5 B a 3
I've tried this so far:
df['C']=df.groupby(['A','B'])['B'].transform('rank')
...but it doesn't work!
Use groupby/cumcount:
In [25]: df['C'] = df.groupby(['A','B']).cumcount()+1; df
Out[25]:
A B C
0 A a 1
1 A a 2
2 A b 1
3 B a 1
4 B a 2
5 B a 3
Use groupby.rank function.
Here the working example.
df = pd.DataFrame({'C1':['a', 'a', 'a', 'b', 'b'], 'C2': [1, 2, 3, 4, 5]})
df
C1 C2
a 1
a 2
a 3
b 4
b 5
df["RANK"] = df.groupby("C1")["C2"].rank(method="first", ascending=True)
df
C1 C2 RANK
a 1 1
a 2 2
a 3 3
b 4 1
b 5 2

Categories

Resources