Pandas pivot/expand a column of tuples into named columns

Pandas pivot/expand a column of tuples into named columns - python

I have the following data frame
df = pd.DataFrame([
{"A": 1, "B": "20", "pairs": [(1,2), (2,3)]},
{"A": 2, "B": "22", "pairs": [(1,1), (2,2), (1,3)]},
{"A": 3, "B": "24", "pairs": [(1,1), (3,3)]},
{"A": 4, "B": "26", "pairs": [(1,3)]},
])
>>> df
A B pairs
0 1 20 [(1, 2), (2, 3)]
1 2 22 [(1, 1), (2, 2), (1, 3)]
2 3 24 [(1, 1), (3, 3)]
3 4 26 [(1, 3)]
Instead of these being a list of tuples, I'd like to make new columns for these pairs, p1 and p2, where these are ordered as the first and second members of each tuple respectively. There is also a wide to long element here in that I explode a single row into as many rows as there are pairs in the list.
This does not appear to fit a lot of the wide to long documentation I can find. My desired output format is this:
>>> df
A B p1 p2
0 1 20 1 2
1 1 20 2 3
2 2 22 1 1
3 2 22 2 2
4 2 22 1 3
5 3 24 1 1
6 3 24 3 3
7 4 26 1 3

1st explode then join
s = df.explode('pairs').reset_index(drop=True)
out = s.join(pd.DataFrame(s.pop('pairs').tolist(),columns=['p1','p2']))
out
Out[98]:
A B p1 p2
0 1 20 1 2
1 1 20 2 3
2 2 22 1 1
3 2 22 2 2
4 2 22 1 3
5 3 24 1 1
6 3 24 3 3
7 4 26 1 3

Use explode:
>>> df.join(df.pop('pairs').explode().apply(pd.Series)
.rename(columns={0: 'p1', 1: 'p2'}))
A B p1 p2
0 1 20 1 2
0 1 20 2 3
1 2 22 1 1
1 2 22 2 2
1 2 22 1 3
2 3 24 1 1
2 3 24 3 3
3 4 26 1 3

Is this what you have in mind:
(df.explode('pairs') # blow it up into individual rows
.assign(p1 = lambda df: df.pairs.str[0],
p2 = lambda df: df.pairs.str[-1])
.drop(columns='pairs')
)
Out[1234]:
A B p1 p2
0 1 20 1 2
0 1 20 2 3
1 2 22 1 1
1 2 22 2 2
1 2 22 1 3
2 3 24 1 1
2 3 24 3 3
3 4 26 1 3
Another option, using the apply method, and longer lines of code (peformance wise I have no idea which is better):
(df
.set_index(['A', 'B'])
.pairs
.apply(pd.Series)
.stack()
.apply(pd.Series)
.droplevel(-1)
.set_axis(['p1', 'p2'],axis=1)
.reset_index()
)
Out[1244]:
A B p1 p2
0 1 20 1 2
1 1 20 2 3
2 2 22 1 1
3 2 22 2 2
4 2 22 1 3
5 3 24 1 1
6 3 24 3 3
7 4 26 1 3
Since pair is a list of tuples, you may get some performance if you move the wrangling/transformation into pure python, before recombining back into a DataFrame:
from itertools import chain
repeats = [*map(len, df.pairs)]
reshaped = chain.from_iterable(df.pairs)
reshaped = pd.DataFrame(reshaped,
columns = ['p1', 'p2'],
index = df.index.repeat(repeats))
df.drop(columns='pairs').join(reshaped)
Out[1265]:
A B p1 p2
0 1 20 1 2
0 1 20 2 3
1 2 22 1 1
1 2 22 2 2
1 2 22 1 3
2 3 24 1 1
2 3 24 3 3
3 4 26 1 3

Related

Pandas replace values (grouping by and iteration)

Good morning
I have a current problem when trying to replace some values. I have a dataframe that has a column "loc10p" that separates the records into 10 groups, and for each group I have grouped those records into smaller groups, but each group has a starting range of 1 of the subgroups instead of counting the last subgroup. For example:
c2[c2.loc10p.isin([1,2])].sort_values(['loc10p','subgrupoloc10'])[['loc10p','subgrupoloc10']]
loc10p subgrupoloc10
1 1 1
7 1 1
15 1 1
0 1 2
14 1 2
30 1 2
31 1 2
2 2 1
8 2 1
9 2 1
16 2 1
17 2 1
18 2 2
23 2 2
How can I transform that into something like the following:
loc10p subgrupoloc10
1 1 1
7 1 1
15 1 1
0 1 2
14 1 2
30 1 2
31 1 2
2 2 3
8 2 3
9 2 3
16 2 3
17 2 3
18 2 4
23 2 4
I tried to do a loop that separates each group category into a different dataframe and then, replacing the values of the subgroup with a counter of the previous group, but it didn't replace anything:
w=1
temporal=[]
for e in range(1,11):
temp=c2[c2['loc10p']==e]
temporal.append(temp)
for e,i in zip(temporal,range(1,9)):
try:
e.loc[,'subgrupoloc10']=w
w+=1
except:
pass
Any help will be really appreciated!!

Try with ngroup
df['out'] = df.groupby(['loc10p','subgrupoloc10']).ngroup()+1
Out[204]:
1 1
7 1
15 1
0 2
14 2
30 2
31 2
2 3
8 3
9 3
16 3
17 3
18 4
23 4
dtype: int64

Try:
groups = (df["subgrupoloc10"] != df["subgrupoloc10"].shift()).cumsum()
df["subgrupoloc10"] = groups
print(df)
Prints:
loc10p subgrupoloc10
1 1 1
7 1 1
15 1 1
0 1 2
14 1 2
30 1 2
31 1 2
2 2 3
8 2 3
9 2 3
16 2 3
17 2 3
18 2 4
23 2 4

How to expand a dataframe by assigning to each value in column 3 values

Let's say I have a dataframe:
index day
0 21
1 2
2 7
and to each day I want to assign 3 values: 0,1,2 in the end the dataframe should look like this:
index day value
0 21 0
1 21 1
2 21 2
3 2 0
4 2 1
5 2 2
6 7 0
7 7 1
8 7 2
Does anyone have any idea?

You could introduce a column containing (0, 1, 2)-tuples and then explode the dataframe on that column:
import pandas as pd
df = pd.DataFrame({'day': [21, 2, 7]})
df['value'] = [(0, 1, 2)] * len(df)
df = df.explode('value')
df.index = range(len(df))
print(df)
day value
0 21 0
1 21 1
2 21 2
3 2 0
4 2 1
5 2 2
6 7 0
7 7 1
8 7 2

Try:
N = 3
df = df.assign(value=[range(N) for _ in range(len(df))]).explode("value")
print(df)
Prints:
index day value
0 0 21 0
0 0 21 1
0 0 21 2
1 1 2 0
1 1 2 1
1 1 2 2
2 2 7 0
2 2 7 1
2 2 7 2

A reindex option:
df = (
df.reindex(index=pd.MultiIndex.from_product([df.index, [0, 1, 2]]),
level=0)
.droplevel(0)
.rename_axis(index='value')
.reset_index()
)
df:
value day
0 0 21
1 1 21
2 2 21
3 0 2
4 1 2
5 2 2
6 0 7
7 1 7
8 2 7

group by pandas removes duplicates

I have a dataframe (df)
a b c
1 2 20
1 2 15
2 4 30
3 2 20
3 2 15
and I want to recognize only max values from column c
I tried
a = df.loc[df.groupby('b')['c'].idxmax()]
but it group by removes duplicates so I get
a b c
1 2 20
2 4 30
it removes rows 3 because they are the same was rows 1.
Is it any way to write the code to not remove duplicates?

Just also take column a into account when you do the groupby:
a = df.loc[df.groupby(['a', 'b'])['c'].idxmax()]
a b c
0 1 2 20
2 2 4 30
3 3 2 20

I think you need:
df = df[df['c'] == df.groupby('b')['c'].transform('max')]
print (df)
a b c
0 1 2 20
2 2 4 30
3 3 2 20
Difference in changed data:
print (df)
a b c
0 1 2 30
1 1 2 30
2 1 2 15
3 2 4 30
4 3 2 20
5 3 2 15
#only 1 max rows per groups a and b
a = df.loc[df.groupby(['a', 'b'])['c'].idxmax()]
print (a)
a b c
0 1 2 30
3 2 4 30
4 3 2 20
#all max rows per groups b
df1 = df[df['c'] == df.groupby('b')['c'].transform('max')]
print (df1)
a b c
0 1 2 30
1 1 2 30
3 2 4 30
#all max rows per groups a and b
df2 = df[df['c'] == df.groupby(['a', 'b'])['c'].transform('max')]
print (df2)
a b c
0 1 2 30
1 1 2 30
3 2 4 30
4 3 2 20

Use loc and iloc together in pandas

Say I have the following dataframe, and I want to change the two elements in column c that correspond to the first two elements in column a that are equal to 1 to equal 2.
>>> df = pd.DataFrame({"a" : [1,1,1,1,2,2,2,2], "b" : [2,3,1,4,5,6,7,2], "c" : [1,2,3,4,5,6,7,8]})
>>> df.loc[df["a"] == 1, "c"].iloc[0:2] = 2
>>> df
a b c
0 1 2 1
1 1 3 2
2 1 1 3
3 1 4 4
4 2 5 5
5 2 6 6
6 2 7 7
7 2 2 8
The code in the second line doesn't work because iloc sets a copy, so the original dataframe is not modified. How would I do this?

A dirty way would be:
df.loc[df[df['a'] == 1][:2].index, 'c'] = 2

You can use Index.isin:
import pandas as pd
df = pd.DataFrame({"a" : [1,1,1,1,2,2,2,2],
"b" : [2,3,1,4,5,6,7,2],
"c" : [1,2,3,4,5,6,7,8]})
#more general index
df.index = df.index + 10
print (df)
a b c
10 1 2 1
11 1 3 2
12 1 1 3
13 1 4 4
14 2 5 5
15 2 6 6
16 2 7 7
17 2 2 8
print (df.index.isin(df.index[:2]))
[ True True False False False False False False]
df.loc[(df["a"] == 1) & (df.index.isin(df.index[:2])), "c"] = 2
print (df)
a b c
10 1 2 2
11 1 3 2
12 1 1 3
13 1 4 4
14 2 5 5
15 2 6 6
16 2 7 7
17 2 2 8
If index is nice (starts from 0 without duplicates):
df.loc[(df["a"] == 1) & (df.index < 2), "c"] = 2
print (df)
a b c
0 1 2 2
1 1 3 2
2 1 1 3
3 1 4 4
4 2 5 5
5 2 6 6
6 2 7 7
7 2 2 8
Another solution:
mask = df["a"] == 1
mask = mask & (mask.cumsum() < 3)
df.loc[mask.index[:2], "c"] = 2
print (df)
a b c
0 1 2 2
1 1 3 2
2 1 1 3
3 1 4 4
4 2 5 5
5 2 6 6
6 2 7 7
7 2 2 8

How can I group a sorted DataFrame on a stopping criterion?

Suppose I have the pandas DataFrame below which is already sorted on column A.
import pandas as pd
data = {'A': range(15),
'B': range(5)*3}
df = pd.DataFrame(data)
# just in case:
df.sort('A', inplace=True)
The resulting dataframe looks something like this:
A | B
-----
0 | 0
1 | 1
2 | 2
3 | 3
4 | 4
5 | 0
6 | 1
7 | 2
8 | 3
9 | 4
10 | 0
11 | 1
12 | 2
13 | 3
14 | 4
I would like to group this into three groups based on the "stopping points" in column B where the value of that column goes down from 4 to 0. A naive use of groupby can't accommodate this because there is no key that distinguishes groups.
It would be straightforward to do this by iterating over the individual rows in sorted order, but I was wondering if there was a pandas-native solution.

IIUC you can create new column C for groupby by cumsum:
df['C'] = ((df.B == 0).cumsum())
print (df)
A B C
0 0 0 1
1 1 1 1
2 2 2 1
3 3 3 1
4 4 4 1
5 5 0 2
6 6 1 2
7 7 2 2
8 8 3 2
9 9 4 2
10 10 0 3
11 11 1 3
12 12 2 3
13 13 3 3
14 14 4 3
print (df.groupby('C').sum())
A B
C
1 10 10
2 35 10
3 60 10
Or better groupby by Series:
print (df[['A','B']].groupby([((df.B == 0).cumsum())]).sum())
A B
B
1 10 10
2 35 10
3 60 10
For storing groups is possible use dict comprehension:
for i, g in df[['A','B']].groupby([((df.B == 0).cumsum())]):
print (i)
print (g)
1
A B
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
2
A B
5 5 0
6 6 1
7 7 2
8 8 3
9 9 4
3
A B
10 10 0
11 11 1
12 12 2
13 13 3
14 14 4
dfs = {i-1: g for i,g in df[['A','B']].groupby([((df.B == 0).cumsum())])}
print (dfs[0])
A B
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas pivot/expand a column of tuples into named columns - python

1st explode then join s = df.explode('pairs').reset_index(drop=True) out = s.join(pd.DataFrame(s.pop('pairs').tolist(),columns=['p1','p2'])) out Out[98]: A B p1 p2 0 1 20 1 2 1 1 20 2 3 2 2 22 1 1 3 2 22 2 2 4 2 22 1 3 5 3 24 1 1 6 3 24 3 3 7 4 26 1 3

Use explode: >>> df.join(df.pop('pairs').explode().apply(pd.Series) .rename(columns={0: 'p1', 1: 'p2'})) A B p1 p2 0 1 20 1 2 0 1 20 2 3 1 2 22 1 1 1 2 22 2 2 1 2 22 1 3 2 3 24 1 1 2 3 24 3 3 3 4 26 1 3

Related

Pandas replace values (grouping by and iteration)

How to expand a dataframe by assigning to each value in column 3 values

group by pandas removes duplicates

Use loc and iloc together in pandas

How can I group a sorted DataFrame on a stopping criterion?

Categories

Resources