Explode multiple columns lists into rows - python

How to explode the list into rows?
I have the following data frame:
df = pd.DataFrame([
(1,
[1,2,3],
['a','b','c']
),
(2,
[4,5,6],
['d','e','f']
),
(3,
[7,8],
['g','h']
)
])
Shown in output as follows
0 1 2
0 1 [1, 2, 3] [a, b, c]
1 2 [4, 5, 6] [d, e, f]
2 3 [7, 8] [g, h]
I want to have the following output:
0 1 2
0 1 1 a
1 1 2 b
2 1 3 c
3 2 4 d
4 2 5 e
5 2 6 f
6 3 7 g
7 3 8 h

You can use str.len for get length of lists which are repeated by numpy.repeat with flattening lists:
from itertools import chain
import numpy as np
df2 = pd.DataFrame({
0: np.repeat(df.iloc[:,0].values, df.iloc[:,1].str.len()),
1: list(chain.from_iterable(df.iloc[:,1])),
2: list(chain.from_iterable(df.iloc[:,2]))})
print (df2)
0 1 2
0 1 1 a
1 1 2 b
2 1 3 c
3 2 4 d
4 2 5 e
5 2 6 f
6 3 7 g
7 3 8 h

Related

How to combine repeated header columns for multi-index pandas dataframe?

Current dataframe:
a a b b c
k l m n o
a 1 2 9 1 4
b 2 3 9 2 4
c 3 8 7 8 3
d 8 8 9 0 0
desired dataframe:
a b c
k l m n o
a 1 2 9 1 4
b 2 3 9 2 4
c 3 8 7 8 3
d 8 8 9 0 0
Its a multi index data frame, want to create a dynamic method to group the same headers into one for the columns where its repeated.
The two dataframes are exactly the same. If you want to change the style of the display you can do the following:
df = pd.DataFrame(np.array([[1, 2, 9, 1, 4],
[2, 3, 9, 2, 4],
[3, 8, 7, 8, 3],
[8, 8, 9, 0, 0]]),
columns=pd.MultiIndex.from_arrays([list('aabbc'), list('klmno')]),
index =list('abcd')
)
default print style:
>>> print(df)
a b c
k l m n o
a 1 2 9 1 4
b 2 3 9 2 4
c 3 8 7 8 3
d 8 8 9 0 0
Alternative style:
>>> with pd.option_context('display.multi_sparse', False):
... print (df)
a a b b c
k l m n o
a 1 2 9 1 4
b 2 3 9 2 4
c 3 8 7 8 3
d 8 8 9 0 0

Pandas Split lists into multiple rows

I have a Dataframe like this
pd.DataFrame([(1,'a','i',[1,2,3],['a','b','c']),(2,'b','i',[4,5],['d','e','f']),(3,'a','j',[7,8,9],['g','h'])])
Output:
0 1 2 3 4
0 1 a i [1, 2, 3] [a, b, c]
1 2 b i [4, 5] [d, e, f]
2 3 a j [7, 8, 9] [g, h]
I want to explode columns 3,4 matching their indices and preserving the rest of the columns like this. I go through this question but the answer is to create a new dataframe and defining all columns again which is memory inefficient (I have 18L rows and 19 columns)
0 1 2 3 4
0 1 a i 1 a
1 1 a i 2 b
2 1 a i 3 c
3 2 b i 4 d
4 2 b i 5 e
5 2 b i NaN f
6 3 c j 7 g
7 3 c j 8 h
8 3 c j 9 NaN
Update: Forgot to mention for missing indices it should be NaN for other
Another solution:
df_out = df.explode(3)
df_out[4] = df[4].explode()
print(df_out)
Prints:
0 1 2 3 4
0 1 a i 1 a
0 1 a i 2 b
0 1 a i 3 c
1 2 b i 4 d
1 2 b i 5 e
1 2 b i 6 f
2 3 a j 7 g
2 3 a j 8 h
EDIT: To handle uneven cases:
df = pd.DataFrame(
[
(1, "a", "i", [1, 2, 3], ["a", "b", "c"]),
(2, "b", "i", [4, 5], ["d", "e", "f"]),
(3, "a", "j", [7, 8, 9], ["g", "h"]),
]
)
def fn(x):
if len(x[3]) < len(x[4]):
x[3].extend([np.nan] * (len(x[4]) - len(x[3])))
elif len(x[3]) > len(x[4]):
x[4].extend([np.nan] * (len(x[3]) - len(x[4])))
return x
# "even-out" the lists:
df = df.apply(fn, axis=1)
# explode them:
df_out = df.explode(3)
df_out[4] = df[4].explode()
print(df_out)
Prints:
0 1 2 3 4
0 1 a i 1 a
0 1 a i 2 b
0 1 a i 3 c
1 2 b i 4 d
1 2 b i 5 e
1 2 b i NaN f
2 3 a j 7 g
2 3 a j 8 h
2 3 a j 9 NaN
You can use pd.Series.explode:
df = df.apply(pd.Series.explode).reset_index(drop=True)
output:
0 1 2 3 4
0 1 a i 1 a
1 1 a i 2 b
2 1 a i 3 c
3 2 b i 4 d
4 2 b i 5 e
5 2 b i 6 f
6 3 a j 7 g
7 3 a j 8 h

How to count the number of occurrences of semi-duplicate rows and make the count a new column

I have a pandas dataframe as follows:
df = pd.DataFrame({'A':[4, 4, 1, 5, 1, 1],
'B':[2, 2, 2, 5, 2, 2],
'C':[1, 1, 3, 5, 3, 3],
'D':['q', 'e', 'r', 'y', 'u',' w']})
which looks like
A B C D
0 4 2 1 q
1 4 2 1 e
2 1 2 3 r
3 5 5 5 y
4 1 2 3 u
5 1 2 3 w
I would like to add a new column that is the count of duplicate rows, with respect to only the columns A, B, and C. This would look like
A B C D Count
0 4 2 1 q 2
1 4 2 1 e 2
2 1 2 3 r 3
3 5 5 5 y 1
4 1 2 3 u 3
5 1 2 3 w 3
I'm guessing this will be something like df.groupby(['A','B','C']).size() but I am unsure how to map the values back to the new 'Count' column. Thanks!
We can do transform
df['Count'] = df.groupby(['A','B','C']).D.transform('count')
df['Count']
0 2
1 2
2 3
3 1
4 3
5 3
Name: Count, dtype: int64

pandas cut multiple columns

I am looking to apply a bin across a number of columns.
a = [1, 2, 9, 1, 5, 3]
b = [9, 8, 7, 8, 9, 1]
c = [a, b]
print(pd.cut(c, 3, labels=False))
which works great and creates:
[[0 0 2 0 1 0]
[2 2 2 2 2 0]]
However, i would like to apply the 'cut' to create a dataframe with number and bin it as below.
Values bin
0 1 0
1 2 0
2 9 2
3 1 0
4 5 1
5 3 0
Values bin
0 9 2
1 8 2
2 7 2
3 8 2
4 9 2
5 1 0
This is a simple example of what im looking to do. In reality i 63 separate dataframes and a & b are examples of a column from each dataframe.
Use zip with a list comp to build a list of dataframes -
c = [a, b]
r = pd.cut(c, 3, labels=False)
df_list = [pd.DataFrame({'Values' : v, 'Labels' : l}) for v, l in zip(c, r)]
df_list
[ Labels Values
0 0 1
1 0 2
2 2 9
3 0 1
4 1 5
5 0 3, Labels Values
0 2 9
1 2 8
2 2 7
3 2 8
4 2 9
5 0 1]

Pandas DataFrame drop tuple or list of columns

When using the drop method for a pandas.DataFrame it accepts lists of column names, but not tuples, despite the documentation saying that "list-like" arguments are acceptable. Am I reading the documentation incorrectly, as I would expect my MWE to work.
MWE
import pandas as pd
df = pd.DataFrame({k: range(5) for k in list('abcd')})
df.drop(['a', 'c'], axis=1) # Works
df.drop(('a', 'c'), axis=1) # Errors
Versions - Using Python 2.7.12, Pandas 0.20.3.
There is problem with tuples select Multiindex:
np.random.seed(345)
mux = pd.MultiIndex.from_arrays([list('abcde'), list('cdefg')])
df = pd.DataFrame(np.random.randint(10, size=(4,5)), columns=mux)
print (df)
a b c d e
c d e f g
0 8 0 3 9 8
1 4 3 4 1 7
2 4 0 9 6 3
3 8 0 3 1 5
df = df.drop(('a', 'c'), axis=1)
print (df)
b c d e
d e f g
0 0 3 9 8
1 3 4 1 7
2 0 9 6 3
3 0 3 1 5
Same as:
df = df[('a', 'c')]
print (df)
0 8
1 4
2 4
3 8
Name: (a, c), dtype: int32
Pandas treats tuples as multi-index values, so try this instead:
In [330]: df.drop(list(('a', 'c')), axis=1)
Out[330]:
b d
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
here is an example for deleting rows (axis=0 - default) in the multi-index DF:
In [342]: x = df.set_index(np.arange(len(df), 0, -1), append=True)
In [343]: x
Out[343]:
a b c d
0 5 0 0 0 0
1 4 1 1 1 1
2 3 2 2 2 2
3 2 3 3 3 3
4 1 4 4 4 4
In [344]: x.drop((0,5))
Out[344]:
a b c d
1 4 1 1 1 1
2 3 2 2 2 2
3 2 3 3 3 3
4 1 4 4 4 4
In [345]: x.drop([(0,5), (4,1)])
Out[345]:
a b c d
1 4 1 1 1 1
2 3 2 2 2 2
3 2 3 3 3 3
So when you specify tuple Pandas treats it as a multi-index label
I used this to delete column of tuples
del df3[('val1', 'val2')]
and it got deleted.

Categories

Resources