I have a Dataframe like this
pd.DataFrame([(1,'a','i',[1,2,3],['a','b','c']),(2,'b','i',[4,5],['d','e','f']),(3,'a','j',[7,8,9],['g','h'])])
Output:
0 1 2 3 4
0 1 a i [1, 2, 3] [a, b, c]
1 2 b i [4, 5] [d, e, f]
2 3 a j [7, 8, 9] [g, h]
I want to explode columns 3,4 matching their indices and preserving the rest of the columns like this. I go through this question but the answer is to create a new dataframe and defining all columns again which is memory inefficient (I have 18L rows and 19 columns)
0 1 2 3 4
0 1 a i 1 a
1 1 a i 2 b
2 1 a i 3 c
3 2 b i 4 d
4 2 b i 5 e
5 2 b i NaN f
6 3 c j 7 g
7 3 c j 8 h
8 3 c j 9 NaN
Update: Forgot to mention for missing indices it should be NaN for other
Another solution:
df_out = df.explode(3)
df_out[4] = df[4].explode()
print(df_out)
Prints:
0 1 2 3 4
0 1 a i 1 a
0 1 a i 2 b
0 1 a i 3 c
1 2 b i 4 d
1 2 b i 5 e
1 2 b i 6 f
2 3 a j 7 g
2 3 a j 8 h
EDIT: To handle uneven cases:
df = pd.DataFrame(
[
(1, "a", "i", [1, 2, 3], ["a", "b", "c"]),
(2, "b", "i", [4, 5], ["d", "e", "f"]),
(3, "a", "j", [7, 8, 9], ["g", "h"]),
]
)
def fn(x):
if len(x[3]) < len(x[4]):
x[3].extend([np.nan] * (len(x[4]) - len(x[3])))
elif len(x[3]) > len(x[4]):
x[4].extend([np.nan] * (len(x[3]) - len(x[4])))
return x
# "even-out" the lists:
df = df.apply(fn, axis=1)
# explode them:
df_out = df.explode(3)
df_out[4] = df[4].explode()
print(df_out)
Prints:
0 1 2 3 4
0 1 a i 1 a
0 1 a i 2 b
0 1 a i 3 c
1 2 b i 4 d
1 2 b i 5 e
1 2 b i NaN f
2 3 a j 7 g
2 3 a j 8 h
2 3 a j 9 NaN
You can use pd.Series.explode:
df = df.apply(pd.Series.explode).reset_index(drop=True)
output:
0 1 2 3 4
0 1 a i 1 a
1 1 a i 2 b
2 1 a i 3 c
3 2 b i 4 d
4 2 b i 5 e
5 2 b i 6 f
6 3 a j 7 g
7 3 a j 8 h
Related
I have data that looks like this:
library("tidyverse")
df <- tibble(user = c(1, 1, 2, 3, 3, 3), x = c("a", "b", "a", "a", "c", "d"), y = 1)
df
# user x y
# 1 1 a 1
# 2 1 b 1
# 3 2 a 1
# 4 3 a 1
# 5 3 c 1
# 6 3 d 1
Python format:
import pandas as pd
df = pd.DataFrame({'user':[1, 1, 2, 3, 3, 3], 'x':['a', 'b', 'a', 'a', 'c', 'd'], 'y':1})
I'd like to "complete" the data frame so that every user has a record for every possible x with the default y fill set to 0.
This is somewhat trivial in R (tidyverse/tidyr):
df %>%
complete(nesting(user), x = c("a", "b", "c", "d"), fill = list(y = 0))
# user x y
# 1 1 a 1
# 2 1 b 1
# 3 1 c 0
# 4 1 d 0
# 5 2 a 1
# 6 2 b 0
# 7 2 c 0
# 8 2 d 0
# 9 3 a 1
# 10 3 b 0
# 11 3 c 1
# 12 3 d 1
Is there a complete equivalent in pandas / python that will yield the same result?
You can use reindex by MultiIndex.from_product:
df = df.set_index(['user','x'])
mux = pd.MultiIndex.from_product([df.index.levels[0], df.index.levels[1]],names=['user','x'])
df = df.reindex(mux, fill_value=0).reset_index()
print (df)
user x y
0 1 a 1
1 1 b 1
2 1 c 0
3 1 d 0
4 2 a 1
5 2 b 0
6 2 c 0
7 2 d 0
8 3 a 1
9 3 b 0
10 3 c 1
11 3 d 1
Or set_index + stack + unstack:
df = df.set_index(['user','x'])['y'].unstack(fill_value=0).stack().reset_index(name='y')
print (df)
user x y
0 1 a 1
1 1 b 1
2 1 c 0
3 1 d 0
4 2 a 1
5 2 b 0
6 2 c 0
7 2 d 0
8 3 a 1
9 3 b 0
10 3 c 1
11 3 d 1
We could use the complete function from pyjanitor, which provides a convenient abstraction to generate the missing rows :
# pip install pyjanitor
import pandas as pd
import janitor
df.complete('user', 'x', fill_value = 0)
user x y
0 1 a 1
1 1 b 1
2 1 c 0
3 1 d 0
4 2 a 1
5 2 b 0
6 2 c 0
7 2 d 0
8 3 a 1
9 3 b 0
10 3 c 1
11 3 d 1
More examples can be found here
Another Panda's option could be using pivot + fillna + melt:
df2 = (df
.pivot(index='user', columns='x', values='y')
.fillna(0)
.melt(value_name='y', ignore_index=False)
.reset_index()
.sort_values(['user', 'x'])
)
It's very easy now to use those dplyr/tidyr APIs in python with datar:
>>> from datar.all import f, c, tibble, complete, nesting
>>> df = tibble(user=c(1, 1, 2, 3, 3, 3), x=c("a", "b", "a", "a", "c", "d"), y=1)
>>> df >> complete(nesting(f.user), x=c("a", "b", "c", "d"), fill={'y': 0})
user x y
<int64> <object> <float64>
0 1 a 1.0
1 1 b 1.0
2 1 c 0.0
3 1 d 0.0
4 2 a 1.0
5 2 b 0.0
6 2 c 0.0
7 2 d 0.0
8 3 a 1.0
9 3 b 0.0
10 3 c 1.0
11 3 d 1.0
I am the author of the package. Feel free to submit issues if you have any questions.
Disclaimer: it`s my first question, fell little akward :/
I have DataFrame that have two columns: title and val:
>>> df=pd.DataFrame(
... {
... "title": ["a", "a", "a", "b", "b", "b", "c", "c", "c"],
... "val": [1, 4, 3, 4, 5, 6, 2, 4, 5],
... }
... )
>>> df
title val
0 a 1
1 a 4
2 a 3
3 b 4
4 b 5
5 b 6
6 c 2
7 c 4
8 c 5
I would like to aggregate rows in clusters with same title and then sort that clusters by descending of maximum val that they have. If maximum val equals, then sort clusters by title alphabetically. The rows in the clusters must be sort descending by val.
I know, I can do it in long way, like:
>>> df.loc[:,'max_value']=df.groupby('title', as_index=False).transform(max)
>>> df
title val max_value
0 a 1 4
1 a 4 4
2 a 3 4
3 b 4 6
4 b 5 6
5 b 6 6
6 c 2 5
7 c 4 5
8 c 5 5
>>> df.sort_values(['max_value','title', 'val'], ascending=False, inplace=True)
>>> df.drop(columns='max_value', inplace=True)
>>> df
title val
5 b 6
4 b 5
3 b 4
8 c 5
7 c 4
6 c 2
1 a 4
2 a 3
0 a 1
But will some shortcut have place there?
There is sorting by multiple columns, so created helper column new used for sorting:
df = df.loc[df.assign(a = df.groupby('title', as_index=False).transform(max))
.sort_values(['a','title', 'val'], ascending=False).index]
print (df)
title val
5 b 6
4 b 5
3 b 4
8 c 5
7 c 4
6 c 2
1 a 4
2 a 3
0 a 1
Current dataframe:
a a b b c
k l m n o
a 1 2 9 1 4
b 2 3 9 2 4
c 3 8 7 8 3
d 8 8 9 0 0
desired dataframe:
a b c
k l m n o
a 1 2 9 1 4
b 2 3 9 2 4
c 3 8 7 8 3
d 8 8 9 0 0
Its a multi index data frame, want to create a dynamic method to group the same headers into one for the columns where its repeated.
The two dataframes are exactly the same. If you want to change the style of the display you can do the following:
df = pd.DataFrame(np.array([[1, 2, 9, 1, 4],
[2, 3, 9, 2, 4],
[3, 8, 7, 8, 3],
[8, 8, 9, 0, 0]]),
columns=pd.MultiIndex.from_arrays([list('aabbc'), list('klmno')]),
index =list('abcd')
)
default print style:
>>> print(df)
a b c
k l m n o
a 1 2 9 1 4
b 2 3 9 2 4
c 3 8 7 8 3
d 8 8 9 0 0
Alternative style:
>>> with pd.option_context('display.multi_sparse', False):
... print (df)
a a b b c
k l m n o
a 1 2 9 1 4
b 2 3 9 2 4
c 3 8 7 8 3
d 8 8 9 0 0
My pandas dataframe looks like this
A B C D E
(Name1, 1) NaN NaN NaN NaN NaN
(Name2, 2) NaN NaN NaN NaN NaN
How do I access the a particular cell or change the value of a particular cell
I created the dataframe using this
id=list(product(array1,array2))
data=pd.DataFrame(index=id ,columns=array3)
I think you need MultiIndex:
np.random.seed(124)
array1 = np.array(['Name1','Name2'])
array2 = np.array([1,2])
array3 = np.array(list('ABCDE'))
idx= pd.MultiIndex.from_product([array1,array2])
data=pd.DataFrame(np.random.randint(10, size=[len(idx), len(array3)]),
index=idx ,columns=array3)
print (data)
A B C D E
Name1 1 1 7 2 9 0
2 4 4 5 5 6
Name2 1 9 6 0 8 9
2 9 0 2 2 1
print (data.index)
MultiIndex(levels=[['Name1', 'Name2'], [1, 2]],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
data.loc[('Name1', 2), 'B'] = 20
print (data)
A B C D E
Name1 1 1 7 2 9 0
2 4 20 5 5 6
Name2 1 9 6 0 8 9
2 9 0 2 2 1
For complicated selects are used slicers:
idx = pd.IndexSlice
data.loc[idx['Name1', 2], 'B'] = 20
print (data)
A B C D E
Name1 1 1 7 2 9 0
2 4 20 5 5 6
Name2 1 9 6 0 8 9
2 9 0 2 2 1
idx = pd.IndexSlice
print (data.loc[idx['Name1', 2], 'A'])
4
#select all values with 2 of second level and column A
idx = pd.IndexSlice
print (data.loc[idx[:, 2], 'A'])
Name1 2 4
Name2 2 9
Name: A, dtype: int32
#select 1 form second level and slice between B and D columns
idx = pd.IndexSlice
print (data.loc[idx[:, 1], idx['B':'D']])
B C D
Name1 1 7 2 9
Name2 1 6 0 8
For simplier selects use DataFrame.xs:
print (data.xs('Name1', axis=0, level=0))
A B C D E
1 1 7 2 9 0
2 4 4 5 5 6
How to explode the list into rows?
I have the following data frame:
df = pd.DataFrame([
(1,
[1,2,3],
['a','b','c']
),
(2,
[4,5,6],
['d','e','f']
),
(3,
[7,8],
['g','h']
)
])
Shown in output as follows
0 1 2
0 1 [1, 2, 3] [a, b, c]
1 2 [4, 5, 6] [d, e, f]
2 3 [7, 8] [g, h]
I want to have the following output:
0 1 2
0 1 1 a
1 1 2 b
2 1 3 c
3 2 4 d
4 2 5 e
5 2 6 f
6 3 7 g
7 3 8 h
You can use str.len for get length of lists which are repeated by numpy.repeat with flattening lists:
from itertools import chain
import numpy as np
df2 = pd.DataFrame({
0: np.repeat(df.iloc[:,0].values, df.iloc[:,1].str.len()),
1: list(chain.from_iterable(df.iloc[:,1])),
2: list(chain.from_iterable(df.iloc[:,2]))})
print (df2)
0 1 2
0 1 1 a
1 1 2 b
2 1 3 c
3 2 4 d
4 2 5 e
5 2 6 f
6 3 7 g
7 3 8 h