Split a data frame using groupby and merge the subsets into columns - python

I have a large pandas.DataFrame that looks something like this:
test = pandas.DataFrame({"score": numpy.random.randn(10)})
test["name"] = ["A"] * 3 + ["B"] * 3 + ["C"] * 4
test.index = range(3) + range(3) + range(4)
id score name
0 -0.652909 A
1 0.100885 A
2 0.410907 A
0 0.304012 B
1 -0.198157 B
2 -0.054764 B
0 0.358484 C
1 0.616415 C
2 0.389018 C
3 1.164172 C
So the index is non-unique but is unique if I group by the column name. I would like to split the data frame into subsections by name and then assemble (by means of an outer join) the score columns into one big new data frame and change the column names of the scores to the respective group key. What I have at the moment is:
df = pandas.DataFrame()
for (key, sub) in test.groupby("name"):
df = df.join(sub["score"], how="outer")
df.columns.values[-1] = key
this yields the expected result:
id A B C
0 -0.652909 0.304012 0.358484
1 0.100885 -0.198157 0.616415
2 0.410907 -0.054764 0.389018
3 NaN NaN 1.164172
but seems not very pandas-ic. Is there a better way?
Edit: Based on the answers I ran some simple timings.
%%timeit
df = pandas.DataFrame()
for (key, sub) in test.groupby("name"):
df = df.join(sub["score"], how="outer")
df.columns.values[-1] = key
100 loops, best of 3: 2.46 ms per loop
%%timeit
test.set_index([test.index, "name"]).unstack()
1000 loops, best of 3: 1.04 ms per loop
%%timeit
test.pivot_table("score", test.index, "name")
100 loops, best of 3: 2.54 ms per loop
So unstack seems the method of choice.

The function you look for is unstack. In order for pandas to know, what to unstack for, we will first create a MultiIndex where we add the column as last index. unstack() will then unstack (by default) based on the last index layer, so we get exactly what you want:
In[152]: test = pandas.DataFrame({"score": numpy.random.randn(10)})
test["name"] = ["A"] * 3 + ["B"] * 3 + ["C"] * 4
test.index = range(3) + range(3) + range(4)
In[153]: test
Out[153]:
score name
0 -0.208392 A
1 -0.103659 A
2 1.645287 A
0 0.119709 B
1 -0.047639 B
2 -0.479155 B
0 -0.415372 C
1 -1.390416 C
2 -0.384158 C
3 -1.328278 C
In[154]: test.set_index([index, 'name'], inplace=True)
test.unstack()
Out[154]:
score
name A B C
0 -0.208392 0.119709 -0.415372
1 -0.103659 -0.047639 -1.390416
2 1.645287 -0.479155 -0.384158
3 NaN NaN -1.328278

I recently came across a similar problem, which was solved by using a pivot_table
a = """id score name
0 -0.652909 A
1 0.100885 A
2 0.410907 A
0 0.304012 B
1 -0.198157 B
2 -0.054764 B
0 0.358484 C
1 0.616415 C
2 0.389018 C
3 1.164172 C"""
df = pd.read_csv(StringIO.StringIO(a),sep="\s*")
df = df.pivot_table('score','id','name')
print df
Output:
name A B C
id
0 -0.652909 0.304012 0.358484
1 0.100885 -0.198157 0.616415
2 0.410907 -0.054764 0.389018
3 NaN NaN 1.164172

Related

Python: build object of Pandas dataframes

I have a dataframe that has dtype=object, i.e. categorical variables, for which I'd like to have the counts of each level of. I'd like the result to be a pretty summary of all categorical variables.
To achieve the aforementioned goals, I tried the following:
(line 1) grab the names of all object-type variables
(line 2) count the number of observations for each level (a, b of v1)
(line 3) rename the column so it reads "count"
stringCol = list(df.select_dtypes(include=['object'])) # list object of categorical variables
a = df.groupby(stringCol[0]).agg({stringCol[0]: 'count'})
a = a.rename(index=str, columns={stringCol[0]: 'count'}); a
count
v1
a 1279
b 2382
I'm not sure how to elegantly get the following result where all string column counts are printed. Like so (only v1 and v4 shown, but should be able to print such results for a variable number of columns):
count count
v1 v4
a 1279 l 32
b 2382 u 3055
y 549
The way I can think of doing it is:
select one element of stringCol
calculate the count of for each group of the column.
store the result in a Pandas dataframe.
store the Pandas dataframe in an object (list?)
repeat
if last element of stringCol is done, break.
but there must be a better way than that, just not sure how to do it.
I think simpliest is use loop:
df = pd.DataFrame({'A':list('abaaee'),
'B':list('abbccf'),
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aacbbb')})
print (df)
A B C D E F
0 a a 7 1 5 a
1 b b 8 3 3 a
2 a b 9 5 6 c
3 a c 4 7 9 b
4 e c 2 1 2 b
5 e f 3 0 4 b
stringCol = list(df.select_dtypes(include=['object']))
for c in stringCol:
a = df[c].value_counts().rename_axis(c).to_frame('count')
#alternative
#a = df.groupby(c)[c].count().to_frame('count')
print (a)
count
A
a 3
e 2
b 1
count
B
b 2
c 2
a 1
f 1
count
F
b 3
a 2
c 1
For list of DataFrames use list comprehension:
dfs = [df[c].value_counts().rename_axis(c).to_frame('count') for c in stringCol]
print (dfs)
[ count
A
a 3
e 2
b 1, count
B
b 2
c 2
a 1
f 1, count
F
b 3
a 2
c 1]

Return All Values of Column A and Put them in Column B until Specific Value Is reached

I am still having trouble with with this and nothing seems to work for me. I have a data frame with two columns. I am trying to return all of the values in column A in a new column, B. However, I want to loop through column A and stop returning those values and instead return 0 when the cumulative sum reaches 8 or the next value would make it greater than 8.
df max_val = 8
A
1
2
2
3
4
5
1
The output should look something like this
df max_val = 8
A B
1 1
2 2
2 2
3 3
4 0
5 0
1 0
I thought something like this
def func(x):
if df['A'].cumsum() <= max_val:
return x
else:
return 0
This doesn't work:
df['B'] = df['A'].apply(func, axis =1 )
Neither does this:
df['B'] = func(df['A'])
You can use Series.where:
df['B'] = df['A'].where(df['A'].cumsum() <= max_val, 0)
print (df)
A B
0 1 1
1 2 2
2 2 2
3 3 3
4 4 0
5 5 0
6 1 0
Approach #1 One approach using np.where -
df['B']= np.where((df.A.cumsum()<=max_val), df.A ,0)
Sample output -
In [145]: df
Out[145]:
A B
0 1 1
1 2 2
2 2 2
3 3 3
4 4 0
5 5 0
6 1 0
Approach #2 Another using array-initialization -
def app2(df,max_val):
a = df.A.values
colB = np.zeros(df.shape[0],dtype=a.dtype)
idx = np.searchsorted(a.cumsum(),max_val, 'right')
colB[:idx] = a[:idx]
df['B'] = colB
Runtime test
Seems like #jezrael's pd.where based one is close one, so timing against it on a bigger dataset -
In [293]: df = pd.DataFrame({'A':np.random.randint(0,9,(1000000))})
In [294]: max_val = 1000000
# #jezrael's soln
In [295]: %timeit df['B1'] = df['A'].where(df['A'].cumsum() <= max_val, 0)
100 loops, best of 3: 8.22 ms per loop
# Proposed in this post
In [296]: %timeit df['B2']= np.where((df.A.cumsum()<=max_val), df.A ,0)
100 loops, best of 3: 6.45 ms per loop
# Proposed in this post
In [297]: %timeit app2(df, max_val)
100 loops, best of 3: 4.47 ms per loop
df['B']=[x if x<=8 else 0 for x in df['A'].cumsum()]
df
Out[7]:
A B
0 1 1
1 2 3
2 2 5
3 3 8
4 4 0
5 5 0
6 1 0
Why don't you add values to a variable like this :
for i in range(len(df)):
if A<max_val:
return x
else:
return 0
A=A+df[i]
Splitting in multiple lines
import pandas as pd
A=[1,2,2,3,4,5,1]
MAXVAL=8
df=pd.DataFrame(data=A,columns=['A'])
df['cumsumA']=df['A'].cumsum()
df['B']=df['cumsumA']*(df['cumsumA']<MAXVAL).astype(int)
You can then drop the 'cumsumA' column
The below will work fine -
import numpy as np
max_val = 8
df['B'] = np.where(df['A'].cumsum() <= max_val , df['A'],0)
I hope this helps.
just a way to do it with .loc:
df['c'] = df['a'].cumsum()
df['b'] = df['a']
df['b'].loc[df['c'] > 8] = 0

Compare two columns ( string formats) in two data frames while length of columns is not the same

The followings are two data frames:
Data frame A:
index codes
1 A
2 B
3 C
4 D
Data frame B
index cym
1 A
2 L
3 F
4 B
5 N
6 X
The length of A and B is not equal. I want to compare column "codes" (data frame A) with column "cym" (data frame B) and return the difference between these two columns plus the data in index column of data frame B. Output is like this:
index cym
2 L
3 F
5 N
6 X
I tried to solve it using merge and equals functions. But I could not generate the output.
You can use isin:
B[~B.cym.isin(A.codes)]
#index cym
#1 2 L
#2 3 F
#4 5 N
#5 6 X
The more verbose but faster version of #Psidom's answer.
mask = ~np.in1d(B.cym.values, A.codes.values)
pd.DataFrame(
B.values[mask],
B.index[mask],
B.columns
)
index cym
1 2 L
2 3 F
4 5 N
5 6 X
Timing
%timeit B[~B.cym.isin(A.codes)]
1000 loops, best of 3: 348 µs per loop
%%timeit
mask = ~np.in1d(B.cym.values, A.codes.values)
pd.DataFrame(
B.values[mask],
B.index[mask],
B.columns
)
10000 loops, best of 3: 194 µs per loop
For sake of completeness:
In [22]: B.query("cym not in #A.codes")
Out[22]:
index cym
0 2 L
1 3 F
2 5 N
3 6 X

Pandas: Keep rows if at least one of them contains certain value

I have the following dataframe in Pandas
letter number
------ -------
a 2
a 0
b 1
b 5
b 2
c 1
c 0
c 2
I'd like to keep all rows if at least one matching number is 0.
Result would be:
letter number
------ -------
a 2
a 0
c 1
c 0
c 2
as b has no matching number being 0
What is the best way to do this ?
Thanks !
You need filtration:
df = df.groupby('letter').filter(lambda x: (x['number'] == 0).any())
print (df)
letter number
0 a 2
1 a 0
5 c 1
6 c 0
7 c 2
Another solution with transform where get size of 0 rows and filter by boolean indexing:
print (df.groupby('letter')['number'].transform(lambda x: (x == 0).sum()))
0 1
1 1
2 0
3 0
4 0
5 1
6 1
7 1
Name: number, dtype: int64
df = df[df.groupby('letter')['number'].transform(lambda x: (x == 0).sum()) > 0]
print (df)
letter number
0 a 2
1 a 0
5 c 1
6 c 0
7 c 2
EDIT:
Faster is not use groupby, better is loc with isin:
df1 = df[df['letter'].isin(df.loc[df['number'] == 0, 'letter'])]
print (df1)
letter number
0 a 2
1 a 0
5 c 1
6 c 0
7 c 2
Comparing with another solution:
In [412]: %timeit df[df['letter'].isin(df[df['number'] == 0]['letter'])]
1000 loops, best of 3: 815 µs per loop
In [413]: %timeit df[df['letter'].isin(df.loc[df['number'] == 0, 'letter'])]
1000 loops, best of 3: 657 µs per loop
You can also do this without the groupby by working out which letters to keep then using isin. I think this is a bit neater personally:
>>> letters_to_keep = df[df['number'] == 0]['letter']
>>> df_reduced = df[df['letter'].isin(letters_to_keep)]
>>> df_reduced
letter number
0 a 2
1 a 0
5 c 1
6 c 0
7 c 2
I suspect this would be faster than doing a groupby, that may not be relevant here though! A simple timeit would indicate this is the case:
>>> %%timeit
... df.groupby('letter').filter(lambda x: (x['number'] == 0).any())
100 loops, best of 3: 2.26 ms per loop
>>> %%timeit
... df[df['letter'].isin(df[df['number'] == 0]['letter'])]
1000 loops, best of 3: 820 µs per loop

Pivot Table and Counting

I have a data set indicating who has shopped at which stores.
ID Store
1 C
1 A
2 A
2 B
3 A
3 B
3 C
Can I use a pivot table to determine the frequency of a shopper going to other stores? I'm thinking like a 3X3 matrix where the columns and rows would indicate how many people went to both stores.
Desired output
A B C
A 3 2 2
B 2 3 1
C 2 1 3
You can create a conditional table of ID and Store with pd.crosstab() and then calculate the matrix product of its transpose and itself, which should produce what you need:
mat = pd.crosstab(df.ID, df.Store)
mat.T.dot(mat)
#Store A B C
#Store
# A 3 2 2
# B 2 2 1
# C 2 1 2
Note: Since only two IDs visited store B and C, I suppose the corresponding cells should be 2 instead of 3:
Another faster solution with groupby, unstack and dot:
df = df.groupby(['ID','Store']).size().unstack(fill_value=0)
df = df.T.dot(df)
print (df)
Store A B C
Store
A 3 2 2
B 2 2 1
C 2 1 2
Timings:
In [119]: %timeit (jez(df))
1000 loops, best of 3: 1.72 ms per loop
In [120]: %timeit (psi(df))
100 loops, best of 3: 7.07 ms per loop
Code for timings:
N = 1000
df = pd.DataFrame({'ID':np.random.choice(5, N),
'Store': np.random.choice(list('ABCDEFGHIJK'), N)})
print (df)
def jez(df):
df = df.groupby(['ID','Store']).size().unstack(fill_value=0)
return df.T.dot(df)
def psi(df):
mat = pd.crosstab(df.ID, df.Store)
return mat.T.dot(mat)
print (jez(df))
print (psi(df))

Categories

Resources