R dcast equivalent in python pandas - python

I am trying to do the equivalent of the below commands in python:
test <- data.frame(convert_me=c('Convert1','Convert2','Convert3'),
values=rnorm(3,45, 12), age_col=c('23','33','44'))
test
library(reshape2)
t <- dcast(test, values ~ convert_me+age_col, length )
t
That is, this:
convert_me values age_col
Convert1 21.71502 23
Convert2 58.35506 33
Convert3 60.41639 44
becomes this:
values Convert2_33 Convert1_23 Convert3_44
21.71502 0 1 0
58.35506 1 0 0
60.41639 0 0 1
I know that with dummy variables I can get the value of the columns and transform as the name of the column, but is there a way to merge them(combination) easily, as R does?

You can use the crosstab function for this:
In [14]: pd.crosstab(index=df['values'], columns=[df['convert_me'], df['age_col']])
Out[14]:
convert_me Convert1 Convert2 Convert3
age_col 23 33 44
values
21.71502 1 0 0
58.35506 0 1 0
60.41639 0 0 1
or the pivot_table (with len as the aggregating function, but here you have to fillna the NaNs with zeros manually):
In [18]: df.pivot_table(index=['values'], columns=['age_col', 'convert_me'], aggfunc=len).fillna(0)
Out[18]:
age_col 23 33 44
convert_me Convert1 Convert2 Convert3
values
21.71502 1 0 0
58.35506 0 1 0
60.41639 0 0 1
See here for the docs on this: http://pandas.pydata.org/pandas-docs/stable/reshaping.html#pivot-tables-and-cross-tabulations
Most functions in pandas will return a multi-level (hierarchical) index, in this case for the columns. If you want to 'melt' this into one level like in R you can do:
In [15]: df_cross = pd.crosstab(index=df['values'], columns=[df['convert_me'], df['age_col']])
In [16]: df_cross.columns = ["{0}_{1}".format(l1, l2) for l1, l2 in df_cross.columns]
In [17]: df_cross
Out[17]:
Convert1_23 Convert2_33 Convert3_44
values
21.71502 1 0 0
58.35506 0 1 0
60.41639 0 0 1

We can use pd.get_dummies function. In the current pandas 0.22.0, it is common to use pd.get_dummies when one-hot encoding to Dataframe.
import pandas as pd
df_dummies = pd.get_dummies(
df[['convert_me', 'age_col']].apply(lambda x: '_'.join(x.astype(str)), axis=1),
prefix_sep='')
df = pd.concat([df["values"], df_dummies], axis=1)
# Out[39]:
# values Convert1_23 Convert2_33 Convert3_44
# 0 21.71502 1 0 0
# 1 58.35506 0 1 0
# 2 60.41639 0 0 1

Related

pivot long form categorical data by group and dummy code categorical variables

For the following dataframe, I am trying to pivot the categorical variable ('purchase_item') into wide format and dummy code them as 1/0 - based on whether or not a customer purchased it in each of the 4 quarters within 2016.
I would like to generate a pivotted dataframe as follows:
To get the desired result shown above, I have tried basically in various ways to combine groupby/pivot_table functions with a call to get_dummies() function in pandas. Example:
data.groupby(["cust_id", "purchase_qtr"])["purchase_item"].reset_index().get_dummies()
However, none of my attempts have worked thus far.
Can somebody please help me generate the desired result?
One way of doing this is to get the crosstabulation, and then force all values > 1 to become 1, while keeping all 0's as they are:
TL;DR
out = (
pd.crosstab([df["cust_id"], df["purchase_qtr"]], df["purchase_item"])
.gt(0)
.astype(int)
.reset_index()
)
Breaking it all down:
Create Data
df = pd.DataFrame({
"group1": np.repeat(["a", "b", "c"], 4),
"group2": [1, 2, 3] * 4,
"item": np.random.choice(["ab", "cd", "ef", "gh", "zx"], size=12)
})
print(df)
group1 group2 item
0 a 1 cd
1 a 2 ef
2 a 3 gh
3 a 1 ef
4 b 2 zx
5 b 3 ab
6 b 1 ab
7 b 2 gh
8 c 3 gh
9 c 1 cd
10 c 2 ef
11 c 3 gh
Cross Tabulation
This returns a frequency table indicating how often each of the categories are observed together:
crosstab = pd.crosstab([df["group1"], df["group2"]], df["item"])
print(crosstab)
item ab cd ef gh zx
group1 group2
a 1 0 1 1 0 0
2 0 0 1 0 0
3 0 0 0 1 0
b 1 1 0 0 0 0
2 0 0 0 1 1
3 1 0 0 0 0
c 1 0 1 0 0 0
2 0 0 1 0 0
3 0 0 0 2 0
Coerce Counts to Dummy Codes
Since we want to dummy code, and not count the co-occurance of categories, we can use a quick trick to force all values greater than 0 gt(0) to become 1 astype(int)
item ab cd ef gh zx
group1 group2
a 1 0 1 1 0 0
2 0 0 1 0 0
3 0 0 0 1 0
b 1 1 0 0 0 0
2 0 0 0 1 1
3 1 0 0 0 0
c 1 0 1 0 0 0
2 0 0 1 0 0
3 0 0 0 1 0
You can do in one line via several tricks.
1) Unique Index count in junction with casting as bool
Works in any case even when you do not have any other column besides index, columns and values. This code implies to count unique indices of each index-column intersection and returning 1 incase its more than 0 else 0.
df.reset_index().pivot_table(index=['cust_id','purchase_qtr'],
columns='purchase_item',
values='index',
aggfunc='nunique', fill_value=0)\
.astype(bool).astype(int)
2) Checking if any other column is not null
If you have other columns besides index, columns and values AND want to use them for intuition. Like purchase_date in your case. It is more intutive because you can "read" it like: Check per customer per quarter if the purchase date of the item is not null and parse them as integer.
df.pivot_table(index=['cust_id','purchase_qtr'],
columns='purchase_item',values='purchase_date',
aggfunc=lambda x: all(pd.notna(x)), fill_value=0)\
.astype(int)
3) Seeing len of elements falling in each index-column intersection
This seeslen of elements falling in each each index-column intersection and returning 1 incase its more than 0 else 0. Same intuitive approach:
df.pivot_table(index=['cust_id','purchase_qtr'],
columns='purchase_item',
values='purchase_date',
aggfunc=len, fill_value=0)\
.astype(bool).astype(int)
All return desired dataframe:
Note that you should be only using crosstab when you don't have a dataframe already as it calls pivot_table internally.

Drop specific rows from pandas dataframe according to condition

I have a dataframe with several columns [A, B, C, ..., Z]. I want to delete all rows from the dataframe which have the property that their values in columns [B, C, ..., Z] are equal to 0 (integer zero).
Example df:
A B C ... Z
0 3 0 0 ... 0
1 1 0 0 ... 0
2 2 1 2 ... 3 <-- keep only this as it has values other than zero
I tried to do this like so:
df = df[(df.columns[1:] != 0).all()]
I can't get it to work. I am not too experienced with conditions in indexers. I wanted to avoid a solution that chains a zero test for every column. I am sure that there is a more elegant solution to this.
Thanks!
EDIT:
The solution worked for an artificially created dataframe, but when I used it on my df that I got from reading a csv, it failed. The file looks like this:
A;B;C;D;E;F;G;H;I;J;K;L;M;N;O;P;Q;R;S;T;U;V;W;X;Y;Z
0;25310;169;81;0;0;0;12291181;31442;246;0;0;0;0;0;0;0;0;0;251;31696;0;0;329;0;0
1;6252727;20480;82;0;0;0;31088;85;245;0;0;0;0;0;0;0;0;0;20567;331;0;0;329;0;0
2;6032184;10961;82;0;0;0;31024;84;245;0;0;0;0;0;0;0;0;0;11046;330;0;0;329;0;0
3;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0
4;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0
5;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0
6;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0
7;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0
8;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0
9;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0
10;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0
I read it using the following commands:
import pandas as pd
# retrieve csv file as dataframe
df = pd.read_csv('PATH/TO/FILE'),
decimal=',',
sep=';')
df[list(df)] = df[list(df)].astype('int')
print(df)
df = df[(df.iloc[:, 1:] != 0).all(axis=1)]
print(df)
The first print statement shows that the frame is read correctly, but the second print gives me an empty dataframe. How can this be?
Use iloc for select all columns without first:
df = df[(df.iloc[:, 1:] != 0).all(axis=1)]
print (df)
A B C Z
2 2 1 2 3
EDIT:
df = df[(df.iloc[:, 1:] != 0).any(axis=1)]
print (df)
A B C D E F G H I J ... Q R S T \
0 0 25310 169 81 0 0 0 12291181 31442 246 ... 0 0 0 251
1 1 6252727 20480 82 0 0 0 31088 85 245 ... 0 0 0 20567
2 2 6032184 10961 82 0 0 0 31024 84 245 ... 0 0 0 11046
U V W X Y Z
0 31696 0 0 329 0 0
1 331 0 0 329 0 0
2 330 0 0 329 0 0
[3 rows x 26 columns]

Drop all columns where all values are zero

I have a simple question which relates to similar questions here, and here.
I am trying to drop all columns from a pandas dataframe, which have only zeroes (vertically, axis=1). Let me give you an example:
df = pd.DataFrame({'a':[0,0,0,0], 'b':[0,-1,0,1]})
a b
0 0 0
1 0 -1
2 0 0
3 0 1
I'd like to drop column asince it has only zeroes.
However, I'd like to do it in a nice and vectorized fashion if possible. My data set is huge - so I don't want to loop. Hence I tried
df = df.loc[(df).any(1), (df!=0).any(0)]
b
1 -1
3 1
Which allows me to drop both columns and rows. But if I just try to drop the columns, locseems to fail. Any ideas?
You are really close, use any - 0 are casted to Falses:
df = df.loc[:, df.any()]
print (df)
b
0 0
1 1
2 0
3 1
If it's a matter of 0s and not sum, use df.any:
In [291]: df.T[df.any()].T
Out[291]:
b
0 0
1 -1
2 0
3 1
Alternatively:
In [296]: df.T[(df != 0).any()].T # or df.loc[:, (df != 0).any()]
Out[296]:
b
0 0
1 -1
2 0
3 1
In [73]: df.loc[:, df.ne(0).any()]
Out[73]:
b
0 0
1 1
2 0
3 1
or:
In [71]: df.loc[:, ~df.eq(0).all()]
Out[71]:
b
0 0
1 1
2 0
3 1
If we want to check those that do NOT sum up to 0:
In [78]: df.loc[:, df.sum().astype(bool)]
Out[78]:
b
0 0
1 1
2 0
3 1

Efficiently transform pandas dataFrame using column name as factor

I would like to transform a DataFrame given by a software into a more python usable one and I can't fix it in a simple way with pandas because I have to use information contained in the columns. Here a simple example :
import pandas as pd
d = {'00' : [1],'01' : [11], '10': [111], '11':[1111]}
pd.DataFrame(d)
00 01 10 11
0 1 11 111 1111
The column names contains the factors that I need to use in rows, I would like to get something like this :
df = {'trt': [0,0,1,1], 'grp': [0,1,0,1], 'value':[1,11,111,1111]}
pd.DataFrame(df)
grp trt value
0 0 0 1
1 1 0 11
2 0 1 111
3 1 1 1111
Any ideas of how to do it properly ?
Solution with MultiIndex.from_arrays created indexing with str and transpose by T:
df.columns = pd.MultiIndex.from_arrays([df.columns.str[0], df.columns.str[1]])
print (df)
0 1
0 1 0 1
0 1 11 111 1111
df1 = df.T.reset_index()
df1.columns = ['grp','trt','value']
print (df1)
grp trt value
0 0 0 1
1 0 1 11
2 1 0 111
3 1 1 1111
Similar solution with rename_axis and rename index:
d = {'00' : [1],'01' : [11], '10': [111], '11':[1111]}
df = pd.DataFrame(d)
df.columns = pd.MultiIndex.from_arrays([df.columns.str[0], df.columns.str[1]])
print(df.rename_axis(('grp','trt'), axis=1).rename(index={0:'value'}).T.reset_index())
grp trt value
0 0 0 1
1 0 1 11
2 1 0 111
3 1 1 1111
To me the simplest solution is just melting the original frame and splitting the column names in a second step. Something like this:
df = pd.DataFrame(d)
mf = pd.melt(df)
mf[['grp', 'trt']] = mf.pop('variable').apply(lambda x: pd.Series(tuple(x)))
Here's mf after melting:
variable value
0 00 1
1 01 11
2 10 111
3 11 1111
And the final result, after splitting the variable column:
value grp trt
0 1 0 0
1 11 0 1
2 111 1 0
3 1111 1 1
I'd encourage you to read up more on melting here: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html . It can be incredibly useful.

Deleting multiple series from a dataframe in one command

In short ... I have a Python Pandas data frame that is read in from an Excel file using 'read_table'. I would like to keep a handful of the series from the data, and purge the rest. I know that I can just delete what I don't want one-by-one using 'del data['SeriesName']', but what I'd rather do is specify what to keep instead of specifying what to delete.
If the simplest answer is to copy the existing data frame into a new data frame that only contains the series I want, and then delete the existing frame in its entirety, I would satisfied with that solution ... but if that is indeed the best way, can someone walk me through it?
TIA ... I'm a newb to Pandas. :)
You can use the DataFrame drop function to remove columns. You have to pass the axis=1 option for it to work on columns and not rows. Note that it returns a copy so you have to assign the result to a new DataFrame:
In [1]: from pandas import *
In [2]: df = DataFrame(dict(x=[0,0,1,0,1], y=[1,0,1,1,0], z=[0,0,1,0,1]))
In [3]: df
Out[3]:
x y z
0 0 1 0
1 0 0 0
2 1 1 1
3 0 1 0
4 1 0 1
In [4]: df = df.drop(['x','y'], axis=1)
In [5]: df
Out[5]:
z
0 0
1 0
2 1
3 0
4 1
Basically the same as Zelazny7's answer -- just specifying what to keep:
In [68]: df
Out[68]:
x y z
0 0 1 0
1 0 0 0
2 1 1 1
3 0 1 0
4 1 0 1
In [70]: df = df[['x','z']]
In [71]: df
Out[71]:
x z
0 0 0
1 0 0
2 1 1
3 0 0
4 1 1
*Edit*
You can specify a large number of columns through indexing/slicing into the Dataframe.columns object.
This object of type(pandas.Index) can be viewed as a dict of column labels (with some extended functionality).
See this extension of above examples:
In [4]: df.columns
Out[4]: Index([x, y, z], dtype=object)
In [5]: df[df.columns[1:]]
Out[5]:
y z
0 1 0
1 0 0
2 1 1
3 1 0
4 0 1
In [7]: df.drop(df.columns[1:], axis=1)
Out[7]:
x
0 0
1 0
2 1
3 0
4 1
You can also specify a list of columns to keep with the usecols option in pandas.read_table. This speeds up the loading process as well.

Categories

Resources