How to enumerize and split csv column in Pandas? - python

We have a CSV. It has a fixed (repeating) set of strings in a column. Say column is called "Type" and values are [A, B, B, C]. I want to get several new columns equal to the number of unique column values with 0 or 1 in them. like this:
Type Type_1 Type_2 Type_3
A 1 0 0
B 0 1 0
B 0 1 0
C 0 0 1
How to turn a column into a set of 0\1 columns?

Try with str.get_dummies
df.Type.str.get_dummies()
A B C
0 1 0 0
1 0 1 0
2 0 1 0
3 0 0 1
Update
df=df.join(df.pop('Type').str.get_dummies())

Related

pivot long form categorical data by group and dummy code categorical variables

For the following dataframe, I am trying to pivot the categorical variable ('purchase_item') into wide format and dummy code them as 1/0 - based on whether or not a customer purchased it in each of the 4 quarters within 2016.
I would like to generate a pivotted dataframe as follows:
To get the desired result shown above, I have tried basically in various ways to combine groupby/pivot_table functions with a call to get_dummies() function in pandas. Example:
data.groupby(["cust_id", "purchase_qtr"])["purchase_item"].reset_index().get_dummies()
However, none of my attempts have worked thus far.
Can somebody please help me generate the desired result?
One way of doing this is to get the crosstabulation, and then force all values > 1 to become 1, while keeping all 0's as they are:
TL;DR
out = (
pd.crosstab([df["cust_id"], df["purchase_qtr"]], df["purchase_item"])
.gt(0)
.astype(int)
.reset_index()
)
Breaking it all down:
Create Data
df = pd.DataFrame({
"group1": np.repeat(["a", "b", "c"], 4),
"group2": [1, 2, 3] * 4,
"item": np.random.choice(["ab", "cd", "ef", "gh", "zx"], size=12)
})
print(df)
group1 group2 item
0 a 1 cd
1 a 2 ef
2 a 3 gh
3 a 1 ef
4 b 2 zx
5 b 3 ab
6 b 1 ab
7 b 2 gh
8 c 3 gh
9 c 1 cd
10 c 2 ef
11 c 3 gh
Cross Tabulation
This returns a frequency table indicating how often each of the categories are observed together:
crosstab = pd.crosstab([df["group1"], df["group2"]], df["item"])
print(crosstab)
item ab cd ef gh zx
group1 group2
a 1 0 1 1 0 0
2 0 0 1 0 0
3 0 0 0 1 0
b 1 1 0 0 0 0
2 0 0 0 1 1
3 1 0 0 0 0
c 1 0 1 0 0 0
2 0 0 1 0 0
3 0 0 0 2 0
Coerce Counts to Dummy Codes
Since we want to dummy code, and not count the co-occurance of categories, we can use a quick trick to force all values greater than 0 gt(0) to become 1 astype(int)
item ab cd ef gh zx
group1 group2
a 1 0 1 1 0 0
2 0 0 1 0 0
3 0 0 0 1 0
b 1 1 0 0 0 0
2 0 0 0 1 1
3 1 0 0 0 0
c 1 0 1 0 0 0
2 0 0 1 0 0
3 0 0 0 1 0
You can do in one line via several tricks.
1) Unique Index count in junction with casting as bool
Works in any case even when you do not have any other column besides index, columns and values. This code implies to count unique indices of each index-column intersection and returning 1 incase its more than 0 else 0.
df.reset_index().pivot_table(index=['cust_id','purchase_qtr'],
columns='purchase_item',
values='index',
aggfunc='nunique', fill_value=0)\
.astype(bool).astype(int)
2) Checking if any other column is not null
If you have other columns besides index, columns and values AND want to use them for intuition. Like purchase_date in your case. It is more intutive because you can "read" it like: Check per customer per quarter if the purchase date of the item is not null and parse them as integer.
df.pivot_table(index=['cust_id','purchase_qtr'],
columns='purchase_item',values='purchase_date',
aggfunc=lambda x: all(pd.notna(x)), fill_value=0)\
.astype(int)
3) Seeing len of elements falling in each index-column intersection
This seeslen of elements falling in each each index-column intersection and returning 1 incase its more than 0 else 0. Same intuitive approach:
df.pivot_table(index=['cust_id','purchase_qtr'],
columns='purchase_item',
values='purchase_date',
aggfunc=len, fill_value=0)\
.astype(bool).astype(int)
All return desired dataframe:
Note that you should be only using crosstab when you don't have a dataframe already as it calls pivot_table internally.

How to update multiple rows and columns in pandas?

I want to update multiple rows and columns in a CSV file, using pandas
I've tried using iterrows() method but it only works on a single column.
here is the logic I want to apply for multiple rows and columns:
if(value < mean):
value += std_dev
else:
value -= std_dev
Here is another way of doing it,
Consider your data is like this:
price strings value
0 1 A a
1 2 B b
2 3 C c
3 4 D d
4 5 E f
Now lets make strings column as the index:
df.set_index('strings', inplace='True')
#Result
price value
strings
A 1 a
B 2 b
C 3 c
D 4 d
E 5 f
Now set the values of rows C, D, E as 0
df.loc[['C', 'D','E']] = 0
#Result
price value
strings
A 1 a
B 2 b
C 0 0
D 0 0
E 0 0
or you can do more precisely
df.loc[df.strings.isin(["C", "D", "E"]), df.columns.difference(["strings"])] = 0
df
Out[82]:
price strings value
0 1 A a
1 2 B b
2 0 C 0
3 0 D 0
4 0 E 0

Add columns to pandas dataframe containing max of each row, AND corresponding column name

My system
Windows 7, 64 bit
python 3.5.1
The challenge
I've got a pandas dataframe, and I would like to know the maximum value for each row, and append that info as a new column. I would also like to know the name of the column where the maximum value is located. And I would like to add another column to the existing dataframe containing the name of the column where the max value can be found.
A similar question has been asked and answered for R in this post.
Reproducible example
In[1]:
# Make pandas dataframe
df = pd.DataFrame({'a':[1,0,0,1,3], 'b':[0,0,1,0,1], 'c':[0,0,0,0,0]})
# Calculate max
my_series = df.max(numeric_only=True, axis = 1)
my_series.name = "maxval"
# Include maxval in df
df = df.join(my_series)
df
Out[1]:
a b c maxval
0 1 0 0 1
1 0 0 0 0
2 0 1 0 1
3 1 0 0 1
4 3 1 0 3
So far so good. Now for the add another column to the existing dataframe containing the name of the column part:
In[2]:
?
?
?
# This is what I'd like to accomplish:
Out[2]:
a b c maxval maxcol
0 1 0 0 1 a
1 0 0 0 0 a,b,c
2 0 1 0 1 b
3 1 0 0 1 a
4 3 1 0 3 a
Notice that I'd like to return all column names if multiple columns contain the same maximum value. Also please notice that the column maxval is not included in maxcol since that would not make much sense. Thanks in advance if anyone out there finds this interesting.
You can compare the df against maxval using eq with axis=0, then use apply with a lambda to produce a boolean mask to mask the columns and join them:
In [183]:
df['maxcol'] = df.ix[:,:'c'].eq(df['maxval'], axis=0).apply(lambda x: ','.join(df.columns[:3][x==x.max()]),axis=1)
df
Out[183]:
a b c maxval maxcol
0 1 0 0 1 a
1 0 0 0 0 a,b,c
2 0 1 0 1 b
3 1 0 0 1 a
4 3 1 0 3 a

Pandas DataFrame Groupby to get Unique row condition and identify with increasing value up to Number of Groups

I have a DataFrame where a combination of column values identify a unique address (A,B,C). I would like to identify all such rows and assign them a unique identifier that I increment per address.
For example
A B C D E
0 1 1 0 1
0 1 2 0 1
0 1 1 1 1
0 1 3 0 1
0 1 2 1 0
0 1 1 2 1
I would like to generate the following
A B C D E ID
0 1 1 0 1 0
0 1 2 0 1 1
0 1 1 1 1 0
0 1 3 0 1 2
0 1 2 1 0 1
0 1 1 2 1 0
I tried the following:
id = 0
def set_id(df):
global id
df['ID'] = id
id += 1
df.groupby(['A','B','C']).transform(set_id)
This returns a NULL dataframe...This is definitely not the way to do it..I am new to pandas. The above should actually use df[['A','B','C']].drop_duplicates() to get all unique values
Thank you.
I think this is what you need :
df2 = df[['A','B','C']].drop_duplicates() #get unique values of ABC
df2 = df2.reset_index(drop = True).reset_index() #reset index to create a column named index
df2=df2.rename(columns = {'index':'ID'}) #rename index to ID
df = pd.merge(df,df2,on = ['A','B','C'],how = 'left') #append ID column with merge
# Create tuple triplet using values from columns A, B & C.
df['key'] = [triplet for triplet in zip(*[df[col].values.tolist() for col in ['A', 'B', 'C']])]
# Sort dataframe on new `key` column.
df.sort_values('key', inplace=True)
# Use `groupby` to keep running total of changes in key value.
df['ID'] = (df['key'] != df['key'].shift()).cumsum() - 1
# Clean up.
del df['key']
df.sort_index(inplace=True)
>>> df
A B C D E ID
0 0 1 1 0 1 0
1 0 1 2 0 1 1
2 0 1 1 1 1 0
3 0 1 3 0 1 2
4 0 1 2 1 0 1
5 0 1 1 2 1 0

dividing a string into separate columns pandas python

I have seen similar questions, but nothing that really matchs my problem. If I have a table of values such as:
value
a
b
b
c
I want to use pandas to add in columns to the table to show for example:
value a b
a 1 0
b 0 1
c 0 0
I have tried the following:
df['a'] = 0
def string_count(indicator):
if indicator == 'a':
df['a'] == 1
df['a'].apply(string_count)
But this produces:
0 None
1 None
2 None
3 None
I would like to at least get to the point where the choices are hardcoded in (i.e I already know that a,b and c appear), but would even better if I could look set the column of strings and then insert a column for each unique string.
Am I approaching this the wrong way?
dummies = pd.get_dummies(df.value)
a b c
0 1 0 0
1 0 1 0
2 0 1 0
3 0 0 1
If you only want to display unique occurrences, you can add:
dummies.index = df.value
dummies.drop_duplicates()
a b c
value
a 1 0 0
b 0 1 0
c 0 0 1
Alternatively:
df = df.join(pd.get_dummies(df.value))
value a b c
0 a 1 0 0
1 b 0 1 0
2 b 0 1 0
3 c 0 0 1
Where you could again .drop_duplicates() to only see unique entries from the value column.

Categories

Resources