How to create pandas dummies based on column values

How to create pandas dummies based on column values - python

I would like to create dummies based on column values...
This is what the df looks like
I want to create this
This is so far my approach
import pandas as pd
df =pd.read_csv('test.csv')
v =df.Values
v_set=set()
for line in v:
line=line.split(',')
for x in line:
if x!="":
v_set.add(x)
else:
continue
for val in v_set:
df[val]=''
By the above code I am able to create columns in my df like this
How do I go about updating the row values to create dummies?
This is where I am having problems.
Thanks in advance.

You could use pandas.Series.str.get_dummies. This will alllow you to split the column directly with a delimiter.
df = pd.concat([df.ID, df.Values.str.get_dummies(sep=",")], axis=1)
ID 1 2 3 4
0 1 1 1 0 0
1 2 0 0 1 1
df.Values.str.get_dummies(sep=",") will generate
1 2 3 4
0 1 1 0 0
1 0 0 1 1
Then, we do a pd.concat to glue to df together.

Related

Pandas count based on condition in current row from records before current row

I have a dataframe show as follows:
import pandas as pd
df=pd.DataFrame({'col1':['a','a','a','b','b','c']})
df.sort_values('col1', inplace=True)
df['Ref']=0
Thus the dataframe looks like:
a 0
a 0
a 0
b 0
b 0
c 0
For the ref column, I want to show the number of reference of current row. For illustration purpose, following is what I want to achieve:
a 0
a 1
a 2
b 0
b 1
c 0
I can use df.iterrows() and loop row by row. Un fortunately in my case, it will take 15 minutes to run. I am wondering if there is a reasonable way to do so.

Group the data by col1 and use cumcount
import pandas as pd
df = pd.DataFrame({'col1':['a','a','a','b','b','c']})
df['Ref'] = df.groupby('col1').cumcount()
df.sort_values('col1', inplace=True)
Output:
>>> df
col1 Ref
0 a 0
1 a 1
2 a 2
3 b 0
4 b 1
5 c 0

python pandas group by and aggregate columns

I am using panda version 0.23.0. I want to use data frame group by function to generate new aggregated columns using [lambda] functions..
My data frame looks like
ID Flag Amount User
1 1 100 123345
1 1 55 123346
2 0 20 123346
2 0 30 123347
3 0 50 123348
I want to generate a table which looks like
ID Flag0_Count Flag1_Count Flag0_Amount_SUM Flag1_Amount_SUM Flag0_User_Count Flag1_User_Count
1 2 2 0 155 0 2
2 2 0 50 0 2 0
3 1 0 50 0 1 0
here:
Flag0_Count is count of Flag = 0
Flag1_Count is count of Flag = 1
Flag0_Amount_SUM is SUNM of amount when Flag = 0
Flag1_Amount_SUM is SUNM of amount when Flag = 1
Flag0_User_Count is Count of Distinct User when Flag = 0
Flag1_User_Count is Count of Distinct User when Flag = 1
I have tried something like
df.groupby(["ID"])["Flag"].apply(lambda x: sum(x==0)).reset_index()
but it creates a new a new data frame. This means I will have to this for all columns and them merge them together into a new data frame.
Is there an easier way to accomplish this?

Use DataFrameGroupBy.agg by dictionary by column names with aggregate function, then reshape by unstack, flatten MultiIndex of columns, rename columns and last reset_index:
df = (df.groupby(["ID", "Flag"])
.agg({'Flag':'size', 'Amount':'sum', 'User':'nunique'})
.unstack(fill_value=0))
#python 3.6+
df.columns = [f'{i}{j}' for i, j in df.columns]
#python below
#df.columns = [f'{}{}'.format(i, j) for i, j in df.columns]
d = {'Flag0':'Flag0_Count',
'Flag1':'Flag1_Count',
'Amount0':'Flag0_Amount_SUM',
'Amount1':'Flag1_Amount_SUM',
'User0':'Flag0_User_Count',
'User1':'Flag1_User_Count',
}
df = df.rename(columns=d).reset_index()
print (df)
ID Flag0_Count Flag1_Count Flag0_Amount_SUM Flag1_Amount_SUM \
0 1 0 2 0 155
1 2 2 0 50 0
2 3 1 0 50 0
Flag0_User_Count Flag1_User_Count
0 0 2
1 2 0
2 1 0

reset a recurring multiindex in pandas

I have a pandas data frame in python coming from a pd.concat with a recurring multiindex:
customer_id
0 0 46841769
1 4683936
1 0 8880872
1 8880812
0 0 8880873
1 1000521
1 0 1135488
1 5388773
No, I will reset only the first index of the multiIndex, so that I get a recurring number on the index. Something like this:
customer_id
0 0 46841769
1 4683936
1 0 8880872
1 8880812
2 0 8880873
1 1000521
3 0 1135488
1 5388773
In general, I have around 5 Mio records and not the biggest machine. So I'm looking for a memory efficient solution for that.
ignore_index=True in pd.concat do not works, because then I lose the Multiindex.
Many thanks

You can convert first level by get_level_values to_series, then compare it with shifted values and add cumsum for count and last use MultiIndex.from_arrays:
a = df.index.get_level_values(0).to_series()
a = a.ne(a.shift()).cumsum() - 1
mux = pd.MultiIndex.from_arrays([a, df.index.get_level_values(1)], names=df.index.names)
df.index = mux
Or:
df = df.set_index(mux)
print (df)
customer_id
0 0 46841769
1 4683936
1 0 8880872
1 8880812
2 0 8880873
1 1000521
3 0 1135488
1 5388773

Add column to pandas without headers

How does one append a column of constant values to a pandas dataframe without headers? I want to append the column at the end.
With headers I can do it this way:
df['new'] = pd.Series([0 for x in range(len(df.index))], index=df.index)

Each not empty DataFrame has columns, index and some values.
You can add default column value and create new column filled by scalar:
df[len(df.columns)] = 0
Sample:
df = pd.DataFrame({0:[1,2,3],
1:[4,5,6]})
print (df)
0 1
0 1 4
1 2 5
2 3 6
df[len(df.columns)] = 0
print (df)
0 1 2
0 1 4 0
1 2 5 0
2 3 6 0
Also for creating new column with name the simpliest is:
df['new'] = 1

Add columns to pandas dataframe containing max of each row, AND corresponding column name

My system
Windows 7, 64 bit
python 3.5.1
The challenge
I've got a pandas dataframe, and I would like to know the maximum value for each row, and append that info as a new column. I would also like to know the name of the column where the maximum value is located. And I would like to add another column to the existing dataframe containing the name of the column where the max value can be found.
A similar question has been asked and answered for R in this post.
Reproducible example
In[1]:
# Make pandas dataframe
df = pd.DataFrame({'a':[1,0,0,1,3], 'b':[0,0,1,0,1], 'c':[0,0,0,0,0]})
# Calculate max
my_series = df.max(numeric_only=True, axis = 1)
my_series.name = "maxval"
# Include maxval in df
df = df.join(my_series)
df
Out[1]:
a b c maxval
0 1 0 0 1
1 0 0 0 0
2 0 1 0 1
3 1 0 0 1
4 3 1 0 3
So far so good. Now for the add another column to the existing dataframe containing the name of the column part:
In[2]:
?
?
?
# This is what I'd like to accomplish:
Out[2]:
a b c maxval maxcol
0 1 0 0 1 a
1 0 0 0 0 a,b,c
2 0 1 0 1 b
3 1 0 0 1 a
4 3 1 0 3 a
Notice that I'd like to return all column names if multiple columns contain the same maximum value. Also please notice that the column maxval is not included in maxcol since that would not make much sense. Thanks in advance if anyone out there finds this interesting.

You can compare the df against maxval using eq with axis=0, then use apply with a lambda to produce a boolean mask to mask the columns and join them:
In [183]:
df['maxcol'] = df.ix[:,:'c'].eq(df['maxval'], axis=0).apply(lambda x: ','.join(df.columns[:3][x==x.max()]),axis=1)
df
Out[183]:
a b c maxval maxcol
0 1 0 0 1 a
1 0 0 0 0 a,b,c
2 0 1 0 1 b
3 1 0 0 1 a
4 3 1 0 3 a

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to create pandas dummies based on column values - python

Related

Pandas count based on condition in current row from records before current row

python pandas group by and aggregate columns

reset a recurring multiindex in pandas

Add column to pandas without headers

Add columns to pandas dataframe containing max of each row, AND corresponding column name

Categories

Resources