I have a Series that look like this:
col1 id
0 a 10
1 b 20
2 c 30
3 b 10
4 d 10
5 a 30
6 e 40
My desired output is this:
a b c d e
10 1 1 0 1 0
20 0 1 0 0 0
30 1 0 1 0 0
40 0 0 0 0 1
I got this code:
import pandas as pd
df['dummies'] = 1
df_ind.pivot(index='id', columns='col1', values='dummies')
I get an error:
137
138 if mask.sum() < len(self.index):
--> 139 raise ValueError('Index contains duplicate entries, '
140 'cannot reshape')
141
ValueError: Index contains duplicate entries, cannot reshape
There are duplicate id's because multiple values in col1 can be attributed to a single id.
How can I achieve the desired output?
Thanks!
You could use pd.crosstab
In [329]: pd.crosstab(df.id, df.col1)
Out[329]:
col1 a b c d e
id
10 1 1 0 1 0
20 0 1 0 0 0
30 1 0 1 0 0
40 0 0 0 0 1
Or, use pd.pivot_table
In [336]: df.pivot_table(index='id', columns='col1', aggfunc=len, fill_value=0)
Out[336]:
col1 a b c d e
id
10 1 1 0 1 0
20 0 1 0 0 0
30 1 0 1 0 0
40 0 0 0 0 1
Or, use groupby and unstack
In [339]: df.groupby(['id', 'col1']).size().unstack(fill_value=0)
Out[339]:
col1 a b c d e
id
10 1 1 0 1 0
20 0 1 0 0 0
30 1 0 1 0 0
40 0 0 0 0 1
Related
I have a dataframe with 45 columns. Most are string values, so I'm trying to use pd.get_dummies to turn the strings into numbers using df = pd.get_dummies(drop_first=True); however, the columns without string values are removed from my dataframe. I don't want to have to type out 40 or so columns names. How can I iterate over every column, ignoring ones without strings and still have them remain after the get_dummies call?
Columns can be filtered by dtypes to programmatically determine which columns to pass to get_dummies, namely only the "object or category" type columns:
new_df = pd.get_dummies(
df,
columns=df.columns[(df.dtypes == 'object') | (df.dtypes == 'category')]
)
Sample Data:
import numpy as np
import pandas as pd
np.random.seed(5)
n = 10
df = pd.DataFrame({
'A': np.random.randint(1, 100, n),
'B': pd.Series(np.random.choice(list("ABCD"), n), dtype='category'),
'C': np.random.random(n) * 100,
'D': np.random.choice(list("EFGH"), n)
})
new_df = pd.get_dummies(
df,
columns=df.columns[(df.dtypes == 'object') | (df.dtypes == 'category')]
)
df:
A B C D
0 79 A 76.437261 G
1 62 D 11.090076 E
2 17 B 20.415475 E
3 74 B 11.909536 E
4 9 D 87.790307 G
5 63 A 52.367529 E
6 28 D 49.213600 F
7 31 A 73.187110 H
8 81 B 1.458075 H
9 8 D 9.336303 H
df.dtypes:
A int32
B category
C float64
D object
dtype: object
new_df:
A C B_A B_B B_D D_E D_F D_G D_H
0 79 76.437261 1 0 0 0 0 1 0
1 62 11.090076 0 0 1 1 0 0 0
2 17 20.415475 0 1 0 1 0 0 0
3 74 11.909536 0 1 0 1 0 0 0
4 9 87.790307 0 0 1 0 0 1 0
5 63 52.367529 1 0 0 1 0 0 0
6 28 49.213600 0 0 1 0 1 0 0
7 31 73.187110 1 0 0 0 0 0 1
8 81 1.458075 0 1 0 0 0 0 1
9 8 9.336303 0 0 1 0 0 0 1
I have a question. I am trying to count how many times values from df1:
Record
v:12:14
v:14:18
v:15:19
appear in df2, when df2 is filter on multiple conditions:
df2:
Patient Test Treatment Record
1 A 15 v:12:14
2 A 30 v:14:18
3 C 15 v:15:19
4 C 20 v:15:19
1 B 15 v:12:14
2 B 15 v:14:18
3 A 20 v:12:14
4 B 30 v:15:19
Essentially ending in a matrix like this:
Patient Record A:15 A:30 C:15 C:20 B:15 A:20 B:30
1 v:12:14 1 0 0 0 1 1 0
2 v:14:18 0 1 0 0 1 0 0
3 v:15:19 0 0 1 1 0 0 1
4 v:15:19 0 0 1 1 0 0 1
3 v:12:14 1 0 0 0 1 1 0
Does anyone have any ideas? I am doing it now by iterating two data frames, but I feel like it can be done faster and better.
Thanks in advance for the help!
You can do this with a pivot table:
matrix = df2.pivot_table(
index=['Patient', 'Record'],
columns=['Test', 'Treatment'],
aggfunc='size',
).fillna(0).astype(int)
Output:
Test A B C
Treatment 15 20 30 15 30 15 20
Patient Record
1 v:12:14 1 0 0 1 0 0 0
2 v:14:18 0 0 1 1 0 0 0
3 v:12:14 0 1 0 0 0 0 0
v:15:19 0 0 0 0 0 1 0
4 v:15:19 0 0 0 0 1 0 1
I have a dataframe called df
The columns in the dataframe can be logically grouped. Hence I grouped the column names in lists A, B, C where:
A = [column_1, column_2, column_3]
B = [column_4, column_5, column_6]
C = [column_7, column_8, column_9]
In addition to the columns column_1 to column_9, df has one more column called "filename_ID", which is used as the index and thus is not grouped. The olumns column_1 to column_9 contain only 0 and 1 values.
Now I want to filter the dataframe such that it only includes rows where there is at least one non-zero value for each group (A,B,C). As such, I only want to keep rows with the respective filename_ID that fullfill this condition.
I have managed to create a seperate dataframe for each group:
df_A = df.loc[(df[A]!=0).any(axis=1)]
df_B = df.loc[(df[B]!=0).any(axis=1)]
df_C = df.loc[(df[C]!=0).any(axis=1)]
However, I dont know how to apply all conditions simultaniously - i.e how to create one new dataframe where all rows fullfill the condition that in each logical column-group there is at least one non-zero value.
Setup
np.random.seed([3, 1415])
df = pd.DataFrame(
np.random.randint(2, size=(10, 9)),
columns=[f"col{i + 1}" for i in range(9)]
)
df
col1 col2 col3 col4 col5 col6 col7 col8 col9
0 0 1 0 1 0 0 1 0 1
1 1 1 1 0 1 1 0 1 0
2 0 0 0 0 0 0 0 0 0
3 1 0 1 1 1 1 0 0 0
4 0 0 1 1 1 1 1 0 1
5 1 1 0 1 1 1 1 1 1
6 1 0 1 0 0 0 1 1 0
7 0 0 0 0 0 1 0 1 0
8 1 0 1 0 1 0 0 1 1
9 1 0 1 0 0 1 0 1 0
Solution
Create a dictionary
m = {
**dict.fromkeys(['col1', 'col2', 'col3'], 'A'),
**dict.fromkeys(['col4', 'col5', 'col6'], 'B'),
**dict.fromkeys(['col7', 'col8', 'col9'], 'C'),
}
Then groupby based on axis=1
df[df.groupby(m, axis=1).any().all(1)]
col1 col2 col3 col4 col5 col6 col7 col8 col9
0 0 1 0 1 0 0 1 0 1
1 1 1 1 0 1 1 0 1 0
4 0 0 1 1 1 1 1 0 1
5 1 1 0 1 1 1 1 1 1
8 1 0 1 0 1 0 0 1 1
9 1 0 1 0 0 1 0 1 0
Notice the ones that didn't make it
col1 col2 col3 col4 col5 col6 col7 col8 col9
2 0 0 0 0 0 0 0 0 0
3 1 0 1 1 1 1 0 0 0
6 1 0 1 0 0 0 1 1 0
7 0 0 0 0 0 1 0 1 0
You could also have had columns like this:
cols = [['col1', 'col2', 'col3'], ['col4', 'col5', 'col6'], ['col7', 'col8', 'col9']]
m = {k: v for v, c in enumerate(cols) for k in c}
And performed the same groupby
Try the following:
column_groups = [A, B, C]
masks = [(df[cols] != 0).any(axis=1) for cols in column_groups]
full_mask = np.logical_and.reduce(masks)
full_df = df[full_mask]
Created a csv file with sample data
Sample Input:
ID a1 a2 a3 a4 a5 a6 a7 a8 a9
1 1 1 1 1 1 1 1 1 1
2 0 0 0 1 0 0 0 1 0
3 0 1 0 0 0 0 1 0 0
4 0 0 0 0 1 0 1 0 1
5 1 1 0 1 1 1 1 0 1
6 0 0 0 0 1 0 0 1 0
7 1 0 1 1 1 0 1 1 1
8 1 1 1 0 1 1 1 0 1
9 0 0 0 1 0 1 0 0 0
10 0 0 1 0 0 0 0 0 0
11 1 0 1 0 1 1 0 1 1
12 1 1 0 1 0 1 1 0 1
import pandas as pd
df = pd.read_csv('check.csv')
df['sumA'] = df.a1+df.a2+df.a3
df['sumB'] = df.a4+df.a5+df.a6
df['sumC'] = df.a7+df.a8+df.a9
new_df = df[(df.sumA>1)&(df.sumB>1)&(df.sumC>1)]
new_df = new_df.drop(['sumA','sumB','sumC'],axis=1)
Output:
ID a1 a2 a3 a4 a5 a6 a7 a8 a9
0 1 1 1 1 1 1 1 1 1 1
4 5 1 1 0 1 1 1 1 0 1
6 7 1 0 1 1 1 0 1 1 1
7 8 1 1 1 0 1 1 1 0 1
10 11 1 0 1 0 1 1 0 1 1
11 12 1 1 0 1 0 1 1 0 1
I have a dataframe like below,
A B C D
0 1 0 0 0
1 0 1 0 0
2 0 1 0 0
3 0 0 1 0
I want to convert this into like this,
A B C D
0 1 0 0 0
1 1 1 0 0
2 1 1 0 0
3 1 1 1 0
so far I tried,
df= df.replace('0',np.NaN)
df=df.fillna(method='ffill').fillna('0')
my above code works fine,
But I think there is some other better approach to solve this problem,
Use cumsum with data converted to numeric and then replace by DataFrame.mask:
df = df.mask(df.astype(int).cumsum() >= 1, '1')
print (df)
A B C D
0 1 0 0 0
1 1 1 0 0
2 1 1 0 0
3 1 1 1 0
Detail:
print (df.astype(int).cumsum())
A B C D
0 1 0 0 0
1 1 1 0 0
2 1 2 0 0
3 1 2 1 0
Or same principe in numpy with numpy.where:
arr = df.values.astype(int)
df = pd.DataFrame(np.where(np.cumsum(arr, axis=0) >= 1, '1', '0'),
index=df.index,
columns= df.columns)
print (df)
A B C D
0 1 0 0 0
1 1 1 0 0
2 1 1 0 0
3 1 1 1 0
I have a dataframe named df as following:
ticker class_n
1 a
2 b
3 c
4 d
5 e
6 f
7 a
8 b
............................
I want to add new columns to this dataframe, the new columns names is the value of unique category of class_n(I mean no repeat class_n). Further, the value of new columns is 1 (if the value of class_n is same with column name), other is 0.
for example as the following dataframe. I want to get the new dataframe as following:
ticer class_n a b c d e f
1 a 1 0 0 0 0 0
2 b 0 1 0 0 0 0
3 c 0 0 1 0 0 0
4 d 0 0 0 1 0 0
5 e 0 0 0 0 1 0
6 f 0 0 0 0 0 1
7 a 1 0 0 0 0 0
8 b 0 1 0 0 0 0
My code is following:
lst_class = list(set(list(df['class_n'])))
for cla in lst_class:
df[c] = 0
df.loc[df['class_n'] is cla, cla] =1
but there is error:
KeyError: 'cannot use a single bool to index into setitem'
Thanks!
Use pd.get_dummies
df.join(pd.get_dummies(df.class_n))
ticker class_n a b c d e f
0 1 a 1 0 0 0 0 0
1 2 b 0 1 0 0 0 0
2 3 c 0 0 1 0 0 0
3 4 d 0 0 0 1 0 0
4 5 e 0 0 0 0 1 0
5 6 f 0 0 0 0 0 1
6 7 a 1 0 0 0 0 0
7 8 b 0 1 0 0 0 0
Or the same thing but a little more manually
f, u = pd.factorize(df.class_n.values)
d = pd.DataFrame(np.eye(u.size, dtype=int)[f], df.index, u)
df.join(d)
ticker class_n a b c d e f
0 1 a 1 0 0 0 0 0
1 2 b 0 1 0 0 0 0
2 3 c 0 0 1 0 0 0
3 4 d 0 0 0 1 0 0
4 5 e 0 0 0 0 1 0
5 6 f 0 0 0 0 0 1
6 7 a 1 0 0 0 0 0
7 8 b 0 1 0 0 0 0