python: Turning in dummy variables - python

I would like to add dummy variable for the column TypePhase.
wm_id TypePhase
2 ['N', 'A', 'B', 'C', 'D']
2 ['N', 'A', 'B', 'C', 'D']
3 ['N', 'W', 'A', 'B', 'C', 'D']
2 ['N', 'A', 'B', 'C', 'D']
3 ['N', 'P', 'A', 'B', 'C', 'D']
2 ['N', 'A', 'B', 'C', 'D']
I tried df.TypePhase = df.TypePhase.apply(lambda s : '_'.join(s))but I did not get the expected result. I know that I need the apply
pd.get_dummies(df_new['TypePhase']).rename(columns=lambda x: 'AAAAAAAAA_' + str(x))
But I don't get it right.
Please, any suggestion?
Many Thanks in advance.
carlo

I think all values are strings in column TypePhase, so is possible use str.get_dummies with double str.strip.
Last join to original.
pop function extract column from original, so not necessary delete it.
print (type(df.loc[0, 'TypePhase']))
<class 'str'>
df1 = df.pop('TypePhase').str.strip('[]').str.get_dummies(', ')
#remove ' from new column names
df1.columns = df1.columns.str.strip("'")
df = df.join(df1)
print (df)
wm_id A B C D N P W
0 2 1 1 1 1 1 0 0
1 2 1 1 1 1 1 0 0
2 3 1 1 1 1 1 0 1
3 2 1 1 1 1 1 0 0
4 3 1 1 1 1 1 1 0
5 2 1 1 1 1 1 0 0

Related

Comparison of two columns

How can I find the same values in the columns regardless of their position?
df = pd.DataFrame({'one':['A','B', 'C', 'D', 'E', np.nan, 'H'],
'two':['B', 'E', 'C', np.nan, np.nan, 'H', 'L']})
The result I want to get:
three
0 B
1 E
2 C
3 H
The exact logic is unclear, you can try:
out = pd.DataFrame({'three': sorted(set(df['one'].dropna())
&set(df['two'].dropna()))})
output:
three
0 B
1 C
2 E
3 H
Or maybe you want to keep the items of col two?
out = (df.loc[df['two'].isin(df['one'].dropna()), 'two']
.to_frame(name='three')
)
output:
three
0 B
1 E
2 C
5 H
Try this:
df = pd.DataFrame(set(df['one']).intersection(df['two']), columns=['Three']).dropna()
print(df)
Output:
Three
1 C
2 H
3 E
4 B

DataFrame selecting users that match a condition in all categories

I have the following DataFrame:
user category x y
0 AB A 1 1
1 EF A 1 1
2 SG A 1 0
3 MN A 1 0
4 AB B 0 0
5 EF B 0 1
6 SG B 0 1
7 MN B 0 0
8 AB C 1 1
9 EF C 1 1
10 SG C 1 1
11 MN C 1 1
I want to select users that have x=y in all categories. I was able to do that using the following code:
data = pd.DataFrame({'user': ['AB', 'EF', 'SG', 'MN', 'AB', 'EF',
'SG', 'MN', 'AB', 'EF', 'SG', 'MN'],
'category': ['A', 'A', 'A', 'A', 'B', 'B',
'B', 'B', 'C', 'C', 'C', 'C'],
'x': [1,1,1,1, 0,0,0,0, 1,1,1,1],
'y': [1,1,0,0, 0,1,1,0, 1,1,1,1]})
data = data[data['x'] == data['y']][['user', 'category']]
count_users_match = data.groupby('user', as_index=False).count()
count_cat = data['category'].unique().shape[0]
print(count_users_match[count_users_match['category'] == count_cat])
Output:
user category
0 AB 3
I felt that this is a quite long solution. Is there any shorter way to achieve this?
Try this:
filtered = df.x.eq(df.y).groupby(df['user']).sum().loc[lambda x: x == df['category'].nunique()].reset_index(name='category')
Output:
>>> filtered
user category
0 AB 3
We could use query + groupby + size to find the number of matching categories for each user. Then compare it with the number of categories for each user:
tmp = data.query('x==y').groupby('user').size()
out = tmp[tmp == data['category'].nunique()].reset_index(name='category')
Output:
user category
0 AB 3
This is a more compact way to do it, but I don't know if it is also more efficient.
out = [{'user': user, 'frequency': data.loc[data['x'] == data['y']]['user'].value_counts()[user]} for user in data['user'].unique() if data.loc[data['x'] == data['y']]['user'].value_counts()[user] == data['user'].value_counts()[user]]
>>> out
[{'user': 'AB', 'frequency': 3}]

Aggregate group by response based on certain count (or size) Pandas

I am looking to create a sum based on certain values obtained after a groupby count (or size). I have created a mock DataFrame and the desired output bellow. It should be self explanatory from the example what I am looking for. I checked quite a bit but it seems there is no straight answer.
data = {'col1' : ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C','C','C','C','C','C','C','C'], 'col2' :[ 'A', 'B', 'C', 'B', 'A', 'B', 'C', 'A', 'C', 'B', 'B', 'C', 'A','B','A','A','A','B','C','C']}
data = pd.DataFrame(data)
data.groupby(['col1', 'col2'])['col2'].count()
The output for this count is:
A A 2
B 2
C 1
B A 1
B 3
C 2
C A 4
B 2
C 3
I would like to do a further calculation on this output and get:
A A 2
(B+C) 3
B (A+C) 3
B 3
C (A+B) 6
C 3
You could create dummy columns and groupby using those columns:
out = (data
.assign(match=data['col1']==data['col2'], count=1)
.groupby(['col1','match'], as_index=False)
.agg({'col2': lambda x: '+'.join(x.unique()), 'count':'sum'})
.drop(columns='match'))
Output:
col1 col2 count
0 A B+C 3
1 A A 2
2 B C+A 3
3 B B 3
4 C A+B 6
5 C C 3

Change values in each cell based on a comparison with another cell using pandas

I want to compare two values in column 0 to the values in all the other columns and change to values of those columns appropriately.
I have 4329 rows x 197 columns.
From this:
0 1 2 3
0 G G G T
1 A A G A
2 C C C C
3 T A T G
To this:
0 1 2 3
0 G 1 1 0
1 A 1 0 1
2 C 1 1 1
3 T 0 1 0
I've tried a nested for loop, which does not work and is slow.
for index, row in df.iterrows():
for name, value in row.iteritems():
if name == 0:
c = value
continue
if value == c:
value = 1
else:
value = 0
I haven't been able to piece together a way to use apply or applymap for the problem.
Here's an approach with iloc and eq:
df.iloc[:,1:] = df.iloc[:,1:].eq(df.iloc[:,0], axis=0).astype(int)
Output:
0 1 2 3
0 G 1 1 0
1 A 1 0 1
2 C 1 1 1
3 T 0 1 0
df = pandas.DataFrame([['G', 'G', 'G', 'T'],
['A', 'A', 'G', 'A'],
['C', 'C', 'C', 'C'],
['T', 'A', 'T', 'G']])
df2 = df[0] + df.apply(lambda c:df[0]==c)[[1,2,3]].astype(int)
print(df2)
I guess ... theres probably a better way though
you could also do something like
df.apply(lambda c:(df[0]==c).astype(int) if c.name > 0 else c)

drop first and last row from within each group

This is a follow up question to get first and last values in a groupby
How do I drop first and last rows within each group?
I have this df
df = pd.DataFrame(np.arange(20).reshape(10, -1),
[['a', 'a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd'],
['a', 'a', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']],
['X', 'Y'])
df
I intentionally made the second row have the same index value as the first row. I won't have control over the uniqueness of the index.
X Y
a a 0 1
a 2 3
c 4 5
d 6 7
b e 8 9
f 10 11
g 12 13
c h 14 15
i 16 17
d j 18 19
I want this
X Y
a b 2.0 3
c 4.0 5
b f 10.0 11
Because both groups at level 0 equal to 'c' and 'd' have less than 3 rows, all rows should be dropped.
I'd apply a similar technique to what I did for the other question:
def first_last(df):
return df.ix[1:-1]
df.groupby(level=0, group_keys=False).apply(first_last)
Note: in pandas version 0.20.0 and above, ix is deprecated and the use of iloc is encouraged instead.
So the df.ix[1:-1] should be replaced by df.iloc[1:-1].

Categories

Resources