I have a Pandas dataframe like this {each row in B is a string with values joined with | symbol}:
A B
a 1|2|3
b 2|4|5
c 3|2|5
I want to create columns which say that the value is present in that row(of column B) or not:
A B 1 2 3 4 5
a 1|2|3 1 1 1 0 0
b 2|4|5 0 1 0 1 1
c 3|5 0 0 1 0 1
I have tried this by looping the columns. But, can it be done using lambda or comprehensions?
You can try get_dummies:
print df
A B
0 a 1|2|3
1 b 2|4|5
2 c 3|2|5
print df.B.str.get_dummies(sep='|')
1 2 3 4 5
0 1 1 1 0 0
1 0 1 0 1 1
2 0 1 1 0 1
And if you need old column B use join:
print df.join(df.B.str.get_dummies(sep='|'))
A B 1 2 3 4 5
0 a 1|2|3 1 1 1 0 0
1 b 2|4|5 0 1 0 1 1
2 c 3|2|5 0 1 1 0 1
Hope this helps.
In [19]: df
Out[19]:
A B
0 a 1|2|3
1 b 2|4|5
2 c 3|2|5
In [20]: op = df.merge(df.B.apply(lambda s: pd.Series(dict((col, 1) for col in s.split('|')))),
left_index=True, right_index=True).fillna(0)
In [21]: op
Out[21]:
A B 1 2 3 4 5
0 a 1|2|3 1 1 1 0 0
1 b 2|4|5 0 1 0 1 1
2 c 3|2|5 0 1 1 0 1
Related
I have a dataframe df
ID ID2 escto1 escto2 escto3
1 A 1 0 0
2 B 0 1 0
3 C 0 0 3
4 D 0 2 0
so either using indexing or using wildcard
like column name 'escto*'
if df.iloc[:, 2:]>0 then df.helper=1
or
df.loc[(df.iloc[:, 3:]>0,'Transfer')]=1
So that output becomes
ID ID2 escto1 escto2 escto3 helper
1 A 1 0 0 1
2 B 0 1 0 1
3 C 0 0 3 1
4 D 0 2 0 1
Output
One option is to use the boolean output:
df.assign(helper = df.filter(like='escto').gt(0).any(1).astype(int))
ID ID2 escto1 escto2 escto3 helper
0 1 A 1 0 0 1
1 2 B 0 1 0 1
2 3 C 0 0 3 1
3 4 D 0 2 0 1
So I have a pandas dataframe that looks something like this.
name is_something
0 a 0
1 b 1
2 c 0
3 c 1
4 a 1
5 b 0
6 a 1
7 c 0
8 a 1
Is there a way to use groupby and merge to create a new column that gives the number of times a name appears with an is_something value of 1 in the whole dataframe? The updated dataframe would look like this:
name is_something no_of_times_is_something_is_1
0 a 0 3
1 b 1 1
2 c 0 1
3 c 1 1
4 a 1 3
5 b 0 1
6 a 1 3
7 c 0 1
8 a 1 3
I know you can just loop through the dataframe to do this but I'm looking for a more efficient way because the dataset I'm working with is quite large. Thanks in advance!
If there are only 0 and 1 values in is_something column only use sum with GroupBy.transform for new column filled by aggregate values:
df['new'] = df.groupby('name')['is_something'].transform('sum')
print (df)
name is_something new
0 a 0 3
1 b 1 1
2 c 0 1
3 c 1 1
4 a 1 3
5 b 0 1
6 a 1 3
7 c 0 1
8 a 1 3
If possible multiple values first compare by 1, convert to integer and then use transform with sum:
df['new'] = df['is_something'].eq(1).view('i1').groupby(df['name']).transform('sum')
Or we just map it
df['New']=df.name.map(df.query('is_something ==1').groupby('name')['is_something'].sum())
df
name is_something New
0 a 0 3
1 b 1 1
2 c 0 1
3 c 1 1
4 a 1 3
5 b 0 1
6 a 1 3
7 c 0 1
8 a 1 3
You could do:
df['new'] = df.groupby('name')['is_something'].transform(lambda xs: xs.eq(1).sum())
print(df)
Output
name is_something new
0 a 0 3
1 b 1 1
2 c 0 1
3 c 1 1
4 a 1 3
5 b 0 1
6 a 1 3
7 c 0 1
8 a 1 3
I have a dataframe like below,
A B C D
0 1 0 0 0
1 0 1 0 0
2 0 1 0 0
3 0 0 1 0
I want to convert this into like this,
A B C D
0 1 0 0 0
1 1 1 0 0
2 1 1 0 0
3 1 1 1 0
so far I tried,
df= df.replace('0',np.NaN)
df=df.fillna(method='ffill').fillna('0')
my above code works fine,
But I think there is some other better approach to solve this problem,
Use cumsum with data converted to numeric and then replace by DataFrame.mask:
df = df.mask(df.astype(int).cumsum() >= 1, '1')
print (df)
A B C D
0 1 0 0 0
1 1 1 0 0
2 1 1 0 0
3 1 1 1 0
Detail:
print (df.astype(int).cumsum())
A B C D
0 1 0 0 0
1 1 1 0 0
2 1 2 0 0
3 1 2 1 0
Or same principe in numpy with numpy.where:
arr = df.values.astype(int)
df = pd.DataFrame(np.where(np.cumsum(arr, axis=0) >= 1, '1', '0'),
index=df.index,
columns= df.columns)
print (df)
A B C D
0 1 0 0 0
1 1 1 0 0
2 1 1 0 0
3 1 1 1 0
Is there a way to convert pandas dataframe to series with multiindex? The dataframe's columns could be multi-indexed too.
Below works, but only for multiindex with labels.
In [163]: d
Out[163]:
a 0 1
b 0 1 0 1
a 0 0 0 0
b 1 2 3 4
c 2 4 6 8
In [164]: d.stack(d.columns.names)
Out[164]:
a b
a 0 0 0
1 0
1 0 0
1 0
b 0 0 1
1 2
1 0 3
1 4
c 0 0 2
1 4
1 0 6
1 8
dtype: int64
I think you can use nlevels for find length of levels in MultiIndex, then create range with stack:
print (d.columns.nlevels)
2
#for python 3 add `list`
print (list(range(d.columns.nlevels)))
[0, 1]
print (d.stack(list(range(d.columns.nlevels))))
a b
a 0 0 0
1 0
1 0 0
1 0
b 0 0 1
1 2
1 0 3
1 4
c 0 0 2
1 4
1 0 6
1 8
dtype: int64
I'am looking for a way to générate a sequence of numbers that reset on every break
Example
ID VAR
A 0
A 0
A 1
A 1
A 0
A 0
A 1
A 1
B 1
B 1
B 1
B 0
B 0
B 0
B 0
Each time var is at 1 and ID the same as before, you start the counter.
but if ID is not the same or VAR is 0 you start again from 0
Desired output
ID VAR DESIRED
A 0 0
A 0 0
A 1 1
A 1 2
A 0 0
A 0 0
A 1 1
A 1 2
B 1 1
B 1 2
B 1 3
B 0 0
B 0 0
B 0 0
B 0 0
You can create an intermediate index, and then groupby this index and ID, cumsumming up on VAR:
df['ix'] = df['VAR'].diff().fillna(0).abs().cumsum()
df['DESIRED'] = df.groupby(['ID','ix'])['VAR'].cumsum()
In [21]: df
Out[21]:
ID VAR ix DESIRED
0 A 0 0 0
1 A 0 0 0
2 A 1 1 1
3 A 1 1 2
4 A 0 2 0
5 A 0 2 0
6 A 1 3 1
7 A 1 3 2
8 B 1 3 1
9 B 1 3 2
10 B 1 3 3
11 B 0 4 0
12 B 0 4 0
13 B 0 4 0
14 B 0 4 0