I have dataframe that looks like this
x = pd.DataFrame.from_dict({'A':[1,2,0,4,0,6], 'B':[0, 0, 0, 44, 48, 81], 'C':[1,0,1,0,1,0]})
(assume it might have other columns).
I want to add a column, which specifies for each row, how many 0s there are in the specific columns A,B,C.
A B C num_zeros
0 1 0 1 1
1 2 0 0 2
2 0 0 1 2
3 4 44 0 1
4 0 48 1 1
5 6 81 0 1
Create a boolean dtype dataframe using ==, then use sum with axis=1:
x['num_zeros'] = (x == 0).sum(1)
Output:
A B C num_zeros
0 1 0 1 1
1 2 0 0 2
2 0 0 1 2
3 4 44 0 1
4 0 48 1 1
5 6 81 0 1
Now, if you want explicitly define which columns, ie... on count in B and C columns, then you can use this:
x['Num_zeros_in_BC'] = (x == 0)[['B','C']].sum(1)
Output:
A B C num_zeros Num_zeros_in_BC
0 1 0 1 1 1
1 2 0 0 2 2
2 0 0 1 2 1
3 4 44 0 1 1
4 0 48 1 1 0
5 6 81 0 1 1
Related
I have a data frame with a column with only 0's and 1's. I need to create a flag column where there are more than a certain number of consecutive ones in the first column.
In the example below, x >= 4 , if there are 4 or more consecutive one's, then the flag should be 1 for all those consecutive rows.
col1 Flag
0 1 0
1 0 0
2 1 1
3 1 1
4 1 1
5 1 1
6 0 0
7 1 0
8 1 0
9 0 0
10 1 1
11 1 1
12 1 1
13 1 1
14 1 1
15 0 0
One change, let's say there is a new column group, we need to group by that and find the flag,
Group col1 Flag
0 A 1 0
1 B 0 0
2 B 1 1
3 B 1 1
4 B 1 1
5 B 1 1
6 C 0 0
7 C 1 0
8 C 1 0
9 C 0 0
10 D 1 0
11 D 1 0
12 D 1 0
13 E 1 0
14 E 1 0
15 E 0 0
As you can there are consecutive ones from 10 to 14 but they belong to different groups. And elements in group can be in any order.
No that hard try with cumsum create the key then do the transform count
(df.groupby(df.col1.ne(1).cumsum())['col1'].transform('count').ge(5) & df.col1.eq(1)).astype(int)
Out[83]:
0 0
1 0
2 1
3 1
4 1
5 1
6 0
7 0
8 0
9 0
10 1
11 1
12 1
13 1
14 1
15 0
Name: col1, dtype: int32
You can achieve this in a couple of steps:
rolling(4).sum() to attain consecutive summations of your column
Use where to get the 1's from "col1" where their summation window (from the previous step) is >= 4. Turn the rest of the values into np.NaN
bfill(limit=3) to backwards fill the leftover 1s in your column by a maximum of 3 places.
fillna(0) fill what's leftover with 0
df["my_flag"] = (df["col1"]
.where(
df["col1"].rolling(4).sum() >= 4
) # Selects the 1's whose consecutive sum >= 4. All other values become NaN
.bfill(limit=3) # Moving backwards from our leftover values,
# take the existing value and fill in a maximum of 3 NaNs
.fillna(0) # Fill in the rest of the NaNs with 0
.astype(int)) # Cast to integer data type, since we were working with floats temporarily
print(df)
col1 Flag my_flag
0 1 0 0
1 0 0 0
2 1 1 1
3 1 1 1
4 1 1 1
5 1 1 1
6 0 0 0
7 1 0 0
8 1 0 0
9 0 0 0
10 1 1 1
11 1 1 1
12 1 1 1
13 1 1 1
14 1 1 1
15 0 0 0
Edit:
For a grouped approach, you just need to use groupby().rolling to create your mask for use in where(). Everything after that is the same. I separated the rolling step to keep it as readable as possible:
grouped_counts_ge_4 = (df.groupby("Group")["col1"]
.rolling(4)
.sum()
.ge(4)
.reset_index(level=0, drop=True))
df["my_flag"] = (df["col1"]
.where(grouped_counts_ge_4)
.bfill(limit=3) # Moving backwards from our leftover values, take the existing value and fill in a maximum of 3 NaNs
.fillna(0) # Fill in the rest of the NaNs with 0
.astype(int)) # Cast to integer data type, since we were working with floats temporarily
print(df)
Group col1 Flag my_flag
0 A 1 0 0
1 B 0 0 0
2 B 1 1 1
3 B 1 1 1
4 B 1 1 1
5 B 1 1 1
6 C 0 0 0
7 C 1 0 0
8 C 1 0 0
9 C 0 0 0
10 D 1 0 0
11 D 1 0 0
12 D 1 0 0
13 E 1 0 0
14 E 1 0 0
15 E 0 0 0
Try this:
df['Flag'] = np.where(df['col1'].groupby((df['col1'].diff().ne(0) | df['col1'].eq(0)).cumsum()).transform('size').ge(4),1,0)
I am working with a dataframe, consisting of a continuity column df['continuity'] and a column group df['group'].
Both are binary columns.
I want to add an extra column 'group_id' that gives consecutive rows of 1s the same integer value, where the first group of rows have a
1, then 2 etc. After each time where the continuity value of a row is 0, the counting should start again at 1.
Since this question is rather specific, I'm not sure how to tackle this vectorized. Below an example, where the first two
columns are the input and the column the output I'd like to have.
continuity group group_id
1 0 0
1 1 1
1 1 1
1 1 1
1 0 0
1 1 2
1 1 2
1 1 2
1 0 0
1 0 0
1 1 3
1 1 3
0 1 1
0 0 0
1 1 1
1 1 1
1 0 0
1 0 0
1 1 2
1 1 2
I believe you can use:
#get unique groups in both columns
b = df[['continuity','group']].ne(df[['continuity','group']].shift()).cumsum()
#identify first 1
c = ~b.duplicated() & (df['group'] == 1)
#cumulative sum of first values only if group are 1, else 0 per groups
df['new'] = np.where(df['group'] == 1,
c.groupby(b['continuity']).cumsum(),
0).astype(int)
print (df)
continuity group group_id new
0 1 0 0 0
1 1 1 1 1
2 1 1 1 1
3 1 1 1 1
4 1 0 0 0
5 1 1 2 2
6 1 1 2 2
7 1 1 2 2
8 1 0 0 0
9 1 0 0 0
10 1 1 3 3
11 1 1 3 3
12 0 1 1 1
13 0 0 0 0
14 1 1 1 1
15 1 1 1 1
16 1 0 0 0
17 1 0 0 0
18 1 1 2 2
19 1 1 2 2
I have a dataframe with about 60 columns and the following structure:
A B C Y
0 12 1 0 1
1 13 1 0 [....] 0
2 14 0 1 1
3 15 1 0 0
4 16 0 1 1
I want to create a zth column which will be the sum of the values from columns B to Y.
How can I proceed?
To create a copy of the dataframe while including a new column, use assign
df.assign(Z=df.loc[:, 'B':'Y'].sum(1))
A B C Y Z
0 12 1 0 1 2
1 13 1 0 0 1
2 14 0 1 1 2
3 15 1 0 0 1
4 16 0 1 1 2
To assign it to the same dataframe, in place, use
df['Z'] = df.loc[:, 'B':'Y'].sum(1)
df
A B C Y Z
0 12 1 0 1 2
1 13 1 0 0 1
2 14 0 1 1 2
3 15 1 0 0 1
4 16 0 1 1 2
Try this
df['z']=df.iloc[:,1:].sum(1)
You could
In [2361]: df.assign(Z=df.loc[:, 'B':'Y'].sum(1))
Out[2361]:
A B C Y Z
0 12 1 0 1 2
1 13 1 0 0 1
2 14 0 1 1 2
3 15 1 0 0 1
4 16 0 1 1 2
I have a Pandas Dataframe with three columns: row, column, value. The row values are all integers below some N, and the column values are all integers below some M. The values are all positive integers.
How do I efficiently create a Dataframe with N rows and M columns, with at index i, j the value val if (i, j , val) is a row in my original Dataframe, and some default value (0) otherwise? Furthermore, is it possible to create a sparse Dataframe immediately, since the data is already quite large, but N*M is still about 10 times the size of my data?
A NumPy solution would suit here for performance -
a = df.values
m,n = a[:,:2].max(0)+1
out = np.zeros((m,n),dtype=a.dtype)
out[a[:,0], a[:,1]] = a[:,2]
df_out = pd.DataFrame(out)
Sample run -
In [58]: df
Out[58]:
row col val
0 7 1 30
1 3 3 0
2 4 8 30
3 5 8 18
4 1 3 6
5 1 6 48
6 0 2 6
7 4 7 6
8 5 0 48
9 8 1 48
10 3 2 12
11 6 8 18
In [59]: df_out
Out[59]:
0 1 2 3 4 5 6 7 8
0 0 0 6 0 0 0 0 0 0
1 0 0 0 6 0 0 48 0 0
2 0 0 0 0 0 0 0 0 0
3 0 0 12 0 0 0 0 0 0
4 0 0 0 0 0 0 0 6 30
5 48 0 0 0 0 0 0 0 18
6 0 0 0 0 0 0 0 0 18
7 0 30 0 0 0 0 0 0 0
8 0 48 0 0 0 0 0 0 0
Is there a way to convert pandas dataframe to series with multiindex? The dataframe's columns could be multi-indexed too.
Below works, but only for multiindex with labels.
In [163]: d
Out[163]:
a 0 1
b 0 1 0 1
a 0 0 0 0
b 1 2 3 4
c 2 4 6 8
In [164]: d.stack(d.columns.names)
Out[164]:
a b
a 0 0 0
1 0
1 0 0
1 0
b 0 0 1
1 2
1 0 3
1 4
c 0 0 2
1 4
1 0 6
1 8
dtype: int64
I think you can use nlevels for find length of levels in MultiIndex, then create range with stack:
print (d.columns.nlevels)
2
#for python 3 add `list`
print (list(range(d.columns.nlevels)))
[0, 1]
print (d.stack(list(range(d.columns.nlevels))))
a b
a 0 0 0
1 0
1 0 0
1 0
b 0 0 1
1 2
1 0 3
1 4
c 0 0 2
1 4
1 0 6
1 8
dtype: int64