I have column in my dataframe which is having string value as shown in fig 1.
What i wanted to do is to replace all nan value from 0 and other with 1 (whatever another field is like string and int)
I tried this
func_lambda = lambda x: 1 if any(dataframe['Colum'].values != 0) else 0
But t is replacing all the column with 1.
this is my df.head
datacol.head(20)
Out[77]:
0 nan
1 4500856427
2 4003363
3 nan
4 16-4989
5 nan
6 nan
7 WVE-78686557032
8 nan
9 4501581113
10 D4-SC-0232737-1/G1023716
11 nan
12 nan
13 4502549104
14 nan
15 nan
16 nan
17 IT008297
18 15\036628
19 299011667
Name: Customer_PO_Number, dtype: object
Check this:
import pandas as pd
df = pd.DataFrame({"Customer_PO_Number":
['nan','4500856427','4003363','nan','16 - 4989','nan','nan','WVE - 78686557032',
'nan','4501581113','D4 - SC - 0232737 - 1 / G1023716','nan','nan','4502549104',
'nan','nan','nan','IT008297','15\03662','8','299011667']})
df.replace('nan', 0, inplace=True) # for replacing nan to 0
df[df != 0] = 1 # for replacing others to 1
print(df)
It will give you output like this:
Customer_PO_Number
0 0
1 1
2 1
3 0
4 1
5 0
6 0
7 1
8 0
9 1
10 1
11 0
12 0
13 1
14 0
15 0
16 0
17 1
18 1
19 1
20 1
Hope it will help you! :)
You can use a boolean test and cast the result as integer:
(df['Customer_PO_Number'] == 'nan').astype(int)
Output:
0 1
1 0
2 0
3 1
4 0
5 1
6 1
7 0
8 1
9 0
10 0
11 1
12 1
13 0
14 1
15 1
16 1
17 0
18 0
19 0
20 0
Name: Customer_PO_Number, dtype: int32
If 'nan' are really np.nan then you can use isnull:
df['Customer_PO_Number'].isnull().astype(int)
Related
I would like to create a new column every time I get 1 in the 'Signal' column that will cast the corresponding value from the 'Value' column (please see the expected output below).
Initial data:
Index
Value
Signal
0
3
0
1
8
0
2
8
0
3
7
1
4
9
0
5
10
0
6
14
1
7
10
0
8
10
0
9
4
1
10
10
0
11
10
0
Expected Output:
Index
Value
Signal
New_Col_1
New_Col_2
New_Col_3
0
3
0
0
0
0
1
8
0
0
0
0
2
8
0
0
0
0
3
7
1
7
0
0
4
9
0
7
0
0
5
10
0
7
0
0
6
14
1
7
14
0
7
10
0
7
14
0
8
10
0
7
14
0
9
4
1
7
14
4
10
10
0
7
14
4
11
10
0
7
14
4
What would be a way to do it?
You can use a pivot:
out = df.join(df
# keep only the values where signal is 1
# and get Signal's cumsum
.assign(val=df['Value'].where(df['Signal'].eq(1)),
col=df['Signal'].cumsum()
)
# pivot cumsumed Signal to columns
.pivot(index='Index', columns='col', values='val')
# ensure column 0 is absent (using loc to avoid KeyError)
.loc[:, 1:]
# forward fill the values
.ffill()
# rename columns
.add_prefix('New_Col_')
)
output:
Index Value Signal New_Col_1 New_Col_2 New_Col_3
0 0 3 0 NaN NaN NaN
1 1 8 0 NaN NaN NaN
2 2 8 0 NaN NaN NaN
3 3 7 1 7.0 NaN NaN
4 4 9 0 7.0 NaN NaN
5 5 10 0 7.0 NaN NaN
6 6 14 1 7.0 14.0 NaN
7 7 10 0 7.0 14.0 NaN
8 8 10 0 7.0 14.0 NaN
9 9 4 1 7.0 14.0 4.0
10 10 10 0 7.0 14.0 4.0
11 11 10 0 7.0 14.0 4.0
#create new column by incrementing the rows that has signal
df['new_col']='new_col_'+df['Signal'].cumsum().astype(str)
#rows having no signal, make them null
df['new_col'] = df['new_col'].mask(df['Signal']==0, '0')
#pivot table
df2=(df.pivot(index=['Index','Signal', 'Value'], columns='new_col', values='Value')
.reset_index()
.ffill().fillna(0) #forward fill and fillna with 0
.drop(columns=['0','Index'] ) #drop the extra columns
.rename_axis(columns={'new_col':'Index'}) # rename the axis
.astype(int)) # changes values to int, removing decimals
df2
Index Signal Value new_col_1 new_col_2 new_col_3
0 0 3 0 0 0
1 0 8 0 0 0
2 0 8 0 0 0
3 1 7 7 0 0
4 0 9 7 0 0
5 0 10 7 0 0
6 1 14 7 14 0
7 0 10 7 14 0
8 0 10 7 14 0
9 1 4 7 14 4
10 0 10 7 14 4
11 0 10 7 14 4
I have a pandas dataframe with lets say two columns, for example:
value boolean
0 1 0
1 5 1
2 0 0
3 3 0
4 9 1
5 12 0
6 4 0
7 7 1
8 8 1
9 2 0
10 17 0
11 15 1
12 6 0
Now I want to add a third column (new_boolean) with the following criteria:
I specify a period, for this example period = 4.
Now I take a look at all rows where boolean == 1.
new_boolean will be 1 for the maximum value in the last period rows.
For example I have boolean == 1 for row 2. So I look at the last period rows. The values are [1, 5], 5 is the maximum, so the value for new_boolean in row 2 will be one.
Second example: row 8 (value = 7): I get values [7, 4, 12, 9], 12 is the maximum, so the value for new_boolean in the row with value 12 will be 1
result:
value boolean new_boolean
0 1 0 0
1 5 1 1
2 0 0 0
3 3 0 0
4 9 1 1
5 12 0 1
6 4 0 0
7 7 1 0
8 8 1 0
9 2 0 0
10 17 0 1
11 15 1 0
12 6 0 0
How can I do this algorithmically?
Compute the rolling max of the 'value' column
>>> rolling_max_value = df.rolling(window=4, min_periods=1)['value'].max()
>>> rolling_max_value
0 1.0
1 5.0
2 5.0
3 5.0
4 9.0
5 12.0
6 12.0
7 12.0
8 12.0
9 8.0
10 17.0
11 17.0
12 17.0
Name: value, dtype: float64
Select only the relevant values, i.e. where 'boolean' = 1
>>> on_values = rolling_max_value[df.boolean == 1].unique()
>>> on_values
array([ 5., 9., 12., 17.])
The rows where 'new_boolean' = 1 are the ones where 'value' belongs to on_values
>>> df['new_boolean'] = df.value.isin(on_values).astype(int)
>>> df
value boolean new_boolean
0 1 0 0
1 5 1 1
2 0 0 0
3 3 0 0
4 9 1 1
5 12 0 1
6 4 0 0
7 7 1 0
8 8 1 0
9 2 0 0
10 17 0 1
11 15 1 0
12 6 0 0
EDIT:
OP raised a good point
Does this also work if I have multiple columns with the same value and they have different booleans?
The previous solution doesn't account for that. To solve this, instead of computing the rolling max, we gather the row labels associated with rolling max values, i.e. the rolling argmaxor idxmax. To my knowledge, Rolling objects don't have an idxmax method, but we can easily compute it via apply.
def idxmax(values):
return values.idxmax()
rolling_idxmax_value = (
df.rolling(min_periods=1, window=4)['value']
.apply(idxmax)
.astype(int)
)
on_idx = rolling_idxmax_value[df.boolean == 1].unique()
df['new_boolean'] = 0
df.loc[on_idx, 'new_boolean'] = 1
Results:
>>> rolling_idxmax_value
0 0
1 1
2 1
3 1
4 4
5 5
6 5
7 5
8 5
9 8
10 10
11 10
12 10
Name: value, dtype: int64
>>> on_idx
[ 1 4 5 10]
>>> df
value boolean new_boolean
0 1 0 0
1 5 1 1
2 0 0 0
3 3 0 0
4 9 1 1
5 12 0 1
6 4 0 0
7 7 1 0
8 8 1 0
9 2 0 0
10 17 0 1
11 15 1 0
12 6 0 0
I did this in 2 steps, but I think the solution is much clearer:
df = pd.read_csv(StringIO('''
id value boolean
0 1 0
1 5 1
2 0 0
3 3 0
4 9 1
5 12 0
6 4 0
7 7 1
8 8 1
9 2 0
10 17 0
11 15 1
12 6 0'''),delim_whitespace=True,index_col=0)
df['new_bool'] = df['value'].rolling(min_periods=1, window=4).max()
df['new_bool'] = df.apply(lambda x: 1 if ((x['value'] == x['new_bool']) & (x['boolean'] == 1)) else 0, axis=1)
df
Result:
value boolean new_bool
id
0 1 0 0
1 5 1 1
2 0 0 0
3 3 0 0
4 9 1 1
5 12 0 0
6 4 0 0
7 7 1 0
8 8 1 0
9 2 0 0
10 17 0 0
11 15 1 0
12 6 0 0
I have the following dataframe:
df = pd.DataFrame({"col":[0,0,1,1,1,1,0,0,1,1,0,0,1,1,1,0,1,1,1,1,0,0,0]})
Now I would like to set all the rows equal to zero where less than four 1's appear "in a row", i.e. I would like to have the following resulting DataFrame:
df = pd.DataFrame({"col":[0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0]})
I was not able to find a way to achieve this nicely...
Try with groupby and where:
streaks = df.groupby(df["col"].ne(df["col"].shift()).cumsum()).transform("sum")
output = df.where(streaks.ge(4), 0)
>>> output
col
0 0
1 0
2 1
3 1
4 1
5 1
6 0
7 0
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 1
17 1
18 1
19 1
20 0
21 0
22 0
We can do
df.loc[df.groupby(df.col.eq(0).cumsum()).transform('count')['col']<5,'col'] = 0
df
Out[77]:
col
0 0
1 0
2 1
3 1
4 1
5 1
6 0
7 0
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 1
17 1
18 1
19 1
20 0
21 0
22 0
I have a dataframe like the following:
ID 0 1 2 3 4 5 6 7 8 ... 81 82 83 84 85 86 87 88 89 90 total
-----------------------------------------------------------------------------------------------------
0 A 2 21 0 18 3 0 0 0 2 ... 0 0 0 0 0 0 0 0 0 0 156
1 B 0 20 12 2 0 8 14 23 0 ... 0 0 0 0 0 0 0 0 0 0 231
2 C 0 38 19 3 1 3 3 7 1 ... 0 0 0 0 0 0 0 0 0 0 78
3 D 3 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 5
and I want to know the % of events (the numbers in the cells) before and after the first sequence of zeros of length n appears in each row. This problem started as another question found here: Length of first sequence of zeros of given size after certain column in pandas dataframe, and I am trying to modify the code to do what I need, but I keep getting errors and can't seem to find the right way. This is what I have tried:
def func(row, n):
"""Returns the number of events before the
first sequence of 0s of length n is found
"""
idx = np.arange(0, 91)
a = row[idx]
b = (a != 0).cumsum()
c = b[a == 0]
d = c.groupby(c).count()
#in case there is no sequence of 0s with length n
try:
e = c[c >= d.index[d >= n][0]]
f = str(e.index[0])
except IndexError:
e = [90]
f = str(e[0])
idx_sliced = np.arange(0, int(f)+1)
a = row[idx_sliced]
if (int(f) + n > 90):
perc_before = 100
else:
perc_before = a.cumsum().tail(1).values[0]/row['total']
return perc_before
As is, the error I get is:
---> perc_before = a.cumsum().tail(1).values[0]/row['total']
TypeError: ('must be str, not int', 'occurred at index 0')
Finally, I would apply this function to a dataframe and return a new column with the % of events before the first sequence of n 0s in each row, like this:
ID 0 1 2 3 4 5 6 7 8 ... 81 82 83 84 85 86 87 88 89 90 total %_before
---------------------------------------------------------------------------------------------------------------
0 A 2 21 0 18 3 0 0 0 2 ... 0 0 0 0 0 0 0 0 0 0 156 43
1 B 0 20 12 2 0 8 14 23 0 ... 0 0 0 0 0 0 0 0 0 0 231 21
2 C 0 38 19 3 1 3 3 7 1 ... 0 0 0 0 0 0 0 0 0 0 78 90
3 D 3 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 5 100
If trying to solve this, you can test by using this sample input:
a = pd.Series([1,1,13,0,0,0,4,0,0,0,0,0,12,1,1])
b = pd.Series([1,1,13,0,0,0,4,12,1,12,3,0,0,5,1])
c = pd.Series([1,1,13,0,0,0,4,12,2,0,5,0,5,1,1])
d = pd.Series([1,1,13,0,0,0,4,12,1,12,4,50,0,0,1])
e = pd.Series([1,1,13,0,0,0,4,12,0,0,0,54,0,1,1])
df = pd.DataFrame({'0':a, '1':b, '2':c, '3':d, '4':e})
df = df.transpose()
Give this a try:
def percent_before(row, n, ncols):
"""Return the percentage of activities happen before
the first sequence of at least `n` consecutive 0s
"""
start_index, i, size = 0, 0, 0
for i in range(ncols):
if row[i] == 0:
# increase the size of the island
size += 1
elif size >= n:
# found the island we want
break
else:
# start a new island
# row[start_index] is always non-zero
start_index = i
size = 0
if size < n:
# didn't find the island we want
return 1
else:
# get the sum of activities that happen
# before the island
idx = np.arange(0, start_index + 1).astype(str)
return row.loc[idx].sum() / row['total']
df['percent_before'] = df.apply(percent_before, n=3, ncols=15, axis=1)
Result:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 total percent_before
0 1 1 13 0 0 0 4 0 0 0 0 0 12 1 1 33 0.454545
1 1 1 13 0 0 0 4 12 1 12 3 0 0 5 1 53 0.283019
2 1 1 13 0 0 0 4 12 2 0 5 0 5 1 1 45 0.333333
3 1 1 13 0 0 0 4 12 1 12 4 50 0 0 1 99 0.151515
4 1 1 13 0 0 0 4 12 0 0 0 54 0 1 1 87 0.172414
For the full frame, call apply with ncols=91.
Another possible solution:
def get_vals(df, n):
df, out = df.T, []
for col in df.columns:
diff_to_previous = df[col] != df[col].shift(1)
g = df.groupby(diff_to_previous.cumsum())[col].agg(['idxmin', 'size'])
vals = df.loc[g.loc[g['size'] >= n, 'idxmin'].values, col]
if len(vals):
out.append( df.loc[np.arange(0, vals[vals == 0].index[0]), col].sum() / df[col].sum() )
else:
out.append( 1.0 )
return out
df['percent_before'] = get_vals(df, n=3)
print(df)
Prints:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 percent_before
0 1 1 13 0 0 0 4 0 0 0 0 0 12 1 1 0.454545
1 1 1 13 0 0 0 4 12 1 12 3 0 0 5 1 0.283019
2 1 1 13 0 0 0 4 12 2 0 5 0 5 1 1 0.333333
3 1 1 13 0 0 0 4 12 1 12 4 50 0 0 1 0.151515
4 1 1 13 0 0 0 4 12 0 0 0 54 0 1 1 0.172414
As one of the comment of the previous question was about the speed, I guess you can try to vectorize the problem. I used this dataframe to try (slightly different than your original input):
ID 0 1 2 3 4 5 6 7 8 total
0 A 2 21 0 18 3 0 0 0 2 46
1 B 0 0 12 2 0 8 14 23 0 59
2 C 0 38 19 3 1 3 3 7 1 75
3 D 3 0 0 1 0 0 0 0 0 4
Now what I think is chaining command to create a mask and find where the data is not equal to 0, then use cumsum along the column axis and see where the diff along the column is equal to 0. To find the first one, you can use cummax so that all the columns after (row-wise) are considered are True. Mask the original dataframe with the opposite of this mask, sum along the columns and divide by total. for example with n=2:
n=2
df['%_before'] = df[~(df.ne(0).cumsum(axis=1).diff(n, axis=1)[range(9)]
.eq(0).cummax(axis=1))].sum(axis=1)/df.total
print (df)
ID 0 1 2 3 4 5 6 7 8 total %_before
0 A 2 21 0 18 3 0 0 0 2 46 0.956522
1 B 0 0 12 2 0 8 14 23 0 59 0.000000
2 C 0 38 19 3 1 3 3 7 1 75 1.000000
3 D 3 0 0 1 0 0 0 0 0 4 0.750000
In your case, you need to change range(9) by range(91) to get all your columns
You can do this using the rolling method.
For your example input, given the number of zeros is 5, we can use
df.rolling(window=5, axis=1).apply(lambda x : np.sum(x))
The output would like
0 1 2 3 4 5 6 7 8 9 10 11 12 13 \
0 NaN NaN NaN NaN 15.0 14.0 17.0 4.0 4.0 4.0 4.0 0.0 12.0 13.0
1 NaN NaN NaN NaN 15.0 14.0 17.0 16.0 17.0 29.0 32.0 28.0 16.0 20.0
2 NaN NaN NaN NaN 15.0 14.0 17.0 16.0 18.0 18.0 23.0 19.0 12.0 11.0
3 NaN NaN NaN NaN 15.0 14.0 17.0 16.0 17.0 29.0 33.0 79.0 67.0 66.0
4 NaN NaN NaN NaN 15.0 14.0 17.0 16.0 16.0 16.0 16.0 66.0 54.0 55.0
14
0 14.0
1 9.0
2 12.0
3 55.0
4 56.0
Looking at the output, its very easy to see that in the first row, for column 11, since the value is 0, it means that starting at position 7, you have 5 zeros.
Since none of the other rows have 0 in them, it means that none of the other rows have 5 contiguous zeros in them.
I want to do a cumulative sum on a pandas dataframe without carrying over the sum to last zero values. For example, give a dataframe:
A B
1 1 2
2 5 0
3 10 0
4 10 1
5 0 1
6 5 2
7 0 0
8 0 0
9 0 0
cumulative sum of index 1 to 6 only:
A B
1 1 2
2 6 2
3 16 2
4 26 3
5 26 4
6 31 6
7 0 0
8 0 0
9 0 0
If want not use cumsum for last 0 values in all columns:
Compare if row no contains 0, shift mask and use cumulative sum. Last compare with last value and filter:
a = df.ne(0).any(1).shift().cumsum()
m = a != a.max()
df[m] = df[m].cumsum()
print (df)
A B
1 1 2
2 6 2
3 16 2
4 26 3
5 26 4
6 31 6
7 0 0
8 0 0
9 0 0
Similar solution if want processes each column separately - only omit any:
print (df)
A B
1 1 2
2 5 0
3 10 0
4 10 1
5 0 1
6 5 0
7 0 0
8 0 0
9 0 0
a = df.ne(0).shift().cumsum()
m = a != a.max()
df[m] = df[m].cumsum()
print (df)
A B
1 1 2
2 6 2
3 16 2
4 26 3
5 26 4
6 31 0
7 0 0
8 0 0
9 0 0
Use
In [262]: s = df.ne(0).all(1)
In [263]: l = s[s].index[-1]
In [264]: df[:l] = df.cumsum()
In [265]: df
Out[265]:
A B
1 1 2
2 6 2
3 16 2
4 26 3
5 26 4
6 31 6
7 0 0
8 0 0
9 0 0
I will use last_valid_index
v=df.replace(0,np.nan).apply(lambda x : x.last_valid_index())
df[pd.DataFrame(df.index.values<=v.values[:,None],columns=df.index,index=df.columns).T].cumsum().fillna(0)
Out[890]:
A B
1 1.0 2.0
2 6.0 2.0
3 16.0 2.0
4 26.0 3.0
5 26.0 4.0
6 31.0 6.0
7 0.0 0.0
8 0.0 0.0
9 0.0 0.0
To skip all rows after the first 0, 0 row, get the first index (by rows) where df['A'] and df[B] are 0 using idxmax(0)
>>> m = ((df["A"]==0) & (df["B"]==0)).idxmax(0)
>>> df[:m] = df[:m].cumsum()
>>> df
A B
0 1 2
1 6 2
2 16 2
3 26 3
4 26 4
5 31 6
6 0 0
7 0 0
8 0 0