I have a daraframe as below:
Datetime Data Fn
0 18747.385417 11275.0 0
1 18747.388889 8872.0 1
2 18747.392361 7050.0 0
3 18747.395833 8240.0 1
4 18747.399306 5158.0 1
5 18747.402778 3926.0 0
6 18747.406250 4043.0 0
7 18747.409722 2752.0 1
8 18747.420139 3502.0 1
9 18747.423611 4026.0 1
I want to calculate the sum of continious non zero values of Column (Fn)
I want my result dataframe as below:
Datetime Data Fn Sum
0 18747.385417 11275.0 0 0
1 18747.388889 8872.0 1 1
2 18747.392361 7050.0 0 0
3 18747.395833 8240.0 1 1
4 18747.399306 5158.0 1 2 <<<
5 18747.402778 3926.0 0 0
6 18747.406250 4043.0 0 0
7 18747.409722 2752.0 1 1
8 18747.420139 3502.0 1 2
9 18747.423611 4026.0 1 3
You can use groupby() and cumsum():
groups = df.Fn.eq(0).cumsum()
df['Sum'] = df.Fn.ne(0).groupby(groups).cumsum()
Details
First use df.Fn.eq(0).cumsum() to create pseudo-groups of consecutive non-zeros. Each zero will get a new id while consecutive non-zeros will keep the same id:
groups = df.Fn.eq(0).cumsum()
# groups Fn (Fn added just for comparison)
# 0 1 0
# 1 1 1
# 2 2 0
# 3 2 1
# 4 2 1
# 5 3 0
# 6 4 0
# 7 4 1
# 8 4 1
# 9 4 1
Then group df.Fn.ne(0) on these pseudo-groups and cumsum() to generate the within-group sequences:
df['Sum'] = df.Fn.ne(0).groupby(groups).cumsum()
# Datetime Data Fn Sum
# 0 18747.385417 11275.0 0 0
# 1 18747.388889 8872.0 1 1
# 2 18747.392361 7050.0 0 0
# 3 18747.395833 8240.0 1 1
# 4 18747.399306 5158.0 1 2
# 5 18747.402778 3926.0 0 0
# 6 18747.406250 4043.0 0 0
# 7 18747.409722 2752.0 1 1
# 8 18747.420139 3502.0 1 2
# 9 18747.423611 4026.0 1 3
How about using cumsum and reset when value is 0
df['Fn2'] = df['Fn'].replace({0: False, 1: True})
df['Fn2'] = df['Fn2'].cumsum() - df['Fn2'].cumsum().where(df['Fn2'] == False).ffill().astype(int)
df
You can store the fn column in a list and then create a new list and iterate over the stored fn column and check the previous index value if it is greater than zero then add it to current index else do not update it and after this u can make a dataframe for the list and concat column wise to existing dataframe
fn=df[Fn]
sum_list[0]=fn first value
for i in range(1,lenghtofthe column):
if fn[i-1]>0:
sum_list.append(fn[i-1]+fn[i])
else:
sum_list.append(fn[i])
dfsum=pd.Dataframe(sum_list)
df=pd.concat([df,dfsum],axis=1)
Hope this will help you.there may me syntax errors that you can refer google.But the idea is this
try this:
sum_arr = [0]
for val in df['Fn']:
if val > 0:
sum_arr.append(sum_arr[-1] + 1)
else:
sum_arr.append(0)
df['sum'] = sum_arr[1:]
df
Related
I am trying to create a target variable based on 2 conditions. I have X values that are binary and X2 values that are also binary. My condition is whenver X changes from 1 to zero, we have one in y only if it is followed by a change from 0 to 1 in X2. If that was followed by a change from 0 to 1 in X then we don't do the change in the first place. I attached a picture from excel.
I also did the following to account for the change in X
df['X-prev']=df['X'].shift(1)
df['Change-X;]=np.where(df['X-prev']+df['X']==1,1,0)
# this is the data frame
X=[1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0]
X2=[0,0,0,0,0,0,0,0,0,1,1,1,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,1]
df=pd.DataFrame()
df['X']=X
df['X2']=X2
however, this is not enough as I need to know which change came first after the X change. I attached a picture of the example.
Thanks a lot for all the contributions.
Keep rows that match your transition (X=1, X+1=0) and (X2=1, X2-1=0) then merge all selected rows to a list where a value of 0 means 'start a cycle' and 1 means 'end a cycle'.
But in this list, you can have consecutive start or end so you need to filter again to get only cycles of (0, 1). After that, reindex this new series by your original dataframe index and back fill with 1.
x1 = df['X'].sub(df['X'].shift(-1)).eq(1)
x2 = df['X2'].sub(df['X2'].shift(1)).eq(1)
sr1 = pd.Series(0, df.index[x1])
sr2 = pd.Series(1, df.index[x2])
sr = pd.concat([sr2, sr1]).sort_index()
df['Y'] = sr[sr.lt(sr.shift(-1)) | sr.gt(sr.shift(1))] \
.reindex(df.index).bfill().fillna(0).astype(int)
>>> df
X X2 Y
0 1 0 0 # start here: (X=1, X+1=0) but never ended before another start
1 1 0 0
2 0 0 0
3 0 0 0
4 1 0 0 # start here: (X=1, X+1=0)
5 0 0 1 # <- fill with 1
6 0 0 1 # <- fill with 1
7 0 0 1 # <- fill with 1
8 0 0 1 # <- fill with 1
9 0 1 1 # end here: (X2=1, X2-1=0) so fill back rows with 1
10 0 1 0
11 0 1 0
12 0 1 0
13 0 1 0
14 0 0 0
15 0 0 0
16 0 1 0 # end here: (X2=1, X2-1=0) but never started before
17 0 0 0
18 0 0 0
19 0 0 0
20 1 0 0
21 1 0 0 # start here: (X=1, X+1=0)
22 0 0 1 # <- fill with 1
23 0 0 1 # <- fill with 1
24 0 0 1 # <- fill with 1
25 0 0 1 # <- fill with 1
26 0 0 1 # <- fill with 1
27 0 1 1 # end here: (X2=1, X2-1=0) so fill back rows with 1
28 0 1 0
29 0 1 0
I'm trying to flag some price data as "stale" if the quoted price of the security hasn't changed over lets say 3 trading days. I'm currently trying it with:
firm["dev"] = np.std(firm["Price"],firm["Price"].shift(1),firm["Price"].shift(2))
firm["flag"] == np.where(firm["dev"] = 0, 1, 0)
But I'm getting nowhere with it. This is what my dataframe would look like.
Index
Price
Flag
1
10
0
2
11
0
3
12
0
4
12
0
5
12
1
6
11
0
7
13
0
Any help is appreciated!
If you are okay with other conditions, you can first check if series.diff equals 0 and take cumsum to check if you have a cumsum of 2 (n-1). Also check if the next row is equal to current, when both these conditions suffice, assign a flag of 1 else 0.
n=3
firm['Flag'] = (firm['Price'].diff().eq(0).cumsum().eq(n-1) &
firm['Price'].eq(firm['Price'].shift())).astype(int)
EDIT, to make it a generalized function with consecutive n, use this:
def fun(df,col,n):
c = df[col].diff().eq(0)
return (c|c.shift(-1)).cumsum().ge(n) & df[col].eq(df[col].shift())
firm['flag_2'] = fun(firm,'Price',2).astype(int)
firm['flag_3'] = fun(firm,'Price',3).astype(int)
print(firm)
Price Flag flag_2 flag_3
Index
1 10 0 0 0
2 11 0 0 0
3 12 0 0 0
4 12 0 1 0
5 12 1 1 1
6 11 0 0 0
7 13 0 0 0
I want to know how can I make the source code of the following problem based on Python.
I have a dataframe that contain this column:
Column X
1
0
0
0
1
1
0
0
1
I want to create a list b counting the sum of successive 0 value for getting something like that :
List X
1
3
3
3
1
1
2
2
1
If I understand your question correctly, you want to replace all the zeros with the number of consecutive zeros in the current streak, but leave non-zero numbers untouched. So
1 0 0 0 0 1 0 1 1 0 0 1 0 1 0 0 0 0 0
becomes
1 4 4 4 4 1 1 1 1 2 2 1 1 1 5 5 5 5 5
To do that, this should work, assuming your input column (a pandas Series) is called x.
result = []
i = 0
while i < len(x):
if x[i] != 0:
result.append(x[i])
i += 1
else:
# See how many times zero occurs in a row
j = i
n_zeros = 0
while j < len(x) and x[j] == 0:
n_zeros += 1
j += 1
result.extend([n_zeros] * n_zeros)
i += n_zeros
result
Adding screenshot below to make usage clearer
How to calculate amounts that row values greater than a specific value in pandas?
For example, I have a Pandas DataFrame dff. I want to count row values greater than 0.
dff = pd.DataFrame(np.random.randn(9,3),columns=['a','b','c'])
dff
a b c
0 -0.047753 -1.172751 0.428752
1 -0.763297 -0.539290 1.004502
2 -0.845018 1.780180 1.354705
3 -0.044451 0.271344 0.166762
4 -0.230092 -0.684156 -0.448916
5 -0.137938 1.403581 0.570804
6 -0.259851 0.589898 0.099670
7 0.642413 -0.762344 -0.167562
8 1.940560 -1.276856 0.361775
I am using an inefficient way. How to be more efficient?
dff['count'] = 0
for m in range(len(dff)):
og = 0
for i in dff.columns:
if dff[i][m] > 0:
og += 1
dff['count'][m] = og
dff
a b c count
0 -0.047753 -1.172751 0.428752 1
1 -0.763297 -0.539290 1.004502 1
2 -0.845018 1.780180 1.354705 2
3 -0.044451 0.271344 0.166762 2
4 -0.230092 -0.684156 -0.448916 0
5 -0.137938 1.403581 0.570804 2
6 -0.259851 0.589898 0.099670 2
7 0.642413 -0.762344 -0.167562 1
8 1.940560 -1.276856 0.361775 2
You can create a boolean mask of your DataFrame, that is True wherever a value is greater than your threshold (in this case 0), and then use sum along the first axis.
dff.gt(0).sum(1)
0 1
1 1
2 2
3 2
4 0
5 2
6 2
7 1
8 2
dtype: int64
Having a DataFrame with the following column:
df['A'] = [1,1,1,0,1,1,1,1,0,1]
What would be the best vectorized way to control the length of "1"-series by some limiting value? Let's say the limit is 2, then the resulting column 'B' must look like:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
One fully-vectorized solution is to use the shift-groupby-cumsum-cumcount combination1 to indicate where consecutive runs are shorter than 2 (or whatever limiting value you like). Then, & this new boolean Series with the original column:
df['B'] = ((df.groupby((df.A != df.A.shift()).cumsum()).cumcount() <= 1) & df.A)\
.astype(int) # cast the boolean Series back to integers
This produces the new column in the DataFrame:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
1 See the pandas cookbook; the section on grouping, "Grouping like Python’s itertools.groupby"
Another way (checking if previous two are 1):
In [443]: df = pd.DataFrame({'A': [1,1,1,0,1,1,1,1,0,1]})
In [444]: limit = 2
In [445]: df['B'] = map(lambda x: df['A'][x] if x < limit else int(not all(y == 1 for y in df['A'][x - limit:x])), range(len(df)))
In [446]: df
Out[446]:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
If you know that the values in the series will all be either 0 or 1, I think you can use a little trick involving convolution. Make a copy of your column (which need not be a Pandas object, it can just be a normal Numpy array)
a = df['A'].as_matrix()
and convolve it with a sequence of 1's that is one longer than the cutoff you want, then chop off the last cutoff elements. E.g. for a cutoff of 2, you would do
long_run_count = numpy.convolve(a, [1, 1, 1])[:-2]
The resulting array, in this case, gives the number of 1's that occur in the 3 elements prior to and including that element. If that number is 3, then you are in a run that has exceeded length 2. So just set those elements to zero.
a[long_run_count > 2] = 0
You can now assign the resulting array to a new column in your DataFrame.
df['B'] = a
To turn this into a more general method:
def trim_runs(array, cutoff):
a = numpy.asarray(array)
a[numpy.convolve(a, numpy.ones(cutoff + 1))[:-cutoff] > cutoff] = 0
return a