Combining looping and conditional to make new columns on dataframe - python

I want to make a function with loop and conditional, that count only when Actual Result = 1.
So the numbers always increase by 1 if the Actual Result = 1.
This is my dataframe:
This is my code but it doesnt produce the result that i want :
def func_count(x):
for i in range(1,880):
if x['Actual Result']==1:
result = i
else:
result = '-'
return result
X_machine_learning['Count'] = X_machine_learning.apply(lambda x:func_count(x),axis=1)
When i check & filter with count != '-' The result will be like this :
The number always equal to 1 and not increase by 1 everytime the actual result = 1. Any solution?

Try something like this:
import pandas as pd
df = pd.DataFrame({
'age': [30,25,40,12,16,17,14,50,22,10],
'actual_result': [0,1,1,1,0,0,1,1,1,0]
})
count = 0
lst_count = []
for i in range(len(df)):
if df['actual_result'][i] == 1:
count+=1
lst_count.append(count)
else:
lst_count.append('-')
df['count'] = lst_count
print(df)
Result
age actual_result count
0 30 0 -
1 25 1 1
2 40 1 2
3 12 1 3
4 16 0 -
5 17 0 -
6 14 1 4
7 50 1 5
8 22 1 6
9 10 0 -

Actually, you don't need to loop over the dataframe, which is mostly a Pandas-antipattern that should be avoided. With df your dataframe you could try the following instead:
m = df["Actual Result"] == 1
df["Count"] = m.cumsum().where(m, "-")
Result for the following dataframe
df = pd.DataFrame({"Actual Result": [1, 1, 0, 1, 1, 1, 0, 0, 1, 0]})
is
Actual Result Count
0 1 1
1 1 2
2 0 -
3 1 3
4 1 4
5 1 5
6 0 -
7 0 -
8 1 6
9 0 -

Related

why condition for my dataframe is not working?

Here is the code:
import pandas as pd
import numpy as np
df1=pd.DataFrame({'0':[1,0,11,0],'1':[0,11,4,0]})
print(df1.head(5))
df2 = df1.copy()
columns=list(df2.columns)
print(columns)
for i in columns:
idx1 = np.where((df2[i]>0) & (df2[i] < 10))
df2.loc[idx1] = 1
idx3 = np.where(df2[i] == 0)
df2.loc[idx3] = 0
idx2 = np.where(df2[i] > 10)
df2.loc[idx2] = 0
print(df2.head(5))
output:
0 1
0 1 0
1 0 11
2 11 4
3 0 0
['0', '1']
0 1
0 1 1
1 0 0
2 0 0
3 0 0
the concerning part is:
(idx1 = np.where((df2[i]>0) & (df2[i] < 10))
df2.loc[idx1] = 1,
why this logic isn't working?)
According to this logic, this is what needs to be my output:
expected:
0 1
0 1 1
1 0 0
2 0 1
3 0 0
This can be done much simpler. You can operate directly on the dataframe as whole; no need to cycle through the columns individually.
Also, you don't need numpy.where to grab indices; you can use the dataframe with boolean values form the selection directly.
sel = (df2 > 0) & (df2 < 10)
df2[sel] = 1
df2[df2 == 0] = 0
df2[df2 > 10] = 0
(The first line is only to make the second line not overly complicated to the eye.)
Given your conditions however, the result is
0 1
0 1 0
1 0 0
2 0 1
3 0 0
Because you only set numbers between 0 and 10 (exclusive) to 1. A number like 11 is set to 0; while your expected output somehow shows 1 for entries with 11. And 0 is also set to 0, not to 1 (the letter shows in your expected output).
Your expected output does not align with your logic it seems. It looks like anything between 0 and 10 (exclusive) should be 1 and the other be 0.
If so, try this:
df2 = pd.DataFrame(np.where((0 < df1) & (df1 < 10), 1, 0))

Find number of consecutively increasing/decreasing values in a pandas column (and fill another col with it) in an optimized way

I am trying to create a new column for a dataframe. The column I use for it is a price column. Basically what I am trying to achieve is getting the number of times that the price has increased/decreased consecutively. I need this to be rather quick because the dataframes can be quite big.
For example the result should look like :
input = [1,2,3,2,1]
increase = [0,1,2,0,0]
decrease = [0,0,0,1,2]
You can compute the diff and apply a cumsum on the positive/negative values:
df = pd.DataFrame({'col': [1,2,3,2,1]})
s = df['col'].diff()
df['increase'] = s.gt(0).cumsum().where(s.gt(0), 0)
df['decrease'] = s.lt(0).cumsum().where(s.lt(0), 0)
Output:
col increase decrease
0 1 0 0
1 2 1 0
2 3 2 0
3 2 0 1
4 1 0 2
resetting the count
As I realize your example is ambiguous, here is an additional method in case your want to reset the counts for each increasing/decreasing group, using groupby.
The resetting counts are labeled inc2/dec2:
df = pd.DataFrame({'col': [1,2,3,2,1,2,3,1]})
s = df['col'].diff()
s1 = s.gt(0)
s2 = s.lt(0)
df['inc'] = s1.cumsum().where(s1, 0)
df['dec'] = s2.cumsum().where(s2, 0)
si = df['inc'].eq(0)
sd = df['dec'].eq(0)
df['inc2'] = si.groupby(si.cumsum()).cumcount()
df['dec2'] = sd.groupby(sd.cumsum()).cumcount()
Output:
col inc dec inc2 dec2
0 1 0 0 0 0
1 2 1 0 1 0
2 3 2 0 2 0
3 2 0 1 0 1
4 1 0 2 0 2
5 2 3 0 1 0
6 3 4 0 2 0
7 1 0 3 0 1
data = {
'input': [1,2,3,2,1]
}
df = pd.DataFrame(data)
diffs = df['input'].diff()
df['a'] = (df['input'] > df['input'].shift(periods=1, axis=0)).cumsum()-(df['input'] > df['input'].shift(periods=1, axis=0)).astype(int).cumsum() \
.where(~(df['input'] > df['input'].shift(periods=1, axis=0))) \
.ffill().fillna(0).astype(int)
df['b'] = (df['input'] < df['input'].shift(periods=1, axis=0)).cumsum()-(df['input'] < df['input'].shift(periods=1, axis=0)).astype(int).cumsum() \
.where(~(df['input'] < df['input'].shift(periods=1, axis=0))) \
.ffill().fillna(0).astype(int)
print(df)
output
input a b
0 1 0 0
1 2 1 0
2 3 2 0
3 2 0 1
4 1 0 2
Coding this manually using numpy might look like this
import numpy as np
input = np.array([1, 2, 3, 2, 1])
increase = np.zeros(len(input))
decrease = np.zeros(len(input))
for i in range(1, len(input)):
if input[i] > input[i-1]:
increase[i] = increase[i-1] + 1
decrease[i] = 0
elif input[i] < input[i-1]:
increase[i] = 0
decrease[i] = decrease[i-1] + 1
else:
increase[i] = 0
decrease[i] = 0
increase # array([0, 1, 2, 0, 0], dtype=int32)
decrease # array([0, 0, 0, 1, 2], dtype=int32)

How to set ranges of rows in pandas?

I have the following working code that sets 1 to "new_col" at the locations pointed by intervals dictated by starts and ends.
import pandas as pd
import numpy as np
df = pd.DataFrame({"a": np.arange(10)})
starts = [1, 5, 8]
ends = [1, 6, 10]
value = 1
df["new_col"] = 0
for s, e in zip(starts, ends):
df.loc[s:e, "new_col"] = value
print(df)
a new_col
0 0 0
1 1 1
2 2 0
3 3 0
4 4 0
5 5 1
6 6 1
7 7 0
8 8 1
9 9 1
I want these intervals to come from another dataframe pointer_df.
How to vectorize this?
pointer_df = pd.DataFrame({"starts": starts, "ends": ends})
Attempt:
df.loc[pointer_df["starts"]:pointer_df["ends"], "new_col"] = 2
print(df)
obviously doesn't work and gives
raise AssertionError("Start slice bound is non-scalar")
AssertionError: Start slice bound is non-scalar
EDIT:
it seems all answers use some kind of pythonic for loop.
the question was how to vectorize the operation above?
Is this not doable without for loops/list comprehentions?
You could do:
pointer_df = pd.DataFrame({"starts": starts, "ends": ends})
rang = np.arange(len(df))
indices = [i for s, e in pointer_df.to_numpy() for i in rang[slice(s, e + 1, None)]]
df.loc[indices, 'new_col'] = value
print(df)
Output
a new_col
0 0 0
1 1 1
2 2 0
3 3 0
4 4 0
5 5 1
6 6 1
7 7 0
8 8 1
9 9 1
If you want a method that do not uses uses any for loop or list comprehension, only relies on numpy, you could do:
def indices(start, end, ma=10):
limits = end + 1
lens = np.where(limits < ma, limits, end) - start
np.cumsum(lens, out=lens)
i = np.ones(lens[-1], dtype=int)
i[0] = start[0]
i[lens[:-1]] += start[1:]
i[lens[:-1]] -= limits[:-1]
np.cumsum(i, out=i)
return i
pointer_df = pd.DataFrame({"starts": starts, "ends": ends})
df.loc[indices(pointer_df.starts.values, pointer_df.ends.values, ma=len(df)), "new_col"] = value
print(df)
I adapted the method to your use case from the one in this answer.
for i,j in zip(pointer_df["starts"],pointer_df["ends"]):
print (i,j)
Apply same method but on your dictionary

how to print ones and zeros in columns with their indexes in python?

I have a list of zeros and ones, I want to print them in two different columns with headings and index numbers. Something like this.
list = [1,0,1,1,1,0,1,0,1,0,0]
ones zeros
1 1 2 0
3 1 6 0
4 1 8 0
5 1 10 0
7 1 11 0
9 1
This is the desired output.
I tried this:
list = [1,0,1,1,1,0,1,0,1,0,0]
print('ones',end='\t')
print('zeros')
for index,ele in enumerate(list,start=1):
if ele==1:
print(index,ele,end=" ")
elif ele==0:
print(" ")
print(index,ele,end=" ")
else:
print()
But this gives output like this:
ones zeros
1 1
2 0 3 1 4 1 5 1
6 0 7 1
8 0 9 1
10 0
11 0
How do get the desired output?
Any help is appreciated.
You can use itertools.zip_longest, str.ljust, f-strings (for formatting), and some calculations for the printing part, and use two lists to hold the indices of both zeros and ones:
l = [1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0]
ones, zeros = [], []
max_len_zeros = max_len_ones = 0
for index, num in enumerate(l, 1):
if num == 0:
zeros.append(index)
max_len_zeros = max(max_len_zeros, len(str(index)))
else:
ones.append(index)
max_len_ones = max(max_len_ones, len(str(index)))
from itertools import zip_longest
print('ones' + ' ' * (max_len_ones + 2) + 'zeros')
for ones_index, zeros_index in zip_longest(ones, zeros, fillvalue = ''):
one = '1' if ones_index else ' '
this_one_index = str(ones_index).ljust(max_len_ones)
zero = '0' if zeros_index else ''
this_zero_index = str(zeros_index).ljust(max_len_zeros)
print(f'{this_one_index} {one} {this_zero_index} {zero}')
Output:
ones zeros
1 1 2 0
3 1 6 0
4 1 8 0
5 1 10 0
7 1 11 0
9 1
List with more zeros than ones:
In: l = [1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0]
Out:
ones zeros
1 1 2 0
4 1 3 0
7 1 5 0
9 1 6 0
10 1 8 0
14 1 11 0
12 0
13 0
15 0
List with equal number of zeros and ones:
In: l = [1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1]
Out:
ones zeros
1 1 2 0
3 1 4 0
5 1 6 0
8 1 7 0
9 1 10 0
11 1 13 0
12 1 14 0
15 1 16 0
18 1 17 0
20 1 19 0
It's hard to do what you need in an iterative way. I have kind of a "broken" solution that both shows how you could better do what you are trying to do and why an iterative approach is limited in this case.
I updated your code as following:
list = [1,0,1,1,1,0,1,0,1,0,0]
print('ones',end='\t')
print('zeros')
for index,ele in enumerate(list,start=1):
# First check if extra space OR new lines OR both are needed
if index > 1:
if ele==1:
print()
elif ele==0:
if list[index-2]==1:
print('', end=' \t')
else:
print('', end='\n\t\t')
# THEN, write your desired output without any end
if ele==1:
print(index,ele,end="")
elif ele==0:
print(index,ele,end="")
# Finally an empty line
print()
It gives the following ouput:
ones zeros
1 1 2 0
3 1
4 1
5 1 6 0
7 1 8 0
9 1 10 0
11 0
As you can see, its limitation is that you can't go "up" and rewrite in old lines.
However, if you need to display EXACTLY as you've shown, you need to construct an intermediate data structure (for example a dict) and then display it using zip

Return rows based off the most recent increase in value from other columns python

The title of this question is a little confusing to write out succinctly.
I have pandas df that contains integers and a relevant key Column. When a value is in the key Column is present I want to return the most recent increase in integers from the other Columns.
For the df below, the key Column is [Area]. When X is in [Area], I want to find the most recent increase is integers from Columns ['ST_A','PG_A','ST_B','PG_B'].
import pandas as pd
d = ({
'ST_A' : [0,0,0,0,0,1,1,1,1],
'PG_A' : [0,0,0,1,1,1,2,2,2],
'ST_B' : [0,1,1,1,1,1,1,1,1],
'PG_B' : [0,0,0,0,0,0,0,1,1],
'Area' : ['','','X','','X','','','','X'],
})
df = pd.DataFrame(data = d)
Output:
ST_A PG_A ST_B PG_B Area
0 0 0 0 0
1 0 0 1 0
2 0 0 1 0 X
3 0 1 1 0
4 0 1 1 0 X
5 1 1 1 0
6 1 2 1 0
7 1 2 1 1
8 1 2 1 1 X
I tried to use df = df.loc[(df['Area'] == 'X')] but this returns the rows where X is situated. I need something that uses X to return the most recent row where there was an increase in Columns ['ST_A','PG_A','ST_B','PG_B'].
I have also tried:
cols = ['ST_A','PG_A','ST_B','PG_B']
df[cols] = df[cols].diff()
df = df.fillna(0.)
df = df.loc[(df[cols] == 1).any(axis=1)]
This returns all rows where there was an increase in Columns ['ST_A','PG_A','ST_B','PG_B']. Not the most recent increase before X in ['Area'].
Intended Output:
ST_A PG_A ST_B PG_B Area
1 0 0 1 0
3 0 1 1 0
7 1 2 1 1
Does this question make sense or do I need to simplify it?
I believe you can use NumPy here via np.searchsorted:
import numpy as np
increases = np.where(df.iloc[:, :-1].diff().gt(0).max(1))[0]
marks = np.where(df['Area'].eq('X'))[0]
idx = increases[np.searchsorted(increases, marks) - 1]
res = df.iloc[idx]
print(res)
ST_A PG_A ST_B PG_B Area
1 0 0 1 0
3 0 1 1 0
7 1 2 1 1
Not efficient tho, but works, so big chunk of code which is kinda slow:
indexes=np.where(df['Area']=='X')[0].tolist()
indexes2=list(map((1).__add__,np.where(df[df.columns[:-1]].sum(axis=1) < df[df.columns[:-1]].shift(-1).sum(axis=1).sort_index())[0].tolist()))
l=[]
for i in indexes:
if min(indexes2,key=lambda x: abs(x-i)) in l:
l.append(min(indexes2,key=lambda x: abs(x-i))-2)
else:
l.append(min(indexes2,key=lambda x: abs(x-i)))
print(df.iloc[l].sort_index())
Output:
Area PG_A PG_B ST_A ST_B
1 0 0 0 1
3 1 0 0 1
7 2 1 1 1

Categories

Resources