How to set ranges of rows in pandas? - python

I have the following working code that sets 1 to "new_col" at the locations pointed by intervals dictated by starts and ends.
import pandas as pd
import numpy as np
df = pd.DataFrame({"a": np.arange(10)})
starts = [1, 5, 8]
ends = [1, 6, 10]
value = 1
df["new_col"] = 0
for s, e in zip(starts, ends):
df.loc[s:e, "new_col"] = value
print(df)
a new_col
0 0 0
1 1 1
2 2 0
3 3 0
4 4 0
5 5 1
6 6 1
7 7 0
8 8 1
9 9 1
I want these intervals to come from another dataframe pointer_df.
How to vectorize this?
pointer_df = pd.DataFrame({"starts": starts, "ends": ends})
Attempt:
df.loc[pointer_df["starts"]:pointer_df["ends"], "new_col"] = 2
print(df)
obviously doesn't work and gives
raise AssertionError("Start slice bound is non-scalar")
AssertionError: Start slice bound is non-scalar
EDIT:
it seems all answers use some kind of pythonic for loop.
the question was how to vectorize the operation above?
Is this not doable without for loops/list comprehentions?

You could do:
pointer_df = pd.DataFrame({"starts": starts, "ends": ends})
rang = np.arange(len(df))
indices = [i for s, e in pointer_df.to_numpy() for i in rang[slice(s, e + 1, None)]]
df.loc[indices, 'new_col'] = value
print(df)
Output
a new_col
0 0 0
1 1 1
2 2 0
3 3 0
4 4 0
5 5 1
6 6 1
7 7 0
8 8 1
9 9 1
If you want a method that do not uses uses any for loop or list comprehension, only relies on numpy, you could do:
def indices(start, end, ma=10):
limits = end + 1
lens = np.where(limits < ma, limits, end) - start
np.cumsum(lens, out=lens)
i = np.ones(lens[-1], dtype=int)
i[0] = start[0]
i[lens[:-1]] += start[1:]
i[lens[:-1]] -= limits[:-1]
np.cumsum(i, out=i)
return i
pointer_df = pd.DataFrame({"starts": starts, "ends": ends})
df.loc[indices(pointer_df.starts.values, pointer_df.ends.values, ma=len(df)), "new_col"] = value
print(df)
I adapted the method to your use case from the one in this answer.

for i,j in zip(pointer_df["starts"],pointer_df["ends"]):
print (i,j)
Apply same method but on your dictionary

Related

Combining looping and conditional to make new columns on dataframe

I want to make a function with loop and conditional, that count only when Actual Result = 1.
So the numbers always increase by 1 if the Actual Result = 1.
This is my dataframe:
This is my code but it doesnt produce the result that i want :
def func_count(x):
for i in range(1,880):
if x['Actual Result']==1:
result = i
else:
result = '-'
return result
X_machine_learning['Count'] = X_machine_learning.apply(lambda x:func_count(x),axis=1)
When i check & filter with count != '-' The result will be like this :
The number always equal to 1 and not increase by 1 everytime the actual result = 1. Any solution?
Try something like this:
import pandas as pd
df = pd.DataFrame({
'age': [30,25,40,12,16,17,14,50,22,10],
'actual_result': [0,1,1,1,0,0,1,1,1,0]
})
count = 0
lst_count = []
for i in range(len(df)):
if df['actual_result'][i] == 1:
count+=1
lst_count.append(count)
else:
lst_count.append('-')
df['count'] = lst_count
print(df)
Result
age actual_result count
0 30 0 -
1 25 1 1
2 40 1 2
3 12 1 3
4 16 0 -
5 17 0 -
6 14 1 4
7 50 1 5
8 22 1 6
9 10 0 -
Actually, you don't need to loop over the dataframe, which is mostly a Pandas-antipattern that should be avoided. With df your dataframe you could try the following instead:
m = df["Actual Result"] == 1
df["Count"] = m.cumsum().where(m, "-")
Result for the following dataframe
df = pd.DataFrame({"Actual Result": [1, 1, 0, 1, 1, 1, 0, 0, 1, 0]})
is
Actual Result Count
0 1 1
1 1 2
2 0 -
3 1 3
4 1 4
5 1 5
6 0 -
7 0 -
8 1 6
9 0 -

pandas for loop for running average does not work

I tried to make a kind of running average - out of 90 rows, every 3 in column A should make an average that would be the same as those rows in column B.
For example:
From this:
df = pd.DataFrame( A B
2 0
3 0
4 0
7 0
9 0
8 0)
to this:
df = pd.DataFrame( A B
2 3
3 3
4 3
7 8
9 8
8 8)
I tried running this code:
x=0
for i in df['A']:
if x<90:
y = (df['A'][x]+ df['A'][(x +1)]+df['A'][(x +2)])/3
df['B'][x] = y
df['B'][(x+1)] = y
df['B'][(x+2)] = y
x=x+3
print(y)
It does print the correct Y
But does not change B
I know there is a better way to do it, and if anyone knows - it would be great if they shared it. But the more important thing for me is to understand why what I wrote down doesn't have an effect on the df.
You could group by the index divided by 3, then use transform to compute the mean of those values and assign to B:
df = pd.DataFrame({'A': [2, 3, 4, 7, 9, 8], 'B': [0, 0, 0, 0, 0, 0]})
df['B'] = df.groupby(df.index // 3)['A'].transform('mean')
Output:
A B
0 2 3
1 3 3
2 4 3
3 7 8
4 9 8
5 8 8
Note that this relies on the index being of the form 0,1,2,3,4,.... If that is not the case, you could either reset the index (df.reset_index(drop=True)) or use np.arange(df.shape[0]) instead i.e.
df['B'] = df.groupby(np.arange(df.shape[0]) // 3)['A'].transform('mean')
i = 0
batch_size = 3
df = pd.DataFrame({'A':[2,3,4,7,9,8,9,10],'B':[-1] * 8})
while i < len(df):
j = min(i+batch_size-1,len(df)-1)
avg =sum(df.loc[i:j,'A'])/ (j-i+1)
df.loc[i:j,'B'] = [avg] * (j-i+1)
i+=batch_size
df
corner case when len(df) % batch_size != 0 assumes we take the average of the leftover rows.

Checking Value from Specific Column of dataframe and updating values from an array to Column 2

I have dataframe with 2 columns in it Column A and Column B and an array of alphabets from A to P which are as follows
df = pd.DataFrame({
'Column_A':[0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1],
'Column_B':[]
})
the array is as follows:
label = ['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P']
Expected output is
'A':[0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1],
'B':['A','A','A','A','A','E','E','E','E','E','I','I','I','I','I','M']
Value from Column B changes as soon as value from Column A is 1. and the value is taken from the given array 'label'
I have tried using this for loop
for row in df.index:
try:
if df.loc[row,'Column_A'] == 1:
df.at[row, 'Column_B'] = label[row+4]
print(label[row])
else:
df.ColumnB.fillna('ffill')
except IndexError:
row = (row+4)%4
df.at[row, 'Coumn_B'] = label[row]
I also want to loopback if it reaches the last value in 'Label' Array.
Some solution that should do the trick looks like:
label=list('ABCDEFGHIJKLMNOP')
df = pd.DataFrame({
'Column_A': [0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1],
'Column_B': label
})
Not exactly sure, what you intended with the fillna, because I think you don't need it.
max_index= len(label)
df['Column_B']='ffill'
lookup= 0
for row in df.index:
if df.loc[row,'Column_A'] == 1:
lookup= lookup+4 if lookup+4 < max_index else lookup%4
df.at[row, 'Column_B'] = label[lookup]
print(label[row])
I also avoid the exception handling in this case, because the "index overflow" can be handled without exception handling.
Btw. if you have a large dataframe you can probably make the code faster by eliminating one lookup (but you'd need to verify if it really runs faster). The solution would look like this then:
max_index= len(label)
df['Column_B']='ffill'
lookup= 0
for row, record in df.iterrows():
if record['Column_A'] == 1:
lookup= lookup+4 if lookup+4 < max_index else lookup%4
df.at[row, 'Column_B'] = label[lookup]
print(label[row])
Option 1
cond1 = df.Column_A == 1
cond2 = df.index == 0
mappr = lambda x: label[x]
df.assign(Column_B=np.where(cond1 | cond2, df.index.map(mappr), np.nan)).ffill()
Column_A Column_B
0 0 A
1 0 A
2 0 A
3 0 A
4 0 A
5 1 F
6 0 F
7 0 F
8 0 F
9 0 F
10 1 K
11 0 K
12 0 K
13 0 K
14 0 K
15 1 P
Option 2
a = np.append(0, np.flatnonzero(df.Column_A))
b = df.Column_A.to_numpy().cumsum()
c = np.array(label)
df.assign(Column_B=c[a[b]])
Column_A Column_B
0 0 A
1 0 A
2 0 A
3 0 A
4 0 A
5 1 F
6 0 F
7 0 F
8 0 F
9 0 F
10 1 K
11 0 K
12 0 K
13 0 K
14 0 K
15 1 P
Using groupby with transform then map
df.reset_index().groupby(df.Column_A.eq(1).cumsum())['index'].transform('first').map(dict(enumerate(label)))
Out[139]:
0 A
1 A
2 A
3 A
4 A
5 F
6 F
7 F
8 F
9 F
10 K
11 K
12 K
13 K
14 K
15 P
Name: index, dtype: object

Python DataFrame Accumulator Based on Flag

I have a logic-driven flag column and I need to create a column that increments by 1 when the flag is true and decrements by 1 when the flag is false down to a floor of zero.
I've tried a few different methods and I can't get the Accumulator 'shift' to reference the new value created by the process. I know the method below wouldn't stop at zero anyway, but I was just trying to work through the concept before and this is the most to-the-point example to explain the goal. Do I need a for loop to iterate line-by-line?
df = pd.DataFrame(data=np.random.randint(2,size=10), columns=['flag'])
df['accum'] = 0
df['accum'] = np.where(df['flag'] == 1, df['accum'].shift(1) + 1, df['accum'].shift(1) - 1)
df['dOutput'] = [1,0,1,2,1,2,3,2,1,0] #desired output
df
Output
As far as I know, there's no numpy or pandas vectorized operation to do this, so, you should iterate line-by-line:
def cumsum_with_floor(series):
acc = 0
output = []
accum_list = []
for val in series:
val = 1 if val else -1
acc += val
accum_list.append(val)
acc = acc if acc > 0 else 0
output.append(acc)
return pd.Series(output, index=series.index), pd.Series(accum_list, index=series.index)
series = pd.Series([1,0,1,1,0,0,0,1])
dOutput, accum = cumsum_with_floor(series)
dOutput
Out:
0 1
1 0
2 1
3 2
4 1
5 0
6 0
7 1
dtype: int64
accum # shifted by one step forward compared with you example
Out:
0 1
1 -1
2 1
3 1
4 -1
5 -1
6 -1
7 1
dtype: int64
But may be there's somebody who knows suitable combination of pd.clip and pd.cumsum or other vectorized operations.

Conditional length of a binary data series in Pandas

Having a DataFrame with the following column:
df['A'] = [1,1,1,0,1,1,1,1,0,1]
What would be the best vectorized way to control the length of "1"-series by some limiting value? Let's say the limit is 2, then the resulting column 'B' must look like:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
One fully-vectorized solution is to use the shift-groupby-cumsum-cumcount combination1 to indicate where consecutive runs are shorter than 2 (or whatever limiting value you like). Then, & this new boolean Series with the original column:
df['B'] = ((df.groupby((df.A != df.A.shift()).cumsum()).cumcount() <= 1) & df.A)\
.astype(int) # cast the boolean Series back to integers
This produces the new column in the DataFrame:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
1 See the pandas cookbook; the section on grouping, "Grouping like Python’s itertools.groupby"
Another way (checking if previous two are 1):
In [443]: df = pd.DataFrame({'A': [1,1,1,0,1,1,1,1,0,1]})
In [444]: limit = 2
In [445]: df['B'] = map(lambda x: df['A'][x] if x < limit else int(not all(y == 1 for y in df['A'][x - limit:x])), range(len(df)))
In [446]: df
Out[446]:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
If you know that the values in the series will all be either 0 or 1, I think you can use a little trick involving convolution. Make a copy of your column (which need not be a Pandas object, it can just be a normal Numpy array)
a = df['A'].as_matrix()
and convolve it with a sequence of 1's that is one longer than the cutoff you want, then chop off the last cutoff elements. E.g. for a cutoff of 2, you would do
long_run_count = numpy.convolve(a, [1, 1, 1])[:-2]
The resulting array, in this case, gives the number of 1's that occur in the 3 elements prior to and including that element. If that number is 3, then you are in a run that has exceeded length 2. So just set those elements to zero.
a[long_run_count > 2] = 0
You can now assign the resulting array to a new column in your DataFrame.
df['B'] = a
To turn this into a more general method:
def trim_runs(array, cutoff):
a = numpy.asarray(array)
a[numpy.convolve(a, numpy.ones(cutoff + 1))[:-cutoff] > cutoff] = 0
return a

Categories

Resources