how to merge rows summing over each column without iteration - python

I have the following df:
prevent _p _n _id
0 1 0 0 83135
0 0 1 0 83135
0 0 1 0 82238
I would like to merge all rows having the same column _idby summing over each column for
the desired output in a dataframe, final (please note that if thee sum is greater than 1, the value should just be 1):
prevent _p _n _id
0 1 1 0 83135
0 0 1 0 82238
I can easily do this using the following code iterating over the dataframe:
final = pd.DataFrame()
for id_ in _ids:
out = df[df._id == id_]
prevent = 0
_p = 0
_n = 0
d = {}
if len(out) > 0:
for row in out.itertuples():
if prevent == 0:
prevent += row.prevent
if _p == 0:
_p += row._p
if _n == 0:
_n += row._n
d['_p'] = _p
d['_n'] = _n
d['prevent'] = prevent
t=pd.DataFrame([d])
t['_id'] = id_
final=pd.concat([final, t])
I have several hundred thousand rows, so this will be very inefficient. Is there a way to vectorize this?

Treat 0 and 1 as boolean with any, then convert them back to integers:
df.groupby("_id").any().astype("int").reset_index()

Check groupby
out = df.groupby('_id',as_index=False).sum()

Related

Multiple conditional statements on list comprehension

So this is my code and I want to know if I can use list comprehension to execute the same operation (count the clusters within rows and output a list of length df.shape[0]). There are at least two rows for the same cluster number, but it can be more and they cycles. I tried but couldn't figure it out.
Any suggestions?
My code:
import pandas as pd
cluster_global = 0
cluster_relativo = 0
cluster_index = []
for index, row in df.iterrows():
if row['cluster'] == cluster_relativo:
cluster_index.append(cluster_global)
elif row['cluster'] == (cluster_relativo + 1):
cluster_global += 1
cluster_relativo += 1
cluster_index.append(cluster_global)
elif row['cluster'] == 0:
cluster_global += 1
cluster_relativo = 0
cluster_index.append(cluster_global)
The DataFrame looks like
index
cluster
0
0
1
0
2
1
3
1
4
1
5
2
6
2
7
0
8
0
...
...
n
m<40
Do you want this?
from itertools import groupby
result = [0 if index == 0 and key == 0
else index
for index, (key, group) in enumerate(groupby(my_values))
for _ in group
]
print(result)
Replace my_values in the list comprehension via - df['cluster'].values. to test

Vectorized function with counter on pandas dataframe column

Consider this pandas dataframe where the condition column is 1 when value is below 5 (any threshold).
import pandas as pd
d = {'value': [30,100,4,0,80,0,1,4,70,70],'condition':[0,0,1,1,0,1,1,1,0,0]}
df = pd.DataFrame(data=d)
df
Out[1]:
value condition
0 30 0
1 100 0
2 4 1
3 0 1
4 80 0
5 0 1
6 1 1
7 4 1
8 70 0
9 70 0
What I want is to have all consecutive values below 5 to have the same id and all values above five have 0 (or NA or a negative value, doesn't matter, they just need to be the same). I want to create a new column called new_id that contains these cumulative ids as follows:
value condition new_id
0 30 0 0
1 100 0 0
2 4 1 1
3 0 1 1
4 80 0 0
5 0 1 2
6 1 1 2
7 4 1 2
8 70 0 0
9 70 0 0
In a very inefficient for loop I would do this (which works):
for i in range(0,df.shape[0]):
if (df.loc[df.index[i],'condition'] == 1) & (df.loc[df.index[i-1],'condition']==0):
new_id = counter # assign new id
counter += 1
elif (df.loc[df.index[i],'condition']==1) & (df.loc[df.index[i-1],'condition']!=0):
new_id = counter-1 # assign current id
elif (df.loc[df.index[i],'condition']==0):
new_id = df.loc[df.index[i],'condition'] # assign 0
df.loc[df.index[i],'new_id'] = new_id
df
But this is very inefficient and I have a very big dataset. Therefore I tried different kinds of vectorization but I so far failed to keep it from counting up inside each "cluster" of consecutive points:
# First try using cumsum():
df['new_id'] = 0
df['new_id_temp'] = ((df['condition'] == 1)).astype(int).cumsum()
df.loc[(df['condition'] == 1), 'new_id'] = df['new_id_temp']
df[['value', 'condition', 'new_id']]
# Another try using list comprehension but this just does +1:
[row+1 for ind, row in enumerate(df['condition']) if (row != row-1)]
I also tried using apply() with a custom if else function but it seems like this does not allow me to use a counter.
There is already a ton of similar posts about this but none of them keep the same id for consecutive rows.
Example posts are:
Maintain count in python list comprehension
Pandas cumsum on a separate column condition
Python - keeping counter inside list comprehension
python pandas conditional cumulative sum
Conditional count of cumulative sum Dataframe - Loop through columns
You can use the cumsum(), as you did in your first try, just modify it a bit:
# calculate delta
df['delta'] = df['condition']-df['condition'].shift(1)
# get rid of -1 for the cumsum (replace it by 0)
df['delta'] = df['delta'].replace(-1,0)
# cumulative sum conditional: multiply with condition column
df['cumsum_x'] = df['delta'].cumsum()*df['condition']
Welcome to SO! Why not just rely on base Python for this?
def counter_func(l):
new_id = [0] # First value is zero in any case
counter = 0
for i in range(1, len(l)):
if l[i] == 0:
new_id.append(0)
elif l[i] == 1 and l[i-1] == 0:
counter += 1
new_id.append(counter)
elif l[i] == l[i-1] == 1:
new_id.append(counter)
else: new_id.append(None)
return new_id
df["new_id"] = counter_func(df["condition"])
Looks like this
value condition new_id
0 30 0 0
1 100 0 0
2 4 1 1
3 0 1 1
4 80 0 0
5 0 1 2
6 1 1 2
7 4 1 2
8 70 0 0
9 70 0 0
Edit :
You can also use numba, which sped up the function quite a lot for me about : about 1sec to ~60ms.
You should input numpy arrays into the function to use it, meaning you'll have to df["condition"].values.
from numba import njit
import numpy as np
#njit
def func(arr):
res = np.empty(arr.shape[0])
counter = 0
res[0] = 0 # First value is zero anyway
for i in range(1, arr.shape[0]):
if arr[i] == 0:
res[i] = 0
elif arr[i] and arr[i-1] == 0:
counter += 1
res[i] = counter
elif arr[i] == arr[i-1] == 1:
res[i] = counter
else: res[i] = np.nan
return res
df["new_id"] = func(df["condition"].values)

Looping over a pandas column and creating a new column if it meets conditions

I have a pandas dataframe and I want to loop over the last column "n" times based on a condition.
import random as random
import pandas as pd
p = 0.5
df = pd.DataFrame()
start = []
for i in range(5)):
if random.random() < p:
start.append("0")
else:
start.append("1")
df['start'] = start
print(df['start'])
Essentially, I want to loop over the final column "n" times and if the value is 0, change it to 1 with probability p so the results become the new final column. (I am simulating on-off every time unit with probability p).
e.g. after one iteration, the dataframe would look something like:
0 0
0 1
1 1
0 0
0 1
after two:
0 0 1
0 1 1
1 1 1
0 0 0
0 1 1
What is the best way to do this?
Sorry if I am asking this wrong, I have been trying to google for a solution for hours and coming up empty.
Like this. Append col with name 1, 2, ...
# continue from question code ...
# colname is 1, 2, ...
for col in range(1, 5):
tmp = []
for i in range(5):
# check final col
if df.iloc[i,col-1:col][0] == "0":
if random.random() < p:
tmp.append("0")
else:
tmp.append("1")
else: # == 1
tmp.append("1")
# append new col
df[str(col)] = tmp
print(df)
# initial
s
0 0
1 1
2 0
3 0
4 0
# result
s 1 2 3 4
0 0 0 1 1 1
1 0 0 0 0 1
2 0 0 1 1 1
3 1 1 1 1 1
4 0 0 0 0 0

find start end index of bouts of consecutive equal values

given a dataframe df
df = pandas.DataFrame(data=[1,0,0,1,1,1,0,1,0,1,1,1],columns = ['A'])
df
Out[20]:
A
0 1
1 0
2 0
3 1
4 1
5 1
6 0
7 1
8 0
9 1
10 1
11 1
I would like to find the start and end index of interval of ones larger than 3.
In this case what I expect is
(3,5 and 9,11)
Use the shifting cumsum trick to mark consecutive groups, then use groupby to get indices and filter by your conditions.
v = (df['A'] != df['A'].shift()).cumsum()
u = df.groupby(v)['A'].agg(['all', 'count'])
m = u['all'] & u['count'].ge(3)
df.groupby(v).apply(lambda x: (x.index[0], x.index[-1]))[m]
A
3 (3, 5)
7 (9, 11)
dtype: object
I don't explicitly know Pandas, but I do know Python, and took this as a small challenge:
def find_sub_in_list(my_list, sublist, greedy=True):
matches = []
results = []
for item in range(len(my_list)):
aux_list = my_list[item:]
if len(sublist) > len(aux_list) or len(aux_list) == 0:
break
start_match = None
end_pos = None
if sublist[0] == my_list[item]:
start_match = item
for sub_item in range(len(sublist)):
if sublist[sub_item] != my_list[item+sub_item]:
end_pos = False
if end_pos == None and start_match != None:
end_pos = start_match+len(sublist)
matches.append([start_match, end_pos])
if greedy:
results = []
for match in range(len(matches)-1):
if matches[match][1] > matches[match+1][0]:
results.append([matches[match][0], matches[match+1][1]])
else:
results.append(matches[match])
else:
results = matches
return results
my_list = [1,1,1,0,1,1,0,1,1,1,1]
interval = 3
sublist = [1]*interval
matches = find_sub_in_list(my_list, sublist)
print(matches)

Iterating a Pandas dataframe over 'n' next rows

I have this Pandas dataframe df:
station a_d direction
a 0 0
a 0 0
a 1 0
a 0 0
a 1 0
b 0 0
b 1 0
c 0 0
c 1 0
c 0 1
c 1 1
b 0 1
b 1 1
b 0 1
b 1 1
a 0 1
a 1 1
a 0 0
a 1 0
I'd assign a value_id that increments when direction value change and refers only to the last pair of station value first it changes with different [0,1] a_d value. I can ignore the last (in this example the last two) dataframe row. In other words:
station a_d direction id_value
a 0 0
a 0 0
a 1 0
a 0 0 0
a 1 0 0
b 0 0 0
b 1 0 0
c 0 0 0
c 1 0 0
c 0 1 1
c 1 1 1
b 0 1
b 1 1
b 0 1 1
b 1 1 1
a 0 1 1
a 1 1 1
a 0 0
a 1 0
Using df.iterrows() i write this script:
df['value_id'] = ""
value_id = 0
row_iterator = df.iterrows()
for i, row in row_iterator:
if i == 0:
continue
elif (df.loc[i-1,'direction'] != df.loc [i,'direction']):
value_id += 1
for z in range(1,11):
if i+z >= len(df)-1:
break
elif (df.loc[i+1,'a_d'] == df.loc [i,'a_d']):
break
elif (df.loc[i+1,'a_d'] != df.loc [i,'a_d']) and (df.loc [i+2,'station'] == df.loc [i,'station'] and (df.loc [i+2,'direction'] == df.loc [i,'direction'])):
break
else:
df.loc[i,'value_id'] = value_id
It works but it's very slow. With a 10*10^6 rows dataframe I need a faster way. Any idea?
#user5402 code works well but I note that a break after the last else reduce computational time also:
df['value_id'] = ""
value_id = 0
row_iterator = df.iterrows()
for i, row in row_iterator:
if i == 0:
continue
elif (df.loc[i-1,'direction'] != df.loc [i,'direction']):
value_id += 1
for z in range(1,11):
if i+z >= len(df)-1:
break
elif (df.loc[i+1,'a_d'] == df.loc [i,'a_d']):
break
elif (df.loc[i+1,'a_d'] != df.loc [i,'a_d']) and (df.loc [i+2,'station'] == df.loc [i,'station'] and (df.loc [i+2,'direction'] == df.loc [i,'direction'])):
break
else:
df.loc[i,'value_id'] = value_id
break
You are not effectively using z in the inner for loop. You never access the i+z-th row. You access the i-th row and the i+1-th row and the i+2-th row, but never the i+z-th row.
You can replace that inner for loop with:
if i+1 > len(df)-1:
pass
elif (df.loc[i+1,'a_d'] == df.loc [i,'a_d']):
pass
elif (df.loc [i+2,'station'] == df.loc [i,'station'] and (df.loc [i+2,'direction'] == df.loc [i,'direction'])):
pass
else:
df.loc[i,'value_id'] = value_id
Note that I also slightly optimized the second elif because at that point you already know df.loc[i+1,'a_d'] does not equal df.loc [i,'a_d'].
Not having to loop over z will save a lot of time.

Categories

Resources