So this is my code and I want to know if I can use list comprehension to execute the same operation (count the clusters within rows and output a list of length df.shape[0]). There are at least two rows for the same cluster number, but it can be more and they cycles. I tried but couldn't figure it out.
Any suggestions?
My code:
import pandas as pd
cluster_global = 0
cluster_relativo = 0
cluster_index = []
for index, row in df.iterrows():
if row['cluster'] == cluster_relativo:
cluster_index.append(cluster_global)
elif row['cluster'] == (cluster_relativo + 1):
cluster_global += 1
cluster_relativo += 1
cluster_index.append(cluster_global)
elif row['cluster'] == 0:
cluster_global += 1
cluster_relativo = 0
cluster_index.append(cluster_global)
The DataFrame looks like
index
cluster
0
0
1
0
2
1
3
1
4
1
5
2
6
2
7
0
8
0
...
...
n
m<40
Do you want this?
from itertools import groupby
result = [0 if index == 0 and key == 0
else index
for index, (key, group) in enumerate(groupby(my_values))
for _ in group
]
print(result)
Replace my_values in the list comprehension via - df['cluster'].values. to test
Related
I have specific issue where, im trying to find solution on inner loop(execute 3 time) and continue outer to process rest of the list in for loop:
strings = ['A','A','A','A','A','A','A','A','A','A','B','B','B','B','B','B','B','B','B','B',\
'A','A','A','A','A','A','A','A','A','A','B','B','B','B','B','B','B','B','B','B']
i=0
for string in strings:
global i
if string == 'A':
while i < 3:
print(string, i)
i+=1
if i==3: continue
elif string== 'B':
while i < 3:
print(string,i)
i+=1
if i==3: continue
# print(string)
Current result:
A 0
A 1
A 2
Expected to have continued over list once the inner loop complete and process from next:
A 0
A 1
A 2
B 0
B 1
B 2
A 0
A 1
A 2
B 0
B 1
B 2
If I understand correctly the logic, you could use itertools.groupby to help you form the groups:
variant #1
from itertools import groupby
MAX = 3
for k,g in groupby(strings):
for i in range(min(len(list(g)), MAX)):
print(f'{k} {i}')
print()
variant #2
from itertools import groupby
MAX = 3
for k,g in groupby(strings):
for i,_ in enumerate(g):
if i >= MAX:
break
print(f'{k} {i}')
print()
output:
A 0
A 1
A 2
B 0
B 1
B 2
A 0
A 1
A 2
B 0
B 1
B 2
variant #3: without import
prev = None
count = 0
MAX = 3
for s in strings:
if s == prev:
if count < MAX:
print(f'{s} {count}')
count += 1
elif prev:
count = 0
print()
prev = s
I have the following df:
prevent _p _n _id
0 1 0 0 83135
0 0 1 0 83135
0 0 1 0 82238
I would like to merge all rows having the same column _idby summing over each column for
the desired output in a dataframe, final (please note that if thee sum is greater than 1, the value should just be 1):
prevent _p _n _id
0 1 1 0 83135
0 0 1 0 82238
I can easily do this using the following code iterating over the dataframe:
final = pd.DataFrame()
for id_ in _ids:
out = df[df._id == id_]
prevent = 0
_p = 0
_n = 0
d = {}
if len(out) > 0:
for row in out.itertuples():
if prevent == 0:
prevent += row.prevent
if _p == 0:
_p += row._p
if _n == 0:
_n += row._n
d['_p'] = _p
d['_n'] = _n
d['prevent'] = prevent
t=pd.DataFrame([d])
t['_id'] = id_
final=pd.concat([final, t])
I have several hundred thousand rows, so this will be very inefficient. Is there a way to vectorize this?
Treat 0 and 1 as boolean with any, then convert them back to integers:
df.groupby("_id").any().astype("int").reset_index()
Check groupby
out = df.groupby('_id',as_index=False).sum()
Consider this pandas dataframe where the condition column is 1 when value is below 5 (any threshold).
import pandas as pd
d = {'value': [30,100,4,0,80,0,1,4,70,70],'condition':[0,0,1,1,0,1,1,1,0,0]}
df = pd.DataFrame(data=d)
df
Out[1]:
value condition
0 30 0
1 100 0
2 4 1
3 0 1
4 80 0
5 0 1
6 1 1
7 4 1
8 70 0
9 70 0
What I want is to have all consecutive values below 5 to have the same id and all values above five have 0 (or NA or a negative value, doesn't matter, they just need to be the same). I want to create a new column called new_id that contains these cumulative ids as follows:
value condition new_id
0 30 0 0
1 100 0 0
2 4 1 1
3 0 1 1
4 80 0 0
5 0 1 2
6 1 1 2
7 4 1 2
8 70 0 0
9 70 0 0
In a very inefficient for loop I would do this (which works):
for i in range(0,df.shape[0]):
if (df.loc[df.index[i],'condition'] == 1) & (df.loc[df.index[i-1],'condition']==0):
new_id = counter # assign new id
counter += 1
elif (df.loc[df.index[i],'condition']==1) & (df.loc[df.index[i-1],'condition']!=0):
new_id = counter-1 # assign current id
elif (df.loc[df.index[i],'condition']==0):
new_id = df.loc[df.index[i],'condition'] # assign 0
df.loc[df.index[i],'new_id'] = new_id
df
But this is very inefficient and I have a very big dataset. Therefore I tried different kinds of vectorization but I so far failed to keep it from counting up inside each "cluster" of consecutive points:
# First try using cumsum():
df['new_id'] = 0
df['new_id_temp'] = ((df['condition'] == 1)).astype(int).cumsum()
df.loc[(df['condition'] == 1), 'new_id'] = df['new_id_temp']
df[['value', 'condition', 'new_id']]
# Another try using list comprehension but this just does +1:
[row+1 for ind, row in enumerate(df['condition']) if (row != row-1)]
I also tried using apply() with a custom if else function but it seems like this does not allow me to use a counter.
There is already a ton of similar posts about this but none of them keep the same id for consecutive rows.
Example posts are:
Maintain count in python list comprehension
Pandas cumsum on a separate column condition
Python - keeping counter inside list comprehension
python pandas conditional cumulative sum
Conditional count of cumulative sum Dataframe - Loop through columns
You can use the cumsum(), as you did in your first try, just modify it a bit:
# calculate delta
df['delta'] = df['condition']-df['condition'].shift(1)
# get rid of -1 for the cumsum (replace it by 0)
df['delta'] = df['delta'].replace(-1,0)
# cumulative sum conditional: multiply with condition column
df['cumsum_x'] = df['delta'].cumsum()*df['condition']
Welcome to SO! Why not just rely on base Python for this?
def counter_func(l):
new_id = [0] # First value is zero in any case
counter = 0
for i in range(1, len(l)):
if l[i] == 0:
new_id.append(0)
elif l[i] == 1 and l[i-1] == 0:
counter += 1
new_id.append(counter)
elif l[i] == l[i-1] == 1:
new_id.append(counter)
else: new_id.append(None)
return new_id
df["new_id"] = counter_func(df["condition"])
Looks like this
value condition new_id
0 30 0 0
1 100 0 0
2 4 1 1
3 0 1 1
4 80 0 0
5 0 1 2
6 1 1 2
7 4 1 2
8 70 0 0
9 70 0 0
Edit :
You can also use numba, which sped up the function quite a lot for me about : about 1sec to ~60ms.
You should input numpy arrays into the function to use it, meaning you'll have to df["condition"].values.
from numba import njit
import numpy as np
#njit
def func(arr):
res = np.empty(arr.shape[0])
counter = 0
res[0] = 0 # First value is zero anyway
for i in range(1, arr.shape[0]):
if arr[i] == 0:
res[i] = 0
elif arr[i] and arr[i-1] == 0:
counter += 1
res[i] = counter
elif arr[i] == arr[i-1] == 1:
res[i] = counter
else: res[i] = np.nan
return res
df["new_id"] = func(df["condition"].values)
I have a pandas dataframe and I want to loop over the last column "n" times based on a condition.
import random as random
import pandas as pd
p = 0.5
df = pd.DataFrame()
start = []
for i in range(5)):
if random.random() < p:
start.append("0")
else:
start.append("1")
df['start'] = start
print(df['start'])
Essentially, I want to loop over the final column "n" times and if the value is 0, change it to 1 with probability p so the results become the new final column. (I am simulating on-off every time unit with probability p).
e.g. after one iteration, the dataframe would look something like:
0 0
0 1
1 1
0 0
0 1
after two:
0 0 1
0 1 1
1 1 1
0 0 0
0 1 1
What is the best way to do this?
Sorry if I am asking this wrong, I have been trying to google for a solution for hours and coming up empty.
Like this. Append col with name 1, 2, ...
# continue from question code ...
# colname is 1, 2, ...
for col in range(1, 5):
tmp = []
for i in range(5):
# check final col
if df.iloc[i,col-1:col][0] == "0":
if random.random() < p:
tmp.append("0")
else:
tmp.append("1")
else: # == 1
tmp.append("1")
# append new col
df[str(col)] = tmp
print(df)
# initial
s
0 0
1 1
2 0
3 0
4 0
# result
s 1 2 3 4
0 0 0 1 1 1
1 0 0 0 0 1
2 0 0 1 1 1
3 1 1 1 1 1
4 0 0 0 0 0
I have a logic-driven flag column and I need to create a column that increments by 1 when the flag is true and decrements by 1 when the flag is false down to a floor of zero.
I've tried a few different methods and I can't get the Accumulator 'shift' to reference the new value created by the process. I know the method below wouldn't stop at zero anyway, but I was just trying to work through the concept before and this is the most to-the-point example to explain the goal. Do I need a for loop to iterate line-by-line?
df = pd.DataFrame(data=np.random.randint(2,size=10), columns=['flag'])
df['accum'] = 0
df['accum'] = np.where(df['flag'] == 1, df['accum'].shift(1) + 1, df['accum'].shift(1) - 1)
df['dOutput'] = [1,0,1,2,1,2,3,2,1,0] #desired output
df
Output
As far as I know, there's no numpy or pandas vectorized operation to do this, so, you should iterate line-by-line:
def cumsum_with_floor(series):
acc = 0
output = []
accum_list = []
for val in series:
val = 1 if val else -1
acc += val
accum_list.append(val)
acc = acc if acc > 0 else 0
output.append(acc)
return pd.Series(output, index=series.index), pd.Series(accum_list, index=series.index)
series = pd.Series([1,0,1,1,0,0,0,1])
dOutput, accum = cumsum_with_floor(series)
dOutput
Out:
0 1
1 0
2 1
3 2
4 1
5 0
6 0
7 1
dtype: int64
accum # shifted by one step forward compared with you example
Out:
0 1
1 -1
2 1
3 1
4 -1
5 -1
6 -1
7 1
dtype: int64
But may be there's somebody who knows suitable combination of pd.clip and pd.cumsum or other vectorized operations.