I wanted to generate some sort of cycle for my dataFrame. One cycle in the example below has the length of 4. The last column is how is supposed to look like, the rest are attempts on my behalf.
My current code looks like this:
import pandas as pd
import numpy as np
l = list(np.linspace(0,10,12))
data = [
('time',l),
('A',[0,5,0.6,-4.8,-0.3,4.9,0.2,-4.7,0.5,5,0.1,-4.6]),
('B',[ 0,300,20,-280,-25,290,30,-270,40,300,-10,-260]),
]
df = pd.DataFrame.from_dict(dict(data))
length = len(df)
df.loc[0,'cycle']=1
df['cycle'] = length/4 +df.loc[0,'cycle']
i = 0
for i in range(0,length):
df.loc[i,'new_cycle']=i+1
df['want_cycle']= [1,1,1,1,2,2,2,2,3,3,3,3]
print(length)
print(df)
I do need an if conditions in the code, too only increase in the value of df['new_cycle'] if the index counter for example 4. But so far I failed to find a proper way to implement such conditions.
Try this with the default range index, because your dataframe row index is a range starting with 0, the default index of a dataframe, you can use floor divide to calculate your cycle:
df['cycle'] = df.index//4 + 1
Output:
time A B cycle
0 0.000000 0.0 0 1
1 0.909091 5.0 300 1
2 1.818182 0.6 20 1
3 2.727273 -4.8 -280 1
4 3.636364 -0.3 -25 2
5 4.545455 4.9 290 2
6 5.454545 0.2 30 2
7 6.363636 -4.7 -270 2
8 7.272727 0.5 40 3
9 8.181818 5.0 300 3
10 9.090909 0.1 -10 3
11 10.000000 -4.6 -260 3
Now, if your dataframe index isn't the default, the you can use something like this:
df['cycle'] = [df.index.get_loc(i) // 4 + 1 for i in df.index]
I've added just 1 thing for you, a new variable called new_cycle which will keep the count you're after.
In the for loop we're checking to see whether or not i is divisible by 4 without a remainder, if it is we're adding 1 to the new variable, and filling the data frame with this value the same way you did.
import pandas as pd
import numpy as np
l = list(np.linspace(0,10,12))
data = [
('time',l),
('A',[0,5,0.6,-4.8,-0.3,4.9,0.2,-4.7,0.5,5,0.1,-4.6]),
('B',[ 0,300,20,-280,-25,290,30,-270,40,300,-10,-260]),
]
df = pd.DataFrame.from_dict(dict(data))
length = len(df)
df.loc[0,'cycle']=1
df['cycle'] = length/4 +df.loc[0,'cycle']
new_cycle = 0
for i in range(0,length):
if i % 4 == 0:
new_cycle += 1
df.loc[i,'new_cycle']= new_cycle
df['want_cycle'] = [1,1,1,1,2,2,2,2,3,3,3,3]
print(length)
print(df)
Related
I have following dataframe called condition:
[0] [1] [2] [3]
1 0 0 1 0
2 0 1 0 0
3 0 0 0 1
4 0 0 0 1
For easier reproduction:
import numpy as np
import pandas as pd
n=4
t=3
condition = pd.DataFrame([[0,0,1,0], [0,1,0,0], [0,0,0, 1], [0,0,0, 1]], columns=['0','1', '2', '3'])
condition.index=np.arange(1,n+1)
Further I have several dataframes that should be filled in a foor loop
df = pd.DataFrame([],index = range(1,n+1),columns= range(t+1) ) #NaN DataFrame
df_2 = pd.DataFrame([],index = range(1,n+1),columns= range(t+1) )
df_3 = pd.DataFrame(3,index = range(1,n+1),columns= range(t+1) )
for i,t in range(t,-1,-1):
if condition[t]==1:
df.loc[:,t] = df_3.loc[:,t]**2
df_2.loc[:,t]=0
elif (condition == 0 and no 1 in any column after t)
df.loc[:,t] = 2.5
....
else:
df.loc[:,t] = 5
df_2.loc[:,t]= df.loc[:,t+1]
I am aware that this for loop is not correct, but what I wanted to do, is to check elementwise condition (recursevly) and if it is 1 (in condition) to fill dataframe df with squared valued of df_3. If it is 0 in condition, I should differentiate two cases.
In the first case, there are no 1 after 0 (row 1 and 2 in condition) then df = 2.5
Second case, there was 1 after and fill df with 5 (row 3 and 4)
So the dataframe df should look something like this
[0] [1] [2] [3]
1 5 5 9 2.5
2 5 9 2.5 2.5
3 5 5 5 9
4 5 5 5 9
The code should include for loop.
Thanks!
I am not sure if this is what you want, but based on your desired output you can do this with only masking operations (which is more efficient than looping over the rows anyway). Your code could look like this:
is_one = condition.astype(bool)
is_after_one = (condition.cumsum(axis=1) - condition).astype(bool)
df = pd.DataFrame(5, index=condition.index, columns=condition.columns)
df_2 = pd.DataFrame(2.5, index=condition.index, columns=condition.columns)
df_3 = pd.DataFrame(3, index=condition.index, columns=condition.columns)
df.where(~is_one, other=df_3 * df_3, inplace=True)
df.where(~is_after_one, other=df_2, inplace=True)
which yields:
0 1 2 3
1 5 5 9.0 2.5
2 5 9 2.5 2.5
3 5 5 5.0 9.0
4 5 5 5.0 9.0
EDIT after comment:
If you really want to loop explicitly over the rows and columns, you could do it like this with the same result:
n_rows = condition.index.size
n_cols = condition.columns.size
for row_index in range(n_rows):
for col_index in range(n_cols):
cond = condition.iloc[row_index, col_index]
if col_index < n_cols - 1:
rest_row = condition.iloc[row_index, col_index + 1:].to_list()
else:
rest_row = []
if cond == 1:
df.iloc[row_index, col_index] = df_3.iloc[row_index, col_index] ** 2
elif cond == 0 and 1 not in rest_row:
# fill whole row at once
df.iloc[row_index, col_index:] = 2.5
# stop iterating over the rest
break
else:
df.iloc[row_index, col_index] = 5
df_2.loc[:, col_index] = df.iloc[:, col_index + 1]
The result is the same, but this is much more inefficient and ugly, so I would not recommend it like this
I have the below code
for i in range(index, len(df_2)+1):
if df_2.loc[i, 'Duration'] == 0:
df_2.loc[i, 'Duration'] = df_2.loc[i, "idle_hrs"] + df_2.loc[i - 1, "Duration"]
How can i write this in simple way to reduce time complexity? is there a war to write it in list comprehension style?
You can use shift for the accumulation.
import pandas as pd
index = 2
df = pd.DataFrame({'Duration': [1,0,2,0], 'idle_hrs': [10,20,30,40]})
df
Duration idle_hrs
0 1 10
1 0 20
2 2 30
3 0 40
start_df = df[:index]
df.loc[df['Duration'] == 0, 'Duration'] = df['idle_hrs'] + df['Duration'].shift(1)
df.iloc[:index] = start_df
df
Duration idle_hrs
0 1.0 10
1 0.0 20
2 2.0 30
3 42.0 40
I want to find the count for the number of previous rows that have the a greater value than the current row in a column and store it in a new column. It would be like a rolling countif that goes back to the beginning of the column. The desired example output below shows the value column given and the count column I want to create.
Desired Output:
Value Count
5 0
7 0
4 2
12 0
3 4
4 3
1 6
I plan on using this code with a large dataframe so the fastest way possible is appreciated.
We can do subtract.outer from numpy , then get lower tri and find the value is less than 0, and sum the value per row
a = np.sum(np.tril(np.subtract.outer(df.Value.values,df.Value.values), k=0)<0, axis=1)
# results in array([0, 0, 2, 0, 4, 3, 6])
df['Count'] = a
IMPORTANT: this only works with pandas < 1.0.0 and the error seems to be a pandas bug. An issue is already created at https://github.com/pandas-dev/pandas/issues/35203
We can do this with expanding and applying a function which checks for values that are higher than the last element in the expanding array.
import pandas as pd
import numpy as np
# setup
df = pd.DataFrame([5,7,4,12,3,4,1], columns=['Value'])
# calculate countif
df['Count'] = df.Value.expanding(1).apply(lambda x: np.sum(np.where(x > x[-1], 1, 0))).astype('int')
Input
Value
0 5
1 7
2 4
3 12
4 3
5 4
6 1
Output
Value Count
0 5 0
1 7 0
2 4 2
3 12 0
4 3 4
5 4 3
6 1 6
count = []
for i in range(len(values)):
count = 0
for j in values[:i]:
if values[i] < j:
count += 1
count.append(count)
The below generator will do what you need. You may be able to further optimize this if needed.
def generator (data) :
i=0
count_dict ={}
while i<len(data) :
m=max(data)
v=data[i]
count_dict[v] =count_dict[v] +1 if v in count_dict else 1
t=sum([(count_dict[j] if j in count_dict else 0) for j in range(v+1,m)])
i +=1
yield t
d=[1, 5,7,3,5,8]
foo=generator (d)
result =[b for b in foo]
print(result)
I want to compute the "carryover" of a series. This computes a value for each row and then adds it to the previously computed value (for the previous row).
How do I do this in pandas?
decay = 0.5
test = pd.DataFrame(np.random.randint(1,10,12),columns = ['val'])
test
val
0 4
1 5
2 7
3 9
4 1
5 1
6 8
7 7
8 3
9 9
10 7
11 2
decayed = []
for i, v in test.iterrows():
if i ==0:
decayed.append(v.val)
continue
d = decayed[i-1] + v.val*decay
decayed.append(d)
test['loop_decay'] = decayed
test.head()
val loop_decay
0 4 4.0
1 5 6.5
2 7 10.0
3 9 14.5
4 1 15.0
Consider a vectorized version with cumsum() where you cumulatively sum (val * decay) with the very first val.
However, you then need to subtract the very first (val * decay) since cumsum() includes it:
test['loop_decay'] = (test.ix[0,'val']) + (test['val']*decay).cumsum() - (test.ix[0,'val']*decay)
You can utilize pd.Series.shift() to create a dataframe with val[i] and val[i-1] and then apply your function across a single axis (1 in this case):
# Create a series that shifts the rows by 1
test['val2'] = test.val.shift()
# Set the first row on the shifted series to 0
test['val2'].ix[0] = 0
# Apply the decay formula:
test['loop_decay'] = test.apply(lambda x: x['val'] + x['val2'] * 0.5, axis=1)
I want to know if there is any faster way to do the following loop? Maybe use apply or rolling apply function to realize this
Basically, I need to access previous row's value to determine current cell value.
df.ix[0] = (np.abs(df.ix[0]) >= So) * np.sign(df.ix[0])
for i in range(1, len(df)):
for col in list(df.columns.values):
if ((df[col].ix[i] > 1.25) & (df[col].ix[i-1] == 0)) | :
df[col].ix[i] = 1
elif ((df[col].ix[i] < -1.25) & (df[col].ix[i-1] == 0)):
df[col].ix[i] = -1
elif ((df[col].ix[i] <= -0.75) & (df[col].ix[i-1] < 0)) | ((df[col].ix[i] >= 0.5) & (df[col].ix[i-1] > 0)):
df[col].ix[i] = df[col].ix[i-1]
else:
df[col].ix[i] = 0
As you can see, in the function, I am updating the dataframe, I need to access the most updated previous row, so using shift will not work.
For example:
Input:
A B C
1.3 -1.5 0.7
1.1 -1.4 0.6
1.0 -1.3 0.5
0.4 1.4 0.4
Output:
A B C
1 -1 0
1 -1 0
1 -1 0
0 1 0
you can use .shift() function for accessing previous or next values:
previous value for col column:
df['col'].shift()
next value for col column:
df['col'].shift(-1)
Example:
In [38]: df
Out[38]:
a b c
0 1 0 5
1 9 9 2
2 2 2 8
3 6 3 0
4 6 1 7
In [39]: df['prev_a'] = df['a'].shift()
In [40]: df
Out[40]:
a b c prev_a
0 1 0 5 NaN
1 9 9 2 1.0
2 2 2 8 9.0
3 6 3 0 2.0
4 6 1 7 6.0
In [43]: df['next_a'] = df['a'].shift(-1)
In [44]: df
Out[44]:
a b c prev_a next_a
0 1 0 5 NaN 9.0
1 9 9 2 1.0 2.0
2 2 2 8 9.0 6.0
3 6 3 0 2.0 6.0
4 6 1 7 6.0 NaN
I am surprised there isn't a native pandas solution to this as well, because shift and rolling do not get it done. I have devised a way to do this using the standard pandas syntax but I am not sure if it performs any better than your loop... My purposes just required this for consistency (not speed).
import pandas as pd
df = pd.DataFrame({'a':[0,1,2], 'b':[0,10,20]})
new_col = 'c'
def apply_func_decorator(func):
prev_row = {}
def wrapper(curr_row, **kwargs):
val = func(curr_row, prev_row)
prev_row.update(curr_row)
prev_row[new_col] = val
return val
return wrapper
#apply_func_decorator
def running_total(curr_row, prev_row):
return curr_row['a'] + curr_row['b'] + prev_row.get('c', 0)
df[new_col] = df.apply(running_total, axis=1)
print(df)
# Output will be:
# a b c
# 0 0 0 0
# 1 1 10 11
# 2 2 20 33
Disclaimer: I used pandas 0.16 but with only slight modification this will work for the latest versions too.
Others had similar questions and I posted this solution on those as well:
Reference previous row when iterating through dataframe
Reference values in the previous row with map or apply
#maxU has it right with shift, I think you can even compare dataframes directly, something like this:
df_prev = df.shift(-1)
df_out = pd.DataFrame(index=df.index,columns=df.columns)
df_out[(df>1.25) & (df_prev == 0)] = 1
df_out[(df<-1.25) & (df_prev == 0)] = 1
df_out[(df<-.75) & (df_prev <0)] = df_prev
df_out[(df>.5) & (df_prev >0)] = df_prev
The syntax may be off, but if you provide some test data I think this could work.
Saves you having to loop at all.
EDIT - Update based on comment below
I would try my absolute best not to loop through the DF itself. You're better off going column by column, sending to a list and doing the updating, then just importing back again. Something like this:
df.ix[0] = (np.abs(df.ix[0]) >= 1.25) * np.sign(df.ix[0])
for col in df.columns.tolist():
currData = df[col].tolist()
for currRow in range(1,len(currData)):
if currData[currRow]> 1.25 and currData[currRow-1]== 0:
currData[currRow] = 1
elif currData[currRow] < -1.25 and currData[currRow-1]== 0:
currData[currRow] = -1
elif currData[currRow] <=-.75 and currData[currRow-1]< 0:
currData[currRow] = currData[currRow-1]
elif currData[currRow]>= .5 and currData[currRow-1]> 0:
currData[currRow] = currData[currRow-1]
else:
currData[currRow] = 0
df[col] = currData