How can I fill empty DataFrame based on conditions? - python

I have following dataframe called condition:
[0] [1] [2] [3]
1 0 0 1 0
2 0 1 0 0
3 0 0 0 1
4 0 0 0 1
For easier reproduction:
import numpy as np
import pandas as pd
n=4
t=3
condition = pd.DataFrame([[0,0,1,0], [0,1,0,0], [0,0,0, 1], [0,0,0, 1]], columns=['0','1', '2', '3'])
condition.index=np.arange(1,n+1)
Further I have several dataframes that should be filled in a foor loop
df = pd.DataFrame([],index = range(1,n+1),columns= range(t+1) ) #NaN DataFrame
df_2 = pd.DataFrame([],index = range(1,n+1),columns= range(t+1) )
df_3 = pd.DataFrame(3,index = range(1,n+1),columns= range(t+1) )
for i,t in range(t,-1,-1):
if condition[t]==1:
df.loc[:,t] = df_3.loc[:,t]**2
df_2.loc[:,t]=0
elif (condition == 0 and no 1 in any column after t)
df.loc[:,t] = 2.5
....
else:
df.loc[:,t] = 5
df_2.loc[:,t]= df.loc[:,t+1]
I am aware that this for loop is not correct, but what I wanted to do, is to check elementwise condition (recursevly) and if it is 1 (in condition) to fill dataframe df with squared valued of df_3. If it is 0 in condition, I should differentiate two cases.
In the first case, there are no 1 after 0 (row 1 and 2 in condition) then df = 2.5
Second case, there was 1 after and fill df with 5 (row 3 and 4)
So the dataframe df should look something like this
[0] [1] [2] [3]
1 5 5 9 2.5
2 5 9 2.5 2.5
3 5 5 5 9
4 5 5 5 9
The code should include for loop.
Thanks!

I am not sure if this is what you want, but based on your desired output you can do this with only masking operations (which is more efficient than looping over the rows anyway). Your code could look like this:
is_one = condition.astype(bool)
is_after_one = (condition.cumsum(axis=1) - condition).astype(bool)
df = pd.DataFrame(5, index=condition.index, columns=condition.columns)
df_2 = pd.DataFrame(2.5, index=condition.index, columns=condition.columns)
df_3 = pd.DataFrame(3, index=condition.index, columns=condition.columns)
df.where(~is_one, other=df_3 * df_3, inplace=True)
df.where(~is_after_one, other=df_2, inplace=True)
which yields:
0 1 2 3
1 5 5 9.0 2.5
2 5 9 2.5 2.5
3 5 5 5.0 9.0
4 5 5 5.0 9.0
EDIT after comment:
If you really want to loop explicitly over the rows and columns, you could do it like this with the same result:
n_rows = condition.index.size
n_cols = condition.columns.size
for row_index in range(n_rows):
for col_index in range(n_cols):
cond = condition.iloc[row_index, col_index]
if col_index < n_cols - 1:
rest_row = condition.iloc[row_index, col_index + 1:].to_list()
else:
rest_row = []
if cond == 1:
df.iloc[row_index, col_index] = df_3.iloc[row_index, col_index] ** 2
elif cond == 0 and 1 not in rest_row:
# fill whole row at once
df.iloc[row_index, col_index:] = 2.5
# stop iterating over the rest
break
else:
df.iloc[row_index, col_index] = 5
df_2.loc[:, col_index] = df.iloc[:, col_index + 1]
The result is the same, but this is much more inefficient and ugly, so I would not recommend it like this

Related

Combining looping and conditional to make new columns on dataframe

I want to make a function with loop and conditional, that count only when Actual Result = 1.
So the numbers always increase by 1 if the Actual Result = 1.
This is my dataframe:
This is my code but it doesnt produce the result that i want :
def func_count(x):
for i in range(1,880):
if x['Actual Result']==1:
result = i
else:
result = '-'
return result
X_machine_learning['Count'] = X_machine_learning.apply(lambda x:func_count(x),axis=1)
When i check & filter with count != '-' The result will be like this :
The number always equal to 1 and not increase by 1 everytime the actual result = 1. Any solution?
Try something like this:
import pandas as pd
df = pd.DataFrame({
'age': [30,25,40,12,16,17,14,50,22,10],
'actual_result': [0,1,1,1,0,0,1,1,1,0]
})
count = 0
lst_count = []
for i in range(len(df)):
if df['actual_result'][i] == 1:
count+=1
lst_count.append(count)
else:
lst_count.append('-')
df['count'] = lst_count
print(df)
Result
age actual_result count
0 30 0 -
1 25 1 1
2 40 1 2
3 12 1 3
4 16 0 -
5 17 0 -
6 14 1 4
7 50 1 5
8 22 1 6
9 10 0 -
Actually, you don't need to loop over the dataframe, which is mostly a Pandas-antipattern that should be avoided. With df your dataframe you could try the following instead:
m = df["Actual Result"] == 1
df["Count"] = m.cumsum().where(m, "-")
Result for the following dataframe
df = pd.DataFrame({"Actual Result": [1, 1, 0, 1, 1, 1, 0, 0, 1, 0]})
is
Actual Result Count
0 1 1
1 1 2
2 0 -
3 1 3
4 1 4
5 1 5
6 0 -
7 0 -
8 1 6
9 0 -

Create a New Column Based on Some Values From Other Column in Pandas

Imagine that this Data Set:
A
1 2
2 4
3 3
4 5
5 5
6 5
I would like to create new column by this condition from A:
if A[i] < A[i-1] then B[i] = -1 else B[i] = 1
the result is:
A B
1 2 NaN
2 4 1
3 3 -1
4 5 1
5 7 1
6 6 -1
All codes and solutions that I have found just compare the rows in same location.
Use the diff function. then the sign function:
df.assign(B = np.sign(df.A.diff()))
Out[248]:
A B
0 2 NaN
1 4 1.0
2 3 -1.0
3 5 1.0
4 7 1.0
5 6 -1.0
df['B']=[1 if i!=0 and df['A'][i] < df['A'][i-1] else -1 for i,v in enumerate(df['A'])]
or
df['B']=[1 if i!=0 and df['A'][i] < df['A'][i-1] else -1 for i in range(len(df['A']))]
Edit (for three states like greater, less, and equal):
import numpy as np
df['B']=np.NAN*len(df.a)
for i in range(1,len(df['a'])):
if df['a'][i] < df['a'][i-1]: df['B'][i]=1
elif df['a'][i] == df['a'][i-1]: df['B'][i]=0
else: df['B'][i]=-1
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [2,4,3,5,7,6]})
df['B'] = np.where(df['A'] < df['A'].shift(1), -1, 1)
in order to keep the nan in the beginning:
df['B'] = np.where(df['A'].shift(1).isna(), np.nan, df['B'])

how to pass row values from a column based on bool value from several other columns

I have the following df :
df = data.frame("T" = c(1,2,3,5,1,3,2,5), "A" = c("0","0","1","1","0","1","0","1"), "B" = c("0","0","0","1","0","0","0","1"))
df
T A B
1 1 0 0
2 2 0 0
3 3 1 0
4 5 1 1
5 1 0 0
6 3 1 0
7 2 0 0
8 5 1 1
Column A & B were the results of as follow:
df['A'] = [int(x) for x in total_df["T"] >= 3]
df['B'] = [int(x) for x in total_df["T"] >= 5]
I have a data spilt
train_size = 0.6
training = df.head(int(train_size * df.shape[0]))
test = df.tail(int((1 - train_size) * df.shape[0]))
Here is the question:
How can I pass row values from "T" to a list called 'tr_return' from 'training' where both columns "A" & "B" are == 1?
I tried this:
tr_returns = training[training['A' and 'B'] == 1]['T'] or
tr_returns = training[training['A'] == 1 and training['B'] == 1]['T']
But neither one works :( Any help will be appreciated!

How to set ranges of rows in pandas?

I have the following working code that sets 1 to "new_col" at the locations pointed by intervals dictated by starts and ends.
import pandas as pd
import numpy as np
df = pd.DataFrame({"a": np.arange(10)})
starts = [1, 5, 8]
ends = [1, 6, 10]
value = 1
df["new_col"] = 0
for s, e in zip(starts, ends):
df.loc[s:e, "new_col"] = value
print(df)
a new_col
0 0 0
1 1 1
2 2 0
3 3 0
4 4 0
5 5 1
6 6 1
7 7 0
8 8 1
9 9 1
I want these intervals to come from another dataframe pointer_df.
How to vectorize this?
pointer_df = pd.DataFrame({"starts": starts, "ends": ends})
Attempt:
df.loc[pointer_df["starts"]:pointer_df["ends"], "new_col"] = 2
print(df)
obviously doesn't work and gives
raise AssertionError("Start slice bound is non-scalar")
AssertionError: Start slice bound is non-scalar
EDIT:
it seems all answers use some kind of pythonic for loop.
the question was how to vectorize the operation above?
Is this not doable without for loops/list comprehentions?
You could do:
pointer_df = pd.DataFrame({"starts": starts, "ends": ends})
rang = np.arange(len(df))
indices = [i for s, e in pointer_df.to_numpy() for i in rang[slice(s, e + 1, None)]]
df.loc[indices, 'new_col'] = value
print(df)
Output
a new_col
0 0 0
1 1 1
2 2 0
3 3 0
4 4 0
5 5 1
6 6 1
7 7 0
8 8 1
9 9 1
If you want a method that do not uses uses any for loop or list comprehension, only relies on numpy, you could do:
def indices(start, end, ma=10):
limits = end + 1
lens = np.where(limits < ma, limits, end) - start
np.cumsum(lens, out=lens)
i = np.ones(lens[-1], dtype=int)
i[0] = start[0]
i[lens[:-1]] += start[1:]
i[lens[:-1]] -= limits[:-1]
np.cumsum(i, out=i)
return i
pointer_df = pd.DataFrame({"starts": starts, "ends": ends})
df.loc[indices(pointer_df.starts.values, pointer_df.ends.values, ma=len(df)), "new_col"] = value
print(df)
I adapted the method to your use case from the one in this answer.
for i,j in zip(pointer_df["starts"],pointer_df["ends"]):
print (i,j)
Apply same method but on your dictionary

Pandas DataFrame use previous row value for complicated 'if' conditions to determine current value

I want to know if there is any faster way to do the following loop? Maybe use apply or rolling apply function to realize this
Basically, I need to access previous row's value to determine current cell value.
df.ix[0] = (np.abs(df.ix[0]) >= So) * np.sign(df.ix[0])
for i in range(1, len(df)):
for col in list(df.columns.values):
if ((df[col].ix[i] > 1.25) & (df[col].ix[i-1] == 0)) | :
df[col].ix[i] = 1
elif ((df[col].ix[i] < -1.25) & (df[col].ix[i-1] == 0)):
df[col].ix[i] = -1
elif ((df[col].ix[i] <= -0.75) & (df[col].ix[i-1] < 0)) | ((df[col].ix[i] >= 0.5) & (df[col].ix[i-1] > 0)):
df[col].ix[i] = df[col].ix[i-1]
else:
df[col].ix[i] = 0
As you can see, in the function, I am updating the dataframe, I need to access the most updated previous row, so using shift will not work.
For example:
Input:
A B C
1.3 -1.5 0.7
1.1 -1.4 0.6
1.0 -1.3 0.5
0.4 1.4 0.4
Output:
A B C
1 -1 0
1 -1 0
1 -1 0
0 1 0
you can use .shift() function for accessing previous or next values:
previous value for col column:
df['col'].shift()
next value for col column:
df['col'].shift(-1)
Example:
In [38]: df
Out[38]:
a b c
0 1 0 5
1 9 9 2
2 2 2 8
3 6 3 0
4 6 1 7
In [39]: df['prev_a'] = df['a'].shift()
In [40]: df
Out[40]:
a b c prev_a
0 1 0 5 NaN
1 9 9 2 1.0
2 2 2 8 9.0
3 6 3 0 2.0
4 6 1 7 6.0
In [43]: df['next_a'] = df['a'].shift(-1)
In [44]: df
Out[44]:
a b c prev_a next_a
0 1 0 5 NaN 9.0
1 9 9 2 1.0 2.0
2 2 2 8 9.0 6.0
3 6 3 0 2.0 6.0
4 6 1 7 6.0 NaN
I am surprised there isn't a native pandas solution to this as well, because shift and rolling do not get it done. I have devised a way to do this using the standard pandas syntax but I am not sure if it performs any better than your loop... My purposes just required this for consistency (not speed).
import pandas as pd
df = pd.DataFrame({'a':[0,1,2], 'b':[0,10,20]})
new_col = 'c'
def apply_func_decorator(func):
prev_row = {}
def wrapper(curr_row, **kwargs):
val = func(curr_row, prev_row)
prev_row.update(curr_row)
prev_row[new_col] = val
return val
return wrapper
#apply_func_decorator
def running_total(curr_row, prev_row):
return curr_row['a'] + curr_row['b'] + prev_row.get('c', 0)
df[new_col] = df.apply(running_total, axis=1)
print(df)
# Output will be:
# a b c
# 0 0 0 0
# 1 1 10 11
# 2 2 20 33
Disclaimer: I used pandas 0.16 but with only slight modification this will work for the latest versions too.
Others had similar questions and I posted this solution on those as well:
Reference previous row when iterating through dataframe
Reference values in the previous row with map or apply
#maxU has it right with shift, I think you can even compare dataframes directly, something like this:
df_prev = df.shift(-1)
df_out = pd.DataFrame(index=df.index,columns=df.columns)
df_out[(df>1.25) & (df_prev == 0)] = 1
df_out[(df<-1.25) & (df_prev == 0)] = 1
df_out[(df<-.75) & (df_prev <0)] = df_prev
df_out[(df>.5) & (df_prev >0)] = df_prev
The syntax may be off, but if you provide some test data I think this could work.
Saves you having to loop at all.
EDIT - Update based on comment below
I would try my absolute best not to loop through the DF itself. You're better off going column by column, sending to a list and doing the updating, then just importing back again. Something like this:
df.ix[0] = (np.abs(df.ix[0]) >= 1.25) * np.sign(df.ix[0])
for col in df.columns.tolist():
currData = df[col].tolist()
for currRow in range(1,len(currData)):
if currData[currRow]> 1.25 and currData[currRow-1]== 0:
currData[currRow] = 1
elif currData[currRow] < -1.25 and currData[currRow-1]== 0:
currData[currRow] = -1
elif currData[currRow] <=-.75 and currData[currRow-1]< 0:
currData[currRow] = currData[currRow-1]
elif currData[currRow]>= .5 and currData[currRow-1]> 0:
currData[currRow] = currData[currRow-1]
else:
currData[currRow] = 0
df[col] = currData

Categories

Resources