I have a dataframe that looks like this:
print(df)
Out[1]:
Numbers
0 0
1 1
2 1
3 1
4 1
5 0
6 0
7 1
8 0
9 1
10 1
I want to transform it to this:
print(dfo)
Out[2]:
Numbers
0 0
1 1
2 2
3 3
4 4
5 0
6 0
7 1
8 0
9 1
10 2
The solution to this,I thought, it could be an iloc with 2 ifs:
Check if the digit in df is 1, if1 true then check if2 the i-1 is 1, if true then in dfo see the value of i-1 and add 1,elifs just put the value of 0 in dfo.
I've tryed this:
# Import pandas library
import pandas as pd
# initialize list elements
list = [0,1,1,1,1,0,0,1,0,1,1]
# Create the pandas DataFrame with column name is provided explicitly
df = pd.DataFrame(list, columns=['Numbers'])
# print dataframe.
df
data1c = df.copy()
for j in df:
for i in range(len(df)):
if df.loc[i, j] == 1:
if df.loc[i-1, j] == 1:
data1c.loc[i, j] = data1c.loc[i-1, j]+1
elif df.loc[i-1, j] == 0:
data1c.loc[i, j] = 1
elif df.loc[i, j] == 0:
data1c.loc[i, j] = 0
print(data1c)
Numbers
0 0
1 1
2 2
3 3
4 4
5 0
6 0
7 1
8 0
9 1
10 2
and for a dataframe of 1 column it works, but when I've tryed with a dataframe with 2 columns :
input = {'A': [0,1,1,1,1,0,0,1,0,1,1,0,1,1],
'B': [1,1,0,0,1,1,1,0,0,1,0,1,1,0]}
df = pd.DataFrame(input)
# Print the output.
df
data2c = df.copy()
for j in dfo:
for i in range(len(dfo)):
if dfo.loc[i, j] == 1:
if dfo.loc[i-1, j] == 1:
data2c.loc[i, j] = data2c.loc[i-1, j]+1
elif dfo.loc[i-1, j] == 0:
data2c.loc[i, j] = 1
elif dfo.loc[i, j] == 0:
data2c.loc[i, j] = 0
I get :
File "C:\Users\talls\.conda\envs\Spyder\lib\site-packages\pandas\core\indexes\range.py", line 393, in get_loc
raise KeyError(key) from err
KeyError: -1
Why do I get this error and how do I fix it?
or
Is there another way to get my desired out put?
I know this is not the answer to the question "How to use iloc to get reference from the rows above?", but it is the answer to your proposed question.
df = pd.DataFrame([0,1,1,1,1,0,0,1,0,1,1], columns=['Numbers'])
df['tmp'] = (~(df['Numbers'] == df['Numbers'].shift(1))).cumsum()
df['new'] = df.groupby('tmp')['Numbers'].cumsum()
print(df['new'])
Numbers tmp new
0 0 1 0
1 1 2 1
2 1 2 2
3 1 2 3
4 1 2 4
5 0 3 0
6 0 3 0
7 1 4 1
8 0 5 0
9 1 6 1
10 1 6 2
How does this work? The inner part ~(df['Numbers'] == df['Numbers'].shift(1)) checks whether the previous line is the same as the current line. For the first line, this works perfectly as well, because a number and a NaN always compare to False. Then I negate it mark the start of each new sequence with a True. When I then do a cumulative sum, I "group" all values with the created tmp column and do a cumulative sum over it to get the required answer in the new column.
For the two columned version, you'd do exactly the same... for both columns A and B:
df = pd.DataFrame(input)
df['tmp'] = (~(df['A'] == df['A'].shift(1))).cumsum()
df['newA'] = df.groupby('tmp')['A'].cumsum()
#Just reusing column tmp
df['tmp'] = (~(df['B'] == df['B'].shift(1))).cumsum()
df['newB'] = df.groupby('tmp')['B'].cumsum()
print(df)
A B tmp newA newB
0 0 1 1 0 1
1 1 1 1 1 2
2 1 0 2 2 0
3 1 0 2 3 0
4 1 1 3 4 1
5 0 1 3 0 2
6 0 1 3 0 3
7 1 0 4 1 0
8 0 0 4 0 0
9 1 1 5 1 1
10 1 0 6 2 0
11 0 1 7 0 1
12 1 1 7 1 2
13 1 0 8 2 0
To answer the question you originally proposed. (I mentioned it in a comment already.) You need to put in a safeguard against i == 0. You can do that two ways:
for j in dfo:
for i in range(len(dfo)):
if i == 0:
continue
elif dfo.loc[i, j] == 1:
if dfo.loc[i-1, j] == 1:
data2c.loc[i, j] = data2c.loc[i-1, j]+1
elif dfo.loc[i-1, j] == 0:
data2c.loc[i, j] = 1
elif dfo.loc[i, j] == 0:
data2c.loc[i, j] = 0
or start at 1 instead of 0:
for j in dfo:
for i in range(1,len(dfo)):
if dfo.loc[i, j] == 1:
if dfo.loc[i-1, j] == 1:
data2c.loc[i, j] = data2c.loc[i-1, j]+1
elif dfo.loc[i-1, j] == 0:
data2c.loc[i, j] = 1
elif dfo.loc[i, j] == 0:
data2c.loc[i, j] = 0
The resulting dataframe:
A B
0 0 1
1 1 2
2 2 0
3 3 0
4 4 1
5 0 2
6 0 3
7 1 0
8 0 0
9 1 1
10 2 0
11 0 1
12 1 2
13 2 0
Related
I am working with a large dataset, but in order to simplify, I will use a simpler example (whose some rows have been deleted to clean the dataset) in order to illustrate the problem.
Imagine I have this dataset:
Code_0 Code_1 Code_3 Code_4 Code_5 Code_6 Code_7
3 1 1 3 0 0 1 1
9 0 0 0 0 0 0 1
10 4 2 3 1 1 0 0
15 0 3 0 5 1 1 1
So, what I want to do in every row is, If Code_5 is equal to one and code_0 is bigger than 0, then, to code_0 will be added one. On the other hand, if code_6 == 1,and code_1 is bigger than 0, then to code_1 will be added one. Finally, if code_7 == 1 and code_3 is bigger than 0, than to code_3 will be added 1. So I want to have this:
Code_0 Code_1 Code_3 Code_4 Code_5 Code_6 Code_7
3 1 2 4 0 0 1 1
9 0 0 0 0 0 0 1
10 5 2 3 1 1 0 0
15 0 4 0 5 1 1 1
I did this code but it is not working. Anyway I think that there is maybe a better option.
def add_one(x):
if x == df_data['Code_5']:
if x == 1 and df_data['Code_0'] > 0:
df_data['Code_0'] = df_data['Code_0'] + 1
if x == df_data['Code_6']:
if x == 1 and df_data['Code_1'] > 0:
df_data['Code_1'] = df_data['Code_1'] + 1
if x == df_data['Code_7']:
if x == 1 and df_data['Code_2'] > 0:
df_data['Code_2'] = df_data['Code_2'] + 1
df_data['Code_0'] = df_data['Code_5'].apply(add_one)
df_data['Code_1'] = df_data['Code_6'].apply(add_one)
df_data['Code_2'] = df_data['Code_7'].apply(add_one)
Anyone can help me, please?
YOu can simplify by passing the complete row:
In [163]: def add_one(row):
...: if row['Code_5'] == 1 and row['Code_0'] > 0:
...: row['Code_0'] = row['Code_0'] + 1
...: if row['Code_6'] == 1 and row['Code_1'] > 0:
...: row['Code_1'] = row['Code_1'] + 1
...: if row['Code_7'] == 1 and row['Code_2'] > 0:
...: row['Code_2'] = row['Code_2'] + 1
...: return row
...:
In [164]: add_one
Out[164]: <function __main__.add_one(row)>
In [165]: df =df.apply(lambda x: add_one(x), axis=1)
axis =1 is specify the columns
I have dataframe with 2 columns in it Column A and Column B and an array of alphabets from A to P which are as follows
df = pd.DataFrame({
'Column_A':[0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1],
'Column_B':[]
})
the array is as follows:
label = ['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P']
Expected output is
'A':[0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1],
'B':['A','A','A','A','A','E','E','E','E','E','I','I','I','I','I','M']
Value from Column B changes as soon as value from Column A is 1. and the value is taken from the given array 'label'
I have tried using this for loop
for row in df.index:
try:
if df.loc[row,'Column_A'] == 1:
df.at[row, 'Column_B'] = label[row+4]
print(label[row])
else:
df.ColumnB.fillna('ffill')
except IndexError:
row = (row+4)%4
df.at[row, 'Coumn_B'] = label[row]
I also want to loopback if it reaches the last value in 'Label' Array.
Some solution that should do the trick looks like:
label=list('ABCDEFGHIJKLMNOP')
df = pd.DataFrame({
'Column_A': [0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1],
'Column_B': label
})
Not exactly sure, what you intended with the fillna, because I think you don't need it.
max_index= len(label)
df['Column_B']='ffill'
lookup= 0
for row in df.index:
if df.loc[row,'Column_A'] == 1:
lookup= lookup+4 if lookup+4 < max_index else lookup%4
df.at[row, 'Column_B'] = label[lookup]
print(label[row])
I also avoid the exception handling in this case, because the "index overflow" can be handled without exception handling.
Btw. if you have a large dataframe you can probably make the code faster by eliminating one lookup (but you'd need to verify if it really runs faster). The solution would look like this then:
max_index= len(label)
df['Column_B']='ffill'
lookup= 0
for row, record in df.iterrows():
if record['Column_A'] == 1:
lookup= lookup+4 if lookup+4 < max_index else lookup%4
df.at[row, 'Column_B'] = label[lookup]
print(label[row])
Option 1
cond1 = df.Column_A == 1
cond2 = df.index == 0
mappr = lambda x: label[x]
df.assign(Column_B=np.where(cond1 | cond2, df.index.map(mappr), np.nan)).ffill()
Column_A Column_B
0 0 A
1 0 A
2 0 A
3 0 A
4 0 A
5 1 F
6 0 F
7 0 F
8 0 F
9 0 F
10 1 K
11 0 K
12 0 K
13 0 K
14 0 K
15 1 P
Option 2
a = np.append(0, np.flatnonzero(df.Column_A))
b = df.Column_A.to_numpy().cumsum()
c = np.array(label)
df.assign(Column_B=c[a[b]])
Column_A Column_B
0 0 A
1 0 A
2 0 A
3 0 A
4 0 A
5 1 F
6 0 F
7 0 F
8 0 F
9 0 F
10 1 K
11 0 K
12 0 K
13 0 K
14 0 K
15 1 P
Using groupby with transform then map
df.reset_index().groupby(df.Column_A.eq(1).cumsum())['index'].transform('first').map(dict(enumerate(label)))
Out[139]:
0 A
1 A
2 A
3 A
4 A
5 F
6 F
7 F
8 F
9 F
10 K
11 K
12 K
13 K
14 K
15 P
Name: index, dtype: object
Following is the Dataframe I am starting from:
import pandas as pd
import numpy as np
d= {'PX_LAST':[1,2,3,3,3,1,2,1,1,1,3,3],'ma':[2,2,2,2,2,2,2,2,2,2,2,2],'action':[0,0,1,0,0,-1,0,1,0,0,-1,0]}
df_zinc = pd.DataFrame(data=d)
df_zinc
Now, I need to add a column called 'buy_sell', which:
when 'action'==1, populates with 1 if 'PX_LAST' >'ma', and with -1 if 'PX_LAST'<'ma'
when 'action'==-1, populates with the opposite of the previous non-zero value that was populated
FYI: in my data, the row that needs to be filled with the opposite of the previous non-zero item is always at the same distance from the previous non-zero item (i.e., 2 in the current example). This should facilitate making the code.
the code that I made so far is the following. It seems right to me. Do you have any fixes to propose?
while index < df_zinc.shape[0]:
if df_zinc['action'][index] == 1:
if df_zinc['PX_LAST'][index]<df_zinc['ma'][index]:
df_zinc.loc[index,'buy_sell'] = -1
else:
df_zinc.loc[index,'buy_sell'] = 1
elif df_zinc['action'][index] == -1:
df_zinc['buy_sell'][index] = df_zinc['buy_sell'][index-3]*-1
index=index+1
df_zinc
the resulting dataframe would look like this:
df_zinc['buy_sell'] = [0,0,1,0,0,-1,0,-1,0,0,1,0]
df_zinc
So, this would be my suggestion according to the example output (and assuming I understood the question properly:
def buy_sell(row):
if row['action'] == 0:
return 0
if row['PX_LAST'] > row['ma']:
return 1 * (-1 if row['action'] == 0 else 1)
else:
return -1 * (-1 if row['action'] == 0 else 1)
return 0
df_zinc = df_zinc.assign(buy_sell=df_zinc.apply(buy_sell, axis=1))
df_zinc
This should behave as expected by the rules. It does not take into account the possibility of 'PX_LAST' being equal to 'ma', returning 0 by default, as it was not clear what rule to follow in that scenario.
EDIT
Ok, after the new logic explained, I think this should do the trick:
def assign_buysell(df):
last_nonzero = None
def buy_sell(row):
nonlocal last_nonzero
if row['action'] == 0:
return 0
if row['action'] == 1:
if row['PX_LAST'] < row['ma']:
last_nonzero = -1
elif row['PX_LAST'] > row['ma']:
last_nonzero = 1
elif row['action'] == -1:
last_nonzero = last_nonzero * -1
return last_nonzero
return df.assign(buy_sell=df.apply(buy_sell, axis=1))
df_zinc = assign_buysell(df_zinc)
This solution is independent of how long ago the nonzero value was seen, it simply remembers the last nonzero value and pipes the opposite wen action is -1.
You can use np.select, and use np.nan as a label for the rows that satisfy the third condition:
c1 = df_zinc.action.eq(1) & df_zinc.PX_LAST.gt(df_zinc.ma)
c2 = df_zinc.action.eq(1) & df_zinc.PX_LAST.lt(df_zinc.ma)
c3 = df_zinc.action.eq(-1)
df_zinc['buy_sell'] = np.select([c1,c2, c3], [1, -1, np.nan])
Now in order to fill NaNs with the value from n rows above (in this case 3), you can fillna with a shifted version of the dataframe:
df_zinc['buy_sell'] = df_zinc.buy_sell.fillna(df_zinc.buy_sell.shift(3)*-1)
Output
PX_LAST ma action buy_sell
0 1 2 0 0.0
1 2 2 0 0.0
2 3 2 1 1.0
3 3 2 0 0.0
4 3 2 0 0.0
5 1 2 -1 -1.0
6 2 2 0 0.0
7 1 2 1 -1.0
8 1 2 0 0.0
9 1 2 0 0.0
10 3 2 -1 1.0
11 3 2 0 0.0
I would use np.select for this, since you have multiple conditions:
conditions = [
(df_zinc['action'] == 1) & (df_zinc['PX_LAST'] > df_zinc['ma']),
(df_zinc['action'] == 1) & (df_zinc['PX_LAST'] < df_zinc['ma']),
(df_zinc['action'] == -1) & (df_zinc['PX_LAST'] > df_zinc['ma']),
(df_zinc['action'] == -1) & (df_zinc['PX_LAST'] < df_zinc['ma'])
]
choices = [1, -1, 1, -1]
df_zinc['buy_sell'] = np.select(conditions, choices, default=0)
result
print(df_zinc)
PX_LAST ma action buy_sell
0 1 2 0 0
1 2 2 0 0
2 3 2 1 1
3 3 2 0 0
4 3 2 0 0
5 1 2 -1 -1
6 2 2 0 0
7 1 2 1 -1
8 1 2 0 0
9 1 2 0 0
10 3 2 -1 1
11 3 2 0 0
here my solution using the function shift() to trap the data of 3th up row:
df_zinc['buy_sell'] = 0
df_zinc.loc[(df_zinc['action'] == 1) & (df_zinc['PX_LAST'] < df_zinc['ma']), 'buy_sell'] = -1
df_zinc.loc[(df_zinc['action'] == 1) & (df_zinc['PX_LAST'] > df_zinc['ma']), 'buy_sell'] = 1
df_zinc.loc[df_zinc['action'] == -1, 'buy_sell'] = -df_zinc['buy_sell'].shift(3)
df_zinc['buy_sell'] = df_zinc['buy_sell'].astype(int)
print(df_zinc)
output:
PX_LAST ma action buy_sell
0 1 2 0 0
1 2 2 0 0
2 3 2 1 1
3 3 2 0 0
4 3 2 0 0
5 1 2 -1 -1
6 2 2 0 0
7 1 2 1 -1
8 1 2 0 0
9 1 2 0 0
10 3 2 -1 1
11 3 2 0 0
I want to know how can I make the source code of the following problem based on Python.
I have a dataframe that contain this column:
Column X
1
0
0
0
1
1
0
0
1
I want to create a list b counting the sum of successive 0 value for getting something like that :
List X
1
3
3
3
1
1
2
2
1
If I understand your question correctly, you want to replace all the zeros with the number of consecutive zeros in the current streak, but leave non-zero numbers untouched. So
1 0 0 0 0 1 0 1 1 0 0 1 0 1 0 0 0 0 0
becomes
1 4 4 4 4 1 1 1 1 2 2 1 1 1 5 5 5 5 5
To do that, this should work, assuming your input column (a pandas Series) is called x.
result = []
i = 0
while i < len(x):
if x[i] != 0:
result.append(x[i])
i += 1
else:
# See how many times zero occurs in a row
j = i
n_zeros = 0
while j < len(x) and x[j] == 0:
n_zeros += 1
j += 1
result.extend([n_zeros] * n_zeros)
i += n_zeros
result
Adding screenshot below to make usage clearer
I have a df like so:
Count
1
0
1
1
0
0
1
1
1
0
and I want to return a 1 in a new column if there are two or more consecutive occurrences of 1 in Count and a 0 if there is not. So in the new column each row would get a 1 based on this criteria being met in the column Count. My desired output would then be:
Count New_Value
1 0
0 0
1 1
1 1
0 0
0 0
1 1
1 1
1 1
0 0
I am thinking I may need to use itertools but I have been reading about it and haven't come across what I need yet. I would like to be able to use this method to count any number of consecutive occurrences, not just 2 as well. For example, sometimes I need to count 10 consecutive occurrences, I just use 2 in the example here.
You could:
df['consecutive'] = df.Count.groupby((df.Count != df.Count.shift()).cumsum()).transform('size') * df.Count
to get:
Count consecutive
0 1 1
1 0 0
2 1 2
3 1 2
4 0 0
5 0 0
6 1 3
7 1 3
8 1 3
9 0 0
From here you can, for any threshold:
threshold = 2
df['consecutive'] = (df.consecutive > threshold).astype(int)
to get:
Count consecutive
0 1 0
1 0 0
2 1 1
3 1 1
4 0 0
5 0 0
6 1 1
7 1 1
8 1 1
9 0 0
or, in a single step:
(df.Count.groupby((df.Count != df.Count.shift()).cumsum()).transform('size') * df.Count >= threshold).astype(int)
In terms of efficiency, using pandas methods provides a significant speedup when the size of the problem grows:
df = pd.concat([df for _ in range(1000)])
%timeit (df.Count.groupby((df.Count != df.Count.shift()).cumsum()).transform('size') * df.Count >= threshold).astype(int)
1000 loops, best of 3: 1.47 ms per loop
compared to:
%%timeit
l = []
for k, g in groupby(df.Count):
size = sum(1 for _ in g)
if k == 1 and size >= 2:
l = l + [1]*size
else:
l = l + [0]*size
pd.Series(l)
10 loops, best of 3: 76.7 ms per loop
Not sure if this is optimized, but you can give it a try:
from itertools import groupby
import pandas as pd
l = []
for k, g in groupby(df.Count):
size = sum(1 for _ in g)
if k == 1 and size >= 2:
l = l + [1]*size
else:
l = l + [0]*size
df['new_Value'] = pd.Series(l)
df
Count new_Value
0 1 0
1 0 0
2 1 1
3 1 1
4 0 0
5 0 0
6 1 1
7 1 1
8 1 1
9 0 0