Related
I have a dataframe that looks like this:
print(df)
Out[1]:
Numbers
0 0
1 1
2 1
3 1
4 1
5 0
6 0
7 1
8 0
9 1
10 1
I want to transform it to this:
print(dfo)
Out[2]:
Numbers
0 0
1 1
2 2
3 3
4 4
5 0
6 0
7 1
8 0
9 1
10 2
The solution to this,I thought, it could be an iloc with 2 ifs:
Check if the digit in df is 1, if1 true then check if2 the i-1 is 1, if true then in dfo see the value of i-1 and add 1,elifs just put the value of 0 in dfo.
I've tryed this:
# Import pandas library
import pandas as pd
# initialize list elements
list = [0,1,1,1,1,0,0,1,0,1,1]
# Create the pandas DataFrame with column name is provided explicitly
df = pd.DataFrame(list, columns=['Numbers'])
# print dataframe.
df
data1c = df.copy()
for j in df:
for i in range(len(df)):
if df.loc[i, j] == 1:
if df.loc[i-1, j] == 1:
data1c.loc[i, j] = data1c.loc[i-1, j]+1
elif df.loc[i-1, j] == 0:
data1c.loc[i, j] = 1
elif df.loc[i, j] == 0:
data1c.loc[i, j] = 0
print(data1c)
Numbers
0 0
1 1
2 2
3 3
4 4
5 0
6 0
7 1
8 0
9 1
10 2
and for a dataframe of 1 column it works, but when I've tryed with a dataframe with 2 columns :
input = {'A': [0,1,1,1,1,0,0,1,0,1,1,0,1,1],
'B': [1,1,0,0,1,1,1,0,0,1,0,1,1,0]}
df = pd.DataFrame(input)
# Print the output.
df
data2c = df.copy()
for j in dfo:
for i in range(len(dfo)):
if dfo.loc[i, j] == 1:
if dfo.loc[i-1, j] == 1:
data2c.loc[i, j] = data2c.loc[i-1, j]+1
elif dfo.loc[i-1, j] == 0:
data2c.loc[i, j] = 1
elif dfo.loc[i, j] == 0:
data2c.loc[i, j] = 0
I get :
File "C:\Users\talls\.conda\envs\Spyder\lib\site-packages\pandas\core\indexes\range.py", line 393, in get_loc
raise KeyError(key) from err
KeyError: -1
Why do I get this error and how do I fix it?
or
Is there another way to get my desired out put?
I know this is not the answer to the question "How to use iloc to get reference from the rows above?", but it is the answer to your proposed question.
df = pd.DataFrame([0,1,1,1,1,0,0,1,0,1,1], columns=['Numbers'])
df['tmp'] = (~(df['Numbers'] == df['Numbers'].shift(1))).cumsum()
df['new'] = df.groupby('tmp')['Numbers'].cumsum()
print(df['new'])
Numbers tmp new
0 0 1 0
1 1 2 1
2 1 2 2
3 1 2 3
4 1 2 4
5 0 3 0
6 0 3 0
7 1 4 1
8 0 5 0
9 1 6 1
10 1 6 2
How does this work? The inner part ~(df['Numbers'] == df['Numbers'].shift(1)) checks whether the previous line is the same as the current line. For the first line, this works perfectly as well, because a number and a NaN always compare to False. Then I negate it mark the start of each new sequence with a True. When I then do a cumulative sum, I "group" all values with the created tmp column and do a cumulative sum over it to get the required answer in the new column.
For the two columned version, you'd do exactly the same... for both columns A and B:
df = pd.DataFrame(input)
df['tmp'] = (~(df['A'] == df['A'].shift(1))).cumsum()
df['newA'] = df.groupby('tmp')['A'].cumsum()
#Just reusing column tmp
df['tmp'] = (~(df['B'] == df['B'].shift(1))).cumsum()
df['newB'] = df.groupby('tmp')['B'].cumsum()
print(df)
A B tmp newA newB
0 0 1 1 0 1
1 1 1 1 1 2
2 1 0 2 2 0
3 1 0 2 3 0
4 1 1 3 4 1
5 0 1 3 0 2
6 0 1 3 0 3
7 1 0 4 1 0
8 0 0 4 0 0
9 1 1 5 1 1
10 1 0 6 2 0
11 0 1 7 0 1
12 1 1 7 1 2
13 1 0 8 2 0
To answer the question you originally proposed. (I mentioned it in a comment already.) You need to put in a safeguard against i == 0. You can do that two ways:
for j in dfo:
for i in range(len(dfo)):
if i == 0:
continue
elif dfo.loc[i, j] == 1:
if dfo.loc[i-1, j] == 1:
data2c.loc[i, j] = data2c.loc[i-1, j]+1
elif dfo.loc[i-1, j] == 0:
data2c.loc[i, j] = 1
elif dfo.loc[i, j] == 0:
data2c.loc[i, j] = 0
or start at 1 instead of 0:
for j in dfo:
for i in range(1,len(dfo)):
if dfo.loc[i, j] == 1:
if dfo.loc[i-1, j] == 1:
data2c.loc[i, j] = data2c.loc[i-1, j]+1
elif dfo.loc[i-1, j] == 0:
data2c.loc[i, j] = 1
elif dfo.loc[i, j] == 0:
data2c.loc[i, j] = 0
The resulting dataframe:
A B
0 0 1
1 1 2
2 2 0
3 3 0
4 4 1
5 0 2
6 0 3
7 1 0
8 0 0
9 1 1
10 2 0
11 0 1
12 1 2
13 2 0
I am very new to python and coding. I have this homework that I have to do:
You will receive on the first line the rows of the matrix (n) and on the next n lines you will get each row of the matrix as a string (zeros and ones separated by a single space). You have to calculate how many blocks you have (connected ones horizontally or diagonally) Here are examples:
Input:
5
1 1 0 0 0
1 1 0 0 0
0 0 0 0 0
0 0 0 1 1
0 0 0 1 1
Output:
2
Input:
6
1 1 0 1 0 1
0 1 1 1 1 1
0 1 0 0 0 0
0 1 1 0 0 0
0 1 1 1 1 0
0 0 0 1 1 0
Output:
1
Input:
4
0 1 0 1 1 0
1 0 1 1 0 1
1 0 0 0 0 0
0 0 0 1 0 0
Output:
5
the code I came up with for now is :
n = int(input())
blocks = 0
matrix = [[int(i) for i in input().split()] for j in range(n)]
#loop or something to find the blocks in the matrix
print(blocks)
Any help will be greatly appreciated.
def valid(y,x):
if y>=0 and x>=0 and y<N and x<horizontal_len:
return True
def find_blocks(y,x):
Q.append(y)
Q.append(x)
#search around 4 directions (up, right, left, down)
dy = [0,1,0,-1]
dx = [1,0,-1,0]
# if nothing is in Q then terminate counting block
while Q:
y = Q.pop(0)
x = Q.pop(0)
for dir in range(len(dy)):
next_y = y + dy[dir]
next_x = x + dx[dir]
#if around component is valid range(inside the matrix) and it is 1(not 0) then include it as a part of block
if valid(next_y,next_x) and matrix[next_y][next_x] == 1:
Q.append(next_y)
Q.append(next_x)
matrix[next_y][next_x] = -1
N = int(input())
matrix = []
for rows in range(N):
row = list(map(int, input().split()))
matrix.append(row)
#row length
horizontal_len = len(matrix[0])
blocks = 0
#search from matrix[0][0] to matrix[N][horizontal_len]
for start_y in range(N):
for start_x in range(horizontal_len):
#if a number is 1 then start calculating
if matrix[start_y][start_x] == 1:
#make 1s to -1 for not to calculate again
matrix[start_y][start_x] = -1
Q=[]
#start function
find_blocks(start_y, start_x)
blocks +=1
print(blocks)
I used BFS algorithm to solve this question. The quotations are may not enough to understand the logic.
If you have questions about this solution, let me know!
Following is the Dataframe I am starting from:
import pandas as pd
import numpy as np
d= {'PX_LAST':[1,2,3,3,3,1,2,1,1,1,3,3],'ma':[2,2,2,2,2,2,2,2,2,2,2,2],'action':[0,0,1,0,0,-1,0,1,0,0,-1,0]}
df_zinc = pd.DataFrame(data=d)
df_zinc
Now, I need to add a column called 'buy_sell', which:
when 'action'==1, populates with 1 if 'PX_LAST' >'ma', and with -1 if 'PX_LAST'<'ma'
when 'action'==-1, populates with the opposite of the previous non-zero value that was populated
FYI: in my data, the row that needs to be filled with the opposite of the previous non-zero item is always at the same distance from the previous non-zero item (i.e., 2 in the current example). This should facilitate making the code.
the code that I made so far is the following. It seems right to me. Do you have any fixes to propose?
while index < df_zinc.shape[0]:
if df_zinc['action'][index] == 1:
if df_zinc['PX_LAST'][index]<df_zinc['ma'][index]:
df_zinc.loc[index,'buy_sell'] = -1
else:
df_zinc.loc[index,'buy_sell'] = 1
elif df_zinc['action'][index] == -1:
df_zinc['buy_sell'][index] = df_zinc['buy_sell'][index-3]*-1
index=index+1
df_zinc
the resulting dataframe would look like this:
df_zinc['buy_sell'] = [0,0,1,0,0,-1,0,-1,0,0,1,0]
df_zinc
So, this would be my suggestion according to the example output (and assuming I understood the question properly:
def buy_sell(row):
if row['action'] == 0:
return 0
if row['PX_LAST'] > row['ma']:
return 1 * (-1 if row['action'] == 0 else 1)
else:
return -1 * (-1 if row['action'] == 0 else 1)
return 0
df_zinc = df_zinc.assign(buy_sell=df_zinc.apply(buy_sell, axis=1))
df_zinc
This should behave as expected by the rules. It does not take into account the possibility of 'PX_LAST' being equal to 'ma', returning 0 by default, as it was not clear what rule to follow in that scenario.
EDIT
Ok, after the new logic explained, I think this should do the trick:
def assign_buysell(df):
last_nonzero = None
def buy_sell(row):
nonlocal last_nonzero
if row['action'] == 0:
return 0
if row['action'] == 1:
if row['PX_LAST'] < row['ma']:
last_nonzero = -1
elif row['PX_LAST'] > row['ma']:
last_nonzero = 1
elif row['action'] == -1:
last_nonzero = last_nonzero * -1
return last_nonzero
return df.assign(buy_sell=df.apply(buy_sell, axis=1))
df_zinc = assign_buysell(df_zinc)
This solution is independent of how long ago the nonzero value was seen, it simply remembers the last nonzero value and pipes the opposite wen action is -1.
You can use np.select, and use np.nan as a label for the rows that satisfy the third condition:
c1 = df_zinc.action.eq(1) & df_zinc.PX_LAST.gt(df_zinc.ma)
c2 = df_zinc.action.eq(1) & df_zinc.PX_LAST.lt(df_zinc.ma)
c3 = df_zinc.action.eq(-1)
df_zinc['buy_sell'] = np.select([c1,c2, c3], [1, -1, np.nan])
Now in order to fill NaNs with the value from n rows above (in this case 3), you can fillna with a shifted version of the dataframe:
df_zinc['buy_sell'] = df_zinc.buy_sell.fillna(df_zinc.buy_sell.shift(3)*-1)
Output
PX_LAST ma action buy_sell
0 1 2 0 0.0
1 2 2 0 0.0
2 3 2 1 1.0
3 3 2 0 0.0
4 3 2 0 0.0
5 1 2 -1 -1.0
6 2 2 0 0.0
7 1 2 1 -1.0
8 1 2 0 0.0
9 1 2 0 0.0
10 3 2 -1 1.0
11 3 2 0 0.0
I would use np.select for this, since you have multiple conditions:
conditions = [
(df_zinc['action'] == 1) & (df_zinc['PX_LAST'] > df_zinc['ma']),
(df_zinc['action'] == 1) & (df_zinc['PX_LAST'] < df_zinc['ma']),
(df_zinc['action'] == -1) & (df_zinc['PX_LAST'] > df_zinc['ma']),
(df_zinc['action'] == -1) & (df_zinc['PX_LAST'] < df_zinc['ma'])
]
choices = [1, -1, 1, -1]
df_zinc['buy_sell'] = np.select(conditions, choices, default=0)
result
print(df_zinc)
PX_LAST ma action buy_sell
0 1 2 0 0
1 2 2 0 0
2 3 2 1 1
3 3 2 0 0
4 3 2 0 0
5 1 2 -1 -1
6 2 2 0 0
7 1 2 1 -1
8 1 2 0 0
9 1 2 0 0
10 3 2 -1 1
11 3 2 0 0
here my solution using the function shift() to trap the data of 3th up row:
df_zinc['buy_sell'] = 0
df_zinc.loc[(df_zinc['action'] == 1) & (df_zinc['PX_LAST'] < df_zinc['ma']), 'buy_sell'] = -1
df_zinc.loc[(df_zinc['action'] == 1) & (df_zinc['PX_LAST'] > df_zinc['ma']), 'buy_sell'] = 1
df_zinc.loc[df_zinc['action'] == -1, 'buy_sell'] = -df_zinc['buy_sell'].shift(3)
df_zinc['buy_sell'] = df_zinc['buy_sell'].astype(int)
print(df_zinc)
output:
PX_LAST ma action buy_sell
0 1 2 0 0
1 2 2 0 0
2 3 2 1 1
3 3 2 0 0
4 3 2 0 0
5 1 2 -1 -1
6 2 2 0 0
7 1 2 1 -1
8 1 2 0 0
9 1 2 0 0
10 3 2 -1 1
11 3 2 0 0
I have a DataFrame that is a follows:
df[16820:16830]
data0 start_stop
16820 1 0
16821 1 1
16822 1 0
16823 1 0
16824 1 0
16825 1 -1
16826 0 0
16827 0 0
16828 1 1
16829 0 0
16830 1 -1
What I need to do is mark values between 1 and -1 in the start_stop columns as valid(1 means 'start', -1 means 'stop') and values between -1 and 1 as invalid (rubbish that I will later discard).
Is there any efficient way to do this instead of iterating with loops over the whole dataframe?
The end result would look like this:
data0 start_stop valid
16820 1 0 False
16821 1 1 True
16822 1 0 True
16823 1 0 True
16824 1 0 True
16825 1 -1 False
16826 0 0 False
16827 0 0 False
16828 1 1 True
16829 0 0 True
16830 1 -1 False
...
The relevant loop that would achieve it is, i think this:
df = df.reset_index(drop=True)
value = False
for i in range(0,df.shape[0]):
if df.loc[i, 'start_stop'] == 1:
df.loc[i,'valid'] = True
value = True
elif df.loc[i, 'start_stop'] == -1:
df.loc[i, 'valid'] = False
value = False
if df.loc[i, 'start_stop'] == 0:
df.loc[i, 'valid'] = value
Thanks!
This should work
df['valid'] = df.start_stop.cumsum()
Then
df['valid'] = df['valid'].apply(lambda x: True if x==1 else False)
df
start_stop valid
0 0 False
1 1 True
2 0 True
3 0 True
4 0 True
5 -1 False
6 0 False
7 0 False
8 1 True
9 0 True
10 -1 False
I have this Pandas dataframe df:
station a_d direction
a 0 0
a 0 0
a 1 0
a 0 0
a 1 0
b 0 0
b 1 0
c 0 0
c 1 0
c 0 1
c 1 1
b 0 1
b 1 1
b 0 1
b 1 1
a 0 1
a 1 1
a 0 0
a 1 0
I'd assign a value_id that increments when direction value change and refers only to the last pair of station value first it changes with different [0,1] a_d value. I can ignore the last (in this example the last two) dataframe row. In other words:
station a_d direction id_value
a 0 0
a 0 0
a 1 0
a 0 0 0
a 1 0 0
b 0 0 0
b 1 0 0
c 0 0 0
c 1 0 0
c 0 1 1
c 1 1 1
b 0 1
b 1 1
b 0 1 1
b 1 1 1
a 0 1 1
a 1 1 1
a 0 0
a 1 0
Using df.iterrows() i write this script:
df['value_id'] = ""
value_id = 0
row_iterator = df.iterrows()
for i, row in row_iterator:
if i == 0:
continue
elif (df.loc[i-1,'direction'] != df.loc [i,'direction']):
value_id += 1
for z in range(1,11):
if i+z >= len(df)-1:
break
elif (df.loc[i+1,'a_d'] == df.loc [i,'a_d']):
break
elif (df.loc[i+1,'a_d'] != df.loc [i,'a_d']) and (df.loc [i+2,'station'] == df.loc [i,'station'] and (df.loc [i+2,'direction'] == df.loc [i,'direction'])):
break
else:
df.loc[i,'value_id'] = value_id
It works but it's very slow. With a 10*10^6 rows dataframe I need a faster way. Any idea?
#user5402 code works well but I note that a break after the last else reduce computational time also:
df['value_id'] = ""
value_id = 0
row_iterator = df.iterrows()
for i, row in row_iterator:
if i == 0:
continue
elif (df.loc[i-1,'direction'] != df.loc [i,'direction']):
value_id += 1
for z in range(1,11):
if i+z >= len(df)-1:
break
elif (df.loc[i+1,'a_d'] == df.loc [i,'a_d']):
break
elif (df.loc[i+1,'a_d'] != df.loc [i,'a_d']) and (df.loc [i+2,'station'] == df.loc [i,'station'] and (df.loc [i+2,'direction'] == df.loc [i,'direction'])):
break
else:
df.loc[i,'value_id'] = value_id
break
You are not effectively using z in the inner for loop. You never access the i+z-th row. You access the i-th row and the i+1-th row and the i+2-th row, but never the i+z-th row.
You can replace that inner for loop with:
if i+1 > len(df)-1:
pass
elif (df.loc[i+1,'a_d'] == df.loc [i,'a_d']):
pass
elif (df.loc [i+2,'station'] == df.loc [i,'station'] and (df.loc [i+2,'direction'] == df.loc [i,'direction'])):
pass
else:
df.loc[i,'value_id'] = value_id
Note that I also slightly optimized the second elif because at that point you already know df.loc[i+1,'a_d'] does not equal df.loc [i,'a_d'].
Not having to loop over z will save a lot of time.