I have a dataframe and I need to change the 3d column by the rule
1) if differ between i+1 row and i row of 2nd column > 1 then 3d column +1
I wrote a code using a cycle, but this code is working for eternity.
I wrote a code in pure python, but there must be a better way to do this in pandas.
So, How to rewrite my code in pandas to reduce time?
old_store_id = -1
for i in range(0,df_sort.shape[0]):
if (old_store_id != df_sort.iloc[i, 0]):
old_store_id = df_sort.iloc[i, 0]
continue
if (df_sort.iloc[i,1]-df_sort.iloc[i-1,1])>1:
df_sort.iloc[i,2] = df_sort.iloc[i-1,2]+1
else:
df_sort.iloc[i,2] = df_sort.iloc[i-1,2]
Before the code:
After the code:
df['value'] = df.groupby('store_id')['period_id'].transform(lambda x: (x.diff()>1).cumsum()+1)
So we group by store_id, check when the diff between periods is greater than 1, then take the cumsum of the bool. We added 1 to make the counter start at 1 instead of 0.
Make sure that period_id is sorted correctly before using the above code, otherwise it will not work.
Related
I'm trying to find a vectorized way of determining the first instance where my column of data has a sign change. I looked at this question and it gets close to what I want, except it evaluates my first zeros as true. I'm open to different solutions including changing how the data is set up in the first place. I'll detail what I'm doing below.
I have two columns, let's call them positive and negative, that look at a third column. The third column has values ranging between [-5, 5]. When this column is [3, 5], my positive column gets a +1 on that same row; all other rows are 0 in that column. Likewise, when the third column is between [-5, -3], my negative column gets a -1 in that row; all other rows are 0.
I combine these columns into one column. You can conceptualize this as 'turn machine on, keep it on/off, turn it off, keep it on/off, turn machine on ... etc.' The problem I've having is that my combined column looks something like below:
pos = [1,1,1,0, 0, 0,0,0,0,0,1, 0,1]
neg = [0,0,0,0,-1,-1,0,0,0,0,0,-1,0]
com = [1,1,1,0,-1,-1,0,0,0,0,1,-1,1]
# Below is what I want to have as the final column.
cor = [1,0,0,0,-1, 0,0,0,0,0,1,-1,1]
The problem with what I've linked is that it gets close, but it evaluates the first 0 as a sign change as well. 0's should be ignored and I tried a few things, but seem to be creating new errors. For the sake of completeness, this is what the code linked outputs:
lnk = [True,False,False,True,True,False,True,False,False,False,True,True,True]
As you can see, it's doing the 1 and -1 not flipping fine, but the zero's it's flipping. Not sure if I should change how the combined column is made or if I should change the logic for the creation of the component columns, both. The big thing is I need to vectorize this code for performance concerns.
Any help would be greatly appreciated!
Let's suppose your dataframe is named df with columns pos and neg then you can try something like the following :
df.loc[:, "switch_pos"] = (np.diff(df.pos, prepend=0) > 0)*1
df.loc[:, "switch_neg"] = (np.diff(df.neg, prepend=0) > 0)*(-1)
You can then combine your two switchs columns.
Explanations
no.diff gives you the difference row by row but setting (for pos columns) 1 for 0 to 1 and - 1 for 1 to 0. Considering your desired output, you want to keep only your 0 to 1, that's why you need to keep only the more than zero output
I am using the numpy and pandas modules to work with data from an excel sheet. I want to iterate through a column and make sure each rows' value is higher than the previous ones' by 1.
For example, cell A1 of excel sheet has a value of 1, I would like to make sure cell A2 has a value of 2. And I would like to do this for the entire column of my excel sheet.
The problem is I'm not sure if this is a good way to go about doing this.
This is the code I've come up with so far:
import numpy as np
import pandas as pd
i = 1
df = pd.read_excel("HR-Employee-Attrition(1).xlsx")
out = df['EmployeeNumber'].to_numpy().tolist()
print(out)
for i in out:
if out[i] + 1 == out[i+1]:
if out[i] == 1470:
break
i += 1
pass
else:
print(out[i])
break
It gives me the error:
IndexError: list index out of range.
Could someone advise me on how to check every row in my excel column?
If I understood the problem correctly, you may need to iterate over the length of the list -1 to avoid the out of range:
for i in range(len(out)-1):
if out[i] + 1 == out[i+1]:
if out[i] == 1470:
break
i += 1
pass
else:
print(out[i])
break
but there is an easier way to achieve this though, which is:
df['EmployeeNumber'].diff()
I don't understand why you are using a for-loop for such a thing:
I've created an Excel-sheet, with two columns, like this:
Index Name
1 A
2 B
C
D
E
I selected the two numbers (1 and 2) and double-clicked on the right-bottom corner of the selection rectangle, while recording what I was doing, and this macro got recorded:
Selection.AutoFill Destination:=Range("A2:A6")
As you see, Excel does not write a for-loop for this (the for-loop might prove being a performance whole in case of large Excel sheets).
The result on my Excel sheet was:
Index Name
1 A
2 B
3 C
4 D
5 E
i have the current dataframe:
df = pd.DataFrame({"A":[1,2,-3,-4,5],
"B":[1,-2,3,-4,5]})
i want to replace, just in column A,
all positive values with 1 and all negative values with 0.
i tried to do it this way:
df[df["A"]>0]["A"] = 1
df[df["A"]<0]["A"] = 0
but that didnt work (dataframe didnt change at all).
however the below code did work:
df["A"][df["A"]>0] = 1
df["A"][df["A"]<0] = 0
can anyone tell me what the difference between the two?
why the first one didnt work, while the second one did?
thanks!
To put it simply:
df[df["A"]>0]["A"] gives you a copy of the dataframe
and df["A"][df["A"]>0] gives you a view of it.
The copy isnt linked to the dataframe so changing it wont do anything to the original.
Go to this link for more info:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
As an alternative method to above, you can use np.where and do:
import numpy as np
df['A'] = np.where(df['A'] >0 ,1,0) # 0's will be given 0 here.
which will essentially replace all +ve's with 1, and everything else with 0.
I have a dataframe df where one column is timestamp and one is A. Column A contains decimals.
I would like to add a new column B and fill it with the current value of A divided by the value of A one minute earlier. That is:
df['B'] = df['A']_current / df['A'] _(current - 1 min)
NOTE: The data does not come in exactly every 1 minute so "the row one minute earlier" means the row whose timestamp is the closest to (current - 1 minute).
Here is how I do it:
First, I use the timestamp as index in order to use get_loc and create a new dataframe new_df starting from 1 minute after df. In this way I'm sure I have all the data when I go look 1 minute earlier within the first minute of data.
new_df = df.loc[df['timestamp'] > df.timestamp[0] + delta] # delta = 1 min timedelta
values = []
for index, row n new_df.iterrows():
v = row.A / df.iloc[df.index.get_loc(row.timestamp-delta,method='nearest')]['A']
values.append[v]
v_ser = pd.Series(values)
new_df['B'] = v_ser.values
I'm afraid this is not that great. It takes a long time for large dataframes. Also, I am not 100% sure the above is completely correct. Sometimes I get this message:
A value is trying to be set on a copy of a slice from a DataFrame. Try
using .loc[row_indexer,col_indexer] = value instead
What is the best / most efficient way to do the task above? Thank you.
PS. If someone can think of a better title please let me know. It took me longer to write the title than the post and I still don't like it.
You could try to use .asof() if the DataFrame has been indexed correctly by the timestamps (if not, use .set_index() first).
Simple example here
import pandas as pd
import numpy as np
n_vals = 50
# Create a DataFrame with random values and 'unusual times'
df = pd.DataFrame(data = np.random.randint(low=1,high=6, size=n_vals),
index=pd.DatetimeIndex(start=pd.Timestamp.now(),
freq='23s', periods=n_vals),
columns=['value'])
# Demonstrate how to use .asof() to get the value that was the 'state' at
# the time 1 min since the index. Note the .values call
df['value_one_min_ago'] = df['value'].asof(df.index - pd.Timedelta('1m')).values
# Note that there will be some NaNs to deal with consider .fillna()
I've got a fairly large data set of about 2 million records, each of which has a start time and an end time. I'd like to insert a field into each record that counts how many records there are in the table where:
Start time is less than or equal to "this row"'s start time
AND end time is greater than "this row"'s start time
So basically each record ends up with a count of how many events, including itself, are "active" concurrently with it.
I've been trying to teach myself pandas to do this with but I am not even sure where to start looking. I can find lots of examples of summing rows that meet a given condition like "> 2", but can't seem to grasp how to iterate over rows to conditionally sum a column based on values in the current row.
You can try below code to get the final result.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[2,10],[5,8],[3,8],[6,9]]),columns=["start","end"])
active_events= {}
for i in df.index:
active_events[i] = len(df[(df["start"]<=df.loc[i,"start"]) & (df["end"]> df.loc[i,"start"])])
last_columns = pd.DataFrame({'No. active events' : pd.Series(active_events)})
df.join(last_columns)
Here goes. This is going to be SLOW.
Note that this counts each row as overlapping with itself, so the results column will never be 0. (Subtract 1 from the result to do it the other way.)
import pandas as pd
df = pd.DataFrame({'start_time': [4,3,1,2],'end_time': [7,5,3,8]})
df = df[['start_time','end_time']] #just changing the order of the columns for aesthetics
def overlaps_with_row(row,frame):
starts_before_mask = frame.start_time <= row.start_time
ends_after_mask = frame.end_time > row.start_time
return (starts_before_mask & ends_after_mask).sum()
df['number_which_overlap'] = df.apply(overlaps_with_row,frame=df,axis=1)
Yields:
In [8]: df
Out[8]:
start_time end_time number_which_overlap
0 4 7 3
1 3 5 2
2 1 3 1
3 2 8 2
[4 rows x 3 columns]
def counter (s: pd.Series):
return ((df["start"]<= s["start"]) & (df["end"] >= s["start"])).sum()
df["count"] = df.apply(counter , axis = 1)
This feels a lot simpler approach, using the apply method. This doesn't really compromise on speed as the apply function, although not as fast as python native functions like cumsum() or cum, it should be faster than using a for loop.