Hi I have a dataframe in pandas like below,
exit
new_column.
0
0
0
0
1
1
0
0
0
0
0
0.
0
0.
0
0.
1
1.
I need to create the desired output column as given below,
in the exit column, if there are two occurances of 1 within next 10 rows, if so, sets the value of the new column for the current occurrence of 1 to 1 and to 0 for the later occurrence of 1. If there are no other occurrences of 1 in the next 10 rows, the new column is set to 1 for the current occurrence of 1. If the value of 'exit' is 0, the new column is set to 0.
exit
new_column.
desired_output
0
0
0
0
0
0
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
I tried the below code, but i am not able to acheive desired output column, i am acheiveing results similar to new_column which is not intended.
df['new_column'] = 0
for i, row in combined_df.iterrows():
if row['exit'] == 1:
next_rows = df.loc[i+1:i+10, 'exit']
if (next_rows == 1).any():
df.loc[i, 'new_column'] = 1
later_occurrence_index = next_rows[next_rows == 1].index[0]
df.loc[later_occurrence_index, 'new_column'] = 0
else:
df.loc[i, 'new_column'] = 1
else:
df.loc[i, 'new_column'] = 0
You can use rolling:
check_one = lambda x: (x.iloc[-1] == 1) & (x.sum() == 1)
df['out'] = (df[['exit', 'new_column']].eq(1).all(axis=1)
.rolling(10, min_periods=1)
.apply(check_one).astype(int))
print(df)
# Output
exit new_column out
0 0 0 0
1 0 0 0
2 1 1 1
3 0 0 0
4 0 0 0
5 0 0 0
6 0 0 0
7 0 0 0
8 1 1 0
Related
I'm trying to fill in a column with numbers -5000 to 5004, stepping by 4, between a condition in one column and a condition in another. The count starts when start==1. The count won't always get to 5004, so it needs to stop when end==1
Here is a example of the input:
start end
1 0
0 0
0 0
0 0
0 0
0 1
0 0
0 0
1 0
0 0
I have tried np.arange:
df['time'] = df['start'].apply(lambda x: np.arange(-5000,5004,4) if x==1 else 0)
This obviously doesn't work - I ended up with a series in one cell. I also messed around with cycle from itertools, but that doesn't work because the distances between the start and end aren't always equal. I also feel there might be a way to do this with ffill:
rise = df[df.start.where(df.start==1).ffill(limit=1250).notnull()]
Not sure how to edit this to stop at the correct place though.
I'd love to have a lambda function that achieves this, but I'm not sure where to go from here.
Here is my expected output:
start end time
1 0 -5000
0 0 -4996
0 0 -4992
0 0 -4988
0 0 -4984
0 1 -4980
0 0 nan
0 0 nan
1 0 -5000
0 0 -4996
grouping = df['start'].add(df['end'].shift(1).fillna(0)).cumsum()
df['time'] = (df.groupby(grouping).cumcount() * 4 - 5000)
df.loc[df.groupby(grouping).filter(lambda x: x[['start', 'end']].sum().sum() == 0).index, 'time'] = np.nan
Output:
>>> df
start end time
0 1 0 -5000.0
1 0 0 -4996.0
2 0 0 -4992.0
3 0 0 -4988.0
4 0 0 -4984.0
5 0 1 -4980.0
6 0 0 NaN
7 0 0 NaN
8 1 0 -5000.0
9 0 0 -4996.0
a = [[0,0,0,0],[0,-1,1,0],[1,-1,1,0],[1,-1,1,0]]
df = pd.DataFrame(a, columns=['A','B','C','D'])
df
Output:
A B C D
0 0 0 0 0
1 0 -1 1 0
2 1 -1 1 0
3 1 -1 1 0
So reading down vertically per column, values in the columns all begin at 0 on the first row, once they change they can never change back and can either become a 1 or a -1. I would like to re arrange the dataframe columns so that the columns in this order:
Order columns that hit 1 in the earliest row as possible
Order columns that hit -1 in the earliest row as possible
Finally the remaining rows that never changed values and remained as zero (if there are even any left)
Desired Output:
C A B D
0 0 0 0 0
1 1 0 -1 0
2 1 1 -1 0
3 1 1 -1 0
The my main data frame is 3000 rows and 61 columns long, is there any way of doing this quickly?
We have to handle the positive and negative values seperately. One way is take sum of the columns , then using sort_values , we can adjust the ordering:
a = df.sum().sort_values(ascending=False)
b = pd.concat((a[a.gt(0)],a[a.lt(0)].sort_values(),a[a.eq(0)]))
out = df.reindex(columns=b.index)
print(out)
C A B D
0 0 0 0 0
1 1 0 -1 0
2 1 1 -1 0
3 1 1 -1 0
Try with pd.Series.first_valid_index
s = df.where(df.ne(0))
s1 = s.apply(pd.Series.first_valid_index)
s2 = s.bfill().iloc[0]
out = df.loc[:,pd.concat([s2,s1],axis=1,keys=[0,1]).sort_values([0,1],ascending=[False,True]).index]
out
Out[35]:
C A B D
0 0 0 0 0
1 1 0 -1 0
2 1 1 -1 0
3 1 1 -1 0
I have a dataset like:
Id Status
1 0
1 0
1 0
1 0
1 1
2 0
1 0
2 0
3 0
3 0
I want to drop all rows of an id after its status became 1, i.e. my new dataset will be:
Id Status
1 0
1 0
1 0
1 0
1 1
2 0
2 0
3 0
3 0
i.e.
1 0 --> gets removed since this row appears after id 1 already had a status of 1
How to implement it efficiently since I have a very large (200 GB+) dataset.
Thanks for your help.
Here's an idea;
You can create a dict with the first index where the status is 1 for each ID (assuming the DataFrame is sorted by ID):
d = df.loc[df["Status"]==1].drop_duplicates()
d = dict(zip(d["Id"], d.index))
Then you create a column with the first status=1 for each Id:
df["first"] = df["Id"].map(d)
Finally you drop every row where the index is less than than the first column:
df = df.loc[df.index<df["first"]]
EDIT: Revisiting this question a month later, there is actually a much simpler way with groupby and cumsum: Just group by Id and take the cumsum of Status, then drop the values where the cumsum is more than 0:
df[df.groupby('Id')['Status'].cumsum() < 1]
The best way I have found is to find the index of the first 1 and slice each group that way. In cases where no 1 exists, return the group unchanged:
def remove(series):
indexless = series.reset_index(drop=True)
ones = indexless[indexless['Status'] == 1]
if len(ones) > 0:
return indexless.iloc[:ones.index[0] + 1]
else:
return indexless
df.groupby('Id').apply(remove).reset_index(drop=True)
Output:
Id Status
0 1 0
1 1 0
2 1 0
3 1 0
4 1 1
5 2 0
6 2 0
7 3 0
8 3 0
Use groupby with cumsum to find where status is 1.
res = df.groupby('Id', group_keys=False).apply(lambda x: x[x.Status.cumsum() > 0])
res
Id Status
4 1 1
6 1 0
Exclude index that Status==0.
not_select_id = res[res.Status==0].index
df[~df.index.isin(not_select_id)]
Id Status
0 1 0
1 1 0
2 1 0
3 1 0
4 1 1
5 2 0
7 2 0
8 3 0
9 3 0
I have the foll. dataframe:
c3ann c3nfx c3per c4ann c4per pastr primf
c3ann 1 0 1 0 1 0 1
c3nfx 1 0 1 0 1 0 1
c3per 1 0 1 0 1 0 1
c4ann 1 0 1 0 1 0 1
c4per 1 0 1 0 1 0 1
pastr 1 0 1 0 1 0 1
primf 1 0 1 0 1 0 1
I would like to reorder the rows and columns so that the order is this:
primf pastr c3ann c3nfx c3per c4ann c4per
I can do this for just the columns like this:
cols = ['primf', 'pastr', 'c3ann', 'c3nfx', 'c3per', 'c4ann', 'c4per']
df = df[cols]
How do I do this such that the row headers are also changed appropriately?
You can use reindex to reorder both the columns and index at the same time.
df = df.reindex(index=cols, columns=cols)
I have a DataFrame where a combination of column values identify a unique address (A,B,C). I would like to identify all such rows and assign them a unique identifier that I increment per address.
For example
A B C D E
0 1 1 0 1
0 1 2 0 1
0 1 1 1 1
0 1 3 0 1
0 1 2 1 0
0 1 1 2 1
I would like to generate the following
A B C D E ID
0 1 1 0 1 0
0 1 2 0 1 1
0 1 1 1 1 0
0 1 3 0 1 2
0 1 2 1 0 1
0 1 1 2 1 0
I tried the following:
id = 0
def set_id(df):
global id
df['ID'] = id
id += 1
df.groupby(['A','B','C']).transform(set_id)
This returns a NULL dataframe...This is definitely not the way to do it..I am new to pandas. The above should actually use df[['A','B','C']].drop_duplicates() to get all unique values
Thank you.
I think this is what you need :
df2 = df[['A','B','C']].drop_duplicates() #get unique values of ABC
df2 = df2.reset_index(drop = True).reset_index() #reset index to create a column named index
df2=df2.rename(columns = {'index':'ID'}) #rename index to ID
df = pd.merge(df,df2,on = ['A','B','C'],how = 'left') #append ID column with merge
# Create tuple triplet using values from columns A, B & C.
df['key'] = [triplet for triplet in zip(*[df[col].values.tolist() for col in ['A', 'B', 'C']])]
# Sort dataframe on new `key` column.
df.sort_values('key', inplace=True)
# Use `groupby` to keep running total of changes in key value.
df['ID'] = (df['key'] != df['key'].shift()).cumsum() - 1
# Clean up.
del df['key']
df.sort_index(inplace=True)
>>> df
A B C D E ID
0 0 1 1 0 1 0
1 0 1 2 0 1 1
2 0 1 1 1 1 0
3 0 1 3 0 1 2
4 0 1 2 1 0 1
5 0 1 1 2 1 0