I have the df which has index with dates and values 0 or 1. I need to filter every first 1 from this data frame.
For example:
2019-11-27 0
2019-11-29 0
2019-12-02 0
2019-12-03 1
2019-12-04 1
2019-12-05 1
2020-06-01 0
2020-06-02 0
2020-06-03 1
2020-06-04 1
2020-06-05 1
So I want to get:
2019-12-03 1
2020-06-03 1
Assuming you want the first date with value 1 of the dataframe ordered by date ascending, a window operation might be the best way to do this:
df['PrevValue'] = df['value'].rolling(2).agg(lambda rowset: int(rowset.iloc[0]))
This line of code adds an extra column named "PrevValue" to the dataframe containing the value of the previous row or "NaN" for the first row.
Next, you could query the data as follows:
df_filtered = df.query("value == 1 & PrevValue == 0")
Resulting in the following output:
date value PrevValue
3 2019-12-03 1 0.0
8 2020-06-03 1 0.0
i built function that can satisfy your requirements
important note you should change the col argument it might cause you problem
def funfun (df , col="values"):
'''
df : dataframe
col (str) : please insert the name of column that you want to scan
'''
a = []
c = df.to_dict()
for i in range (len(c[col]) -1 ) :
b=c[col][i] , c[col][i+1]
if b == (0, 1) :
a.append(df.iloc[i+1])
return a
results
Related
I am having trouble with Pandas.
I try to compare each value of a row to another one.
In the attached link you will be able to see a slice of my dataframe.
For each date I have the daily variation of some stocks.
I want to compare each stock variation to the variation of the columns labelled 'CAC 40'.
If the value is greater I want to turn it into a Boolean 1 or 0 if lower.
This should return a dataframe filled only with 1 or 0 so I can then summarize by columns.
I have tried the apply method but this doesn't work.
It returns a Pandas.Serie ( attached below )
def compare_to_cac(row):
for i in row:
if row[i] >= row['CAC 40']:
return 1
else:
return 0
data2 = data.apply(compare_to_cac, axis=1)
Please can someone help me out ?
I worked with this data (column names are not important here, only the CAC 40 one is):
A B CAC 40
0 0 2 9
1 1 3 9
2 2 4 1
3 3 5 2
4 4 7 2
With just a for loop :
for column in df.columns:
if column == "CAC 40":
continue
condition = [df[column] > df["CAC 40"]]
value = [1]
df[column] = np.select(condition, value, default=0)
Which gives me as a result :
A B CAC 40
0 0 0 9
1 0 0 9
2 1 1 1
3 1 1 2
4 1 1 2
I have this DF with users and KPIs (1 present, 0 not)
global_df = pd.DataFrame({'Users':[1,2,3,4],
'KPI_1':[1,0,0,0],
'KPI_2':[1,1,0,0]})
It looks like:
Users KPI_1 KPI_2
0 1 1 1
1 2 0 1
2 3 0 0
3 4 0 0
Then I have a separate df, one row per filter, with the conditions (just showing one):
filter_condition = pd.DataFrame({'KPI_1':[1],
'KPI_2':[1]})
It looks like:
KPI_1 KPI_2
0 1 1
This returns KeyError:
global_df[global_df[any(filter_condition)]]
Expected result (user 1 and 2 because they have KPI_1 or KPI_2 to 1):
Users KPI_1 KPI_2
0 1 1 1
1 2 0 1
Note this is only the first row of the dataframe, I need to be able to iterate for every row in the filter_conditiosn dataframe where there can be N columns, for the sake of the example I just added two
EDIT
This is close, but it removes usersId and adds a lot of nans:
global_df[global_df[filter_condition.columns]==1]
Users KPI_1 KPI_2
0 NaN 1.0 1.0
1 NaN NaN 1.0
2 NaN NaN NaN
3 NaN NaN NaN
Iterate over the rows of filter_condition, then for each row, slice the portion of global_df having the common columns with filtered_df and compare it with the row values using eq and reduce using any along the columns axis to create a boolean mask
for i, r in filter_condition.iterrows():
filtered_df = global_df[global_df[r.keys()].eq(r).any(1)]
# TODO: Process the filtered dataframe
Users KPI_1 KPI_2
0 1 1 1
1 2 0 1
we can use reduce from the functools lib to return a boolean to filter with.
we can also create a dictionary from your filter dataframe to filter each column by.
from functools import reduce
d = filter_condition.stack().groupby(level=1).agg(list).to_dict()
# {'KPI_1': [1], 'KPI_2': [1]}
global_df.loc[reduce(lambda a,b: a | b, [global_df[k].isin(v) for k,v in d.items()])]
Users KPI_1 KPI_2
0 1 1 1
1 2 0 1
Use df.loc. Just do a:
result = global_df.loc[(global_df['KPI_1'] == 1) | (global_df['KPI_2'] == 1)]
Well, the error is explainable as you are trying to find a boolean index that is not present in your global_df.
indexes = set()
for column in filter_condition.columns:
indexes = indexes.union(global_df[global_df[column].isin(filter_condition[column])].index)
global_df.loc[indexes]
I've used isin here as you've used a list in your filter_conditon, so I am hoping you're planning to use multiple values there.
This is one of the ways, you could choose, will update once I come up with some better approach. Also, since you're going to use columns to be the same somehow, it would be better to use Series rather than a Data frame.
EDIT: Global_df is fixed now in the question.
Another simple approach
condition = '|'.join(f'{column} in {filter_condition[column].values}' for column in filter_condition.columns)
global_df.query(condition)
This will basically perform the same function as above
global_df:
Users KPI_1 KPI_2
0 1 1 1
1 2 0 1
2 3 0 0
3 4 0 0
filter_condition:
KPI_1 KPI_2
0 1 1
Output:
Users KPI_1 KPI_2
0 1 1 1
1 2 0 1
I have a pandas dataframe that looks like this,
id start end
0 1 2020-02-01 2020-04-01
1 2 2020-04-01 2020-04-28
I have two additional parameters that are date values say x and y. x and y will be always a first day of the month.
I want to expand the above data frame to the one shown below for x = "2020-01-01" and y = "2020-06-01",
id month status
0 1 2020-01 -1
1 1 2020-02 1
2 1 2020-03 2
3 1 2020-04 2
4 1 2020-05 -1
5 1 2020-06 -1
6 2 2020-01 -1
7 2 2020-02 -1
8 2 2020-03 -1
9 2 2020-04 1
10 2 2020-05 -1
11 2 2020-06 -1
The dataframe expanded such that for each id, there will be additional months_between(x, y) rows made. And a status columns is made and values are filled in such that,
If the month column value is equal to month of start column then fill status as 1
If the month column value is greater than month of start column but less than or equal to month of end column fill it as 2.
If the month column value is less than month of start month then fill it as -1. Also if the month column value is greater than month of end fill status with -1.
I'm trying to solve this in pandas without looping. The current solution I have is with loops and takes longer to run with huge datasets.
Is there any pandas functions that can help me here?
Thanks #Code Different for the solution. It solves the issue. However there is an extension to the problem where the dataframe can look like this,
id start end
0 1 2020-02-01 2020-02-20
1 1 2020-04-01 2020-05-10
2 2 2020-04-10 2020-04-28
One id can have more than one entry. For the above x and y which is 6 months apart, I want to have 6 rows for each id in the dataframe. The solution currently creates 6 rows for each row in the dataframe. Which is okay but not ideal when dealing with dataframe with millions of ids.
Make sure the start and end columns are of type Timestamp:
# Explode each month between x and y
x = '2020-01-01'
y = '2020-06-01'
df['month'] = [pd.date_range(x, y, freq='MS')] * len(df)
df = df.explode('month').drop_duplicate(['id', 'month'])
# Determine the status
df['status'] = -1
cond = df['start'] == df['month']
df.loc[cond, 'status'] = 1
cond = (df['start'] < df['month']) & (df['month'] <= df['end'])
df.loc[cond, 'status'] = 2
It's a bit hard to explain, so I'll start off with what I'm trying to achieve using excel.
Basically, the value of the "Active" column is based on values of the same row different column values (columns 'Act Count' and 'De Count'), as well as the value in the previous row of the "Active" column.
From the excel formula, if 'Act Count' < 4 and 'De Count' < 4, 'Active' = the previous row's 'Active' value.
I want to transition this to Python pandas dataframe.
Here is the sample data:
import pandas as pd
df = pd.DataFrame({'Act Count':[1,2,3,4,0,0,0,0,0,0,0,0,0,0],
'De Count':[0,0,0,0,0,0,0,0,1,2,3,4,5,6]})
You can assume the first row value of 'Active' = 0.
I know of .shift() function, however I feel like I can't use it because I can't shift a column that doesn't exist yet.
This is not an elegant solution, but it will do the job.
import pandas as pd
act_count = [1,2,3,4,0,0,0,0,0,0,0,0,0,0]
de_count = [0,0,0,0,0,0,0,0,1,2,3,4,5,6]
active = [0]
for i in range(1:len(act_count)):
if act_count[i] >= 4:
active.append(100)
elif de_count[i] >= 4 :
active.append(0)
else:
active.append(active[i-1])
df = pd.DataFrame({'Act Count': act_count, 'De Count' : de_count,
'Active' : active})
You can use this method:
##Adding an empty column named Active to the existing dataframe
df['Active'] = np.nan
##putting the first value as 0
df['Active'].loc[0] = 0
for index in range(1,df.shape[0]):
if df['Act Count'].iloc[index]>=4:
df['Active'].iloc[index]=100
elif df['De Count'].iloc[index]>=4:
df['Active'].iloc[index]=0
else:
df['Active'].iloc[index]=df['Active'].iloc[index-1]
print(df)
Output:
Act Count De Count Active
0 1 0 0.0
1 2 0 0.0
2 3 0 0.0
3 4 0 100.0
4 0 0 100.0
5 0 0 100.0
6 0 0 100.0
7 0 0 100.0
8 0 1 100.0
9 0 2 100.0
10 0 3 100.0
11 0 4 0.0
12 0 5 0.0
13 0 6 0.0
I am trying to apply left join to the two dataframe shown below.
outlier day season
0 11556.0 0 1
==========================================
date bikeid date2
0 1 16736 2016-06-06
1 1 16218 2016-06-13
2 1 15254 2016-06-20
3 1 16327 2016-06-27
4 1 17745 2016-07-04
5 1 16975 2016-07-11
6 1 17705 2016-07-18
7 1 16792 2016-07-25
8 1 18540 2016-08-01
9 1 17212 2016-08-08
10 1 11556 2016-08-15
11 1 17694 2016-08-22
12 1 14936 2016-08-29
outliers = pd.merge(outliers, sum_Day, how = 'left', left_on = ['outlier'], right_on = ['bikeid'])
outliers = outliers.dropna(axis=1, how='any')
trip_outlier day season
0 11556.0 0 1
As shown after above applying left join i dropped all the NaN rows which gives the result above. However the desired results should be as shown below
trip_outlier day season date2
0 11556.0 0 1 2016-08-15
It seems dtype of outlier column in outliers is float. Need same dtypes in both joined columns.
Check it by:
print (outliers['outlier'].dtype)
print (sum_Day['bikeid'].dtype)
So use astype for convert:
outliers['outlier'] = outliers['outlier'].astype(int)
#if not int
#sum_Day['bikeid'] = sum_Day['bikeid'].astype(int)
EDIT:
If some NaNs in outlier column is not possible convert to int, first is necessary remove NaNs:
outliers = outliers.dropna('outlier')
outliers['outlier'] = outliers['outlier'].astype(int)
One way to get the desired result would be using the below code:
outliers = outliers.merge(sum_Day.rename(columns={'bikeid': 'outlier'}), on = 'outlier', \
how = 'left')