Pandas: "Left Join" not working correctly - python

I am trying to apply left join to the two dataframe shown below.
outlier day season
0 11556.0 0 1
==========================================
date bikeid date2
0 1 16736 2016-06-06
1 1 16218 2016-06-13
2 1 15254 2016-06-20
3 1 16327 2016-06-27
4 1 17745 2016-07-04
5 1 16975 2016-07-11
6 1 17705 2016-07-18
7 1 16792 2016-07-25
8 1 18540 2016-08-01
9 1 17212 2016-08-08
10 1 11556 2016-08-15
11 1 17694 2016-08-22
12 1 14936 2016-08-29
outliers = pd.merge(outliers, sum_Day, how = 'left', left_on = ['outlier'], right_on = ['bikeid'])
outliers = outliers.dropna(axis=1, how='any')
trip_outlier day season
0 11556.0 0 1
As shown after above applying left join i dropped all the NaN rows which gives the result above. However the desired results should be as shown below
trip_outlier day season date2
0 11556.0 0 1 2016-08-15

It seems dtype of outlier column in outliers is float. Need same dtypes in both joined columns.
Check it by:
print (outliers['outlier'].dtype)
print (sum_Day['bikeid'].dtype)
So use astype for convert:
outliers['outlier'] = outliers['outlier'].astype(int)
#if not int
#sum_Day['bikeid'] = sum_Day['bikeid'].astype(int)
EDIT:
If some NaNs in outlier column is not possible convert to int, first is necessary remove NaNs:
outliers = outliers.dropna('outlier')
outliers['outlier'] = outliers['outlier'].astype(int)

One way to get the desired result would be using the below code:
outliers = outliers.merge(sum_Day.rename(columns={'bikeid': 'outlier'}), on = 'outlier', \
how = 'left')

Related

Summing values up to a column value change in pandas dataframe

I have a pandas data frame that looks like this:
Count Status
Date
2021-01-01 11 1
2021-01-02 13 1
2021-01-03 14 1
2021-01-04 8 0
2021-01-05 8 0
2021-01-06 5 0
2021-01-07 2 0
2021-01-08 6 1
2021-01-09 8 1
2021-01-10 10 0
I want to calculate the difference between the initial and final value of the "Count" column before the "Status" column changes from 0 to 1 or vice-versa (for every cycle) and make a new dataframe out of these values.
The output for this example would be:
Cycle Difference
1 3
2 -6
3 2
Use GroupBy.agg by consecutive groups created by comapre shifted values with cumulative sum, last subtract last and first value:
df = (df.groupby(df['Status'].ne(df['Status'].shift()).cumsum().rename('Cycle'))['Count']
.agg(['first','last'])
.eval('last - first')
.reset_index(name='Difference'))
print (df)
Cycle Difference
0 1 3
1 2 -6
2 3 2
3 4 0
If need filter out groups rows with 1 row is possible add aggregation GroupBy.size and then filter oupt rows by DataFrame.loc:
df = (df.groupby(df['Status'].ne(df['Status'].shift()).cumsum().rename('Cycle'))['Count']
.agg(['first','last', 'size'])
.loc[lambda x: x['size'] > 1]
.eval('last - first')
.reset_index(name='Difference'))
print (df)
Cycle Difference
0 1 3
1 2 -6
2 3 2
You can use a GroupBy.agg on the groups formed of the consecutive values, then get the first minus last value (see below for variants):
out = (df.groupby(df['Status'].ne(df['Status'].shift()).cumsum())
['Count'].agg(lambda x: x.iloc[-1]-x.iloc[0])
)
output:
Status
1 3
2 -6
3 2
4 0
Name: Count, dtype: int64
If you only want to do this for groups of more than one element:
out = (df.groupby(df['Status'].ne(df['Status'].shift()).cumsum())
['Count'].agg(lambda x: x.iloc[-1]-x.iloc[0] if len(x)>1 else pd.NA)
.dropna()
)
output:
Status
1 3
2 -6
3 2
Name: Count, dtype: object
output as DataFrame:
add .rename_axis('Cycle').reset_index(name='Difference'):
out = (df.groupby(df['Status'].ne(df['Status'].shift()).cumsum())
['Count'].agg(lambda x: x.iloc[-1]-x.iloc[0] if len(x)>1 else pd.NA)
.dropna()
.rename_axis('Cycle').reset_index(name='Difference')
)
output:
Cycle Difference
0 1 3
1 2 -6
2 3 2

Python repeatable cycle for picking only first values equal 1

I have the df which has index with dates and values 0 or 1. I need to filter every first 1 from this data frame.
For example:
2019-11-27 0
2019-11-29 0
2019-12-02 0
2019-12-03 1
2019-12-04 1
2019-12-05 1
2020-06-01 0
2020-06-02 0
2020-06-03 1
2020-06-04 1
2020-06-05 1
So I want to get:
2019-12-03 1
2020-06-03 1
Assuming you want the first date with value 1 of the dataframe ordered by date ascending, a window operation might be the best way to do this:
df['PrevValue'] = df['value'].rolling(2).agg(lambda rowset: int(rowset.iloc[0]))
This line of code adds an extra column named "PrevValue" to the dataframe containing the value of the previous row or "NaN" for the first row.
Next, you could query the data as follows:
df_filtered = df.query("value == 1 & PrevValue == 0")
Resulting in the following output:
date value PrevValue
3 2019-12-03 1 0.0
8 2020-06-03 1 0.0
i built function that can satisfy your requirements
important note you should change the col argument it might cause you problem
def funfun (df , col="values"):
'''
df : dataframe
col (str) : please insert the name of column that you want to scan
'''
a = []
c = df.to_dict()
for i in range (len(c[col]) -1 ) :
b=c[col][i] , c[col][i+1]
if b == (0, 1) :
a.append(df.iloc[i+1])
return a
results

Python Pandas - Selecting specific rows based on the max and min of two columns with the same group id

I am looking for a way to identify the row that is the 'master' row. The way I am defining the master row is for each group id the row that has the minimum in cust_hierarchy then if it is a tie use the row with the most recent date.
I have supplied some sample tables below:
row_id
group_id
cust_hierarchy
most_recent_date
master(I am looking for)
1
0
2
2020-01-03
1
2
0
7
2019-01-01
0
3
1
7
2019-05-01
0
4
1
6
2019-04-01
0
5
1
6
2019-04-03
1
I was thinking of possibly ordering by the two columns (cust_hierarchy (ascending), most_recent_date (descending), and then a new column that places a 1 on the first row for each group id?
Does anyone have any helpful code for this?
You basically can to an groupby with an idxmin(), but with a little bit of sorting to ensure the most recent use date is selected by the min operation:
import pandas as pd
import numpy as np
# example data
dates = ['2020-01-03','2019-01-01','2019-05-01',
'2019-04-01','2019-04-03']
dates = pd.to_datetime(dates)
df = pd.DataFrame({'group_id':[0,0,1,1,1],
'cust_hierarchy':[2,7,7,6,6,],
'most_recent_date':dates})
# solution
df = df.sort_values('most_recent_date', ascending=False)
idxs = df.groupby('group_id')['cust_hierarchy'].idxmin()
df['master'] = np.where(df.index.isin(idxs), True, False)
df = df.sort_index()
df before:
group_id cust_hierarchy most_recent_date
0 0 2 2020-01-03
1 0 7 2019-01-01
2 1 7 2019-05-01
3 1 6 2019-04-01
4 1 6 2019-04-03
df after:
group_id cust_hierarchy most_recent_date master
0 0 2 2020-01-03 True
1 0 7 2019-01-01 False
2 1 7 2019-05-01 False
3 1 6 2019-04-01 False
4 1 6 2019-04-03 True
Use duplicated on sort_values:
df['master'] = 1- (df.sort_values(['cust_hierarchy', 'most_recent_date'],
ascending=[False, True])
.duplicated('group_id', keep='last')
.astype(int)
)

Aggregation in pandas dataframe with columns names in one row

I am using Python 3.6 and I am doing an aggregation, which I have done correctly, but the column names are not in the form I want.
df = pd.DataFrame({'ID':[1,1,2,2,2],
'revenue':[1,3,5,1,5],
'month':['2012-01-01','2012-01-01','2012-03-01','2014-01-01','2012-01-01']})
print(df)
ID month revenue
0 1 2012-01-01 1
1 1 2012-01-01 3
2 2 2012-03-01 5
3 2 2014-01-01 1
4 2 2012-01-01 5
Doing the aggregation below.
df = df.groupby(['ID']).agg({'revenue':'sum','month':[('distinct_m','nunique'),('month_m','first')]}).reset_index()
print(df)
ID revenue month
sum distinct_m month_m
0 1 4 1 2012-01-01
1 2 11 3 2012-03-01
Desired output is:
ID revenue distinct_m month
0 1 4 1 2012-01-01
1 2 11 3 2012-03-01
The problem is that I am using a mixed form of expressions inside agg(). Had it been only agg('revenue':'sum'), I would have got a column named revenue in precisely the same format I wanted, as shown below:
ID revenue
0 1 4
1 2 11
But, since I am creating 2 additional columns as well, using tuple form ('distinct_m','nunique'),('month_m','first'), I get column names spread across two rows.
Is there a way to get the desired output shown above in one aggregation agg()? I want to avoid using tuple form for 'revenue':'sum'. I am not looking for multiple operations afterwards to get the column names right. I am using Python 3.6.
For avoid this problem is used named aggregations working in pandas 0.25+, where is possible specify each columns names:
df = (df.groupby(['ID']).agg(revenue=('revenue','sum'),
distinct_m=('month','nunique'),
month_m = ('month','first')
).reset_index())
print(df)
ID revenue distinct_m month_m
0 1 4 1 2012-01-01
1 2 11 3 2012-03-01
For lower pandas versions is possible flatten columns in MultiIndex and then rename:
df = df.groupby(['ID']).agg({'revenue':'sum',
'month':[('distinct_m','nunique'),('month_m','first')]})
df.columns = df.columns.map('_'.join)
df = df.rename(columns={'revenue_sum':'revenue',
'month_distinct_m':'distinct_m',
'month_month_m':'month_m'})
df = df.reset_index()
print(df)
ID revenue distinct_m month_m
0 1 4 1 2012-01-01
1 2 11 3 2012-03-01

How to efficiently add rows for those data points which are missing from a sequence using pandas?

I have the following time series dataset of the number of sales happening for a day as a pandas data frame.
date, sales
20161224,5
20161225,2
20161227,4
20161231,8
Now if I have to include the missing data points here(i. e. missing dates) with a constant value(zero) and want to make it look the following way, how can I do this efficiently(assuming the data frame is ~50MB) using Pandas.
date, sales
20161224,5
20161225,2
20161226,0**
20161227,4
20161228,0**
20161229,0**
20161231,8
**Missing rows which are been added to the data frame.
Any help will be appreciated.
You can first cast to to_datetime column date, then set_index and reindex by min and max value of index, reset_index and if necessary change format by strftime:
df.date = pd.to_datetime(df.date, format='%Y%m%d')
df = df.set_index('date')
df = df.reindex(pd.date_range(df.index.min(), df.index.max()), fill_value=0)
.reset_index()
.rename(columns={'index':'date'})
print (df)
date sales
0 2016-12-24 5
1 2016-12-25 2
2 2016-12-26 0
3 2016-12-27 4
4 2016-12-28 0
5 2016-12-29 0
6 2016-12-30 0
7 2016-12-31 8
Last if need change format:
df.date = df.date.dt.strftime('%Y%m%d')
print (df)
date sales
0 20161224 5
1 20161225 2
2 20161226 0
3 20161227 4
4 20161228 0
5 20161229 0
6 20161230 0
7 20161231 8

Categories

Resources