Pandas data frame: adding columns based on previous time periods - python

I am trying to work through a problem in pandas, being more accustomed to R.
I have a data frame df with three columns: person, period, value
df.head() or the top few rows look like:
| person | period | value
0 | P22 | 1 | 0
1 | P23 | 1 | 0
2 | P24 | 1 | 1
3 | P25 | 1 | 0
4 | P26 | 1 | 1
5 | P22 | 2 | 1
Notice the last row records a value for period 2 for person P22.
I would now like to add a new column that provides the value from the previous period. So if for P22 the value in period 1 is 0, then this new column would look like:
| person | period | value | lastperiod
5 | P22 | 2 | 1 | 0
I believe I need to do something like the following command, having loaded pandas:
for p in df.period.unique():
df['lastperiod']== [???]
How should this be formulated?

You could groupby person and then apply a shift to the values:
In [11]: g = df.groupby('person')
In [12]: g['value'].apply(lambda s: s.shift())
Out[12]:
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 0
dtype: float64
Adding this as a column:
In [13]: df['lastPeriod'] = g['value'].apply(lambda s: s.shift())
In [14]: df
Out[14]:
person period value lastPeriod
1 P22 1 0 NaN
2 P23 1 0 NaN
3 P24 1 1 NaN
4 P25 1 0 NaN
5 P26 1 1 NaN
6 P22 2 1 0
Here the NaN signify missing data (i.e. there wasn't an entry in the previous period).

Related

pandas groupby excluding when a column takes some value

Is there a way to exclude rows that take certain values when aggregating?
For example:
ID | Company | Cost
1 | Us | 2
1 | Them | 1
1 | Them | 1
2 | Us | 1
2 | Them | 2
2 | Them | 1
I would like to do a groupby and sum but ignoring whenever a row is Company="us".
The result should be something like this:
ID | Sum of cost
1 | 2
2 | 3
I solved it by doing this, but I want to know if there's a smarter solution:
df_agg = df[df['Company']!="Us"][['ID','Cost']].groupby(['ID']).sum()
Use:
print (df)
ID Company Cost
0 1 Us 2
1 1 Them 1
2 1 Them 1
3 2 Us 1
4 2 Them 2
5 2 Them 1
6 3 Us 1 <- added new row for see difference
If need filter first and not matched groups (if exist) are not important use:
df1 = df[df.Company!="Us"].groupby('ID', as_index=False).Cost.sum()
print (df1)
ID Cost
0 1 2
1 2 3
df1 = df.query('Company!="Us"').groupby('ID', as_index=False).Cost.sum()
print (df1)
ID Cost
0 1 2
1 2 3
If need all groups ID with Cost=0 for Us first set Cost to 0 and then aggregate:
df2 = (df.assign(Cost = df.Cost.where(df.Company!="Us", 0))
.groupby('ID', as_index=False).Cost
.sum())
print (df2)
ID Cost
0 1 2
1 2 3
2 3 0

Count occurrences of specific value in column based on categories of another column

I have a dataset that looks like this:
Categories | Clicks
1 | 1
1 | 3
1 | 2
2 | 2
2 | 1
2 | 1
2 | 2
3 | 1
3 | 2
3 | 3
4 | 2
4 | 1
And to make some bar plots I would like for it to look like this:
Categories | Clicks_count | Clicks_prob
1 | 1 | 33%
2 | 2 | 50%
3 | 1 | 33%
4 | 1 | 50%
so basically: grouping by Categories and calculating on Clicks_count the number of times per category that Clicks takes the value 1, and Clicks_prob the probability of Clicks taking the value 1 (so it's count of Clicks==1/Count of Category i observations)
How could I do this? I tried, to get the second column:
df.groupby("Categories")["Clicks"].count().reset_index()
but the result is:
Categories | Clicks
1 | 3
2 | 4
3 | 3
4 | 2
Try sum and mean on the condition Clicks==1. Since you're working with groups, put them in groupby:
df['Clicks'].eq(1).groupby(df['Categories']).agg(['sum','mean'])
Output:
sum mean
Categories
1 1 0.333333
2 2 0.500000
3 1 0.333333
4 1 0.500000
To match output's naming, use named aggregation:
df['Clicks'].eq(1).groupby(df['Categories']).agg(Click_counts='sum', Clicks_prob='mean')
Output:
Click_counts Clicks_prob
Categories
1 1 0.333333
2 2 0.500000
3 1 0.333333
4 1 0.500000

how to fill date column in one dataframe with nearest dates from another dataframe

I have a dataframe visit =
visit_occurrence_id visit_start_date person_id
1 2016-06-01 1
2 2019-05-01 2
3 2016-01-22 1
4 2017-02-14 2
5 2018-05-11 3
and another dataframe measurement =
measurement_date person_id visit_occurrence_id
2017-09-04 1 Nan
2018-04-24 2 Nan
2018-05-22 2 Nan
2019-02-02 1 Nan
2019-01-28 3 Nan
2019-05-07 1 Nan
2018-12-11 3 Nan
2017-04-28 3 Nan
I want to fill the visit_occurrence_id for measurement table with visit_occurrence_id of visit table on the basis of person_id and nearest date possible.
I have written a code but its taking a lot of time.
measurement has 7*10^5 rows.
Note: visit_start_date and measurement_date are object types
my code -
import datetime as dt
unique_person_list = measurement['person_id'].unique().tolist()
def nearest_date(row,date_list):
date_list = [dt.datetime.strptime(date, '%Y-%m-%d').date() for date in date_list]
row = min(date_list, key=lambda x: abs(x - row))
return row
modified_measurement = pd.DataFrame(columns = measurement.columns)
for person in unique_person_list:
near_visit_dates = visit[visit['person_id']==person]['visit_start_date'].tolist()
if near_visit_dates:
near_visit_dates = list(filter(None, near_visit_dates))
near_visit_dates = [i.strftime('%Y-%m-%d') for i in near_visit_dates]
store_dates = measurement.loc[measurement['person_id']== person]['measurement_date']
store_dates= store_dates.apply(nearest_date, args=(near_visit_dates,))
modified_measurement = modified_measurement.append(store_dates)
My code's execution time is quite high. Can you help me in either reducing the time complexity or with another solution.
edit - adding dataframe constructors.
import numpy as np
measurement = {'measurement_date':['2017-09-04', '2018-04-24', '2018-05-22', '2019-02-02',
'2019-01-28', '2019-05-07', '2018-12-11','2017-04-28'],
'person_id':[1, 2, 2, 1, 3, 1, 3, 3],'visit_occurrence_id':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]}
visit = {'visit_occurrence_id':[1, 2, 3, 4, 5],
'visit_start_date':['2016-06-01', '2019-05-01', '2016-01-22', '2017-02-14', '2018-05-11'],
'person_id':[1, 2, 1, 2, 3]}
# Create DataFrame
measurement = pd.DataFrame(measurement)
visit = pd.DataFrame(visit)
You can do the following:
df=pd.merge(measurement[["person_id", "measurement_date"]], visit, on="person_id", how="inner")
df["dt_diff"]=df[["visit_start_date", "measurement_date"]].apply(lambda x: abs(datetime.datetime.strptime(x["visit_start_date"], '%Y-%m-%d').date() - datetime.datetime.strptime(x["measurement_date"], '%Y-%m-%d').date()), axis=1)
df=pd.merge(df, df.groupby(["person_id", "measurement_date"])["dt_diff"].min(), on=["person_id", "dt_diff", "measurement_date"], how="inner")
res=pd.merge(measurement, df, on=["measurement_date", "person_id"], suffixes=["", "_2"])[["measurement_date", "person_id", "visit_occurrence_id_2"]]
Output:
measurement_date person_id visit_occurrence_id_2
0 2017-09-04 1 1
1 2018-04-24 2 2
2 2018-05-22 2 2
3 2019-02-02 1 1
4 2019-01-28 3 5
5 2019-05-07 1 1
6 2018-12-11 3 5
7 2017-04-28 3 5
Here's what I've come up with:
# Get all visit start dates
df = measurement.drop('visit_occurrence_id', axis=1).merge(visit, on='person_id')
df['date_difference'] = abs(df.measurement_date - df.visit_start_date)
# Find the smallest visit start date for each person_id - measurement_date pair
df['smallest_difference'] = df.groupby(['person_id', 'measurement_date'])['date_difference'].transform(min)
df = df[df.date_difference == df.smallest_difference]
df = df[['measurement_date', 'person_id', 'visit_occurrence_id']]
# Fill in visit_occurrence_id from original dataframe
measurement.drop("visit_occurrence_id", axis=1).merge(
df, on=["measurement_date", "person_id"]
)
This produces:
| | measurement_date | person_id | visit_occurrence_id |
|---:|:-------------------|------------:|----------------------:|
| 0 | 2017-09-04 | 1 | 1 |
| 1 | 2018-04-24 | 2 | 2 |
| 2 | 2018-05-22 | 2 | 2 |
| 3 | 2019-02-02 | 1 | 1 |
| 4 | 2019-01-28 | 3 | 5 |
| 5 | 2019-05-07 | 1 | 1 |
| 6 | 2018-12-11 | 3 | 5 |
| 7 | 2017-04-28 | 3 | 5 |
I believe there's probably a cleaner way of writing this using sklearn: https://scikit-learn.org/stable/modules/neighbors.html

How to copy values from one df to the original df with a certain condition?

Currently I am working on clustering problem and I have a problem with copying the values from one dataframe to the original dataframe.
CustomerID | Date | Time| TotalSum | CohortMonth| CohortIndex
--------------------------------------------------------------------
0 |17850.0|2017-11-29||08:26:00|15.30|2017-11-01|1|
--------------------------------------------------------------------
1 |17850.0|2017-11-29||08:26:00|20.34|2017-11-01|1|
--------------------------------------------------------------------
2 |17850.0|2017-11-29||08:26:00|22.00|2017-11-01|1|
--------------------------------------------------------------------
3 |17850.0|2017-11-29||08:26:00|20.34|2017-11-01|1|
--------------------------------------------------------------------
And the dataframe with values (clusters) to copy:
CustomerID|Cluster
------------------
12346.0 | 1
------------------
12346.0 | 1
------------------
12346.0 | 1
------------------
Please help me with the problem: How to copy values from the second df based on Customer ID criteria to the first dataframe.
I tried the code like this:
df.merge(ic,left_on='CustomerID',right_on='Cluster',how='left').drop('CustomerID',1).fillna('')
But it doesn't work and I get an error...
Besides it tried a version of such code as:
df, ic = [d.reset_index(drop=True) for d in (df, ic)]
ic.join(df[['CustomerID']])
But it gets the same error or error like the 'Customer ID' not in df...
Sorry if it's not clear and bad formatted question...It is my first question on stackoverflow. Thank you all.
UPDATE
I have tried this
df1=df.merge(ic,left_on='CustomerID',right_on='Cluster',how='left')
if ic['CustomerID'].values != df1['CustomerID_x'].values:
df1.Cluster=ic.Cluster
else:
df1.Cluster='NaN'
But I've got different cluster for the same customer.
CustomerID_x| Date | Time | TotalSum | CohortMonth | CohortIndex | CustomerID_y | Cluster
0|17850.0|2017-11-29||08:26:00|15.30 | 2017-11-01 | 1 | NaN | 1.0
1|17850.0|2017-11-29||08:26:00|20.34 | 2017-11-01 | 1 | NaN | 0.0
2|17850.0|2017-11-29||08:26:00|22.00 | 2017-11-01 | 1 | NaN | 1.0
3|17850.0|2017-11-29||08:26:00|20.34 | 2017-11-01 | 1 | NaN | 2.0
4|17850.0|2017-11-29||08:26:00|20.34 | 2017-11-01 | 1 | NaN | 1.0
Given what you've written, I think you want:
>>> df1 = pd.DataFrame({"CustomerID": [17850.0] * 4, "CohortIndex": [1,1,1,1] })
>>> df1
CustomerID CohortIndex
0 17850.0 1
1 17850.0 1
2 17850.0 1
3 17850.0 1
>>> df2
CustomerID Cluster
0 12346.0 1
1 17850.0 1
2 12345.0 1
>>> pd.merge(df1, df2, 'left', 'CustomerID')
CustomerID CohortIndex Cluster
0 17850.0 1 1
1 17850.0 1 1
2 17850.0 1 1
3 17850.0 1 1

Python pandas - remove groups based on NaN count threshold

I have a dataset based on different weather stations,
stationID | Time | Temperature | ...
----------+------+-------------+-------
123 | 1 | 30 |
123 | 2 | 31 |
202 | 1 | 24 |
202 | 2 | 24.3 |
202 | 3 | NaN |
...
And I would like to remove 'stationID' groups, which have more than a certain number of NaNs. For instance, if I type:
**>>> df.groupby('stationID')**
then, I would like to drop groups that have (at least) a certain number of NaNs (say 30) within a group. As I understand it, I cannot use dropna(thresh=10) with groupby:
**>>> df2.groupby('station').dropna(thresh=30)**
*AttributeError: Cannot access callable attribute 'dropna' of 'DataFrameGroupBy' objects...*
So, what would be the best way to do that with Pandas?
IIUC you can do df2.loc[df2.groupby('station')['Temperature'].filter(lambda x: len(x[pd.isnull(x)] ) < 30).index]
Example:
In [59]:
df = pd.DataFrame({'id':[0,0,0,1,1,1,2,2,2,2], 'val':[1,1,np.nan,1,np.nan,np.nan, 1,1,1,1]})
df
Out[59]:
id val
0 0 1.0
1 0 1.0
2 0 NaN
3 1 1.0
4 1 NaN
5 1 NaN
6 2 1.0
7 2 1.0
8 2 1.0
9 2 1.0
In [64]:
df.loc[df.groupby('id')['val'].filter(lambda x: len(x[pd.isnull(x)] ) < 2).index]
Out[64]:
id val
0 0 1.0
1 0 1.0
2 0 NaN
6 2 1.0
7 2 1.0
8 2 1.0
9 2 1.0
So this will filter out the groups that have more than 1 nan values
You can create a column to give the the number of null values by station_id, and then use loc to select the relevant data for further processing.
df['station_id_null_count'] = \
df.groupby('stationID').Temperature.transform(lambda group: group.isnull().sum())
df.loc[df.station_id_null_count > 30, :] # Select relevant data
Using #EdChum setup: Since you dont mention your final output, adding this.
vals = df.groupby(['id'])['val'].apply(lambda x: (np.size(x)-x.count()) < 2 )
vals[vals]
id
0 True
2 True
Name: val, dtype: bool

Categories

Resources