I want to select duplicate rows between 2 dataframes

I want to select duplicate rows between 2 dataframes - python

I want to filter rolls (df1) with date column that in datetime64[ns] from df2 (same column name and dtype). I tried searching for a solution but I get the error:
Can only compare identically-labeled Series objects | 'Timestamp' object is not iterable or other.
sample df1
id
date
value
1
2018-10-09
120
2
2018-10-09
60
3
2018-10-10
59
4
2018-11-25
120
5
2018-08-25
120
sample df2
date
2018-10-09
2018-10-10
sample result that I want
id
date
value
1
2018-10-09
120
2
2018-10-09
60
3
2018-10-10
59
In fact, I want this program to run 1 time in every 7 days, counting back from the day it started. So I want it to remove dates that are not in these past 7 days.
# create new dataframe -> df2
data = {'date':[]}
df2 = pd.DataFrame(data)
#Set the date to the last 7 days.
days_use = 7 # 7 -> 1
for x in range (days_use,0,-1):
days_use = x
use_day = date.today() - timedelta(days=days_use)
df2.loc[x] = use_day
#Change to datetime64[ns]
df2['date'] = pd.to_datetime(df2['date'])

Use isin:
>>> df1[df1["date"].isin(df2["date"])]
id date value
0 1 2018-10-09 120
1 2 2018-10-09 60
2 3 2018-10-10 59
If you want to create df2 with the dates for the past week, you can simply use pd.date_range:
df2 = pd.DataFrame({"date": pd.date_range(pd.Timestamp.today().date()-pd.DateOffset(7),periods=7)})
>>> df2
date
0 2022-05-03
1 2022-05-04
2 2022-05-05
3 2022-05-06
4 2022-05-07
5 2022-05-08
6 2022-05-09

Related

How to do a rolling count of values grouping by date python

Hi I have a table of data like below and I want to try do a rolling count that takes the date in the group by and the values of dates prior.
Table of data:
Date
ID
1/1/2020
123
2/1/2020
432
2/1/2020
5234
4/1/2020
543
5/1/2020
645
6/1/2020
231
My desired output is something like this:
Date
count
1/1/2020
1
2/1/2020
3
4/1/2020
4
5/1/2020
5
6/1/2020
6
I have tried the following but it doesn't seem to work on how I want it do it.
df[['id','date']].groupby('date').cumcount()

Convert column to datetimes for correct ordering if aggregate GroupBy.size and add cumulative sum by Series.cumsum:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df = df.groupby('Date').size().cumsum().reset_index(name='count')
print (df)
Date count
0 2020-01-01 1
1 2020-01-02 3
2 2020-01-04 4
3 2020-01-05 5
4 2020-01-06 6

How to filter by a pandas column if the number of unique value in another column is equal a given value

please I want to filter AccountID that has transaction data for at least >=3 months ?. This is just a small fraction of the entire dataset
Here is what I did but, I am not sure it is right.
data = data.groupby('AccountID').apply(lambda x: x['TransactionDate'].nunique() >= 3)
I get a series as an output with boolean values. I want to get a pandas dataframe
TransactionDate AccountID TransactionAmount
0 2020-12-01 8 400000.0
1 2020-12-01 22 25000.0
2 2020-12-02 22 551500.0
3 2020-01-01 17 116.0
4 2020-01-01 24 2000.0
5 2020-01-02 68 6000.0
6 2020-03-03. 20 180000.0
7 2020-03-01 66 34000.0
8 2020-02-01 66 20000.0
9 2020-02-01 66 40600.0
The ouput I get
AccountID
1 True
2 True
3 True
4 True
5 True

You are close, need GroupBy.transform for repeat aggregated values for Series with same size like original df, so possible filtering in boolean indexing:
data = data[data.groupby('AccountID')['TransactionDate'].transform('nunique') >= 3]
If possible some dates has no same day, 1 use Series.dt.to_period for helper column filler by months periods:
s = data.assign(new = data['TransactionDate'].dt.to_period('m')).groupby('AccountID')['new'].transform('nunique')
data = data[s >= 3]

How can I join columns by DatetimeIndex, matching day, month and hour from data from different years?

I have a dataset with meteorological features for 2019, to which I want to join two columns of power consumption datasets for 2017, 2018. I want to match them by hour, day and month, but the data belongs to different years. How can I do that?
The meteo dataset is a 6 column similar dataframe with datetimeindexes belonging to 2019.

You can from the index 3 additional columns that represent the hour, day and month and use them for a later join. DatetimeIndex has attribtues for different parts of the timestamp:
import pandas as pd
ind = pd.date_range(start='2020-01-01', end='2020-01-20', periods=10)
df = pd.DataFrame({'number' : range(10)}, index = ind)
df['hour'] = df.index.hour
df['day'] = df.index.day
df['month'] = df.index.month
print(df)
number hour day month
2020-01-01 00:00:00 0 0 1 1
2020-01-03 02:40:00 1 2 3 1
2020-01-05 05:20:00 2 5 5 1
2020-01-07 08:00:00 3 8 7 1
2020-01-09 10:40:00 4 10 9 1
2020-01-11 13:20:00 5 13 11 1
2020-01-13 16:00:00 6 16 13 1
2020-01-15 18:40:00 7 18 15 1
2020-01-17 21:20:00 8 21 17 1
2020-01-20 00:00:00 9 0 20 1

Number of active IDs in each period

I have a dataframe that looks like this
ID | START | END
1 |2016-12-31|2017-02-30
2 |2017-01-30|2017-10-30
3 |2016-12-21|2018-12-30
I want to know the number of active IDs in each possible day. So basically count the number of overlapping time periods.
What I did to calculate this was creating a new data frame c_df with the columns date and count. The first column was populated using a range:
all_dates = pd.date_range(start=min(df['START']), end=max(df['END']))
Then for every line in my original data frame I calculated a different range for the start and end dates:
id_dates = pd.date_range(start=min(user['START']), end=max(user['END']))
I then used this range of dates to increment by one the corresponding count cell in c_df.
All these loops though are not very efficient for big data sets and look ugly. Is there a more efficient way of doing this?

If your dataframe is small enough so that performance is not a concern, create a date range for each row, then explode them and count how many times each date exists in the exploded series.
Requires pandas >= 0.25:
df.apply(lambda row: pd.date_range(row['START'], row['END']), axis=1) \
.explode() \
.value_counts() \
.sort_index()
If your dataframe is large, take advantage of numpy broadcasting to improve performance.
Work with any version of pandas:
dates = pd.date_range(df['START'].min(), df['END'].max()).values
start = df['START'].values[:, None]
end = df['END'].values[:, None]
mask = (start <= dates) & (dates <= end)
result = pd.DataFrame({
'Date': dates,
'Count': mask.sum(axis=0)
})

Create IntervalIndex and use genex or list comprehension with contains to check each date again each interval (Note: I made a smaller sample to test on this solution)
Sample `df`
Out[56]:
ID START END
0 1 2016-12-31 2017-01-20
1 2 2017-01-20 2017-01-30
2 3 2016-12-28 2017-02-03
3 4 2017-01-20 2017-01-25
iix = pd.IntervalIndex.from_arrays(df.START, df.END, closed='both')
all_dates = pd.date_range(start=min(df['START']), end=max(df['END']))
df_final = pd.DataFrame({'dates': all_dates,
'date_counts': (iix.contains(dt).sum() for dt in all_dates)})
In [58]: df_final
Out[58]:
dates date_counts
0 2016-12-28 1
1 2016-12-29 1
2 2016-12-30 1
3 2016-12-31 2
4 2017-01-01 2
5 2017-01-02 2
6 2017-01-03 2
7 2017-01-04 2
8 2017-01-05 2
9 2017-01-06 2
10 2017-01-07 2
11 2017-01-08 2
12 2017-01-09 2
13 2017-01-10 2
14 2017-01-11 2
15 2017-01-12 2
16 2017-01-13 2
17 2017-01-14 2
18 2017-01-15 2
19 2017-01-16 2
20 2017-01-17 2
21 2017-01-18 2
22 2017-01-19 2
23 2017-01-20 4
24 2017-01-21 3
25 2017-01-22 3
26 2017-01-23 3
27 2017-01-24 3
28 2017-01-25 3
29 2017-01-26 2
30 2017-01-27 2
31 2017-01-28 2
32 2017-01-29 2
33 2017-01-30 2
34 2017-01-31 1
35 2017-02-01 1
36 2017-02-02 1
37 2017-02-03 1

Getting complicated average values for pandas DataFrame

I have a simple DataFrame with 2 columns - date and value. I need to create another DataFrame with would contain an average value for every month of every year. For example, I have daily data in range from 2015-01-01 till 2018-12-31
I need averages for every month in 2015, 2016 etc.
Which is the easiest way to do that?

You can aggregate by month period with Series.dt.to_period and mean:
df['date'] = pd.to_datetime(df['date'])
df1 = df.groupby(df['date'].dt.to_period('m'))['col'].mean().reset_index()
Another solution with year and months in separate columns:
df['date'] = pd.to_datetime(df['date'])
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df1 = df.groupby(['year','month'])['col'].mean().reset_index()
Sample:
df = pd.DataFrame({'date':['2015-01-02','2016-03-02','2015-01-23','2016-01-12','2015-03-02'],
'col':[1,2,5,4,6]})
print (df)
date col
0 2015-01-02 1
1 2016-03-02 2
2 2015-01-23 5
3 2016-01-12 4
4 2015-03-02 6
df['date'] = pd.to_datetime(df['date'])
df1 = df.groupby(df['date'].dt.to_period('m'))['col'].mean().reset_index()
print (df1)
date col
0 2015-01 3
1 2015-03 6
2 2016-01 4
3 2016-03 2
df['date'] = pd.to_datetime(df['date'])
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df2 = df.groupby(['year','month'])['col'].mean().reset_index()
print (df2)
year month col
0 2015 1 3
1 2015 3 6
2 2016 1 4
3 2016 3 2

To get the monthly average values of a Data Frame when the DataFrame has daily data rows, I would:
Convert the column with the dates , df['dates']into the index of the DataFrame df: df.set_index('date',inplace=True)
Then I'll convert the index dates into a month-index: df.index.month
Finally I'll calculate the mean of the DataFrame GROUPED BY MONTH: df.groupby(df.index.month).data.mean()
I go slowly throw each step here:
Generation DataFrame with dates and values
You need first to import Pandas and Numpy, as well as the module datetime
from datetime import datetime
Generate a Column 'date' between 1/1/2019 and the 3/05/2019, at week 'W' intervals. And a column 'data'with random values between 1-100:
date_rng = pd.date_range(start='1/1/2018', end='3/05/2018', freq='W')
df = pd.DataFrame(date_rng, columns=['date'])
df['data']=np.random.randint(0,100,size=(len(date_rng)))
the df has two columns 'date' and 'data':
date data
0 2018-01-07 42
1 2018-01-14 54
2 2018-01-21 30
3 2018-01-28 43
4 2018-02-04 65
5 2018-02-11 40
6 2018-02-18 3
7 2018-02-25 55
8 2018-03-04 81
Set 'date'column as the index of the DataFrame:
df.set_index('date',inplace=True)
df has one column 'data' and the index is 'date':
data
date
2018-01-07 42
2018-01-14 54
2018-01-21 30
2018-01-28 43
2018-02-04 65
2018-02-11 40
2018-02-18 3
2018-02-25 55
2018-03-04 81
Capture the month number from the index
months=df.index.month
Obtain the mean value of each month groupping by month:
monthly_avg=df.groupby(months).data.mean()
The mean of the dataset by month 'monthly_avg' is:
date
1 42.25
2 40.75
3 81.00
Name: data, dtype: float64

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

I want to select duplicate rows between 2 dataframes - python

Related

How to do a rolling count of values grouping by date python

How to filter by a pandas column if the number of unique value in another column is equal a given value

How can I join columns by DatetimeIndex, matching day, month and hour from data from different years?

Number of active IDs in each period

Getting complicated average values for pandas DataFrame

Categories

Resources