I'm new pandas user and I try to do something with my DataFrame.
I have DataFrame watchers with two columns: repo_id and created_at:
In: watchers.head()
Out:
repo_id created_at
0 1 2010-05-12 06:16:00
1 1 2009-02-16 12:51:54
2 2 2011-02-09 03:53:14
3 1 2010-09-01 09:05:21
4 2 2009-03-04 09:44:56
I want to create new DataFrame - grouped by the month from created_at and repo_id and take information about count of rows for each of them. The result should be similar to:
In: watchers_by_month()
Out:
repo_id month count
0 1 2009-02-28 32
1 1 2009-03-31 42
2 2 2009-05-31 3
3 2 2009-06-30 24
4 3 2013-04-30 23
The order doesn't matter. I just need to know repo_id still for each count.
I tried do something with my DataFrame, but I don't know how to achieve the above effect.
the only thing I could get:
In: watchers.index = watchers['created_at']
watchers.groupby(['repo_id', pd.Grouper(freq='M')]).count()
Out:
created_at
repo_id created_at
1 2009-02-28 323
2009-03-31 56
2009-04-30 29
2009-05-31 24
2009-06-30 35
... ... ...
107672 2013-04-30 6
2013-05-31 3
2013-06-30 3
2013-07-31 6
2013-08-31 1
Assuming your watchers['created_at] is of datetime64[ns], then create an additional month column
watchers['month'] = watchers['created_at'].dt.month
watchers_by_month = watchers.groupby(by=['repo_id','month']['created_at'].count().reset_index().rename_column(column={'created_at':'count'})
If your watchers['created_at] is not of datetime64[ns], then first convert watchers['created_at] to datetime64[ns] using pd.to_datetime() then create an additional month column , then run the above code.
You are very close, just turn that into a dataframe with reset_index:
(watchers.groupby(['repo_id', pd.Grouper(freq='M')])
.count().reset_index(name='count')
)
Related
Lets say i have this DF
ID
date_time
1
2020-03-13 21:10:56, 2020-06-02 22:18:06, 2020-04-14 22:10:56, 2021-06-02 22:18:06
2
2010-09-13 21:43:09, 2011-05-04 23:08:15,2012-06-04 23:08:16
3
2013-06-14 23:29:17, 2014-08-13 23:20:22,2014-08-13 23:20:22
I want to remove YYYYMMDD string at the first position after every single comma and calculate AVG hour from list
Final output would be:
ID
date_time
AVG_hour
1
21:10:56,22:18:06,22:10:56
22
2
21:43:09,23:08:15,23:08:16
22
3
23:29:17,23:20:22,23:20:22
22
I tried following; but it did not work:
df['date_time'] = [para.split(None, 1)[1] for para in df['date_time']]
df.head()
here is one way to do it
# Split on comma, convert each value to date time and then to time delta
# take the total seconds and convert to hours
# np.mean to take average and then round the result
df['Avg_hour']=df['date_time'].str.split(',').apply(lambda x: round(np.mean([ pd.to_timedelta((pd.to_datetime(i)).strftime('%H:%M:%S')).total_seconds()/3600 for i in x])))
df
ID date_time Avg_hour
0 1 2020-03-13 21:10:56, 2020-06-02 22:18:06, 2020... 22
1 2 2010-09-13 21:43:09, 2011-05-04 23:08:15,2012-... 23
2 3 2013-06-14 23:29:17, 2014-08-13 23:20:22,2014-... 23
#same as above, round to 2 decimal places
df['Avg_hour']=df['date_time'].str.split(',').apply(lambda x: round(np.mean([ pd.to_timedelta((pd.to_datetime(i)).strftime('%H:%M:%S')).total_seconds()/3600 for i in x]), 2))
df
ID date_time Avg_hour
0 1 2020-03-13 21:10:56, 2020-06-02 22:18:06, 2020... 21.99
1 2 2010-09-13 21:43:09, 2011-05-04 23:08:15,2012-... 22.66
2 3 2013-06-14 23:29:17, 2014-08-13 23:20:22,2014-... 23.39
I need to build the new column about compare the previous date and the previous date must follow a special rule. I need to find the repeat purchase in past 3 months. I have no idea how can do this. There has some example and my expected output.
transaction.csv:
code,transaction_datetime
1,2021-12-01
1,2022-01-24
1,2022-05-29
2,2021-11-20
2,2022-04-12
2,2022-06-02
3,2021-04-23
3,2022-04-22
expected output:
code,transaction_datetime,repeat_purchase_P3M
1,2021-12-01,no
1,2022-01-24,2021-12-01
1,2022-05-29,no
2,2021-11-20,no
2,2022-04-12,no
2,2022-06-02,2022-04-12
3,2021-04-23,no
3,2022-04-22,no
df = pd.read_csv('file.csv')
df.transaction_datetime = pd.to_datetime(df.transaction_datetime)
grouped = df.groupby('code')['transaction_datetime']
df['repeated_purchase_P3M'] = grouped.shift().dt.date.where(grouped.diff().dt.days < 90, 'no')
df
code transaction_datetime repeated_purchase_P3M
0 1 2021-12-01 no
1 1 2022-01-24 2021-12-01
2 1 2022-05-29 no
3 2 2021-11-20 no
4 2 2022-04-12 no
5 2 2022-06-02 2022-04-12
6 3 2021-04-23 no
7 3 2022-04-22 no
I have a time series that looks like this:
value date
63.85 2017-01-15
63.95 2017-01-22
63.88 2017-01-29
64.02 2017-02-05
63.84 2017-02-12
62.13 2017-03-05
65.36 2017-03-25
66.45 2017-04-25
And I would like to reverse the order of the rows so they look like this:
value date
66.45 2000-01-01
65.36 2000-02-01
62.13 2000-02-20
63.84 2000-03-12
64.02 2000-03-19
63.88 2000-03-26
63.95 2000-04-02
63.85 2000-04-09
As you can see, the "value" column requires to simply flip the row values, but for the date column what I would like to do is keep the same "difference in days" between dates. It doesn't really matter what the start date value is as long as the difference in days is flipped correctly too. In the second dataframe of the example, the start date value is 2000-01-01 and the second value is 2020-02-01, which is 31 days later than the first date. This "day difference" of 31 days is the same one as the last (2017-04-25) and penultimate date (2017-03-25) of the first dataframe. And, the same for the second (2000-02-01) and the third value (2000-02-20) of the second dataframe: the "difference in days" is 20 days, the same one between the penultimate date (2017-03-25) and the antepenultimate date (2017-03-05) of the first dataframe. And so on.
I believe that the steps needed to do this would require to first calculate this "day differences", but I would like to know how to do it efficiently. Thank you :)
NumPy has support for this via its datetime and timedelta data types.
First you reverse both columns in your time series as follows:
import pandas as pd
import numpy as np
df2 = df
df2 = df2.iloc[::-1]
df2
where df is your original time series data and df2 (shown below) is the reversed time series.
value date
7 66.45 2017-04-25
6 65.36 2017-03-25
5 62.13 2017-03-05
4 63.84 2017-02-12
3 64.02 2017-02-05
2 63.88 2017-01-29
1 63.95 2017-01-22
0 63.85 2017-01-15
Next you find the day differences and store them as timedelta objects:
dates_np = np.array(df2.date).astype(np.datetime64) # Convert dates to np.datetime64 ojects
timeDeltas = np.insert(abs(np.diff(dates_np)), 0, 0) # np.insert is to account for -1 length during np.diff call
d2 = {'value': df_reversed.value, 'day_diff': timeDeltas} # Create new dataframe (df3)
df3 = pd.DataFrame(data=d2)
df3
where df3 (the day differences table) looks like this:
value day_diff
7 66.45 0 days
6 65.36 31 days
5 62.13 20 days
4 63.84 21 days
3 64.02 7 days
2 63.88 7 days
1 63.95 7 days
0 63.85 7 days
Lastly, to get back to dates accumulating from a start data, you do the following:
startDate = np.datetime64('2000-01-01') # You can change this if you like
df4 = df2 # Copy coumn data from df2
df4.date = np.array(np.cumsum(df3.day_diff) + startDate # np.cumsum accumulates the day_diff sum
df4
where df4 (the start date accumulation) looks like this:
value date
7 66.45 2000-01-01
6 65.36 2000-02-01
5 62.13 2000-02-21
4 63.84 2000-03-13
3 64.02 2000-03-20
2 63.88 2000-03-27
1 63.95 2000-04-03
0 63.85 2000-04-10
I noticed there is a 1-day discrepancy with my final table, however this is most likely due to the implementation of timedelta inclusivity/exluclusivity.
Here's how I did it:
Creating the DataFrame:
value = [63.85, 63.95, 63.88, 64.02, 63.84, 62.13, 65.36, 66.45]
date = ["2017-01-15", "2017-01-22", "2017-01-29", "2017-02-05", "2017-02-12", "2017-03-05", "2017-03-25", "2017-04-25",]
df = pd.DataFrame({"value": value, "date": date})
Creating a second DataFrame with the values reversed and converting the date column to datetime
new_df = df.astype({'date': 'datetime64'})
new_df.sort_index(ascending=False, inplace=True, ignore_index=True)
new_df
value date
0 66.45 2017-04-25
1 65.36 2017-03-25
2 62.13 2017-03-05
3 63.84 2017-02-12
4 64.02 2017-02-05
5 63.88 2017-01-29
6 63.95 2017-01-22
7 63.85 2017-01-15
I then used pandas.Series.diff to calculate the time delta between each row and converted those values to absolute values.
time_delta_series = new_df['date'].diff().abs()
time_delta_series
0 NaT
1 31 days
2 20 days
3 21 days
4 7 days
5 7 days
6 7 days
7 7 days
Name: date, dtype: timedelta64[ns]
Then you need to convert those values to a cumulative time delta.
But to use the cumsum() method you need to first remove the missing values (NaT).
time_delta_series = time_delta_series.fillna(pd.Timedelta(seconds=0)).cumsum()
time_delta_series
0 0 days
1 31 days
2 51 days
3 72 days
4 79 days
5 86 days
6 93 days
7 100 days
Name: date, dtype: timedelta64[ns
Then you can create your starting date and create the date column for the second DataFrame we created before:
from datetime import date
start = date(2000, 1, 1)
new_df['date'] = start
new_df['date'] = new_df['date'] + time_delta_series
new_df
value date
0 66.45 2000-01-01
1 65.36 2000-02-01
2 62.13 2000-02-21
3 63.84 2000-03-13
4 64.02 2000-03-20
5 63.88 2000-03-27
6 63.95 2000-04-03
7 63.85 2000-04-10
I have one column called Date which is of object type having both date and time in the form '2019/10/07,12:44:58'.
I have tried slicing out the date part from this Date column then convert this to the proper date format. I want to apply this function on the date column to create a new column called date1 without using for loop.
As we can see, the first two rows have a different date format. So we convert the rest to datetime first with errors='coerce'. Then we convert the first two rows and use fillna, to get both dates together:
date1 = pd.to_datetime(data['Date'], format='%Y/%m/%d,%H:%M:%S', errors='coerce')
date2 = pd.to_datetime(data.loc[date1.isna(), 'Date'], format='%d-%m-%Y,%H:%M:%S')
data['Date'] = date1.fillna(date2)
Date Open High Low Close Qty Value(Lk) \
0 2019-10-07 12:45:17 1208.65 1208.85 1208.40 1208.85 1125 13.60
1 2019-10-07 12:45:00 1208.70 1209.10 1208.40 1209.10 9344 112.95
2 2019-10-07 12:43:58 1208.80 1209.40 1208.35 1208.65 7342 88.75
3 2019-10-07 12:42:58 1208.70 1209.20 1208.40 1209.00 9355 113.08
4 2019-10-07 12:41:57 1208.75 1209.00 1207.80 1208.35 5890 71.17
Trades BS
0 4
1 15
2 13
3 15
4 13
Original data:
Date Open High Low Close Qty Value(Lk) \
0 07-10-2019,12:45:17 1208.65 1208.85 1208.40 1208.85 1125 13.60
1 07-10-2019,12:45:00 1208.70 1209.10 1208.40 1209.10 9344 112.95
2 2019/10/07,12:43:58 1208.80 1209.40 1208.35 1208.65 7342 88.75
3 2019/10/07,12:42:58 1208.70 1209.20 1208.40 1209.00 9355 113.08
4 2019/10/07,12:41:57 1208.75 1209.00 1207.80 1208.35 5890 71.17
Trades BS
0 4
1 15
2 13
3 15
4 13
My data looks like below:
id, date, target
1,2016-10-24,22
1,2016-10-25,31
1,2016-10-27,44
1,2016-10-28,12
2,2016-10-21,22
2,2016-10-22,31
2,2016-10-25,44
2,2016-10-27,12
I want to fill in missing dates among id.
For example, the date range of id=1 is 2016-10-24 ~ 2016-10-28, and 2016-10-26 is missing. Moreover, the date range of id=2 is 2016-10-21 ~ 2016-10-27, and 2016-10-23, 2016-10-24 and 2016-10-26 are missing.
I want to fill in the missing dates and fill in the target value as 0.
Therefore, I want my data to be as below:
id, date, target
1,2016-10-24,22
1,2016-10-25,31
1,2016-10-26,0
1,2016-10-27,44
1,2016-10-28,12
2,2016-10-21,22
2,2016-10-22,31
2,2016-10-23,0
2,2016-10-24,0
2,2016-10-25,44
2,2016-10-26,0
2,2016-10-27,12
Can somebody help me?
Thanks in advance.
You can use groupby with resample - then is problem fillna - so need asfreq first:
#if necessary convert to datetime
df.date = pd.to_datetime(df.date)
df = df.set_index('date')
df = df.groupby('id').resample('d')['target'].asfreq().fillna(0).astype(int).reset_index()
print (df)
id date target
0 1 2016-10-24 22
1 1 2016-10-25 31
2 1 2016-10-26 0
3 1 2016-10-27 44
4 1 2016-10-28 12
5 2 2016-10-21 22
6 2 2016-10-22 31
7 2 2016-10-23 0
8 2 2016-10-24 0
9 2 2016-10-25 44
10 2 2016-10-26 0
11 2 2016-10-27 12