companies_id transaction_month count
0 2020-10-01 3
1 2020-10-01 5
1 2020-11-01 5
1 2020-12-01 18
1 2021-01-01 8
I want the result to be like
companies_id transaction_month count first_month
0 2020-10-01 3
1 2020-10-01 5 2020-10-01
1 2020-11-01 5 2020-10-01
1 2020-12-01 18 2020-10-01
1 2021-01-01 8 2020-10-01
This is my data set I want to add a new column called "first month" that should contain the value from transaction month column where the corresponding count is >=5.
for example :
In case of companies_id 1:
first 5 or more transaction occurred on 2020-10-01 therefore "first month" column should contain 2020-10-01 throughout i.e to all rows with companies_id as 1.
Use Series.where for replace transaction_month to NaN if not >=5 count and then use GroupBy.transform with GroupBy.first for first non missing values per groups to new column:
df['transaction_month'] = pd.to_datetime(df['transaction_month'])
print (df['transaction_month'].where(df['count'] >= 5))
0 NaT
1 2020-10-01
2 2020-11-01
3 2020-12-01
4 2021-01-01
Name: transaction_month, dtype: datetime64[ns]
df['first_month'] = (df['transaction_month'].where(df['count'] >= 5)
.groupby(df['companies_id'])
.transform('first'))
print (df)
companies_id transaction_month count first_month
0 0 2020-10-01 3 NaT
1 1 2020-10-01 5 2020-10-01
2 1 2020-11-01 5 2020-10-01
3 1 2020-12-01 18 2020-10-01
4 1 2021-01-01 8 2020-10-01
Related
I have the following data frame, where time_stamp is already sorted in the ascending order:
time_stamp indicator
0 2021-01-01 00:00:00 1
1 2021-01-01 00:02:00 1
2 2021-01-01 00:03:00 NaN
3 2021-01-01 00:04:00 NaN
4 2021-01-01 00:09:00 NaN
5 2021-01-01 00:14:00 NaN
6 2021-01-01 00:19:00 NaN
7 2021-01-01 00:24:00 NaN
8 2021-01-01 00:27:00 1
9 2021-01-01 00:29:00 NaN
10 2021-01-01 00:32:00 2
11 2021-01-01 00:34:00 NaN
12 2021-01-01 00:37:00 2
13 2021-01-01 00:38:00 NaN
14 2021-01-01 00:39:00 NaN
I want to create a new column in the above data frame, that shows the time difference between each row's time_stamp value and the first time_stamp value above that row where indicator is not NaN (immediately above row, where indicator is not NaN).
Below is how the output should look like (time_diff is a timedelta value, but I'll just show subtraction by indices to better illustrate. For example, ( 2 - 1 ) = df['time_stamp'][2] - df['time_stamp'][1] ):
time_stamp indicator time_diff
0 2021-01-01 00:00:00 1 NaT # (or undefined)
1 2021-01-01 00:02:00 1 1 - 0
2 2021-01-01 00:03:00 NaN 2 - 1
3 2021-01-01 00:04:00 NaN 3 - 1
4 2021-01-01 00:09:00 NaN 4 - 1
5 2021-01-01 00:14:00 NaN 5 - 1
6 2021-01-01 00:19:00 NaN 6 - 1
7 2021-01-01 00:24:00 NaN 7 - 1
8 2021-01-01 00:27:00 1 8 - 1
9 2021-01-01 00:29:00 NaN 9 - 8
10 2021-01-01 00:32:00 1 10 - 8
11 2021-01-01 00:34:00 NaN 11 - 10
12 2021-01-01 00:37:00 1 12 - 10
13 2021-01-01 00:38:00 NaN 13 - 12
14 2021-01-01 00:39:00 NaN 14 - 12
We can use a for loop that keeps track of the last NaN entry, but I'm looking for a solution that does not use a for loop.
I've ended up doing this:
# create an intermediate column to track the last timestamp corresponding to the non-NaN `indicator` value
df['tracking'] = np.nan
df['tracking'][~df['indicator'].isna()] = df['time_stamp'][~df['indicator'].isna()]
df['tracking'] = df['tracking'].ffill()
# use that to subtract the value from the `time_stamp`
df['time_diff'] = df['time_stamp'] - df['tracking']
Below is the sample of dataframe (df):-
alpha
value
0
a
5
1
a
8
2
a
4
3
b
2
4
b
1
I know how to make the sequence (numbers) as per the group:
df["serial"] = df.groupby("alpha").cumcount()+1
alpha
value
serial
0
a
5
1
1
a
8
2
2
a
4
3
3
b
2
1
4
b
1
2
But instead of number I need date-time in sequence having 30 mins interval:
Expected result:
alpha
value
serial
0
a
5
2021-01-01 23:30:00
1
a
8
2021-01-02 00:00:00
2
a
4
2021-01-02 00:30:00
3
b
2
2021-01-01 23:30:00
4
b
1
2021-01-02 00:00:00
You can simply multiply your result with a pd.Timedelta:
print ((df.groupby("alpha").cumcount()+1)*pd.Timedelta(minutes=30)+pd.Timestamp("2021-01-01 23:00:00"))
0 2021-01-01 23:30:00
1 2021-01-02 00:00:00
2 2021-01-02 00:30:00
3 2021-01-01 23:30:00
4 2021-01-02 00:00:00
dtype: datetime64[ns]
Try with to_datetime and groupby with cumcount, and then multiplying by pd.Timedelta for 30 minutes:
>>> df['serial'] = pd.to_datetime('2021-01-01 23:30:00') + df.groupby('alpha').cumcount() * pd.Timedelta(minutes=30)
>>> df
alpha value serial
0 a 5 2021-01-01 23:30:00
1 a 8 2021-01-02 00:00:00
2 a 4 2021-01-02 00:30:00
3 b 2 2021-01-01 23:30:00
4 b 1 2021-01-02 00:00:00
>>>
Problem: I am trying to make a bar chart based on a very simple Pandas DataFrame that has a DateTime index and integers in one column. Data below. Only some of the data is showing up in Matplotlib, however.
Code:
fig, ax = plt.subplots(figsize=(16,6))
ax.bar(unique_opps_by_month.index, unique_opps_by_month['Opportunity Name'])
ax.xaxis.set_major_locator(mdates.MonthLocator((1,4,7,10)))
ax.xaxis.set_minor_locator(mdates.MonthLocator())
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%b'))
ax.set_xlim('2016-01', '2021-04')
fig.autofmt_xdate()
The output, however, looks like:
This doesn't match the data though! There should be eight bars in 2017, and not all the same height. There are similar problems in the other years as well. Why are only some of the bars showing? How can I make Matplotlib show all the data?
Data
2016-02-01 1
2016-05-01 1
2016-08-01 1
2016-09-01 1
2017-01-01 1
2017-02-01 1
2017-03-01 1
2017-04-01 1
2017-07-01 3
2017-10-01 2
2017-11-01 3
2017-12-01 1
2018-02-01 2
2018-03-01 2
2018-04-01 2
2018-06-01 1
2018-07-01 1
2018-08-01 1
2018-11-01 1
2018-12-01 2
2019-03-01 5
2019-04-01 2
2019-05-01 1
2019-06-01 2
2019-07-01 1
2019-08-01 2
2019-09-01 4
2019-11-01 5
2020-01-01 4
2020-02-01 6
2020-03-01 1
2020-06-01 1
2020-07-01 2
2020-09-01 3
2020-10-01 5
2020-11-01 4
2020-12-01 6
2021-01-01 3
2021-02-01 6
2021-03-01 3
I tried to ask this question previously, but it was too ambiguous so here goes again. I am new to programming, so I am still learning how to ask questions in a useful way.
In summary, I have a pandas dataframe that resembles "INPUT DATA" that I would like to convert to "DESIRED OUTPUT", as shown below.
Each row contains an ID, a DateTime, and a Value. For each unique ID, the first row corresponds to timepoint 'zero', and each subsequent row contains a value 5 minutes following the previous row and so on.
I would like to calculate the mean of all the IDs for every 'time elapsed' timepoint. For example, in "DESIRED OUTPUT" Time Elapsed=0.0 would have the value 128.3 (100+105+180/3); Time Elapsed=5.0 would have the value 150.0 (150+110+190/3); Time Elapsed=10.0 would have the value 133.3 (125+90+185/3) and so on for Time Elapsed=15,20,25 etc.
I'm not sure how to create a new column which has the value for the time elapsed for each ID (e.g. 0.0, 5.0, 10.0 etc). I think that once I know how to do that, then I can use the groupby function to calculate the means for each time elapsed.
INPUT DATA
ID DateTime Value
1 2018-01-01 15:00:00 100
1 2018-01-01 15:05:00 150
1 2018-01-01 15:10:00 125
2 2018-02-02 13:15:00 105
2 2018-02-02 13:20:00 110
2 2018-02-02 13:25:00 90
3 2019-03-03 05:05:00 180
3 2019-03-03 05:10:00 190
3 2019-03-03 05:15:00 185
DESIRED OUTPUT
Time Elapsed Mean Value
0.0 128.3
5.0 150.0
10.0 133.3
Here is one way , using transform with groupby get the group key 'Time Elapsed', then just groupby it get the mean
df['Time Elapsed']=df.DateTime-df.groupby('ID').DateTime.transform('first')
df.groupby('Time Elapsed').Value.mean()
Out[998]:
Time Elapsed
00:00:00 128.333333
00:05:00 150.000000
00:10:00 133.333333
Name: Value, dtype: float64
You can do this explicitly by taking advantage of the datetime attributes of the DateTime column in your DataFrame
First get the year, month and day for each DateTime since they are all changing in your data
df['month'] = df['DateTime'].dt.month
df['day'] = df['DateTime'].dt.day
df['year'] = df['DateTime'].dt.year
print(df)
ID DateTime Value month day year
1 1 2018-01-01 15:00:00 100 1 1 2018
1 1 2018-01-01 15:05:00 150 1 1 2018
1 1 2018-01-01 15:10:00 125 1 1 2018
2 2 2018-02-02 13:15:00 105 2 2 2018
2 2 2018-02-02 13:20:00 110 2 2 2018
2 2 2018-02-02 13:25:00 90 2 2 2018
3 3 2019-03-03 05:05:00 180 3 3 2019
3 3 2019-03-03 05:10:00 190 3 3 2019
3 3 2019-03-03 05:15:00 185 3 3 2019
Then append a sequential DateTime counter column (per this SO post)
the counter is computed within (1) each year, (2) then each month and then (3) each day
since the data are in multiples of 5 minutes, use this to scale the counter values (i.e. the counter will be in multiples of 5 minutes, rather than a sequence of increasing integers)
df['Time Elapsed'] = df.groupby(['year', 'month', 'day']).cumcount() + 1
df['Time Elapsed'] *= 5
print(df)
ID DateTime Value month day year cumulative_record
1 1 2018-01-01 15:00:00 100 1 1 2018 5
1 1 2018-01-01 15:05:00 150 1 1 2018 10
1 1 2018-01-01 15:10:00 125 1 1 2018 15
2 2 2018-02-02 13:15:00 105 2 2 2018 5
2 2 2018-02-02 13:20:00 110 2 2 2018 10
2 2 2018-02-02 13:25:00 90 2 2 2018 15
3 3 2019-03-03 05:05:00 180 3 3 2019 5
3 3 2019-03-03 05:10:00 190 3 3 2019 10
3 3 2019-03-03 05:15:00 185 3 3 2019 15
Perform the groupby over the newly appended counter column
dfg = df.groupby('Time Elapsed')['Value'].mean()
print(dfg)
Time Elapsed
5 128.333333
10 150.000000
15 133.333333
Name: Value, dtype: float64
I have a dataframe with period_start_time by every 15 minutes and now I need to aggregate to 1 hour and calculate sum and avg for almost every column in dataframe (it has about 20 columns) and
PERIOD_START_TIME ID val1 val2
06.21.2017 22:15:00 12 3 0
06.21.2017 22:30:00 12 5 6
06.21.2017 22:45:00 12 0 3
06.21.2017 23:00:00 12 5 2
...
06.21.2017 22:15:00 15 9 2
06.21.2017 22:30:00 15 0 2
06.21.2017 22:45:00 15 1 5
06.21.2017 23:00:00 15 0 1
...
Desired output:
PERIOD_START_TIME ID val1(avg) val1(sum) val1(max) ...
06.21.2017 22:00:00 12 3.25 13 5
...
06.21.2017 23:00:00 15 2.25 10 9 ...
And for columns val2 too, and for every other column in dataframe.
I have no idea how to group by period start time for every hour, not for the whole day, no idea how to start.
I believe you need Series.dt.floor for Hours and then aggregate by agg:
df = df.groupby([df['PERIOD_START_TIME'].dt.floor('H'),'ID']).agg(['mean','sum', 'max'])
#for columns from MultiIndex
df.columns = df.columns.map('_'.join)
print (df)
val1_mean val1_sum val1_max val2_mean val2_sum \
PERIOD_START_TIME ID
2017-06-21 22:00:00 12 2.666667 8 5 3 9
15 3.333333 10 9 3 9
2017-06-21 23:00:00 12 5.000000 5 5 2 2
15 0.000000 0 0 1 1
val2_max
PERIOD_START_TIME ID
2017-06-21 22:00:00 12 6
15 5
2017-06-21 23:00:00 12 2
15 1
df = df.reset_index()
print (df)
PERIOD_START_TIME ID val1_mean val1_sum val1_max val2_mean val2_sum \
0 2017-06-21 22:00 12 2.666667 8 5 3 9
1 2017-06-21 22:00 15 3.333333 10 9 3 9
2 2017-06-21 23:00 12 5.000000 5 5 2 2
3 2017-06-21 23:00 15 0.000000 0 0 1 1
val2_max
0 6
1 5
2 2
3 1
Very similarly you can convert PERIOD_START_TIME to a pandas Period.
df['PERIOD_START_TIME'] = df['PERIOD_START_TIME'].dt.to_period('H')
df.groupby(['PERIOD_START_TIME', 'ID']).agg(['max', 'min', 'mean']).reset_index()