Reshaping tables in pandas - python

Below is an extract of a dataframe which I have created my merging multiple query log dataframes:
keyword hits date average time
1 the cat sat on 10 10-Jan 10
2 who is the sea 5 10-Jan 1.2
3 under the earth 30 1-Dec 2.5
4 what is this 100 1-Feb 9
Is there a way I can pivot the data using Pandas so that rows are daily dates (e.g. 1-Jan, 2-Jan etc.) and the corresponding 1 column to each date is the daily sum of hits (sum of the hits for that day e.g. sum of hits for 1-Jan) divided by the monthly sum of hits (e.g. for the whole of Jan) for that month (i.e. the month normalised daily hit percentage for each day)

Parse the dates so we can extract the month later.
In [99]: df.date = df.date.apply(pd.Timestamp)
In [100]: df
Out[100]:
keyword hits date average time
1 the cat sat on 10 2013-01-10 00:00:00 10.0
2 who is the sea 5 2013-01-10 00:00:00 1.2
3 under the earth 30 2013-12-01 00:00:00 2.5
4 what is this 100 2013-02-01 00:00:00 9.0
Group by day and sum the hits.
In [101]: daily_totals = df.groupby('date').hits.sum()
In [102]: daily_totals
Out[102]:
date
2013-01-10 15
2013-02-01 100
2013-12-01 30
Name: hits, dtype: int64
Group by month, and divide each row (each daily total) by the sum of all the daily totals in that month.
In [103]: normalized_totals = daily_totals.groupby(lambda d: d.month).transform(lambda x: float(x)/x.sum())
In [104]: normalized_totals
Out[104]:
date
2013-01-10 1
2013-02-01 1
2013-12-01 1
Name: hits, dtype: int64
Your simple example only gave one day in each month, so all these are 1.

Related

Calculating values from time series in pandas multi-indexed pivot tables

I've got a dataframe in pandas that stores the Id of a person, the quality of interaction, and the date of the interaction. A person can have multiple interactions across multiple dates, so to help visualise and plot this I converted it into a pivot table grouping first by Id then by date to analyse the pattern over time.
e.g.
import pandas as pd
df = pd.DataFrame({'Id':['A4G8','A4G8','A4G8','P9N3','P9N3','P9N3','P9N3','C7R5','L4U7'],
'Date':['2016-1-1','2016-1-15','2016-1-30','2017-2-12','2017-2-28','2017-3-10','2019-1-1','2018-6-1','2019-8-6'],
'Quality':[2,3,6,1,5,10,10,2,2]})
pt = df.pivot_table(values='Quality', index=['Id','Date'])
print(pt)
Leads to this:
Id
Date
Quality
A4G8
2016-1-1
2
2016-1-15
4
2016-1-30
6
P9N3
2017-2-12
1
2017-2-28
5
2017-3-10
10
2019-1-1
10
C7R5
2018-6-1
2
L4U7
2019-8-6
2
However, I'd also like to...
Measure the time from the first interaction for each interaction per Id
Measure the time from the previous interaction with the same Id
So I'd get a table similar to the one below
Id
Date
Quality
Time From First
Time To Prev
A4G8
2016-1-1
2
0 days
NA days
2016-1-15
4
14 days
14 days
2016-1-30
6
29 days
14 days
P9N3
2017-2-12
1
0 days
NA days
2017-2-28
5
15 days
15 days
2017-3-10
10
24 days
9 days
The Id column is a string type, and I've converted the date column into datetime, and the Quality column into an integer.
The column is rather large (>10,000 unique ids) so for performance reasons I'm trying to avoid using for loops. I'm guessing the solution is somehow using pd.eval but I'm stuck as to how to apply it correctly.
Apologies I'm a python, pandas, & stack overflow) noob and I haven't found the answer anywhere yet so even some pointers on where to look would be great :-).
Many thanks in advance
Convert Dates to datetimes and then substract minimal datetimes per groups by GroupBy.transformb subtracted by column Date and for second new column use DataFrameGroupBy.diff:
df['Date'] = pd.to_datetime(df['Date'])
df['Time From First'] = df['Date'].sub(df.groupby('Id')['Date'].transform('min'))
df['Time To Prev'] = df.groupby('Id')['Date'].diff()
print (df)
Id Date Quality Time From First Time To Prev
0 A4G8 2016-01-01 2 0 days NaT
1 A4G8 2016-01-15 3 14 days 14 days
2 A4G8 2016-01-30 6 29 days 15 days
3 P9N3 2017-02-12 1 0 days NaT
4 P9N3 2017-02-28 5 16 days 16 days
5 P9N3 2017-03-10 10 26 days 10 days
6 P9N3 2019-01-01 10 688 days 662 days
7 C7R5 2018-06-01 2 0 days NaT
8 L4U7 2019-08-06 2 0 days NaT
df["Date"] = pd.to_datetime(df.Date)
df = df.merge(
df.groupby(["Id"]).Date.first(),
on="Id",
how="left",
suffixes=["", "_first"]
)
df["Time From First"] = df.Date-df.Date_first
df['Time To Prev'] = df.groupby('Id').Date.diff()
df.set_index(["Id", "Date"], inplace=True)
df
output:

Keep only rows from first X hours with starting point from another dataframe

I have a DataFrame (df1) with patients, where each patient (with unique id) has an admission timestamp:
admission_timestamp id
0 2020-03-31 12:00:00 1
1 2021-01-13 20:52:00 2
2 2020-04-02 07:36:00 3
3 2020-04-05 16:27:00 4
4 2020-03-21 18:51:00 5
I also have a DataFrame (df2) with for each patient (with unique id), data for a specific feature. For example:
id name effective_timestamp numerical_value
0 1 temperature 2020-03-31 13:00:00 36.47
1 1 temperature 2020-03-31 13:04:33 36.61
2 1 temperature 2020-04-03 13:04:33 36.51
3 2 temperature 2020-04-02 07:44:12 36.45
4 2 temperature 2020-04-08 08:36:00 36.50
Where effective_timestamp is of type: datetime64[ns], for both columns. The ids for both dataframes link to the same patients.
In reality there is a lot more data with +- 1 value per minute. What I want is for each patient, only the data for the first X (say 24) hours after the admission timestamp from df1. So the above would result in:
id name effective_timestamp numerical_value
0 1 temperature 2020-03-31 13:00:00 36.47
1 1 temperature 2020-03-31 13:04:33 36.61
3 2 temperature 2020-04-02 07:44:12 36.45
This would thus include first searching for the admission timestamp, and with this timestamp, drop all rows for that patient where the effective_timestamp is not within X hours from the admission timestamp. Here, X should be variable (could be 7, 24, 72, etc). I could not find a similar question on SO. I tried this using panda's date_range but I don't know how to perform that for each patient, with a variable value for X. Any help is appreciated.
Edit: I could also merge the dataframes together so each row in df2 has the admission_timestamp, and then subtract the two columns to get the difference in time. And then drop all rows where difference > X. But this sounds very cumbersome.
Let's use pd.DateOffset
First get the value of admission_timestamp for a given patient id, and convert it to pandas datetime.
Let's say id = 1
>>admissionTime = pd.to_datetime(df1[df1['id'] == 1]['admission_timestamp'].values[0])
>>admissionTime
Timestamp('2020-03-31 12:00:00')
Now, you just need to use pd.DateOffset to add 24 hours to it.
>>admissionTime += pd.DateOffset(hours=24)
Now, just look for the rows where id=1 and effective_timestamp < admissionTime
>>df2[(df2['id'] == 1) & (df2['effective_timestamp']<admissionTime)]
id name effective_timestamp numerical_value
0 1 temperature 2020-03-31 13:00:00 36.47
1 1 temperature 2020-03-31 13:04:33 36.61

Reverse rows in time series dataframe

I have a time series that looks like this:
value date
63.85 2017-01-15
63.95 2017-01-22
63.88 2017-01-29
64.02 2017-02-05
63.84 2017-02-12
62.13 2017-03-05
65.36 2017-03-25
66.45 2017-04-25
And I would like to reverse the order of the rows so they look like this:
value date
66.45 2000-01-01
65.36 2000-02-01
62.13 2000-02-20
63.84 2000-03-12
64.02 2000-03-19
63.88 2000-03-26
63.95 2000-04-02
63.85 2000-04-09
As you can see, the "value" column requires to simply flip the row values, but for the date column what I would like to do is keep the same "difference in days" between dates. It doesn't really matter what the start date value is as long as the difference in days is flipped correctly too. In the second dataframe of the example, the start date value is 2000-01-01 and the second value is 2020-02-01, which is 31 days later than the first date. This "day difference" of 31 days is the same one as the last (2017-04-25) and penultimate date (2017-03-25) of the first dataframe. And, the same for the second (2000-02-01) and the third value (2000-02-20) of the second dataframe: the "difference in days" is 20 days, the same one between the penultimate date (2017-03-25) and the antepenultimate date (2017-03-05) of the first dataframe. And so on.
I believe that the steps needed to do this would require to first calculate this "day differences", but I would like to know how to do it efficiently. Thank you :)
NumPy has support for this via its datetime and timedelta data types.
First you reverse both columns in your time series as follows:
import pandas as pd
import numpy as np
df2 = df
df2 = df2.iloc[::-1]
df2
where df is your original time series data and df2 (shown below) is the reversed time series.
value date
7 66.45 2017-04-25
6 65.36 2017-03-25
5 62.13 2017-03-05
4 63.84 2017-02-12
3 64.02 2017-02-05
2 63.88 2017-01-29
1 63.95 2017-01-22
0 63.85 2017-01-15
Next you find the day differences and store them as timedelta objects:
dates_np = np.array(df2.date).astype(np.datetime64) # Convert dates to np.datetime64 ojects
timeDeltas = np.insert(abs(np.diff(dates_np)), 0, 0) # np.insert is to account for -1 length during np.diff call
d2 = {'value': df_reversed.value, 'day_diff': timeDeltas} # Create new dataframe (df3)
df3 = pd.DataFrame(data=d2)
df3
where df3 (the day differences table) looks like this:
value day_diff
7 66.45 0 days
6 65.36 31 days
5 62.13 20 days
4 63.84 21 days
3 64.02 7 days
2 63.88 7 days
1 63.95 7 days
0 63.85 7 days
Lastly, to get back to dates accumulating from a start data, you do the following:
startDate = np.datetime64('2000-01-01') # You can change this if you like
df4 = df2 # Copy coumn data from df2
df4.date = np.array(np.cumsum(df3.day_diff) + startDate # np.cumsum accumulates the day_diff sum
df4
where df4 (the start date accumulation) looks like this:
value date
7 66.45 2000-01-01
6 65.36 2000-02-01
5 62.13 2000-02-21
4 63.84 2000-03-13
3 64.02 2000-03-20
2 63.88 2000-03-27
1 63.95 2000-04-03
0 63.85 2000-04-10
I noticed there is a 1-day discrepancy with my final table, however this is most likely due to the implementation of timedelta inclusivity/exluclusivity.
Here's how I did it:
Creating the DataFrame:
value = [63.85, 63.95, 63.88, 64.02, 63.84, 62.13, 65.36, 66.45]
date = ["2017-01-15", "2017-01-22", "2017-01-29", "2017-02-05", "2017-02-12", "2017-03-05", "2017-03-25", "2017-04-25",]
df = pd.DataFrame({"value": value, "date": date})
Creating a second DataFrame with the values reversed and converting the date column to datetime
new_df = df.astype({'date': 'datetime64'})
new_df.sort_index(ascending=False, inplace=True, ignore_index=True)
new_df
value date
0 66.45 2017-04-25
1 65.36 2017-03-25
2 62.13 2017-03-05
3 63.84 2017-02-12
4 64.02 2017-02-05
5 63.88 2017-01-29
6 63.95 2017-01-22
7 63.85 2017-01-15
I then used pandas.Series.diff to calculate the time delta between each row and converted those values to absolute values.
time_delta_series = new_df['date'].diff().abs()
time_delta_series
0 NaT
1 31 days
2 20 days
3 21 days
4 7 days
5 7 days
6 7 days
7 7 days
Name: date, dtype: timedelta64[ns]
Then you need to convert those values to a cumulative time delta.
But to use the cumsum() method you need to first remove the missing values (NaT).
time_delta_series = time_delta_series.fillna(pd.Timedelta(seconds=0)).cumsum()
time_delta_series
0 0 days
1 31 days
2 51 days
3 72 days
4 79 days
5 86 days
6 93 days
7 100 days
Name: date, dtype: timedelta64[ns
Then you can create your starting date and create the date column for the second DataFrame we created before:
from datetime import date
start = date(2000, 1, 1)
new_df['date'] = start
new_df['date'] = new_df['date'] + time_delta_series
new_df
value date
0 66.45 2000-01-01
1 65.36 2000-02-01
2 62.13 2000-02-21
3 63.84 2000-03-13
4 64.02 2000-03-20
5 63.88 2000-03-27
6 63.95 2000-04-03
7 63.85 2000-04-10

Add Months to Data Frame using a period column

I'm looking to add a %Y%m%d date column to my dataframe using a period column that has integers 1-32, which represent monthly data points starting at a defined environment variable "odate" (e.g. if odate=20190531 then period 1 should be 20190531, period 2 should be 20190630, etc.)
I tried defining a dictionary with the number of periods in the column as the keys and the value being odate + MonthEnd(period -1)
This works fine and well; however, I want to improve the code to be flexible given changes in the number of periods.
Is there a function that will allow me to fill the date columns with the odate in period 1 and then subsequent month ends for subsequent periods?
example dataset:
odate=20190531
period value
1 5.5
2 5
4 6.2
3 5
5 40
11 5
desired dataset:
odate=20190531
period value date
1 5.5 2019-05-31
2 5 2019-06-30
4 6.2 2019-08-31
3 5 2019-07-31
5 40 2019-09-30
11 5 2020-03-31
You can use pd.date_range():
pd.date_range(start = '2019-05-31', periods = 100,freq='M')
You can change total periods depending on what you need, the freq='M' means a Month-End frequency
Here is a list of Offset Aliases you can for freq parameter.
If you just want to add or subtract some period to a date you can use pd.DataOffset:
odate = pd.Timestamp('20191031')
odate
>> Timestamp('2019-10-31 00:00:00')
odate - pd.DateOffset(months=4)
>> Timestamp('2019-06-30 00:00:00')
odate + pd.DateOffset(months=4)
>> Timestamp('2020-02-29 00:00:00')
To add given the period column to Month Ends:
odate = pd.Timestamp('20190531')
df['date'] = df.period.apply(lambda x: odate + pd.offsets.MonthEnd(x-1))
df
period value date
0 1 5.5 2019-05-31
1 2 5.0 2019-06-30
2 4 6.2 2019-08-31
3 3 5.0 2019-07-31
4 5 40.0 2019-09-30
5 11 5.0 2020-03-31
To improve performance use list-comprehension:
df['date'] = [odate + pd.offsets.MonthEnd(period-1) for period in df.period]

Count String Values in Column across 30 Minute Time Bins using Pandas

I am looking to determine the count of string variables in a column across a 3 month data sample. Samples were taken at random times throughout each day. I can group the data by hour, but I require the fidelity of 30 minute intervals (ex. 0500-0600, 0600-0630) on roughly 10k rows of data.
An example of the data:
datetime stringvalues
2018-06-06 17:00 A
2018-06-07 17:30 B
2018-06-07 17:33 A
2018-06-08 19:00 B
2018-06-09 05:27 A
I have tried setting the datetime column as the index, but I cannot figure how to group the data on anything other than 'hour' and I don't have fidelity on the string value count:
df['datetime'] = pd.to_datetime(df['datetime']
df.index = df['datetime']
df.groupby(df.index.hour).count()
Which returns an output similar to:
datetime stringvalues
datetime
5 0 0
6 2 2
7 5 5
8 1 1
...
I researched multi-indexing and resampling to some length the past two days but I have been unable to find a similar question. The desired result would look something like this:
datetime A B
0500 1 2
0530 3 5
0600 4 6
0630 2 0
....
There is no straightforward way to do a TimeGrouper on the time component, so we do this in two steps:
v = (df.groupby([pd.Grouper(key='datetime', freq='30min'), 'stringvalues'])
.size()
.unstack(fill_value=0))
v.groupby(v.index.time).sum()
stringvalues A B
05:00:00 1 0
17:00:00 1 0
17:30:00 1 1
19:00:00 0 1

Categories

Resources