I have a data set which contains samples at the 1 second level from workout data (heart rate, watts, etc.) The data feed is not perfect and sometimes there are gaps. I need to have the dataset at 1 sec intervals with no missing rows.
Once I resample the data it looks along the lines of this:
activity_id watts
t
1 12345 5
2 12345 NaN
3 12345 15
6 98765 NaN
7 98765 10
8 98765 12
After the resample I cant get the interpolate to work properly. The problem is that the interpolation is going across the entire dataframe and I need it to 'reset' for every workout ID within the dataframe. The data should look like this after its working properly:
activity_id watts
t
1 12345 5
2 12345 10
3 12345 15
6 98765 NaN
7 98765 10
8 98765 12
Heres the snippet of code I have tried. It's not throwing any errors but also not doing the interpolation...
seconds = 1
df = df.groupby(['activity_id']).resample(str(seconds) + 'S').mean().reset_index(level='activity_id', drop=True)
df = df.reset_index(drop=False)
df = df.groupby('activity_id').apply(lambda group: group.interpolate(method='linear'))
Marked as correct answer here but not working for me:
Pandas interpolate within a groupby
Related
This question already has answers here:
Convert columns into rows with Pandas
(6 answers)
Closed 2 years ago.
I have a large table with multiple columns as input table in below format:
Col-A Col-B Col-C Col-D Col-E Col-F
001 10 01/01/2020 123456 123123 123321
001 20 01/02/2020 123456 123111
002 10 01/03/2020 111000 111123
And I'd like to write a code such that it will show lines per each Col-A and so that instead of multiple columns Col-D,E,F I will only have Col-D:
Col-A Col-B Col-C Col-D
001 10 01/01/2020 123456
001 10 01/01/2020 123123
001 10 01/01/2020 123321
001 20 01/02/2020 123456
001 20 01/02/2020 123111
002 10 01/03/2020 111000
002 10 01/03/2020 111123
Any ideas will be appreciated,
Thanks,
Nurbek
You can use pd.melt
import pandas as pd
newdf = pd.melt(
df,
id_vars=['Col-A', 'Col-B', 'Col-C'],
value_vars=['Col-D', 'Col-E', 'Col-F']
).dropna()
This will drop 'Col-D', 'Col-E' and 'Col-F', but create two new columns variable and value. Variable column will denote the column from which your value came from. To achieve what you want ultimately, you can drop the variable column and rename the value column to Col-D.
newdf = newdf.drop(['variable'], axis=1)
newdf = newdf.rename(columns={"value":"Col-D"})
What about something like this:
df2 = df[["Col-A","Col-B","Col-C","Col-D"]]
columns = ["Col-E","Col-F",...,"Col-Z"]
for col in columns:
df2.append(df[["Col-A","Col-B","Col-C",col]]).reset_index(drop=True)
You just append the columns you want to your original dataframe
I have a Pandas data frame that keeps data for checkouts of laptops in my department. The dataframe has columns for time checked out (column name Out), time checked in (In), the name of the person checking out (Name), and the number of machines checked out by that person (Number). I want to create a new dataframe that shows both the number of times checkouts occurred in a given week, and the number of machines checked out in a given week. The original data frame is called cb.
I was able to make a pivot table that gets me the number of machines checked out by week:
dates = pd.pivot_table(cb, values="Number", index="Out", aggfunc=sum)
I'm wondering what I can add to this line of code to add a new column that calculates the number of times machines were checked out. For example if two people checked out laptops in a given week, person 1 checked out 10 laptops, and person 2 checked out 5, then there should be a "Number" column that reads "15" for this week and another column "Frequency" that reads "2".
Is this possible with a single pivot_table line or is there more to it? Thanks in advance.
EDIT: Here's what I hope is a small example of what I am looking for. First, here's raw data from the CSV I am reading:
Name Number DateOut TimeOut DateIn TimeIn
C 1 8/31/2017 2:00p 9/1/2017 3:40p
Ma 2 8/31/2017 3:30p . .
S 1 9/6/2017 10:50a 9/6/2017 1:55p
S 3 9/7/2017 10:00a 9/7/2017 3:00p
C 1 9/7/2017 2:20p 9/8/2017 11:00a
Ma 2 9/7/2017 4:00p 9/8/2017 10:00a
S 4 9/8/2017 10:50a 9/8/2017 3:15p
W 6 9/11/2017 8:15a 9/11/2017 11:00a
B 4 9/11/2017 10:45a 9/11/2017 1:00p
S 4 9/11/2017 10:55a 9/11/2017 3:55p
S 3 9/12/2017 12:55p 9/12/2017 3:00p
Ma 2 9/12/2017 4:00p 9/15/2017 10:00a
S 1 9/13/2017 11:00a 9/13/2017 1:00p
T 1 9/13/2017 1:00p . .
K 1 9/13/2017 2:00p 9/14/2017 10:00a
F 2 9/13/2017 4:00p 9/14/2017 11:45a
S 3 9/14/2017 1:00p 9/14/2017 3:00p
C 1 9/14/2017 3:50p 9/15/2017 10:00a
F 4 9/15/2017 9:35a 9/15/2017 3:00p
(Names redacted for privacy.)
The code for reading it in (parsing the given dates into a correct DateTime index):
import pandas as pd
cb = pd.read_csv("chromebookdata.csv", na_values=".",
parse_dates={"In": [2,3], "Out":[4,5]})
cb['In'] = pd.to_datetime(cb['In'], errors="coerce")
cb['Out'] = pd.to_datetime(cb['Out'], errors="coerce")
Creating a pivot table that gives the number of machines each week:
dates = pd.pivot_table(cb, values="Number", index="Out", aggfunc=sum)
dates_weekly = dates.resample("W").sum()
This pivot table gives me the number of machines checked out per week:
Number
In
2017-09-03 3.0
2017-09-10 11.0
2017-09-17 33.0
What I want is a new column for the number of times checkouts occurred, so for these data it would look like:
Number Count
In
2017-09-03 3.0 2
2017-09-10 11.0 5
2017-09-17 33.0 12
Assuming your dates_weekly and cb dataframes are sorted by date:
# Round your dates to the day
cb['dates'] = cb['dates'].dt.floor('d')
# Group by rounded date and count the number of rows per each date
dates_weekly['frequency'] = cb.groupby('dates').agg('count')
You can pass a list in the aggfunction. Try aggfunc=['sum', 'count'] when you create the pivot table
I'm currently working with panel data in Python and I'm trying to compute the rolling average for each time series observation within a given group (ID).
Given the size of my data set (thousands of groups with multiple time periods), the .groupby and .apply() functions are taking way too long to compute (has been running over an hour and still nothing -- entire data set only contains around 300k observations).
I'm ultimately wanting to iterate over multiple columns, doing the following:
Compute a rolling average for each time step in a given column, per group ID
Create a new column containing the difference between the original value and the moving average [x_t - (x_t-1 + x_t)/2]
Store column in a new DataFrame, which would be identical to original data set, except that it has the residual from #2 instead of the original value.
Repeat and append new residuals to df_resid (as seen below)
df_resid
date id rev_resid exp_resid
2005-09-01 1 NaN NaN
2005-12-01 1 -10000 -5500
2006-03-01 1 -352584 -262058.5
2006-06-01 1 240000 190049.5
2006-09-01 1 82648.75 37724.25
2005-09-01 2 NaN NaN
2005-12-01 2 4206.5 24353
2006-03-01 2 -302574 -331951
2006-06-01 2 103179 117405.5
2006-09-01 2 -52650 -72296.5
Here's small sample of the original data.
df
date id rev exp
2005-09-01 1 745168.0 545168.0
2005-12-01 1 725168.0 534168.0
2006-03-01 1 20000.0 10051.0
2006-06-01 1 500000.0 390150.0
2006-09-01 1 665297.5 465598.5
2005-09-01 2 956884.0 736987.0
2005-12-01 2 965297.0 785693.0
2006-03-01 2 360149.0 121791.0
2006-06-01 2 566507.0 356602.0
2006-09-01 2 461207.0 212009.0
And the (very slow) code:
df['rev_resid'] = df.groupby('id')['rev'].apply(lambda x:x.rolling(center=False,window=2).mean())
I'm hoping there is a much more computationally efficient way to do this (primarily with respect to #1), and could be extended to multiple columns.
Any help would be truly appreciated.
To quicken up the calculation, if dataframe is already sorted on 'id' then you don't have to do rolling within a groupby (if it isn't sorted... do so). Then since your window is only length 2 then we mask the result by checking where id == id.shift This works because it's sorted.
d1 = df[['rev', 'exp']]
df.join(
d1.rolling(2).mean().rsub(d1).add_suffix('_resid')[df.id.eq(df.id.shift())]
)
date id rev exp rev_resid exp_resid
0 2005-09-01 1 745168.0 545168.0 NaN NaN
1 2005-12-01 1 725168.0 534168.0 -10000.00 -5500.00
2 2006-03-01 1 20000.0 10051.0 -352584.00 -262058.50
3 2006-06-01 1 500000.0 390150.0 240000.00 190049.50
4 2006-09-01 1 665297.5 465598.5 82648.75 37724.25
5 2005-09-01 2 956884.0 736987.0 NaN NaN
6 2005-12-01 2 965297.0 785693.0 4206.50 24353.00
7 2006-03-01 2 360149.0 121791.0 -302574.00 -331951.00
8 2006-06-01 2 566507.0 356602.0 103179.00 117405.50
9 2006-09-01 2 461207.0 212009.0 -52650.00 -72296.50
I have a data frame that looks like:
app_id subproduct date
0 23 3 2015-05-29
1 23 4 2015-05-29
2 25 5 2015-05-29
3 23 3 2015-05-29
4 24 7 2015-05-29
....
I run:
groupings =insightevents.groupby([insightevents['created_at_date'].dt.year,\
insightevents['created_at_date'].dt.month,\
insightevents['created_at_date'].dt.week,insightevents['created_at_date'].dt.day,
insightevents['created_at_date'].dt.dayofweek]);
inboxinsights=pd.DataFrame([groupings['app_id'].unique(),groupings['subproduct'].unique()]).transpose()
This gives me:
app_id subproduct
2015 5 22 29 4 [23,24,25] [3,4,5,7]
However, what I want is actually not to get just the unique values, but overall just the app_ids and sub_product loads on the day as additional columns, so:
unique_ app_id unique_subproduct subproduct app_id
2015 5 22 29 4 [23,24,25] [3,4,5,7] [3,3,4,5,7] [23,23,23,24,25]
I find that just doing:
inboxinsights=pd.DataFrame([groupings['app_id'].unique(), groupings['subproduct'].unique(),groupings['app_id'],groupings['subproduct']]).transpose()
Doesn't work and just gives me:
AttributeError: 'Series' object has no attribute 'type'
If you wanted just the number of unique values, that's easy:
inboxinsights.groupby('date').agg({'app_id': 'nunique', 'subproduct': 'nunique'})
returns:
But it looks like you want the list of what those actually are. I found this other SO question helpful:
not_unique_inboxinsights = groupby('date').agg(lambda x: tuple(x))
And then you say want both the unique and not-unique. For that, I would make two groupby dataframes and concatenate them, like this:
unique_inboxinsights = groupby('date').agg(lambda x: set(tuple(x)))
Hope that helps.
I have been spinning my wheels with this problem and was wondering if anyone has any insight on how best to approach it. I have a pandas DataFrame with a number of columns, including one datetime64[ns]. I would like to find some way to 'group' records together which have datetimes which are very close to one another. For example, I might be interested in grouping the following transactions together if they occur within two seconds of each other by assigning a common ID called Grouped ID:
Transaction ID Time Grouped ID
1 08:10:02 1
2 08:10:03 1
3 08:10:50
4 08:10:55
5 08:11:00 2
6 08:11:01 2
7 08:11:02 2
8 08:11:03 3
9 08:11:04 3
10 08:15:00
Note that I am not looking to have the time window expand ad infinitum if transactions continue to occur at quick intervals - once a full 2 second window has passed, a new window would begin with the next transaction (as shown in transactions 5 - 9). Additionally, I will ultimately be performing this analysis at the millisecond level (i.e. combine transactions within 50 ms) but stuck with seconds for ease of presentation above.
Thanks very much for any insight you can offer!
The solution i suggest requires you to reindex your data with your Time data.
You can use a list of datetimes with the desired frequency, use searchsorted to find the nearest datetimes in your index, and then use it for slicing (as suggested in question python pandas dataframe slicing by date conditions and Python pandas, how to truncate DatetimeIndex and fill missing data only in certain interval).
I'm using pandas 0.14.1 and the DataOffset object (http://pandas.pydata.org/pandas-docs/dev/timeseries.html?highlight=dateoffset). I didn't check with datetime64, but i guess you might adapt the code. DataOffset goes down to the microsecond level.
Using the following code,
import pandas as pd
import pandas.tseries.offsets as pto
import numpy as np
# Create some ome test data
d_size = 15
df = pd.DataFrame({"value": np.arange(d_size)}, index=pd.date_range("2014/11/03", periods=d_size, freq=pto.Milli()))
# Define periods to define groups (ticks)
ticks = pd.date_range("2014/11/03", periods=d_size/3, freq=5*pto.Milli())
# find nearest indexes matching the ticks
index_ticks = np.unique(df.index.searchsorted(ticks))
# make a dataframe with the group ids
dgroups = pa.DataFrame(index=df.index, columns=['Group id',])
# sets the group ids
for i, (mini, maxi) in enumerate(zip(index_ticks[:-1], index_ticks[1:])):
dgroups.loc[mini:maxi] = i
# update original dataframe
df['Group id'] = dgroups['Group id']
I was able to obtain this kind of dataframe:
value Group id
2014-11-03 00:00:00 0 0
2014-11-03 00:00:00.001000 1 0
2014-11-03 00:00:00.002000 2 0
2014-11-03 00:00:00.003000 3 0
2014-11-03 00:00:00.004000 4 0
2014-11-03 00:00:00.005000 5 1
2014-11-03 00:00:00.006000 6 1
2014-11-03 00:00:00.007000 7 1
2014-11-03 00:00:00.008000 8 1
2014-11-03 00:00:00.009000 9 1
2014-11-03 00:00:00.010000 10 2
2014-11-03 00:00:00.011000 11 2
2014-11-03 00:00:00.012000 12 2
2014-11-03 00:00:00.013000 13 2
2014-11-03 00:00:00.014000 14 2