So I have a pandas dataframe that is taking in / out interface traffic every 10 minutes. I want to aggregate the two time series into hourly buckets for analysis. What seems to be simple has actually ended up being quite challenging for me to figure out! Just need to bucket into hourly bins
times = list()
ins = list()
outs = list()
for row in results['results']:
times.append(row['DateTime'])
ins.append(row['Intraffic'])
outs.append(row['Outtraffic'])
df = pd.DataFrame()
df['datetime'] = times
df['datetime'] = pd.to_datetime(df['datetime'])
df.index = df['datetime']
df['ins'] = ins
df['outs'] = outs
I have tried using
df.resample('H').mean()
I have tried pandas
groupby
but was having trouble with the two columns and getting the means over the hourly bucket
I believe this should do what you want:
df = pd.DataFrame()
df['datetime'] = times
df['datetime'] = pd.to_datetime(df['datetime'])
df.set_index('datetime',inplace=True) # This won't try to remap your rows
new_df = df.groupby(pd.Grouper(freq='H')).mean()
That last line groups your data by timestamp into hourly chunks, based on the index, and then spits out a new DataFrame with the mean of each column.
Related
For a given dataframe column, I would like to randomly select by day roughly 60% and add to a new column, add the remaining 40% to another column, multiply the 40% column by (-1), and create a new column that merges these back together for each day (so that each day I have a ratio of 60/40):
I have asked the same question without the daily specification here: Randomly selecting rows from dataframe column
Example below illustrates this (although my ratio is not exactly 60/40 there):
dict0 = {'date':[1/1/2019,1/1/2019,1/1/2019,1/2/2019,1/1/2019,1/2/2019],'x1': [1,2,3,4,5,6]}
df = pd.DataFrame(dict0)###
df['date'] = pd.to_datetime(df['date']).dt.date
dict1 = {'date':[1/1/2019,1/1/2019,1/1/2019,1/2/2019,1/1/2019,1/2/2019],'x1': [1,2,3,4,5,6],'x2': [1,'nan',3,'nan',5,6],'x3': ['nan',2,'nan',4,'nan','nan']}
df = pd.DataFrame(dict1)###
df['date'] = pd.to_datetime(df['date']).dt.date
dict2 = {'date':[1/1/2019,1/1/2019,1/1/2019,1/2/2019,1/1/2019,1/2/2019],'x1': [1,2,3,4,5,6],'x2': [1,'nan',3,'nan',5,6],'x3': ['nan',-2,'nan',-4,'nan','nan']}
df = pd.DataFrame(dict2)###
df['date'] = pd.to_datetime(df['date']).dt.date
dict3 = {'date':[1/1/2019,1/1/2019,1/1/2019,1/2/2019,1/1/2019,1/2/2019],'x1': [1,2,3,4,5,6],'x2': [1,'nan',3,'nan',5,6],'x3': ['nan',-2,'nan',- 4,'nan','nan'],'x4': [1,-2,3,-4,5,6]}
df = pd.DataFrame(dict3)###
df['date'] = pd.to_datetime(df['date']).dt.date
you can use groupby and sample, get the index values, then create the column x4 with loc, and fillna with the -1 multiplied column like:
idx= df.groupby('date').apply(lambda x: x.sample(frac=0.6)).index.get_level_values(1)
df.loc[idx, 'x4'] = df.loc[idx, 'x1']
df['x4'] = df['x4'].fillna(-df['x1'])
I have the following dataframe
How can I aggregate the number of tickets (summing) for every month?
I tried:
df_res[df_res["type"]=="other"].groupby(["type","date"])["n_tickets"].sum()
date is an object
You need assign to new DataFrame for same size of Series created by Series.dt.month:
#if necessary convert to datetimes
df['date'] = pd.to_datetime(df['date'])
df = df_res[df_res["type"]=="pax"]
#type is same, so should be omited
out = df.groupby(df["date"].dt.month)["n_tickets"].sum()
#if need column with same value `pax`
#out = df.groupby(['type',df["date"].dt.month])["n_tickets"].sum()
If want grouping by pax and no pax:
types = np.where(df_res["type"]=="pax", 'pax', 'no pax')
df_res.groupby([types, df_res["date"].dt.month])["n_tickets"].sum()
I have a time series of daily data from 2000 to 2015. What I want is another single time series which only contains data from each year between April 15 to June 15 (because that is the period relevant for my analysis).
I have already written a code to do the same myself, which is given below:
import pandas as pd
df = pd.read_table(myfilename, delimiter=",", parse_dates=['Date'], na_values=-99)
dff = df[df['Date'].apply(lambda x: x.month>=4 and x.month<=6)]
dff = dff[dff['Date'].apply(lambda x: x.day>=15 if x.month==4 else True)]
dff = dff[dff['Date'].apply(lambda x: x.day<=15 if x.month==6 else True)]
I think this code is too much ineffecient as it has to carry out operation on the dataframe 3 times to get the desired subset.
I would like to know the following two things:
Is there an inbuilt pandas function to achieve this?
If not, is there a more efficient and better way to achieve this?
let the data frame look like this:
df = pd.DataFrame({'Date': pd.date_range('2000-01-01', periods=365*10, freq='D'),
'Value': np.random.random(365*10)})
create a series of dates with the year set to the same value
x = df.Date.apply(lambda x: pd.datetime(2000,x.month, x.day))
filter using this series to select from the dataframe
df.values[(x >= pd.datetime(2000,4,15)) & (x <= pd.datetime(2000,6,15))]
try this:
index = pd.date_range("2000/01/01", "2016/01/01")
s = index.to_series()
s[(s.dt.month * 100 + s.dt.day).between(415, 615)]
I am looking to join two dataframes using pandas on the 'Date' columns. I usually use df2= pd.concat([df, df1],axis=1), however for some reason this is not working.
In this example, i am pulling the data from a sql file, creating a new column called 'Date' that is merging my year and month columns, and then pivoting. Whne i try and concatenate the two dataframes, the dataframe shows up side by side instead of merged together.
What comes up:
Date Count of Cats Date Count of Dogs
What I want to come up:
Date Count of Cats Count of Dogs
Any ideas?
My other problem is I am trying to make sure the Date columns writes to excel as a string and not a datetime function. Please keep this is mind when thinking about a solution.
Here is my code:
executeScriptsFromFile('cats.sql')
df = pd.DataFrame(cursor.fetchall())
df.columns = [rec[0] for rec in cursor.description]
monthend = {'Q1':'3/31','Q2':'6/30','Q3':'9/30','Q4':'12/31'}
df['Date']=df['QUARTER'].map(monthend)+'/'+ df['YEAR']
df['Date'] = pd.to_datetime(df['Date'])
df10= df.pivot_table(['Breed'], ['Date'], aggfunc=np.sum,fill_value=0)
df10.reset_index(drop=False, inplace=True)
df10.reindex_axis(['Breed', 'Count of Cats'], axis=1)
df10.columns = ('Breed', 'Count of Cats')
executeScriptsFromFile('dogs.sql')
df = pd.DataFrame(cursor.fetchall())
df.columns = [rec[0] for rec in cursor.description]
monthend = {'Q1':'3/31','Q2':'6/30','Q3':'9/30','Q4':'12/31'}
df['Date']=df['QUARTER'].map(monthend)+'/'+ df['YEAR']
df['Date'] = pd.to_datetime(df['Date'])
df11= df.pivot_table(['Breed'], ['Date'], aggfunc=np.sum,fill_value=0)
df11.reset_index(drop=False, inplace=True)
df11.reindex_axis(['Breed', 'Count of Dogs'], axis=1)
df11.columns = ('Breed', 'Count of Dogs')
df11a= df11.round(0)
df12= pd.concat([df10, df11a],axis=1)
I think you have to remove code:
df10.reset_index(drop=False, inplace=True)
df11.reset_index(drop=False, inplace=True)
because need level date in index for concat by date.
Also for convert index to string use:
df.inde = df.index.astype(str)
Sample DataFrame :
process_id | app_path | start_time
the desired output data frame should be multi-Indexed based on the date and time value in start_time column with unique dates as first level of index and one hour range as second level of index the count of records in each time slot should be calculated
def activity(self):
# find unique dates from db file
columns = self.df['start_time'].map(lambda x: x.date()).unique()
result = pandas.DataFrame(np.zeros((1,len(columns))), columns = columns)
for i in range(len(self.df)):
col = self.df.iloc[i]['start_time'].date()
result[col][0] = result.get_value(0, col) + 1
return result
I have tried the above code which gives the output as :
15-07-2014 16-7-2014 17-07-2014 18-07-2014
3217 2114 1027 3016
I want to count records on per hour basis as well
It would be helpful to start your question with some sample data. Since you didn't, I assumed the following is representative of your data (looks like app_path was not being used):
rng = pd.date_range('1/1/2011', periods=10000, freq='1Min')
df = pd.DataFrame(randint(size=len(rng), low=100, high = 500), index=rng)
df.columns = ['process_id']
It looks like you could benefit from exploring the groupby method in Pandas data frames. Using groupby, your example above become a simple one-liner:
df.groupby( [df.index.year, df.index.month, df.index.day] ).count()
and grouping by hour means simply adding hour to the group:
df.groupby( [df.index.year, df.index.month, df.index.day, df.index.hour] ).count()
Don't recreate the wheel in Pandas, use the methods provided for much more readable, as well as faster, code.