Python pandas finding data in between time - python

I am using crime statistics (in a data frame)and I am trying to find when most crimes occur between 12 am-8am,8am-4pm, and 4pm-12pm. I have already converted the column to DateTime. the code I used is:
#first attempt
df_15['FIRST_OCCURRENCE_DATE']=pd.date_range('01/01/2015',periods=10000,freq='H')
df_15[(df_15['FIRST_OCCURrENCE_DATE'] > '2015-1-1 00:00:00') & (df_15['FIRST_OCCURRENCE_DATE'] <= '2015-12-31 08:00:00')]
#second attempt
df_15 = df_15.set_index(df_15['FIRST_OCCURRENCE_DATE'])
df_15.loc['2015-01-01 00:00:00':'2015-12-31 00:00:00']
#third attempt
date_rng = pd.date_range(start='00:00:00', end='08:00:00',freq='H')
date_rng1 = pd.DataFrame(date_rng)
date_rng1.head(30)
#fourth attempt
df_15.FIRST_OCCURRENCE_DATE.dt.hour
ts = pd.to_datetime('12/31/2015 08:00:00')
df_15.loc[df_15.FIRST_OCCURRENCE_DATE <= ts,:].head()
The results I get are time entries that go outside of 08:00:00.
PS. all the data is from the same year

Looks like you can just do a little arithmetic and count:
(df_15['FIRST_OCCURrENCE_DATE'].dt.hour // 8).value_counts()
There are a lot of ways to solve this problem but this is likely the simplest. Extract the hour of day from each date, find which time slot it belongs to. Floor-divide by 8 to get 0 (12AM-8AM), 1 (8AM-4PM), or 2 (4PM-12AM) for each, and just count these occurrences.

Related

Pandas Dataframes: Addition of float to column value based on if condition

Relative newbie with Python and Pandas, finally admitting defeat on not being able to figure this out myself. I have a pandas Dataframe from our energy suppliers API, each row is a 30min interval showing wholesale energy costs in p/kWH 'value_exc_vat', the solar output for the house 'export' and a datetime stamp 'datetime'.
| index |'value_exc_vat'|'datetime'|'export'|'hour'|'export_rate'|'export_rate_var'|
'hour' is taken from datetime for each row e.g. 13, 14, 15, 16, etc.
To calculate the price/kWh we are paid i need to calculate
0.97 x 'value_exc_vat' + peak_rate_uplift
peak_rate_uplift is only applied during the hours 16:19 inclusive
I've tried just about every method i can think of but i can't get this to work.
peak_rate = [16,17,18,19]
for hour in df['hour']:
if hour == peak_rate:
df['export_rate_var'] = (df['export_rate'] + peak_rate_uplift)
else:
df['export_rate_var'] = df['export_rate']
Printing the output from the if function i can see that 'hour' is being selected for the correct values but the remainder of the statement doesn't then add the peak_rate_uplift I would expect.
Any advice or help on how to apply the addition to the selected row would be appreciated, feels like it should be something simple but I've been at this for 3 days now...
You could use:
peak_rate = [16,17,18,19]
df['export_rate_var'] = (df['export_rate'] + df.hour.isin(peak_rate) * peak_rate_uplift)
Where df.hour.isin([peak_rate]) returns a boolean series. This multiplied with the integer peak_rate_uplift gives a Series of integers which is 0 where the hour is not in the peak rate hours.
Does this work:
peak_rate = [16,17,18,19]
for i in range(len(df)):
if df.hour.iloc[i].isin(peak_rate):
df['export_rate_var'] = (df['export_rate'] + peak_rate_uplift)
else:
df['export_rate_var'] = df['export_rate']

Selecting events that happen 1 hour before the predicted event

I'm training a binary classifier to predict whether a certain sequence of industrial log events ends up in an error or not.
For each error, I need to capture the events that happened in the hour before the error-event. I'm using a pandas DataFrame and converted the time with pd.to_datetime() so I ended up with a Year/Month/Day/Hour/Minute/Second column, which is not the index of the dataframe.
Things I tried are pulling out the corresponding hours and minutes with this code below
hours = data2.event_timestamp.apply(lambda x: x.hour)
minutes = data2.event_timestamp.apply(lambda x: x.minute)
I managed to loop over the dataset and capture a fixed amount of events, disregarding time, that happen before the error with this code:
dataarray = []
for index, row in data2.iterrows():
array = np.asarray(row)
dataarray.append(array)
listwitheventswithnoerror = []
listwitheventswitherror = []
"""-----------------------------------------------------------------"""
for index, array in enumerate(dataarray):
if index > 50:
if array[1] == 0: # 0 is for non-errors
sample = dataarray[index-50:index]
listwitheventswitherror.append(sample)
for index, array in enumerate(dataarray):
if index > 50:
if array[1] != 0: #non zero is for errors
sample = dataarray[index-50:index]
listwitheventswithnoerror.append(sample)
I can't seem to grasp how I can change this code to instead of taking 50 events, take the events that happen in the hour before, regarding the time column. Help would be much appreciated.

Slicing my data frame is returning unexpected results

I have 13 CSV files that contain billing information in an unusual format. Multiple readings are recorded every 30 minutes of the day. Five days are recorded beside each other (columns). Then the next five days are recorded under it. To make things more complicated, the day of the week, date, and billing day is shown over the first recording of KVAR each day.
The image blow shows a small example. However, imagine that KW, KVAR, and KVA repeat 3 more times before continuing some 50 rows later.
My goal as to create a simple python script that would make the data into a data frame with the columns: DATE, TIME, KW, KVAR, KVA, and DAY.
The problem is my script returns NaN data for the KW, KVAR, and KVA data after the first five days (which is correlated with a new instance of a for loop). What is weird to me is that when I try to print out the same ranges I get the data that I expect.
My code is below. I have included comments to help further explain things. I also have an example of sample output of my function.
def make_df(df):
#starting values
output = pd.DataFrame(columns=["DATE", "TIME", "KW", "KVAR", "KVA", "DAY"])
time = df1.loc[3:50,0]
val_start = 3
val_end = 51
date_val = [0,2]
day_type = [1,2]
# There are 7 row movements that need to take place.
for row_move in range(1,8):
day = [1,2,3]
date_val[1] = 2
day_type[1] = 2
# There are 5 column movements that take place.
# The basic idea is that I would cycle through the five days, grab their data in a temporary dataframe,
# and then append that dataframe onto the output dataframe
for col_move in range(1,6):
temp_df = pd.DataFrame(columns=["DATE", "TIME", "KW", "KVAR", "KVA", "DAY"])
temp_df['TIME'] = time
#These are the 3 values that stop working after the first column change
# I get the values that I expect for the first 5 days
temp_df['KW'] = df.iloc[val_start:val_end, day[0]]
temp_df['KVAR'] = df.iloc[val_start:val_end, day[1]]
temp_df['KVA'] = df.iloc[val_start:val_end, day[2]]
# These 2 values work perfectly for the entire data set
temp_df['DAY'] = df.iloc[day_type[0], day_type[1]]
temp_df["DATE"] = df.iloc[date_val[0], date_val[1]]
# trouble shooting
print(df.iloc[val_start:val_end, day[0]])
print(temp_df)
output = output.append(temp_df)
# increase values for each iteration of row loop.
# seems to work perfectly when I print the data
day = [x + 3 for x in day]
date_val[1] = date_val[1] + 3
day_type[1] = day_type[1] + 3
# increase values for each iteration of column loop
# seems to work perfectly when I print the data
date_val[0] = date_val[0] + 55
day_type [0]= day_type[0] + 55
val_start = val_start + 55
val_end = val_end + 55
return output
test = make_df(df1)
Below is some sample output. It shows where the data starts to break down after the fifth day (or first instance of the column shift in the for loop). What am I doing wrong?
Could be pd.append requiring matched row indices for numerical values.
import pandas as pd
import numpy as np
output = pd.DataFrame(np.random.rand(5,2), columns=['a','b']) # fake data
output['c'] = list('abcdefghij') # add a column of non-numerical entries
tmp = pd.DataFrame(columns=['a','b','c'])
tmp['a'] = output.iloc[0:2, 2]
tmp['b'] = output.iloc[3:5, 2] # generates NaN
tmp['c'] = output.iloc[0:2, 2]
data.append(tmp)
(initial response)
How does df1 look like? Is df.iloc[val_start:val_end, day[0]] have any issue past the fifth day? The codes didn't show how you read from the csv files, or df1 itself.
My guess: if val_start:val_end gives invalid indices on the sixth day, or df1 happens to be malformed past the fifth day, df.iloc[val_start:val_end, day[0]] will return an empty Series object and possibly make its way into temp_df. iloc do not report invalid row indices, though similar column indices would trigger IndexError.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(5,3), columns=['a','b','c'], index=np.arange(5)) # fake data
df.iloc[0:2, 1] # returns the subset
df.iloc[100:102, 1] # returns: Series([], Name: b, dtype: float64)
A little off topic but I would recommend preprocessing the csv files rather than deal with indexing in Pandas DataFrame, as the original format was kinda complex. Slice the data by date and later use pd.melt or pd.groupby to shape them into the format you like. Or alternatively try multi-index if stick with Pandas I/O.

Python Data manipulation: Duplicate and Average row and column values using dates

Hi I have a dataset in the following format:
Code for replicating the data:
import pandas as pd
d1 = {'Year':
['2008','2008','2008','2008','2008','2008','2008','2008','2008','2008'],
'Month':['1','1','2','6','7','8','8','11','12','12'],
'Day':['6','22','6','18','3','10','14','6','16','24'],
'Subject_A':['','30','','','','35','','','',''],
'Subject_B':['','','','','','','','40','',''],
'Subject_C': ['','','','','','65','','50','','']}
d1 = pd.DataFrame(d1)
I input the numbers as a string to show blank cells
Where the first three columns denotes date (Year, Month and Day) and the following columns represent individuals (My actual data file consists of about 300 such rows and about 1000 subjects. I presented a subset of the data here).
Where the column value refers to expenditure on FMCG products.
What I would like to do is the following:
Part 1 (Beginning and end points)
a) For each individual locate the first observation and duplicate the value of the first observation for atleast the previous six months. For example: Subject C's 1st observation is on the 10th of August 2008. In that case I would want all the rows from June 10, 2008 to be equal to 65 for Subject C (Roughly 2/12/2008
is the cutoff date. SO we leave the 3rd cell from the top for Subject_C's column blank).
b) Locate last observation and repeat the last observation for the following 3 months. For example for Subject_A, we repeat 35 twice (till 6th November 2008).
Please refer to the following diagram for the highlighted cell with the solutions.
Part II - (Rows in between)
Next I would like to do two things (I would need to do the following three steps separately, not all at one time):
For individuals like Subject_A, locate two observations that come one after the other (30 and 35).
i) Use the average of the two observations. In this case we would have 32.5 in the four rows without caring about time.
for eg:
ii) Find the total time between two observations and take the mean of the time. For the 1st half of the time period assign the first value and for the 2nd half assign the second value. For example - for subject 1, the total days between 01/22/208 and 08/10/2008 is 201 days. For the first 201/2 = 100.5 days assign the value of 30 to Subject_A and for the remaining value assign 35. In this case the columns for Subject_A and Subject_C will look like:
The final dataset will use (a), (b) & (i) or (a), (b) & (ii)
Final data I [using a,b and i]
Final data II [using a,b and ii]
I would appreciate any help with this. Thanks in advance. Please let me know if the steps are unclear.
Follow up question and Issues
Thanks #Juan for the initial answer. Here's my follow up question. Suppose that Subject_A has more than 2 observations (code for the example data below). Would we be able to extend this code to incorporate more than 2 observations?
import pandas as pd
d1 = {'Year':
['2008','2008','2008','2008','2008','2008','2008','2008','2008','2008'],
'Month':['1','1','2','6','7','8','8','11','12','12'],
'Day':['6','22','6','18','3','10','14','6','16','24'],
'Subject_A':['','30','','45','','35','','','',''],
'Subject_B':['','','','','','','','40','',''],
'Subject_C': ['','','','','','65','','50','','']}
d1 = pd.DataFrame(d1)
Issues
For the current code, I found an issue for part II (ii). This is the output that I get:
This is actually on the right track. The two cells above 35 does not seem to get updated. Is there something wrong on my end? Also the same question as before, would we be able to extend it to the case of >2 observations?
Here a code solution for subject A. Should work with the other subjects:
d1 = {'Year':
['2008','2008','2008','2008','2008','2008','2008','2008','2008','2008'],
'Month':['1','1','2','6','7','8','8','11','12','12'],
'Day':['6','22','6','18','3','10','14','6','16','24'],
'Subject_A':['','30','','45','','35','','','',''],
'Subject_B':['','','','','','','','40','',''],
'Subject_C': ['','','','','','65','','50','','']}
d1 = pd.DataFrame(d1)
d1 = pd.DataFrame(d1)
## Create a variable named date
d1['date']= pd.to_datetime(d1['Year']+'/'+d1['Month']+'/'+d1['Day'])
# convert to float, to calculate mean
d1['Subject_A'] = d1['Subject_A'].replace('',np.nan).astype(float)
# index of the not null rows
subja = d1['Subject_A'].notnull()
### max and min index row with notnull value
max_id_subja = d1.loc[subja,'date'].idxmax()
min_id_subja = d1.loc[subja,'date'].idxmin()
### max and min date for Sub A with notnull value
max_date_subja = d1.loc[subja,'date'].max()
min_date_subja = d1.loc[subja,'date'].min()
### value for max and min date
max_val_subja = d1.loc[max_id_subja,'Subject_A']
min_val_subja = d1.loc[min_id_subja,'Subject_A']
#### Cutoffs
min_cutoff = min_date_subja-pd.Timedelta(6, unit='M')
max_cutoff = max_date_subja+pd.Timedelta(3, unit='M')
## PART I.a
d1.loc[(d1['date']<min_date_subja) & (d1['date']>min_cutoff),'Subject_A'] = min_val_subja
## PART I.b
d1.loc[(d1['date']>max_date_subja) & (d1['date']<max_cutoff),'Subject_A'] = max_val_subja
## PART II
d1_2i = d1.copy()
d1_2ii = d1.copy()
lower_date = min_date_subja
lower_val = min_val_subja.copy()
next_dates_index = d1_2i.loc[(d1['date']>min_date_subja) & subja].index
for N in next_dates_index:
next_date = d1_2i.loc[N,'date']
next_val = d1_2i.loc[N,'Subject_A']
#PART II.i
d1_2i.loc[(d1['date']>lower_date) & (d1['date']<next_date),'Subject_A'] = np.mean([lower_val,next_val])
#PART II.ii
mean_time_a = pd.Timedelta((next_date-lower_date).days/2, unit='d')
d1_2ii.loc[(d1['date']>lower_date) & (d1['date']<=lower_date+mean_time_a),'Subject_A'] = lower_val
d1_2ii.loc[(d1['date']>lower_date+mean_time_a) & (d1['date']<=next_date),'Subject_A'] = next_val
lower_date = next_date
lower_val = next_val
print(d1_2i)
print(d1_2ii)

Calculating a max for every X number of lines, how to take leap year into account?

I am trying to take yearly max rainfall data for multiple years of data within one array. I understand how you would need to use a for loop if I wanted to take the max of a single range, I saw there was similar question to the problem I'm having. However, I need to take leap year into account!
So for the first year I have 14616 data points from 1960-1965, not including 1965, which contains 2 leap years: 1960 and 1964. A leap year contains 2928 data points and every other year contains 2920 data points.
I first thought was to modify the solution from the similar question which involved using a for loop as follows (just a straight copy paste from their's):
for i,d in enumerate(data_you_want):
if (i % 600) == 0:
avg_for_day = np.mean(data_you_want[i - 600:i])
daily_averages.append(avg_for_day)
Their's involved taking the average of every 600 lines in their data. I thought there might be a way to just modify this, but I couldn't figure out a way for it to work. If modification of this won't work, is there another way to loop it with the leap years taken into account without completely cutting up the file manually.
Fake data:
import numpy as np
fake = np.random.randint(2, 30, size = 14616)
Use pandas to handle the leap year functionality.
Create timestamps for your data with pandas.date_range().
import pandas as pd
index = pd.date_range(start = '1960-1-1 00:00:00', end = '1964-12-31 23:59:59' , freq='3H')
Then create a DataFrame using the timestamps for the index.
df = pd.DataFrame(data = fake, index = index)
Aggregate by year - taking advantage of the DatetimeIndex flexibilty.
>>> df['1960'].max()
0 29
dtype: int32
>>> df['1960'].mean()
0 15.501366
dtype: float64
>>>
>>> len(df['1960'])
2928
>>> len(df['1961'])
2920
>>> len(df['1964'])
2928
>>>
I just cobbled this together from the Time Series / Date functionality section of the docs. Given pandas capability this looks a bit naive and probably can be improved upon.
Like resampling (using the same DataFrame)
>>> df.resample('A').mean()
0
1960-12-31 15.501366
1961-12-31 15.170890
1962-12-31 15.412329
1963-12-31 15.538699
1964-12-31 15.382514
>>> df.resample('A').max()
0
1960-12-31 29
1961-12-31 29
1962-12-31 29
1963-12-31 29
1964-12-31 29
>>>
>>> r = df.resample('A')
>>> r.agg([np.sum, np.mean, np.std])
0
sum mean std
1960-12-31 45388 15.501366 8.211835
1961-12-31 44299 15.170890 8.117072
1962-12-31 45004 15.412329 8.257992
1963-12-31 45373 15.538699 7.986877
1964-12-31 45040 15.382514 8.178057
>>>
Food for thought:
Time-aware Rolling vs. Resampling

Categories

Resources