Add index weekday to pandas dataframe - python

I have a following data frame, which is indexed by date_time:
date_time rsvp_limit rsvp_yes dropout
2017-11-30 19:00:00 240 229 0.045833
2017-10-19 19:00:00 300 300 0.000000
2017-06-26 19:00:00 300 300 0.000000
When I try to add weekday column to it, somehow it does not seem to succeed:
weekday_dropoouts = events['dropout'].copy()
weekday_dropoouts['weekday'] = weekday_dropoouts.index.weekday_name
weekday_dropoouts[:3]
Gives me:
date_time
2017-11-30 19:00:00 0.0458333
2017-10-19 19:00:00 0
2017-06-26 19:00:00 0
Name: dropout, dtype: object
What I'm trying achieve is to create a bar plot per weekday i.e. basically I'm trying to figure out which weekday the event experiences the highest drop out.
I'm sure I'm missing something fundamental here, but I can't figure out what it is.

Could be a type issue?
Is weekday_dropoouts.index definitely a DatetimeIndex?

Related

How to plot part of date time in Python

I have massive data from CSV which spans every hour for a whole year. It has not been difficult plotting the whole data (or specific data) through the whole year.
However, I would like to take a closer look at month (for ex just plot January or February), and for the life of me, I haven't found out how to do that.
Date Company1 Company2
2020-01-01 00:00:00 100 200
2020-01-01 01:00:00 110 180
2020-01-01 02:00:00 90 210
2020-01-01 03:00:00 100 200
.... ... ...
2020-12-31 21:00:00 100 200
2020-12-31 22:00:00 80 230
2020-12-31 23:00:00 120 220
All of the columns are correctly formatted, the datetime is correctly formatted. How can I slice or define exactly the period I want to plot?
You can extract the month portion of a pandas datetime using .dt.month on a datetime series. Then check if that is equal to the month in question:
df_january = df[df['Date'].dt.month == 1]
You can then plot using your df_january dataframe. N.B. this will pick up data from other years as well if your dataset expanded to cover other years.
#WakemeUpNow had the solution I hadn't noticed. defining xlin while plotting did the trick.
df.DateTime.plot(x='Date', y='Company', xlim=('2020-01-01 00:00:00 ', '2020-12-31 23:00:00'))
plt.show()

selecting rows in dataframe using datetime.datetime

Python is new for me.
I want to select a range of rows by using the datetime which is also the index.
I am not sure if having the datetime as the index is a problem or not.
my dataframe looks like this:
gradient
date
2022-04-15 10:00:00 0.013714
2022-04-15 10:20:00 0.140792
2022-04-15 10:40:00 0.148240
2022-04-15 11:00:00 0.016510
2022-04-15 11:20:00 0.018219
...
2022-05-02 15:40:00 0.191208
2022-05-02 16:00:00 0.016198
2022-05-02 16:20:00 0.043312
2022-05-02 16:40:00 0.500573
2022-05-02 17:00:00 0.955833
And I have made variables which contain the start and end date of the rows I want to select. This looks like this:
A_start_646 = datetime.datetime(2022,4,27, 11,0,0)
S_start_646 = datetime.datetime(2022,4,28, 3,0,0)
D_start_646 = datetime.datetime(2022,5,2, 15,25,0)
D_end_646 = datetime.datetime(2022,5, 2, 15,50,0)
So I would like to make a new dataframe. I saw some examples on the internet, but they use another way of expressing the date.
Does somewan know a solution?
I feel kind of stupid and smart at the same time now. This because I have already answered my own question, my apologies
So this is the answer:
new_df = data_646_mean[A_start_646 : S_start_646]

How do I take the mean on either side of a value in a pandas DataFrame?

I have a Pandas DataFrame where the index is datetimes for every 12 minutes in a day (120 rows total). I went ahead and resampled the data to every 30 minutes.
Time Rain_Rate
1 2014-04-02 00:00:00 0.50
2 2014-04-02 00:30:00 1.10
3 2014-04-02 01:00:00 0.48
4 2014-04-02 01:30:00 2.30
5 2014-04-02 02:00:00 4.10
6 2014-04-02 02:30:00 5.00
7 2014-04-02 03:00:00 3.20
I want to take 3 hour means centered on hours 00, 03, 06, 09, 12, 15 ,18, and 21. I want the mean to consist of 1.5 hours before 03:00:00 (so 01:30:00) and 1.5 hours after 03:00:00 (04:30:00). The 06:00:00 time would overlap with the 03:00:00 average (they would both use 04:30:00).
Is there a way to do this using pandas? I've tried a few things but they haven't worked.
Method 1
I'm going to suggest just change your resample from the get-go to get the chunks you want. Here's some fake data resembling yours, before resampling at all:
dr = pd.date_range('04-02-2014 00:00:00', '04-03-2014 00:00:00', freq='12T', closed='left')
data = np.random.rand(120)
df = pd.DataFrame(data, index=dr, columns=['Rain_Rate'])
df.index.name = 'Time'
#df.head()
Rain_Rate
Time
2014-04-02 00:00:00 0.616588
2014-04-02 00:12:00 0.201390
2014-04-02 00:24:00 0.802754
2014-04-02 00:36:00 0.712743
2014-04-02 00:48:00 0.711766
Averaging by 3 hour chunks initially will be the same as doing 30 minute chunks then doing 3 hour chunks. You just have to tweak a couple things to get the right bins you want. First you can add the bin you will start from (i.e. 10:30 pm on the previous day, even if there's no data there; the first bin is from 10:30pm - 1:30am), then resample starting from this point
before = df.index[0] - pd.Timedelta(minutes=90) #only if the first index is at midnight!!!
df.loc[before] = np.nan
df = df.sort_index()
output = df.resample('3H', base=22.5, loffset='90min').mean()
The base parameter here means start at the 22.5th hour (10:30), and loffset means push the bin names back by 90 minutes. You get the following output:
Rain_Rate
Time
2014-04-02 00:00:00 0.555515
2014-04-02 03:00:00 0.546571
2014-04-02 06:00:00 0.439953
2014-04-02 09:00:00 0.460898
2014-04-02 12:00:00 0.506690
2014-04-02 15:00:00 0.605775
2014-04-02 18:00:00 0.448838
2014-04-02 21:00:00 0.387380
2014-04-03 00:00:00 0.604204 #this is the bin at midnight on the following day
You could also start with the data binned at 30 minutes and use this method, and should get the same answer.*
Method 2
Another approach would be to find the locations of the indexes you want to create averages for, and then calculate the averages for entries in the 3 hours surrounding:
resampled = df.resample('30T',).mean() #like your data in the post
centers = [0,3,6,9,12,15,18,21]
mask = np.where(df.index.hour.isin(centers) & (df.index.minute==0), True, False)
df_centers = df.index[mask]
output = []
for center in df_centers:
cond1 = (df.index >= (center - pd.Timedelta(hours=1.5)))
cond2 = (df.index <= (center + pd.Timedelta(hours=1.5)))
output.append(df[cond1 & cond2].values.mean())
Output here is the same, but the answers are in a list (and the last point of "24 hours" is not included):
[0.5555146139562004,
0.5465709237162698,
0.43995277270996735,
0.46089800625663596,
0.5066902552121085,
0.6057747262752732,
0.44883794039466535,
0.3873795731806939]
*You mentioned you wanted some points on the edge of bins to be included in both bins. resample doesn't do this (and generally I don't think most people want to do so), but the second method I used is explicit about doing so (by using >= and <= in cond1 and cond2). However, these two methods achieve the same result here, presumably b/c of the use of resample at different stages causing data points to be included in different bins. It's hard for me to wrap my around that, but one could do a little manual binning to verify what is going on. The point is, I would recommend spot-checking the output of these methods (or any resample-based method) against your raw data to make sure things look correct. For these examples, I did so using Excel.

to_datetime in pandas changes the date of my datetime data

I use the following code to extract the datetime of a .csv file:
house_data = 'test_1house_EV.csv'
house1 = pandas.read_csv(house_data)
time = pandas.to_datetime(house1["localminute"])
The datetime data to be extracted are the 1440 minutes of September 1, 2017.
However, after using to_datetime the minutes between 00:00 and 05:00 are placed on September 2.
e.g. the original data looks like this:
28 2017-09-01 00:28:00-05
29 2017-09-01 00:29:00-05
...
1411 2017-09-01 23:31:00-05
1412 2017-09-01 23:32:00-05
but the datetime data looks like this:
28 2017-09-01 05:28:00
29 2017-09-01 05:29:00
...
1410 2017-09-02 04:30:00
1411 2017-09-02 04:31:00
Does anyone know how to fix this?
Use this, as per #James' suggestion:
pd.to_datetime(house1["localminute"], format='%Y-%m-%d %H:%M:%S-%f')
You can slice off the last three characters of the date string before converting.
pd.to_datetime(house1.localminute.str[:-3])

How to sum field across two DataFrames when the indexes don't line up?

I am brand new to complex data analysis in general, and to pandas in particular. I have a feeling that pandas should be able to handle this task easily, but my newbieness prevents me from seeing the path to a solution. I want to sum one column across two files at a given time each day, 3pm in this case. If a file doesn't have a record at 3pm that day, I want to use the previous record.
Let me give a concrete example. I have data in two CSV files. Here are a couple small examples:
datetime value
2013-02-28 09:30:00 0.565019720442
2013-03-01 09:30:00 0.549536266504
2013-03-04 09:30:00 0.5023031467
2013-03-05 09:30:00 0.698370467751
2013-03-06 09:30:00 0.75834927162
2013-03-07 09:30:00 0.783620442226
2013-03-11 09:30:00 0.777265379462
2013-03-12 09:30:00 0.785787872851
2013-03-13 09:30:00 0.784873183044
2013-03-14 10:15:00 0.802959366653
2013-03-15 10:15:00 0.802959366653
2013-03-18 10:15:00 0.805413095911
2013-03-19 09:30:00 0.80816233134
2013-03-20 10:15:00 0.878912249996
2013-03-21 10:15:00 0.986393922571
and the other:
datetime value
2013-02-28 05:00:00 0.0373634672097
2013-03-01 05:00:00 -0.24700085273
2013-03-04 05:00:00 -0.452964976056
2013-03-05 05:00:00 -0.2479288295
2013-03-06 05:00:00 -0.0326855588777
2013-03-07 05:00:00 0.0780461766619
2013-03-08 05:00:00 0.306247682656
2013-03-11 06:00:00 0.0194146154407
2013-03-12 05:30:00 0.0103653153719
2013-03-13 05:30:00 0.0350377752558
2013-03-14 05:30:00 0.0110884755383
2013-03-15 05:30:00 -0.173216846788
2013-03-19 05:30:00 -0.211785013352
2013-03-20 05:30:00 -0.891054563968
2013-03-21 05:30:00 -1.27207563599
2013-03-22 05:30:00 -1.28648629004
2013-03-25 05:30:00 -1.5459897419
Note that a) neither file actually has a 3pm record, and b) the two files don't always have records for any given day. (2013-03-08 is missing from the first file, while 2013-03-18 is missing from the second, and the first file ends before the second.) As output, I envision a dataframe like this (perhaps just the date without the time):
datetime value
2013-Feb-28 15:00:00 0.6023831876517
2013-Mar-01 15:00:00 0.302535413774
2013-Mar-04 15:00:00 0.049338170644
2013-Mar-05 15:00:00 0.450441638251
2013-Mar-06 15:00:00 0.7256637127423
2013-Mar-07 15:00:00 0.8616666188879
2013-Mar-08 15:00:00 0.306247682656
2013-Mar-11 15:00:00 0.7966799949027
2013-Mar-12 15:00:00 0.7961531882229
2013-Mar-13 15:00:00 0.8199109582998
2013-Mar-14 15:00:00 0.8140478421913
2013-Mar-15 15:00:00 0.629742519865
2013-Mar-18 15:00:00 0.805413095911
2013-Mar-19 15:00:00 0.596377317988
2013-Mar-20 15:00:00 -0.012142313972
2013-Mar-21 15:00:00 -0.285681713419
2013-Mar-22 15:00:00 -1.28648629004
2013-Mar-25 15:00:00 -1.5459897419
I have a feeling this is perhaps a three-liner in pandas, but it's not at all clear to me how to do this. Further complicating my thinking about this problem, more complex CSV files might have multiple records for a single day (same date, different times). It seems that I need to somehow either generate a new pair of input dataframes with times at 15:00 and then sum across their values columns keying on just the date, or during the sum operation select the record with the greatest time on any given day with the time <= 15:00:00. Given that datetime.time objects can't be compared for magnitude, I suspect I might have to group rows together having the same date, then within each group, select only the row nearest to (but not greater than) 3pm. Kind of at that point my brain explodes.
I got nowhere looking at the documentation, as I don't really understand all the database-like operations pandas supports. Pointers to relevant documentation (especially tutorials) would be much appreciated.
First combine your DataFrames:
df3 = df1.append(df2)
so that everything is in one table, next use the groupby to sum across timestamps:
df4 = df3.groupby('datetime').aggregate(sum)
now d4 has a value column that is the sum of all matching datetime columns.
Assuming you have the timestamps as datetime objects, you can do whatever filtering you like at any stage:
filtered = df[df['datetime'] < datetime.datetime(year, month, day, hour, minute, second)]
I'm not sure exactly what you are trying to do, you may need to parse your timestamp columns before filtering.

Categories

Resources