pandas get data for the end day of month? - python

The data is given as following:
return
2010-01-04 0.016676
2010-01-05 0.003839
...
2010-01-05 0.003839
2010-01-29 0.001248
2010-02-01 0.000134
...
What I want get is to extract all value that is the last day of month appeared in the data .
2010-01-29 0.00134
2010-02-28 ......
If I directly use pandas.resample, i.e., df.resample('M).last(). I would select the correct rows with the wrong index. (it automatically use the last day of the month as the index)
2010-01-31 0.00134
2010-02-28 ......
How can I get the correct answer in a Pythonic way?

An assumption made here is that your date data is part of the index. If not, I recommend setting it first.
Single Year
I don't think the resampling or grouper functions would do. Let's group on the month number instead and call DataFrameGroupBy.tail.
df.groupby(df.index.month).tail(1)
Multiple Years
If your data spans multiple years, you'll need to group on the year and month. Using a single grouper created from dt.strftime—
df.groupby(df.index.strftime('%Y-%m')).tail(1)
Or, using multiple groupers—
df.groupby([df.index.year, df.index.month]).tail(1)
Note—if your index is not a DatetimeIndex as assumed here, you'll need to replace df.index with pd.to_datetime(df.index, errors='coerce') above.

Although this doesn't answer the question properly I'll leave it if someone is interested.
An approach which would only work if you are certain you have all days (!IMPORTANT) is to add 1 day too with pd.Timedelta and check if day == 1. I did a small running time test and it is 6x faster than the groupby solution.
df[(df['dates'] + pd.Timedelta(days=1)).dt.day == 1]
Or if index:
df[(df.index + pd.Timedelta(days=1)).day == 1]
Full example:
import pandas as pd
df = pd.DataFrame({
'dates': pd.date_range(start='2016-01-01', end='2017-12-31'),
'i': 1
}).set_index('dates')
dfout = df[(df.index + pd.Timedelta(days=1)).day == 1]
print(dfout)
Returns:
i
dates
2016-01-31 1
2016-02-29 1
2016-03-31 1
2016-04-30 1
2016-05-31 1
2016-06-30 1
2016-07-31 1
2016-08-31 1
2016-09-30 1
2016-10-31 1
2016-11-30 1
2016-12-31 1
2017-01-31 1
2017-02-28 1
2017-03-31 1
2017-04-30 1
2017-05-31 1
2017-06-30 1
2017-07-31 1
2017-08-31 1
2017-09-30 1
2017-10-31 1
2017-11-30 1
2017-12-31 1

Related

How can I count values for each date in a Dataframe based conditionally on the value of a column?

I have a dataframe with xenophobic and non-xenophobic tweets.
For each day, I want to count the number of tweets that have a sentiment of 1.
This is the Dataframes df_unevaluated
sentiment id date text
0 0 9.820000e+17 2018-04-05 11:43:31+00:00 but if she had stated another fact like that I may have thought...
1 0 1.170000e+18 2019-09-03 22:53:30+00:00 the worst thing that dude has done this week is ramble about the...
2 0 1.140000e+18 2019-06-28 17:43:07+00:00 i think immigrants of all walks of life should be allowed into...
3 0 2.810000e+17 2012-12-18 00:43:57+00:00 why is america not treating the immigrants like normal people...
4 1 8.310000e+17 2017-02-14 01:42:26+00:00 who the hell wants to live in canada anyhow the people there...
...
This is what I've tried:
# Put all tweets with sentiment = 1 into a Dataframes
for i in range(len(df_unevaluated)):
if df_unevaluated['sentiment'][i] == 1:
df_xenophobic = df_xenophobic.append(df_unevaluated.iloc[[i]])
# Store a copy of df_xenophobic in df_counts
df_counts = df_xenophobic
# Change df_counts to get counts for each date
df_counts = (pd.to_datetime(df_counts['date'])
.dt.floor('d')
.value_counts()
.rename_axis('date')
.reset_index(name='count'))
# Sort data and drop index column
df_counts = df_counts.sort_values('date')
df_counts = df_counts.reset_index(drop=True)
# Look at data
df_counts.head()
This was the output:
date count
0 2012-03-14 00:00:00+00:00 1
1 2012-03-19 00:00:00+00:00 1
2 2012-04-07 00:00:00+00:00 1
3 2012-04-10 00:00:00+00:00 1
4 2012-04-19 00:00:00+00:00 1
...
This is what I expected:
date count
0 2012-03-14 00:00:00+00:00 1
1 2012-03-15 00:00:00+00:00 0
2 2012-03-16 00:00:00+00:00 0
3 2012-03-17 00:00:00+00:00 0
4 2012-03-18 00:00:00+00:00 0
5 2012-03-19 00:00:00+00:00 1
6 2012-03-20 00:00:00+00:00 0
7 2012-03-21 00:00:00+00:00 0
...
These are some links I've read through:
Python & Pandas - Group by day and count for each day
Using value_counts in pandas with conditions
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dt.floor.html
To be more clear, the each date has the format YYYY-MM-DD HH:MM:SS+00:00
As seen in my attempt, I try to round the dates column to its day. My goal is to count the number of times sentiment = 1 for that day.
If I understood your question correctly, then it should be as simple as follows:
import pandas as pd
# Data Load
df = pd.DataFrame(data={'Date': ['2022-11-28 11:43:31+00:00', '2022-11-28 22:53:30+00:00', '2022-11-29 17:43:07+00:00', '2022-12-01 01:42:26+00:00', '2022-12-01 02:40:26+00:00'],
'Sentiment': [ 0, 1, 0, 1, 1]})
df['Date'] = pd.to_datetime(df['Date']).dt.date
df_counts = df.groupby(by=['Date']).sum().reset_index()
The df_counts data frame should give output like this:

Best approach to group the differences between each row by month, year, etc?

I have a dataframe like the following:
Index Diff
2019-03-14 11:32:21.583000+00:00 0
2019-03-14 11:32:21.583000+00:00 2
2019-04-14 11:32:21.600000+00:00 13
2019-04-14 11:32:21.600000+00:00 14
2019-05-14 11:32:21.600000+00:00 19
2019-05-14 11:32:21.600000+00:00 27
What would be the best approach to group by the month and take the difference inside of those months?
Using the .diff() option I am able to find the difference between each row, but I am trying to use the df.groupby(pd.Grouper(freq='M')) with no success.
Expected Output:
Index Diff
0 2019-03-31 00:00:00+00:00 2.0
1 2019-04-30 00:00:00+00:00 1.0
2 2019-05-31 00:00:00+00:00 8.0
Any help would be much appreciated!!
Depending on whether or not your date is on the index, you can comment out df1 = df.reset_index(). Also, check that your index is in DateTimeIndex format if it is on the index. If not in the correct format, then you can change the data type with df.index = pd.to_datetime(df.index). Then, you should be set to change the Diff column with df1.groupby(pd.Grouper(key='Index', freq='M'))['Diff'].diff() and then later groupby with the full dataframe:
input:
import pandas as pd
df = pd.DataFrame({'Diff': {'2019-03-14 11:32:21.583000+00:00': 2,
'2019-04-14 11:32:21.600000+00:00': 14,
'2019-05-14 11:32:21.600000+00:00': 27}})
df.index.name = 'Index'
df.index = pd.to_datetime(df.index)
code:
df1 = df.reset_index()
df1['Diff'] = df1.groupby(pd.Grouper(key='Index', freq='M'))['Diff'].diff()
df1 = df1.groupby(pd.Grouper(key='Index', freq='M'))['Diff'].max().reset_index()
df1
output:
Index Diff
0 2019-03-31 00:00:00+00:00 2.0
1 2019-04-30 00:00:00+00:00 1.0
2 2019-05-31 00:00:00+00:00 8.0

Is there a Pandas function to highlight a week's 10 lowest values in a time series?

Rookie here so please excuse my question format:
I got an event time series dataset for two months (columns for "date/time" and "# of events", each row representing an hour).
I would like to highlight the 10 hours with the lowest numbers of events for each week. Is there a specific Pandas function for that? Thanks!
Let's say you have a dataframe df with column col as well as a datetime column.
You can simply sort the column with
import pandas as pd
df = pd.DataFrame({'col' : [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],
'datetime' : ['2019-01-01 00:00:00','2015-02-01 00:00:00','2015-03-01 00:00:00','2015-04-01 00:00:00',
'2018-05-01 00:00:00','2016-06-01 00:00:00','2017-07-01 00:00:00','2013-08-01 00:00:00',
'2015-09-01 00:00:00','2015-10-01 00:00:00','2015-11-01 00:00:00','2015-12-01 00:00:00',
'2014-01-01 00:00:00','2020-01-01 00:00:00','2014-01-01 00:00:00']})
df = df.sort_values('col')
df = df.iloc[0:10,:]
df
Output:
col datetime
0 1 2019-01-01 00:00:00
1 2 2015-02-01 00:00:00
2 3 2015-03-01 00:00:00
3 4 2015-04-01 00:00:00
4 5 2018-05-01 00:00:00
5 6 2016-06-01 00:00:00
6 7 2017-07-01 00:00:00
7 8 2013-08-01 00:00:00
8 9 2015-09-01 00:00:00
9 10 2015-10-01 00:00:00
I know there's a function called nlargest. I guess there should be an nsmallest counterpart. pandas.DataFrame.nsmallest
df.nsmallest(n=10, columns=['col'])
My bad, so your DateTimeIndex is a Hourly sampling. And you need the hour(s) with least events weekly.
...
Date n_events
2020-06-06 08:00:00 3
2020-06-06 09:00:00 3
2020-06-06 10:00:00 2
...
Well I'd start by converting each hour into columns.
1. Create an Hour column that holds the hour of the day.
df['hour'] = df['date'].hour
Pivot the hour values into columns having values as n_events.
So you'll then have 1 datetime index, 24 hour columns, with values denoting #events. pandas.DataFrame.pivot_table
...
Date hour0 ... hour8 hour9 hour10 ... hour24
2020-06-06 0 3 3 2 0
...
Then you can resample it to weekly level aggregate using sum.
df.resample('w').sum()
The last part is a bit tricky to do on the dataframe. But fairly simple if you just need the output.
for row in df.itertuples():
print(sorted(row[1:]))

Resample DataFrame with DatetimeIndex and keep date range

My problem might sound trivial but I haven't found any solution for it:
I want the resampled data to remain in the same date range as the original data when I resample a DataFrame with a DatetimeIndex e.g. into three-monthly values.
Minimal example:
import numpy as np
import pandas as pd
# data from 2014 to 2016
dim = 8760 * 3 + 24
idx = pd.date_range('1/1/2014 00:00:00', freq='h', periods=dim)
df = pd.DataFrame(np.random.randn(dim, 2), index=idx)
# resample two three months
df = df.resample('3M').sum()
print(df)
yielding
0 1
2014-01-31 24.546928 -16.082389
2014-04-30 -52.966507 -40.255773
2014-07-31 -32.580114 47.096810
2014-10-31 -9.501333 12.872683
2015-01-31 -106.504047 45.082733
2015-04-30 -34.230358 70.508420
2015-07-31 -35.916497 104.930101
2015-10-31 -16.780425 17.411410
2016-01-31 68.512994 -43.772082
2016-04-30 -0.349917 27.794895
2016-07-31 -30.408862 -18.182486
2016-10-31 -97.355730 -105.961101
2017-01-31 -7.221361 40.037358
Why does the resampling exceed the date range e.g. create an entry for 2017-01-31 and how can I prevent this and instead remain within the original range e.g. between 2014-01-01 and 2016-12-31? And shouldn't this be the expected standard behaviour going from January-March, April-June, ... October-December?
Thanks in advance!
There are 36 months in your DataFrame.
When you resample every 3 months, the first row will contain everything up to the end of your first month, the second row will contain everything between your second month and 3 months after that, and so on. Your last row will contain everything from 2016-10-31 until 3 months after that, which is 2017-01-31.
If you want, you could change it to
df.resample('3M', closed='left', label='left').sum()
, giving you
2013-10-31 3.705955 25.394287
2014-01-31 38.778872 -12.655323
2014-04-30 10.382832 -64.649173
2014-07-31 66.939190 31.966008
2014-10-31 -39.453572 27.431183
2015-01-31 66.436348 29.585436
2015-04-30 78.731608 -25.150526
2015-07-31 14.493226 -5.842421
2015-10-31 -2.394419 58.017105
2016-01-31 -36.295499 -14.542251
2016-04-30 69.794101 62.572736
2016-07-31 76.600558 -17.706111
2016-10-31 -68.842328 -32.723581
, but then the first row would be 'outside your range'.
If you resample every 3 months, then either your first row is going to be outside your range, or your last one is.
EDIT
If you want the bins to be 'first three months', 'next three months', and so on, you could write
df.resample('3MS').sum()
, as this will take the beginning of each month rather than its end (see https://pandas.pydata.org/pandas-docs/stable/timeseries.html#timeseries-offset-aliases)

pick month start and end data in python

I have stock data downloaded from yahoo finance. I want to pickup data in the row corresponding to monthly start and month end. I am trying to do it with python pandas data frame. But I am not getting correct method to get the starting & ending of the month. will be great full if somebody can help me in solving this.
Please note that if 1st of the month is holiday and there is no data for that, I need to pick up 2nd day's data. Same rule applies to last of the month also. Thanks in advance.
Example data is
2016-01-05,222.80,222.80,217.00,217.75,15074800,217.75
2016-01-04,226.95,226.95,220.05,220.70,14092000,220.70
2015-12-31,225.95,226.55,224.00,224.45,11558300,224.45
2015-12-30,229.00,229.70,224.85,225.80,11702800,225.80
2015-12-29,228.85,229.95,227.50,228.20,7263200,228.20
2015-12-28,229.05,229.95,228.00,228.90,8756800,228.90
........
........
2015-12-04,240.00,242.15,238.05,241.10,11115100,241.10
2015-12-03,244.15,244.50,240.40,241.10,7155600,241.10
2015-12-02,250.55,250.65,243.75,244.60,10881700,244.60
2015-11-30,249.65,253.00,245.00,250.20,12865400,250.20
2015-11-27,243.00,250.50,242.80,249.70,15149900,249.70
2015-11-26,241.95,244.90,241.00,242.50,13629800,242.50
First, you should convert your date column to datetime format, then group by month, then sort groupby Series by date and take the first/last from it using head/tail methods, like so:
In [37]: df
Out[37]:
0 1 2 3 4 5 6
0 2016-01-05 222.80 222.80 217.00 217.75 15074800 217.75
1 2016-01-04 226.95 226.95 220.05 220.70 14092000 220.70
2 2015-12-31 225.95 226.55 224.00 224.45 11558300 224.45
3 2015-12-30 229.00 229.70 224.85 225.80 11702800 225.80
4 2015-12-29 228.85 229.95 227.50 228.20 7263200 228.20
5 2015-12-28 229.05 229.95 228.00 228.90 8756800 228.90
In [25]: import datetime
In [29]: df[0] = df[0].apply(lambda x: datetime.datetime.strptime(x, '%Y-%m-%d')
)
In [36]: df.groupby(df[0].apply(lambda x: x.month)).apply(lambda x: x.sort_value
s(0).head(1))
Out[36]:
0 1 2 3 4 5 6
0
1 1 2016-01-04 226.95 226.95 220.05 220.7 14092000 220.7
12 5 2015-12-28 229.05 229.95 228.00 228.9 8756800 228.9
In [38]: df.groupby(df[0].apply(lambda x: x.month)).apply(lambda x: x.sort_value
s(0).tail(1))
Out[38]:
0 1 2 3 4 5 6
0
1 0 2016-01-05 222.80 222.80 217.0 217.75 15074800 217.75
12 2 2015-12-31 225.95 226.55 224.0 224.45 11558300 224.45
You can merge the result dataframes, using pd.concat()
For the first / last day of each month, you can use .resample() with 'BMS' and 'BM' for Business Month (Start) like so (using pandas 0.18 syntax):
df.resample('BMS').first()
df.resample('BM').last()
This assumes that your data have a DateTimeIndex as usual when downloaded from yahoo using pandas_datareader:
from datetime import datetime
from pandas_datareader.data import DataReader
df = DataReader('FB', 'yahoo', datetime(2015, 1, 1), datetime(2015, 3, 31))['Open']
df.head()
Date
2015-01-02 78.580002
2015-01-05 77.980003
2015-01-06 77.230003
2015-01-07 76.760002
2015-01-08 76.739998
Name: Open, dtype: float64
df.tail()
Date
2015-03-25 85.500000
2015-03-26 82.720001
2015-03-27 83.379997
2015-03-30 83.809998
2015-03-31 82.900002
Name: Open, dtype: float64
do:
df.resample('BMS').first()
Date
2015-01-01 78.580002
2015-02-02 76.110001
2015-03-02 79.000000
Freq: BMS, Name: Open, dtype: float64
and
df.resample('BM').last()
to get:
Date
2015-01-30 78.000000
2015-02-27 80.680000
2015-03-31 82.900002
Freq: BM, Name: Open, dtype: float64
Assuming you have downloaded data from Yahoo:
> import pandas.io.data as web
> import datetime
> start = datetime.datetime(2016,1,1)
> end = datetime.datetime(2016,5,1)
> df = web.DataReader("AAPL", "yahoo", start, end)
You simply pick the month end and start rows with:
df[df.index.is_month_end]
df[df.index.is_month_start]
If you want to access a specific row, like the first row of the first starting day of the selected starting days, you simply do:
df[df.index.is_month_start].ix[0]

Categories

Resources