Given a dataframe df with only one column (consisting of datetime values that can be repeated). e.g:
date
2017-09-17
2017-09-17
2017-09-22
2017-11-04
2017-11-15
and df.dtypes is date datetime64[ns].
How can I create a new dataframe exporting information from the existing one so that for every month of a particular year there will be a second column with the number of observations for that month of the year.
The result for the above example would be something like:
date
observations
2017-09
3
2017-11
2
You can do:
(df['date'].dt.to_period('M') # change date to Month
.value_counts() # count the Month
.reset_index(name='observations') # make dataframe
)
Related
I have a dataframe that looks like in the following way where the 'Date' column has already datetime64 dtype:
Date Income_Company_A
0 1990-02-01 2185600.0
1 1990-02-02 3103200.0
........................................
5467 2011-10-10 29555500.0
5468 2011-10-11 54708100.0
How can I get the values for Income_Company_A where the date has to be an ending date for each year, i.e., it has to be 31 Dec for every year starting from 1990 till 2011?
Also, if the value is Null/NaN for the ending date for each year, then how can I fill it up with the value that can be found prior to that date from the dataframe?
The first output with NaN values should look like this:
1990-12-31 1593200.0
1991-12-31 4802000.0
1992-12-31 3302000.0
1993-12-31 5765200.0
1994-12-31 NaN
Then by replacing the NaN value for the date 1994-12-31 by the value that can be found for a prior date, for example, 1994-12-29 7865200.0, the final output should look like this:
1990-12-31 1593200.0
1991-12-31 4802000.0
1992-12-31 3302000.0
1993-12-31 5765200.0
1994-12-31 7865200.0
Assuming Date column is already in datetime data type:
df.loc[(df['Date'].dt.month == 12) & (df['Date'].dt.day == 31)].ffill()
in that case , try this :
df.loc[df.groupby(df['Date'].dt.year)['Date'].idxmax()].ffill()
Use resample and take the last valid value of year:
out = df.assign(Date=df['Date'].astype('datetime64')).resample('Y', on='Date').last()
You can omit .assign(...) if your Date column has already datetime64 dtype.
I have a hypothetical time series data frame, which is with some missing observations (assumption is that the data frame shall include all dates and corresponding values and for all the dates in the year). As we can see in the head and tail information, there are certain dates and corresponding values are missing (30th Jan & 29th Dec). There would be many more such in the data frame, sometimes missing observations for more than one consecutive date.
Is there a way that missing dates are detected and inserted into the data frame and corresponding values are filled with a rolling average with one week window (this would naturally increase the number of rows of the data frame)? Appreciate inputs.
df.head(3)
date value
0 2020-01-28 25
1 2020-01-29 32
2 2020-01-31 45
df.tail(3)
date value
3 2020-12-28 24
4 2020-12-30 35
5 2020-12-31 37
df.dtypes
date object
value int64
dtype: object
Create DaetimeIndex, then use DataFrame.asfreq with rolling and mean:
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date').asfreq('d').rolling('7D').mean()
If need all values by year use:
df['date'] = pd.to_datetime(df['date'])
idx = pd.date_range('2020-01-01','2020-12-31')
df = df.set_index('date').reindex(idx).rolling('7D').mean()
This is related to a previous question which I asked here (pandas average by timestamp and day of the week).
Here, I perform a groupby operation as follows:
df = pd.DataFrame(np.random.random(2838),index=pd.date_range('2019-09-13 12:40:00', periods=2838, freq='5T'))
# Reset the index
df.reset_index(inplace=True)
df.groupby(df.index.dt.strftime('%A %H:%M')).mean()
df.reset_index(inplace=True)
Now if I check the data types of the column, we have:
index object
0 float64
The column does not retain its datetime data type. How can I still preserve the column data type?
I wouldn't do grouping like that, instead, I would do double grouping/indexing:
days = df.index.day_name()
times = df.index.time
df.groupby([days,times]).mean()
which gives (head):
0
Friday 00:00:00 0.524322
00:05:00 0.857684
00:10:00 0.593461
00:15:00 0.755158
00:20:00 0.049511
where the first level index is the (string) day names, and second level index are datetime type.
I've got a pandas dataframe that looks like this
miles dollars gallons date gal_cost mpg tank%_used day
0 253.2 21.37 11.138 2019-01-15 1.918657 22.732986 0.821993 Tuesday
1 211.9 22.24 11.239 2019-01-26 1.978824 18.853991 0.829446 Saturday
2 258.1 22.70 11.708 2019-02-02 1.938845 22.044756 0.864059 Saturday
3 223.0 22.24 11.713 2019-02-15 1.898745 19.038675 0.864428 Friday
I'd like to create a new column called 'id' that is unique for each entry. For the first entry in the df, the id would be c0115201901 because it is from the df_c dataframe, the date is 01 15 2019 and it is the first entry.
I know I'll end up doing something like this
df_c = df_c.assign(id=('c'+df_c['date']) + ?????)
but I'd like to parse the df_c['date'] column to pull values for the day, month and year individually. The df_c['date'] column is a datetime64[ns] type.
The other issue is I'd like to have a counter at the end of the id to count which number entry for the date it is. For example, 01 for the first entry, 02 for the second, etc.
I also have a df_m dataframe, but I can repeat the process with a different letter for that dataframe.
Refer pandas datetime-properties docs.
The date can be extracted easily like
df_c['date'].dt.day + df_c['date'].dt.month df_c['date'].dt.year
I was trying out time series analysis with pandas data frames and found that there were easy ways to select specific columns like all the rows of an year, between two dates, etc.
For example, consider
ind = pd.date_range('2004-01-01', '2019-08-13')
data = np.random.randn(len(ind))
df = pd.DataFrame(d, index=ind)
Here, we can select all the rows between and including the dates '2014-01-23' and '2014-06-18' with
df['2014-01-23':'2014-06-18']
and all the rows of the year '2015' with just
df['2015']
Is there a similar way to select all the rows belonging to a specific month but for all years?
I found ways to get all the rows of a particular month and a particular year with syntax like
df['01-2015'] #all rows of January 2015
I was hoping pandas would have a way with simple syntax to get all rows of a month irrespective of the year. Does such a way exist?
Use DatetimeIndex.month, compare and filter by with boolean indexing:
print (df[df.index.month == 1])
0
2004-01-01 2.398676
2004-01-02 2.074744
2004-01-03 0.106972
2004-01-04 0.294587
2004-01-05 0.243768
...
2019-01-27 -1.623171
2019-01-28 -0.043810
2019-01-29 -0.999764
2019-01-30 -0.928471
2019-01-31 -0.304730
[496 rows x 1 columns]