retrieve only months with at least 28 sample days - pandas dataframe

retrieve only months with at least 28 sample days - pandas dataframe - python

Hello to the people of the web,
I have a dataframe containing 'DATE' (datetime) as index and TMAX as column with values:
tmax dataframe
What i'm trying to do is checking for every month (of each year) the amount of samples (each TMAX column value is considered as a sample).
If I have less than 28 samples, I want to drop that particular month (of that particular year) and all it's samples.
I have the following code:
if __name__ == '__main__':
df = pd.read_csv("2961941.csv")
# set date column as index, drop the 'DATE' column to avoid repititions + create as datetime object
# speed up parsing using infer_datetime_format=True.
df['DATE'] = pd.to_datetime(df['DATE'], infer_datetime_format=True)
df.set_index('DATE', inplace=True)
# create new table out of 'DATE' and 'TMAX'
tmax = df.filter(['DATE', 'TMAX'], axis=1)
# erase rows with missing data
tmax.dropna()
# create snow table & delete rows with missing info
snow = df.filter(['DATE', 'SNOW']).dropna()
# for index, row in tmax.iterrows():
Thanks for the help.

I can suggest trying the following.
Here I have highlighted the results of counting days in a month into a variable 'a'.
And then I filter the data in which there are less than 28 days in a month.
It worked for me.
a = df.groupby(pd.Grouper(level='DATE', freq="M")).transform('count')
print(df[a['TMAX'] >= 28])

Related

Splitting Date in a Pandas Dataframe

Dataset
Hi, I Have a Index ['release_date'] in a format of month,date,year , I was trying to split this column by doing
test['date_added'].str.split(' ',expand=True) #code_1
but now it's creating a 4 columns and what really is happening is for some reason is it is simply for few rows it's shifting columns therefore creating a 4th column
code_1
This is the error I am facing
I tried splitting ['release_date'], I am expecting it to be splitted into 3 rows but for some reason few rows are being shifting to other column.
if someone wants to inspect that dataframe you can use google colab for it,
!gdown 1x-_Kq9qYrybB9-DxJHoeVlPabmAm6xbQ

you can use:
df['day'] = pd.DatetimeIndex(df['date_added']).day
df['Month'] = pd.DatetimeIndex(df['date_added']).month
df['year'] = pd.DatetimeIndex(df['date_added']).year

day, month, year = zip(*[(d.day, d.month, d.year) for d in df['date_added']])
df df.assign(day = day, month = month, year = year)

How to manipulate your Data Set based on the values of your index?

I have this Dataset, wind_modified. In this Dataset, columns are the locations and Index is the Date. And the Values in the columns are the wind speeds.
Let's say I want to find the average wind speed in January for each location, how do I use groupby or any other method to find the average?
Would it be possible without resetting the INDEX?
Edit - [This][2] is the actual dataset. I have combined the three columns "Yr, Mo, Dy" into one i.e. "DATE" and made it the INDEX.
I imported the dataset by using pd.read_fwf.
And "DATE" is of type datetime64[ns].
[2]:

Sure, if want all Januaries for all years first filter them by boolean indexing and add mean:
#if necessary convert index to DatetimeIndex
#df.index = pd.to_datetime(df.index)
df1 = df[df.index.month == 1].mean().to_frame().T
Or if need each January for each year separately after filter use groupby with DatetimeIndex.year and aggregate mean:
df2 = df[df.index.month == 1]
df3 = df2.groupby(df2.index.year).mean()

Sorting Pandas data by hour of the day

Given the following dataset that I have extracted from an Excel file via Panda:
…
[131124 rows x 2 columns]
date datetime64[ns]
places_occupees int64
dtype: object
Is there a way to sort this data by the hour of day no matter the date?
What I would like to do is to get all the data in between 9 and 10 o'clock in the morning for instance.
You can find a sample of the dataset below.
https://ufile.io/jlilr

after converting to datetime pd.to_datetime(df['date']) you can create a separate column with the hour in it, e.g. df['Hour'] = df.date.dt.hour and then sort by it
df.sort_values('Hour')
EDIT:
As you want to sort by time you can instead of using hour, put the timestamp part into a 'time' column. In order to get times between 9 and 10 you can filter by where hour==9 and then sort by the time column as per below
df['date'] = pd.to_datetime(df['date'])
#put the timestamp part of the datetime into a separate column
df['time'] = df['date'].dt.time
#filter by times between 9 and 10 and sort by timestamp
df.loc[df.date.dt.hour==9].sort_values('time')

Access only second row of column names in a dataframe

I want to read an excel file where the second line is a date in a string format and the first line is the weekday that corresponds to each date, and then change the second line from string to datetime. If I only read the second line as index, and completely skip the first line with the days, I do the following to convert it to a datetime:
Receipts_tbl.columns = pd.to_datetime(Receipts_tbl.columns)
How do I do that if I have a multiindexed dataframe, where the first line of the indices remains as weekdays, and I want the second to be converted to datetime?
Thanx

You didn't give an example of what your data source looks like, so I'm inferring.
If you use pd.read_excel with header=None, it will treat the first two rows as data and you can manipulate them to achieve your goal. Here's a minimum example, with an example "real" data row beneath:
df = pd.DataFrame([['Mon', 'Tues'], ['10-02-1995', '11-23-1997'],
[12, 32]])
# 0 1
#0 Mon Tues
#1 10-02-1995 11-23-1997
#2 12 32
Next, convert the first row to datetime as you said in your question.
df.loc[1] = pd.to_datetime(df.loc[1])
Create a multi-index from the first two rows, and set it as the dataframe's columns
df.columns = df.T.set_index([0,1]).index.set_names(['DOW', 'Date'])
Lastly, select from second row down, as the first two rows are now in the columns.
df = df.loc[2:].reset_index()
df
#DOW Mon Tues
#Date 812592000000000000 880243200000000000
#0 12 32
Note that DOW and Date are now a multilevel index for the columns, and the 'data' rows have been reindexed to start at 0.
Please let me know if I misunderstood your question.

Assuming you have this data in the clipboard
Day Date Data
Mo 2018-08-06 blah
Mo 2018-08-06 blah
Mo 2018-08-06 blah
Tu 2018-08-07 blah
Try
import pandas as pd
df = pd.read_clipboard().set_index(['Day', 'Date'])
to get a multiindexed example
Then change the Date to Datetime
df2 = df.reset_index()
df2.Date = pd.to_datetime(df2.Date, yearfirst=True)
Afterwards you can set the multiindex again, if you want.
Note, check out the documentation on to_datetime if your
datetime string is formatted differently. It assumes
month first, unless you set dayfirst or yearfirst to True.

Pandas dataframe top users by amount within 2016

I have a dataframe with columns: customerId, amount, date the date range of this dataframe is: date: 1/1/2016 9/9/2017 I am trying to find the top 10,000 customers will be determined by the total amount of money they have spent in the year 2016; I was going to sort the amount column in descending order and then parse the date column by just 2016 using
mask = (df['date'] >= '1/1/2016') & (df['date'] <'1/1/2017')
there has to be a smarter way to do this, I am new to coding so any help would be appreciated thanks!

Maybe you can try converting the column to datetime by:
df['date'] = pd.to_datetime(df['date'])
#then filter by year
mask = df['date'].apply(lambda x: x.year == 2016)
#A-Za-z's answer is more concise, but in case the column wasn't in datetime type already, you can convert it with pd.to_datetime.

You can use .dt accessor given that the date column is pandas datetime. Otherwise convert it to datetime first
df.date = pd.to_datetime(df.date)
df[df.date.dt.year == 2016]
Should give you the required rows. If you can post the sample dataset, it would be easier to test it

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

retrieve only months with at least 28 sample days - pandas dataframe - python

Related

Splitting Date in a Pandas Dataframe

How to manipulate your Data Set based on the values of your index?

Sorting Pandas data by hour of the day

Access only second row of column names in a dataframe

Pandas dataframe top users by amount within 2016

Categories

Resources