Hello to the people of the web,
I have a dataframe containing 'DATE' (datetime) as index and TMAX as column with values:
tmax dataframe
What i'm trying to do is checking for every month (of each year) the amount of samples (each TMAX column value is considered as a sample).
If I have less than 28 samples, I want to drop that particular month (of that particular year) and all it's samples.
I have the following code:
if __name__ == '__main__':
df = pd.read_csv("2961941.csv")
# set date column as index, drop the 'DATE' column to avoid repititions + create as datetime object
# speed up parsing using infer_datetime_format=True.
df['DATE'] = pd.to_datetime(df['DATE'], infer_datetime_format=True)
df.set_index('DATE', inplace=True)
# create new table out of 'DATE' and 'TMAX'
tmax = df.filter(['DATE', 'TMAX'], axis=1)
# erase rows with missing data
tmax.dropna()
# create snow table & delete rows with missing info
snow = df.filter(['DATE', 'SNOW']).dropna()
# for index, row in tmax.iterrows():
Thanks for the help.
I can suggest trying the following.
Here I have highlighted the results of counting days in a month into a variable 'a'.
And then I filter the data in which there are less than 28 days in a month.
It worked for me.
a = df.groupby(pd.Grouper(level='DATE', freq="M")).transform('count')
print(df[a['TMAX'] >= 28])
Related
Dataset
Hi, I Have a Index ['release_date'] in a format of month,date,year , I was trying to split this column by doing
test['date_added'].str.split(' ',expand=True) #code_1
but now it's creating a 4 columns and what really is happening is for some reason is it is simply for few rows it's shifting columns therefore creating a 4th column
code_1
This is the error I am facing
I tried splitting ['release_date'], I am expecting it to be splitted into 3 rows but for some reason few rows are being shifting to other column.
if someone wants to inspect that dataframe you can use google colab for it,
!gdown 1x-_Kq9qYrybB9-DxJHoeVlPabmAm6xbQ
you can use:
df['day'] = pd.DatetimeIndex(df['date_added']).day
df['Month'] = pd.DatetimeIndex(df['date_added']).month
df['year'] = pd.DatetimeIndex(df['date_added']).year
day, month, year = zip(*[(d.day, d.month, d.year) for d in df['date_added']])
df df.assign(day = day, month = month, year = year)
I have this Dataset, wind_modified. In this Dataset, columns are the locations and Index is the Date. And the Values in the columns are the wind speeds.
Let's say I want to find the average wind speed in January for each location, how do I use groupby or any other method to find the average?
Would it be possible without resetting the INDEX?
Edit - [This][2] is the actual dataset. I have combined the three columns "Yr, Mo, Dy" into one i.e. "DATE" and made it the INDEX.
I imported the dataset by using pd.read_fwf.
And "DATE" is of type datetime64[ns].
[2]:
Sure, if want all Januaries for all years first filter them by boolean indexing and add mean:
#if necessary convert index to DatetimeIndex
#df.index = pd.to_datetime(df.index)
df1 = df[df.index.month == 1].mean().to_frame().T
Or if need each January for each year separately after filter use groupby with DatetimeIndex.year and aggregate mean:
df2 = df[df.index.month == 1]
df3 = df2.groupby(df2.index.year).mean()
Given the following dataset that I have extracted from an Excel file via Panda:
…
[131124 rows x 2 columns]
date datetime64[ns]
places_occupees int64
dtype: object
Is there a way to sort this data by the hour of day no matter the date?
What I would like to do is to get all the data in between 9 and 10 o'clock in the morning for instance.
You can find a sample of the dataset below.
https://ufile.io/jlilr
after converting to datetime pd.to_datetime(df['date']) you can create a separate column with the hour in it, e.g. df['Hour'] = df.date.dt.hour and then sort by it
df.sort_values('Hour')
EDIT:
As you want to sort by time you can instead of using hour, put the timestamp part into a 'time' column. In order to get times between 9 and 10 you can filter by where hour==9 and then sort by the time column as per below
df['date'] = pd.to_datetime(df['date'])
#put the timestamp part of the datetime into a separate column
df['time'] = df['date'].dt.time
#filter by times between 9 and 10 and sort by timestamp
df.loc[df.date.dt.hour==9].sort_values('time')
I want to read an excel file where the second line is a date in a string format and the first line is the weekday that corresponds to each date, and then change the second line from string to datetime. If I only read the second line as index, and completely skip the first line with the days, I do the following to convert it to a datetime:
Receipts_tbl.columns = pd.to_datetime(Receipts_tbl.columns)
How do I do that if I have a multiindexed dataframe, where the first line of the indices remains as weekdays, and I want the second to be converted to datetime?
Thanx
You didn't give an example of what your data source looks like, so I'm inferring.
If you use pd.read_excel with header=None, it will treat the first two rows as data and you can manipulate them to achieve your goal. Here's a minimum example, with an example "real" data row beneath:
df = pd.DataFrame([['Mon', 'Tues'], ['10-02-1995', '11-23-1997'],
[12, 32]])
# 0 1
#0 Mon Tues
#1 10-02-1995 11-23-1997
#2 12 32
Next, convert the first row to datetime as you said in your question.
df.loc[1] = pd.to_datetime(df.loc[1])
Create a multi-index from the first two rows, and set it as the dataframe's columns
df.columns = df.T.set_index([0,1]).index.set_names(['DOW', 'Date'])
Lastly, select from second row down, as the first two rows are now in the columns.
df = df.loc[2:].reset_index()
df
#DOW Mon Tues
#Date 812592000000000000 880243200000000000
#0 12 32
Note that DOW and Date are now a multilevel index for the columns, and the 'data' rows have been reindexed to start at 0.
Please let me know if I misunderstood your question.
Assuming you have this data in the clipboard
Day Date Data
Mo 2018-08-06 blah
Mo 2018-08-06 blah
Mo 2018-08-06 blah
Tu 2018-08-07 blah
Try
import pandas as pd
df = pd.read_clipboard().set_index(['Day', 'Date'])
to get a multiindexed example
Then change the Date to Datetime
df2 = df.reset_index()
df2.Date = pd.to_datetime(df2.Date, yearfirst=True)
Afterwards you can set the multiindex again, if you want.
Note, check out the documentation on to_datetime if your
datetime string is formatted differently. It assumes
month first, unless you set dayfirst or yearfirst to True.
I have a dataframe with columns: customerId, amount, date the date range of this dataframe is: date: 1/1/2016 9/9/2017 I am trying to find the top 10,000 customers will be determined by the total amount of money they have spent in the year 2016; I was going to sort the amount column in descending order and then parse the date column by just 2016 using
mask = (df['date'] >= '1/1/2016') & (df['date'] <'1/1/2017')
there has to be a smarter way to do this, I am new to coding so any help would be appreciated thanks!
Maybe you can try converting the column to datetime by:
df['date'] = pd.to_datetime(df['date'])
#then filter by year
mask = df['date'].apply(lambda x: x.year == 2016)
#A-Za-z's answer is more concise, but in case the column wasn't in datetime type already, you can convert it with pd.to_datetime.
You can use .dt accessor given that the date column is pandas datetime. Otherwise convert it to datetime first
df.date = pd.to_datetime(df.date)
df[df.date.dt.year == 2016]
Should give you the required rows. If you can post the sample dataset, it would be easier to test it