I have a dataframe that has a column 'mon/yr' that has month and year stored in this format Jun/19 , Jan/22,etc.
I want to Extract only these from that column - ['Jul/19','Oct/19','Jan/20','Apr/20','Jul/20','Oct/20','Jan/21','Apr/21','Jul/21','Oct/21','Jan/22']
and put them into a variable called 'dates' so that I can use it for plotting
My code which does not work -
dates = df["mon/yr"] == ['Jul/19','Oct/19','Jan/20','Apr/20','Jul/20','Oct/20','Jan/21','Apr/21','Jul/21','Oct/21','Jan/22']
This is a python code
this is how to filter rows
df.loc[df['column_name'].isin(some_values)]
Using your dates list, if we wanted to extract just 'Jul/20' and 'Oct/20' we can do:
import pandas as pd
df = pd.DataFrame(['Jul/19','Oct/19','Jan/20','Apr/20','Jul/20','Oct/20','Jan/21','Apr/21','Jul/21','Oct/21','Jan/22'], columns = ['dates'])
mydates = ['Jul/20','Oct/20']
df.loc[df['dates'].isin(mydates)]
which produces:
dates
4 Jul/20
5 Oct/20
So, for your actual use case, assuming that df is a pandas dataframe, and mon/yr is the name of the column, you can do:
dates = df.loc[df['mon/yr'].isin(['Jul/19','Oct/19','Jan/20','Apr/20','Jul/20','Oct/20','Jan/21','Apr/21','Jul/21','Oct/21','Jan/22'])]
Related
1. Question
I have a dataframe, and the Year-Month column contains the year and month which I want to extract.
For example, an element in this column is "2022-10". And I want to extract year=2022, month=10 from it.
My current solution is to use apply and lambda function:
df['xx_month'] = df['Year-Month'].apply(lambda x: int(x.split('-')[1]))
But it's super slow on a huge dataframe.
How to do it more efficiently?
2. Solutions
Thanks for your wisdom, I summarized each one's solution with the code:
(1) split by '-' and join #Vitalizzare
pandas.Series.str.split - split strings of a series, if expand=True then return a data frame with each part in a separate column;
pandas.DataFrame.set_axis - if axis='columns' then rename column names of a data frame;
pandas.DataFrame.join - if the indices are equal, then the frames stacked together horizontally are returned.
df = pd.DataFrame({'Year-Month':['2022-10','2022-11','2022-12']})
df = df.join(
df['Year-Month']
.str.split('-', expand=True)
.set_axis(['year','month'], axis='columns')
)
(2) convert the datatype from object (str) into datetime format #Neele22
import pandas as pd
df['Year-Month'] = pd.to_datetime(df['Year-Month'], format="%Y-%m")
(3) use regex or datetime to extract year and month #mozway
df['Year-Month'].str.extract(r'(?P<year>\d+)-(?P<month>\d+)').astype(int)
# If you want to assign the output to the same DataFrame while removing the original Year-Month:
df[['year', 'month']] = df.pop('Year-Month').str.extract(r'(\d+)-(\d+)').astype(int)
Or use datetime:
date = pd.to_datetime(df['Year-Month'])
df['year'] = date.dt.year
df['month'] = date.dt.month
3. Follow up question
But there will be a problem if I want to subtract 'Year-Month' with other datetime columns after converting the incomplete 'Year-Month' column from string to datetime.
For example, if I want to get the data which is no later than 2 months after the timestamp of each record.
import dateutil # dateutil is a better package than datetime package according to my experience
df[(df['timestamp'] - df['Year-Month'])>= dateutil.relativedelta.relativedelta(months=0) and (df['timestamp'] - df['Year-Month'])<= datetime.timedelta(months=2)]
This code will have type error for subtracting the converted Year-Month column with actual datetime column.
TypeError: Cannot subtract tz-naive and tz-aware datetime-like objects
The types for these two columns are:
Year-Month is datetime64[ns]
timestamp is datetime64[ns, UTC]
Then, I tried to specify utc=True when changing Year-Month to datetime type:
df[["Year-Month"]] = pd.to_datetime(df[["Year-Month"]],utc=True,format="%Y-%m")
But I got Value Error.
ValueError: to assemble mappings requires at least that [year, month,
day] be specified: [day,month,year] is missing
4. Take away
If the [day,month,year] is not complete for the elements in a column. (like in my case, I only have year and month), we can't change this column from string type into datetime type to do calculations. But to use the extracted day and month to do the calculations.
If you don't need to do calculations between the incomplete datetime column and other datetime columns like me, you can change the incomplete datetime string into datetime type, and extract [day,month,year] from it. It's easier than using regex, split and join.
df = pd.DataFrame({'Year-Month':['2022-10','2022-11','2022-12']})
df = df.join(
df['Year-Month']
.str.split('-', expand=True)
.set_axis(['year','month'], axis='columns')
)
pandas.Series.str.split - split strings of a series, if expand=True then return a data frame with each part in a separate column;
pandas.DataFrame.set_axis - if axis='columns' then rename column names of a data frame;
pandas.DataFrame.join - if the indices are equal, then the frames stacked together horizontally are returned.
You can use a regex for that.
Creating a new DataFrame:
df['Year-Month'].str.extract(r'(?P<year>\d+)-(?P<month>\d+)').astype(int)
If you want to assign the output to the same DataFrame while removing the original Year-Month:
df[['year', 'month']] = df.pop('Year-Month').str.extract(r'(\d+)-(\d+)').astype(int)
Example input:
Year-Month
0 2022-10
output:
year month
0 2022 10
alternative using datetime:
You can also use a datetime intermediate
date = pd.to_datetime(df['Year-Month'])
df['year'] = date.dt.year
df['month'] = date.dt.month
output:
Year-Month year month
0 2022-10 2022 10
You can also convert the datatype from object (str) into datetime format. This will make it easier to work with the dates.
import pandas as pd
df['Year-Month'] = pd.to_datetime(df['Year-Month'], format="%Y-%m")
I have weather data over a variety of years. In this I am trying to find the long term averages for the temperature of each month, which I achieved using the following.
mh3 = mh3.groupby([mh3.index.month, mh3.index.day])
mh3 = mh3[['dry_bulb_tmp_mean', 'global_horiz_radiation']].mean()
However, in doing this, I get two index's for the dataframe (both month and day which is fine). The issue is that both of these index columns are assigned the name date. Is there a way to manually add a name? This causes problems later in my code when I need to do some data analysis by month. Thank you
The name of the Series you group with becomes the name of the Index levels so rename them in the grouper.
mh3 = mh3.groupby([mh3.index.month.rename('month'), mh3.index.day.rename('day')])
Or if you don't want to type as much you can create the grouping with a list comprehension, getattr and renaming to the attribute.
import pandas as pd
df = pd.DataFrame(index=pd.date_range('2010-01-01', freq='4H', periods=10),
data={'col1': range(10)})
grpr = [getattr(df.index, attr).rename(attr) for attr in ['month', 'day']]
df.groupby(grpr).sum()
# col1
#month day
#1 1 15
# 2 30
here is my input df:
df:
date , name
1990-12-21, adam1
1990-12-22, adam2
1990-12-23, adam3
1990-12-24, adam4
1990-12-25, adam5
I want to select all dates above given date from list (always on fist place)
list = ['1990-12-23','name','22']
df = pd.to_datetime(df['date'))
df = df[df.date > list[0]]
And its working.
My question is, why its working without converting this first element of a list to datetime format?
Pandas has flexible Partial String Indexing. This allows dates and times that can be automatically parsed into a datetime or timestamp to be used as strings without first converting them.
Very simple query but did not find the answer on google.
df with timestamp in date column
Date
22/11/2019 22:30:10 etc. say which is of the form object on doing df.dtype()
Code:
df['Date']=pd.to_datetime(df['Date']).dt.date
Now I want the date to be converted to datetime using column number rather than column name. Column number in this case will be 0(I have very big column names and similar multipe files, so I want to change date column to datetime using its position '0' in this case).
Can anyone help?
Use DataFrame.iloc for column (Series) by position:
df.iloc[:, 0] = pd.to_datetime(df.iloc[:, 0]).dt.date
Or is also possible extract column name by indexing:
df[df.columns[0]] = pd.to_datetime(df[df.columns[0]]).dt.date
I have a dataframe created from a .csv document. Since one of the columns has dates, I have used pandas read_csv with parse_dates:
df = pd.read_csv('CSVdata.csv', encoding = "ISO-8859-1", parse_dates=['Dates_column'])
The dates range from 2012 to 2016. I want to crate a sub-dataframe, containing only the rows from 2014.
The only way I have managed to do this, is with two subsequent Boolean filters:
df_a = df[df.Dates_column>pd.Timestamp('2014')] # To create a dataframe from 01/Jan/2014 onwards.
df = df_a[df_a.Dates_column<pd.Timestamp('2015')] # To remove all the values after 01/jan/2015
Is there a way of doing this in one step, more efficiently?
Many thanks!
You can use the dt accessor:
df = df[df.Dates_column.dt.year == 2014]