1. Question
I have a dataframe, and the Year-Month column contains the year and month which I want to extract.
For example, an element in this column is "2022-10". And I want to extract year=2022, month=10 from it.
My current solution is to use apply and lambda function:
df['xx_month'] = df['Year-Month'].apply(lambda x: int(x.split('-')[1]))
But it's super slow on a huge dataframe.
How to do it more efficiently?
2. Solutions
Thanks for your wisdom, I summarized each one's solution with the code:
(1) split by '-' and join #Vitalizzare
pandas.Series.str.split - split strings of a series, if expand=True then return a data frame with each part in a separate column;
pandas.DataFrame.set_axis - if axis='columns' then rename column names of a data frame;
pandas.DataFrame.join - if the indices are equal, then the frames stacked together horizontally are returned.
df = pd.DataFrame({'Year-Month':['2022-10','2022-11','2022-12']})
df = df.join(
df['Year-Month']
.str.split('-', expand=True)
.set_axis(['year','month'], axis='columns')
)
(2) convert the datatype from object (str) into datetime format #Neele22
import pandas as pd
df['Year-Month'] = pd.to_datetime(df['Year-Month'], format="%Y-%m")
(3) use regex or datetime to extract year and month #mozway
df['Year-Month'].str.extract(r'(?P<year>\d+)-(?P<month>\d+)').astype(int)
# If you want to assign the output to the same DataFrame while removing the original Year-Month:
df[['year', 'month']] = df.pop('Year-Month').str.extract(r'(\d+)-(\d+)').astype(int)
Or use datetime:
date = pd.to_datetime(df['Year-Month'])
df['year'] = date.dt.year
df['month'] = date.dt.month
3. Follow up question
But there will be a problem if I want to subtract 'Year-Month' with other datetime columns after converting the incomplete 'Year-Month' column from string to datetime.
For example, if I want to get the data which is no later than 2 months after the timestamp of each record.
import dateutil # dateutil is a better package than datetime package according to my experience
df[(df['timestamp'] - df['Year-Month'])>= dateutil.relativedelta.relativedelta(months=0) and (df['timestamp'] - df['Year-Month'])<= datetime.timedelta(months=2)]
This code will have type error for subtracting the converted Year-Month column with actual datetime column.
TypeError: Cannot subtract tz-naive and tz-aware datetime-like objects
The types for these two columns are:
Year-Month is datetime64[ns]
timestamp is datetime64[ns, UTC]
Then, I tried to specify utc=True when changing Year-Month to datetime type:
df[["Year-Month"]] = pd.to_datetime(df[["Year-Month"]],utc=True,format="%Y-%m")
But I got Value Error.
ValueError: to assemble mappings requires at least that [year, month,
day] be specified: [day,month,year] is missing
4. Take away
If the [day,month,year] is not complete for the elements in a column. (like in my case, I only have year and month), we can't change this column from string type into datetime type to do calculations. But to use the extracted day and month to do the calculations.
If you don't need to do calculations between the incomplete datetime column and other datetime columns like me, you can change the incomplete datetime string into datetime type, and extract [day,month,year] from it. It's easier than using regex, split and join.
df = pd.DataFrame({'Year-Month':['2022-10','2022-11','2022-12']})
df = df.join(
df['Year-Month']
.str.split('-', expand=True)
.set_axis(['year','month'], axis='columns')
)
pandas.Series.str.split - split strings of a series, if expand=True then return a data frame with each part in a separate column;
pandas.DataFrame.set_axis - if axis='columns' then rename column names of a data frame;
pandas.DataFrame.join - if the indices are equal, then the frames stacked together horizontally are returned.
You can use a regex for that.
Creating a new DataFrame:
df['Year-Month'].str.extract(r'(?P<year>\d+)-(?P<month>\d+)').astype(int)
If you want to assign the output to the same DataFrame while removing the original Year-Month:
df[['year', 'month']] = df.pop('Year-Month').str.extract(r'(\d+)-(\d+)').astype(int)
Example input:
Year-Month
0 2022-10
output:
year month
0 2022 10
alternative using datetime:
You can also use a datetime intermediate
date = pd.to_datetime(df['Year-Month'])
df['year'] = date.dt.year
df['month'] = date.dt.month
output:
Year-Month year month
0 2022-10 2022 10
You can also convert the datatype from object (str) into datetime format. This will make it easier to work with the dates.
import pandas as pd
df['Year-Month'] = pd.to_datetime(df['Year-Month'], format="%Y-%m")
Related
I have a dataframe that has a column 'mon/yr' that has month and year stored in this format Jun/19 , Jan/22,etc.
I want to Extract only these from that column - ['Jul/19','Oct/19','Jan/20','Apr/20','Jul/20','Oct/20','Jan/21','Apr/21','Jul/21','Oct/21','Jan/22']
and put them into a variable called 'dates' so that I can use it for plotting
My code which does not work -
dates = df["mon/yr"] == ['Jul/19','Oct/19','Jan/20','Apr/20','Jul/20','Oct/20','Jan/21','Apr/21','Jul/21','Oct/21','Jan/22']
This is a python code
this is how to filter rows
df.loc[df['column_name'].isin(some_values)]
Using your dates list, if we wanted to extract just 'Jul/20' and 'Oct/20' we can do:
import pandas as pd
df = pd.DataFrame(['Jul/19','Oct/19','Jan/20','Apr/20','Jul/20','Oct/20','Jan/21','Apr/21','Jul/21','Oct/21','Jan/22'], columns = ['dates'])
mydates = ['Jul/20','Oct/20']
df.loc[df['dates'].isin(mydates)]
which produces:
dates
4 Jul/20
5 Oct/20
So, for your actual use case, assuming that df is a pandas dataframe, and mon/yr is the name of the column, you can do:
dates = df.loc[df['mon/yr'].isin(['Jul/19','Oct/19','Jan/20','Apr/20','Jul/20','Oct/20','Jan/21','Apr/21','Jul/21','Oct/21','Jan/22'])]
I have a dataset in CSV which first column are dates (not datetimes, just dates).
The CSV is like this:
date,text
2005-01-01,"FOO-BAR-1"
2005-01-02,"FOO-BAR-2"
If I do this:
df = pd.read_csv('mycsv.csv')
I get:
print(df.dtypes)
date object
text object
dtype: object
How can I get column date by datetime.date?
Use:
df = pd.read_csv('mycsv.csv', parse_dates=[0])
This way the initial column will be of native pandasonic datetime type,
which is used in Pandas much more often than pythonic datetime.date.
It is a more natural approach than conversion of the column in question
after you read the DataFrame.
You can use pd.to_datetime function available in pandas.
For example in a dataset about scores of a cricket match. I can convert the Matchdate column to datatime object by applying pd.to_datetime function based on the data time format given in the data. ( Refer https://www.w3schools.com/python/python_datetime.asp to assign commands based on your data time formating )
cricket["MatchDate"]=pd.to_datetime(cricket["MatchDate"], format= "%m-%d-%Y")
I am trying to create datetime index in python. I have an existing dataframe with date column (CrimeDate), here is a snapshot of it:
The date is not in datetime format though.
I intent to have an output similar to the below format, but with my existing dataframe's date column-
The Crimedate column has approx. 334192 rows and start date from 2021-04-24 to 1963-10-30 (all are in sequence of months and year)
First you'll need to convert the date column to datetime:
df['CrimeDate'] = pd.to_datetime(df['CrimeDate'])
And after that set that column as the index:
df.set_index(['CrimeDate'], inplace=True)
Once set, you can access the datetime index directly:
df.index
I have a 'myfile.csv' file which has a 'timestamp' column which starts at
(01/05/2015 11:51:00)
and finishes at
(07/05/2015 23:22:00)
A total span of 9,727 minutes
'myfile.csv' also has a column named 'A' which is some numerical value, there are values are multiple values for 'A' within each minute, each with a unique timestamp to the nearest second.
I have code as follows
df = pd.read_csv('myfile.csv')
df = df.set_index('timestamp')
df.index = df.index.to_datetime()
df.sort_index(inplace=True)
df = df['A'].resample('1Min').mean()
df.index = (df.index.map(lambda t: t.strftime('%Y-%m-%d %H:%M')))
My problem is that python seems to think 'timestamp' starts at
(01/05/2015 11:51:00)
-> 5th January
and finishes at
(07/05/2015 23:22:00)
-> 5th July
But really 'timestamp' starts at the
1st May
and finishes at the
7th of May
So the above code produces a dataframe with 261,332 rows, OMG, when it should really only have 9,727 rows.
Somehow Python is mixing up the month with the day, misinterpreting the dates, how do I sort this out?
There are many arguments within csv_read that can help you parse dates from a csv straight into your pandas DataFrame. Here we can set parse_dates with the columns you want as dates and then use dayfirst. This is defaulted to false so the following should do what you want, assuming the dates are in the first column.
df = pd.read_csv('myfile.csv', parse_dates=[0], dayfirst=True)
If the dates column is not the first row, just change the 0 to the column number.
The format of dates that you have included in your question don't seem to match your strftime filter. Take a look at this to fix your string parameter.
It looks to me that it should be something in the lines of:
'%d/%m/%Y %H:%M:%S'
How can I drop rows from Dataframe df if the dates associated with df['maturity_dt'] are less that today's date?
I am currently doing the following:
todays_date = datetime.date.today()
datenow = datetime.datetime.combine(todays_date, datetime.datetime.min.time()) #Converting to datetime
for (i,row) in df.iterrows():
if datetime.datetime.strptime(row['maturity_dt'], '%Y-%m-%d %H:%M:%S.%f') < datenow):
df.drop(df.index[i])
However, its taking too long and I was hoping to do something like: df = df[datetime.datetime.strptime(df['maturity_dt'], '%Y-%m-%d %H:%M:%S.%f') < datenow, but this results in the error TypeError: must be str, not Series
Thank You
Haven't tried it but maybe the pandas native functions will iterate faster. Something like:
df['dt']=pandas.Datetimeindex(df['maturity_dt'])
newdf=df.loc[df['dt']<=todays_date].copy()
Instead of parsing the date in each row, you could format your comparison date in the same format as these dates are stored and then you could just do a string comparison.
Also, if there is a way to drop multiple rows in a single call, you could use your loop just to gather the indices of those rows to be dropped, then use that call to drop them in bunches.