Find months between dates pandas [duplicate] - python

This question already has an answer here:
Create date range list with pandas
(1 answer)
Closed 2 years ago.
I have a large DataFrame with two columns - start_date and finish_date with dates in string format. f.e. "2018-06-01"
I want to create third column with list of months between two dates.
So, if I have a start_date - "2018-06-01", finish_date - "2018-08-01", in the third column I expect ["2018-06-01", "2018-07-01", "2018-08-01"]. Day doesn't matter for me, so we can delete it.
I find many ways to do it for simple strings, but no one to do it for pandas DataFrame.

Pandas has a function called apply which allows you to apply logic to every row of a dataframe.
We can use dateutil to get all months between the start and end date, then apply the logic to every row of your dataframe as a new column.
import pandas as pd
import time
import datetime
from dateutil.rrule import rrule, MONTHLY
#Dataframe creation, this is just for the example, use the one you already have created.
data = {'start': datetime.datetime.strptime("10-10-2020", "%d-%m-%Y"), 'end': datetime.datetime.strptime("10-12-2020", "%d-%m-%Y")}
df = pd.DataFrame(data, index=[0])
#df
# start end
#0 2020-10-10 2020-12-10
# Find all months between the start and end date, apply to every row in the dataframe. Result is a list.
df['months'] = df.apply(lambda x: [date.strftime("%m/%Y") for date in rrule(MONTHLY, dtstart=x.start, until=x.end)], axis = 1)
#df
# start end months
#0 2020-10-10 2020-12-10 [10/2020, 11/2020, 12/2020]

Related

How to extract year and month from string in a dataframe

1. Question
I have a dataframe, and the Year-Month column contains the year and month which I want to extract.
For example, an element in this column is "2022-10". And I want to extract year=2022, month=10 from it.
My current solution is to use apply and lambda function:
df['xx_month'] = df['Year-Month'].apply(lambda x: int(x.split('-')[1]))
But it's super slow on a huge dataframe.
How to do it more efficiently?
2. Solutions
Thanks for your wisdom, I summarized each one's solution with the code:
(1) split by '-' and join #Vitalizzare
pandas.Series.str.split - split strings of a series, if expand=True then return a data frame with each part in a separate column;
pandas.DataFrame.set_axis - if axis='columns' then rename column names of a data frame;
pandas.DataFrame.join - if the indices are equal, then the frames stacked together horizontally are returned.
df = pd.DataFrame({'Year-Month':['2022-10','2022-11','2022-12']})
df = df.join(
df['Year-Month']
.str.split('-', expand=True)
.set_axis(['year','month'], axis='columns')
)
(2) convert the datatype from object (str) into datetime format #Neele22
import pandas as pd
df['Year-Month'] = pd.to_datetime(df['Year-Month'], format="%Y-%m")
(3) use regex or datetime to extract year and month #mozway
df['Year-Month'].str.extract(r'(?P<year>\d+)-(?P<month>\d+)').astype(int)
# If you want to assign the output to the same DataFrame while removing the original Year-Month:
df[['year', 'month']] = df.pop('Year-Month').str.extract(r'(\d+)-(\d+)').astype(int)
Or use datetime:
date = pd.to_datetime(df['Year-Month'])
df['year'] = date.dt.year
df['month'] = date.dt.month
3. Follow up question
But there will be a problem if I want to subtract 'Year-Month' with other datetime columns after converting the incomplete 'Year-Month' column from string to datetime.
For example, if I want to get the data which is no later than 2 months after the timestamp of each record.
import dateutil # dateutil is a better package than datetime package according to my experience
df[(df['timestamp'] - df['Year-Month'])>= dateutil.relativedelta.relativedelta(months=0) and (df['timestamp'] - df['Year-Month'])<= datetime.timedelta(months=2)]
This code will have type error for subtracting the converted Year-Month column with actual datetime column.
TypeError: Cannot subtract tz-naive and tz-aware datetime-like objects
The types for these two columns are:
Year-Month is datetime64[ns]
timestamp is datetime64[ns, UTC]
Then, I tried to specify utc=True when changing Year-Month to datetime type:
df[["Year-Month"]] = pd.to_datetime(df[["Year-Month"]],utc=True,format="%Y-%m")
But I got Value Error.
ValueError: to assemble mappings requires at least that [year, month,
day] be specified: [day,month,year] is missing
4. Take away
If the [day,month,year] is not complete for the elements in a column. (like in my case, I only have year and month), we can't change this column from string type into datetime type to do calculations. But to use the extracted day and month to do the calculations.
If you don't need to do calculations between the incomplete datetime column and other datetime columns like me, you can change the incomplete datetime string into datetime type, and extract [day,month,year] from it. It's easier than using regex, split and join.
df = pd.DataFrame({'Year-Month':['2022-10','2022-11','2022-12']})
df = df.join(
df['Year-Month']
.str.split('-', expand=True)
.set_axis(['year','month'], axis='columns')
)
pandas.Series.str.split - split strings of a series, if expand=True then return a data frame with each part in a separate column;
pandas.DataFrame.set_axis - if axis='columns' then rename column names of a data frame;
pandas.DataFrame.join - if the indices are equal, then the frames stacked together horizontally are returned.
You can use a regex for that.
Creating a new DataFrame:
df['Year-Month'].str.extract(r'(?P<year>\d+)-(?P<month>\d+)').astype(int)
If you want to assign the output to the same DataFrame while removing the original Year-Month:
df[['year', 'month']] = df.pop('Year-Month').str.extract(r'(\d+)-(\d+)').astype(int)
Example input:
Year-Month
0 2022-10
output:
year month
0 2022 10
alternative using datetime:
You can also use a datetime intermediate
date = pd.to_datetime(df['Year-Month'])
df['year'] = date.dt.year
df['month'] = date.dt.month
output:
Year-Month year month
0 2022-10 2022 10
You can also convert the datatype from object (str) into datetime format. This will make it easier to work with the dates.
import pandas as pd
df['Year-Month'] = pd.to_datetime(df['Year-Month'], format="%Y-%m")

How can i take specific Months out from a Column in python

I have a dataframe that has a column 'mon/yr' that has month and year stored in this format Jun/19 , Jan/22,etc.
I want to Extract only these from that column - ['Jul/19','Oct/19','Jan/20','Apr/20','Jul/20','Oct/20','Jan/21','Apr/21','Jul/21','Oct/21','Jan/22']
and put them into a variable called 'dates' so that I can use it for plotting
My code which does not work -
dates = df["mon/yr"] == ['Jul/19','Oct/19','Jan/20','Apr/20','Jul/20','Oct/20','Jan/21','Apr/21','Jul/21','Oct/21','Jan/22']
This is a python code
this is how to filter rows
df.loc[df['column_name'].isin(some_values)]
Using your dates list, if we wanted to extract just 'Jul/20' and 'Oct/20' we can do:
import pandas as pd
df = pd.DataFrame(['Jul/19','Oct/19','Jan/20','Apr/20','Jul/20','Oct/20','Jan/21','Apr/21','Jul/21','Oct/21','Jan/22'], columns = ['dates'])
mydates = ['Jul/20','Oct/20']
df.loc[df['dates'].isin(mydates)]
which produces:
dates
4 Jul/20
5 Oct/20
So, for your actual use case, assuming that df is a pandas dataframe, and mon/yr is the name of the column, you can do:
dates = df.loc[df['mon/yr'].isin(['Jul/19','Oct/19','Jan/20','Apr/20','Jul/20','Oct/20','Jan/21','Apr/21','Jul/21','Oct/21','Jan/22'])]

How to convert pandas dataframe column to string and delete some text of column in pandas dataframe [duplicate]

This question already has answers here:
Extracting the hour from a time column in pandas
(3 answers)
Convert string to timedelta in pandas
(4 answers)
Closed 2 years ago.
I want to convert each value in a pandas dataframe column to a string and then delete some text. The values are times. For example, if the value is 11:21, I would like to delete every to the right of the : in every element in the column. 11:21 should be converted to 11.
Let's say you have following dataset:
df = pd.DataFrame({
'time': ['09:30:00','09:40:01','09:50:02','10:00:03']
})
df.head()
Output:
If you want to work with time column as a string, following code may be used:
df['hour'] = df['time'].apply(lambda time : time.split(':')[0])
df.head()
Output:
Alternatively time can be converted to datetime and hour can be extracted:
df['hour'] = pd.to_datetime(df['time'], format='%H:%M:%S').dt.hour
df.head()
Output:

Pandas parsing dates when reading CSV file [duplicate]

This question already has answers here:
Can pandas automatically read dates from a CSV file?
(13 answers)
Closed 3 years ago.
I have a csv file which contains a date column, the dates in this file have the format of 'dd.mm.yy', when pandas parse the dates it understands the day as a month if it was less than or equal to 12, so 05.01.05 becomes 01/05/2005.
How can I solve this issue
Regards
This is one way to solve it using pandas.to.datetime and setting the argument dayfirst=True. However, I've had to make assumptions about the format of your data since you are not sharing any code. In the case below the original format of the date column is object.
import pandas as pd
df = pd.DataFrame({
'date': ['01.02.20', '25.12.19', '10.03.18'],
})
df = pd.to_datetime(df['date'], dayfirst=True)
df
0 2020-02-01
1 2019-12-25
2 2018-03-10
Name: date, dtype: datetime64[ns]

using index in calculations, pandas [duplicate]

This question already has answers here:
How to directly use Pandas Date-time index in calculations?
(1 answer)
selecting from multi-index pandas
(7 answers)
Closed 4 years ago.
I have a df that contains a date index and another column which is a different date. I would like to add a column to my df that is the difference between these two dates in days. How can one use the index in the computation directly without having to bring it into the df as a column?
MWE:
df = pd.DataFrame(data = {"val": [1,2,3,4,5], "some_date": np.arange("2000-02-01", "2000-02-06", dtype="datetime64[D]")}, index = pd.date_range(start = "2000-01-01", end = "2000-01-05", periods = 5, name="date"))
#would like to do something like this
df["delta"] = df["some_date"] - df["date"] #produces an error
What's the best way to access the index in calculations of this type?

Categories

Resources