1. Question
I have a dataframe, and the Year-Month column contains the year and month which I want to extract.
For example, an element in this column is "2022-10". And I want to extract year=2022, month=10 from it.
My current solution is to use apply and lambda function:
df['xx_month'] = df['Year-Month'].apply(lambda x: int(x.split('-')[1]))
But it's super slow on a huge dataframe.
How to do it more efficiently?
2. Solutions
Thanks for your wisdom, I summarized each one's solution with the code:
(1) split by '-' and join #Vitalizzare
pandas.Series.str.split - split strings of a series, if expand=True then return a data frame with each part in a separate column;
pandas.DataFrame.set_axis - if axis='columns' then rename column names of a data frame;
pandas.DataFrame.join - if the indices are equal, then the frames stacked together horizontally are returned.
df = pd.DataFrame({'Year-Month':['2022-10','2022-11','2022-12']})
df = df.join(
df['Year-Month']
.str.split('-', expand=True)
.set_axis(['year','month'], axis='columns')
)
(2) convert the datatype from object (str) into datetime format #Neele22
import pandas as pd
df['Year-Month'] = pd.to_datetime(df['Year-Month'], format="%Y-%m")
(3) use regex or datetime to extract year and month #mozway
df['Year-Month'].str.extract(r'(?P<year>\d+)-(?P<month>\d+)').astype(int)
# If you want to assign the output to the same DataFrame while removing the original Year-Month:
df[['year', 'month']] = df.pop('Year-Month').str.extract(r'(\d+)-(\d+)').astype(int)
Or use datetime:
date = pd.to_datetime(df['Year-Month'])
df['year'] = date.dt.year
df['month'] = date.dt.month
3. Follow up question
But there will be a problem if I want to subtract 'Year-Month' with other datetime columns after converting the incomplete 'Year-Month' column from string to datetime.
For example, if I want to get the data which is no later than 2 months after the timestamp of each record.
import dateutil # dateutil is a better package than datetime package according to my experience
df[(df['timestamp'] - df['Year-Month'])>= dateutil.relativedelta.relativedelta(months=0) and (df['timestamp'] - df['Year-Month'])<= datetime.timedelta(months=2)]
This code will have type error for subtracting the converted Year-Month column with actual datetime column.
TypeError: Cannot subtract tz-naive and tz-aware datetime-like objects
The types for these two columns are:
Year-Month is datetime64[ns]
timestamp is datetime64[ns, UTC]
Then, I tried to specify utc=True when changing Year-Month to datetime type:
df[["Year-Month"]] = pd.to_datetime(df[["Year-Month"]],utc=True,format="%Y-%m")
But I got Value Error.
ValueError: to assemble mappings requires at least that [year, month,
day] be specified: [day,month,year] is missing
4. Take away
If the [day,month,year] is not complete for the elements in a column. (like in my case, I only have year and month), we can't change this column from string type into datetime type to do calculations. But to use the extracted day and month to do the calculations.
If you don't need to do calculations between the incomplete datetime column and other datetime columns like me, you can change the incomplete datetime string into datetime type, and extract [day,month,year] from it. It's easier than using regex, split and join.
df = pd.DataFrame({'Year-Month':['2022-10','2022-11','2022-12']})
df = df.join(
df['Year-Month']
.str.split('-', expand=True)
.set_axis(['year','month'], axis='columns')
)
pandas.Series.str.split - split strings of a series, if expand=True then return a data frame with each part in a separate column;
pandas.DataFrame.set_axis - if axis='columns' then rename column names of a data frame;
pandas.DataFrame.join - if the indices are equal, then the frames stacked together horizontally are returned.
You can use a regex for that.
Creating a new DataFrame:
df['Year-Month'].str.extract(r'(?P<year>\d+)-(?P<month>\d+)').astype(int)
If you want to assign the output to the same DataFrame while removing the original Year-Month:
df[['year', 'month']] = df.pop('Year-Month').str.extract(r'(\d+)-(\d+)').astype(int)
Example input:
Year-Month
0 2022-10
output:
year month
0 2022 10
alternative using datetime:
You can also use a datetime intermediate
date = pd.to_datetime(df['Year-Month'])
df['year'] = date.dt.year
df['month'] = date.dt.month
output:
Year-Month year month
0 2022-10 2022 10
You can also convert the datatype from object (str) into datetime format. This will make it easier to work with the dates.
import pandas as pd
df['Year-Month'] = pd.to_datetime(df['Year-Month'], format="%Y-%m")
I am currently working on multiple datasets with TimeStamp column : dd/mm/yyyy HH:MM daily data at 5 mins interval
i want to resample dataset to fill missing dates n timestamps
Issue is few datasets have some rows as ddmmyy and then format abruptly
changes to mmddyyyy after say first few 100 rows and again ddmmyy without any pattern...
need solution or help to correct this issue
code i am using :::
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df['Timestamp'] = df.Timestamp.dt.strftime('%d/%m/%y %H:%M')
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
start_dt = df.loc[0, "Timestamp"]
end_dt = df["Timestamp"].iloc[-1]
r = pd.date_range(start=start_dt, end=end_dt, freq="5min")
# Reindexing by adding missing dates
df = df.set_index('Timestamp').reindex(r).rename_axis("Timestamp").reset_index()
Use regex to filter rows having ddmmyy & mmddyy and then convert to datetime format.
Please have look at both these images, especially Dates from Sno 32. The month column and day column are not properly converted . How can I make this correct? I have already referred to questions regarding timeseries but haven't found any answer to this kind of issue.
There is problem pandas by default parse months first if possible.
You can specify the format as DD/MM/YY
df['date'] = pd.to_datetime(df['date'], format='%d/%m/%y')
Or try using dayfirst=True parameter:
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
Or if create DataFrame from file use parse_dates and dayfirst=True parameters:
df = pd.read_csv(file, parse_dates=['date'], dayfirst=True)
Two of the columns in my dataset are hour and mins as integers. Here's a snippet of the dataset.
I'm creating a timestamp through the following code:
TIME = pd.to_timedelta(df["hour"], unit='h') + pd.to_timedelta(df["mins"], unit='m')
#df['TIME'] = TIME
df['TIME'] = TIME.astype(str)
I convert TIME to string format because I'm exporting the dataframe to MS Excel which doesn't support timedelta format.
Now I want timestamps for every minute.
For that, I want to fill the missing minutes and add zero to the TOTAL_TRADE_RATE against them, for which I first have to set the TIME column as index. I'm applying this:
df = df.set_index('TIME')
df.index = pd.DatetimeIndex(df.index)
df.resample('60s').sum().reset_index()
but it's giving the following error:
Unknown string format: 0 days 09:33:00.000000000
I have data in excel like this
I want to combine columns of Date and Time using the following code
import pandas
df = pd.read_excel('selfmade.xlsx')
df['new'] = df['Date'].map(str) + df['Time'].map(str)
print(df)
but it prints the results like this.
I want the last column in format like 2016-06-14 10:00:00
What should I change in my code to get the desired results
I think you need to_datetime and to_timedelta, also is necessary convert Time column to string by astype:
df['new'] = pd.to_datetime(df['Date']) + pd.to_timedelta(df['Time'].astype(str))
If dtype of Date column is already datetime:
df['new'] = df['Date'] + pd.to_timedelta(df['Time'].astype(str))