A DataFrame has Date as Index. I need to add a column, value of the column should be days_since_epoch. This value can be calculated with
(date_value - datetime.datetime(1970,1,1)).days
How can this value be calculated for all rows in dataframe ?
Following code demonstrate the operation with a sample DataFrame, is there a better way of doing this ?
import pandas as pd
date_range = pd.date_range(start='1/1/1970', end='12/31/2018', freq='D')
df = pd.DataFrame(date_range, columns=['date'])
df['days_since_epoch']=range(0,len(df))
df = df.set_index('date')
Note : this is an example, dates in DataFrame need not start from 1st Jan 1970.
Subtract from Datetimeindex scalar and then call TimedeltaIndex.days:
df['days_since_epoch1']= (df.index - pd.Timestamp('1970-01-01')).days
Related
I have pandas data frame that had a Date (string) which i could convert and set it up as a index using the set_index and to_datetime functions
usd2inr_df.set_index(pd.to_datetime(usd2inr_df['Date']), inplace=True)
but the resulting dataframe has the time portion which i wanted to remove ...
2023-02-14 00:00:00
I wanted to have it as 2023-02-14
How do i setup the call such that, i can get have the date without the time portion as a index on my dataframe
usd2inr_df['Date'] = pd.to_datetime(usd2inr_df['Date']).dt.normalize()
usd2inr_df.set_index(usd2inr_df['date'])
Using the .to_datetime() method, converts a Series to a pandas datetime object.
Using the Series.dt.date, returns a 'yyyy-mm-dd' date form.
Using the DataFrame.index, sets the index of the dataFrame.
import pandas as pd
# create a dataFrame as an example
df = pd.DataFrame({'Name': ['Example'],'Date': ['2023-02-14 10:01:11']})
print(df)
# convert 'yyyy-mm-dd hh:mm:ss' to 'yyyy-mm-dd'.
df['Date'] = pd.to_datetime(df['Date']).dt.date
# set 'Date' as index
df.index = df['Date']
print(df)
Output
Name Date
0 Example 2023-02-14 10:01:11
-------------------------------------------------------
Name Date
Date
2023-02-14 Example 2023-02-14
1. Question
I have a dataframe, and the Year-Month column contains the year and month which I want to extract.
For example, an element in this column is "2022-10". And I want to extract year=2022, month=10 from it.
My current solution is to use apply and lambda function:
df['xx_month'] = df['Year-Month'].apply(lambda x: int(x.split('-')[1]))
But it's super slow on a huge dataframe.
How to do it more efficiently?
2. Solutions
Thanks for your wisdom, I summarized each one's solution with the code:
(1) split by '-' and join #Vitalizzare
pandas.Series.str.split - split strings of a series, if expand=True then return a data frame with each part in a separate column;
pandas.DataFrame.set_axis - if axis='columns' then rename column names of a data frame;
pandas.DataFrame.join - if the indices are equal, then the frames stacked together horizontally are returned.
df = pd.DataFrame({'Year-Month':['2022-10','2022-11','2022-12']})
df = df.join(
df['Year-Month']
.str.split('-', expand=True)
.set_axis(['year','month'], axis='columns')
)
(2) convert the datatype from object (str) into datetime format #Neele22
import pandas as pd
df['Year-Month'] = pd.to_datetime(df['Year-Month'], format="%Y-%m")
(3) use regex or datetime to extract year and month #mozway
df['Year-Month'].str.extract(r'(?P<year>\d+)-(?P<month>\d+)').astype(int)
# If you want to assign the output to the same DataFrame while removing the original Year-Month:
df[['year', 'month']] = df.pop('Year-Month').str.extract(r'(\d+)-(\d+)').astype(int)
Or use datetime:
date = pd.to_datetime(df['Year-Month'])
df['year'] = date.dt.year
df['month'] = date.dt.month
3. Follow up question
But there will be a problem if I want to subtract 'Year-Month' with other datetime columns after converting the incomplete 'Year-Month' column from string to datetime.
For example, if I want to get the data which is no later than 2 months after the timestamp of each record.
import dateutil # dateutil is a better package than datetime package according to my experience
df[(df['timestamp'] - df['Year-Month'])>= dateutil.relativedelta.relativedelta(months=0) and (df['timestamp'] - df['Year-Month'])<= datetime.timedelta(months=2)]
This code will have type error for subtracting the converted Year-Month column with actual datetime column.
TypeError: Cannot subtract tz-naive and tz-aware datetime-like objects
The types for these two columns are:
Year-Month is datetime64[ns]
timestamp is datetime64[ns, UTC]
Then, I tried to specify utc=True when changing Year-Month to datetime type:
df[["Year-Month"]] = pd.to_datetime(df[["Year-Month"]],utc=True,format="%Y-%m")
But I got Value Error.
ValueError: to assemble mappings requires at least that [year, month,
day] be specified: [day,month,year] is missing
4. Take away
If the [day,month,year] is not complete for the elements in a column. (like in my case, I only have year and month), we can't change this column from string type into datetime type to do calculations. But to use the extracted day and month to do the calculations.
If you don't need to do calculations between the incomplete datetime column and other datetime columns like me, you can change the incomplete datetime string into datetime type, and extract [day,month,year] from it. It's easier than using regex, split and join.
df = pd.DataFrame({'Year-Month':['2022-10','2022-11','2022-12']})
df = df.join(
df['Year-Month']
.str.split('-', expand=True)
.set_axis(['year','month'], axis='columns')
)
pandas.Series.str.split - split strings of a series, if expand=True then return a data frame with each part in a separate column;
pandas.DataFrame.set_axis - if axis='columns' then rename column names of a data frame;
pandas.DataFrame.join - if the indices are equal, then the frames stacked together horizontally are returned.
You can use a regex for that.
Creating a new DataFrame:
df['Year-Month'].str.extract(r'(?P<year>\d+)-(?P<month>\d+)').astype(int)
If you want to assign the output to the same DataFrame while removing the original Year-Month:
df[['year', 'month']] = df.pop('Year-Month').str.extract(r'(\d+)-(\d+)').astype(int)
Example input:
Year-Month
0 2022-10
output:
year month
0 2022 10
alternative using datetime:
You can also use a datetime intermediate
date = pd.to_datetime(df['Year-Month'])
df['year'] = date.dt.year
df['month'] = date.dt.month
output:
Year-Month year month
0 2022-10 2022 10
You can also convert the datatype from object (str) into datetime format. This will make it easier to work with the dates.
import pandas as pd
df['Year-Month'] = pd.to_datetime(df['Year-Month'], format="%Y-%m")
I would like to analyse time series data, where I have some millions of entries.
The data has a granularity of one data entry per minute.
During the weekend, per definition no data exists. As well as for one hour during a weekday.
I want to check for missing data during the week (so: if one or more minutes are missing).
How would I do this with high performance in Python (e.g. with a Pandas DataFrame)
Probably the easiest would be to compare your DatetimeIndex with missing values to a reference DatetimeIndex covering the same range with all values.
Here's an example where I create an arbitrary DatetimeIndex and include some dummy values in a DataFrame.
import pandas as pd
import numpy as np
#dummy data
date_range = pd.date_range('2017-01-01 00:00', '2017-01-01 00:59', freq='1Min')
df = pd.DataFrame(np.random.randint(1, 20, (date_range.shape[0], 1)))
df.index = date_range # set index
df_missing = df.drop(df.between_time('00:12', '00:14').index)
#check for missing datetimeindex values based on reference index (with all values)
missing_dates = df.index[~df.index.isin(df_missing.index)]
print(missing_dates)
Which will return:
DatetimeIndex(['2017-01-01 00:12:00', '2017-01-01 00:13:00',
'2017-01-01 00:14:00'],
dtype='datetime64[ns]', freq='T')
I have data in excel like this
I want to combine columns of Date and Time using the following code
import pandas
df = pd.read_excel('selfmade.xlsx')
df['new'] = df['Date'].map(str) + df['Time'].map(str)
print(df)
but it prints the results like this.
I want the last column in format like 2016-06-14 10:00:00
What should I change in my code to get the desired results
I think you need to_datetime and to_timedelta, also is necessary convert Time column to string by astype:
df['new'] = pd.to_datetime(df['Date']) + pd.to_timedelta(df['Time'].astype(str))
If dtype of Date column is already datetime:
df['new'] = df['Date'] + pd.to_timedelta(df['Time'].astype(str))
I have a data that looks like
Column A (timestamp) Column B (Price) Column C(Volume)
20140804:10:00:13.281486,782.83,443355
20140804:10:00:13.400113,955.71,348603
20140804:10:00:13.555512,1206.38,467175
20140804:10:00:13.435677,1033.50,230056
I am trying to sort by timestamps and using the following code:
sorted_time = pd.to_datetime(df.time, format="%Y%m%d:%H:%M:%S.%f").sort_values()
All I am getting is the column for times. Any help will be appreciated.
You need to call sort_values on the data frame, where you can specify time as the sort by column:
df.time = pd.to_datetime(df.time, format="%Y%m%d:%H:%M:%S.%f")
df = df.sort_values(by = 'time')