I want to read an excel file where the second line is a date in a string format and the first line is the weekday that corresponds to each date, and then change the second line from string to datetime. If I only read the second line as index, and completely skip the first line with the days, I do the following to convert it to a datetime:
Receipts_tbl.columns = pd.to_datetime(Receipts_tbl.columns)
How do I do that if I have a multiindexed dataframe, where the first line of the indices remains as weekdays, and I want the second to be converted to datetime?
Thanx
You didn't give an example of what your data source looks like, so I'm inferring.
If you use pd.read_excel with header=None, it will treat the first two rows as data and you can manipulate them to achieve your goal. Here's a minimum example, with an example "real" data row beneath:
df = pd.DataFrame([['Mon', 'Tues'], ['10-02-1995', '11-23-1997'],
[12, 32]])
# 0 1
#0 Mon Tues
#1 10-02-1995 11-23-1997
#2 12 32
Next, convert the first row to datetime as you said in your question.
df.loc[1] = pd.to_datetime(df.loc[1])
Create a multi-index from the first two rows, and set it as the dataframe's columns
df.columns = df.T.set_index([0,1]).index.set_names(['DOW', 'Date'])
Lastly, select from second row down, as the first two rows are now in the columns.
df = df.loc[2:].reset_index()
df
#DOW Mon Tues
#Date 812592000000000000 880243200000000000
#0 12 32
Note that DOW and Date are now a multilevel index for the columns, and the 'data' rows have been reindexed to start at 0.
Please let me know if I misunderstood your question.
Assuming you have this data in the clipboard
Day Date Data
Mo 2018-08-06 blah
Mo 2018-08-06 blah
Mo 2018-08-06 blah
Tu 2018-08-07 blah
Try
import pandas as pd
df = pd.read_clipboard().set_index(['Day', 'Date'])
to get a multiindexed example
Then change the Date to Datetime
df2 = df.reset_index()
df2.Date = pd.to_datetime(df2.Date, yearfirst=True)
Afterwards you can set the multiindex again, if you want.
Note, check out the documentation on to_datetime if your
datetime string is formatted differently. It assumes
month first, unless you set dayfirst or yearfirst to True.
Related
Dataset
Hi, I Have a Index ['release_date'] in a format of month,date,year , I was trying to split this column by doing
test['date_added'].str.split(' ',expand=True) #code_1
but now it's creating a 4 columns and what really is happening is for some reason is it is simply for few rows it's shifting columns therefore creating a 4th column
code_1
This is the error I am facing
I tried splitting ['release_date'], I am expecting it to be splitted into 3 rows but for some reason few rows are being shifting to other column.
if someone wants to inspect that dataframe you can use google colab for it,
!gdown 1x-_Kq9qYrybB9-DxJHoeVlPabmAm6xbQ
you can use:
df['day'] = pd.DatetimeIndex(df['date_added']).day
df['Month'] = pd.DatetimeIndex(df['date_added']).month
df['year'] = pd.DatetimeIndex(df['date_added']).year
day, month, year = zip(*[(d.day, d.month, d.year) for d in df['date_added']])
df df.assign(day = day, month = month, year = year)
1. Question
I have a dataframe, and the Year-Month column contains the year and month which I want to extract.
For example, an element in this column is "2022-10". And I want to extract year=2022, month=10 from it.
My current solution is to use apply and lambda function:
df['xx_month'] = df['Year-Month'].apply(lambda x: int(x.split('-')[1]))
But it's super slow on a huge dataframe.
How to do it more efficiently?
2. Solutions
Thanks for your wisdom, I summarized each one's solution with the code:
(1) split by '-' and join #Vitalizzare
pandas.Series.str.split - split strings of a series, if expand=True then return a data frame with each part in a separate column;
pandas.DataFrame.set_axis - if axis='columns' then rename column names of a data frame;
pandas.DataFrame.join - if the indices are equal, then the frames stacked together horizontally are returned.
df = pd.DataFrame({'Year-Month':['2022-10','2022-11','2022-12']})
df = df.join(
df['Year-Month']
.str.split('-', expand=True)
.set_axis(['year','month'], axis='columns')
)
(2) convert the datatype from object (str) into datetime format #Neele22
import pandas as pd
df['Year-Month'] = pd.to_datetime(df['Year-Month'], format="%Y-%m")
(3) use regex or datetime to extract year and month #mozway
df['Year-Month'].str.extract(r'(?P<year>\d+)-(?P<month>\d+)').astype(int)
# If you want to assign the output to the same DataFrame while removing the original Year-Month:
df[['year', 'month']] = df.pop('Year-Month').str.extract(r'(\d+)-(\d+)').astype(int)
Or use datetime:
date = pd.to_datetime(df['Year-Month'])
df['year'] = date.dt.year
df['month'] = date.dt.month
3. Follow up question
But there will be a problem if I want to subtract 'Year-Month' with other datetime columns after converting the incomplete 'Year-Month' column from string to datetime.
For example, if I want to get the data which is no later than 2 months after the timestamp of each record.
import dateutil # dateutil is a better package than datetime package according to my experience
df[(df['timestamp'] - df['Year-Month'])>= dateutil.relativedelta.relativedelta(months=0) and (df['timestamp'] - df['Year-Month'])<= datetime.timedelta(months=2)]
This code will have type error for subtracting the converted Year-Month column with actual datetime column.
TypeError: Cannot subtract tz-naive and tz-aware datetime-like objects
The types for these two columns are:
Year-Month is datetime64[ns]
timestamp is datetime64[ns, UTC]
Then, I tried to specify utc=True when changing Year-Month to datetime type:
df[["Year-Month"]] = pd.to_datetime(df[["Year-Month"]],utc=True,format="%Y-%m")
But I got Value Error.
ValueError: to assemble mappings requires at least that [year, month,
day] be specified: [day,month,year] is missing
4. Take away
If the [day,month,year] is not complete for the elements in a column. (like in my case, I only have year and month), we can't change this column from string type into datetime type to do calculations. But to use the extracted day and month to do the calculations.
If you don't need to do calculations between the incomplete datetime column and other datetime columns like me, you can change the incomplete datetime string into datetime type, and extract [day,month,year] from it. It's easier than using regex, split and join.
df = pd.DataFrame({'Year-Month':['2022-10','2022-11','2022-12']})
df = df.join(
df['Year-Month']
.str.split('-', expand=True)
.set_axis(['year','month'], axis='columns')
)
pandas.Series.str.split - split strings of a series, if expand=True then return a data frame with each part in a separate column;
pandas.DataFrame.set_axis - if axis='columns' then rename column names of a data frame;
pandas.DataFrame.join - if the indices are equal, then the frames stacked together horizontally are returned.
You can use a regex for that.
Creating a new DataFrame:
df['Year-Month'].str.extract(r'(?P<year>\d+)-(?P<month>\d+)').astype(int)
If you want to assign the output to the same DataFrame while removing the original Year-Month:
df[['year', 'month']] = df.pop('Year-Month').str.extract(r'(\d+)-(\d+)').astype(int)
Example input:
Year-Month
0 2022-10
output:
year month
0 2022 10
alternative using datetime:
You can also use a datetime intermediate
date = pd.to_datetime(df['Year-Month'])
df['year'] = date.dt.year
df['month'] = date.dt.month
output:
Year-Month year month
0 2022-10 2022 10
You can also convert the datatype from object (str) into datetime format. This will make it easier to work with the dates.
import pandas as pd
df['Year-Month'] = pd.to_datetime(df['Year-Month'], format="%Y-%m")
i have a dataframe downloading as:
the dataframe with the date header on a separate row
if i export it to a csv file and import it again it has all the headers on the first row.
if i look for the information from the row via .iloc[0] i get:
bidopen 1.14140
bidclose 1.14143
bidhigh 1.14160
bidlow 1.14116
askopen 1.14153
askclose 1.14164
askhigh 1.14179
asklow 1.14127
tickqty 5204.00000
Name: 2022-01-14 21:00:00, dtype: float64
resetting the index does not work
essentially i am trying to be able to select the date column, i.e. df['date'] etc, but with no luck in its current form.
any help would be greatly appreciated.
This code will let you switch the date as a columns and reset your index. You will need to import pandas
df['Date'] = df.index
df.reset_index(drop=True, inplace=True)
I have a 'myfile.csv' file which has a 'timestamp' column which starts at
(01/05/2015 11:51:00)
and finishes at
(07/05/2015 23:22:00)
A total span of 9,727 minutes
'myfile.csv' also has a column named 'A' which is some numerical value, there are values are multiple values for 'A' within each minute, each with a unique timestamp to the nearest second.
I have code as follows
df = pd.read_csv('myfile.csv')
df = df.set_index('timestamp')
df.index = df.index.to_datetime()
df.sort_index(inplace=True)
df = df['A'].resample('1Min').mean()
df.index = (df.index.map(lambda t: t.strftime('%Y-%m-%d %H:%M')))
My problem is that python seems to think 'timestamp' starts at
(01/05/2015 11:51:00)
-> 5th January
and finishes at
(07/05/2015 23:22:00)
-> 5th July
But really 'timestamp' starts at the
1st May
and finishes at the
7th of May
So the above code produces a dataframe with 261,332 rows, OMG, when it should really only have 9,727 rows.
Somehow Python is mixing up the month with the day, misinterpreting the dates, how do I sort this out?
There are many arguments within csv_read that can help you parse dates from a csv straight into your pandas DataFrame. Here we can set parse_dates with the columns you want as dates and then use dayfirst. This is defaulted to false so the following should do what you want, assuming the dates are in the first column.
df = pd.read_csv('myfile.csv', parse_dates=[0], dayfirst=True)
If the dates column is not the first row, just change the 0 to the column number.
The format of dates that you have included in your question don't seem to match your strftime filter. Take a look at this to fix your string parameter.
It looks to me that it should be something in the lines of:
'%d/%m/%Y %H:%M:%S'
When I enable parse_dates, it looks like column 0 is removed, that is, data.column.values start from 1, and not 0. How do I access the date column?
mytext = StringIO(unicode(mytext))
data = pd.DataFrame.from_csv(mytext,
parse_dates=True,
index_col=0,
header=None)
index_col=0 means that column 0 is used as the index.
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.Index.html
data.index
returns the first column from the original csv file.
If you prefer this to be a column,
data.reset_index()
returns a new DataFrame with the old index as a column.
index_col=False may also be passed to pd.read_csv() but this will chop off the last column of data if there are not enough column names.