Pandas, select dates using input from list - python

here is my input df:
df:
date , name
1990-12-21, adam1
1990-12-22, adam2
1990-12-23, adam3
1990-12-24, adam4
1990-12-25, adam5
I want to select all dates above given date from list (always on fist place)
list = ['1990-12-23','name','22']
df = pd.to_datetime(df['date'))
df = df[df.date > list[0]]
And its working.
My question is, why its working without converting this first element of a list to datetime format?

Pandas has flexible Partial String Indexing. This allows dates and times that can be automatically parsed into a datetime or timestamp to be used as strings without first converting them.

Related

Convert to datetime using column position/number in python pandas

Very simple query but did not find the answer on google.
df with timestamp in date column
Date
22/11/2019 22:30:10 etc. say which is of the form object on doing df.dtype()
Code:
df['Date']=pd.to_datetime(df['Date']).dt.date
Now I want the date to be converted to datetime using column number rather than column name. Column number in this case will be 0(I have very big column names and similar multipe files, so I want to change date column to datetime using its position '0' in this case).
Can anyone help?
Use DataFrame.iloc for column (Series) by position:
df.iloc[:, 0] = pd.to_datetime(df.iloc[:, 0]).dt.date
Or is also possible extract column name by indexing:
df[df.columns[0]] = pd.to_datetime(df[df.columns[0]]).dt.date

Converting dtype: period[M] to string format

I have converted my dates in to an Dtype M format as I don't want anything to do with the dates. Unfortunately I cannot plot with this format so I want now convert this in to strings.
So I need to group my data so I can print out some graphs by months.
But I keep getting a serial JSON error when my data is in dtype:Mperiod
so I want to convert it to strings.
df['Date_Modified'] = pd.to_datetime(df['Collection_End_Date']).dt.to_period('M')
#Add a new column called Date Modified to show just month and year
df = df.groupby(["Date_Modified", "Entity"]).sum().reset_index()
#Group the data frame by the new column and then Company and sum the values
df["Date_Modified"].index = df["Date_Modified"].index.strftime('%Y-%m')
It returns a string of numbers, but I need it to return a string output.
Use Series.dt.strftime for set Series to strings in last step:
df["Date_Modified"]= df["Date_Modified"].dt.strftime('%Y-%m')
Or set it before groupby, then converting to month period is not necessary:
df['Date_Modified'] = pd.to_datetime(df['Collection_End_Date']).dt.strftime('%Y-%m')
df = df.groupby(["Date_Modified", "Entity"]).sum().reset_index()

Faster solution for date formatting

I am trying to change the format of the date in a pandas dataframe.
If I check the date in the beginning, I have:
df['Date'][0]
Out[158]: '01/02/2008'
Then, I use:
df['Date'] = pd.to_datetime(df['Date']).dt.date
To change the format to
df['Date'][0]
Out[157]: datetime.date(2008, 1, 2)
However, this takes a veeeery long time, since my dataframe has millions of rows.
All I want to do is change the date format from MM-DD-YYYY to YYYY-MM-DD.
How can I do it in a faster way?
You should first collapse by Date using the groupby method to reduce the dimensionality of the problem.
Then you parse the dates into the new format and merge the results back into the original DataFrame.
This requires some time because of the merging, but it takes advantage from the fact that many dates are repeated a large number of times. You want to convert each date only once!
You can use the following code:
date_parser = lambda x: pd.datetime.strptime(str(x), '%m/%d/%Y')
df['date_index'] = df['Date']
dates = df.groupby(['date_index']).first()['Date'].apply(date_parser)
df = df.set_index([ 'date_index' ])
df['New Date'] = dates
df = df.reset_index()
df.head()
In my case, the execution time for a DataFrame with 3 million lines reduced from 30 seconds to about 1.5 seconds.
I'm not sure if this will help with the performance issue, as I haven't tested with a dataset of your size, but at least in theory, this should help. Pandas has a built in parameter you can use to specify that it should load a column as a date or datetime field. See the parse_dates parameter in the pandas documentation.
Simply pass in a list of columns that you want to be parsed as a date and pandas will convert the columns for you when creating the DataFrame. Then, you won't have to worry about looping back through the dataframe and attempting the conversion after.
import pandas as pd
df = pd.read_csv('test.csv', parse_dates=[0,2])
The above example would try to parse the 1st and 3rd (zero-based) columns as dates.
The type of each resulting column value will be a pandas timestamp and you can then use pandas to print this out however you'd like when working with the dataframe.
Following a lead at #pygo's comment, I found that my mistake was to try to read the data as
df['Date'] = pd.to_datetime(df['Date']).dt.date
This would be, as this answer explains:
This is because pandas falls back to dateutil.parser.parse for parsing the strings when it has a non-default format or when no format string is supplied (this is much more flexible, but also slower).
As you have shown above, you can improve the performance by supplying a format string to to_datetime. Or another option is to use infer_datetime_format=True
When using any of the date parsers from the answers above, we go into the for loop. Also, when specifying the format we want (instead of the format we have) in the pd.to_datetime, we also go into the for loop.
Hence, instead of doing
df['Date'] = pd.to_datetime(df['Date'],format='%Y-%m-%d')
or
df['Date'] = pd.to_datetime(df['Date']).dt.date
we should do
df['Date'] = pd.to_datetime(df['Date'],format='%m/%d/%Y').dt.date
By supplying the current format of the data, it is read really fast into datetime format. Then, using .dt.date, it is fast to change it to the new format without the parser.
Thank you to everyone who helped!

Pandas - New Row for Each Day in Date Range

I have a Pandas df with one column (Reservation_Dt_Start) representing the start of a date range and another (Reservation_Dt_End) representing the end of a date range.
Rather than each row having a date range, I'd like to expand each row to have as many records as there are dates in the date range, with each new row representing one of those dates.
See the two pics below for an example input and the desired output.
The code snippet below works!! However, for every 250 rows in the input table, it takes 1 second to run. Given my input table is 120,000,000 rows in size, this code will take about one week to run.
pd.concat([pd.DataFrame({'Book_Dt': row.Book_Dt,
'Day_Of_Reservation': pd.date_range(row.Reservation_Dt_Start, row.Reservation_Dt_End),
'Pickup': row.Pickup,
'Dropoff' : row.Dropoff,
'Price': row.Price},
columns=['Book_Dt','Day_Of_Reservation', 'Pickup', 'Dropoff' , 'Price'])
for i, row in df.iterrows()], ignore_index=True)
There has to be a faster way to do this. Any ideas? Thanks!
pd.concat in a loop with a large dataset gets pretty slow as it will make a copy of the frame each time and return a new dataframe. You are attempting to do this 120m times. I would try to work with this data as a simple list of tuples instead then convert to dataframe at the end.
e.g.
Given a list list = []
For each row in the dataframe:
get list of date range (can use pd.date_range here still) store in variable dates which is a list of dates
for each date in date range, add a tuple to the list list.append((row.Book_Dt, dates[i], row.Pickup, row.Dropoff, row.Price))
Finally you can convert the list of tuples to a dataframe:
df = pd.DataFrame(list, columns = ['Book_Dt', 'Day_Of_Reservation', 'Pickup', 'Dropoff', 'Price'])

Select date range from Pandas DataFrame

I have a list of dates in a DF that have been converted to a YYYY-MM format and need to select a range. This is what I'm trying:
#create dataframe
data = ['2016-01','2016-02','2016-09','2016-10','2016-11','2017-04','2017-05','2017-06','2017-07','2017-08']
df = pd.DataFrame(data, columns = {'date'})
#lookup range
df[df["date"].isin(pd.date_range('2016-01', '2016-06'))]
It doesn't seem to be working because the date column is no longer a datetime column. The format has to be in YYYY-MM. So I guess the question is, how can I make a datetime column with YYYY-MM? Can someone please help?
Thanks.
You do not need an actual datetime-type column or query values for this to work. Keep it simple:
df[df.date.between('2016-01', '2016-06')]
That gives:
date
0 2016-01
1 2016-02
It works because ISO 8601 date strings can be sorted as if they were plain strings. '2016-06' comes after '2016-05' and so on.

Categories

Resources