Subset of Dataframe based on substring (python) - python

I have a pandas dataframe called Data filled with dates. An example date might look like: "2015-05-10 23:45:00". I want to look at the data in January only, so I want:
Data= Data[:][5:7]=="01"
This doesn't work though.
TDLR, wondering how to find get subset of a dataframe based on a substring.
Thanks!

Consider using the bracketed filter with datetime's month value. But first, you will need to convert string dates to datetime which can be handled with panda's to_datetime():
import datetime as dt
...
Data['yourdatetimecolumn'] = pd.to_datetime(Data['yourdatetimecolumn'])
JanData = Data[Data['yourdatetimecolumn'].dt.month==1]

since your query is with regards to Dates, want to start first by looking this up? and give it a try may be..
Parse a Pandas column to Datetime

Related

Pandas, select dates using input from list

here is my input df:
df:
date , name
1990-12-21, adam1
1990-12-22, adam2
1990-12-23, adam3
1990-12-24, adam4
1990-12-25, adam5
I want to select all dates above given date from list (always on fist place)
list = ['1990-12-23','name','22']
df = pd.to_datetime(df['date'))
df = df[df.date > list[0]]
And its working.
My question is, why its working without converting this first element of a list to datetime format?
Pandas has flexible Partial String Indexing. This allows dates and times that can be automatically parsed into a datetime or timestamp to be used as strings without first converting them.

Converting dates when importing from CSV, OutOfBoundsDatetime: Out of bounds nanosecond timestamp. Pandas

I'm importing data from a csv, and I'm trying to set a specific date to today's date.
Data in the csv if formatted this way:
All data in that column are dates and are formatted exactly the same. I read in the data with df = pd.read_csv(r'<filapath.csv>) at the moment.
Then this is run to convert all instances of '7/21/2020' into today's date:
df['filedate'] = np.where(pd.to_datetime(df['filedate']) == '7/21/2020', pd.Timestamp('now').floor(freq='d'),df['filedate'])
I receive this error: pandas._libs.tslibs.np_datetime.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1-01-14 00:00:00
I don't want to use errors='coerce' because the column will always be 100% populated with real dates, and I will later need to filter the dataframe by date. There seems to be some "ghost" precision in the csv data I can't see. I cannot modify the csv column in this case and I can't use any packages outside of pandas and numpy.
...or alternatively .loc:
df.loc[df['filedate'] == '7/21/2020', 'filedate'] = pd.Timestamp('now').floor(freq='d')
Use .replace() function.
df['filedate'].replace({'7/21/2020':pd.Timestamp('now').floor(freq='d')})

Date Formatting Problem in pandas Dataframe

I have a Date column in my Dataframe, when I display the dates, The Dates format are merged, and are in random format.How to put them in right format? Like in dd/mm/yyyy
This is pseudo code since you did not gave us your code. It assumed that the column date of a dataframe df is correctly formatted as datetime.
You can use the vectorized datetime function strftime() with (see the docs):
df['date'].dt.strftime("%d/%m/%Y")
When you want to save the changes of the format, you need to assign it again to the date column, like this
df['date'] = df['date'].dt.strftime("%d/%m/%Y")

Efficient way to convert a column into datetime

I have a dataframe with a column (object) having values 18-JUN-18 12.00.00.000000000 AM.
I need to get the only the "18-JUN-18" and then convert the column as Datetime.
Below code is taking lot of time as my dataframe is huge :
frame['PURCHASE_DATE'] = frame['PURCHASE_DATE'].apply(lambda x: str(x)[:10])
frame['PURCHASE_DATE'] = pd.to_datetime(frame.PURCHASE_DATE)
Is there a way to optimize it?
You can use strptime to directly convert the date:
from datetime import datetime
frame['PURCHASE_DATE'] = datetime.strptime(frame['PURCHASE_DATE'][:9], '%d-%b-%y')
I'm not sure this is more efficient though.

Faster solution for date formatting

I am trying to change the format of the date in a pandas dataframe.
If I check the date in the beginning, I have:
df['Date'][0]
Out[158]: '01/02/2008'
Then, I use:
df['Date'] = pd.to_datetime(df['Date']).dt.date
To change the format to
df['Date'][0]
Out[157]: datetime.date(2008, 1, 2)
However, this takes a veeeery long time, since my dataframe has millions of rows.
All I want to do is change the date format from MM-DD-YYYY to YYYY-MM-DD.
How can I do it in a faster way?
You should first collapse by Date using the groupby method to reduce the dimensionality of the problem.
Then you parse the dates into the new format and merge the results back into the original DataFrame.
This requires some time because of the merging, but it takes advantage from the fact that many dates are repeated a large number of times. You want to convert each date only once!
You can use the following code:
date_parser = lambda x: pd.datetime.strptime(str(x), '%m/%d/%Y')
df['date_index'] = df['Date']
dates = df.groupby(['date_index']).first()['Date'].apply(date_parser)
df = df.set_index([ 'date_index' ])
df['New Date'] = dates
df = df.reset_index()
df.head()
In my case, the execution time for a DataFrame with 3 million lines reduced from 30 seconds to about 1.5 seconds.
I'm not sure if this will help with the performance issue, as I haven't tested with a dataset of your size, but at least in theory, this should help. Pandas has a built in parameter you can use to specify that it should load a column as a date or datetime field. See the parse_dates parameter in the pandas documentation.
Simply pass in a list of columns that you want to be parsed as a date and pandas will convert the columns for you when creating the DataFrame. Then, you won't have to worry about looping back through the dataframe and attempting the conversion after.
import pandas as pd
df = pd.read_csv('test.csv', parse_dates=[0,2])
The above example would try to parse the 1st and 3rd (zero-based) columns as dates.
The type of each resulting column value will be a pandas timestamp and you can then use pandas to print this out however you'd like when working with the dataframe.
Following a lead at #pygo's comment, I found that my mistake was to try to read the data as
df['Date'] = pd.to_datetime(df['Date']).dt.date
This would be, as this answer explains:
This is because pandas falls back to dateutil.parser.parse for parsing the strings when it has a non-default format or when no format string is supplied (this is much more flexible, but also slower).
As you have shown above, you can improve the performance by supplying a format string to to_datetime. Or another option is to use infer_datetime_format=True
When using any of the date parsers from the answers above, we go into the for loop. Also, when specifying the format we want (instead of the format we have) in the pd.to_datetime, we also go into the for loop.
Hence, instead of doing
df['Date'] = pd.to_datetime(df['Date'],format='%Y-%m-%d')
or
df['Date'] = pd.to_datetime(df['Date']).dt.date
we should do
df['Date'] = pd.to_datetime(df['Date'],format='%m/%d/%Y').dt.date
By supplying the current format of the data, it is read really fast into datetime format. Then, using .dt.date, it is fast to change it to the new format without the parser.
Thank you to everyone who helped!

Categories

Resources