python datetime convert, dates may contains whitespaces - python

I have a .csv file with a date column, and the date looks like this.
date
2016年 4月 1日 <-- there are whitespaces in thie row
...
2016年10月10日
The date format is Japanese date format. I'm trying to convert this column to 'YYYY-MM-DD', and the python code I'm using is below.
data['date'] = [datetime.datetime.strptime(d, '%Y年%m月%d日').date() for d in data['date']]
There is one problem, the date column in the .csv may contain whitespace when the month/day is a single digit. And my code doesn't work well when there is a whitespace.
Anyone solutions?

In pandas is best avoid list comprehension if exist vectorized solutions because performance and no support NaNs.
I think need replace by \s+ : one or more whitespaces with pandas.to_datetime for converting to datetimes and last for dates add date:
data['date'] = (pd.to_datetime(data['date'].str.replace('\s+', ''), format='%Y年%m月%d日')
.dt.date)
Performance:
The plot was created with perfplot:
def list_compr(df):
df['date1'] = [datetime.datetime.strptime(d.replace(" ", ""), '%Y年%m月%d日').date() for d in df['date']]
return df
def vector(df):
df['date2'] = (pd.to_datetime(df['date'].str.replace('\s+', ''), format='%Y年%m月%d日').dt.date)
return df
def make_df(n):
df = pd.DataFrame({'date':['2016年 4月 1日','2016年10月10日']})
df = pd.concat([df] * n, ignore_index=True)
return df
perfplot.show(
setup=make_df,
kernels=[list_compr, vector],
n_range=[2**k for k in range(2, 13)],
logx=True,
logy=True,
equality_check=False, # rows may appear in different order
xlabel='len(df)')

I don't know Python actually, but wouldn't something like replacing d in strptime with d.replace(" ", "") do the trick?

Related

convert to datetime and format date in pandas in a single oneliner

I have a dataframe with two columns containing dates non formated.
the data in such columns is as follows:
2011-06-10T00:00:00.000+02:00
I would like to get just the date and format it.
In a Jupyter notebook I do the followings:
sections['produced'] = pd.to_datetime(sections['produced'])
sections['produced'] = [d.strftime('%Y-%m-%d') if not pd.isnull(d) else '' for d in sections['produced']]
sections['updated'] = pd.to_datetime(sections['updated'])
sections['updated'] = [d.strftime('%Y-%m-%d') if not pd.isnull(d) else '' for d in sections['updated']]
sections.info()
Then I print out the sections dataframe and indeed the dates are printed correctly.
BUT:
sections.info()
still tells me that those columns are non-null objects and not datetime.
Why?
secondly, my approach does not seem to work under the hood, i.e. the date types are not actually dates.
What should I do?
And last, the code is super verbose for something that should be one liner, or not? (i.e. pandas is powerful but has his limits)
EDIT 1: Answering some of the contributors. I expect datetime. just 2008-02-02 just the day.
So when doing:
sections['updated'] = pd.to_datetime(sections['updated'])
the date type is converted.
but when doing next:
sections['produced'] = [d.strftime('%Y-%m-%d') if not pd.isnull(d) else '' for d in sections['produced']]
So the aim here is to a) covert to datetime format b) get the date format 2008-01-02, I dont care about seconds c) it has to be printed out in jupyter notebook as such, i.e. as date
just pass errors parameter in to_datetime() method and set that equal to 'coerce':-
sections['produced'] = pd.to_datetime(sections['produced'],errors='coerce')
sections['updated'] = pd.to_datetime(sections['updated'],errors='coerce')
This should work as a one liner:
df[['produced','updated']] = df[['produced','updated']].apply(lambda x: pd.to_datetime(x,errors='coerce'))

String operation on Series [duplicate]

Assuming that I have a pandas dataframe and I want to add thousand separators to all the numbers (integer and float), what is an easy and quick way to do it?
When formatting a number with , you can just use '{:,}'.format:
n = 10000
print '{:,}'.format(n)
n = 1000.1
print '{:,}'.format(n)
In pandas, you can use the formatters parameter to to_html as discussed here.
num_format = lambda x: '{:,}'.format(x)
def build_formatters(df, format):
return {
column:format
for column, dtype in df.dtypes.items()
if dtype in [ np.dtype('int64'), np.dtype('float64') ]
}
formatters = build_formatters(data_frame, num_format)
data_frame.to_html(formatters=formatters)
Adding the thousands separator has actually been discussed quite a bit on stackoverflow. You can read here or here.
Use Series.map or Series.apply with this solutions:
df['col'] = df['col'].map('{:,}'.format)
df['col'] = df['col'].map(lambda x: f'{x:,}')
df['col'] = df['col'].apply('{:,}'.format)
df['col'] = df['col'].apply(lambda x: f'{x:,}')
Assuming you just want to display (or render to html) the floats/integers with a thousands separator you can use styling which was added in version 0.17.1:
import pandas as pd
df = pd.DataFrame({'int': [1200, 320], 'flt': [5300.57, 12000000.23]})
df.style.format('{:,}')
To render this output to html you use the render method on the Styler.
Steps
use df.applymap() to apply a function to every cell in your dataframe
check if cell value is of type int or float
format numbers using f'{x:,d}' for integers and f'{x:,f}' for floats
Here is a simple example for integers only:
df = df.applymap(lambda x: f'{x:,d}' if isinstance(x, int) else x)
If you want "." as thousand separator and "," as decimal separator this will works:
Data = pd.read_Excel(path)
Data[my_numbers] = Data[my_numbers].map('{:,.2f}'.format).str.replace(",", "~").str.replace(".", ",").str.replace("~", ".")
If you want three decimals instead of two you change "2f" --> "3f"
Data[my_numbers] = Data[my_numbers].map('{:,.3f}'.format).str.replace(",", "~").str.replace(".", ",").str.replace("~", ".")
The formatters parameter in to_html will take a dictionary.
Click the example link for reference

Python Pandas filtering dataframe on date

I am trying to manipulate a CSV file on a certain date in a certain column.
I am using pandas (total noob) for that and was pretty successful until i got to dates.
The CSV looks something like this (with more columns and rows of course).
These are the columns:
Circuit
Status
Effective Date
These are the values:
XXXX001
Operational
31-DEC-2007
I tried dataframe query (which i use for everything else) without success.
I tried dataframe loc (which worked for everything else) without success.
How can i get all rows that are older or newer from a given date? If i have other conditions to filter the dataframe, how do i combine them with the date filter?
Here's my "raw" code:
import pandas as pd
# parse_dates = ['Effective Date']
# dtypes = {'Effective Date': 'str'}
df = pd.read_csv("example.csv", dtype=object)
# , parse_dates=parse_dates, infer_datetime_format=True
# tried lot of suggestions found on SO
cols = df.columns
cols = cols.map(lambda x: x.replace(' ', '_'))
df.columns = cols
status1 = 'Suppressed'
status2 = 'Order Aborted'
pool = '2'
region = 'EU'
date1 = '31-DEC-2017'
filt_df = df.query('Status != #status1 and Status != #status2 and Pool == #pool and Region_A == #region')
filt_df.reset_index(drop=True, inplace=True)
filt_df.to_csv('filtered.csv')
# this is working pretty well
supp_df = df.query('Status == #status1 and Effective_Date < #date1')
supp_df.reset_index(drop=True, inplace=True)
supp_df.to_csv('supp.csv')
# this is what is not working at all
I tried many approaches, but i was not able to put it together. This is just one of many approaches i tried.. so i know it is perhaps completely wrong, as no date parsing is used.
supp.csv will be saved, but the dates present are all over the place, so there's no match with the "logic" in this code.
Thanks for any help!
Make sure you convert your date to datetime and then filter slice on it.
df['Effective Date'] = pd.to_datetime(df['Effective Date'])
df[df['Effective Date'] < '2017-12-31']
#This returns all the values with dates before 31th of December, 2017.
#You can also use Query

How to add thousand separator to numbers in pandas

Assuming that I have a pandas dataframe and I want to add thousand separators to all the numbers (integer and float), what is an easy and quick way to do it?
When formatting a number with , you can just use '{:,}'.format:
n = 10000
print '{:,}'.format(n)
n = 1000.1
print '{:,}'.format(n)
In pandas, you can use the formatters parameter to to_html as discussed here.
num_format = lambda x: '{:,}'.format(x)
def build_formatters(df, format):
return {
column:format
for column, dtype in df.dtypes.items()
if dtype in [ np.dtype('int64'), np.dtype('float64') ]
}
formatters = build_formatters(data_frame, num_format)
data_frame.to_html(formatters=formatters)
Adding the thousands separator has actually been discussed quite a bit on stackoverflow. You can read here or here.
Use Series.map or Series.apply with this solutions:
df['col'] = df['col'].map('{:,}'.format)
df['col'] = df['col'].map(lambda x: f'{x:,}')
df['col'] = df['col'].apply('{:,}'.format)
df['col'] = df['col'].apply(lambda x: f'{x:,}')
Assuming you just want to display (or render to html) the floats/integers with a thousands separator you can use styling which was added in version 0.17.1:
import pandas as pd
df = pd.DataFrame({'int': [1200, 320], 'flt': [5300.57, 12000000.23]})
df.style.format('{:,}')
To render this output to html you use the render method on the Styler.
Steps
use df.applymap() to apply a function to every cell in your dataframe
check if cell value is of type int or float
format numbers using f'{x:,d}' for integers and f'{x:,f}' for floats
Here is a simple example for integers only:
df = df.applymap(lambda x: f'{x:,d}' if isinstance(x, int) else x)
If you want "." as thousand separator and "," as decimal separator this will works:
Data = pd.read_Excel(path)
Data[my_numbers] = Data[my_numbers].map('{:,.2f}'.format).str.replace(",", "~").str.replace(".", ",").str.replace("~", ".")
If you want three decimals instead of two you change "2f" --> "3f"
Data[my_numbers] = Data[my_numbers].map('{:,.3f}'.format).str.replace(",", "~").str.replace(".", ",").str.replace("~", ".")
The formatters parameter in to_html will take a dictionary.
Click the example link for reference

Pandas: How to read ill formated time data?

The time of my dataframe consist of 2 coloumns: date and HrMn, like this:
How can I read them into time, and plot a time series plot? (There are other value columns, for example, speed).
I think I can get away with time.strptime('19900125'+'1200','%Y%m%d%H%M')
But the problem is that, when read from the csv, HrMn at 0000 would be parsed as 0, so
time.strptime('19900125'+'0','%Y%m%d%H%M') will fail.
UPDATE:
My current approach:
# When reading the data, pase HrMn as string
df = pd.read_csv(uipath,header=0, skipinitialspace=True, dtype={'HrMn': str})
df['time']=df.apply(lambda x:datetime.strptime("{0} {1}".format(x['date'],x['HrMn']), "%Y%m%d %H%M"),axis=1)# df.temp_date
df.index= df['time']
# Then parse it again as int
df['HrMn'] = df['HrMn'].astype(int)
You can use pd.to_datetime after you've transformed it into a string that looks like a date:
def to_date_str(r):
d = r.date[: 4] + '-' + r.date[4: 6] + '-' + r.date[6: 8]
d += ' '+ r.HrMn[: 2] + ':' + r.HrMn[2: 4]
return d
>>> pd.to_datetime(df[['date', 'HrMn']].apply(to_date_str, axis=1))
0 1990-01-25 12:00:00
dtype: datetime64[ns]
Edit
As #EdChum comments, you can do this even more simply as
pd.to_datetime(df.date.astype(str) + df.HrMn)
which string-concatenates the columns.
You may parse the dates directly while reading the CSV, where HrMn is zero padded as HHMM, i.e. a value of 0 will represent 00:00:
df = pd.read_csv(
uipath,
header=0,
skipinitialspace=True,
dtype={'HrMn': str},
parse_dates={'datetime': ['date', 'HrMn']},
date_parser=lambda x, y: pd.datetime.strptime('{0}{1:04.0f}'.format(x, int(y)),
'%Y%m%d%H%M'),
index_col='datetime'
)
I don' get why you call it "ill formatted", that format is actually quite common and pandas can parse it as is, just specify which columns you want to parse as timestamps.
df = pd.read_csv(uipath, skipinitialspace=True,
parse_dates=[['date', 'HrMn']])

Categories

Resources