Getting a time index in python for pandas dataframe - python

I'm having a bit of trouble getting the right time index for my pandas dataframe.
import pandas as pd
from datetime import strptime
import numpy as np
stockdata = pd.read_csv("/home/stff/symbol_2012-02.csv", parse_dates =[[0,1,2]])
stockdata.columns = ['date_time','ticker','exch','salcond','vol','price','stopstockind','corrind','seqnum','source','trf','symroot','symsuffix']
I think the problem is that the time stuff comes in the first three columns: year/month/date, hour/minute/second, millisecond. Also, the hour/minute/second column drops the first zero if its before noon.
print(stockdata['date_time'][0])
20120201 41206 300
print(stockdata['date_time'][50000])
20120201 151117 770
Ideally, I would like to define my own function that could be called by the converters argument in the read_csv function.

Suppose you have a csv file that looks like this:
date,time,milliseconds,value
20120201,41206,300,1
20120201,151117,770,2
Then using parse_dates, index_cols and date_parser parameters of read_csv method, one could construct a pandas DataFrame with time index like this:
import datetime as dt
import pandas as pd
parse = lambda x: dt.datetime.strptime(x, '%Y%m%d %H%M%S %f')
df = pd.read_csv('test.csv', parse_dates=[['date', 'time', 'milliseconds']],
index_col=0, date_parser=parse)
This yields:
value
date_time_milliseconds
2012-02-01 04:12:06.300000 1
2012-02-01 15:11:17.770000 2
And df.index:
<class 'pandas.tseries.index.DatetimeIndex'>
[2012-02-01 04:12:06.300000, 2012-02-01 15:11:17.770000]
Length: 2, Freq: None, Timezone: None
This answer is based on a similar solution proposed here.

Related

Changing date column of csv with Python Pandas

I have a csv file like this:
Tarih, Şimdi, Açılış, Yüksek, Düşük, Hac., Fark %
31.05.2022, 8,28, 8,25, 8,38, 8,23, 108,84M, 0,61%
(more than a thousand lines)
I want to change it like this:
Tarih, Şimdi, Açılış, Yüksek, Düşük, Hac., Fark %
5/31/2022, 8.28, 8.25, 8.38, 8.23, 108.84M, 0.61%
Especially "Date" format is Day.Month.Year and I need to put it in Month/Day/Year format.
i write the code like this:
import pandas as pd
import numpy as np
import datetime
data=pd.read_csv("qwe.csv", encoding= 'utf-8')
df.Tarih=df.Tarih.str.replace(".","/")
df.Şimdi=df.Şimdi.str.replace(",",".")
df.Açılış=df.Açılış.str.replace(",",".")
df.Yüksek=df.Yüksek.str.replace(",",".")
df.Düşük=df.Düşük.str.replace(",",".")
for i in df['Tarih']:
q = 1
datetime_obj = datetime.datetime.strptime(i, "%d/%m/%Y")
df['Tarih'].loc[df['Tarih'].values == q] = datetime_obj
But the "for" loop in my code doesn't work. I need help on this. Thank you
Just looking at converting the date, you can import to a datetime object with arguments for pd.read_csv, then convert to your desired format by applying strftime to each entry.
If I have the following tmp.csv:
date, value
30.05.2022, 4.2
31.05.2022, 42
01.06.2022, 420
import pandas as pd
df = pd.read_csv('tmp.csv', parse_dates=['date'], dayfirst=True)
df['date'] = df['date'].dt.strftime('%m/%d/%Y')
print(df)
output:
date value
0 05/30/2022 4.2
1 05/31/2022 42.0
2 06/01/2022 420.0

Is it possible for a pandas dataframe column to have datetime.date type?

I am using cx_oracle to fetch date from databases. I would like to put the fetched data into a pandas dataframe. My problem is that the dates are converted to numpy.datetime64 objects which I absolutely don't need.
I would like to have them as datetime.date objects. I have seen the dt.date method but it still gives back numpy datetypes.
Edit: It appears that with pandas 0.21.0 or newer, there is no problem holding python datetime.dates in a DataFrame. date-like columns are not automatically converted to datetime64[ns] dtype.
import numpy as np
import pandas as pd
import datetime as DT
print(pd.__version__)
# 0.21.0.dev+25.g50e95e0
dates = [DT.date(2017,1,1)+DT.timedelta(days=2*i) for i in range(3)]
df = pd.DataFrame({'dates': dates, 'foo': np.arange(len(dates))})
print(all([isinstance(item, DT.date) for item in df['dates']]))
# True
df['dates'] = (df['dates'] + pd.Timedelta(days=1))
print(all([isinstance(item, DT.date) for item in df['dates']]))
# True
For older versions of Pandas:
There is a way to prevent a Pandas DataFrame from automatically converting
datelike values to datetime64[ns] by assigning an additional value such as an
empty string which is not datelike to the column. After the DataFrame is
formed, you can remove the non-datelike value:
import pandas as pd
import datetime as DT
dates = [DT.date(2017,1,1)+DT.timedelta(days=i) for i in range(10)]
df = pd.DataFrame({'dates':['']+dates})
df = df.iloc[1:]
print(all([isinstance(item, DT.date) for item in df['dates']]))
# True
Clearly, programming this kind of shenanigan into serious code feels entirely wrong since we're subverting the intent of the developers.
There are also computational speed advantages to using datetime64[ns]s over lists or object arrays of datetime.dates.
Moreover, if df[col] has dtype datetime64[ns] then df[col].dt.date.values returns an object NumPy array of python datetime.dates:
import pandas as pd
import datetime as DT
dates = [DT.datetime(2017,1,1)+DT.timedelta(days=2*i) for i in range(3)]
df = pd.DataFrame({'dates': dates})
print(repr(df['dates'].dt.date.values))
# array([datetime.date(2017, 1, 1), datetime.date(2017, 1, 3),
# datetime.date(2017, 1, 5)], dtype=object)
So you could perhaps enjoy the best of both worlds by keeping the column as datetime64[ns] and using df[col].dt.date.values to obtain datetime.dates when necessary.
On the other hand, the datetime64[ns]s and Python datetime.dates have different ranges of representable dates.
datetime64[ns]s can represent datetimes from 1678 AD to 2262 AD.
datetime.dates can represent dates from DT.date(0,1,1) to DT.date(9999,1,1).
If the reason why you want to use datetime.dates instead of datetime64[ns]s is to overcome the limited range of representable dates, then perhaps a better alternative is to use a pd.PeriodIndex:
import pandas as pd
import datetime as DT
dates = [DT.date(2017,1,1)+DT.timedelta(days=2*i) for i in range(10)]
df = pd.DataFrame({'dates':pd.PeriodIndex(dates, freq='D')})
print(df)
# dates
# 0 2017-01-01
# 1 2017-01-03
# 2 2017-01-05
# 3 2017-01-07
# 4 2017-01-09
# 5 2017-01-11
# 6 2017-01-13
# 7 2017-01-15
# 8 2017-01-17
# 9 2017-01-19

Pandas date_range: drop nanoseconds

I'm trying to create the following date range series with quarterly frequency:
import pandas as pd
pd.date_range(start = "1980", periods = 5, freq = 'Q-Dec')
Which returns
DatetimeIndex(['1980-03-31', '1980-06-30', '1980-09-30', '1980-12-31',
'1981-03-31'],
dtype='datetime64[ns]', freq='Q-DEC')
However, when I output this object to CSV I see the time portion of the series. e.g.
1980-03-31 00:00:00, 1980-06-30 00:00:00
Any idea how I can get rid of the time portion so when I export to a csv it just shows:
1980-03-31, 1980-06-30
You are not seeing nano-seconds, you are just seeing the time portion of the datetime that pandas has created for the DatetimeIndex. You can extract only the date portion with .date().
Code:
import pandas as pd
dr = pd.date_range(start="1980", periods=5, freq='Q-Dec')
print(dr[0].date())
Results:
1980-03-31

Convert csv column from Epoch time to human readable minutes

I have a pandas.DataFrame indexed by time, as seen below. The time is in Epoch time. When I graph the second column these time values display along the x-axis. I want a more readable time in minutes:seconds.
In [13]: print df.head()
Time
1481044277379 0.581858
1481044277384 0.581858
1481044277417 0.581858
1481044277418 0.581858
1481044277467 0.581858
I have tried some pandas functions, and some methods for converting the whole column, I visited: Pandas docs, this question and the cool site.
I am using pandas 0.18.1
If you read your data with read_csv you can use a custom dateparser:
import pandas as pd
#example.csv
'''
Time,Value
1481044277379,0.581858
1481044277384,0.581858
1481044277417,0.581858
1481044277418,0.581858
1481044277467,0.581858
'''
def dateparse(time_in_secs):
time_in_secs = time_in_secs/1000
return datetime.datetime.fromtimestamp(float(time_in_secs))
dtype= {"Time": float, "Value":float}
df = pd.read_csv("example.csv", dtype=dtype, parse_dates=["Time"], date_parser=dateparse)
print df
You can convert an epoch timestamp to HH:MM with:
import datetime as dt
hours_mins = dt.datetime.fromtimestamp(1347517370).strftime('%H:%M')
Adding a column to your pandas.DataFrame can be done as:
df['H_M'] = pd.Series([dt.datetime.fromtimestamp(int(ts)).strftime('%H:%M')
for ts in df['timestamp']]).values

Can pandas automatically read dates from a CSV file?

Today I was positively surprised by the fact that while reading data from a data file (for example) pandas is able to recognize types of values:
df = pandas.read_csv('test.dat', delimiter=r"\s+", names=['col1','col2','col3'])
For example it can be checked in this way:
for i, r in df.iterrows():
print type(r['col1']), type(r['col2']), type(r['col3'])
In particular integer, floats and strings were recognized correctly. However, I have a column that has dates in the following format: 2013-6-4. These dates were recognized as strings (not as python date-objects). Is there a way to "learn" pandas to recognized dates?
You should add parse_dates=True, or parse_dates=['column name'] when reading, thats usually enough to magically parse it. But there are always weird formats which need to be defined manually. In such a case you can also add a date parser function, which is the most flexible way possible.
Suppose you have a column 'datetime' with your string, then:
from datetime import datetime
dateparse = lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S')
df = pd.read_csv(infile, parse_dates=['datetime'], date_parser=dateparse)
This way you can even combine multiple columns into a single datetime column, this merges a 'date' and a 'time' column into a single 'datetime' column:
dateparse = lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S')
df = pd.read_csv(infile, parse_dates={'datetime': ['date', 'time']}, date_parser=dateparse)
You can find directives (i.e. the letters to be used for different formats) for strptime and strftime in this page.
Perhaps the pandas interface has changed since #Rutger answered, but in the version I'm using (0.15.2), the date_parser function receives a list of dates instead of a single value. In this case, his code should be updated like so:
from datetime import datetime
import pandas as pd
dateparse = lambda dates: [datetime.strptime(d, '%Y-%m-%d %H:%M:%S') for d in dates]
df = pd.read_csv('test.dat', parse_dates=['datetime'], date_parser=dateparse)
Since the original question asker said he wants dates and the dates are in 2013-6-4 format, the dateparse function should really be:
dateparse = lambda dates: [datetime.strptime(d, '%Y-%m-%d').date() for d in dates]
You could use pandas.to_datetime() as recommended in the documentation for pandas.read_csv():
If a column or index contains an unparseable date, the entire column
or index will be returned unaltered as an object data type. For
non-standard datetime parsing, use pd.to_datetime after pd.read_csv.
Demo:
>>> D = {'date': '2013-6-4'}
>>> df = pd.DataFrame(D, index=[0])
>>> df
date
0 2013-6-4
>>> df.dtypes
date object
dtype: object
>>> df['date'] = pd.to_datetime(df.date, format='%Y-%m-%d')
>>> df
date
0 2013-06-04
>>> df.dtypes
date datetime64[ns]
dtype: object
When merging two columns into a single datetime column, the accepted answer generates an error (pandas version 0.20.3), since the columns are sent to the date_parser function separately.
The following works:
def dateparse(d,t):
dt = d + " " + t
return pd.datetime.strptime(dt, '%d/%m/%Y %H:%M:%S')
df = pd.read_csv(infile, parse_dates={'datetime': ['date', 'time']}, date_parser=dateparse)
pandas read_csv method is great for parsing dates. Complete documentation at http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html
you can even have the different date parts in different columns and pass the parameter:
parse_dates : boolean, list of ints or names, list of lists, or dict
If True -> try parsing the index. If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a
separate date column. If [[1, 3]] -> combine columns 1 and 3 and parse as a single date
column. {‘foo’ : [1, 3]} -> parse columns 1, 3 as date and call result ‘foo’
The default sensing of dates works great, but it seems to be biased towards north american Date formats. If you live elsewhere you might occasionally be caught by the results. As far as I can remember 1/6/2000 means 6 January in the USA as opposed to 1 Jun where I live. It is smart enough to swing them around if dates like 23/6/2000 are used. Probably safer to stay with YYYYMMDD variations of date though. Apologies to pandas developers,here but i have not tested it with local dates recently.
you can use the date_parser parameter to pass a function to convert your format.
date_parser : function
Function to use for converting a sequence of string columns to an array of datetime
instances. The default uses dateutil.parser.parser to do the conversion.
Yes - according to the pandas.read_csv documentation:
Note: A fast-path exists for iso8601-formatted dates.
So if your csv has a column named datetime and the dates looks like 2013-01-01T01:01 for example, running this will make pandas (I'm on v0.19.2) pick up the date and time automatically:
df = pd.read_csv('test.csv', parse_dates=['datetime'])
Note that you need to explicitly pass parse_dates, it doesn't work without.
Verify with:
df.dtypes
You should see the datatype of the column is datetime64[ns]
While loading csv file contain date column.We have two approach to to make pandas to
recognize date column i.e
Pandas explicit recognize the format by arg date_parser=mydateparser
Pandas implicit recognize the format by agr infer_datetime_format=True
Some of the date column data
01/01/18
01/02/18
Here we don't know the first two things It may be month or day. So in this case we have to use
Method 1:-
Explicit pass the format
mydateparser = lambda x: pd.datetime.strptime(x, "%m/%d/%y")
df = pd.read_csv(file_name, parse_dates=['date_col_name'],
date_parser=mydateparser)
Method 2:- Implicit or Automatically recognize the format
df = pd.read_csv(file_name, parse_dates=[date_col_name],infer_datetime_format=True)
In addition to what the other replies said, if you have to parse very large files with hundreds of thousands of timestamps, date_parser can prove to be a huge performance bottleneck, as it's a Python function called once per row. You can get a sizeable performance improvements by instead keeping the dates as text while parsing the CSV file and then converting the entire column into dates in one go:
# For a data column
df = pd.read_csv(infile, parse_dates={'mydatetime': ['date', 'time']})
df['mydatetime'] = pd.to_datetime(df['mydatetime'], exact=True, cache=True, format='%Y-%m-%d %H:%M:%S')
# For a DateTimeIndex
df = pd.read_csv(infile, parse_dates={'mydatetime': ['date', 'time']}, index_col='mydatetime')
df.index = pd.to_datetime(df.index, exact=True, cache=True, format='%Y-%m-%d %H:%M:%S')
# For a MultiIndex
df = pd.read_csv(infile, parse_dates={'mydatetime': ['date', 'time']}, index_col=['mydatetime', 'num'])
idx_mydatetime = df.index.get_level_values(0)
idx_num = df.index.get_level_values(1)
idx_mydatetime = pd.to_datetime(idx_mydatetime, exact=True, cache=True, format='%Y-%m-%d %H:%M:%S')
df.index = pd.MultiIndex.from_arrays([idx_mydatetime, idx_num])
For my use case on a file with 200k rows (one timestamp per row), that cut down processing time from about a minute to less than a second.
Read the existing string columns in date and time format respectively
pd.read_csv('CGMData.csv', parse_dates=['Date', 'Time'])
Resulted Columns
Concat string columns of date and time and add new column of datetype object - Remove Original columns
if want to rename the new column name then pass as dictionary as
show in below example and the new column name will be the key name,
if pass as list of column, new column name will be concate of column name passed in the list separated by _ e.g Date_Time
# parse_dates={'given_name': ['Date', 'Time']}
pd.read_csv("InsulinData.csv",low_memory=False,
parse_dates=[['Date', 'Time']])
pd.read_csv("InsulinData.csv",low_memory=False,
parse_dates={'date_time': ['Date', 'Time']})
Concat string columns of date and time and add new column of datetype object and Keep the Original columns
pd.read_csv("InsulinData.csv",low_memory=False,
parse_dates=[['Date', 'Time']], keep_date_col=True)
Want to change the format of date and time when read from csv
parser = lambda x: pd.to_datetime(x, format='%Y-%m-%d %H:%M:%S')
pd.read_csv('path', date_parser=parser, parse_dates=['date', 'time'])
If performance matters to you make sure you time:
import sys
import timeit
import pandas as pd
print('Python %s on %s' % (sys.version, sys.platform))
print('Pandas version %s' % pd.__version__)
repeat = 3
numbers = 100
def time(statement, _setup=None):
print (min(
timeit.Timer(statement, setup=_setup or setup).repeat(
repeat, numbers)))
print("Format %m/%d/%y")
setup = """import pandas as pd
import io
data = io.StringIO('''\
ProductCode,Date
''' + '''\
x1,07/29/15
x2,07/29/15
x3,07/29/15
x4,07/30/15
x5,07/29/15
x6,07/29/15
x7,07/29/15
y7,08/05/15
x8,08/05/15
z3,08/05/15
''' * 100)"""
time('pd.read_csv(data); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"]); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"],'
'infer_datetime_format=True); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"],'
'date_parser=lambda x: pd.datetime.strptime(x, "%m/%d/%y")); data.seek(0)')
print("Format %Y-%m-%d %H:%M:%S")
setup = """import pandas as pd
import io
data = io.StringIO('''\
ProductCode,Date
''' + '''\
x1,2016-10-15 00:00:43
x2,2016-10-15 00:00:56
x3,2016-10-15 00:00:56
x4,2016-10-15 00:00:12
x5,2016-10-15 00:00:34
x6,2016-10-15 00:00:55
x7,2016-10-15 00:00:06
y7,2016-10-15 00:00:01
x8,2016-10-15 00:00:00
z3,2016-10-15 00:00:02
''' * 1000)"""
time('pd.read_csv(data); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"]); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"],'
'infer_datetime_format=True); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"],'
'date_parser=lambda x: pd.datetime.strptime(x, "%Y-%m-%d %H:%M:%S")); data.seek(0)')
prints:
Python 3.7.1 (v3.7.1:260ec2c36a, Oct 20 2018, 03:13:28)
[Clang 6.0 (clang-600.0.57)] on darwin
Pandas version 0.23.4
Format %m/%d/%y
0.19123052499999993
8.20691274
8.143124389
1.2384357139999977
Format %Y-%m-%d %H:%M:%S
0.5238807110000039
0.9202787830000005
0.9832778819999959
12.002349824999996
So with iso8601-formatted date (%Y-%m-%d %H:%M:%S is apparently an iso8601-formatted date, I guess the T can be dropped and replaced by a space) you should not specify infer_datetime_format (which does not make a difference with more common ones either apparently) and passing your own parser in just cripples performance. On the other hand, date_parser does make a difference with not so standard day formats. Be sure to time before you optimize, as usual.
You can use the parameter date_parser with a function for converting a sequence of string columns to an array of datetime instances:
parser = lambda x: pd.to_datetime(x, format='%Y-%m-%d %H:%M:%S')
pd.read_csv('path', date_parser=parser, parse_dates=['date_col1', 'date_col2'])
Yes, this code works like breeze. Here index 0 refers to the index of the date column.
df = pd.read_csv(filepath, parse_dates=[0], infer_datetime_format = True)
No, there is no way in pandas to automatically recognize date columns.
Pandas does a poor job at type inference. It basically puts most columns as the generic object type, unless you manually work around it eg. using the abovementioned parse_dates parameter.
If you want to automatically detect columns types, you'd have to use a separate data profiling tool, eg. visions, and then cast or feed the inferred types back into your DataFrame constructor (eg. for dates and from_csv, using the parse_dates parameter).

Categories

Resources