I face some confusion with the way pandas is handling time-related objects.
If I do
x = pd.datetime.fromtimestamp(1440502703064/1000.) # or
x = pd.datetime(1234,5,6)
then type(x) returns datetime.datetime in either of the cases. However if I have:
z = pd.DataFrame([
{'a': 'foo', 'ts': pd.datetime.fromtimestamp(1440502703064/1000.)}
])
then type(z['ts'][0]) returns pandas.tslib.Timestamp. When is this casting happening? Is its trigger is pandas or maybe numpy? What is this type that I obtain in the latter case and where is it documented?
I'm not 100% sure, since I haven't studied the underlying code, but the conversion from datetime.datetime happens the moment the value is "incorporated" into a DataFrame.
Outside a DataFrame, pandas will try to do the smart thing and return something sensible when using pd.dattime(.fromtimestamp): it returns a Python datetime.datetime object.
Inside, it uses something it can probably work better with internally. You can see the conversion occurring when creating a DataFrame by using a datetime.datetime object instead:
>>> from datetime import datetime
>>> z = pd.DataFrame([
{'a': 'foo', 'ts': datetime(2015,8,27)} ])
>>> type(z['ts'][0])
pandas.tslib.Timestamp
Perhaps even clearer:
>>> pd.datetime == datetime
True
So the conversion happens during the DataFrame initialisation.
As for documentation, I searched around and found the source (note: probably not a very time-safe link), which says (doc-string):
TimeStamp is the pandas equivalent of python's Datetime and is
interchangable with it in most cases. It's the type used for the
entries that make up a DatetimeIndex, and other timeseries oriented
data structures in pandas.
Related
Let me start off by saying that I'm fairly new to numpy and pandas. I'm trying to construct a pandas dataframe but I'm not sure that I'm doing things in an appropriate way.
My setting is that I have a large list of .Net objects (that I have very little control over) and I want to build a time series from this using pandas dataframe. I have an example where I have replaced the .Net class with a simplified placeholder class just for demonstration. The listOfthings in the code is basically what I get from .Net and I want to convert that into a pandas dataframe.
My questions are:
I construct the dataframe by first constructing a numpy array. Is this necessary? Also, this array doesn't have the size 1000x2 as I expect. Is there a better way to use numpy here?
This code doesn't work because I doesn't seem to be able to cast the string to a datetime64. This confuses me since the string is in ISO format and it works when I try to parse it like this: np.datetime64(str(np.datetime64('now','us'))).
Code sample:
import numpy as np
import pandas as pd
class PlaceholderClass:
def time(self):
return str(np.datetime64('now', 'us'))
def value(self):
return 100*np.random.random_sample()
listOfThings = [PlaceholderClass() for i in range(1000)]
arr = np.array([(x.time(), x.value()) for x in listOfThings], dtype=[('time', np.datetime64), ('value', np.float)])
dataframe = pd.DataFrame(data=arr['value'], index=arr['time'])
Thanks in advance
Q1:
I think it is not necessary to first make an np.array and then create the dataframe. This works perfectly fine, for example:
rd = lambda: datetime.date(randint(2005,2025), randint(1,12),randint(1,28))
df = pd.DataFrame([(rd(), rd()) for x in range(100)])
Added later:
df = pd.DataFrame((x.value() for x in listOfThings), index=(pd.to_datetime(x.time()) for x in listOfThings))
Q2:
I noticed that pd.to_datetime('some date') almost always gets it right. Even without specifying the format. Perhaps this helps.
In [115]: pd.to_datetime('2008-09-22T13:57:31.2311892-04:00')
Out[115]: Timestamp('2008-09-22 17:57:31.231189200')
First import:
import pandas as pd
import numpy as np
import hashlib
Next, consider the following:
np.random.seed(42)
arr = np.random.choice([41, 43, 42], size=(3,3))
df = pd.DataFrame(arr)
print(arr)
print(df)
print(hashlib.sha256(arr.tobytes()).hexdigest())
print(hashlib.sha256(df.values.tobytes()).hexdigest())
Multiple executions of this snippet yield the same hash twice all the time: ddfee4572d380bef86d3ebe3cb7bfa7c68b7744f55f67f4e1ca5f6872c2c9ba1.
However, if we consider the following:
np.random.seed(42)
arr = np.random.choice(['foo', 'bar', 42], size=(3,3))
df = pd.DataFrame(arr)
print(arr)
print(df)
print(hashlib.sha256(arr.tobytes()).hexdigest())
print(hashlib.sha256(df.values.tobytes()).hexdigest())
Note that there are strings in the data now. The hash of the arr is fixed (52db9328682317c44370b8186a5c6bae75f2a94c9d0d5b24d61f602857acd3de) for different evaluations, but the one of the pandas.DataFrame changes each time.
Any pythonic way around it? No Pythonic?
Edit: Related links:
Hashable DataFrames
A pandas DataFrame or Series can be hashed using the pandas.util.hash_pandas_object function, starting in version 0.20.1.
According to me when you are using string as values for your cells. Data frame type is object
df.dtypes
shows that.
That is why you get different hash each time.
Naive workaround is to get a string representation of the whole dataframe and hash it. In particular either of the following can work:
print(hashlib.sha256(df.to_json().encode()).hexdigest())
print(hashlib.sha256(df.to_csv().encode()).hexdigest())
Naturally, this is going to be very length for big dataframes.
Still, the it remains that pd.DataFrame(arr).values != arr, and this is counter-intuitive.
See a summary: https://gist.github.com/drorata/bfc5d956c4fb928dcc77510a33009691
I wrote a package with hashable subclasses of Series and DataFrame for my needs. Hope this helps.
The docs provide good examples, how metadata can be provided. However I still feel unsure, when it comes to picking the right dtypes for my dataframe.
Could I do something like meta={'x': int 'y': float,
'z': float} instead of meta={'x': 'i8', 'y': 'f8', 'z': 'f8'}?
Could somebody hint me to a list of possible values like 'i8'? What
dtypes exist?
How can I specify a column, that contains arbitrary objects? How can I specify a column, that contains only instances of one class?
The available basic data types are the ones which are offered through numpy. Have a look at the documentation for a list.
Not included in this set are datetime-formats (e.g. datetime64), for which additional information can be found in the pandas and numpy documentation.
The meta-argument for dask dataframes usually expects an empty pandas dataframe holding definitions for columns, indices and dtypes.
One way to construct such a DataFrame is:
import pandas as pd
import numpy as np
meta = pd.DataFrame(columns=['a', 'b', 'c'])
meta.a = meta.a.astype(np.int64)
meta.b = meta.b.astype(np.datetime64)
There is also a way to provide a dtype to the constructor of the pandas dataframe, however, I am not sure how to provide them for individual columns each. As you can see, it is possible to provide not only the "name" for datatypes, but also the actual numpy dtype.
Regarding your last question, the datatype you are looking for is "object". For example:
import pandas as pd
class Foo:
def __init__(self, foo):
self.bar = foo
df = pd.DataFrame(data=[Foo(1), Foo(2)], columns=['a'], dtype='object')
df.a
# 0 <__main__.Foo object at 0x00000000058AC550>
# 1 <__main__.Foo object at 0x00000000058AC358>
Both Dask.dataframe and Pandas use NumPy dtypes. In particular, anything within that you can pass to np.dtype. This includes the following:
NumPy dtype objects, like np.float64
Python type objects, like float
NumPy dtype strings, like 'f8'
Here is a more extensive list taken from the NumPy docs: http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html#specifying-and-constructing-data-types
This question already has answers here:
Convert pandas.Series from dtype object to float, and errors to nans
(3 answers)
Closed 3 years ago.
I have this line in my code which converts my data to numeric...
data["S1Q2I"] = data["S1Q2I"].convert_objects(convert_numeric=True)
The thing is that now the new pandas release (0.17.0) said that this function is deprecated..
This is the error:
FutureWarning: convert_objects is deprecated.
Use the data-type specific converters pd.to_datetime,
pd.to_timedelta and pd.to_numeric.
data["S3BD5Q2A"] = data["S3BD5Q2A"].convert_objects(convert_numeric=True)
So, I went to the new documentation and I couldn't find any examples of how to use the new function to convert my data...
It only says this:
"DataFrame.convert_objects has been deprecated in favor of type-specific functions pd.to_datetime, pd.to_timestamp and pd.to_numeric (new in 0.17.0) (GH11133)."
Any help would be nice!
As explained by #EvanWright in the comments,
data['S1Q2I'] = pd.to_numeric(data['S1Q2I'])
is now the prefered way of converting types. A detailed explanation in of the change can be found in the github PR GH11133.
You can effect a replacement using apply as done here. An example would be:
>>> import pandas as pd
>>> a = pd.DataFrame([{"letter":"a", "number":"1"},{"letter":"b", "number":"2"}])
>>> a.dtypes
letter object
number object
dtype: object
>>> b = a.apply(pd.to_numeric, errors="ignore")
>>> b.dtypes
letter object
number int64
dtype: object
>>>
But it sucks in two ways:
You have to use apply rather than a non-native dataframe method
You have to copy to another dataframe--can't be done in place. So much for use with "big data."
I'm not really loving the direction pandas is going. I haven't used R data.table much, but so far it seems superior.
I think a data table with native, in-place type conversion is pretty basic for a competitive data analysis framework.
It depends on which version of Pandas......
if you have Pandas's version 0.18.0
this type will work ........
df['col name'] = df['col name'].apply(pd.to_numeric, errors='coerce')
another versions ........
df['col name']=df.col name .astype(float)
If you convert all columns to numeric at once, this code may work.
data = data.apply(pd.to_numeric, axis=0)
You can get it to apply correctly to a particular variable name in a dataframe without having to copy into a different dataframe like this:
>>> import pandas as pd
>>> a = pd.DataFrame([{"letter":"a", "number":"1"},{"letter":"b", "number":"2"}])
>>> a.dtypes
letter object
number object
dtype: object
>>> a['number'] = a['number'].apply(pd.to_numeric, errors='coerce')
>>> a.dtypes
letter object
number int64
dtype: object
An example based on the original question above would be something like:
data['S1Q2I'] = data['S1Q2I'].apply(pd.to_numeric, errors='coerce')
This works the same way as your original:
data['S1Q2I'] = data['S1Q2I'].convert_objects(convert_numeric=True)
in my hands, anyway....
This doesn't address the point abalter made about inferring datatypes which is a little above my head I'm afraid!
So if I have a timestamp in pandas as such:
Timestamp('2014-11-07 00:05:00')
How can I create a new column that just has the 'time' component?
So I want
00:05:00
Currently, I'm using .apply as shown below, but this is slow (my dataframe is a couple million rows), and i'm looking for a faster way.
df['time'] = df['date_time'].apply(lambda x: x.time())
Instead of .apply, I tried using .astype(time), as I noticed .astype operations can be faster than .apply, but that apparently doesn't work on timestamps (AttributeError: 'Timestamp' object has no attribute 'astype')... any ideas?
You want .dt.time see the docs for some more examples of things under the .dt accessor.
df['date_time'].dt.time
There are two df1 and df2, each having date and time columns respectively.
Following code snippets useful to convert data type and comparing.
type(df1['date'].iloc[0]), type(df2['time'].iloc[0])
>>> (datetime.date, pandas._libs.tslibs.timestamps.Timestamp)
type(df1['date'].iloc[0]), type(df2['time'].iloc[0].date())
>>> (datetime.date, datetime.date)
df1['date'].iloc[0] == df2['time'].iloc[0].date()
>>> False