How to specify metadata for dask.dataframe

How to specify metadata for dask.dataframe - python

The docs provide good examples, how metadata can be provided. However I still feel unsure, when it comes to picking the right dtypes for my dataframe.
Could I do something like meta={'x': int 'y': float,
'z': float} instead of meta={'x': 'i8', 'y': 'f8', 'z': 'f8'}?
Could somebody hint me to a list of possible values like 'i8'? What
dtypes exist?
How can I specify a column, that contains arbitrary objects? How can I specify a column, that contains only instances of one class?

The available basic data types are the ones which are offered through numpy. Have a look at the documentation for a list.
Not included in this set are datetime-formats (e.g. datetime64), for which additional information can be found in the pandas and numpy documentation.
The meta-argument for dask dataframes usually expects an empty pandas dataframe holding definitions for columns, indices and dtypes.
One way to construct such a DataFrame is:
import pandas as pd
import numpy as np
meta = pd.DataFrame(columns=['a', 'b', 'c'])
meta.a = meta.a.astype(np.int64)
meta.b = meta.b.astype(np.datetime64)
There is also a way to provide a dtype to the constructor of the pandas dataframe, however, I am not sure how to provide them for individual columns each. As you can see, it is possible to provide not only the "name" for datatypes, but also the actual numpy dtype.
Regarding your last question, the datatype you are looking for is "object". For example:
import pandas as pd
class Foo:
def __init__(self, foo):
self.bar = foo
df = pd.DataFrame(data=[Foo(1), Foo(2)], columns=['a'], dtype='object')
df.a
# 0 <__main__.Foo object at 0x00000000058AC550>
# 1 <__main__.Foo object at 0x00000000058AC358>

Both Dask.dataframe and Pandas use NumPy dtypes. In particular, anything within that you can pass to np.dtype. This includes the following:
NumPy dtype objects, like np.float64
Python type objects, like float
NumPy dtype strings, like 'f8'
Here is a more extensive list taken from the NumPy docs: http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html#specifying-and-constructing-data-types

Related

Why does my value in a pandas dataframe turns to dtype complex128 when I access it via .loc?

My DataFrame has a complex128 in one column. When I access another value via the .loc method it returns a complex128 instead of the stored dtype.
I encountered the problem when I was using some values from a DataFrame inside a class in a function.
Here is a minimal example:
import pandas as pd
arrays = [["f","i","c"],["float","int","complex"]]
ind = pd.MultiIndex.from_arrays(arrays,names=("varname","intended dtype"))
a = pd.DataFrame(columns=ind)
m1 = 1.33+1e-9j
parms1 = [1.,2,None]
a.loc["aa"] = parms1
a.loc["aa","c"] = m1
print(a.dtypes)
print(a.loc["aa","f"])
print("-----------------------------")
print(a.loc["aa",("f","float")])
print("-----------------------------")
print(a["f"])
If the MultiIndex is taken away that does not happen. So it seems to have some impact. But also accessing it in the MultiIndex-way does not help.
I noticed that the dtype assignment happens, because I have not specified any index in the DataFrame creation. This is necessary, because I don't know what to be filled in the beginning.
Is this a normal behavior or can I get rid of it?
pandas version is: 0.24.2
is reproducible in 0.25.3

Why select_dtypes does not work in this case in pandas

import pandas as pd
import numpy as np
test_df = pd.DataFrame([[1,2]]*4, columns=['x','y'])
test_df.iloc[0,0] = '1'
test_df.iloc[0,0] = 1
test_df.select_dtypes(include=['number'])
I want to know that why column x does not included in this case

I can reproduce on Pandas v0.19.2. The issue is when, if at all, Pandas chooses to check and recast series. You first define the series as dtype object with this assignment:
test_df.iloc[0, 0] = '1'
Pandas stores any series with strings as object dtype. You then overwrite a value in the next line without explicitly changing the dtype of the series:
test_df.iloc[0, 0] = 1
But you should not assume this automatically triggers conversion to a numeric dtype for the entire series. As far as I am aware, this is not a documented behaviour. While it may work in more recent versions, it is not a behaviour you should assume for a production workflow.

pandas read_csv set `dtype` by column index (not name)

file.txt has a header and four columns. But the headers changes all the time.
something like:
,'non_standard_header_1','non_standard_header_2','non_standard_header_3'
,kdfjlkjdf, sdfdfd,,
,kdfjlkjwwdf, sdfddffd,,
,kdfjlkjwwdf,, sdfddffd,
I want to import file.txt in pandas, and I want the columns to be import as a object. The intuitive approach (to me):
dtype = [object, object, object] as in:
daily_file = pandas.read_csv('file.txt',
usecols = [1, 2, 3],
dtype = [object, object, object])
does not work, running the above, I get:
data type not understood
How to set column dtype on import w/o referencing (existing) column names?

pd.read_csv(..., dtype=object) will globally apply the object dtype across all columns read in, if that's what you're looking for.
Otherwise, you'll need to pass a dict of the form {'col' : dtype} if you want to map dtypes to column names.

Constructing pandas dataframe from a list of objects

Let me start off by saying that I'm fairly new to numpy and pandas. I'm trying to construct a pandas dataframe but I'm not sure that I'm doing things in an appropriate way.
My setting is that I have a large list of .Net objects (that I have very little control over) and I want to build a time series from this using pandas dataframe. I have an example where I have replaced the .Net class with a simplified placeholder class just for demonstration. The listOfthings in the code is basically what I get from .Net and I want to convert that into a pandas dataframe.
My questions are:
I construct the dataframe by first constructing a numpy array. Is this necessary? Also, this array doesn't have the size 1000x2 as I expect. Is there a better way to use numpy here?
This code doesn't work because I doesn't seem to be able to cast the string to a datetime64. This confuses me since the string is in ISO format and it works when I try to parse it like this: np.datetime64(str(np.datetime64('now','us'))).
Code sample:
import numpy as np
import pandas as pd
class PlaceholderClass:
def time(self):
return str(np.datetime64('now', 'us'))
def value(self):
return 100*np.random.random_sample()
listOfThings = [PlaceholderClass() for i in range(1000)]
arr = np.array([(x.time(), x.value()) for x in listOfThings], dtype=[('time', np.datetime64), ('value', np.float)])
dataframe = pd.DataFrame(data=arr['value'], index=arr['time'])
Thanks in advance

Q1:
I think it is not necessary to first make an np.array and then create the dataframe. This works perfectly fine, for example:
rd = lambda: datetime.date(randint(2005,2025), randint(1,12),randint(1,28))
df = pd.DataFrame([(rd(), rd()) for x in range(100)])
Added later:
df = pd.DataFrame((x.value() for x in listOfThings), index=(pd.to_datetime(x.time()) for x in listOfThings))
Q2:
I noticed that pd.to_datetime('some date') almost always gets it right. Even without specifying the format. Perhaps this helps.
In [115]: pd.to_datetime('2008-09-22T13:57:31.2311892-04:00')
Out[115]: Timestamp('2008-09-22 17:57:31.231189200')

Confusing datetime objects in pandas

I face some confusion with the way pandas is handling time-related objects.
If I do
x = pd.datetime.fromtimestamp(1440502703064/1000.) # or
x = pd.datetime(1234,5,6)
then type(x) returns datetime.datetime in either of the cases. However if I have:
z = pd.DataFrame([
{'a': 'foo', 'ts': pd.datetime.fromtimestamp(1440502703064/1000.)}
])
then type(z['ts'][0]) returns pandas.tslib.Timestamp. When is this casting happening? Is its trigger is pandas or maybe numpy? What is this type that I obtain in the latter case and where is it documented?

I'm not 100% sure, since I haven't studied the underlying code, but the conversion from datetime.datetime happens the moment the value is "incorporated" into a DataFrame.
Outside a DataFrame, pandas will try to do the smart thing and return something sensible when using pd.dattime(.fromtimestamp): it returns a Python datetime.datetime object.
Inside, it uses something it can probably work better with internally. You can see the conversion occurring when creating a DataFrame by using a datetime.datetime object instead:
>>> from datetime import datetime
>>> z = pd.DataFrame([
{'a': 'foo', 'ts': datetime(2015,8,27)} ])
>>> type(z['ts'][0])
pandas.tslib.Timestamp
Perhaps even clearer:
>>> pd.datetime == datetime
True
So the conversion happens during the DataFrame initialisation.
As for documentation, I searched around and found the source (note: probably not a very time-safe link), which says (doc-string):
TimeStamp is the pandas equivalent of python's Datetime and is
interchangable with it in most cases. It's the type used for the
entries that make up a DatetimeIndex, and other timeseries oriented
data structures in pandas.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to specify metadata for dask.dataframe - python

Related

Why does my value in a pandas dataframe turns to dtype complex128 when I access it via .loc?

Why select_dtypes does not work in this case in pandas

pandas read_csv set `dtype` by column index (not name)

Constructing pandas dataframe from a list of objects

Confusing datetime objects in pandas

Categories

Resources