Handling timestamps with timezones in Pandas and Rpy2 - python

I'm trying to understand how to add a row that contains a timestamp to a Pandas dataframe that has a column with a data type of datetime64[ns, UTC]. Unfortunately, when I add a row, the column datatype changes to object, which ends up breaking conversion to a R data frame via Rpy2.
Here are the interesting lines of code where I'm seeing the problem, with debug printing statements around it whose output I'll share as well. The variable observation is a simple python list whose first value is a timestamp. Code:
print('A: df.dtypes[0] = {}'.format(str(df.dtypes[0])))
print('observation[0].type = {}, observation[0].tzname() = {}'.format(str(type(observation[0])), observation[0].tzname()))
df.loc[len(df)] = observation
print('B: df.dtypes[0] = {}'.format(str(df.dtypes[0])))
Here is the output of the above code snippet:
A: df.dtypes[0] = datetime64[ns, UTC]
observation[0].type = <class 'datetime.datetime'>, observation[0].tzname() = UTC
B: df.dtypes[0] = object
What I'm observing is that the datatype of the column is being changed when I append the row. As far as I can tell, Pandas is adding the timestamp as an instance of . The rpy2 pandas2ri module seems to be unable to convert values of that class.
I've so far been unable to find an approach that lets me append a row to the data frame and preserve the column type for the timestamp column. Suggestions would be welcome.
==========================
Update
I've been able to work around the problem in a hacky way. I create a one-row temporary dataframe from the list of values, then set the types on the columns for this one-row dataframe. Then I append the row from this temporary dataframe to the one I'm working on. This is the only approach I was able to identify that preserves the column type of the dataframe I'm appending to. It's almost enough to make me pine for a strongly typed language.
I'd prefer a more elegant solution, so I'm leaving this open in case anyone can suggest one.

Check this post for an answer, especially the answer by Wes McKinney:
Converting between datetime, Timestamp and datetime64

Related

Why is a cell value coming as a series when you do dataframe[dataframe[ColumnName]==some_value]?

I am having the same problem as described here Cell value coming as series in pandas
While a solution is provided there is no explanation on why a cell would be a dataseries when I would expect that to be a string (which I understand that it is dtype=object)
My dataframe has columns as below
Serial Number object
Device ID int64
Device Name object
I am extracting a
device=s_devices[s_devices['Device ID']==17177529]
print(device['Device ID'])
prints fine as I would expect
17177529
print(device['Device Name'])
prints like below, like a Series:
49 10.112.165.182
Name: Device Name, dtype: object
What can be done ? I could see that I could use ".values" to get the IP only 10.112.165.182 but I am wondering what is causing the difference between dtype float and dtype object at import or elsewhere. I am reading from excel.
As far as I understand, your code should always output a Series. So the problem is probably in the code you are not describing. Also, the question you are referring to uses ix (which doesn't exist in the latest version of pandas), which indicates pandas version may also be an issue.
By the way, I don't think values is a good choice for your case, because it is used when you want an array, not an element. (Also values is not recommended anymore)
If you just want to extract an element, try:
# If there are multiple elements, the first one will be extracted.
print(device['Device Name'].iloc[0])
or
# If there are multiple elements, ValueError will be raised.
print(device['Device Name'].item())

Pandas one-line filtering for the entire dataset - how is it achieved?

I am just now diving into this wonderful library and am pretty baffled by how filtering, or even column manipulation, is done and am trying to understand if this is a feature of pandas or of python itself. More precisely:
import pandas
df = pandas.read_csv('data.csv')
# Doing
df['Column'] # would display all values from Column for dataframe
# Even moreso, doing
df.loc[df['Column'] > 10] # would display all values from Column greater than 10
# and is the same with
df.loc[df.Column > 10]
So columns are both attributes, and keys, so DataFrame is both a dict, and object? Or perhaps I am missing some basic python functionality that I don't know about... And accessing a column basically loops over the whole dataset? How is this achieved?
Column filtering or column manipulation or overall data manipulation in a data set is a feature of pandas library itself. Once you load your data using pd.read_csv, the data set is stored as a pandas dataframe in a dictionary-like container. Then ,every column of dataframe is a series object of pandas. It depends on how you're trying to access the column, whether as an attribute(df.columnname) or a key(df['columnname']). Though you can apply methods like .head() or .tail() or .shape or .isna() on both the ways it is accessed. While accessing a certain column, it goes through whole dataset and tries to match the column name you have input. If it is matched, output is shown or else it throws some KeyError or AttributeError depends on how you're accessing it.

Converting numpy64 objects to Pandas datetime

Question is pretty self-explanatory. I am finding that pd.to_datetime isn't changing anything about the object type and using pd.Timestampe()directly is bombing out.
Before this is marked a duplicate of Converting between datetime, Timestamp and datetime64, I am struggling at changing an entire column of a dataframe not just one datetime object. Perhaps that was in the article but I didn't see it in the top answer.
I will add that my error is occurring when I try to get unique values from the dataframes column. Is using unique converting the dtype to something unwanted?
The method you mentioned pandas.to_datetime() will work on scalars, Series and whole DataFrame if you need, so:
dataFrame['column_date_converted'] = pd.to_datetime(dataFrame['column_to_convert'])

HDFStore get column names

I have some problems with pandas' HDFStore being far to slow and unfortunately I'm unable to put together a satisfying solution from other questions here.
Situation
I have a big DataFrame, containing mostly floats and sometimes integer columns which goes through multiple processing steps (renaming, removing bad entries, aggregating by 30min). Each row has a timestamp associated to it. I would like to save some middle steps to a HDF file, so that the user can do a single step iteratively without starting from scratch each time.
Additionally the user should be able to plot certain column from these saves in order to select bad data. Therefore I would like to retrieve only the column names without reading the data in the HDFStore.
Concretely the user should get a list of all columns of all dataframes stored in the HDF then they should select which columns they would like to see whereafter I use matplotlib to present them the corresponding data.
Data
shape == (5730000, 339) does not seem large at all, that's why I'm confused... (Might get far more rows over time, columns should stay fixed)
In the first step I append iteratively rows and columns (that runs okay), but once that's done I always process the entire DataFrame at once, only grouping or removing data.
My approach
I do all manipulations in memory since pandas seems to be rather fast and I/O is slower (HDF is on different physical server, I think)
I use datetime index and automatically selected float or integer columns
I save the steps with hdf.put('/name', df, format='fixed') since hdf.put('/name'.format(grp), df, format='table', data_columns=True) seemed to be far too slow.
I use e.g. df.groupby(df.index).first() and df.groupby(pd.Grouper(freq='30Min')).agg(agg_dict) to process the data, where agg_dict is a dictonary with one function per column. This is incredibly slow as well.
For plotting, I have to read-in the entire dataframe and then get the columns: hdfstore.get('/name').columns
Question
How can I retrieve all columns without reading any data from the HDFStore?
What would be the most efficient way of storing my data? Is HDF the right option? Table or fixed?
Does it matter in term of efficiency if the index is a datetime index? Does there exists a more efficient format in general (e.g. all columns the same, fixed dtype?)
Is there a faster way to aggregate instead of groupby (df.groupby(pd.Grouper(freq='30Min')).agg(agg_dict))
similar questions
How to access single columns using .select
I see that I can use this to retrieve only certain columns but only after I know the column names, I think.
Thank you for any advice!
You may simply load 0 rows of the DataFrame by specifying same start and stop attributes. And leave all internal index/column processing for pandas itself:
idx = pd.MultiIndex.from_product([('A', 'B'), range(2)], names=('Alpha', 'Int'))
df = pd.DataFrame(np.random.randn(len(idx), 3), index=idx, columns=('I', 'II', 'III'))
df
>>> I II III
>>> Alpha Int
>>> A 0 -0.472412 0.436486 0.354592
>>> 1 -0.095776 -0.598585 -0.847514
>>> B 0 0.107897 1.236039 -0.196927
>>> 1 -0.154014 0.821511 0.092220
Following works both for fixed an table formats:
with pd.HDFStore('test.h5') as store:
store.put('df', df, format='f')
meta = store.select('df', start=1, stop=1)
meta
meta.index
meta.columns
>>> I II III
>>> Alpha Int
>>>
>>> MultiIndex(levels=[[], []],
>>> codes=[[], []],
>>> names=['Alpha', 'Int'])
>>>
>>> Index(['I', 'II', 'III'], dtype='object')
As for others question:
As long as your data is mostly homogeneous (almost float columns as you mentioned) and you are able to store it in single file without need to distribute data across machines - HDF is the first thing to try.
If you need to append/delete/query data - you must use table format. If you only need to write once and read many - fixed will improve performance.
As for datetime index, i think here we may use same idea as in 1 clause. If u are able to convert all data into single type it should increase your performance.
Nothing else that proposed in comment to your question comes to mind.
For a HDFStore hdf and a key (from hdf.keys()) you can get the column names with:
# Table stored with hdf.put(..., format='table')
columns = hdf.get_node('{}/table'.format(key)).description._v_names
# Table stored with hdf.put(..., format='fixed')
columns = list(hdf.get_node('{}/axis0'.format(key)).read().astype(str))
note that hdf.get(key).columns works as well, but it reads all the data into memory, while the approach above only reads the column names.
Full working example:
#!/usr/bin/env python
import pandas as pd
data = pd.DataFrame({'a': [1,1,1,2,3,4,5], 'b': [2,3,4,1,3,2,1]})
with pd.HDFStore(path='store.h5', mode='a') as hdf:
hdf.put('/DATA/fixed_store', data, format='fixed')
hdf.put('/DATA/table_store', data, format='table', data_columns=True)
for key in hdf.keys():
try:
# column names of table store
print(hdf.get_node('{}/table'.format(key)).description._v_names)
except AttributeError:
try:
# column names of fixed store
print(list(hdf.get_node('{}/axis0'.format(key)).read().astype(str)))
except AttributeError:
# e.g. a dataset created by h5py instead of pandas.
print('unknown node in HDF.')
Columns without reading any data:
store.get_storer('df').ncols # substitute 'df' with your key
# you can also access nrows and other useful fields
From the docs (fixed format, table format): (important points in bold)
[fixed] These types of stores are not appendable once written (though you can simply remove them and rewrite). Nor are they queryable; they must be retrieved in their entirety. They also do not support dataframes with non-unique column names. The fixed format stores offer very fast writing and slightly faster reading than table stores.
[table] Conceptually a table is shaped very much like a DataFrame, with rows and columns. A table may be appended to in the same or other sessions. In addition, delete and query type operations are supported.
You may try to use epochms (or epochns) (milliseconds or nanoseconds since epoch) in place of datetimes. This way, you are just dealing with integer indices.
You may have a look at this answer if what you need is grouping by on large data.
An advice: if you have 4 questions to ask, it may be better to ask 4 separate questions on SO. This way, you'll get a higher number of (higher quality) answers, since each one is easier to tackle. And each will deal with a specific topic, making it easier to search for people that are looking for specific answers.

what effect does changing the datatype in a pandas dataframe or series have?

Specifically,
If I don't need to change the datatype, is it better left alone? Does it copy the whole column of a dataframe? Does it copy the whole dataframe? Or does it just alter some setting in the dataframe to treat the entries in that column as a particular type?
Also, is there a way to set the type of the columns while the dataframe is getting created?
Here is one example "2014-05-25 12:14:01.929000" is cast as a np.datetime64 when the dataframe is created. then I save the dataframe onto a csv. then I read from the csv, and it becomes an arbitrary object. How would I avoid this? Or how can I re-cast this particular column as an np.datetime64 whilst doing a pd.DataFrame.read_csv ....
Thanks.

Categories

Resources