Reading Parquet File with Array<Map<String,String>> Column

Reading Parquet File with Array<Map<String,String>> Column - python

I'm using Dask to read a Parquet file that was generated by PySpark, and one of the columns is a list of dictionaries (i.e. array<map<string,string>>'). An example of the df would be:
import pandas as pd
df = pd.DataFrame.from_records([
(1, [{'job_id': 1, 'started': '2019-07-04'}, {'job_id': 2, 'started': '2019-05-04'}], 100),
(5, [{'job_id': 3, 'started': '2015-06-04'}, {'job_id': 9, 'started': '2019-02-02'}], 540)],
columns=['uid', 'job_history', 'latency']
)
The when using engine='fastparquet, Dask reads all other columns fine but returns a column of Nones for the column with the complex type. When I set engine='pyarrow', I get the following exception:
ArrowNotImplementedError: lists with structs are not supported.
A lot of googling has made it clear that reading a column with a Nested Array just isn't really supported right now, and I'm not totally sure what the best way to handle this is. I figure my options are:
Some how tell dask/fastparquet to parse the column using the standard json library. The schema is simple and that would do the job if possible
See if I can possibily re-run the Spark job that generated the output and save it as something else, though this almost isn't an acceptable solution since my company uses parquet everywhere
Turn the keys of the map into columns and break the data up across several columns with dtype list and note that the data across these columns are related/map to each other by index (e.g. the elements in idx 0 across these keys/columns all came from the same source). This would work, but frankly, breaks my heart :(
I'd love to hear how others have navigated around this limitation. My company uses nested arrays in their parquest frequently, and I'd hate to have to let go of using Dask because of this.

It would be fairer to say that pandas does not support non-simple types very well (currently). It may be the case that pyarrow will, without conversion to pandas, and that as some future point, pandas will use these arrow structures directly.
Indeed, the most direct method that I can think for you to use, is to rewrite the columns as B/JSON-encoded text, and then load with fastparquet, specifying to load using B/JSON. You should get lists of dicts in the column, but the performance will be slow.
Note that the old project oamap and its successor awkward provides a way to iterate and aggregate over nested list/map/struct trees using python syntax, but compiled with Numba, such that you never need to instantiate the intermediate python objects. They were not designed for parquet, but had parquet compatibility, so might just be useful to you.

I'm dealing with pyarrow.lib.ArrowNotImplementedError: Reading lists of structs from Parquet files not yet supported when I try to read using Pandas; however, when I read using pyspark and then convert to pandas, the data at least loads:
import pyspark
spark = pyspark.sql.SparkSession.builder.getOrCreate()
df = spark.read.load(path)
pdf = df.toPandas()
and the offending field is now rendered as a pyspark Row object, which have some structured parsing but you would have to probably write custom pandas functions to extract data from them:
>>> pdf["user"][0]["sessions"][0]["views"]
[Row(is_search=True, price=None, search_string='ABC', segment='listing', time=1571250719.393951), Row(is_search=True, price=None, search_string='ZYX', segment='homepage', time=1571250791.588197), Row(is_search=True, price=None, search_string='XYZ', segment='listing', time=1571250824.106184)]
the individual record can be rendered as a dictionary, simply call .asDict(recursive=True) on the Row object you would like.
Unfortunately, it takes ~5 seconds to start the SparkSession context and every spark action also takes much longer than pandas operations (for small to medium datasets) so I would greatly prefer a more python-native option

Related

HDFStore get column names

I have some problems with pandas' HDFStore being far to slow and unfortunately I'm unable to put together a satisfying solution from other questions here.
Situation
I have a big DataFrame, containing mostly floats and sometimes integer columns which goes through multiple processing steps (renaming, removing bad entries, aggregating by 30min). Each row has a timestamp associated to it. I would like to save some middle steps to a HDF file, so that the user can do a single step iteratively without starting from scratch each time.
Additionally the user should be able to plot certain column from these saves in order to select bad data. Therefore I would like to retrieve only the column names without reading the data in the HDFStore.
Concretely the user should get a list of all columns of all dataframes stored in the HDF then they should select which columns they would like to see whereafter I use matplotlib to present them the corresponding data.
Data
shape == (5730000, 339) does not seem large at all, that's why I'm confused... (Might get far more rows over time, columns should stay fixed)
In the first step I append iteratively rows and columns (that runs okay), but once that's done I always process the entire DataFrame at once, only grouping or removing data.
My approach
I do all manipulations in memory since pandas seems to be rather fast and I/O is slower (HDF is on different physical server, I think)
I use datetime index and automatically selected float or integer columns
I save the steps with hdf.put('/name', df, format='fixed') since hdf.put('/name'.format(grp), df, format='table', data_columns=True) seemed to be far too slow.
I use e.g. df.groupby(df.index).first() and df.groupby(pd.Grouper(freq='30Min')).agg(agg_dict) to process the data, where agg_dict is a dictonary with one function per column. This is incredibly slow as well.
For plotting, I have to read-in the entire dataframe and then get the columns: hdfstore.get('/name').columns
Question
How can I retrieve all columns without reading any data from the HDFStore?
What would be the most efficient way of storing my data? Is HDF the right option? Table or fixed?
Does it matter in term of efficiency if the index is a datetime index? Does there exists a more efficient format in general (e.g. all columns the same, fixed dtype?)
Is there a faster way to aggregate instead of groupby (df.groupby(pd.Grouper(freq='30Min')).agg(agg_dict))
similar questions
How to access single columns using .select
I see that I can use this to retrieve only certain columns but only after I know the column names, I think.
Thank you for any advice!

You may simply load 0 rows of the DataFrame by specifying same start and stop attributes. And leave all internal index/column processing for pandas itself:
idx = pd.MultiIndex.from_product([('A', 'B'), range(2)], names=('Alpha', 'Int'))
df = pd.DataFrame(np.random.randn(len(idx), 3), index=idx, columns=('I', 'II', 'III'))
df
>>> I II III
>>> Alpha Int
>>> A 0 -0.472412 0.436486 0.354592
>>> 1 -0.095776 -0.598585 -0.847514
>>> B 0 0.107897 1.236039 -0.196927
>>> 1 -0.154014 0.821511 0.092220
Following works both for fixed an table formats:
with pd.HDFStore('test.h5') as store:
store.put('df', df, format='f')
meta = store.select('df', start=1, stop=1)
meta
meta.index
meta.columns
>>> I II III
>>> Alpha Int
>>>
>>> MultiIndex(levels=[[], []],
>>> codes=[[], []],
>>> names=['Alpha', 'Int'])
>>>
>>> Index(['I', 'II', 'III'], dtype='object')
As for others question:
As long as your data is mostly homogeneous (almost float columns as you mentioned) and you are able to store it in single file without need to distribute data across machines - HDF is the first thing to try.
If you need to append/delete/query data - you must use table format. If you only need to write once and read many - fixed will improve performance.
As for datetime index, i think here we may use same idea as in 1 clause. If u are able to convert all data into single type it should increase your performance.
Nothing else that proposed in comment to your question comes to mind.

For a HDFStore hdf and a key (from hdf.keys()) you can get the column names with:
# Table stored with hdf.put(..., format='table')
columns = hdf.get_node('{}/table'.format(key)).description._v_names
# Table stored with hdf.put(..., format='fixed')
columns = list(hdf.get_node('{}/axis0'.format(key)).read().astype(str))
note that hdf.get(key).columns works as well, but it reads all the data into memory, while the approach above only reads the column names.
Full working example:
#!/usr/bin/env python
import pandas as pd
data = pd.DataFrame({'a': [1,1,1,2,3,4,5], 'b': [2,3,4,1,3,2,1]})
with pd.HDFStore(path='store.h5', mode='a') as hdf:
hdf.put('/DATA/fixed_store', data, format='fixed')
hdf.put('/DATA/table_store', data, format='table', data_columns=True)
for key in hdf.keys():
try:
# column names of table store
print(hdf.get_node('{}/table'.format(key)).description._v_names)
except AttributeError:
try:
# column names of fixed store
print(list(hdf.get_node('{}/axis0'.format(key)).read().astype(str)))
except AttributeError:
# e.g. a dataset created by h5py instead of pandas.
print('unknown node in HDF.')

Columns without reading any data:
store.get_storer('df').ncols # substitute 'df' with your key
# you can also access nrows and other useful fields
From the docs (fixed format, table format): (important points in bold)
[fixed] These types of stores are not appendable once written (though you can simply remove them and rewrite). Nor are they queryable; they must be retrieved in their entirety. They also do not support dataframes with non-unique column names. The fixed format stores offer very fast writing and slightly faster reading than table stores.
[table] Conceptually a table is shaped very much like a DataFrame, with rows and columns. A table may be appended to in the same or other sessions. In addition, delete and query type operations are supported.
You may try to use epochms (or epochns) (milliseconds or nanoseconds since epoch) in place of datetimes. This way, you are just dealing with integer indices.
You may have a look at this answer if what you need is grouping by on large data.
An advice: if you have 4 questions to ask, it may be better to ask 4 separate questions on SO. This way, you'll get a higher number of (higher quality) answers, since each one is easier to tackle. And each will deal with a specific topic, making it easier to search for people that are looking for specific answers.

Pandas - Appending 'table' format to HDF5Store with different dtypes: invalid combinate of [values_axes]

I recently started trying to use HDF5 format in python pandas to store data but encountered a problem where cant find a workaround for. Before i worked with CSV files and i had no trouble in regards to appending new data.
This is what i try:
store = pd.HDFStore('cdw.h5')
frame.to_hdf('cdw.h5','cdw/data_cleaned', format='table',append=True, data_columns=True,dropna=False)
And it throws:
ValueError: invalid combinate of [values_axes] on appending data [name->Ordereingangsdatum,cname->Ordereingangsdatum,dtype->float64,kind->float,shape->(1, 176345)] vs current table [name->Ordereingangsdatum,cname->Ordereingangsdatum,dtype->bytes128,kind->string,shape->None]
I get that it tells me i want to append different data type for a column but what buffles me is that i have wrote the same CSV file before with some other CSV Files from a Dataframe to that HDF5 file.
I'm doing analysis in the forwarding industry and the data there is very inconsistent - more often than not there are missing values or mixed dtypes in columns or other 'data dirt'.
Im looking for a way to append data to HDF5 file no matter what is inside the column as long as the column names are the same.
It would be beautiful to enforce appending data in HDF store independant of datatypes or another simple solution for my problem. The goal is to have an automation later on for the analysis therefore id not like to change datatypes everytime i have a missing value in a column of the total 62 columns i have.
Another question in my question is:
My file access for read_hdf consumes more time than my read_csv i have around 1.5 million rows with 62 columns. Is this because i have no SSD drive? Because i have read that the file access for read_hdf should be faster.
I question myself if I rather should stick with CSV files or with HDF5?
Help would be greatly appreciated.

Okay for anyone having the same issue with appending data where the dtype is not always secured to be the same: I finally found a solution. First convert every column to object with li = list(frame)
frame[li] = frame[li].astype(object)
frame.info() then try the method df.to_hdf(key,value, append=True) and wait for its error message. The error message TypeError: Cannot serialize the column [not_one_datatype] because its data contents are [mixed] object dtype will tell the columns it still doesnt like. Converting those columns to float worked for me! After that the error convert the mentioned column with df['not_one_datatype'].astype(float) only use integer if you are sure that a float will never occur in this column otherwise append method will bug again.
I decided to work parallel with CSV and HDF5 Files. If i get a problem with HDF5 where i have no workaround for i will simply change to CSV - this is what i can recommend personally.
Update: Okay it seems that the creators of this format have not thought about the reality when considering the HDF API: HDF5 min_itemsize error: ValueError: Trying to store a string with len [##] in [y] column but this column has a limit of [##]! occurs when trying to append data to an already existing file if some column happens to be longer than the initial write to HDF file.
Now the joke here is that the creators of this API expecting me to know the max column length of each possible data in a column at the first write? really? Another inconsistency is that df.to_hdf(append=True) do not have the parameter min_itemsize={'column1':1000}. This format is at best suited for storing self created data only but definately not for data where the dtypes and length of the entries in each column are NOT set in stone. The only solution left when you want to append data from pandas dataframes independent of the stubborn HDF5 API in Python is to insert in every dataframe before appending a row with very long strings except for the numeric columns. Just to be sure that you will always be able to append the data no matter how possible long it will get.
When doing this write process will take ages and slurp gigantic sizes of disc drive for saving the huge HDF5 file.
CSV definately wins against HDF5 in terms of performance, integration and especially usability.

Read Directory of Timeseries CSV data efficiently with Dask DataFrame and Pandas

I have a directory of timeseries data stored as CSV files, one file per day. How do I load and process it efficiently with Dask DataFrame?
Disclaimer: I maintain Dask. This question occurs often enough in other channels that I decided to add a question here on StackOverflow to which I can point people in the future.

Simple Solution
If you just want to get something quickly then simple use of dask.dataframe.read_csv using a globstring for the path should suffice:
import dask.dataframe as dd
df = dd.read_csv('2000-*.csv')
Keyword arguments
The dask.dataframe.read_csv function supports most of the pandas.read_csv keyword arguments, so you might want to tweak things a bit.
df = dd.read_csv('2000-*.csv', parse_dates=['timestamp'])
Set the index
Many operations like groupbys, joins, index lookup, etc. can be more efficient if the target column is the index. For example if the timestamp column is made to be the index then you can quickly look up the values for a particular range easily, or you can join efficiently with another dataframe along time. The savings here can easily be 10x.
The naive way to do this is to use the set_index method
df2 = df.set_index('timestamp')
However if you know that your new index column is sorted then you can make this much faster by passing the sorted=True keyword argument
df2 = df.set_index('timestamp', sorted=True)
Divisions
In the above case we still pass through the data once to find good breakpoints. However if your data is already nicely segmented (such as one file per day) then you can give these division values to set_index to avoid this initial pass (which can be costly for a large amount of CSV data.
import pandas as pd
divisions = tuple(pd.date_range(start='2000', end='2001', freq='1D'))
df2 = df.set_index('timestamp', sorted=True, divisions=divisions)
This solution correctly and cheaply sets the timestamp column as the index (allowing for efficient computations in the future).
Convert to another format
CSV is a pervasive and convenient format. However it is also very slow. Other formats like Parquet may be of interest to you. They can easily be 10x to 100x faster.

Is .loc the best way to build a pandas DataFrame?

I have a fairly large csv file (700mb) which is assembled as follows:
qCode Date Value
A_EVENTS 11/17/2014 202901
A_EVENTS 11/4/2014 801
A_EVENTS 11/3/2014 2.02E+14
A_EVENTS 10/17/2014 203901
etc.
I am parsing this file to get specific values, and then using DF.loc to populate a pre-existing DataFrame, i.e. the code:
for line in fileParse:
code=line[0]
for point in fields:
if(point==code[code.find('_')+1:len(code)]):
date=line[1]
year,quarter=quarter_map(date)
value=float(line[2])
pos=line[0].find('_')
ticker=line[0][0:pos]
i=ticker+str(int(float(year)))+str(int(float(quarter)))
df.loc[i,point]=value
else:
pass
the question I have is .loc the most efficient way to add values to a existing DataFrame? As this operation seems to take over 10 hours...
fyi fields are the col that are in the DF (values i'm interested in) and the index (i) is a string...
thanks

No, you should never build a dataframe row-by-row. Each time you do this the entire dataframe has to be copied (it's not extended inplace) so you are using n + (n - 1) + (n - 2) + ... + 1, O(n^2), memory (which has to be garbage collected)... which is terrible, hence it's taking hours!
You want to use read_csv, and you have a few options:
read in the entire file in one go (this should be fine with 700mb even with just a few gig of ram).
pd.read_csv('your_file.csv')
read in the csv in chunks and then glue them together (in memory)... tbh I don't think this will actually use less memory than the above, but is often useful if you are doing some munging at this step.
pd.concat(pd.read_csv('foo.csv', chunksize=100000))  # not sure what optimum value is for chunksize
read the csv in chunks and save them into pytables (rather than in memory), if you have more data than memory (and you've already bought more memory) then use pytables/hdf5!
store = pd.HDFStore('store.h5')
for df in pd.read_csv('foo.csv', chunksize=100000):
store.append('df', df)

If I understand correctly, I think it would be much faster to:
Import the whole csv into a dataframe using pandas.read_csv.
Select the rows of interest from the dataframe.
Append the rows to your other dataframe using df.append(other_df).
If you provide more information about what criteria you are using in step 2 I can provide code there as well.

A couple of options that come to mind
1) Parse the file as you are currently doing, but build a dict intend of appending to your dataframe. After you're done with that convert that dict to a Dataframe and then use concat() to combine it with the existing Dataframe
2) Bring that csv into pandas using read_csv() and then filter/parse on what you want then do a concat() with the existing dataframe

Sorting in pandas for large datasets

I would like to sort my data by a given column, specifically p-values. However, the issue is that I am not able to load my entire data into memory. Thus, the following doesn't work or rather works for only small datasets.
data = data.sort(columns=["P_VALUE"], ascending=True, axis=0)
Is there a quick way to sort my data by a given column that only takes chunks into account and doesn't require loading entire datasets in memory?

In the past, I've used Linux's pair of venerable sort and split utilities, to sort massive files that choked pandas.
I don't want to disparage the other answer on this page. However, since your data is text format (as you indicated in the comments), I think it's a tremendous complication to start transferring it into other formats (HDF, SQL, etc.), for something that GNU/Linux utilities have been solving very efficiently for the last 30-40 years.
Say your file is called stuff.csv, and looks like this:
4.9,3.0,1.4,0.6
4.8,2.8,1.3,1.2
Then the following command will sort it by the 3rd column:
sort --parallel=8 -t . -nrk3 stuff.csv
Note that the number of threads here is set to 8.
The above will work with files that fit into the main memory. When your file is too large, you would first split it into a number of parts. So
split -l 100000 stuff.csv stuff
would split the file into files of length at most 100000 lines.
Now you would sort each file individually, as above. Finally, you would use mergesort, again through (waith for it...) sort:
sort -m sorted_stuff_* > final_sorted_stuff.csv
Finally, if your file is not in CSV (say it is a tgz file), then you should find a way to pipe a CSV version of it into split.

As I referred in the comments, this answer already provides a possible solution. It is based on the HDF format.
About the sorting problem, there are at least three possible ways to solve it with that approach.
First, you can try to use pandas directly, querying the HDF-stored-DataFrame.
Second, you can use PyTables, which pandas uses under the hood.
Francesc Alted gives a hint in the PyTables mailing list:
The simplest way is by setting the sortby parameter to true in the
Table.copy() method. This triggers an on-disk sorting operation, so you
don't have to be afraid of your available memory. You will need the Pro
version for getting this capability.
In the docs, it says:
sortby :
If specified, and sortby corresponds to a column with an index, then the copy will be sorted by this index. If you want to ensure a fully sorted order, the index must be a CSI one. A reverse sorted copy can be achieved by specifying a negative value for the step keyword. If sortby is omitted or None, the original table order is used
Third, still with PyTables, you can use the method Table.itersorted().
From the docs:
Table.itersorted(sortby, checkCSI=False, start=None, stop=None, step=None)
Iterate table data following the order of the index of sortby column. The sortby column must have associated a full index.
Another approach consists in using a database in between. The detailed workflow can be seen in this IPython Notebook published at plot.ly.
This allows to solve the sorting problem, along with other data analyses that are possible with pandas. It looks like it was created by the user chris, so all the credit goes to him. I am copying here the relevant parts.
Introduction
This notebook explores a 3.9Gb CSV file.
This notebook is a primer on out-of-memory data analysis with
pandas: A library with easy-to-use data structures and data analysis tools. Also, interfaces to out-of-memory databases like SQLite.
IPython notebook: An interface for writing and sharing python code, text, and plots.
SQLite: An self-contained, server-less database that's easy to set-up and query from Pandas.
Plotly: A platform for publishing beautiful, interactive graphs from Python to the web.
Requirements
import pandas as pd
from sqlalchemy import create_engine # database connection
Import the CSV data into SQLite
Load the CSV, chunk-by-chunk, into a DataFrame
Process the data a bit, strip out uninteresting columns
Append it to the SQLite database
disk_engine = create_engine('sqlite:///311_8M.db') # Initializes database with filename 311_8M.db in current directory
chunksize = 20000
index_start = 1
for df in pd.read_csv('311_100M.csv', chunksize=chunksize, iterator=True, encoding='utf-8'):
# do stuff
df.index += index_start
df.to_sql('data', disk_engine, if_exists='append')
index_start = df.index[-1] + 1
Query value counts and order the results
Housing and Development Dept receives the most complaints
df = pd.read_sql_query('SELECT Agency, COUNT(*) as `num_complaints`'
'FROM data '
'GROUP BY Agency '
'ORDER BY -num_complaints', disk_engine)
Limiting the number of sorted entries
What's the most 10 common complaint in each city?
df = pd.read_sql_query('SELECT City, COUNT(*) as `num_complaints` '
'FROM data '
'GROUP BY `City` '
'ORDER BY -num_complaints '
'LIMIT 10 ', disk_engine)
Possibly related and useful links
Pandas: in memory sorting hdf5 files
ptrepack sortby needs 'full' index
http://pandas.pydata.org/pandas-docs/stable/cookbook.html#hdfstore
http://www.pytables.org/usersguide/optimization.html

Blaze might be the tool for you with the ability to work with pandas and csv files out of core.
http://blaze.readthedocs.org/en/latest/ooc.html
import blaze
import pandas as pd
d = blaze.Data('my-large-file.csv')
d.P_VALUE.sort() # Uses Chunked Pandas
For faster processing, load it into a database first which blaze can control. But if this is a one off and you have some time then the posted code should do it.

If your csv file contains only structured data, I would suggest approach using only linux commands.
Assume csv file contains two columns, COL_1 and P_VALUE:
map.py:
import sys
for line in sys.stdin:
col_1, p_value = line.split(',')
print "%f,%s" % (p_value, col_1)
then the following linux command will generate the csv file with p_value sorted:
cat input.csv | ./map.py | sort > output.csv
If you're familiar with hadoop, using the above map.py also adding a simple reduce.py will generate the sorted csv file via hadoop streaming system.

Here is my Honest sugg./ Three options you can do.
I like Pandas for its rich doc and features but I been suggested to
use NUMPY as it feel faster comparatively for larger datasets. You can think of using other tools as well for easier job.
In case you are using Python3, you can break your big data chunk into sets and do Congruent Threading. I am too lazy for this and it does nt look cool, you see Panda, Numpy, Scipy are build with Hardware design perspectives to enable multi threading I believe.
I prefer this, this is easy and lazy technique acc. to me. Check the document at http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort.html
You can also use 'kind' parameter in your pandas-sort function you are using.
Godspeed my friend.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.