Dask Parquet loading files with data schema

Dask Parquet loading files with data schema - python

This is a question related to this post.
I am experimenting with Dask and Parquet files. I loaded the New York parking violations data I downloaded here.
I read the data files, find common columns, apply datatypes, and save all afterwards as a parquet collevtion
from dask import dataframe as dd
from dask.diagnostics import ProgressBar
import numpy as np
base_url = 'origin/nyc-parking-tickets/'
fy14 = dd.read_csv(base_url + '*2014_.csv')
fy15 = dd.read_csv(base_url + '*2015.csv')
fy16 = dd.read_csv(base_url + '*2016.csv')
fy17 = dd.read_csv(base_url + '*2017.csv')
data = [fy14, fy15, fy16, fy17]
col_set = [set(d.columns) for d in data]
common_columns = list(set.intersection(*col_set))
# Set proper column types
dtype_tuples = [(x, np.str) for x in common_columns]
dtypes = dict(dtype_tuples)
floats = ['Feet From Curb', 'Issuer Code', 'Issuer Precinct', 'Law Section', 'Vehicle Year', 'Violation Precinct']
ints32 = ['Street Code1', 'Street Code2', 'Street Code3', 'Summons Number']
ints16 = ['Violation Code']
for item in floats: dtypes[item] = np.float32
for item in ints32: dtypes[item] = np.int32
for item in ints16: dtypes[item] = np.int16
# Read Data
data = dd.read_csv(base_url + '*.csv', dtype=dtypes, usecols=common_columns) # usecols not in Dask documentation, but from pandas
# Write data as parquet
target_url = 'target/nyc-parking-tickets-pq/'
with ProgressBar():
data.to_parquet(target_url)
When I attempt to reload the data
data2 = dd.read_parquet(target_url, engine='pyarrow')
I get a ValueError, namely that some of the partitions have a different file format. Looking at the output, I can see that the 'Violation Legal Code' column is in one partition interpreted as null, presumably because the data is too sparse for sampling.
In the post with the original question two solutions are suggested. The first is about entering dummy values, the other is supplying column types when loading the data. I would like to do the latter and I am stuck.
In the dd.read_csv method I can pass the dtype argument, for which I just enter the dtypes dictionary defined above. The dd.read_parquet does not accept that keyword. In the documentation it seems to suggest that categories is taking over that role , but even when passing categories=dtypes, I still get the same error.
How can I pass type specifications in dask.dataframe.read_parquet?

You can not pass dtypes to read_parquet because Parquet files know their own dtypes (in CSV it is ambiguous). Dask DataFrame expects that all files of a dataset have the same schema, as of 2019-03-26, there is no support for loading data of mixed schemas.
That being said, you could do this yourself using something like Dask Delayed, do whatever manipulations you need to do on a file-by-file basis, and then convert those into a Dask DataFrame with dd.from_delayed. More information about that is here.
https://docs.dask.org/en/latest/delayed.html
https://docs.dask.org/en/latest/delayed-collections.html

It seems the problem was with the parquet engine. When I changed the code to
data.to_parquet(target_url, engine = 'fastparquet')
and
data.from_parquet(target_url, engine = 'fastparquet')
the writing and loading worked fine.

Related

Is there a way to overwrite existing data using pandas to_parquet with partitions?

I'm using pandas to write a parquet file using the to_parquet function with partitions. Example:
df.to_parquet('gs://bucket/path', partition_cols=['key'])
The issue is that every time I run the code. It adds a new parquet file in the partition and when you read data, you get all the data from each time the script was run. Essentially, the data appends each time.
Is there a way to overwrite the data every time you write using pandas?

I have found dask to be helpful reading and writing parquet. It defaults the file name on write (which you can alter) and will replace the parquet file if you use the same name, which I believe is what you are looking for. You can append data to the partition by setting 'append' to True, which is more intuitive to me, or you can set 'overwrite' to True which will remove all files in the partition/folder prior to writing the file. Reading parquet works well as well by including partition columns in the dataframe on read.
https://docs.dask.org/en/stable/generated/dask.dataframe.to_parquet.html
See below some code I used to satisfy myself of the behaviour of dask.dataframe.to_parquet:
import pandas as pd
from dask import dataframe as dd
import numpy as np
dates = pd.date_range("2015-01-01", "2022-06-30")
df_len = len(dates)
df_1 = pd.DataFrame(np.random.randint(0, 1000, size=(df_len, 1)), columns=["value"])
df_2 = pd.DataFrame(np.random.randint(0, 1000, size=(df_len, 1)), columns=["value"])
df_1["date"] = dates
df_1["YEAR"] = df_1["date"].dt.year
df_1["MONTH"] = df_1["date"].dt.month
df_2["date"] = dates
df_2["YEAR"] = df_2["date"].dt.year
df_2["MONTH"] = df_2["date"].dt.month
ddf_1 = dd.from_pandas(df_1, npartitions=1)
ddf_2 = dd.from_pandas(df_2, npartitions=1)
name_function = lambda x: f"monthly_data_{x}.parquet"
ddf_1.to_parquet(
"dask_test_folder",
name_function=name_function,
partition_on=["YEAR", "MONTH"],
write_index=False,
)
print(ddf_1.head())
ddf_first_write = dd.read_parquet("dask_test_folder/YEAR=2015/MONTH=1")
print(ddf_first_write.head())
ddf_2.to_parquet(
"dask_test_folder",
name_function=name_function,
partition_on=["YEAR", "MONTH"],
write_index=False,
)
print(ddf_2.head())
ddf_second_write = dd.read_parquet("dask_test_folder/YEAR=2015/MONTH=1")
print(ddf_second_write.head())

Yeah, there is. You need to read pandas docs and you'll see that to_parquet supports **kwargs and uses engine:pyarrow by default. With that you got to the pyarrow docs. There you'll see there are two methods of doing this. One, by using partition_filename_cb which needs legacy support and will be depricated.
Two, using basename_template which is the new way. This because of performance issues of running a callable/lambda to name each partition. You need to pass a string: "string_{i}". Only works with legacy support off. The saved file will be "string_0","string_1"...
You can't use both at the same time.
def write_data(
df: pd.DataFrame,
path: str,
file_format="csv",
comp_zip="snappy",
index=False,
partition_cols: list[str] = None,
basename_template: str = None,
storage_options: dict = None,
**kwargs,
) -> None:
getattr(pd.DataFrame, f"to_{file_format}")(
df,
f"{path}.{file_format}",
compression=comp_zip,
index=index,
partition_cols=partition_cols,
basename_template=basename_template,
storage_options={"token": creds},
**kwargs,
)
Try this.

handling large timestamps when converting from pyarrow.Table to pandas

I have a timestamp of 9999-12-31 23:59:59 stored in a parquet file as an int96. I read this parquet file using pyarrow.dataset and convert the resulting table into a pandas dataframe (using pyarrow.Table.to_pandas()). The conversion to pandas dataframe turns my timestamp into 1816-03-30 05:56:07.066277376 (Pandas timestamp has probably smaller range of valid dates) without any complaing about datatype or anything.
I then take this pandas dataframe, convert it back to table and write it into a parquet dataset using pyarrow.dataset.write_dataset
I am now left with a different data than the data i started with, without seeing any warnings. (I found this out when I tried to create an impala table from the parquet dataset and then couldn't query it properly).
Is there a way to handle these large timestamps when converting from pyarrow table to pandas?
I've tried using the timestamp_as_object = True parameter as in Table.to_pandas(timestamp_as_object = True), but doesn't seem like it does anything.
EDIT: providing reproducible example. The problem is that pyarrow thinks the timestamps are nanoseconds while reading the file, although they were stored as microseconds:
import pyarrow as pa
import pyarrow.dataset as ds
non_legacy_hdfs_filesystem = # connect to a filesystem here
my_table = pa.Table.from_arrays([pa.array(['9999-12-31', '9999-12-31', '9999-12-31']).cast('timestamp[us]')], names = ['my_timestamps'])
parquet_format = ds.ParquetFileFormat()
write_options = parquet_format.make_write_options(use_deprecated_int96_timestamps = True, coerce_timestamps = 'us', allow_truncated_timestamps = True)
ds.write_dataset(data = my_table, base_dir = 'my_path', filesystem = non_legacy_hdfs_filesystem, format = parquet_format, file_options = write_options, partitioning= None)
dataset = ds.dataset('my_path', filesystem = non_legacy_hdfs_filesystem)
dataset.to_table().column('my_timestamps')

My understanding is that your data has been saved using use_deprecated_int96_timestamps=True.
import pyarrow as pa
import pyarrow.parquet as pq
my_table = pa.Table.from_arrays([pa.array(['9999-12-31', '9999-12-31', '9999-12-31']).cast('timestamp[us]')], names = ['my_timestamps'])
pq.write_table(my_table, '/tmp/table.pq', use_deprecated_int96_timestamps=True)
In this mode, timestamps are saved using 96 bits integer with a (default/hardcoded) nanosecond resolution.
>>> pq.read_metadata('/tmp/table.pq').schema[0]
<ParquetColumnSchema>
name: my_timestamps
path: my_timestamps
max_definition_level: 1
max_repetition_level: 0
physical_type: INT96
logical_type: None
converted_type (legacy): NONE
In the latest version of arrow/parquet, timestamps are 64 bits integers with a configurable resolution.
It should be possible to convert the legacy 96 bits nano second timestamps to 64 bits integer using microsecond resolution without loss of information. But unfortunately there's no option in the parquet reader that would let you do that (as far as I can tell).
You may have to raise an issue with parquet/arrow, but I think they are trying hard to deprecate 96 bits integerenter link description here.

Convert a column (in Panda's DataFrame) from hexadecimal to int, skip lines with Nan or None, for large data sets> 10GB

I have a recording of a Can-Bus transmission and I want to analyze it now. In the past, I used Excel for it. But now I am faced with huge amounts of data (> 10GB). With "pd.read_csv" I can load the data wonderfully into a data frame. But the hexadecimal numbers are called a string in the following form "6E" and not "0x6E". Furthermore, some columns are filled with "None".
In the second paragraph I pointed out that I tested it with a for loop and an if query on None, this works, but this procedure takes a very long time
def load_data(self, file_folder, file_type):
df_local_list = []
# Load-Filenames as string in list "all_files"
full_path = glob.glob(file_folder + "/*." + file_type)
self.all_files = natsort.natsorted(full_path)
# Walk through all files and load the content in list "self.df"
for file in self.all_files:
# Read file-content to data-frame-variable "self.df"
local_df = pd.read_csv(file, names=self.header_list,
delim_whitespace=True, skiprows=12, skipfooter=3, header=13,
error_bad_lines=False, engine='python')
# Save the file-content without the last two lines --> End-Header
# self.df_list.append(local_df[:-2])
df_local_list.append(local_df)
self.df = pd.concat(df_local_list, axis=0, ignore_index=True)
self.df['Byte0_int'] = ('0x' + self.df['Byte0']).apply(int, base=0)
I would like to have a fast function which converts selected columns from hex to int, skipping the "None" values.

I had a similar issue and I did this :
self.df['Byte0_int'] = self.df['Byte0_int'].dropna().map(lambda x:int(x, 16))
In short, I remove NaN first, then I convert everything else from hex to int
No need to prefix "0x" since they are processed the same:
>>> int("0x5e",16)
94
>>> int("5e",16)
94
I suggest to profile the timings of your code because the concat could be costly. You also could have a look at dask to process many files

How to iterate on datatype to get associated values?

I am currently discovering HDf5 library n Python and I have some problem. I have a dataset with this layout:
GROUP "GROUP1" {
DATASET "DATASET1" {
DATATYPE H5T_COMPOUND {
H5T_STD_I64LE "DATATYPE1";
H5T_STD_I64LE "DATATYPE2";
H5T_STD_I64LE "DATATYPE3";
}
DATASPACE SIMPLE { ( 3 ) / ( 3 ) }
DATA {
(0): {
1,
2,
3
I am trying to iterate in dataset to get the values associated to each datatype and copying them in a text file. (For example, "1" is the associated value to "DATATYPE1".) This following script does work:
new_file = open('newfile.txt', 'a')
for i in range(len(dataset[...])):
new_file.write('Ligne '+ str(i)+" "+":"+" ")
for j in range(len(dataset[i,...])):
new_file.write(str(dataset[i][j]) + "\n")
But it is not this clean... So I tried to get values by calling the datatypes by name. The closest script I found is the following:
for attribute in group.attrs:
print group.attrs[attribute]
Unfortunately, despite my tries it does not work on datatype :
Checking datatypes leads to dataset
for data.dtype in dataset.dtype:
#then print datatypes
print dataset.dtype[data.dtype
The backing error message is "numpy.dtype' object is not iterable".
Do you please have any idea how to process? I hope my question is clear.

Without your data it's hard to offer specific solutions. Here is a very simple example that mimics your data schema using pytables (& numpy). First it creates the HDF5 file, with table named DATASET1 under group GROUP1. DATASET1 has 3 int values in each row named: DATATYPE1, DATATYPE2, and DATATYPE3. The ds1.append() function adds rows of data to the table (1 row at a time).
After the data is created, walk_nodes() is used to traverse the HDF5 file structure and print node names and dtypes for tables.
import tables as tb
import numpy as np
with tb.open_file("SO_56545586.h5", mode = "w") as h5f:
ds1 = h5f.create_table('/GROUP1', 'DATASET1',
description=np.dtype([('DATATYPE1', int),('DATATYPE2', int),('DATATYPE3', int)]),
createparents=True)
for row in range(5) :
row_vals = [ (row, row+1, row*2), ]
ds1.append(row_vals)
## This section walks the file strcuture (groups and datasets), printing node names and dtype for tables:
for this_node in h5f.walk_nodes('/'):
print (this_node)
if isinstance(this_node, tb.Table) :
print (this_node.dtype)
Note: do not use mode = "w" when you open an existing file. It will create a new file (overwrite the existing file). Use mode = "a" or mode = "r+" if you need to append data, or mode = "r" if you only need to read the data.

To complete solution added by kcw78 I also found this script which also work. Because I can't iterate over dataset, I copied dataset into a new array :
dataset = file['path_to_dataset']
data = np.array(dataset) # Create a new array filled with dataset values as numpy.
print(data)
ls_column = list(data.dtype.names) # Get a list with datatypes associated to each data values.
print(ls_column) # Show layout of datatypes associated to each previous data values.
# Create an array filled with same datatypes rather than same subcases.
for col in ls_column:
k = data[col] # example : k=data['DATATYPE1'], k=data['DATATYPE2']
print(k)

Arnaud, OK, I see you are using h5py.
I don't understand what you mean by "I can't iterate over dataset". You can iterate over rows, or columns/fields.
Here is an example to demonstrate with h5py.
It shows 4 ways to extract data from the dataset, the last one iterates):
Read the entire HDF5 dataset to a np array
Then read 1 column from that array to another array
Read 1 column from the HDF5 dataset as an array
Loop thru HDF5 dataset columns and read 1 at a time as an array
Note that the return from .dtype.names is iterable. You don't need to create a list (unless you need it for other purposes). Also, HDF5 supports mixed types in datasets, so you can get a dtype with int, float, and string values (it will be a record array).
import h5py
import numpy as np
with h5py.File("SO_56545586.h5", "w") as h5f:
# create empty dataset 'DATASET1' in group '/GROUP1'
# dyte argument defines names and types
ds1 = h5f.create_dataset('/GROUP1/DATASET1', (10,),
dtype=np.dtype([('DATATYPE1', int),('DATATYPE2', int),('DATATYPE3', int)]) )
for row in range(5) : # load some arbitrary data into the dataset
row_vals = [ (row, row+1, row*2), ]
ds1[row] = row_vals
# to read the entire dataset as an array
ds1_arr = h5f['/GROUP1/DATASET1'][:]
print (ds1_arr.dtype)
# to read 1 column from ds1_arr as an array
ds1_col1 = ds1_arr[:]['DATATYPE1']
print ('for DATATYPE1 from ds1_arr, dtype=',ds1_col1.dtype)
# to read 1 HDF5 dataset column as an array
ds1_col1 = h5f['/GROUP1/DATASET1'][:,'DATATYPE1']
print ('for DATATYPE1 from HDF5, dtype=',ds1_col1.dtype)
# to loop thru HDF5 dataset columns and read 1 at a time as an array
for col in h5f['/GROUP1/DATASET1'].dtype.names :
print ('for ', col, ', dtype=',h5f['/GROUP1/DATASET1'][col].dtype)
col_arr = h5f['/GROUP1/DATASET1'][col][:]
print (col_arr.shape)

Looking for a python datastructure for cleaning/annotating large datasets

I'm doing a lot of cleaning, annotating and simple transformations on very large twitter datasets (~50M messages). I'm looking for some kind of datastructure that would contain column info the way pandas does, but works with iterators rather than reading the whole dataset into memory at once. I'm considering writing my own, but I wondered if there was something with similar functionality out there. I know I'm not the only one doing things like this!
Desired functionality:
>>> ds = DataStream.read_sql("SELECT id, message from dataTable WHERE epoch < 129845")
>>> ds.columns
['id', 'message']
>>> ds.iterator.next()
[2385, "Hi it's me, Sally!"]
>>> ds = datastream.read_sql("SELECT id, message from dataTable WHERE epoch < 129845")
>>> ds_tok = get_tokens(ds)
>>> ds_tok.columns
['message_id', 'token', 'n']
>>> ds_tok.iterator.next()
[2385, "Hi", 0]
>>> ds_tok.iterator.next()
[2385, "it's", 1]
>>> ds_tok.iterator.next()
[2385, "me", 2]
>>> ds_tok.to_sql(db_info)
UPDATE: I've settled on a combination of dict iterators and pandas dataframes to satisfy these needs.

As commented there is a chunksize argument for read_sql which means you can work on sql results piecemeal. I would probably use HDF5Store to save the intermediary results... or you could just append it back to another sql table.
dfs = pd.read_sql(..., chunksize=100000)
store = pd.HDF5Store("store.h5")
for df in dfs:
clean_df = ... # whatever munging you have to do
store.append("df", clean_df)
(see hdf5 section of the docs), or
dfs = pd.read_sql(..., chunksize=100000)
for df in dfs:
clean_df = ...
clean_df.to_sql(..., if_exists='append')
see the sql section of the docs.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Dask Parquet loading files with data schema - python

It seems the problem was with the parquet engine. When I changed the code to data.to_parquet(target_url, engine = 'fastparquet') and data.from_parquet(target_url, engine = 'fastparquet') the writing and loading worked fine.

Related

Is there a way to overwrite existing data using pandas to_parquet with partitions?

handling large timestamps when converting from pyarrow.Table to pandas

Convert a column (in Panda's DataFrame) from hexadecimal to int, skip lines with Nan or None, for large data sets> 10GB

How to iterate on datatype to get associated values?

Looking for a python datastructure for cleaning/annotating large datasets

Categories

Resources