handling large timestamps when converting from pyarrow.Table to pandas - python

I have a timestamp of 9999-12-31 23:59:59 stored in a parquet file as an int96. I read this parquet file using pyarrow.dataset and convert the resulting table into a pandas dataframe (using pyarrow.Table.to_pandas()). The conversion to pandas dataframe turns my timestamp into 1816-03-30 05:56:07.066277376 (Pandas timestamp has probably smaller range of valid dates) without any complaing about datatype or anything.
I then take this pandas dataframe, convert it back to table and write it into a parquet dataset using pyarrow.dataset.write_dataset
I am now left with a different data than the data i started with, without seeing any warnings. (I found this out when I tried to create an impala table from the parquet dataset and then couldn't query it properly).
Is there a way to handle these large timestamps when converting from pyarrow table to pandas?
I've tried using the timestamp_as_object = True parameter as in Table.to_pandas(timestamp_as_object = True), but doesn't seem like it does anything.
EDIT: providing reproducible example. The problem is that pyarrow thinks the timestamps are nanoseconds while reading the file, although they were stored as microseconds:
import pyarrow as pa
import pyarrow.dataset as ds
non_legacy_hdfs_filesystem = # connect to a filesystem here
my_table = pa.Table.from_arrays([pa.array(['9999-12-31', '9999-12-31', '9999-12-31']).cast('timestamp[us]')], names = ['my_timestamps'])
parquet_format = ds.ParquetFileFormat()
write_options = parquet_format.make_write_options(use_deprecated_int96_timestamps = True, coerce_timestamps = 'us', allow_truncated_timestamps = True)
ds.write_dataset(data = my_table, base_dir = 'my_path', filesystem = non_legacy_hdfs_filesystem, format = parquet_format, file_options = write_options, partitioning= None)
dataset = ds.dataset('my_path', filesystem = non_legacy_hdfs_filesystem)
dataset.to_table().column('my_timestamps')

My understanding is that your data has been saved using use_deprecated_int96_timestamps=True.
import pyarrow as pa
import pyarrow.parquet as pq
my_table = pa.Table.from_arrays([pa.array(['9999-12-31', '9999-12-31', '9999-12-31']).cast('timestamp[us]')], names = ['my_timestamps'])
pq.write_table(my_table, '/tmp/table.pq', use_deprecated_int96_timestamps=True)
In this mode, timestamps are saved using 96 bits integer with a (default/hardcoded) nanosecond resolution.
>>> pq.read_metadata('/tmp/table.pq').schema[0]
<ParquetColumnSchema>
name: my_timestamps
path: my_timestamps
max_definition_level: 1
max_repetition_level: 0
physical_type: INT96
logical_type: None
converted_type (legacy): NONE
In the latest version of arrow/parquet, timestamps are 64 bits integers with a configurable resolution.
It should be possible to convert the legacy 96 bits nano second timestamps to 64 bits integer using microsecond resolution without loss of information. But unfortunately there's no option in the parquet reader that would let you do that (as far as I can tell).
You may have to raise an issue with parquet/arrow, but I think they are trying hard to deprecate 96 bits integerenter link description here.

Related

MariaDB BLOB data format in Power BI vs. Python conversion

I have a MariaDB Table containing a MEDIUMBLOB Column. There are several entries in this table corresponding to one photo each.
When querying the data to PowerBI using the MariaDB connector, I get the Data in the format "Binary"
However, when querying the same data in Python (IDE or PowerBI) the format is different:
The bigger picture is to use this code to split the image in bits as PBI has a character-limit on their data elements:
Source = MariaDB.Contents("XXX.XXX.XXX.XXX:YYYY", "ZZZZZ"),
Query1= Source{[Name="Query1",Kind="Table"]}[Data],
#"Removed Top Rows" = Table.Skip(qc_westernblot_Table,1),
//Remove unnecessary columns
RemoveOtherColumns = Table.SelectColumns(#"Removed Top Rows",{"picture", "batchname"}),
//Creates Splitter function
SplitTextFunction = Splitter.SplitTextByRepeatedLengths(30000),
//Converts table of files to list
ListInput = Table.ToRows(RemoveOtherColumns),
//Function to convert binary of photo to multiple
//text values
ConvertOneFile = (InputRow as list) =>
let
BinaryIn = InputRow{0},
FileName = InputRow{1},
BinaryText = Binary.ToText(BinaryIn, BinaryEncoding.Base64),
SplitUpText = SplitTextFunction(BinaryText),
AddFileName = List.Transform(SplitUpText, each {FileName,_})
in
AddFileName,
//Loops over all photos and calls the above function
ConvertAllFiles = List.Transform(ListInput, each ConvertOneFile(_)),
//Combines lists together
CombineLists = List.Combine(ConvertAllFiles),
//Converts results to table
ToTable = #table(type table[Name=text,Pic=text],CombineLists),
//Adds index column to output table
AddIndexColumn = Table.AddIndexColumn(ToTable, "Index", 0, 1)
in
AddIndexColumn
As I am a beginner on this topic, I am confident there is a straight-forward conversion missing here but I couldn't figure it out so far myself.
I greatly appreciate any help. Thank you!
What are you planning to do with this data? PBI doesn't support binary data although you can see it in Power Query. It must be converted to something else before it can be loaded to the PBI data model.
I suspect the Python version is just the binary already converted to text. If you click the two arrows in the top right of the picture column for the PBI version, do you not get the same output?

Is there a way to overwrite existing data using pandas to_parquet with partitions?

I'm using pandas to write a parquet file using the to_parquet function with partitions. Example:
df.to_parquet('gs://bucket/path', partition_cols=['key'])
The issue is that every time I run the code. It adds a new parquet file in the partition and when you read data, you get all the data from each time the script was run. Essentially, the data appends each time.
Is there a way to overwrite the data every time you write using pandas?
I have found dask to be helpful reading and writing parquet. It defaults the file name on write (which you can alter) and will replace the parquet file if you use the same name, which I believe is what you are looking for. You can append data to the partition by setting 'append' to True, which is more intuitive to me, or you can set 'overwrite' to True which will remove all files in the partition/folder prior to writing the file. Reading parquet works well as well by including partition columns in the dataframe on read.
https://docs.dask.org/en/stable/generated/dask.dataframe.to_parquet.html
See below some code I used to satisfy myself of the behaviour of dask.dataframe.to_parquet:
import pandas as pd
from dask import dataframe as dd
import numpy as np
dates = pd.date_range("2015-01-01", "2022-06-30")
df_len = len(dates)
df_1 = pd.DataFrame(np.random.randint(0, 1000, size=(df_len, 1)), columns=["value"])
df_2 = pd.DataFrame(np.random.randint(0, 1000, size=(df_len, 1)), columns=["value"])
df_1["date"] = dates
df_1["YEAR"] = df_1["date"].dt.year
df_1["MONTH"] = df_1["date"].dt.month
df_2["date"] = dates
df_2["YEAR"] = df_2["date"].dt.year
df_2["MONTH"] = df_2["date"].dt.month
ddf_1 = dd.from_pandas(df_1, npartitions=1)
ddf_2 = dd.from_pandas(df_2, npartitions=1)
name_function = lambda x: f"monthly_data_{x}.parquet"
ddf_1.to_parquet(
"dask_test_folder",
name_function=name_function,
partition_on=["YEAR", "MONTH"],
write_index=False,
)
print(ddf_1.head())
ddf_first_write = dd.read_parquet("dask_test_folder/YEAR=2015/MONTH=1")
print(ddf_first_write.head())
ddf_2.to_parquet(
"dask_test_folder",
name_function=name_function,
partition_on=["YEAR", "MONTH"],
write_index=False,
)
print(ddf_2.head())
ddf_second_write = dd.read_parquet("dask_test_folder/YEAR=2015/MONTH=1")
print(ddf_second_write.head())
Yeah, there is. You need to read pandas docs and you'll see that to_parquet supports **kwargs and uses engine:pyarrow by default. With that you got to the pyarrow docs. There you'll see there are two methods of doing this. One, by using partition_filename_cb which needs legacy support and will be depricated.
Two, using basename_template which is the new way. This because of performance issues of running a callable/lambda to name each partition. You need to pass a string: "string_{i}". Only works with legacy support off. The saved file will be "string_0","string_1"...
You can't use both at the same time.
def write_data(
df: pd.DataFrame,
path: str,
file_format="csv",
comp_zip="snappy",
index=False,
partition_cols: list[str] = None,
basename_template: str = None,
storage_options: dict = None,
**kwargs,
) -> None:
getattr(pd.DataFrame, f"to_{file_format}")(
df,
f"{path}.{file_format}",
compression=comp_zip,
index=index,
partition_cols=partition_cols,
basename_template=basename_template,
storage_options={"token": creds},
**kwargs,
)
Try this.

Datetime storing in hd5 database

I have a list of np.datetime64 data that looks as follows:
times =[2015-03-26T16:02:42.000000Z,
2015-03-26T16:02:45.000000Z,...]
type(times) returns list
type(times[1]) returns obspy.core.utcdatetime.UTCDateTime
Now, I understand that h5py does not support date time data.
I have tried the following:
time_str = [n.encode("ascii", "ignore") for n in time_str]
time_str = [str(s) for s in time_str]
type(time_str[1]) returns bytes
I am okay with creating the dataset and storing these date time values as a string
However, when attempting to create the dataset, I get the following error:
with h5py.File('data_ML.hdf5', 'w') as f:
f.create_dataset("time", data=time_str,maxshape=(None),chunks=True, dtype='str')
TypeError: No conversion path for dtype: dtype('<U')
Where am I messing up/ is there an alternative way to store these values as is so I can extract them later?
Ok, here we go. I couldn't get some of you code to work together (maybe you left some steps out, or changed variable names?). And, I could not get the obspy.core.utcdatetime.UTCDateTime object your have.
So I created an example that does the following:
Starts with a list of np.datetime64() objects,
Converts to a list of np.datetime_as_string() in UTC format
objects **see note at Item 4
Converts to a np.array with dtype='S30'
Note: I included Step 2 to replicate your data. See following section
for simpler version
Code below:
times =[np.datetime64('2015-03-26T16:02:42.000000'),
np.datetime64('2015-03-26T16:02:45.000000'),
np.datetime64('2015-03-26T16:02:48.000000'),
np.datetime64('2015-03-26T16:02:55.000000') ]
utc_times = [ np.datetime_as_string(n,timezone='UTC') for n in times ]
utc_str_arr = np.array(utc_times,dtype='S30')
with h5py.File('data_ML.hdf5', 'w') as f:
f.create_dataset("time", data=utc_str_arr,maxshape=(None),chunks=True)
You can simplify the process if you are starting with np.datetime64() objects, and don't have (and don't need or want) the intermediate list of string objects (variable utc_times in my code). The method below skips Step 2 above, and shows 2 ways to create a np.array() of properly encoded strings.
Code below:
times =[np.datetime64('2015-03-26T16:02:42.000000'),
np.datetime64('2015-03-26T16:02:45.000000'),
np.datetime64('2015-03-26T16:02:48.000000'),
np.datetime64('2015-03-26T16:02:55.000000') ]
# Create empty array with defined size and 'S#' dtype, then populate with for loop:
utc_str_arr1 = np.empty((len(times),),dtype='S30')
for i, n in enumerate(times):
utc_str_arr1[i] = np.datetime_as_string(n,timezone='UTC')
# -OR- Create array and populate using loop comprehension:
utc_str_arr2 = np.array( [np.datetime_as_string(n,timezone='UTC').encode('utf-8') for n in times] )
with h5py.File('data_ML.hdf5', 'w') as f:
f.create_dataset("time1", data=utc_str_arr1,maxshape=(None),chunks=True)
f.create_dataset("time2", data=utc_str_arr2,maxshape=(None),chunks=True)
Final result looks similar with either method (second method creates 2 identical datsets).
Image from HDFView:
To Read the Data:
Per request in Aug-02-2021 comment, here is the code to extract data from HDF5 and create Pandas timestamp objects (then saved to a dataframe). First the byte strings in the dataset are read and converted to NumPy Unicode strings with .astype(). Then the strings are converted to Pandas timestamp objects with pd.to_datetime() using the format= parameter.
import h5py
import numpy as np
import pandas as pd
with h5py.File('data_ML.hdf5', 'r') as h5f:
## returns a h5py dataset object:
dts_ds = h5f["time"]
longest_word=len(max(dts_ds, key=len))
## returns an array of byte strings representing np.datetime64:
## .astype() used to convert byte strings to unicode
dts_arr = dts_ds[:].astype('U'+str(longest_word))
## create a new array to hold Pandas datetime objects
## then loop over first array to convert and populate new array
pd_dts_arr = np.empty((dts_arr.shape[0],),dtype=object)
for i, dts in enumerate(dts_arr):
pd_dts_arr[i] = pd.to_datetime(dts, format='%Y-%m-%dT%H:%M:%S.%fZ')
dts_df = pd.DataFrame(pd_dts_arr)
There are a lot of ways to represent dates and time using native Python, NumPy and Pandas objects. More details about working with them can be found at this answer:
Converting between datetime, Timestamp and datetime64

Convert a column (in Panda's DataFrame) from hexadecimal to int, skip lines with Nan or None, for large data sets> 10GB

I have a recording of a Can-Bus transmission and I want to analyze it now. In the past, I used Excel for it. But now I am faced with huge amounts of data (> 10GB). With "pd.read_csv" I can load the data wonderfully into a data frame. But the hexadecimal numbers are called a string in the following form "6E" and not "0x6E". Furthermore, some columns are filled with "None".
In the second paragraph I pointed out that I tested it with a for loop and an if query on None, this works, but this procedure takes a very long time
def load_data(self, file_folder, file_type):
df_local_list = []
# Load-Filenames as string in list "all_files"
full_path = glob.glob(file_folder + "/*." + file_type)
self.all_files = natsort.natsorted(full_path)
# Walk through all files and load the content in list "self.df"
for file in self.all_files:
# Read file-content to data-frame-variable "self.df"
local_df = pd.read_csv(file, names=self.header_list,
delim_whitespace=True, skiprows=12, skipfooter=3, header=13,
error_bad_lines=False, engine='python')
# Save the file-content without the last two lines --> End-Header
# self.df_list.append(local_df[:-2])
df_local_list.append(local_df)
self.df = pd.concat(df_local_list, axis=0, ignore_index=True)
self.df['Byte0_int'] = ('0x' + self.df['Byte0']).apply(int, base=0)
I would like to have a fast function which converts selected columns from hex to int, skipping the "None" values.
I had a similar issue and I did this :
self.df['Byte0_int'] = self.df['Byte0_int'].dropna().map(lambda x:int(x, 16))
In short, I remove NaN first, then I convert everything else from hex to int
No need to prefix "0x" since they are processed the same:
>>> int("0x5e",16)
94
>>> int("5e",16)
94
I suggest to profile the timings of your code because the concat could be costly. You also could have a look at dask to process many files

Dask Parquet loading files with data schema

This is a question related to this post.
I am experimenting with Dask and Parquet files. I loaded the New York parking violations data I downloaded here.
I read the data files, find common columns, apply datatypes, and save all afterwards as a parquet collevtion
from dask import dataframe as dd
from dask.diagnostics import ProgressBar
import numpy as np
base_url = 'origin/nyc-parking-tickets/'
fy14 = dd.read_csv(base_url + '*2014_.csv')
fy15 = dd.read_csv(base_url + '*2015.csv')
fy16 = dd.read_csv(base_url + '*2016.csv')
fy17 = dd.read_csv(base_url + '*2017.csv')
data = [fy14, fy15, fy16, fy17]
col_set = [set(d.columns) for d in data]
common_columns = list(set.intersection(*col_set))
# Set proper column types
dtype_tuples = [(x, np.str) for x in common_columns]
dtypes = dict(dtype_tuples)
floats = ['Feet From Curb', 'Issuer Code', 'Issuer Precinct', 'Law Section', 'Vehicle Year', 'Violation Precinct']
ints32 = ['Street Code1', 'Street Code2', 'Street Code3', 'Summons Number']
ints16 = ['Violation Code']
for item in floats: dtypes[item] = np.float32
for item in ints32: dtypes[item] = np.int32
for item in ints16: dtypes[item] = np.int16
# Read Data
data = dd.read_csv(base_url + '*.csv', dtype=dtypes, usecols=common_columns) # usecols not in Dask documentation, but from pandas
# Write data as parquet
target_url = 'target/nyc-parking-tickets-pq/'
with ProgressBar():
data.to_parquet(target_url)
When I attempt to reload the data
data2 = dd.read_parquet(target_url, engine='pyarrow')
I get a ValueError, namely that some of the partitions have a different file format. Looking at the output, I can see that the 'Violation Legal Code' column is in one partition interpreted as null, presumably because the data is too sparse for sampling.
In the post with the original question two solutions are suggested. The first is about entering dummy values, the other is supplying column types when loading the data. I would like to do the latter and I am stuck.
In the dd.read_csv method I can pass the dtype argument, for which I just enter the dtypes dictionary defined above. The dd.read_parquet does not accept that keyword. In the documentation it seems to suggest that categories is taking over that role , but even when passing categories=dtypes, I still get the same error.
How can I pass type specifications in dask.dataframe.read_parquet?
You can not pass dtypes to read_parquet because Parquet files know their own dtypes (in CSV it is ambiguous). Dask DataFrame expects that all files of a dataset have the same schema, as of 2019-03-26, there is no support for loading data of mixed schemas.
That being said, you could do this yourself using something like Dask Delayed, do whatever manipulations you need to do on a file-by-file basis, and then convert those into a Dask DataFrame with dd.from_delayed. More information about that is here.
https://docs.dask.org/en/latest/delayed.html
https://docs.dask.org/en/latest/delayed-collections.html
It seems the problem was with the parquet engine. When I changed the code to
data.to_parquet(target_url, engine = 'fastparquet')
and
data.from_parquet(target_url, engine = 'fastparquet')
the writing and loading worked fine.

Categories

Resources