Pyarrow.lib.Schema vs. pyarrow.parquet.Schema

Pyarrow.lib.Schema vs. pyarrow.parquet.Schema - python

When I try to load across a many-partitioned parquet file, some of the schema get inferred invalidly because of missing data which fills the schema in with nulls. I would think specifying the schema in the pyarrow.parquet.ParquetDataset would fix this but I don't know how to construct a schema of the correct pyarrow.parquet.Schema type. Some example code:
import pyarrow as pa
import pa.parquet as pq
test_schema = pa.schema([pa.field('field1', pa.string()), pa.field('field2', pa.float64())])
paths = ['test_root/partition1/file1.parquet', 'test_root/partition2/file2.parquet']
dataset = pq.ParquetDataset(paths, schema=schema)
And the error:
AttributeError: 'pyarrow.lib.Schema' object has no attribute 'to_arrow_schema'
But I can't find any documentation on how to construct a pyarrow.parquet.Schema schema as in the docs (https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html) and have only made a pyarrow.lib.Schema which gives the above error.

There is not an API to construct a Parquet schema in Python yet. You can use one that you read from a particular file, though (see pq.ParquetFile(...).schema).
Could you open an issue on the ARROW JIRA project to request the feature to construct Parquet schemas in Python?
https://issues.apache.org/jira

So thank you (whoever you are) if there was the ticket and fix in ARROW JIRA of this.
I was able to merge schemas of files in dataset and read dataset:
import pyarrow as pa
import pa.parquet as pq
merged_schema = pa.schema([])
for filename in os.listdir(dataset_folder):
schema_ = pq.read_table(os.path.join(dataset_folder, filename)).schema
merged_schema = pa.unify_schemas([schema_, merged_schema])
Read dataset:
dset = pq.ParquetDataset(
'my_dataset_folder',
schema=merged_schema,
use_legacy_dataset=False
).read()

Related

PyArrow dataset missing new data

I have a python process that writes data into parquet files, using Pyarrow=6.0.0. I init a PyArrow Dataset with:
import pyarrow.dataset as ds
root_directory = "bucket_name_in_gcp"
fs = gcsfs.GCSFileSystem(project=project)
pa_fs = PyFileSystem(FSSpecHandler(fs))
self.partitions = ds.partitioning(pyarrow.schema([("item_id", pyarrow.string()), ("group", pyarrow.string())]), flavor="hive")
dataset = ds.dataset(source=root_directory, filesystem=fs, partitioning=partitions, format="parquet")
later in the code I use
item_id_condition = ds.field("item_id") == "xxx"
group_condition = ds.field("group") == "group_1"
filters = item_id_condition & group_condition
results = dataset.to_table(filter=filters).to_pandas()
to read data from the storage and I get an empty dataframe and that is fine.
later I use:
file_path = f'{root_directory}/item_id=xxx/group=group_1'
with pyarrow.BufferOutputStream() as output_buffer:
parquet.write_table(table, output_buffer)
parquet_bytes = output_buffer.getvalue().to_pybytes()
with pa_fs.open_output_stream(str(file_path)) as stream:
stream.write(parquet_bytes)
to write a table of data to the storage, at that point I can view the file and it's content.
But if I try to use the read function (dataset.to_table) again I still get an empty dataframe. Why PyArrow dataset doesn't recognize the new files?
If I will re-create the ds.dataset object it will recognize all the existing data
Am I missing anything? is there a way to refresh the dataset? or I need to init it each time?

I think you would need to run
dataset = ds.dataset(source=root_directory, filesystem=fs, partitioning=partitions, format="parquet")
again and then it should recognize the new files. If I understand correctly the connection to the gcsfs filesystem has to be "refreshed" in this way (re-creating the ds.dataset as you have said)

Read dbf with arcpy in PyCharm?

I have exported an ArcGIS Desktop 10.7 table into a dbf file.
Now I want to do some GIS calculation in standalone Python.
Therefore I have started a PyCharm project referencing the ArcGIS Python interpreter and hence am able to import arcpy into my main.py.
Problem is: I don't want to pip install other modules, but I don't know how to correctly read the dbf table with arcpy.
#encoding=utf-8
import arcpy
path=r"D:\test.dbf"
sc=arcpy.SearchCursor(path) # Does not work: IOError exception str() failed
tv=arcpy.mapping.TableView(path) # Does not work either: StandaloneObject invalid data source or table
The dbf file is correct, it can be read into ArcGIS.
Can someone please give me an idea, how to read the file standalone with arcpy?

Using pandas
Python from ArcMap comes with some modules. You can load the data into a pandas.DataFrame and work with this format. Pandas is well-documented and there is a lot of already asked question about it all over the web. It's also super easy to do groupby or table manipulations.
import pandas as pd
import arcpy
def read_arcpy_table(self, table, fields='*', null_value=None):
"""
Transform a table from ArcMap into a pandas.DataFrame object
table : Path the table
fields : Fields to load - '*' loads all fields
null_value : choose a value to replace null values
"""
fields_type = {f.name: f.type for f in arcpy.ListFields(table)}
if fields == '*':
fields = fields_type.keys()
fields = [f.name for f in arcpy.ListFields(table) if f.name in fields]
fields = [f for f in fields if f in fields_type and fields_type[f] != 'Geometry'] # Remove Geometry field if FeatureClass to avoid bug
# Transform in pd.Dataframe
np_array = arcpy.da.FeatureClassToNumPyArray(in_table=table,
field_names=fields,
skip_nulls=False,
null_value=null_value)
df = self.DataFrame(np_array)
return df
# Add the function into the loaded pandas module
pd.read_arcpy_table = types.MethodType(read_arcpy_table, pd)
df = pd.read_arcpy_table(table='path_to_your_table')
# Do whatever calculations need to be done
Using cursor
You can also use arcpy cursors and dict for simple calculation.
There are simple example on this page on how to use correctly cursors :
https://desktop.arcgis.com/fr/arcmap/10.3/analyze/arcpy-data-access/searchcursor-class.htm

My bad,
after reading the Using cursor approach, I figured out that using the
sc=arcpy.SearchCursor(path) # Does not work: IOError exception str() failed
approach was correct, but at the time around 3 AM, I was a little bit exhausted and missed the typo in the path that caused the error. Nevertheless, a more descriptive error message e.g. IOError could not open file rather than IOError exception str() failed would have solved my mistake as acrGIS newbie.. : /

Overwrite parquet file with pyarrow in S3

I'm trying to overwrite my parquet files with pyarrow that are in S3. I've seen the documentacion and I haven't found anything.
Here is my code:
from s3fs.core import S3FileSystem
import pyarrow as pa
import pyarrow.parquet as pq
s3 = S3FileSystem(anon=False)
output_dir = "s3://mybucket/output/my_table"
my_csv = pd.read_csv(file.csv)
my_table = pa.Table.from_pandas(my_csv , preserve_index=False)
pq.write_to_dataset(my_table,
output_dir,
filesystem=s3,
use_dictionary=True,
compression='snappy')
Is there something like mode = "overwrite" option in write_to_dataset function?

I think the best way to do it is with AWS Data Wrangler that offers 3 differents write modes:
append
overwrite
overwrite_partitions
Example:
import awswrangler as wr
wr.s3.to_parquet(
dataframe=df,
path="s3://...",
mode="overwrite",
dataset=True,
database="my_database", # Optional, only with you want it available on Athena/Glue Catalog
table="my_table",
partition_cols=["PARTITION_COL_NAME"])

Here's a solution using pyarrow.parquet (need version 8+! see docs regarding arg: "existing_data_behavior") and S3FileSystem.
Now decide if you want to overwrite partitions or parquet part files which often compose those partitions.
Overwrite single .parquet file
pq.write_to_dataset(
my_table,
root_path='bucket/mydata/year=2022/data_part001.parquet',
filesystem=s3,
existing_data_behavior="overwrite_or_ignore"
)
Overwrite .parquet files with common basename within each partition
pq.write_to_dataset(
my_table,
root_path='bucket/mydata',
partition_cols=['year'],
basename_template='data_part001.parquet',
filesystem=s3,
existing_data_behavior="overwrite_or_ignore"
)
Overwriting existing partitions that match new records
If some of your new records belong to a partition that already exists, that entire partition will be overwritten and new partitions will be added with:
pq.write_to_dataset(
my_table,
root_path='bucket/mydata',
partition_cols=['year'],
filesystem=s3,
existing_data_behavior="delete_matching"
)

Sorry, there's no a such option yet but the way I work around it is using boto3 to delete the files before writing them.
import boto3
resource = boto3.resource('s3')
resource.Bucket('mybucket').objects.filter(Prefix='output/my_table').delete()

Google Cloud BigQuery load_table_from_dataframe() Parquet AttributeError

I am trying to use the BigQuery package to interact with Pandas DataFrames. In my scenario, I query a base table in BigQuery, use .to_dataframe(), then pass that to load_table_from_dataframe() to load it into a new table in BigQuery.
My original problem was that str(uuid.uuid4()) (for random ID's) was automatically being converted to bytes instead of string, so I am forcing a schema instead of allowing it to auto-detect what to make.
Now, though, I passed a job_config with a job_config dict that contained the schema, and now I get this error:
File "/usr/local/lib/python2.7/dist-packages/google/cloud/bigquery/client.py", line 903, in load_table_from_dataframe
job_config.source_format = job.SourceFormat.PARQUET
AttributeError: 'dict' object has no attribute 'source_format'
I already had PyArrow installed, and tried also installing FastParquet, but it didnt help, and this didn't happen before I tried to force a schema.
Any ideas?
https://google-cloud-python.readthedocs.io/en/latest/bigquery/usage.html#using-bigquery-with-pandas
https://google-cloud-python.readthedocs.io/en/latest/_modules/google/cloud/bigquery/client.html#Client.load_table_from_dataframe
Looking in to the actual package it seems that it forces Parquet format, but like I said, I had no issue before, just now that I'm trying to give a table schema.
EDIT: This only happens when I try to write to BigQuery.

Figured it out. After weeding through Google's documentation I forgot to put:
load_config = bigquery.LoadJobConfig()
load_config.schema = SCHEMA
Oops. Never loaded the config dict from the BigQuery package.

Input FITS table to astroquery.xmatch

I need to use xmatch from the astroquery package to cross match a large local catalogue with 2MASS.
I load my local FITS table with astropy as usual:
from astropy.io import fits
hdu = fits.open(root+'mycat.fits')
Then try to use xmatch with that table (the table is hdu[2]) following the syntax described in the astroquery docs:
from astroquery.xmatch import XMatch
table = XMatch.query(cat1=hdu[2],
cat2='vizier:II/246/out',
max_distance=1 * u.arcsec, colRA1='RA',
colDec1='Dec')
But get the following error:
AttributeError: 'BinTableHDU' object has no attribute 'read'
The examples on the astroquery docs only shows how to give a local CSV file. But my catalogue has about 7 million entries, so it is not convenient to pass it as an ASCII CSV file.
How should I pass my FITS table as input? Thanks!

While xmatch can accept a file object as input, that file object has to be a Vizier-style .csv table. You need to convert your FITS table to an astropy table first, e.g.
from astropy.table import Table
myTable = Table(data=hdu[2].data)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pyarrow.lib.Schema vs. pyarrow.parquet.Schema - python

Related

PyArrow dataset missing new data

Read dbf with arcpy in PyCharm?

Overwrite parquet file with pyarrow in S3

Google Cloud BigQuery load_table_from_dataframe() Parquet AttributeError

Input FITS table to astroquery.xmatch

Categories

Resources