PyArrow dataset missing new data

PyArrow dataset missing new data - python

I have a python process that writes data into parquet files, using Pyarrow=6.0.0. I init a PyArrow Dataset with:
import pyarrow.dataset as ds
root_directory = "bucket_name_in_gcp"
fs = gcsfs.GCSFileSystem(project=project)
pa_fs = PyFileSystem(FSSpecHandler(fs))
self.partitions = ds.partitioning(pyarrow.schema([("item_id", pyarrow.string()), ("group", pyarrow.string())]), flavor="hive")
dataset = ds.dataset(source=root_directory, filesystem=fs, partitioning=partitions, format="parquet")
later in the code I use
item_id_condition = ds.field("item_id") == "xxx"
group_condition = ds.field("group") == "group_1"
filters = item_id_condition & group_condition
results = dataset.to_table(filter=filters).to_pandas()
to read data from the storage and I get an empty dataframe and that is fine.
later I use:
file_path = f'{root_directory}/item_id=xxx/group=group_1'
with pyarrow.BufferOutputStream() as output_buffer:
parquet.write_table(table, output_buffer)
parquet_bytes = output_buffer.getvalue().to_pybytes()
with pa_fs.open_output_stream(str(file_path)) as stream:
stream.write(parquet_bytes)
to write a table of data to the storage, at that point I can view the file and it's content.
But if I try to use the read function (dataset.to_table) again I still get an empty dataframe. Why PyArrow dataset doesn't recognize the new files?
If I will re-create the ds.dataset object it will recognize all the existing data
Am I missing anything? is there a way to refresh the dataset? or I need to init it each time?

I think you would need to run
dataset = ds.dataset(source=root_directory, filesystem=fs, partitioning=partitions, format="parquet")
again and then it should recognize the new files. If I understand correctly the connection to the gcsfs filesystem has to be "refreshed" in this way (re-creating the ds.dataset as you have said)

Related

Read xml file using RDD using local cluster pyspark

I have a very large xml 100MB using pyspark for the following reason,
to reduce the run time of data and convert it into data frame.
Any idea how to read to xml file. (modification required into below code) One more. I took from
google. In precise did not understand, how can I define get_values.
spark = SparkSession.builder.master("local[2]").appName('finale').getOrCreate()
xml = os.path.join(self.violation_path, xml)
file_rdd = spark.read.text(xml, wholetext=False)
pyspark.sql.udf.UDFRegistration.register(name="get_values", f = get_values,
returnType=StringType())
myRDD = spark.parallelize(file_rdd.take(4), 4)
parsed_records = spark.runJob(myRDD, lambda part: [get_values(x) for x in part])
print (parsed_records)
another method
root = ET.parse(xml).getroot()
is it right approach to use pyspark as local cluster? will it be fatser
Can it be run only cloud container and can not use local machine

Read csv files recursively in all sub folders from a GCP bucket using python

I was trying to load all csv files recursively from all sub folders available in a GCP bucket using python pandas.
Currently I am using dask to load data, but its very slow.
import dask
path = "gs://mybucket/parent_path + "*/*.csv"
getAllDaysData = dask.dataframe.read_csv(path).compute()
Can someone help me with better way.

I would suggest reading into parquet files instead.
And use pd.read_parquet(file, engine = 'pyarrow') to convert it into a pandas dataframe.

Alternatively you might want to consider loading data into BigQuery first.
You can do something like this, as long as all csv-files have the some structure.
uri = f"gs://mybucket/parent_path/*.csv"
job_config = bigquery.LoadJobConfig(
source_format=bigquery.SourceFormat.CSV
)
load_job = client.load_table_from_uri(
uri,
'destination_table',
job_config=job_config,
location=GCP_LOCATION
)
load_job_result = load_job.result()

Trying to move data from one Azure Blob Storage to another using a Python script

I have data that exists in a zipped format in container A that I need to transform using a Python script and am trying to schedule this to occur within Azure, but when writing the output to a new storage container (container B), it simply outputs a csv with the name of the file inside rather than the data.
I've followed the tutorial given on the microsoft site exactly, but I can't get it to work - what am I missing?
https://learn.microsoft.com/en-us/azure/batch/tutorial-run-python-batch-azure-data-factory
file_n='iris.csv'
# Load iris dataset from the task node
df = pd.read_csv(file_n)
# Subset records
df = df[df['Species'] == "setosa"]
# Save the subset of the iris dataframe locally in task node
df.to_csv("iris_setosa.csv", index = False, encoding="utf-8")
# Upload iris dataset
blobService.create_blob_from_text(containerName, "iris_setosa.csv", "iris_setosa.csv")
Specifically, the final line seems to be just giving me the output of a csv called "iris_setosa.csv" with a contents of "iris_setosa.csv" in cell A1 rather than the actual data that it reads in.

Update:
replace create_blob_from_text with create_blob_from_path.
create_blob_from_text creates a new blob from str/unicode, or updates the content of an existing blob. So you will find text iris_setosa.csv in the content of the new blob.
create_blob_from_path creates a new blob from a file path, or updates the content of an existing blob. It is what you want.
This workaround uses copy_blob and delete_blob to move Azure Blob from one container to another.
from azure.storage.blob import BlobService
def copy_azure_files(self):
blob_service = BlobService(account_name='account_name', account_key='account_key')
blob_name = 'iris_setosa.csv'
copy_from_container = 'test-container'
copy_to_container = 'demo-container'
blob_url = blob_service.make_blob_url(copy_from_container, blob_name)
# blob_url:https://demostorage.blob.core.windows.net/test-container/iris_setosa.csv
blob_service.copy_blob(copy_to_container, blob_name, blob_url)
#for move the file use this line
blob_service.delete_blob(copy_from_container, blob_name)

writing to pysftp fileobject using pandas to_csv with compression doesn't actually compress

I have looked at many related answers here on Stackoverflow and this question seems most related How to Transfer Pandas DataFrame to .csv on SFTP using Paramiko Library in Python?. I want to do something similar, however, I want to compress the file when I send it to the SFTP location, so I end up with a .csv.gz file essentially. The files I am working with are 15-40 MB in size uncompressed, but there are lots of them sometimes, so need to keep the fingerprint small.
I have been using code like this to move the dataframe to the destination, after pulling it from another location as a csv, doing some transformations on the data itself:
fileList = source_sftp.listdir('/Inbox/')
dataList = []
for item in fileList: # for each file in the list...
print(item)
if item[-3:] == u'csv':
temp = pd.read_csv(source_sftp.open('/Inbox/'+item)) # read the csv directly from the sftp server into a pd Dataframe
elif item[-3:] == u'zip':
temp = pd.read_csv(source_sftp.open('/Inbox/'+item),compression='zip')
elif item[-3:] == u'.gz':
temp = pd.read_csv(source_sftp.open('/Inbox/'+item),compression='gzip')
else:
temp = pd.read_csv(source_sftp.open('/Inbox/'+item),compression='infer')
dataList.append(temp) # keep each
#... Some transformations in here on the data
FL = [(x.replace('.csv',''))+suffix # just swap out to suffix
for x in fileList]
locpath = '{}/some/new/dir/'.format(dest_sftp.pwd)
i = 0
for item in dataList:
with dest_sftp.open(locpath + FL[i], 'w') as f:
item.to_csv(f, index=False,compression='gzip')
i = i+1
It seems like I should be able to get this to work, but I am guessing there is something being skipped over when I use to_csv to convert the dataframe back and then compress it on the sftp fileobject. Should I be streaming this somehow, or is there solution I am missing somewhere in the documentation on pysftp or pandas?
If I can avoid saving the csv file somewhere local first, I would like to, but I don't think I should have to, right? I am able to get the file in the end to be compressed if I just save file locally with temp.to_csv('/local/path/myfile.csv.gz', compression='gzip'), and after transferring this local file to the destination it is still compressed, so I don't think it has do with the transfer, just how pandas.Dataframe.to_csv and the pysftp.Connection.open are used together.
I should probably add that I still consider myself a newbie to much of Python, but I have been working with local to sftp and sftp to local, and have not had to do much in the way of transferring (directly or indirectly) between them.

Make sure you have the latest version of Pandas.
It supports the compression with a file-like object since 0.24 only:
GH21227: df.to_csv ignores compression when provided with a file handle

Tablib Unbundle Databook into Datasets

I have successfully imported an Excel file into tablib as a Databook.
imported_data = tablib.Databook().load('xlsx',open('file.xlsx', 'rb').read())
Now that I have imported it, I don't seem to be able to do anything with the Databook. I guess I need to get a Dataset (equivalent to one of the Excel worksheets) but I cannot figure out how to unbundle the Databook (or better yet, extract a specific worksheet as a dataset).
Python 2.7.
Tablib reference: http://docs.python-tablib.org/en/latest/api/#tablib.Databook
imported_data
<databook object>
print imported_data <databook object>
imported_data.size: 1
print imported_data[0]: TypeError
'Databook' object does not support indexing
data = tablib.Dataset(imported_data)
TypeError: 'Databook' object is not iterable
Once I have a dataset, I can get to work on it.
Does anyone know how to do this?

Somehow I've only just started using tablib. In any case I was stumbling through using databooks and encountered this question. No doubt this is no longer a pressing issue, but for anyone else who also finds themselves here the Databook.sheets method returns a list of Dataset objects:
In [2]: databook = tablib.Databook().load('xlsx', open('file.xlsx', 'rb').read())
In [3]: databook.sheets()
Out[3]: [<sheet1 dataset>, <sheet2 dataset>, <sheet3 dataset>]

This was the only way I could get the names and the data to come out correctly.
By declaring it was a databook before hand, and what type of file I was imported I was able to access the titles of the datasets and all the data within each dataset.
imported_data = tablib.Databook() # declare the databook first
imported_data.xlsx = open(import_filename, 'rb').read()
for dataset in imported_data.sheets():
print(dataset.title) # returns all the sheet title names
print(dataset) # returns the data in each sheet

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

PyArrow dataset missing new data - python

Related

Read xml file using RDD using local cluster pyspark

Read csv files recursively in all sub folders from a GCP bucket using python

Trying to move data from one Azure Blob Storage to another using a Python script

writing to pysftp fileobject using pandas to_csv with compression doesn't actually compress

Tablib Unbundle Databook into Datasets

Categories

Resources