pyarrow writting Parquet files keeps overriding existing data sets

pyarrow writting Parquet files keeps overriding existing data sets - python

I'm trying to write to an existing Parquet file stored on the local filesystem. But when writing multiple times, the previous one gets overridden instead of added.
from datetime import datetime
import os
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
def append_to_parquet_table(dataframe, filename):
full_path = os.path.join('.', filename)
table = pa.Table.from_pandas(dataframe)
writer = pq.ParquetWriter(full_path, table.schema)
writer.write_table(table=table)
def save(passed):
data = {'number': [1234],
'verified': [passed],
'date': datetime.now().strftime("%Y-%m-%d %H:%M:%S")}
data_df = pd.DataFrame(data)
append_to_parquet_table(data_df, 'results.parquet')
save(True)
save(False)
Why is the first data set being "updated" instead of a new one written?

I'm trying to write to an existing Parquet file stored on the local filesystem.
This isn't supported by the file format. Parquet files are immutable after being written.

Related

How to read the files of Azure file share as csv that is pandas dataframe

I have few csv files in my Azure File share which I am accessing as text by following the code:
from azure.storage.file import FileService
storageAccount='...'
accountKey='...'
file_service = FileService(account_name=storageAccount, account_key=accountKey)
share_name = '...'
directory_name = '...'
file_name = 'Name.csv'
file = file_service.get_file_to_text(share_name, directory_name, file_name)
print(file.content)
The contents of the csv files are being displayed but I need to pass them as dataframe which I am not able to do. Can anyone please tell me how to read the file.content as pandas dataframe?

After reproducing from my end, I could able to read a csv file into dataframe from the contents of the file following the below code.
generator = file_service.list_directories_and_files('fileshare/')
for file_or_dir in generator:
print(file_or_dir.name)
file=file_service.get_file_to_text('fileshare','',file_or_dir.name)
df = pd.read_csv(StringIO(file.content), sep=',')
print(df)
RESULTS:

How to convert JSON file into EXCEL file in python

I have 2 questions
How to convert and extract JSON file into EXCEL file in python
How to combine all json file into one file?
Now, I have 30 json files. I would like to extract them all into EXCEL file (In readable format).
Lastly, I need to combine all of the result into one excel file. So, curious on how to do that too.

Converting JSON into EXCEL;
import pandas as pd
df = pd.read_json('./file1.json')
df.to_excel('./file1.xlsx')
Combining multiple EXCELs (two files are combined in the example);
import glob
import pandas as pd
excl_list_path = ["./file1.xlsx", "./file2.xlsx"]
excl_list = []
for file in excl_list_path:
excl_list.append(pd.read_excel(file))
excl_merged = pd.DataFrame()
for excl_file in excl_list:
excl_merged = excl_merged.append(
excl_file, ignore_index=True)
excl_merged.to_excel('file1-file2-merged.xlsx', index=False)
Note; Your specific JSON file structure is important for these examples...

And I have the perfect function just for that
import pandas as pd
def save_to_excel(json_file, filename):
df = pd.read_json(json_file).T
df.to_excel(filename)
json_data = {"a": "data A", "b": "data B"}
save_to_excel(json_data, "json_data.xlsx")
More info here

You can try to use this library,
https://pypi.org/project/tablib/0.9.3/
It provides a lot of features that can help you on this.

How to combine all json file into one file?
ans:
import json
import glob
import pprint as pp #Pretty printer
'
combined = []
for json_file in glob.glob("*.json"): #Assuming that your json files and .py file in the same directory
with open(json_file, "rb") as infile:
combined.append(json.load(infile))
pp.pprint(combined)

Pyarrow append and update modes

With the code below, I can write the file in parquet format from disk to hdfs. But when I run the code again, it overwrites it. I want it to append or update. How can I do that? I would be glad if you help.
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from pyarrow import fs
file = "source_path"
target = "target_path"
hdfs = pa.fs.HadoopFileSystem("hdfs://okay")
df = pd.read_csv(file)
table = pa.Table.from_pandas(df)
pq.write_table(table, target, filesystem=hdfs)

Convert CSV to Parquet in S3 with Python

I need convert a CSV file to Parquet file in S3 path. I'm trying use the code below, but no error occurs, the code execute with success and dont convert the CSV file
import pandas as pd
import boto3
import pyarrow as pa
import pyarrow.parquet as pq
s3 = boto3.client("s3", region_name='us-east-2', aws_access_key_id='my key id',
aws_secret_access_key='my secret key')
obj = s3.get_object(Bucket='my bucket', Key='test.csv')
df = pd.read_csv(obj['Body'])
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table=table, root_path="test.parquet")

AWS CSV to Parquet Converter in Python
This Script gets files from Amazon S3 and converts it to Parquet Version for later query jobs and uploads it back to the Amazon S3.
import numpy
import pandas
import fastparquet
def lambda_handler(event,context):
#identifying resource
s3_object = boto3.client('s3', region_name='us-east-2')
#access file
get_file = s3_object.get_object(Bucket='ENTER_BUCKET_NAME_HERE', Key='CSV_FILE_NAME.csv')
get = get_file['Body']
df = pandas.DataFrame(get)
#convert csv to parquet function
def conv_csv_parquet_file(df):
converted_data_parquet = df.to_parquet('converted_data_parquet_version.parquet')
conv_csv_parquet_file(df)
print("File converted from CSV to parquet completed")
#uploading the parquet version file
s3_path = "/converted_to_parquet/" + converted_data_parquet
put_response = s3_resource.Object('ENTER_BUCKET_NAME_HERE',converted_data_parquet).put(Body=converted_data_parquet)
Python Library Boto3 allows the lambda to get the CSV file from S3 and then Fast-Parquet (or Pyarrow) converts the CSV file into Parquet.
From- https://github.com/ayshaysha/aws-csv-to-parquet-converter.py

Python: Read several json files from a folder

I would like to know how to read several json files from a single folder (without specifying the files names, just that they are json files).
Also, it is possible to turn them into a pandas DataFrame?
Can you give me a basic example?

One option is listing all files in a directory with os.listdir and then finding only those that end in '.json':
import os, json
import pandas as pd
path_to_json = 'somedir/'
json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]
print(json_files) # for me this prints ['foo.json']
Now you can use pandas DataFrame.from_dict to read in the json (a python dictionary at this point) to a pandas dataframe:
montreal_json = pd.DataFrame.from_dict(many_jsons[0])
print montreal_json['features'][0]['geometry']
Prints:
{u'type': u'Point', u'coordinates': [-73.6051013, 45.5115944]}
In this case I had appended some jsons to a list many_jsons. The first json in my list is actually a geojson with some geo data on Montreal. I'm familiar with the content already so I print out the 'geometry' which gives me the lon/lat of Montreal.
The following code sums up everything above:
import os, json
import pandas as pd
# this finds our json files
path_to_json = 'json/'
json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]
# here I define my pandas Dataframe with the columns I want to get from the json
jsons_data = pd.DataFrame(columns=['country', 'city', 'long/lat'])
# we need both the json and an index number so use enumerate()
for index, js in enumerate(json_files):
with open(os.path.join(path_to_json, js)) as json_file:
json_text = json.load(json_file)
# here you need to know the layout of your json and each json has to have
# the same structure (obviously not the structure I have here)
country = json_text['features'][0]['properties']['country']
city = json_text['features'][0]['properties']['name']
lonlat = json_text['features'][0]['geometry']['coordinates']
# here I push a list of data into a pandas DataFrame at row given by 'index'
jsons_data.loc[index] = [country, city, lonlat]
# now that we have the pertinent json data in our DataFrame let's look at it
print(jsons_data)
for me this prints:
country city long/lat
0 Canada Montreal city [-73.6051013, 45.5115944]
1 Canada Toronto [-79.3849008, 43.6529206]
It may be helpful to know that for this code I had two geojsons in a directory name 'json'. Each json had the following structure:
{"features":
[{"properties":
{"osm_key":"boundary","extent":
[-73.9729016,45.7047897,-73.4734865,45.4100756],
"name":"Montreal city","state":"Quebec","osm_id":1634158,
"osm_type":"R","osm_value":"administrative","country":"Canada"},
"type":"Feature","geometry":
{"type":"Point","coordinates":
[-73.6051013,45.5115944]}}],
"type":"FeatureCollection"}

Iterating a (flat) directory is easy with the glob module
from glob import glob
for f_name in glob('foo/*.json'):
...
As for reading JSON directly into pandas, see here.

Loads all files that end with * .json from a specific directory into a dict:
import os,json
path_to_json = '/lala/'
for file_name in [file for file in os.listdir(path_to_json) if file.endswith('.json')]:
with open(path_to_json + file_name) as json_file:
data = json.load(json_file)
print(data)
Try it yourself:
https://repl.it/#SmaMa/loadjsonfilesfromfolderintodict

To read the json files,
import os
import glob
contents = []
json_dir_name = '/path/to/json/dir'
json_pattern = os.path.join(json_dir_name, '*.json')
file_list = glob.glob(json_pattern)
for file in file_list:
contents.append(read(file))

If turning into a pandas dataframe, use the pandas API.
More generally, you can use a generator..
def data_generator(my_path_regex):
for filename in glob.glob(my_path_regex):
for json_line in open(filename, 'r'):
yield json.loads(json_line)
my_arr = [_json for _json in data_generator(my_path_regex)]

I am using glob with pandas. Checkout the below code
import pandas as pd
from glob import glob
df = pd.concat([pd.read_json(f_name, lines=True) for f_name in glob('foo/*.json')])

A simple and very easy-to-understand answer.
import os
import glob
import pandas as pd
path_to_json = r'\path\here'
# import all files from folder which ends with .json
json_files = glob.glob(os.path.join(path_to_json, '*.json'))
# convert all files to datafr`enter code here`ame
df = pd.concat((pd.read_json(f) for f in json_files))
print(df.head())

I feel a solution using pathlib is missing :)
from pathlib import Path
file_list = list(Path("/path/to/json/dir").glob("*.json"))

One more option is to read it as a PySpark Dataframe and then convert it to Pandas Dataframe (if really necessary, depending on the operation I'd suggest keeping as a PySpark DF). Spark natively handles using a directory with JSON files as the main path without the need of libraries for reading or iterating over each file:
# pip install pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
spark_df = spark.read.json('/some_dir_with_json/*.json')
Next, in order to convert into a Pandas Dataframe, you can do:
df = spark_df.toPandas()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pyarrow writting Parquet files keeps overriding existing data sets - python

I'm trying to write to an existing Parquet file stored on the local filesystem. This isn't supported by the file format. Parquet files are immutable after being written.

Related

How to read the files of Azure file share as csv that is pandas dataframe

How to convert JSON file into EXCEL file in python

Pyarrow append and update modes

Convert CSV to Parquet in S3 with Python

Python: Read several json files from a folder

Categories

Resources