With the code below, I can write the file in parquet format from disk to hdfs. But when I run the code again, it overwrites it. I want it to append or update. How can I do that? I would be glad if you help.
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from pyarrow import fs
file = "source_path"
target = "target_path"
hdfs = pa.fs.HadoopFileSystem("hdfs://okay")
df = pd.read_csv(file)
table = pa.Table.from_pandas(df)
pq.write_table(table, target, filesystem=hdfs)
Related
I need convert a CSV file to Parquet file in S3 path. I'm trying use the code below, but no error occurs, the code execute with success and dont convert the CSV file
import pandas as pd
import boto3
import pyarrow as pa
import pyarrow.parquet as pq
s3 = boto3.client("s3", region_name='us-east-2', aws_access_key_id='my key id',
aws_secret_access_key='my secret key')
obj = s3.get_object(Bucket='my bucket', Key='test.csv')
df = pd.read_csv(obj['Body'])
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table=table, root_path="test.parquet")
AWS CSV to Parquet Converter in Python
This Script gets files from Amazon S3 and converts it to Parquet Version for later query jobs and uploads it back to the Amazon S3.
import numpy
import pandas
import fastparquet
def lambda_handler(event,context):
#identifying resource
s3_object = boto3.client('s3', region_name='us-east-2')
#access file
get_file = s3_object.get_object(Bucket='ENTER_BUCKET_NAME_HERE', Key='CSV_FILE_NAME.csv')
get = get_file['Body']
df = pandas.DataFrame(get)
#convert csv to parquet function
def conv_csv_parquet_file(df):
converted_data_parquet = df.to_parquet('converted_data_parquet_version.parquet')
conv_csv_parquet_file(df)
print("File converted from CSV to parquet completed")
#uploading the parquet version file
s3_path = "/converted_to_parquet/" + converted_data_parquet
put_response = s3_resource.Object('ENTER_BUCKET_NAME_HERE',converted_data_parquet).put(Body=converted_data_parquet)
Python Library Boto3 allows the lambda to get the CSV file from S3 and then Fast-Parquet (or Pyarrow) converts the CSV file into Parquet.
From- https://github.com/ayshaysha/aws-csv-to-parquet-converter.py
I'm trying to write to an existing Parquet file stored on the local filesystem. But when writing multiple times, the previous one gets overridden instead of added.
from datetime import datetime
import os
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
def append_to_parquet_table(dataframe, filename):
full_path = os.path.join('.', filename)
table = pa.Table.from_pandas(dataframe)
writer = pq.ParquetWriter(full_path, table.schema)
writer.write_table(table=table)
def save(passed):
data = {'number': [1234],
'verified': [passed],
'date': datetime.now().strftime("%Y-%m-%d %H:%M:%S")}
data_df = pd.DataFrame(data)
append_to_parquet_table(data_df, 'results.parquet')
save(True)
save(False)
Why is the first data set being "updated" instead of a new one written?
I'm trying to write to an existing Parquet file stored on the local filesystem.
This isn't supported by the file format. Parquet files are immutable after being written.
I want to read DOCX/PDF file from Hadoop file system using pyspark, Currently I am using pandas API. But in pandas we have some limitation we can read only CSV, JSON, XLSX & HDF5. Its not support any other format.
Currently my code is :
import pandas as pd
from pyspark import SparkContext, SparkConf
from hdfs import InsecureClient
conf = SparkConf().setAppName("Random")
sc = SparkContext(conf = conf)
client_hdfs = InsecureClient('http://192.00.00.30:50070')
with client_hdfs.read('/user/user.name/sample.csv', encoding = 'utf-8') as reader:
df = pd.read_csv(reader,index_col=0)
print df
I am able to read CSV using above code, any other API's which can solve this problem for DOC/PDF?
I have a hacky way of achieving this using boto3 (1.4.4), pyarrow (0.4.1) and pandas (0.20.3).
First, I can read a single parquet file locally like this:
import pyarrow.parquet as pq
path = 'parquet/part-r-00000-1e638be4-e31f-498a-a359-47d017a0059c.gz.parquet'
table = pq.read_table(path)
df = table.to_pandas()
I can also read a directory of parquet files locally like this:
import pyarrow.parquet as pq
dataset = pq.ParquetDataset('parquet/')
table = dataset.read()
df = table.to_pandas()
Both work like a charm. Now I want to achieve the same remotely with files stored in a S3 bucket. I was hoping that something like this would work:
dataset = pq.ParquetDataset('s3n://dsn/to/my/bucket')
But it does not:
OSError: Passed non-file path: s3n://dsn/to/my/bucket
After reading pyarrow's documentation thoroughly, this does not seem possible at the moment. So I came out with the following solution:
Reading a single file from S3 and getting a pandas dataframe:
import io
import boto3
import pyarrow.parquet as pq
buffer = io.BytesIO()
s3 = boto3.resource('s3')
s3_object = s3.Object('bucket-name', 'key/to/parquet/file.gz.parquet')
s3_object.download_fileobj(buffer)
table = pq.read_table(buffer)
df = table.to_pandas()
And here my hacky, not-so-optimized, solution to create a pandas dataframe from a S3 folder path:
import io
import boto3
import pandas as pd
import pyarrow.parquet as pq
bucket_name = 'bucket-name'
def download_s3_parquet_file(s3, bucket, key):
buffer = io.BytesIO()
s3.Object(bucket, key).download_fileobj(buffer)
return buffer
client = boto3.client('s3')
s3 = boto3.resource('s3')
objects_dict = client.list_objects_v2(Bucket=bucket_name, Prefix='my/folder/prefix')
s3_keys = [item['Key'] for item in objects_dict['Contents'] if item['Key'].endswith('.parquet')]
buffers = [download_s3_parquet_file(s3, bucket_name, key) for key in s3_keys]
dfs = [pq.read_table(buffer).to_pandas() for buffer in buffers]
df = pd.concat(dfs, ignore_index=True)
Is there a better way to achieve this? Maybe some kind of connector for pandas using pyarrow? I would like to avoid using pyspark, but if there is no other solution, then I would take it.
You should use the s3fs module as proposed by yjk21. However as result of calling ParquetDataset you'll get a pyarrow.parquet.ParquetDataset object. To get the Pandas DataFrame you'll rather want to apply .read_pandas().to_pandas() to it:
import pyarrow.parquet as pq
import s3fs
s3 = s3fs.S3FileSystem()
pandas_dataframe = pq.ParquetDataset('s3://your-bucket/', filesystem=s3).read_pandas().to_pandas()
Thanks! Your question actually tell me a lot. This is how I do it now with pandas (0.21.1), which will call pyarrow, and boto3 (1.3.1).
import boto3
import io
import pandas as pd
# Read single parquet file from S3
def pd_read_s3_parquet(key, bucket, s3_client=None, **args):
if s3_client is None:
s3_client = boto3.client('s3')
obj = s3_client.get_object(Bucket=bucket, Key=key)
return pd.read_parquet(io.BytesIO(obj['Body'].read()), **args)
# Read multiple parquets from a folder on S3 generated by spark
def pd_read_s3_multiple_parquets(filepath, bucket, s3=None,
s3_client=None, verbose=False, **args):
if not filepath.endswith('/'):
filepath = filepath + '/' # Add '/' to the end
if s3_client is None:
s3_client = boto3.client('s3')
if s3 is None:
s3 = boto3.resource('s3')
s3_keys = [item.key for item in s3.Bucket(bucket).objects.filter(Prefix=filepath)
if item.key.endswith('.parquet')]
if not s3_keys:
print('No parquet found in', bucket, filepath)
elif verbose:
print('Load parquets:')
for p in s3_keys:
print(p)
dfs = [pd_read_s3_parquet(key, bucket=bucket, s3_client=s3_client, **args)
for key in s3_keys]
return pd.concat(dfs, ignore_index=True)
Then you can read multiple parquets under a folder from S3 by
df = pd_read_s3_multiple_parquets('path/to/folder', 'my_bucket')
(One can simplify this code a lot I guess.)
It can be done using boto3 as well without the use of pyarrow
import boto3
import io
import pandas as pd
# Read the parquet file
buffer = io.BytesIO()
s3 = boto3.resource('s3')
object = s3.Object('bucket_name','key')
object.download_fileobj(buffer)
df = pd.read_parquet(buffer)
print(df.head())
Probably the easiest way to read parquet data on the cloud into dataframes is to use dask.dataframe in this way:
import dask.dataframe as dd
df = dd.read_parquet('s3://bucket/path/to/data-*.parq')
dask.dataframe can read from Google Cloud Storage, Amazon S3, Hadoop file system and more!
Provided you have the right package setup
$ pip install pandas==1.1.0 pyarrow==1.0.0 s3fs==0.4.2
and your AWS shared config and credentials files configured appropriately
you can use pandas right away:
import pandas as pd
df = pd.read_parquet("s3://bucket/key.parquet")
In case of having multiple AWS profiles you may also need to set
$ export AWS_DEFAULT_PROFILE=profile_under_which_the_bucket_is_accessible
so you can access your bucket.
If you are open to also use AWS Data Wrangler.
import awswrangler as wr
df = wr.s3.read_parquet(path="s3://...")
You can use s3fs from dask which implements a filesystem interface for s3. Then you can use the filesystem argument of ParquetDataset like so:
import s3fs
s3 = s3fs.S3FileSystem()
dataset = pq.ParquetDataset('s3n://dsn/to/my/bucket', filesystem=s3)
Using pre-signed URLs
s3 =s3fs.S3FileSystem(key='your_key',secret='your_secret',client_kwargs={"endpoint_url":'your_end_point'})
df = dd.read_parquet(s3.url('your_bucket' + 'your_filepath',expires=3600,client_method='get_object'))
I have tried the #oya163 solution and it works but after little bit change
import boto3
import io
import pandas as pd
# Read the parquet file
buffer = io.BytesIO()
s3 = boto3.resource('s3',aws_access_key_id='123',aws_secret_access_key= '456')
object = s3.Object('bucket_name','myoutput.parquet')
object.download_fileobj(buffer)
df = pd.read_parquet(buffer)
print(df.head())
I would like to know how to read several json files from a single folder (without specifying the files names, just that they are json files).
Also, it is possible to turn them into a pandas DataFrame?
Can you give me a basic example?
One option is listing all files in a directory with os.listdir and then finding only those that end in '.json':
import os, json
import pandas as pd
path_to_json = 'somedir/'
json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]
print(json_files) # for me this prints ['foo.json']
Now you can use pandas DataFrame.from_dict to read in the json (a python dictionary at this point) to a pandas dataframe:
montreal_json = pd.DataFrame.from_dict(many_jsons[0])
print montreal_json['features'][0]['geometry']
Prints:
{u'type': u'Point', u'coordinates': [-73.6051013, 45.5115944]}
In this case I had appended some jsons to a list many_jsons. The first json in my list is actually a geojson with some geo data on Montreal. I'm familiar with the content already so I print out the 'geometry' which gives me the lon/lat of Montreal.
The following code sums up everything above:
import os, json
import pandas as pd
# this finds our json files
path_to_json = 'json/'
json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]
# here I define my pandas Dataframe with the columns I want to get from the json
jsons_data = pd.DataFrame(columns=['country', 'city', 'long/lat'])
# we need both the json and an index number so use enumerate()
for index, js in enumerate(json_files):
with open(os.path.join(path_to_json, js)) as json_file:
json_text = json.load(json_file)
# here you need to know the layout of your json and each json has to have
# the same structure (obviously not the structure I have here)
country = json_text['features'][0]['properties']['country']
city = json_text['features'][0]['properties']['name']
lonlat = json_text['features'][0]['geometry']['coordinates']
# here I push a list of data into a pandas DataFrame at row given by 'index'
jsons_data.loc[index] = [country, city, lonlat]
# now that we have the pertinent json data in our DataFrame let's look at it
print(jsons_data)
for me this prints:
country city long/lat
0 Canada Montreal city [-73.6051013, 45.5115944]
1 Canada Toronto [-79.3849008, 43.6529206]
It may be helpful to know that for this code I had two geojsons in a directory name 'json'. Each json had the following structure:
{"features":
[{"properties":
{"osm_key":"boundary","extent":
[-73.9729016,45.7047897,-73.4734865,45.4100756],
"name":"Montreal city","state":"Quebec","osm_id":1634158,
"osm_type":"R","osm_value":"administrative","country":"Canada"},
"type":"Feature","geometry":
{"type":"Point","coordinates":
[-73.6051013,45.5115944]}}],
"type":"FeatureCollection"}
Iterating a (flat) directory is easy with the glob module
from glob import glob
for f_name in glob('foo/*.json'):
...
As for reading JSON directly into pandas, see here.
Loads all files that end with * .json from a specific directory into a dict:
import os,json
path_to_json = '/lala/'
for file_name in [file for file in os.listdir(path_to_json) if file.endswith('.json')]:
with open(path_to_json + file_name) as json_file:
data = json.load(json_file)
print(data)
Try it yourself:
https://repl.it/#SmaMa/loadjsonfilesfromfolderintodict
To read the json files,
import os
import glob
contents = []
json_dir_name = '/path/to/json/dir'
json_pattern = os.path.join(json_dir_name, '*.json')
file_list = glob.glob(json_pattern)
for file in file_list:
contents.append(read(file))
If turning into a pandas dataframe, use the pandas API.
More generally, you can use a generator..
def data_generator(my_path_regex):
for filename in glob.glob(my_path_regex):
for json_line in open(filename, 'r'):
yield json.loads(json_line)
my_arr = [_json for _json in data_generator(my_path_regex)]
I am using glob with pandas. Checkout the below code
import pandas as pd
from glob import glob
df = pd.concat([pd.read_json(f_name, lines=True) for f_name in glob('foo/*.json')])
A simple and very easy-to-understand answer.
import os
import glob
import pandas as pd
path_to_json = r'\path\here'
# import all files from folder which ends with .json
json_files = glob.glob(os.path.join(path_to_json, '*.json'))
# convert all files to datafr`enter code here`ame
df = pd.concat((pd.read_json(f) for f in json_files))
print(df.head())
I feel a solution using pathlib is missing :)
from pathlib import Path
file_list = list(Path("/path/to/json/dir").glob("*.json"))
One more option is to read it as a PySpark Dataframe and then convert it to Pandas Dataframe (if really necessary, depending on the operation I'd suggest keeping as a PySpark DF). Spark natively handles using a directory with JSON files as the main path without the need of libraries for reading or iterating over each file:
# pip install pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
spark_df = spark.read.json('/some_dir_with_json/*.json')
Next, in order to convert into a Pandas Dataframe, you can do:
df = spark_df.toPandas()