Apache beam dataframe write csv to GCS without shard name template

Apache beam dataframe write csv to GCS without shard name template - python

I have a Dataflow pipeline using Apache Beam dataframe, and I'd like to write the csv to a GCS bucket. This is my code:
with beam.Pipeline(options=pipeline_options) as p:
df = p | read_csv(known_args.input)
df[column] = df.groupby(primary_key)[column].apply(lambda x: x.ffill().bfill()))
df.to_csv(known_args.output, index=False, encoding='utf-8')
However, while I pass a gcs path to known_args.output, the written csv on gcs is added with shard, like this gs://path/to/file-00000-of-00001. For my project, I need the file name to be without the shard. I've read the documentation but there seems to be no options to remove the shard. I tried converting the df back to pcollection and use WriteToText but it doesn't work either, and also not a desirable solution.

It looks like you're right; in Beam 2.40 there's no way to customize the sharding of these dataframe write operations. Instead, you'll have to use convert to a PCollection and use WriteToText(..., shard_name_template='')
I filed BEAM-22923. When the relevant PR is merged this fixed will allow one to pass an explicit file naming parameter (which will allow customization of this as well as windowing information), e.g.
df.to_csv(
output_dir,
num_shards=1,
file_naming=fileio.single_file_naming('out.csv'))
.

Related

How can I import many binary files in Dask?

I have many binary files (.tdms format, similar to .wav) stored in S3 and I would like to read them with nptdms then process them in a distributed fashion with Dask on a cluster.
In PySpark there is pyspark.SparkContext.binaryFiles() which produces an RDD with a bytearray for each input file which is a simple solution to this problem.
I have not found an equivalent function in Dask - is there one? If not, how could the equivalent functionality be achieved in Dask?
I noticed there's dask.bytes.read_bytes() if it's necessary to involve this however nptdms can't read a chunk of a file - it needs the entire file to be available and I'm not sure how to accomplish that.

dask.bytes.read_bytes() will give you whole files if you use blocksize=None, i.e., exactly one block per file. The most common use case for that is compressed files (e.g., gzip) where you can't start mid-stream, but should work for your use case too. Note that the delayed objects you get each return bytes, not open files.
Alternatively, you can use fsspec.open_files. This returns OpenFile objects, which are safe to serialise and so you can use them in dask.delayed calls such as
ofs = fsspec.open_files("s3://...", ...)
#dask.delayed
def read_a_file(of):
with of as f:
# entering context actually touches storage
return TdmsFile.read(f)
tdms = [read_a_file(of) for of in ofs]

Reading a number of lines from azure blob storage

I have csv files in azure which I read using the function with the following header :
get_blob_to_stream(container_name, blob_name, stream, snapshot=None,
start_range=None, end_range=None, validate_content=False,
progress_callback=None, max_connections=2, lease_id=None,
if_modified_since=None, if_unmodified_since=None, if_match=None,
if_none_match=None, timeout=None)
The start_range and end_range are good parameters if you want to bring a number of bytes from said blob, but say I know my blob is a csv and I precisely want it to bring me the lines from 1 to 1000, kind of like how I tell pandas pd.read_csv(...,nrow=1000, skiprows = range(0,1)). How would I proceed?

Looking at the Azure Documentation, it doesn't look like that function will offer that functionality.
However, I found this answer, which seems promising. Maybe you can redirect the stream directly into pandas read_csv function and continue from there.

input file is not getting read from pd.read_csv

I'm trying to read a file stored in google storage from apache beam using pandas but getting error
def Panda_a(self):
import pandas as pd
data = 'gs://tegclorox/Input/merge1.csv'
df1 = pd.read_csv(data, names = ['first_name', 'last_name', 'age',
'preTestScore', 'postTestScore'])
return df1
ip2 = p |'Split WeeklyDueto' >> beam.Map(Panda_a)
ip7 = ip2 | 'print' >> beam.io.WriteToText('gs://tegclorox/Output/merge1234')
When I'm executing the above code , the error says the path does not exist. Any idea why ?

A bunch of things are wrong with this code.
Trying to get Pandas to read a file from Google Cloud Storage. Pandas does not support the Google Cloud Storage filesystem (as #Andrew pointed out - documentation says supported schemes are http, ftp, s3, file). However, you can use the Beam FileSystems.open() API to get a file object, and give that object to Pandas instead of the file path.
p | ... >> beam.Map(...) - beam.Map(f) transforms every element of the input PCollection using the given function f, it can't be applied to the pipeline itself. It seems that in your case, you want to simply run the Pandas code without any input. You can simulate that by supplying a bogus input, e.g. beam.Create(['ignored'])
beam.Map(f) requires f to return a single value (or more like: if it returns a list, it will interpret that list as a single value), but your code is giving it a function that returns a Pandas dataframe. I strongly doubt that you want to create a PCollection containing a single element where this element is the entire dataframe - more likely, you're looking to have 1 element for every row of the dataframe. For that, you need to use beam.FlatMap, and you need df.iterrows() or something like it.
In general, I am not sure why read the CSV file using Pandas at all. You can read it using Beam's ReadFromText with skip_header_lines=1, and then parse each line yourself - if you have a large amount of data, this will be a lot more efficient (and if you have only a small amount of data and do not anticipate it becoming large enough to exceed the capabilities of a single machine - say, if it will never be above a few GB - then Beam is the wrong tool).

Using Hadoop InputFormat in Pyspark

I'm working on a file parser for Spark that can basically read in n lines at a time and place all of those lines as a single row in a dataframe.
I know I need to use InputFormat to try and specify that, but I cannot find a good guide to this in Python.
Is there a method for specifying a custom InputFormat in Python or do I need to create it as a scala file and then specify the jar in spark-submit?

You can directly use the InputFormats with Pyspark.
Quoting from the documentation,
PySpark can also read any Hadoop InputFormat or write any Hadoop
OutputFormat, for both ‘new’ and ‘old’ Hadoop MapReduce APIs.
Pass the HadoopInputFormat class to any of these methods of pyspark.SparkContext as suited,
hadoopFile()
hadoopRDD()
newAPIHadoopFile()
newAPIHadoopRDD()
To read n lines, org.apache.hadoop.mapreduce.lib.NLineInputFormat can be used as the HadoopInputFormat class with the newAPI methods.

I cannot find a good guide to this in Python
In the Spark docs, under "Saving and Loading Other Hadoop Input/Output Formats", there is an Elasticsearch example + links to an HBase example.
can basically read in n lines at a time... I know I need to use InputFormat to try and specify that
There is NLineInputFormat specifically for that.
This is a rough translation of some Scala code I have from NLineInputFormat not working in Spark
def nline(n, path):
sc = SparkContext.getOrCreate
conf = {
"mapreduce.input.lineinputformat.linespermap": n
}
hadoopIO = "org.apache.hadoop.io"
return sc.newAPIHadoopFile(path,
"org.apache.hadoop.mapreduce.lib.NLineInputFormat",
hadoopIO + ".LongWritable",
hadoopIO + ".Text",
conf=conf).map(lambda x : x[1]) # To strip out the file-offset
n = 3
rdd = nline(n, "/file/input")
and place all of those lines as a single row in a dataframe
With NLineInputFormat, each string in the RDD is actually new-line delimited. You can rdd.map(lambda record : "\t".join(record.split('\n'))), for example to put make one line out them.

Can I read multiple files into a Spark Dataframe from S3, passing over nonexistent ones?

I would like to read multiple parquet files into a dataframe from S3. Currently, I'm using the following method to do this:
files = ['s3a://dev/2017/01/03/data.parquet',
's3a://dev/2017/01/02/data.parquet']
df = session.read.parquet(*files)
This works if all of the files exist on S3, but I would like to ask for a list of files to be loaded into a dataframe without breaking when some of the files in the list don't exist. In other words, I would like for sparkSql to load as many of the files as it finds into the dataframe, and return this result without complaining. Is this possible?

Yes, it's possible if you change method of specifying input to hadoop glob pattern, for example:
files = 's3a://dev/2017/01/{02,03}/data.parquet'
df = session.read.parquet(files)
You can read more on patterns in Hadoop javadoc.
But, in my opinion this isn't elegant way of working with data partitioned by time (by day in your case). If you are able to rename directories like this:
s3a://dev/2017/01/03/data.parquet --> s3a://dev/day=2017-01-03/data.parquet
s3a://dev/2017/01/02/data.parquet --> s3a://dev/day=2017-01-02/data.parquet
then you can take advantage of spark partitioning schema and read data by:
session.read.parquet('s3a://dev/') \
.where(col('day').between('2017-01-02', '2017-01-03')
This way will omit empty/non-existing directories as well. Additionall column day will appear in your dataframe (it will be string in spark <2.1.0 and datetime in spark >= 2.1.0), so you will know in which directory each record exists.

Can I observe that as glob-pattern matching includes a full recursive tree-walk and pattern match of the paths, it is an absolute performance killer against object stores, especially S3. There's a special shortcut in spark to recognise when your path doesn't have any glob characters in, in which case it makes a more efficient choice.
Similarly, a very deep partitioning tree,as in that year/month/day layout, means many directories scanned, at a cost of hundreds of millis (or worse) per directory.
The layout suggested by Mariusz should be much more efficient, as it is a flatter directory tree —switching to it should have a bigger impact on performance on object stores than real filesystems.

A solution using union
files = ['s3a://dev/2017/01/03/data.parquet',
's3a://dev/2017/01/02/data.parquet']
for i, file in enumerate(files):
act_df = spark.read.parquet(file)
if i == 0:
df = act_df
else:
df = df.union(act_df)
An advantage is that it can be done regardless any pattern.

import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.transforms import *
from awsglue.dynamicframe import DynamicFrame
from awsglue.utils import getResolvedOptions
from awsglue.job import Job
import boto3
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
inputDyf = lueContext.create_dynamic_frame.from_options(connection_type="parquet", connection_options={'paths': ["s3://dev-test-laxman-new-bucket/"]})
I am able to read multiple (2) parquet file from s3://dev-test-laxman-new-bucket/ and write in csv files.
As you can see i have 2 parqet file in the my bucket :
Hope it will be helpful to others.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Apache beam dataframe write csv to GCS without shard name template - python

Related

How can I import many binary files in Dask?

Reading a number of lines from azure blob storage

input file is not getting read from pd.read_csv

Using Hadoop InputFormat in Pyspark

Can I read multiple files into a Spark Dataframe from S3, passing over nonexistent ones?

Categories

Resources