Applying a custom file parser on a PySpark RDD - python

I have a set of custom log files, that I need to parse. I am currently working in Azure Databricks, but am quite new in using PySpark. The log files are hosted within an Azure Blob Storage Account, which is mounted to our Azure Databricks instance.
Log file example for the input:
Value_x: 1
Value_y: "Station"
col1;col2;col3;col4;
A1;B1;C1;D1;
A2;B2;C2;D2;
A3;B3;C3;D3;
Output that is a list of strings, but I can also work with a list of lists.
['A1;B1;C1;D1;1;station',
'A2;B2;C2;D2;1;station',
'A3;B3;C3;D3;1;station']
The snippet of code to apply these transformations.
def custom_parser(file, content):
content_ = content.replace('"', '').replace('\r', '').split('\n')
content_ = [line for line in content_ if len(line) > 0]
x = content_[0].split('Value_x:')[-1].strip()
y = content_[0].split('Value_y:')[-1].strip()
content_ = content_[3:]
content_ = [line + ';'.join([x,y]) for line in content_]
return content_
from pyspark import SparkConf
from pyspark.context import SparkContext
sc = SparkContect.getOrCreate(SparkConf)
files = sc.wholeTextFiles('spam/eggs/').collect()
parsed_content = []
for file, content in files:
parsed_content += custom_parser(file, content)
I have developed a custom_parser function to handle the content of these log files. But I am left with some questions:
Can I apply this custom_parser action directly to the Spark RDD returned by sc.wholeTextFiles so I can use the parallelization features of Spark?
Is parsing the data in such an ad-hoc method the most performant method?

you cannot apply your custom_parser action directly on sc.wholeTextFiles, but what you can do is to use custom_parser as map function, then after read your files and get RDD[String,String] (path,content) you can apply the custom_parser as rdd.map(custom_parser) and then write it where you need. In that way you will do your job in parallel, not like now that you are doing all in driver.

Related

How to save a PySpark dataframe as a CSV with custom file name?

Here is the spark DataFrame I want to save as a csv.
type(MyDataFrame)
--Output: <class 'pyspark.sql.dataframe.DataFrame'>
To save this as a CSV, I have the following code:
MyDataFrame.write.csv(csv_path, mode = 'overwrite', header = 'true')
When I save this, the file name is something like this:
part-0000-766dfdf-78fg-aa44-as3434rdfgfg-c000.csv
Is there a way I can give this a custom name while saving it? Like "MyDataFrame.csv"
I have the same requirement.You can write to one path, and then change the file path. This is my solution.
def write_to_hdfs_specify_path(df, spark, hdfs_path, file_name):
"""
:param df: dataframe which you want to save
:param spark: sparkSession
:param hdfs_path: target path(shoul be not exises)
:param file_name: csv file name
:return:
"""
sc = spark.sparkContext
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
Configuration = sc._gateway.jvm.org.apache.hadoop.conf.Configuration
df.coalesce(1).write.option("header", True).option("delimiter", "|").option("compression", "none").csv(hdfs_path)
fs = FileSystem.get(Configuration())
file = fs.globStatus(Path("%s/part*" % hdfs_path))[0].getPath().getName()
full_path = "%s/%s" % (hdfs_path, file_name)
result = fs.rename(Path("%s/%s" % (hdfs_path, file)), Path(full_path))
return result
No. That's how Spark work (at least for now). You'd have MyDataFrame.csv as a directory name, and under that directory, you'd have multiple files with the same format as part-0000-766dfdf-78fg-aa44-as3434rdfgfg-c000.csv, part-0000-766dfdf-78fg-aa44-as3434rdfgfg-c001.csv etc
It's not recommended, but if your data is small enough (arguably what is "small enough" here), you can always convert it to Pandas and save it to a single CSV file with any name you wanted.
.coalesce(1) will guarantee that there is only 1 file but will not guarantee file name. Please use some temp directory to save it and than rename it and copy (using dbutils.fs functions if you use databricks or using FileUtil from Hadoop API).

Python: How to Extract Zip-Files in Google Cloud Storage Without Running Out of Memory?

I need to extract the files in a zip file in Google Cloud Storage. I'm using a python function to do this, but I keep running into memory issues even when using a Dask Cluster and each Dask worker has a 20GB memory limit.
How could I optimize my code so that it doesn't consume as much memory? Perhaps reading the zip file in chunks and streaming them to a temporary file and then sending this file to Google Cloud Storage?
Would appreciate any guidance here.
Here is my code:
#task
def unzip_files(
bucket_name,
zip_data
):
file_date = zip_data['file_date']
gcs_folder_path = zip_data['gcs_folder_path']
gcs_blob_name = zip_data['gcs_blob_name']
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
destination_blob_pathname = f'{gcs_folder_path}/{gcs_blob_name}'
blob = bucket.blob(destination_blob_pathname)
zipbytes = io.BytesIO(blob.download_as_string())
if is_zipfile(zipbytes):
with ZipFile(zipbytes, 'r') as zipObj:
extracted_file_paths = []
for content_file_name in zipObj.namelist():
content_file = zipObj.read(content_file_name)
extracted_file_path = f'{gcs_folder_path}/hgdata_{file_date}_{content_file_name}'
blob = bucket.blob(extracted_file_path)
blob.upload_from_string(content_file)
extracted_file_paths.append(f'gs://{bucket_name}/{extracted_file_path}')
return extracted_file_paths
else:
return []
I do not quite follow your code, but in general, dask plays nicely with complex file operations like this, using the fsspec and gcsfs libraries. For example (and you don't need Dask for this)
import fsspec
with fsspec.open_files("zip://*::gcs://gcs_folder_path/gcs_blob_name") as open_files:
for of in open_files:
with fsspec.open("gcs://{something from fo}", "wb") as f:
data = True
while data:
data = of.read(2**22)
f.write(data)
You could instead do
open_files = fssec.open_files(...)
and parallelise the loop with Dask.

Writing files using spark and reading using python

writing a file s3 using spark usually creates a directory with 11 files success and the other file name starts with name as part which has actual data in s3 , how to load the same file using pandas dataframe since the file path changes because the file name Par for all 10 files with actual data varies in each run.
For example the file path at the time of writing :
df.colaesce.(10).write.path("s3://testfolder.csv")
The files stored in the directory are :
- sucess
- part-00-*.parquet
I have a python job which reads the file to pandas dataframe
pd.read(s3\\..........what is the path to specify here.................)
when writing files with spark, you cannot pass the name the file (you can, but you end up with what you described above). if you want a single file to later load to pandas, you would do something like this:
df.repartition(1).write.parquet(path="s3://testfolder/", mode='append')
The end result will be a single file in "s3://testfolder/" that starts with part-00-*.parquet. You can simply read that file in or rename the file to something specific before reading it in with pandas.
Option 1: (Recommended)
You can use awswrangler. Its a light weight tool to aid with the integration between
Pandas/S3/Parquet. It lets you read in multiple files from the directory.
pip install awswrangler
import awswrangler as wr
df = wr.s3.read_parquet(path='s3://testfolder/')
Option 2:
############################## RETRIEVE KEYS FROM THE BUCKET ##################################
import boto3
import pandas as pd
s3 = boto3.client('s3')
s3_bucket_name = 'your bucket name'
prefix = 'path where the files are located'
response = s3.list_objects_v2(
Bucket = s3_bucket_name,
Prefix = prefix
)
keys = []
for obj in response['Contents']:
keys.append(obj['Key'])
##################################### READ IN THE FILES #######################################
df=[]
for key in keys:
df.append(pd.read_parquet(path = 's3://' + s3_bucket_name + '/' + key, engine = 'pyarrow'))

Read Files from multiple folders in Apache Beam and map outputs to filenames

Working on reading files from multiple folders and then output the file contents with the file name like (filecontents, filename) to bigquery in apache beam using the python sdk and a dataflow runner.
Originally thought I could create A pcollection for each file then map the file contents with the filename.
def read_documents(pipeline):
"""Read the documents at the provided uris and returns (uri, line) pairs."""
pcolls = []
count = 0
with open(TESTIN) as uris:
for uri in uris:
#print str(uri).strip("[]/'")
pcolls.append(
pipeline
| 'Read: uri' + str(uri) >>ReadFromText(str(uri).strip("[]/'"), compression_type = 'gzip')
| 'WithKey: uri' + str(uri) >> beam.Map(lambda v, uri: (v, str(uri).strip("[]")), uri)
)
return pcolls | 'FlattenReadPColls' >> beam.Flatten()
This worked fine but was slow and wouldn't work on dataflow cloud after about 10000 files. It would suffer from a broken pipe if over 10000 or so files.
Currently trying to overload the ReadAllFromText function from Text.io. Text.io is designed to read tons of files quickly from a pcollection of filenames or patterns. There is a bug in this module if reading from Google cloud storage and the file has content encoding. Google Cloud storage automatically gunzips files and transcodes them but for some reason ReadAllFromText doesn't work with it. You have to change the metadata of the file to remove content encoding and set the compression type on ReadAllFromText to gzip. I'm including this issue url in case anyone else has problems with ReadAllFromText
https://issues.apache.org/jira/browse/BEAM-1874
My current code looks like this
class ReadFromGs(ReadAllFromText):
def __init__(self):
super(ReadFromGs, self).__init__(compression_type="gzip")
def expand(self, pvalue):
files = self._read_all_files
return (
pvalue
| 'ReadAllFiles' >> files #self._read_all_files
| 'Map values' >> beam.Map( lambda v: (v, filename)) # filename is a placeholder for the input filename that im trying to figure out how to include in the output.
)
ReadAllFromText is contained in Text.io and calls ReadAllText from filebasedsource.py and inherits from PTransform.
I believe i'm just missing something simple missing.
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filebasedsource.py
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/textio.py
As you found, ReadFromText doesn't currently support dynamic filenames and you definitely don't want to create individual steps for the each URL. From your initial sentence I understand you want get the filename and the file content out as one item. That means you won't need or benefit from any streaming of parts of the file. You can simply read the file contents. Something like:
import apache_beam as beam
from apache_beam.io.filesystems import FileSystems
def read_all_from_url(url):
with FileSystems.open(url) as f:
return f.read()
def read_from_urls(pipeline, urls):
return (
pipeline
| beam.Create(urls)
| 'Read File' >> beam.Map(lambda url: (
url,
read_all_from_url(url)
))
)
You can customise it if you think you're having issues with metadata. The output will be a tuple (url, file contents). If your file contents is very large you might need a slightly different approach depending on your use case.

How to get data from s3 and do some work on it? python and boto

I have a project task to use some output data I have already produced on s3 in an EMR task. So previously I have ran an EMR job that produced some output in one of my s3 buckets in the form of multiple files named part-xxxx. Now I need to access those files from within my new EMR job, read the contents of those files and by using that data I need to produce another output.
This is the local code that does the job:
def reducer_init(self):
self.idfs = {}
for fname in os.listdir(DIRECTORY): # look through file names in the directory
file = open(os.path.join(DIRECTORY, fname)) # open a file
for line in file: # read each line in json file
term_idf = JSONValueProtocol().read(line)[1] # parse the line as a JSON object
self.idfs[term_idf['term']] = term_idf['idf']
def reducer(self, term_poster, howmany):
tfidf = sum(howmany) * self.idfs[term_poster['term']]
yield None, {'term_poster': term_poster, 'tfidf': tfidf}
This runs just fine locally, but the problem is the data i need now is on s3 and i need to access it somehow in reducer_init function.
This is what I have so far, but it fails while executing on EC2:
def reducer_init(self):
self.idfs = {}
b = conn.get_bucket(bucketname)
idfparts = b.list(destination)
for key in idfparts:
file = open(os.path.join(idfparts, key))
for line in file:
term_idf = JSONValueProtocol().read(line)[1] # parse the line as a JSON object
self.idfs[term_idf['term']] = term_idf['idf']
def reducer(self, term_poster, howmany):
tfidf = sum(howmany) * self.idfs[term_poster['term']]
yield None, {'term_poster': term_poster, 'tfidf': tfidf}
AWS access info is defined as follows:
awskey = '*********'
awssecret = '***********'
conn = S3Connection(awskey, awssecret)
bucketname = 'mybucket'
destination = '/path/to/previous/output'
There are two ways of doing this :
Download the file into your local system and parse it. ( Kinda simple, quick and easy )
Get data stored on S3 into memory and parse it ( a bit more complex in case of huge files ).
Step 1:
On S3 filenames are stored as a Key, if you have a file named "Demo" stored in a folder named "DemoFolder" then the key for that particular file would be "DemoFolder\Demo".
Use the below code to download the file into a temp folder.
AWS_KEY = 'xxxxxxxxxxxxxxxxxx'
AWS_SECRET_KEY = 'xxxxxxxxxxxxxxxxxxxxxxxxxx'
BUCKET_NAME = 'DemoBucket'
fileName = 'Demo'
conn = connect_to_region(Location.USWest2,aws_access_key_id = AWS_KEY,
aws_secret_access_key = AWS_SECRET_KEY,
is_secure=False,host='s3-us-west-2.amazonaws.com'
)
source_bucket = conn.lookup(BUCKET_NAME)
''' Download the file '''
for name in source_bucket.list():
if name.name in fileName:
print("DOWNLOADING",fileName)
name.get_contents_to_filename(tempPath)
You can then work on the file in that temp path.
Step 2:
You can also fetch data as string using data = name.get_contents_as_string(). In case of huge files (> 1gb) you may come across memory errors, to avoid such errors you will have to write a lazy function which reads the data in chunks.
For example you can use range to fetch a part of file using data = name.get_contents_as_string(headers={'Range': 'bytes=%s-%s' % (0,100000000)}).
I am not sure if I answered your question properly, I can custom code for your requirement once I get some time. Meanwhile please feel free to post any query you have.

Categories

Resources