Writing Files on Hadoop Line by line using python

I'm working with files which has varying schema for lines, so i need to parse each line and take decisions basis that which needs me write files to HDFS line by line.
Is there a way to achieve that in python?

You can use IOUtils from sc._gateway.jvm and use it to stream from one hadoop file(or local) to file on hadoop.
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
Configuration = sc._gateway.jvm.org.apache.hadoop.conf.Configuration
fs = FileSystem.get(Configuration())
IOUtils = sc._gateway.jvm.org.apache.hadoop.io.IOUtils
f = fs.open(Path("/user/test/abc.txt"))
output_stream = fs.create(Path("/user/test/a1.txt"))
IOUtils.copyBytes(f, output_stream, Configuration())


Writing files using spark and reading using python

writing a file s3 using spark usually creates a directory with 11 files success and the other file name starts with name as part which has actual data in s3 , how to load the same file using pandas dataframe since the file path changes because the file name Par for all 10 files with actual data varies in each run.
For example the file path at the time of writing :
The files stored in the directory are :
- sucess
- part-00-*.parquet
I have a python job which reads the file to pandas dataframe
pd.read(s3\\..........what is the path to specify here.................)
when writing files with spark, you cannot pass the name the file (you can, but you end up with what you described above). if you want a single file to later load to pandas, you would do something like this:
df.repartition(1).write.parquet(path="s3://testfolder/", mode='append')
The end result will be a single file in "s3://testfolder/" that starts with part-00-*.parquet. You can simply read that file in or rename the file to something specific before reading it in with pandas.
Option 1: (Recommended)
You can use awswrangler. Its a light weight tool to aid with the integration between
Pandas/S3/Parquet. It lets you read in multiple files from the directory.
pip install awswrangler
import awswrangler as wr
df = wr.s3.read_parquet(path='s3://testfolder/')
Option 2:
############################## RETRIEVE KEYS FROM THE BUCKET ##################################
import boto3
import pandas as pd
s3 = boto3.client('s3')
s3_bucket_name = 'your bucket name'
prefix = 'path where the files are located'
response = s3.list_objects_v2(
Bucket = s3_bucket_name,
Prefix = prefix
keys = []
for obj in response['Contents']:
##################################### READ IN THE FILES #######################################
for key in keys:
df.append(pd.read_parquet(path = 's3://' + s3_bucket_name + '/' + key, engine = 'pyarrow'))

Iterating Over List of Parsed Files Python

This program scans through a log file and finds faults and timestamps for the faults. The problem I am having with my program is finding a way to modify my program so that it can iterate over multiple files given via the command line and wildcard. In the state the code is now, it can accept a single file and build the dictionary with my my desired info successfully. I have been struggling finding a way to perform this with multiple files simultaneously. The goal is to able to enter into the command line the filename with a wildcard to parse files associated. For example on the command line after the executable I would enter, -f filename.*txt**. However, I cannot find a way to parse multiple files through my fault finder. I have been successful in parsing multiple files and proved it by printing out the list of files parsed. But when it comes to using multiple files and building the dictionary, I am stumped. I would like to use my program and have the same result as it would when parsing a singular file.
import sys
import argparse
class FaultList():
fault_dict = {}
fault_dict_counter = {}
def __init__(self, file):
self.file = file
print self.fault_dict
def find_faults(self):
with open(self.file) as f:
for line in f.readlines():
fault_index = line.find("Fault Cache id")
if(fault_index != -1):
time_stamp = line[:_TIME_STAMP_LENGTH]
fault_data = line[fault_index+_FAULT_STRING_HEADER_LENGTH:-11][:-1] #need the [:-1] to remove new line from string
self.handle_new_fault_found(fault_data, time_stamp)
def handle_new_fault_found(self, fault, time_stamp):
self.fault_dict[fault] = [fault]
self.fault_dict_counter[0] += 1
except KeyError:
self.fault_dict_counter[fault] = [1, [time_stamp]]
def main(file_names):
parser = argparse.ArgumentParser()
parser.add_argument("-f", "--file", dest="file_names",
help="The binary file to be writen to flash")
args = parser.parse_args()
fault_finder = FaultList(args.file_names)
args = parser.parse_args()
if __name__ == '__main__':
Here is the output of dictionary when parsing a singular file
{'fault_01_17_00 Type:Warning': ['fault_01_17_00 Type:Warning', 37993146319], 'fault_0E_00_00 Type:Warning': ['fault_0E_00_00 Type:Warning', 38304267561], 'fault_05_01_00 Typ': ['fault_05_01_00 Typ', 38500887160]}
You can use the os module for listing files.
import os
# finds all files in a directory
files = [file for file in os.listdir('path of the files') if os.path.isfile(file)]
# filter them looking for files that end with 'txt'
txt_files = [file for file in files if file.endswith('txt')]

Extract particular file from zip blob stored in azure container with python using Jupyter notebook

I had uploaded zip file in my azure account as a blob in azure container.
Zip file contains .csv, .ascii files and many other formats.
I need to read specific file, lets say ascii file data containing in zip file. I am using python for this case.
How to read particular file data from this zip file without downloading it on local? I would like to handle this process in memory only.
I am also trying with jypyter notebook provided by azure for ML functionality
I am using ZipFile python package for this case.
Request you to assist in this matter to read the file
Please find following code snippet.
allBlobs = []
for blob in blob_list:
sampleZipFile = allBlobs[0]
The below code should work. This example accesses an Azure Container using an Account URL and Key combination.
from azure.storage.blob import BlobServiceClient
from io import BytesIO
from zipfile import ZipFile
key = r'my_key'
service = BlobServiceClient(account_url="my_account_url",
container_client = service.get_container_client('container_name')
zipfilename = 'myzipfile.zip'
blob_data = container_client.download_blob(zipfilename)
blob_bytes = blob_data.content_as_bytes()
inmem = BytesIO(blob_bytes)
myzip = ZipFile(inmem)
otherfilename = 'mycontainedfile.csv'
filetoread = BytesIO(myzip.read(otherfilename))
Now all you have to do is pass filetoread into whatever method you would normally use to read a local file (eg. pandas.read_csv())
you could use below code for reading file inside .zip file without extracting in python
import zipfile
archive = zipfile.ZipFile('images.zip', 'r')
imgdata = archive.read('img_01.png')
For details , you can refer to ZipFile docs here
Alternatively, you can do something like this
-- coding: utf-8 --
Created on Mon Apr 1 11:14:56 2019
#author: moverm
import zipfile
zfile = zipfile.ZipFile('C:\\LAB\Pyt\sample.zip')
for finfo in zfile.infolist():
ifile = zfile.open(finfo)
line_list = ifile.readlines()
Here is the output for the same
Hope it helps.

How to load data from HDFS sequencefile in python

I have a map reduce program running to read the HDFS file as below:
hadoop jar /opt/mapr/hadoop/hadoop-0.20.2/contrib/streaming/hadoop-0.20.2-dev-streaming.jar -Dmapred.reduce.tasks=1000 -file $homedir/mapper.py -mapper $homedir/mapper.py -file $homedir/reducer.py -reducer $homedir/reducer.py -input /user/data/* -output /output/ 2> output.text
Anything to be confirm, the path /user/data/* has folders including files, /user/data/* will iterate all files under all subfolders right ?
The hdfs text file contains a JSON string for each line so the mapper read the file as below:
for line in sys.stdin:
object = json.loads(line)
But the owner of the HDFS changed the file from text into sequencefile. and I found the map reduce program output a lot of zero sized files, which probably means it did not successfully read the file from HDFS.
What should I change to code so that I can read from the sequencefile ? I also have a HIVE external table to perform the aggregation and sorting based on that output of mapreduce, and the HIVE was STORED AS TEXTFILE before, should I change to STORED AS SEQUENCEFILE ?
Have a look at this
Run below python file before your mapreduce job
input : your sequence file
output : your input to mapreduce
import sys
from hadoop.io import SequenceFile
if __name__ == '__main__':
if len(sys.argv) < 3:
print 'usage: SequenceFileReader <filename> <output>'
reader = SequenceFile.Reader(sys.argv[1])
key_class = reader.getKeyClass()
value_class = reader.getValueClass()
key = key_class()
value = value_class()
position = reader.getPosition()
f = open(sys.argv[2],'w')
while reader.next(key, value):
You wont have to change you original python file now.

How to get data from s3 and do some work on it? python and boto

I have a project task to use some output data I have already produced on s3 in an EMR task. So previously I have ran an EMR job that produced some output in one of my s3 buckets in the form of multiple files named part-xxxx. Now I need to access those files from within my new EMR job, read the contents of those files and by using that data I need to produce another output.
This is the local code that does the job:
def reducer_init(self):
self.idfs = {}
for fname in os.listdir(DIRECTORY): # look through file names in the directory
file = open(os.path.join(DIRECTORY, fname)) # open a file
for line in file: # read each line in json file
term_idf = JSONValueProtocol().read(line)[1] # parse the line as a JSON object
self.idfs[term_idf['term']] = term_idf['idf']
def reducer(self, term_poster, howmany):
tfidf = sum(howmany) * self.idfs[term_poster['term']]
yield None, {'term_poster': term_poster, 'tfidf': tfidf}
This runs just fine locally, but the problem is the data i need now is on s3 and i need to access it somehow in reducer_init function.
This is what I have so far, but it fails while executing on EC2:
def reducer_init(self):
self.idfs = {}
b = conn.get_bucket(bucketname)
idfparts = b.list(destination)
for key in idfparts:
file = open(os.path.join(idfparts, key))
for line in file:
term_idf = JSONValueProtocol().read(line)[1] # parse the line as a JSON object
self.idfs[term_idf['term']] = term_idf['idf']
def reducer(self, term_poster, howmany):
tfidf = sum(howmany) * self.idfs[term_poster['term']]
yield None, {'term_poster': term_poster, 'tfidf': tfidf}
AWS access info is defined as follows:
awskey = '*********'
awssecret = '***********'
conn = S3Connection(awskey, awssecret)
bucketname = 'mybucket'
destination = '/path/to/previous/output'
There are two ways of doing this :
Download the file into your local system and parse it. ( Kinda simple, quick and easy )
Get data stored on S3 into memory and parse it ( a bit more complex in case of huge files ).
Step 1:
On S3 filenames are stored as a Key, if you have a file named "Demo" stored in a folder named "DemoFolder" then the key for that particular file would be "DemoFolder\Demo".
Use the below code to download the file into a temp folder.
AWS_KEY = 'xxxxxxxxxxxxxxxxxx'
AWS_SECRET_KEY = 'xxxxxxxxxxxxxxxxxxxxxxxxxx'
BUCKET_NAME = 'DemoBucket'
fileName = 'Demo'
conn = connect_to_region(Location.USWest2,aws_access_key_id = AWS_KEY,
aws_secret_access_key = AWS_SECRET_KEY,
source_bucket = conn.lookup(BUCKET_NAME)
''' Download the file '''
for name in source_bucket.list():
if name.name in fileName:
You can then work on the file in that temp path.
Step 2:
You can also fetch data as string using data = name.get_contents_as_string(). In case of huge files (> 1gb) you may come across memory errors, to avoid such errors you will have to write a lazy function which reads the data in chunks.
For example you can use range to fetch a part of file using data = name.get_contents_as_string(headers={'Range': 'bytes=%s-%s' % (0,100000000)}).
I am not sure if I answered your question properly, I can custom code for your requirement once I get some time. Meanwhile please feel free to post any query you have.

