How to read docx/pdf file from HDFS using pyspark?

How to read docx/pdf file from HDFS using pyspark? - python

I want to read DOCX/PDF file from Hadoop file system using pyspark, Currently I am using pandas API. But in pandas we have some limitation we can read only CSV, JSON, XLSX & HDF5. Its not support any other format.
Currently my code is :
import pandas as pd
from pyspark import SparkContext, SparkConf
from hdfs import InsecureClient
conf = SparkConf().setAppName("Random")
sc = SparkContext(conf = conf)
client_hdfs = InsecureClient('http://192.00.00.30:50070')
with client_hdfs.read('/user/user.name/sample.csv', encoding = 'utf-8') as reader:
df = pd.read_csv(reader,index_col=0)
print df
I am able to read CSV using above code, any other API's which can solve this problem for DOC/PDF?

Related

Pyarrow append and update modes

With the code below, I can write the file in parquet format from disk to hdfs. But when I run the code again, it overwrites it. I want it to append or update. How can I do that? I would be glad if you help.
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from pyarrow import fs
file = "source_path"
target = "target_path"
hdfs = pa.fs.HadoopFileSystem("hdfs://okay")
df = pd.read_csv(file)
table = pa.Table.from_pandas(df)
pq.write_table(table, target, filesystem=hdfs)

store parquet files (in aws s3) into a spark dataframe using pyspark

I'm trying to read data from a specific folder in my s3 bucket. This data is in parquet format. To do that I'm using awswrangler:
import awswrangler as wr
# read data
data = wr.s3.read_parquet("s3://bucket-name/folder/with/parquet/files/", dataset = True)
This returns a pandas dataframe:
client_id center client_lat client_lng inserted_at matrix_updated
0700292081 BFDR -23.6077 -46.6617 2021-04-19 2021-04-19
7100067781 BFDR -23.6077 -46.6617 2021-04-19 2021-04-19
7100067787 BFDR -23.6077 -46.6617 2021-04-19 2021-04-19
However, instead of a pandas dataframe I would like to store this data retrieved from my s3 bucket in a spark dataframe. I've tried doing this(which is my own question), but seems not to be working correctly.
I was wondering if there is any way I could store this data into a spark dataframe using awswrangler. Or if you have an alternative I would like to read about it.

I didn't use awswrangler. Instead I used the following code which I found on this github:
myAccessKey = 'your key'
mySecretKey = 'your key'
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0 pyspark-shell'
import pyspark
sc = pyspark.SparkContext("local[*]")
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
hadoopConf = sc._jsc.hadoopConfiguration()
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId", myAccessKey)
hadoopConf.set("fs.s3.awsSecretAccessKey", mySecretKey)
df = sqlContext.read.parquet("s3://bucket-name/path/")

Loading pcap file in spark using python

I am new to Spark. I have pcap file. How can Spark read that file using python? How can I upload pcap file in Spark using python and how can it process?
conf = SparkConf().setMaster("local").setAppName("SparkStreamingPcap")
sc = SparkContext(conf = conf)
sc.setLogLevel("ERROR")
spark = SparkSession(sc)
FileLog = sc.textFile("pcapFiles/ipv4frags.pcap")
df = FileLog.map(lambda line: line.split("\n"))
print("Helloo")
print (df.count())

You could also try using dpkt or scapy to parse pcap files in pyspark code.

Python on Hadoop read blocks

I have the following problem. I want to extract data from hdfs (a table called 'complaint'). I wrote the following script which actually works:
import pandas as pd
from hdfs import InsecureClient
import os
file = open ("test.txt", "wb")
print ("Step 1")
client_hdfs = InsecureClient ('http://XYZ')
N = 10
print ("Step 2")
with client_hdfs.read('/user/.../complaint/000000_0') as reader:
print('new line')
features = reader.read(1000000)
file.write(features)
print('end')
file.close()
My problem now is that the folder "complaint" contains 4 files ( i don't know which file type) and the read operation gives me back bytes which I can't use further (I saved it to a textfile as a test and it looks like that:
In HDFS it looks like this:
My question now is:
Is it possible to get the data separated for each column in a senseful way?
I only found solutions with .csv files and like that and somehow stuck here... :-)
EDIT
I made changes to my solution and tried different approaches but none of them is going to work really. Here's the updated code:
import pandas as pd
from hdfs import InsecureClient
import os
import pypyodbc
import pyspark
from pyspark import SparkConf, SparkContext
from hdfs3 import HDFileSystem
import pyarrow.parquet as pq
import pyarrow as pa
from pyhive import hive
#Step 0: Configurations
#Connections with InsecureClient (this basically works)
#Notes: TMS1 doesn't work because of txt files
#insec_client_tms1 = InsecureClient ('http://some-adress:50070')
insec_client_tms2 = InsecureClient ('http://some-adress:50070')
#Connection with Spark (not working at the moment)
#Error: Java gateway process exited before sending its port number
#conf = SparkConf().setAppName('TMS').setMaster('spark://adress-of-node:7077')
#sc = SparkContext(conf=conf)
#Connection via PyArrow (not working)
#Error: File not found
#fs = pa.hdfs.connect(host='hdfs://node-adress', port =8020)
#print("FS: " + fs)
#connection via HDFS3 (not working)
#The module couldn't be load
#client_hdfs = HDFileSystem(host='hdfs://node-adress', port=8020)
#Connection via Hive (not working)
#no module named sasl -> I tried to install it, but it also fails
#conn = hive.Connection(host='hdfs://node-adress', port=8020, database='deltatest')
#Step 1: Extractions
print ("starting Extraction")
#Create file
file = open ("extraction.txt", "w")
#Extraction with Spark
#text = sc.textFile('/user/hive/warehouse/XYZ.db/baseorder_flags/000000_0')
#first_line = text.first()
#print (first_line)
#extraction with hive
#df = pd.read_sql ('select * from baseorder',conn)
#print ("DF: "+ df)
#extraction with hdfs3
#with client_hdfs.open('/home/deltatest/basedeviation/000000_0') as f:
# df = pd.read_parquet(f)
#Extraction with Webclient (not working)
#Error: Arrow error: IOError: seek -> fastparquet has a similar error
with insec_client_tms2.read('/home/deltatest/basedeviation/000000_0') as reader:
features = pd.read_parquet(reader)
print (features)
#features = reader.read()
#data = features.decode('utf-8', 'replace')
print("saving data to file")
file.write(data)
print('end')
file.close()

Python: Read several json files from a folder

I would like to know how to read several json files from a single folder (without specifying the files names, just that they are json files).
Also, it is possible to turn them into a pandas DataFrame?
Can you give me a basic example?

One option is listing all files in a directory with os.listdir and then finding only those that end in '.json':
import os, json
import pandas as pd
path_to_json = 'somedir/'
json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]
print(json_files) # for me this prints ['foo.json']
Now you can use pandas DataFrame.from_dict to read in the json (a python dictionary at this point) to a pandas dataframe:
montreal_json = pd.DataFrame.from_dict(many_jsons[0])
print montreal_json['features'][0]['geometry']
Prints:
{u'type': u'Point', u'coordinates': [-73.6051013, 45.5115944]}
In this case I had appended some jsons to a list many_jsons. The first json in my list is actually a geojson with some geo data on Montreal. I'm familiar with the content already so I print out the 'geometry' which gives me the lon/lat of Montreal.
The following code sums up everything above:
import os, json
import pandas as pd
# this finds our json files
path_to_json = 'json/'
json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]
# here I define my pandas Dataframe with the columns I want to get from the json
jsons_data = pd.DataFrame(columns=['country', 'city', 'long/lat'])
# we need both the json and an index number so use enumerate()
for index, js in enumerate(json_files):
with open(os.path.join(path_to_json, js)) as json_file:
json_text = json.load(json_file)
# here you need to know the layout of your json and each json has to have
# the same structure (obviously not the structure I have here)
country = json_text['features'][0]['properties']['country']
city = json_text['features'][0]['properties']['name']
lonlat = json_text['features'][0]['geometry']['coordinates']
# here I push a list of data into a pandas DataFrame at row given by 'index'
jsons_data.loc[index] = [country, city, lonlat]
# now that we have the pertinent json data in our DataFrame let's look at it
print(jsons_data)
for me this prints:
country city long/lat
0 Canada Montreal city [-73.6051013, 45.5115944]
1 Canada Toronto [-79.3849008, 43.6529206]
It may be helpful to know that for this code I had two geojsons in a directory name 'json'. Each json had the following structure:
{"features":
[{"properties":
{"osm_key":"boundary","extent":
[-73.9729016,45.7047897,-73.4734865,45.4100756],
"name":"Montreal city","state":"Quebec","osm_id":1634158,
"osm_type":"R","osm_value":"administrative","country":"Canada"},
"type":"Feature","geometry":
{"type":"Point","coordinates":
[-73.6051013,45.5115944]}}],
"type":"FeatureCollection"}

Iterating a (flat) directory is easy with the glob module
from glob import glob
for f_name in glob('foo/*.json'):
...
As for reading JSON directly into pandas, see here.

Loads all files that end with * .json from a specific directory into a dict:
import os,json
path_to_json = '/lala/'
for file_name in [file for file in os.listdir(path_to_json) if file.endswith('.json')]:
with open(path_to_json + file_name) as json_file:
data = json.load(json_file)
print(data)
Try it yourself:
https://repl.it/#SmaMa/loadjsonfilesfromfolderintodict

To read the json files,
import os
import glob
contents = []
json_dir_name = '/path/to/json/dir'
json_pattern = os.path.join(json_dir_name, '*.json')
file_list = glob.glob(json_pattern)
for file in file_list:
contents.append(read(file))

If turning into a pandas dataframe, use the pandas API.
More generally, you can use a generator..
def data_generator(my_path_regex):
for filename in glob.glob(my_path_regex):
for json_line in open(filename, 'r'):
yield json.loads(json_line)
my_arr = [_json for _json in data_generator(my_path_regex)]

I am using glob with pandas. Checkout the below code
import pandas as pd
from glob import glob
df = pd.concat([pd.read_json(f_name, lines=True) for f_name in glob('foo/*.json')])

A simple and very easy-to-understand answer.
import os
import glob
import pandas as pd
path_to_json = r'\path\here'
# import all files from folder which ends with .json
json_files = glob.glob(os.path.join(path_to_json, '*.json'))
# convert all files to datafr`enter code here`ame
df = pd.concat((pd.read_json(f) for f in json_files))
print(df.head())

I feel a solution using pathlib is missing :)
from pathlib import Path
file_list = list(Path("/path/to/json/dir").glob("*.json"))

One more option is to read it as a PySpark Dataframe and then convert it to Pandas Dataframe (if really necessary, depending on the operation I'd suggest keeping as a PySpark DF). Spark natively handles using a directory with JSON files as the main path without the need of libraries for reading or iterating over each file:
# pip install pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
spark_df = spark.read.json('/some_dir_with_json/*.json')
Next, in order to convert into a Pandas Dataframe, you can do:
df = spark_df.toPandas()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to read docx/pdf file from HDFS using pyspark? - python

Related

Pyarrow append and update modes

store parquet files (in aws s3) into a spark dataframe using pyspark

Loading pcap file in spark using python

Python on Hadoop read blocks

Python: Read several json files from a folder

Categories

Resources