Python on Hadoop read blocks

Python on Hadoop read blocks - python

I have the following problem. I want to extract data from hdfs (a table called 'complaint'). I wrote the following script which actually works:
import pandas as pd
from hdfs import InsecureClient
import os
file = open ("test.txt", "wb")
print ("Step 1")
client_hdfs = InsecureClient ('http://XYZ')
N = 10
print ("Step 2")
with client_hdfs.read('/user/.../complaint/000000_0') as reader:
print('new line')
features = reader.read(1000000)
file.write(features)
print('end')
file.close()
My problem now is that the folder "complaint" contains 4 files ( i don't know which file type) and the read operation gives me back bytes which I can't use further (I saved it to a textfile as a test and it looks like that:
In HDFS it looks like this:
My question now is:
Is it possible to get the data separated for each column in a senseful way?
I only found solutions with .csv files and like that and somehow stuck here... :-)
EDIT
I made changes to my solution and tried different approaches but none of them is going to work really. Here's the updated code:
import pandas as pd
from hdfs import InsecureClient
import os
import pypyodbc
import pyspark
from pyspark import SparkConf, SparkContext
from hdfs3 import HDFileSystem
import pyarrow.parquet as pq
import pyarrow as pa
from pyhive import hive
#Step 0: Configurations
#Connections with InsecureClient (this basically works)
#Notes: TMS1 doesn't work because of txt files
#insec_client_tms1 = InsecureClient ('http://some-adress:50070')
insec_client_tms2 = InsecureClient ('http://some-adress:50070')
#Connection with Spark (not working at the moment)
#Error: Java gateway process exited before sending its port number
#conf = SparkConf().setAppName('TMS').setMaster('spark://adress-of-node:7077')
#sc = SparkContext(conf=conf)
#Connection via PyArrow (not working)
#Error: File not found
#fs = pa.hdfs.connect(host='hdfs://node-adress', port =8020)
#print("FS: " + fs)
#connection via HDFS3 (not working)
#The module couldn't be load
#client_hdfs = HDFileSystem(host='hdfs://node-adress', port=8020)
#Connection via Hive (not working)
#no module named sasl -> I tried to install it, but it also fails
#conn = hive.Connection(host='hdfs://node-adress', port=8020, database='deltatest')
#Step 1: Extractions
print ("starting Extraction")
#Create file
file = open ("extraction.txt", "w")
#Extraction with Spark
#text = sc.textFile('/user/hive/warehouse/XYZ.db/baseorder_flags/000000_0')
#first_line = text.first()
#print (first_line)
#extraction with hive
#df = pd.read_sql ('select * from baseorder',conn)
#print ("DF: "+ df)
#extraction with hdfs3
#with client_hdfs.open('/home/deltatest/basedeviation/000000_0') as f:
# df = pd.read_parquet(f)
#Extraction with Webclient (not working)
#Error: Arrow error: IOError: seek -> fastparquet has a similar error
with insec_client_tms2.read('/home/deltatest/basedeviation/000000_0') as reader:
features = pd.read_parquet(reader)
print (features)
#features = reader.read()
#data = features.decode('utf-8', 'replace')
print("saving data to file")
file.write(data)
print('end')
file.close()

Related

How to read a csvfile on FTP that is compressed on a zip/folder

I'm trying to :
read a .csv file (compressed in a zipfile that is stored on FTP) by using ftplib
store the .csv file on a virtual file on memory by using io
transform the virutal file to a dataframe by using pandas
For that I'm using the code below and it works really fine for the first scenario (path1, see image above) :
CODE :
import ftplib
import zipfile
import io
import pandas as pd
ftp = ftplib.FTP("theserver_name")
ftp.login("my_username","my_password")
ftp.encoding = "utf-8"
ftp.cwd('folder1/folder2')
filename = 'zipFile1.zip'
download_file = io.BytesIO()
ftp.retrbinary("RETR " + filename, download_file.write)
download_file.seek(0)
zfile = zipfile.ZipFile(download_file)
df = pd.read_csv(zfile.namelist()[0], delimiter=';')
display(df)
But in the second scenario (path2) and after changing my code, I get the error below :
CODE UPDATE :
ftp.cwd('folder1/folder2/')
filename = 'zipFile2.zip'
ERROR AFTER UPDATE :
FileNotFoundError: [Errno 2] No such file or directory:
'folder3/csvFile2.csv'
It seems like Python don't recognize the folder3 (contained in the zipFile2). Is there any explanation for that, please ? How can we fix that ? I tried with ftp.cwd('folder3') right before pd.read.csv() but it doesn't work..

Thanks to Serge Ballesta in his post here, I finally figure out how to transform csvFile2.csv to a DataFrame :
import ftplib
import zipfile
import io
import pandas as pd
ftp = ftplib.FTP("theserver_name")
ftp.login("my_username","my_password")
ftp.encoding = "utf-8"
flo = io.BytesIO()
ftp.retrbinary('RETR /folder1/folder2/zipFile2.zip', flo.write)
flo.seek(0)
with zipfile.ZipFile(flo) as archive:
with archive.open('folder3/csvFile2.csv') as fd:
df = pd.read_csv(fd, delimiter=';')
display(df)

How to perform a CName Lookup on a csv file

I am in the processes of automating the CName lookup process via python and would like some help / thought on my current draft.
The goal is for the script to take in each site under the column 'site' and provide the cname of the site in another column named 'CName'
Here is what I have now:
# pip install pandas
from tkinter.dnd import dnd_start
import pandas as pd
# pip install dnspython
from dns import resolver, reversename
# pip install xlrd, pip install xlsxwriter, pip install socket
from pandas.io.excel import ExcelWriter
import time
import socket
import dns.resolver
startTime = time.time()
# Import excel called logs.xlsx as dataframe
# if CSV change to pd.read_csv('logs.csv', error_bad_lines=False)
logs = pd.read_csv(path to file)
# Create DF with dupliate sites filtered for check
logs_filtered = logs.drop_duplicates(['site']).copy()
def cNameLookup(site):
name = str(site).strip()
try:
cname = socket.AddressInfo(site)[0]
for val in cname:
print('CNAME Record : ', val.target)
except:
return 'N/A'
# Create CName column with the CName Lookup result
logs_filtered['cname'] = logs_filtered['site'].apply(cNameLookup)
# Merge DNS column to full logs matching IP
logs_filtered = logs.merge(logs_filtered[['site', 'cname']], how='left', on=['site'])
# output as Excel
writer = ExcelWriter('validated_logs.xlsx', engine='xlsxwriter', options={
'strings_to_urls': False})
logs_filtered.to_excel(writer, index=False)
writer.save()
print('File Succesfully written as validated_logs.xlsx')
print('The script took {0} second !'.format(time.time() - startTime))
As of now, when I run the script, all I get for the CName colum is 'N/A' all the way down, it seems as though the cname lookup portion of the code is not working as intended.
Thank you in advance for any help / suggestions!

Difficulty trying to export query results as a CSV, uploaded to SharePoint (PySpark)

I am trying to run a query, with the result saved as a CSV that is uploaded to a SharePoint folder. This is within Databricks via Pyspark.
My code below is close to doing this, but the final line is not functioning correctly - the file generated in SharePoint does not contain any data, though the dataframe does.
I'm new to Python and Databricks, if anyone can provide some guidance on how to correct that final line I'd really appreciate it!
from shareplum import Site
from shareplum.site import Version
import pandas as pd
sharepointUsername =
sharepointPassword =
sharepointSite =
website =
sharepointFolder =
# Connect to SharePoint Folder
authcookie = Office365(website, username=sharepointUsername, password=sharepointPassword).GetCookies()
site = Site(sharepointSite, version=Version.v2016, authcookie=authcookie)
folder = site.Folder(sharepointFolder)
FileName = "Data_Export.csv"
Query = "SELECT * FROM TABLE"
df = spark.sql(Query)
pandasdf = df.toPandas()
folder.upload_file(pandasdf.to_csv(FileName, encoding = 'utf-8'), FileName)

Sure my code is still garbage, but it does work. I needed to convert the dataframe into a variable containing CSV formatted data prior to uploading it to SharePoint; effectively I was trying to skip a step before. Last two lines were updated:
from shareplum.site import Version
import pandas as pd
sharepointUsername =
sharepointPassword =
sharepointSite =
website =
sharepointFolder =
# Connect to SharePoint Folder
authcookie = Office365(website, username=sharepointUsername, password=sharepointPassword).GetCookies()
site = Site(sharepointSite, version=Version.v2016, authcookie=authcookie)
folder = site.Folder(sharepointFolder)
FileName = "Data_Export.csv"
Query = "SELECT * FROM TABLE"
df = (spark.sql(QueryAllocation)).toPandas().to_csv(header=True, index=False, encoding='utf-8')
folder.upload_file(df, FileName)

How to generalise the import script

I have a query to generate a CSV file from the data in a Postgres Table.The script is working fine.
But i have a situation where i need to create separate files using the data from a different table.
So basically only the below hardcoded one change and rest code is same.Now the situation is i have to create separate scripts for all CSV's.
Is there a way i can have one script and only change this parameters.
I'm using Jenkins to automate the CSV file creation.
filePath = '/home/jenkins/data/'
fileName = 'data.csv'
import csv
import os
import psycopg2
from pprint import pprint
from datetime import datetime
from utils.config import Configuration as Config
from utils.postgres_helper import get_connection
from utils.utils import get_global_config
# File path and name.
filePath = '/home/jenkins/data/'
fileName = 'data.csv'
# Database connection variable.
connect = None
# Check if the file path exists.
if os.path.exists(filePath):
try:
# Connect to database.
connect = get_connection(get_global_config(), 'dwh')
except psycopg2.DatabaseError as e:
# Confirm unsuccessful connection and stop program execution.
print("Database connection unsuccessful.")
quit()
# Cursor to execute query.
cursor = connect.cursor()
# SQL to select data from the google feed table.
sqlSelect = "SELECT * FROM data"
try:
# Execute query.
cursor.execute(sqlSelect)
# Fetch the data returned.
results = cursor.fetchall()
# Extract the table headers.
headers = [i[0] for i in cursor.description]
# Open CSV file for writing.
csvFile = csv.writer(open(filePath + fileName, 'w', newline=''),
delimiter=',', lineterminator='\r\n',
quoting=csv.QUOTE_ALL, escapechar='\\')
# Add the headers and data to the CSV file.
csvFile.writerow(headers)
csvFile.writerows(results)
# Message stating export successful.
print("Data export successful.")
print('CSV Path : '+ filePath+fileName)
except psycopg2.DatabaseError as e:
# Message stating export unsuccessful.
print("Data export unsuccessful.")
quit()
finally:
# Close database connection.
connect.close()
else:
# Message stating file path does not exist.
print("File path does not exist.")

How to read docx/pdf file from HDFS using pyspark?

I want to read DOCX/PDF file from Hadoop file system using pyspark, Currently I am using pandas API. But in pandas we have some limitation we can read only CSV, JSON, XLSX & HDF5. Its not support any other format.
Currently my code is :
import pandas as pd
from pyspark import SparkContext, SparkConf
from hdfs import InsecureClient
conf = SparkConf().setAppName("Random")
sc = SparkContext(conf = conf)
client_hdfs = InsecureClient('http://192.00.00.30:50070')
with client_hdfs.read('/user/user.name/sample.csv', encoding = 'utf-8') as reader:
df = pd.read_csv(reader,index_col=0)
print df
I am able to read CSV using above code, any other API's which can solve this problem for DOC/PDF?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python on Hadoop read blocks - python

Related

How to read a csvfile on FTP that is compressed on a zip/folder

How to perform a CName Lookup on a csv file

Difficulty trying to export query results as a CSV, uploaded to SharePoint (PySpark)

How to generalise the import script

How to read docx/pdf file from HDFS using pyspark?

Categories

Resources