I am new to Spark. I have pcap file. How can Spark read that file using python? How can I upload pcap file in Spark using python and how can it process?
conf = SparkConf().setMaster("local").setAppName("SparkStreamingPcap")
sc = SparkContext(conf = conf)
sc.setLogLevel("ERROR")
spark = SparkSession(sc)
FileLog = sc.textFile("pcapFiles/ipv4frags.pcap")
df = FileLog.map(lambda line: line.split("\n"))
print("Helloo")
print (df.count())
You could also try using dpkt or scapy to parse pcap files in pyspark code.
Related
I have a set of custom log files, that I need to parse. I am currently working in Azure Databricks, but am quite new in using PySpark. The log files are hosted within an Azure Blob Storage Account, which is mounted to our Azure Databricks instance.
Log file example for the input:
Value_x: 1
Value_y: "Station"
col1;col2;col3;col4;
A1;B1;C1;D1;
A2;B2;C2;D2;
A3;B3;C3;D3;
Output that is a list of strings, but I can also work with a list of lists.
['A1;B1;C1;D1;1;station',
'A2;B2;C2;D2;1;station',
'A3;B3;C3;D3;1;station']
The snippet of code to apply these transformations.
def custom_parser(file, content):
content_ = content.replace('"', '').replace('\r', '').split('\n')
content_ = [line for line in content_ if len(line) > 0]
x = content_[0].split('Value_x:')[-1].strip()
y = content_[0].split('Value_y:')[-1].strip()
content_ = content_[3:]
content_ = [line + ';'.join([x,y]) for line in content_]
return content_
from pyspark import SparkConf
from pyspark.context import SparkContext
sc = SparkContect.getOrCreate(SparkConf)
files = sc.wholeTextFiles('spam/eggs/').collect()
parsed_content = []
for file, content in files:
parsed_content += custom_parser(file, content)
I have developed a custom_parser function to handle the content of these log files. But I am left with some questions:
Can I apply this custom_parser action directly to the Spark RDD returned by sc.wholeTextFiles so I can use the parallelization features of Spark?
Is parsing the data in such an ad-hoc method the most performant method?
you cannot apply your custom_parser action directly on sc.wholeTextFiles, but what you can do is to use custom_parser as map function, then after read your files and get RDD[String,String] (path,content) you can apply the custom_parser as rdd.map(custom_parser) and then write it where you need. In that way you will do your job in parallel, not like now that you are doing all in driver.
I am working on a project that involves using hdfs, and i want to store arduino data into hadoop hdfs every 3s in a csv file.
csv file example:
'temp1','datetime1','location1'
'temp2','datetime2','location2'
'temp3','datetime3','location3'
and every 3s i want to add a row to this csv file.
i've already tried a python code that read from arduino's serial port and write into a nosql database and i tried to do the same but i found some problems on hdfs path.
# Creating a simple Pandas DataFrame
liste_temp = [temp_string,datetime.datetime.now(),temperature_location]
df = pd.DataFrame(data = {'temp' : liste_temp})
# Writing Dataframe to hdfs
with client_hdfs.write('/test/temp.csv', encoding = 'utf-8') as writer:
df.to_csv(writer)
Error:
File "templog.py", line 33, in <module> with client_hdfs.write('/test/temp.csv', encoding = 'utf-8') as writer: File "C:\Users\nouhl\AppData\Local\Programs\Python\Python37-32\lib\site-packages\hdfs\client.py", line 460, in write raise
InvalidSchema("No connection adapters were found for '%s'" % url) requests.exceptions.InvalidSchema: No connection adapters were found for 'hdfs://localhost:9870/webhdfs/v1/test/temp.csv
I have the following problem. I want to extract data from hdfs (a table called 'complaint'). I wrote the following script which actually works:
import pandas as pd
from hdfs import InsecureClient
import os
file = open ("test.txt", "wb")
print ("Step 1")
client_hdfs = InsecureClient ('http://XYZ')
N = 10
print ("Step 2")
with client_hdfs.read('/user/.../complaint/000000_0') as reader:
print('new line')
features = reader.read(1000000)
file.write(features)
print('end')
file.close()
My problem now is that the folder "complaint" contains 4 files ( i don't know which file type) and the read operation gives me back bytes which I can't use further (I saved it to a textfile as a test and it looks like that:
In HDFS it looks like this:
My question now is:
Is it possible to get the data separated for each column in a senseful way?
I only found solutions with .csv files and like that and somehow stuck here... :-)
EDIT
I made changes to my solution and tried different approaches but none of them is going to work really. Here's the updated code:
import pandas as pd
from hdfs import InsecureClient
import os
import pypyodbc
import pyspark
from pyspark import SparkConf, SparkContext
from hdfs3 import HDFileSystem
import pyarrow.parquet as pq
import pyarrow as pa
from pyhive import hive
#Step 0: Configurations
#Connections with InsecureClient (this basically works)
#Notes: TMS1 doesn't work because of txt files
#insec_client_tms1 = InsecureClient ('http://some-adress:50070')
insec_client_tms2 = InsecureClient ('http://some-adress:50070')
#Connection with Spark (not working at the moment)
#Error: Java gateway process exited before sending its port number
#conf = SparkConf().setAppName('TMS').setMaster('spark://adress-of-node:7077')
#sc = SparkContext(conf=conf)
#Connection via PyArrow (not working)
#Error: File not found
#fs = pa.hdfs.connect(host='hdfs://node-adress', port =8020)
#print("FS: " + fs)
#connection via HDFS3 (not working)
#The module couldn't be load
#client_hdfs = HDFileSystem(host='hdfs://node-adress', port=8020)
#Connection via Hive (not working)
#no module named sasl -> I tried to install it, but it also fails
#conn = hive.Connection(host='hdfs://node-adress', port=8020, database='deltatest')
#Step 1: Extractions
print ("starting Extraction")
#Create file
file = open ("extraction.txt", "w")
#Extraction with Spark
#text = sc.textFile('/user/hive/warehouse/XYZ.db/baseorder_flags/000000_0')
#first_line = text.first()
#print (first_line)
#extraction with hive
#df = pd.read_sql ('select * from baseorder',conn)
#print ("DF: "+ df)
#extraction with hdfs3
#with client_hdfs.open('/home/deltatest/basedeviation/000000_0') as f:
# df = pd.read_parquet(f)
#Extraction with Webclient (not working)
#Error: Arrow error: IOError: seek -> fastparquet has a similar error
with insec_client_tms2.read('/home/deltatest/basedeviation/000000_0') as reader:
features = pd.read_parquet(reader)
print (features)
#features = reader.read()
#data = features.decode('utf-8', 'replace')
print("saving data to file")
file.write(data)
print('end')
file.close()
I want to read DOCX/PDF file from Hadoop file system using pyspark, Currently I am using pandas API. But in pandas we have some limitation we can read only CSV, JSON, XLSX & HDF5. Its not support any other format.
Currently my code is :
import pandas as pd
from pyspark import SparkContext, SparkConf
from hdfs import InsecureClient
conf = SparkConf().setAppName("Random")
sc = SparkContext(conf = conf)
client_hdfs = InsecureClient('http://192.00.00.30:50070')
with client_hdfs.read('/user/user.name/sample.csv', encoding = 'utf-8') as reader:
df = pd.read_csv(reader,index_col=0)
print df
I am able to read CSV using above code, any other API's which can solve this problem for DOC/PDF?
I'm working with files which has varying schema for lines, so i need to parse each line and take decisions basis that which needs me write files to HDFS line by line.
Is there a way to achieve that in python?
You can use IOUtils from sc._gateway.jvm and use it to stream from one hadoop file(or local) to file on hadoop.
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
Configuration = sc._gateway.jvm.org.apache.hadoop.conf.Configuration
fs = FileSystem.get(Configuration())
IOUtils = sc._gateway.jvm.org.apache.hadoop.io.IOUtils
f = fs.open(Path("/user/test/abc.txt"))
output_stream = fs.create(Path("/user/test/a1.txt"))
IOUtils.copyBytes(f, output_stream, Configuration())