I am trying to write a code using Kafka, Python and SparK
The problem statement is: Read data from XML and the data consumed will be in the binary format. This data has to be stored in a data frame.
I am getting below error:
Error:
File "C:/Users/HP/PycharmProjects/xml_streaming/ConS.py", line 55, in
.format("console")
AttributeError: 'DataFrameWriter' object has no attribute 'start'
Here is my code for reference:
#import *
# Set spark environments
#os.environ['PYSPARK_PYTHON'] = <PATH>
#os.environ['PYSPARK_DRIVER_PYTHON'] = <PATH>
spark = SparkSession\
.builder\
.master("local[1]")\
.appName("Consumer")\
.getOrCreate()
topic_Name = 'XML_File_Processing3'
consumer = kafka.KafkaConsumer(topic_Name, bootstrap_servers=['localhost:9092'], auto_offset_reset='latest')
kafka_df = spark\
.read \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("kafka.security.protocol", "SSL") \
.option("failOnDataLoss", "false") \
.option("subscribe", topic_Name) \
.load()
#.option("startingOffsets", "earliest") \
print("Loaded to DataFrame kafka_df")
kafka_df.printSchema()
new_df = kafka_df.selectExpr("CAST(value AS STRING)")
schema = ArrayType(StructType()\
.add("book_id", IntegerType())\
.add("author", StringType())\
.add("title", StringType())\
.add("genre",StringType())\
.add("price",IntegerType())\
.add("publish_date", IntegerType())\
.add("description", StringType()))
book_DF = new_df.select(from_json(col("value"), schema).alias("dataf")) #.('data')).select("data.*")
book_DF.printSchema()
#book_DF.select("dataf.author").show()
book_DF.write\
.format("console")\
.start()
I don't have a lot of experience with kafka, but at the end you're using the start() method on the result of book_DF.write.format("console"), which is a DataFrameWriter object. This does not have a start() method.
Do you want to write this as a stream? Then you'll probably need to use something like the writeStream method:
book_DF.writeStream \
.format("kafka") \
.start()
More info + examples can be found here.
If you simply want to print your dataframe to the console you should be able to use the show method for that. So in your case: book_DF.show()
The error is with PySpark. The DataFrameWriter doesn't have a .start() instead use .save()
This example is extracted from Structured Streaming Programming Guide of Spark:
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
spark = SparkSession \
.builder \
.appName("StructuredNetworkWordCount") \
.getOrCreate()
# Create DataFrame representing the stream of input lines from connection to localhost:9999
lines = spark \
.readStream \
.format("socket") \
.option("host", "localhost") \
.option("port", 9999) \
.load()
# Split the lines into words
words = lines.select(
explode(
split(lines.value, " ")
).alias("word"),
lines.timestamp.alias('time')
)
# Generate running word count
wordCounts = words.groupBy("word").count() #line to modify
# Start running the query that prints the running counts to the console
query = wordCounts \
.writeStream \
.outputMode("complete") \
.format("console") \
.start()
query.awaitTermination()
I need to create a table with every word and its input time. The output table should be like this:
+-------+--------------------+
|word | time |
+-------+--------------------+
| car |2021-12-16 12:21:..|
+-------+--------------------+
How can I do it? I think the line marked with "#line to modify" is only the line to modify.
Try, something like this:
streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
batchDF.persist()
batchDF.write.format(...).save(...) // location 1
batchDF.write.format(...).save(...) // location 2
batchDF.unpersist()
}
You can do something like this:
writeStream
.format("parquet") // can be "orc", "json", "csv", etc.
.option("path", "path/to/destination/dir")
.start()
and make an external table to point, and set the path if needed yourself.
See https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach-and-foreachbatch
Delta also writes to file location:
df.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", "/delta/df/_checkpoints/etl-from-json")
.start("/delta/df")
You may want to think about "complete".
I'm very new to Spark. This example is extracted from Structured Streaming Programming Guide of Spark:
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
spark = SparkSession \
.builder \
.appName("StructuredNetworkWordCount") \
.getOrCreate()
# Create DataFrame representing the stream of input lines from connection to localhost:9999
lines = spark \
.readStream \
.format("socket") \
.option("host", "localhost") \
.option("port", 9999) \
.load()
# Split the lines into words
words = lines.select(
explode(
split(lines.value, " ")
).alias("word")
)
# Generate running word count
wordCounts = words.groupBy("word").count()
# Start running the query that prints the running counts to the console
query = wordCounts \
.writeStream \
.outputMode("complete") \
.format("console") \
.start()
query.awaitTermination()
I need to modify this code to count the words that start with letter "B" and having more than 6 counts. How can I do it?
The solution is:
wordCountsDF = wordsDF.groupBy('word').count().where('word.startsWith("B")' and 'count > 6')
I have a big list is a json file an example :
{"timestamp":"1600840155","name":"0.0.0.1","value":"subdomain.test.net","type":"hd"}
{"timestamp":"1600840155","name":"0.0.0.2","value":"test.net","type":"hd"}
{"timestamp":"1600846210","name":"0.0.0.3","value":"node-fwx.pool-1-0.dynamic.exmple4.net","type":"hd"}
{"timestamp":"1600846210","name":"0.0.0.4","value":"exmple4.net","type":"hd"}
{"timestamp":"1600848078","name":"0.0.0.5","value":"node-fwy.pool-1-0.dynamic.exmple5.net","type":"hd"}
{"timestamp":"1600848078","name":"0.0.0.6","value":"exmple5.net","type":"hd"}
{"timestamp":"1600838189","name":"0.0.0.7","value":"node-fwz.pool-1-0.dynamic.exmple6.net","type":"hd"}
{"timestamp":"1600838189","name":"0.0.0.8","value":"exmple6.net","type":"hd"}
{"timestamp":"1600879127","name":"0.0.0.9","value":"node-fx0.pool-1-0.dynamic.exmple7.net","type":"hd"}
{"timestamp":"1600838189","name":"0.0.0.10","value":"exmple7.net","type":"hd"}
{"timestamp":"1600874834","name":"0.0.0.11","value":"node-fx1.pool-1-0.dynamic.exmple8.net","type":"hd"}
{"timestamp":"1600838189","name":"0.0.0.12","value":"exmple8.net","type":"hd"}
{"timestamp":"1600825122","name":"0.0.0.13","value":"node-ftb.pool-1-0.dynamic.exmple9.net","type":"hd"}
{"timestamp":"1600838189","name":"0.0.0.14","value":"exmple9.net","type":"hd"}
{"timestamp":"1600849239","name":"0.0.0.15","value":"node-fx2.pool-1-0.dynamic.exmple10.net","type":"hd"}
{"timestamp":"1600838189","name":"0.0.0.16","value":"exmple10.net","type":"hd"}
{"timestamp":"1600820784","name":"0.0.0.17","value":"node-fx3.pool-1-0.dynamic.other11.net","type":"hd"}
{"timestamp":"1600838189","name":"0.0.0.18","value":"exmple11.net","type":"hd"}
{"timestamp":"1600840955","name":"0.0.0.19","value":"node-fx4.pool-1-0.dynamic.other12.net","type":"hd"}
{"timestamp":"1600838189","name":"0.0.0.20","value":"exmple12.net","type":"hd"}
{"timestamp":"1600860091","name":"0.0.0.21","value":"another -one.pool-1-0.dynamic.other13.net","type":"hd"}
{"timestamp":"1600838189","name":"0.0.0.22","value":"exmple13.net","type":"hd"}
and I would like to get just the the root only and delete the other one using pyspark
so want to select
df.select("name","value","type").distinct() \
.write \
.save("mycleanlist",format="json")
i want this result
"name":"0.0.0.22","value":"exmple13.net","type":"hd"}
"name":"0.0.0.2","value":"test.net","type":"hd"}
"name":"0.0.0.4","value":"exmple4.net","type":"hd"}
"name":"0.0.0.6","value":"exmple5.net","type":"hd"}
"name":"0.0.0.8","value":"exmple6.net","type":"hd"}
"name":"0.0.0.10","value":"exmple7.net","type":"hd"}
"name":"0.0.0.12","value":"exmple8.net","type":"hd"}
"name":"0.0.0.14","value":"exmple9.net","type":"hd"}
"name":"0.0.0.16","value":"exmple10.net","type":"hd"}
"name":"0.0.0.18","value":"exmple11.net","type":"hd"}
"name":"0.0.0.20","value":"exmple12.net","type":"hd"}
"name":"0.0.0.22","value":"exmple13.net","type":"hd"}
You can use a UDF to wrap a method that extract the root domain:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
def extract_domain(url):
if url:
parts = url.split('.')
url = '.'.join(parts[-2:])
return url
extract_domain_udf = udf(extract_domain, StringType())
df.select("name", extract_domain_udf("value"), "type").distinct() \
.write \
.save("mycleanlist",format="json")
Below is my code. I have tried many different select variations, and yet the app runs, but without showing messages which are being written every second. I have a Spark Streaming example which using pprint() confirms kafka is in fact getting messages every second. The messages in Kafka are JSON formatted, see the schema for the field/column labels:
from pyspark.sql.functions import *
from pyspark.sql.types import *
import statistics
KAFKA_TOPIC = "vehicle_events_fast_testdata"
KAFKA_SERVER = "10.2.0.6:2181"
if __name__ == "__main__":
print("NXB PySpark Structured Streaming with Kafka Demo Started")
spark = SparkSession \
.builder \
.appName("PySpark Structured Streaming with Kafka Demo") \
.master("local[*]") \
.config("spark.jars", "/home/cldr/streams-dev/libs/spark-sql-kafka-0-10_2.11-2.4.4.jar,/home/cldr/streams-dev/libs/kafka-clients-2.0.0.jar") \
.config("spark.executor.extraClassPath", "/home/cldr/streams-dev/libs/spark-sql-kafka-0-10_2.11-2.4.4.jar:/home/cldr/streams-dev/libs/kafka-clients-2.0.0.jar") \
.config("spark.executor.extraLibrary", "/home/cldr/streams-dev/libs/spark-sql-kafka-0-10_2.11-2.4.4.jar:/home/cldr/streams-dev/libs/kafka-clients-2.0.0.jar") \
.config("spark.driver.extraClassPath", "/home/cldr/streams-dev/libs/spark-sql-kafka-0-10_2.11-2.4.4.jar:/home/cldr/streams-dev/libs/kafka-clients-2.0.0.jar") \
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
schema = StructType() \
.add("WheelAngle", IntegerType()) \
.add("acceleration", IntegerType()) \
.add("heading", IntegerType()) \
.add("reading_time", IntegerType()) \
.add("tractionForce", IntegerType()) \
.add("vel_latitudinal", IntegerType()) \
.add("vel_longitudinal", IntegerType()) \
.add("velocity", IntegerType()) \
.add("x_pos", IntegerType()) \
.add("y_pos", IntegerType()) \
.add("yawrate", IntegerType())
# Construct a streaming DataFrame that reads from testtopic
trans_det_df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", KAFKA_SERVER) \
.option("subscribe", KAFKA_TOPIC) \
.option("startingOffsets", "latest") \
.load() \
.selectExpr("CAST(value as STRING)", "CAST(timestamp as STRING)","CAST(topic as STRING)")
#(from_json(col("value").cast("string"),schema))
#Q1 = trans_det_df.select(from_json(col("value"), schema).alias("parsed_value"), "timestamp")
#Q2 = trans_det_d.select("parsed_value*", "timestamp")
query = trans_det_df.writeStream \
.format("console") \
.option("truncate","false") \
.start() \
.awaitTermination()
kafka.bootstrap.servers is the Kafka broker address (default port 9092), not Zookeeper (port 2181)
Also note your starting offsets are the latest, so you must produce data after starting the streaming application.
If you want to see existing topic data, use the earliest offsets.