Write paritioned csv files to a single folder - Pyspark

Write paritioned csv files to a single folder - Pyspark - python

While using partitionby() in pyspark, what approach should I follow to write csv files in one single folder rather than multiple folders ? Any suggested solution ?
Code
from pyspark.sql import SparkSession
from pyspark import SparkConf
import pyodbc
appName = "PySpark Teradata Example"
master = "local"
conf = SparkConf() # create the configuration
conf.set("spark.repl.local.jars", "terajdbc4.jar")
conf.set("spark.executor.extraClassPath", "terajdbc4.jar")
conf.set("spark.driver.extraClassPath", "terajdbc4.jar")
spark = SparkSession.builder \
.config(conf=conf) \
.appName(appName) \
.master(master) \
.getOrCreate()
#input table name
table = "my_table_1"
df =spark.read \
.format('jdbc') \
.option('url', 'jdbc:teradata://xxx.xxx.xx.xx') \
.option('user', 'dbc') \
.option('password', 'dbc') \
.option('driver', 'com.teradata.jdbc.TeraDriver') \
.option('STRICT_NAMES', 'OFF') \
.option('query',"Select eno, CAST(edata.asJSONText() AS VARCHAR(32000)) as edata from AdventureWorksDW."+table)\
.load()
df.show()
df = df.withColumn("id_tmp", F.col(df.columns[0]) % 4).orderBy("id_tmp")
df.coalesce(4)
.write \
.option("header",True) \
.mode("overwrite") \
.partitionBy("id_tmp") \
.option("sep","|")\
.format("csv") \
.save("C:\\Data\\"+table+"\\")
It is giving multiple folders with multiple CSV as an output. How to download it to a single folder ? Also, how can we change the name of the file while writing it to the folder ?

df = df.repartition(1) will reset the amount of partitions to 1, but as Kafels mentioned, it is better to use coalesce:
df = df.coalesce(1)
more info:
https://stackoverflow.com/a/31675351
https://stackoverflow.com/a/40983145
source:
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.repartition.html
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.coalesce.html

Related

AttributeError: 'DataFrameWriter' object has no attribute 'start'

I am trying to write a code using Kafka, Python and SparK
The problem statement is: Read data from XML and the data consumed will be in the binary format. This data has to be stored in a data frame.
I am getting below error:
Error:
File "C:/Users/HP/PycharmProjects/xml_streaming/ConS.py", line 55, in
.format("console")
AttributeError: 'DataFrameWriter' object has no attribute 'start'
Here is my code for reference:
#import *
# Set spark environments
#os.environ['PYSPARK_PYTHON'] = <PATH>
#os.environ['PYSPARK_DRIVER_PYTHON'] = <PATH>
spark = SparkSession\
.builder\
.master("local[1]")\
.appName("Consumer")\
.getOrCreate()
topic_Name = 'XML_File_Processing3'
consumer = kafka.KafkaConsumer(topic_Name, bootstrap_servers=['localhost:9092'], auto_offset_reset='latest')
kafka_df = spark\
.read \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("kafka.security.protocol", "SSL") \
.option("failOnDataLoss", "false") \
.option("subscribe", topic_Name) \
.load()
#.option("startingOffsets", "earliest") \
print("Loaded to DataFrame kafka_df")
kafka_df.printSchema()
new_df = kafka_df.selectExpr("CAST(value AS STRING)")
schema = ArrayType(StructType()\
.add("book_id", IntegerType())\
.add("author", StringType())\
.add("title", StringType())\
.add("genre",StringType())\
.add("price",IntegerType())\
.add("publish_date", IntegerType())\
.add("description", StringType()))
book_DF = new_df.select(from_json(col("value"), schema).alias("dataf")) #.('data')).select("data.*")
book_DF.printSchema()
#book_DF.select("dataf.author").show()
book_DF.write\
.format("console")\
.start()

I don't have a lot of experience with kafka, but at the end you're using the start() method on the result of book_DF.write.format("console"), which is a DataFrameWriter object. This does not have a start() method.
Do you want to write this as a stream? Then you'll probably need to use something like the writeStream method:
book_DF.writeStream \
.format("kafka") \
.start()
More info + examples can be found here.
If you simply want to print your dataframe to the console you should be able to use the show method for that. So in your case: book_DF.show()

The error is with PySpark. The DataFrameWriter doesn't have a .start() instead use .save()

Word Count with timestamp in Python

This example is extracted from Structured Streaming Programming Guide of Spark:
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
spark = SparkSession \
.builder \
.appName("StructuredNetworkWordCount") \
.getOrCreate()
# Create DataFrame representing the stream of input lines from connection to localhost:9999
lines = spark \
.readStream \
.format("socket") \
.option("host", "localhost") \
.option("port", 9999) \
.load()
# Split the lines into words
words = lines.select(
explode(
split(lines.value, " ")
).alias("word"),
lines.timestamp.alias('time')
)
# Generate running word count
wordCounts = words.groupBy("word").count() #line to modify
# Start running the query that prints the running counts to the console
query = wordCounts \
.writeStream \
.outputMode("complete") \
.format("console") \
.start()
query.awaitTermination()
I need to create a table with every word and its input time. The output table should be like this:
+-------+--------------------+
|word | time |
+-------+--------------------+
| car |2021-12-16 12:21:..|
+-------+--------------------+
How can I do it? I think the line marked with "#line to modify" is only the line to modify.

Try, something like this:
streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
batchDF.persist()
batchDF.write.format(...).save(...) // location 1
batchDF.write.format(...).save(...) // location 2
batchDF.unpersist()
}
You can do something like this:
writeStream
.format("parquet") // can be "orc", "json", "csv", etc.
.option("path", "path/to/destination/dir")
.start()
and make an external table to point, and set the path if needed yourself.
See https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach-and-foreachbatch
Delta also writes to file location:
df.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", "/delta/df/_checkpoints/etl-from-json")
.start("/delta/df")
You may want to think about "complete".

Word Count using Spark Structured Streaming with Python

I'm very new to Spark. This example is extracted from Structured Streaming Programming Guide of Spark:
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
spark = SparkSession \
.builder \
.appName("StructuredNetworkWordCount") \
.getOrCreate()
# Create DataFrame representing the stream of input lines from connection to localhost:9999
lines = spark \
.readStream \
.format("socket") \
.option("host", "localhost") \
.option("port", 9999) \
.load()
# Split the lines into words
words = lines.select(
explode(
split(lines.value, " ")
).alias("word")
)
# Generate running word count
wordCounts = words.groupBy("word").count()
# Start running the query that prints the running counts to the console
query = wordCounts \
.writeStream \
.outputMode("complete") \
.format("console") \
.start()
query.awaitTermination()
I need to modify this code to count the words that start with letter "B" and having more than 6 counts. How can I do it?

The solution is:
wordCountsDF = wordsDF.groupBy('word').count().where('word.startsWith("B")' and 'count > 6')

delete all subdomain and Get Root Domain only pyspark

I have a big list is a json file an example :
{"timestamp":"1600840155","name":"0.0.0.1","value":"subdomain.test.net","type":"hd"}
{"timestamp":"1600840155","name":"0.0.0.2","value":"test.net","type":"hd"}
{"timestamp":"1600846210","name":"0.0.0.3","value":"node-fwx.pool-1-0.dynamic.exmple4.net","type":"hd"}
{"timestamp":"1600846210","name":"0.0.0.4","value":"exmple4.net","type":"hd"}
{"timestamp":"1600848078","name":"0.0.0.5","value":"node-fwy.pool-1-0.dynamic.exmple5.net","type":"hd"}
{"timestamp":"1600848078","name":"0.0.0.6","value":"exmple5.net","type":"hd"}
{"timestamp":"1600838189","name":"0.0.0.7","value":"node-fwz.pool-1-0.dynamic.exmple6.net","type":"hd"}
{"timestamp":"1600838189","name":"0.0.0.8","value":"exmple6.net","type":"hd"}
{"timestamp":"1600879127","name":"0.0.0.9","value":"node-fx0.pool-1-0.dynamic.exmple7.net","type":"hd"}
{"timestamp":"1600838189","name":"0.0.0.10","value":"exmple7.net","type":"hd"}
{"timestamp":"1600874834","name":"0.0.0.11","value":"node-fx1.pool-1-0.dynamic.exmple8.net","type":"hd"}
{"timestamp":"1600838189","name":"0.0.0.12","value":"exmple8.net","type":"hd"}
{"timestamp":"1600825122","name":"0.0.0.13","value":"node-ftb.pool-1-0.dynamic.exmple9.net","type":"hd"}
{"timestamp":"1600838189","name":"0.0.0.14","value":"exmple9.net","type":"hd"}
{"timestamp":"1600849239","name":"0.0.0.15","value":"node-fx2.pool-1-0.dynamic.exmple10.net","type":"hd"}
{"timestamp":"1600838189","name":"0.0.0.16","value":"exmple10.net","type":"hd"}
{"timestamp":"1600820784","name":"0.0.0.17","value":"node-fx3.pool-1-0.dynamic.other11.net","type":"hd"}
{"timestamp":"1600838189","name":"0.0.0.18","value":"exmple11.net","type":"hd"}
{"timestamp":"1600840955","name":"0.0.0.19","value":"node-fx4.pool-1-0.dynamic.other12.net","type":"hd"}
{"timestamp":"1600838189","name":"0.0.0.20","value":"exmple12.net","type":"hd"}
{"timestamp":"1600860091","name":"0.0.0.21","value":"another -one.pool-1-0.dynamic.other13.net","type":"hd"}
{"timestamp":"1600838189","name":"0.0.0.22","value":"exmple13.net","type":"hd"}
and I would like to get just the the root only and delete the other one using pyspark
so want to select
df.select("name","value","type").distinct() \
.write \
.save("mycleanlist",format="json")
i want this result
"name":"0.0.0.22","value":"exmple13.net","type":"hd"}
"name":"0.0.0.2","value":"test.net","type":"hd"}
"name":"0.0.0.4","value":"exmple4.net","type":"hd"}
"name":"0.0.0.6","value":"exmple5.net","type":"hd"}
"name":"0.0.0.8","value":"exmple6.net","type":"hd"}
"name":"0.0.0.10","value":"exmple7.net","type":"hd"}
"name":"0.0.0.12","value":"exmple8.net","type":"hd"}
"name":"0.0.0.14","value":"exmple9.net","type":"hd"}
"name":"0.0.0.16","value":"exmple10.net","type":"hd"}
"name":"0.0.0.18","value":"exmple11.net","type":"hd"}
"name":"0.0.0.20","value":"exmple12.net","type":"hd"}
"name":"0.0.0.22","value":"exmple13.net","type":"hd"}

You can use a UDF to wrap a method that extract the root domain:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
def extract_domain(url):
if url:
parts = url.split('.')
url = '.'.join(parts[-2:])
return url
extract_domain_udf = udf(extract_domain, StringType())
df.select("name", extract_domain_udf("value"), "type").distinct() \
.write \
.save("mycleanlist",format="json")

pySpark Structured Streaming from Kafka does not output to console for debugging

Below is my code. I have tried many different select variations, and yet the app runs, but without showing messages which are being written every second. I have a Spark Streaming example which using pprint() confirms kafka is in fact getting messages every second. The messages in Kafka are JSON formatted, see the schema for the field/column labels:
from pyspark.sql.functions import *
from pyspark.sql.types import *
import statistics
KAFKA_TOPIC = "vehicle_events_fast_testdata"
KAFKA_SERVER = "10.2.0.6:2181"
if __name__ == "__main__":
print("NXB PySpark Structured Streaming with Kafka Demo Started")
spark = SparkSession \
.builder \
.appName("PySpark Structured Streaming with Kafka Demo") \
.master("local[*]") \
.config("spark.jars", "/home/cldr/streams-dev/libs/spark-sql-kafka-0-10_2.11-2.4.4.jar,/home/cldr/streams-dev/libs/kafka-clients-2.0.0.jar") \
.config("spark.executor.extraClassPath", "/home/cldr/streams-dev/libs/spark-sql-kafka-0-10_2.11-2.4.4.jar:/home/cldr/streams-dev/libs/kafka-clients-2.0.0.jar") \
.config("spark.executor.extraLibrary", "/home/cldr/streams-dev/libs/spark-sql-kafka-0-10_2.11-2.4.4.jar:/home/cldr/streams-dev/libs/kafka-clients-2.0.0.jar") \
.config("spark.driver.extraClassPath", "/home/cldr/streams-dev/libs/spark-sql-kafka-0-10_2.11-2.4.4.jar:/home/cldr/streams-dev/libs/kafka-clients-2.0.0.jar") \
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
schema = StructType() \
.add("WheelAngle", IntegerType()) \
.add("acceleration", IntegerType()) \
.add("heading", IntegerType()) \
.add("reading_time", IntegerType()) \
.add("tractionForce", IntegerType()) \
.add("vel_latitudinal", IntegerType()) \
.add("vel_longitudinal", IntegerType()) \
.add("velocity", IntegerType()) \
.add("x_pos", IntegerType()) \
.add("y_pos", IntegerType()) \
.add("yawrate", IntegerType())
# Construct a streaming DataFrame that reads from testtopic
trans_det_df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", KAFKA_SERVER) \
.option("subscribe", KAFKA_TOPIC) \
.option("startingOffsets", "latest") \
.load() \
.selectExpr("CAST(value as STRING)", "CAST(timestamp as STRING)","CAST(topic as STRING)")
#(from_json(col("value").cast("string"),schema))
#Q1 = trans_det_df.select(from_json(col("value"), schema).alias("parsed_value"), "timestamp")
#Q2 = trans_det_d.select("parsed_value*", "timestamp")
query = trans_det_df.writeStream \
.format("console") \
.option("truncate","false") \
.start() \
.awaitTermination()

kafka.bootstrap.servers is the Kafka broker address (default port 9092), not Zookeeper (port 2181)
Also note your starting offsets are the latest, so you must produce data after starting the streaming application.
If you want to see existing topic data, use the earliest offsets.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Write paritioned csv files to a single folder - Pyspark - python

Related

AttributeError: 'DataFrameWriter' object has no attribute 'start'

Word Count with timestamp in Python

Word Count using Spark Structured Streaming with Python

delete all subdomain and Get Root Domain only pyspark

pySpark Structured Streaming from Kafka does not output to console for debugging

Categories

Resources