How to read a csv with using the schema - python

I am trying to load a csv file in a pyspark dataframe using:
spark.read.options(delimiter=';', header=True).csv(file)
But I get the following error
AnalysisException: 'Unable to infer schema for CSV. It must be specified manually.;'
I try to insert the schema manually but it still doesn´t load any value
customSchema = StructType([
StructField("aaa", StringType(), True),
StructField("bbb", IntegerType(), True),
StructField("ccc", IntegerType(), True)])
spark.read.option('header', 'true').option('delimiter', ';').schema(customSchema).csv(file)
spark.read.load(file, format="csv", header="true", sep=';', schema=customSchema)
I only get an empty dataframe with columns name

For a CSV file delimited like the following.
aaa;bbb;ccc
john;12;14
peter;3;7
sally;8;27
You can read this into pyspark like so.
from pyspark.sql import types
customSchema = types.StructType([
types.StructField("aaa", types.StringType(), True),
types.StructField("bbb", types.IntegerType(), True),
types.StructField("ccc", types.IntegerType(), True)])
df = spark.read.options(delimiter=';', header=True).csv(inputs, schema=customSchema)

Related

Unable to write the data to databricks delta table

I am trying to create a table in data bricks using data which is in parquet format but before that, i need to convert the datatypes of that data to data bricks compatible so I converted them but, I'm unable to write the converted data to databricks delta format
Here is the code I wrote in databricks to convert the datatypes and to write the data into databricks delta format:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, ArrayType
Define the schema for the input data
input_schema = StructType([
StructField("id", IntegerType(), True),
StructField("array_column", ArrayType(IntegerType()), True),
StructField("set_column", StringType(), True),
StructField("name", StringType(), True)
])
Read the input data from S3
input_data = spark.read \
.option("header", "true") \
.option("inferSchema", "false") \
.schema(input_schema) \
.parquet("/mnt/employee/employee_data/complex.parquet")
Convert the data types to be compatible with Databricks
converted_data = input_data.withColumn("id", input_data["id"].cast(IntegerType())) \
.withColumn("array_column", input_data["array_column"].cast(ArrayType(StringType()))) \
.withColumn("set_column", input_data["set_column"].cast(StringType())) \
.withColumn("name", input_data["name"].cast(StringType())) \
Write the converted data to Databricks Delta format
converted_data.write.format("delta").mode("overwrite").save("/mnt/data")
This is the error I am getting :
This is my data:

How to Extract Columns From BinaryType Using pySpark Databricks?

Issue: Extracting columns from dataframe's binary type column. The data frame was loaded from blob storage account of azure.
Environment:
Databricks 5.4 (includes Apache Spark 2.4.3)
Python 3.5.2
Process:
Get data from avro files
Extract useful information and write back more user friendly version to parquet
Avro schema:
SequenceNumber:long
Offset:string
EnqueuedTimeUtc:string
SystemProperties:map
key:string
value:struct
member0:long
member1:double
member2:string
member3:binary
Properties:map
key:string
value:struct
member0:long
member1:double
member2:string
member3:binary
Body:binary
I struggle getting data from Body:binary. I managed to convert column to string using code snippet below
df = df.withColumn("Body", col("Body").cast("string"))
I managed to extract a list of columns in the body column using code below:
#body string looks like json
dfBody = df.select(df.Body)
jsonList = (dfBody.collect())
jsonString = jsonList[0][0]
columns = []
data = json.loads(jsonString)
for key, value in data.items():
columns.append(key)
columns.sort()
print(columns)
The list has interesting columns such as ID, Status, Name.
Question:
How do I add ID column that sits in body binary column and add to my current dataframe. In general, I want to flatten binary column. Binary column might also have arrays.
You don't want to collect the dataframe. Instead you should be able to cast and flatten the body field. From the looks of it you are using avro captures from Event Hubs. This is code that I use to handle this:
from pyspark.sql.types import StringType, IntegerType, StructType, StructField
from pyspark.sql.functions import from_json, col
# Create a schema that describes the Body field
sourceSchema = StructType([
StructField("Attribute1", StringType(), False),
StructField("Attribute2", StringType(), True),
StructField("Attribute3", StringType(), True),
StructField("Attribute4", IntegerType(), True)])
# Convert Body to String and then Json applying the schema
df = df.withColumn("Body", col("Body").cast("string"))
jsonOptions = {"dateFormat" : "yyyy-MM-dd HH:mm:ss.SSS"}
df = df.withColumn("Body", from_json(df.Body, sourceSchema, jsonOptions))
# Flatten Body
for c in df.schema['Body'].dataType:
df = df.withColumn(c.name, col("Body." + c.name))
I think the key bit you need is the from_json function.

How to handle NullType in Spark Dataframe using Python?

I'm trying to load data from MapR DB into Spark DF.
Then I'm just trying to export the DF to CSV files.
But, I'm getting error is:
"com.mapr.db.spark.exceptions.SchemaMappingException: Failed to parse a value for data type NullType (current token: STRING)"
I tried couple of ways by casting the column to StringType.
This is one of them:
df = spark.loadFromMapRDB(db_table).select(
F.col('c_002.v_22').cast(T.StringType()).alias('aaa'),
F.col('c_002.v_23').cast(T.StringType()).alias('bbb')
)
print(df.printSchema())
Output of PrintSchema:
root
|-- aaa: string (nullable = true)
|-- bbb: string (nullable = true)
Values in column 'aaa' & 'bbb' can be null.
Then I'm trying to export the df to CSV files:
df = df.repartition(10)
df.write.csv(csvFile, compression='gzip', mode='overwrite', sep=',', header='true', quoteAll='true')
I was getting a samilar issue with a MapR-DB JSON table and I was able to resolve by defining the table schema when loading into a DataFrame:
tableSchema = StructType([
StructField("c_002.v_22", StringType(), True), # True here signifies nullable: https://spark.apache.org/docs/2.3.1/api/python/pyspark.sql.html?highlight=structfield#pyspark.sql.types.StructField
StructField("c_002.v_23", StringType(), True),
])
df = spark.loadFromMapRDB(db_table, tableSchema ).select(
F.col('c_002.v_22').alias('aaa'),
F.col('c_002.v_23').alias('bbb')
)
Another thing you could try is simply filling the null values with something:
https://spark.apache.org/docs/2.3.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.fillna
df = df.na.fill('null')

String encoding issue in Spark SQL/DataFrame

So I have this csv file which has two columns: id (int), name(string). When I read the file into pyspark throught the following code:
schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True)])
df = sqlContext.read.csv("file.csv",
header=False, schema = schema)
On executing df.first() I get the following output:
Row(artistid=1240105, artistname=u'Andr\xe9 Visior')
This is the original row from the file:
1240105,André Visior
How do I go about displaying the name as it is?
save the csv file by opening as CSV(utf-8)
Not a very clean way, but here is a quick fix.
s = "1240105,André Visior"
s.decode('latin-1').encode("utf-8").replace("\xc3\xa9 ","e'")
>>
"1240105,Andre'Visior"
You may want to look to Latin-1 to Unicode / ASCII conversion here

How to "reduce" multiple json tables stored in a column of an RDD to a single RDD table as efficiently as possible

Does concurrent access to append rows using union in a dataframe using following code will work correctly? Currently showing type error
from pyspark.sql.types import *
schema = StructType([
StructField("owreg", StringType(), True),StructField("we", StringType(), True)
,StructField("aa", StringType(), True)
,StructField("cc", StringType(), True)
,StructField("ss", StringType(), True)
,StructField("ss", StringType(), True)
,StructField("sss", StringType(), True)
])
f = sqlContext.createDataFrame(sc.emptyRDD(), schema)
def dump(l,jsid):
if not l.startswith("<!E!>"):
f=f.unionAll(sqlContext.read.json(l))
savedlabels.limit(10).foreach(lambda a: dump(a.labels,a.job_seq_id))
Assume that sqlContext.read.json(l) will read a json and output a RDD with the same schema
The pattern is that I want to "reduce" multiple json tables stored in a column of an RDD to an RDD table as efficiently as possible.
def dump(l,jsid):
if not l.startswith("<!E!>"):
f=f.unionAll(sc.parallelize(json.loads(l)).toDF())
The above code will also no work as sc.parallelize is being invoked by the worker threads. Hence how to solve this problem?

Categories

Resources