how to convert a string received from one dataframe to another dataframe - python

I have the following code in a parquet file, which I convert into a variable using collect:
mapping_in_parquet = [('filial','filial','S','string'),('doc','numero_do_documento','S','string'),('serie','serie_do_documento','S','string')]
mapping = (df.select('mapping').distinct().collect()[0][0])
The problem is when I try to convert the string back to dataframe:
schema = StructType([
StructField("fieldName", StringType(), True),
StructField("alias", StringType(), True),
StructField("column_active", StringType(), True),
StructField("typeField", StringType(), True)])
df = (spark.createDataFrame(mapping, schema))
print(mapping)
I get error:
StructType can not accept object '[' in type
When playing the code directly in the console I don't get any error. The error occurs when I extract the column value to a variable.

There are 2 issues in your code:
First is typo while creating dataframe: Use df = (spark.createDataFrame(mapping_in_parquet, schema)).
Second, in mapping = (df.select('mapping').distinct().collect()[0][0]), there is no column mapping. Use either of [fieldName, alias, typeField, column_active].
Full example code with correction:
mapping_in_parquet = [('filial','filial','S','string'),('doc','numero_do_documento','S','string'),('serie','serie_do_documento','S','string')]
schema = StructType([
StructField("fieldName", StringType(), True),
StructField("alias", StringType(), True),
StructField("column_active", StringType(), True),
StructField("typeField", StringType(), True)])
df = (spark.createDataFrame(mapping_in_parquet, schema))
mapping = (df.select('fieldName').distinct().collect()[0][0])
print(mapping)
[Out]:
doc

Related

PySpark create a json string by combining columns

I have a dataframe.
from pyspark.sql.types import *
input_schema = StructType(
[
StructField("ID", StringType(), True),
StructField("Date", StringType(), True),
StructField("code", StringType(), True),
])
input_data = [
("1", "2021-12-01", "a"),
("2", "2021-12-01", "b"),
]
input_df = spark.createDataFrame(data=input_data, schema=input_schema)
I would like to perform a transformation that combines a set of columns and stuff into a json string. The columns to be combined is known ahead of time. The output should look like something below.
Is there any sugggested method to achieve this?
Appreciate any help on this.
You can create a struct type and then convert to json:
from pyspark.sql import functions as F
col_to_combine = ['Date','code']
output = input_df.withColumn('combined',F.to_json(F.struct(*col_to_combine)))\
.drop(*col_to_combine)
output.show(truncate=False)
+---+--------------------------------+
|ID |combined |
+---+--------------------------------+
|1 |{"Date":"2021-12-01","code":"a"}|
|2 |{"Date":"2021-12-01","code":"b"}|
+---+--------------------------------+

Pyspark: pyarrow.lib.ArrowInvalid: 'utf-32-le' codec can't decode bytes in position 0-3: code point not in range(0x110000)

I have a dataframe like the following
df.show(5, False)
+------------------------------------+-------------------+--------+-------+--------+
|ID |timestamp |accuracy|lat |lon |
+------------------------------------+-------------------+--------+-------+--------+
|00000059-eb17-4db6-8e46-0739205a7ca1|2020-01-01 11:51:43|1.0 |41.3128|-81.8566|
|00000387-5804-40b2-9196-5cfead4dc55b|2020-01-01 18:05:24|11.7 |29.4241|-98.4936|
|00000387-5804-40b2-9196-5cfead4dc55b|2020-01-01 20:11:23|15.7 |29.4241|-98.4936|
|00000387-5804-40b2-9196-5cfead4dc55b|2020-01-01 18:05:10|14.4 |29.4241|-98.4936|
|00000387-5804-40b2-9196-5cfead4dc55b|2020-01-01 18:06:02|12.4 |29.4241|-98.4936|
+------------------------------------+-------------------+--------+-------+--------+
If I run a code keeping the ID column I get this error
pyarrow.lib.ArrowInvalid: 'utf-32-le' codec can't decode bytes in position 0-3: code point not in range(0x110000)
This kind of errors usually come from unmatched return type .
For example , happens when convert python int to pyspark StringType like below :
schema = StructType(
[
StructField("a", StringType(), True),
StructField("b", StringType(), True),
StructField("c", IntegerType(), True),
StructField("d", IntegerType(), True),
]
)
#pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def test(pdf):
return pd.DataFrame([range(4)], columns=['a','b','c','d'])
some_df.groupby(xxx).apply(test)
Also happen on datetime/date to StringType , so I assume this kind of error come from unmatched return type . You should review your code .
As your case , I think there may be some nan-string values in ID column .

How to calculate number of times an element has 2 fields the same after an RDD Join (Spark)

I started with 2 RDDs, one with userID and then "SHL .." and then one with userID and the rest of the information.
So, after joining 2 RDDs together I now have data in this format:
(u'5839477', (u'SHL UNRESTRICTED',(u'AGBAMA,JAMES', u'MEDALLION TAXI DRIVER',u'12/27/2020', u'08/22/2019', u'13:20')))]
The first field is the userID and then the next is information about them. I am needing to see
how many users have both "SHL UNRESTRICTED" and 'MEDALLION TAXI DRIVER'. I believe I should maybe have formatted the data after the .join before. The issue I'm having is being able to access the specific fields within the second field.
This would be easier expressed with DataFrames, introduced since Spark 1.6. Is there a specific reason you’re using RDDs?
If not, start working with DataFrames from the start or convert your existing RDD to a DataFrame by specifying a schema. Like this:
>>> rdd = spark.sparkContext.parallelize([("5839477", ("SHL UNRESTRICTED", (u'AGBAMA,JAMES', u'MEDALLION TAXI DRIVER',u'12/27/2020', u'08/22/2019', u'13:20')))])
>>> from pyspark.sql.types import *
>>> schema = StructType([
... StructField("user_id", StringType(), True),
... StructField("details", StructType([
... StructField("restrictions", StringType(), True),
... StructField("more_details", StructType([
... StructField("name", StringType(), True),
... StructField("function", StringType(), True),
... StructField("date1", StringType(), True),
... StructField("date2", StringType(), True),
... StructField("time_of_day", StringType(), True)
... ]), True) ]), True)])
...
>>> df = rdd.toDF(schema=schema)
>>> df.filter(
... (df.details.restrictions == "SHL UNRESTRICTED")
... & (df.details.more_details.function == "MEDALLION TAXI DRIVER")
... ).count()
...
1
Additionally, you could consider flattening the structure, so you won’t need to dig inside the nested columns.
Alternatively, if you really must stay with the RDD, then you would access the elements as if they were nested collections:
>>> rdd.filter(lambda x: (x[1][0] == "SHL UNRESTRICTED")
... and (x[1][1][1] == "MEDALLION TAXI DRIVER")
... ).count()
...
1
Notice how much less expressive this code is (what meaning does x[1][1][1] have to the reader of your code?). I definitely recommend adding names to your data structures.

pyspark dataframe "condition should be string or Column"

i am unable to use a filter on a data frame. i keep getting error "TypeError("condition should be string or Column")"
I have tried changing the filter to use col object. Still, it does not work.
path = 'dbfs:/FileStore/tables/TravelData.txt'
data = spark.read.text(path)
from pyspark.sql.types import StructType, StructField, IntegerType , StringType, DoubleType
schema = StructType([
StructField("fromLocation", StringType(), True),
StructField("toLocation", StringType(), True),
StructField("productType", IntegerType(), True)
])
df = spark.read.option("delimiter", "\t").csv(path, header=False, schema=schema)
from pyspark.sql.functions import col
answerthree = df.select("toLocation").groupBy("toLocation").count().sort("count", ascending=False).take(10) # works fine
display(answerthree)
I add a filter to variable "answerthree" as follows:
answerthree = df.select("toLocation").groupBy("toLocation").count().filter(col("productType")==1).sort("count", ascending=False).take(10)
It is throwing error as follows:
""cannot resolve 'productType' given input columns""condition should be string or Column"
In jist, i am trying to solve problem 3 given in below link using pyspark instead of scal. Dataset is also provided in below url.
https://acadgild.com/blog/spark-use-case-travel-data-analysis?fbclid=IwAR0fgLr-8aHVBsSO_yWNzeyh7CoiGraFEGddahDmDixic6wmumFwUlLgQ2c
I should be able to get the desired result only for productType value 1
As you don't have a variable referencing the data frame, the easiest is to use a string condition:
answerthree = df.select("toLocation").groupBy("toLocation").count()\
.filter("productType = 1")\
.sort(...
Alternatively, you can use a data frame variable and use a column-based filter:
count_df = df.select("toLocation").groupBy("toLocation").count()
answerthree = count_df.filter(count_df['productType'] == 1)\
.sort("count", ascending=False).take(10)

How to remove utf format in a dataframe in Pyspark and convert column from string to Integer

I need to remove utf formatting and convert a column to Integer type.
Below is what I have done to remove the utf format
>>auction_data = auction_raw_data.map(lambda line: line.encode("ascii","ignore").split(","))
>>auction_Data.take(2)
>>[['8211480551', '52.99', '1.201505', 'hanna1104', '94', '49.99', '311.6'], ['8211480551', '50.99', '1.203843', 'wrufai1', '90', '49.99', '311.6']]
But, when I create a dataframe with for the same data using the schema, and try to retrieve particular data, I get the data prefixed with a " u' ".
>>schema = StructType([ StructField("auctionid", StringType(), True),
StructField("bid", StringType(), True),
StructField("bidtime", StringType(), True),
StructField("bidder", StringType(), True),
StructField("bidderrate", StringType(), True),
StructField("openbid", StringType(), True),
StructField("price", StringType(), True)])`
>>xbox_df = sqlContext.createDataFrame(auction_data,schema)
>>xbox_df.registerTempTable("auction")
>>first_line = sqlContext.sql("select * from auction where auctionid=8211480551").collect()
>>for i in first_line:
>> print i
>>Row(auctionid=u'8211480551', bid=u'52.99', bidtime=u'1.201505', bidder=u'hanna1104', bidderrate=u'94', openbid=u'49.99', price=u'311.6')
>>Row(auctionid=u'8211480551', bid=u'50.99', bidtime=u'1.203843', bidder=u'wrufai1', bidderrate=u'90', openbid=u'49.99', price=u'311.6')
How to remove the u' infront of the values, also I want to convert the bid value into an Integer. When I directly change in schema definition, I get an error saying
" TypeError: IntegerType can not accept object in type ".Show less
I am loading a JSON and not using a schema, so I don't know if there's a difference. I have no issues when converting fields to int when using select. This is what I do:
from pyspark.sql.functions import *
...
df = df.select(col('intField').cast('int'))
df.show()
# prints Row(intField=123)

Categories

Resources