Converting string list to Python dataframe - pyspark python sparksql

Converting string list to Python dataframe - pyspark python sparksql - python

I have the following Python / Pyspark code:
sql_command = ''' query ''''
df = spark.sql(sql_command)
ls_colnames = df.schema.names
ls_colnames
['id', 'level1', 'level2', 'level3', 'specify_facts']
cSchema = StructType([
StructField("colname", StringType(), False)
])
df_colnames = spark.createDataFrame(dataset_array,schema=cSchema)
File "/opt/mapr/spark/spark-2.1.0/python/pyspark/sql/types.py", line
1366, in _verify_type
raise TypeError("StructType can not accept object %r in type %s" % (obj, type(obj))) TypeError: StructType can not accept object 'id'
in type class 'str'
What can I do to get a spark object of the colnames?
`

Not sure if I have understood your question correctly. But if you are tryng to create a dataframe based on the given list, you can use below code for the same.
from pyspark.sql import Row
l = ['id', 'level1', 'level2', 'level3', 'specify_facts']
rdd1 = sc.parallelize(l)
row_rdd = rdd1.map(lambda x: Row(x))
sqlContext.createDataFrame(row_rdd,['col_name']).show()
Hope it Helps.
Regards,
Neeraj

Related

How to convert pandas code using .str and .split to Pyspark

I wrote the following code using pandas:
df['last_two'] = df['text'].str[-2:]
df['before_hyphen'] = df['text'].str.split('-').str[0]
df['new_text'] = df['before_hyphen'].astype(str) + "-" + df['last_two'].astype(str)
But when I run it on a spark dataframe I get the following error:
TypeError: startPos and length must be the same type
I know I could just convert the df to pandas, run the code, and then convert it back to a spark df, but I wonder if there's a better way? Thanks

You can try the string functions below:
import pyspark.sql.functions as F
df2 = df.withColumn(
'last_two', F.expr('substring(text, -2)')
).withColumn(
'before_hyphen', F.substring_index('text', '-', 1))
).withColumn(
'new_text', F.concat_ws('-', 'before_hyphen', 'last_two')
)

Attribute error while creating list from string values

I have imported excel file with some data and removed missing values.
df = pd.read_excel (r'file.xlsx', na_values = missing_values)
Im trying to split string values to make them into list for later actions.
df['GENRE'] = df['GENRE'].map(lambda x: x.split(','))
df['ACTORS'] = df['ACTORS'].map(lambda x: x.split(',')[:3])
df['DIRECTOR'] = df['DIRECTOR'].map(lambda x: x.split(','))
But it gives me following error - AttributeError: 'list' object has no attribute 'split'
I've done the same with csv and it worked.. could it be because its excel?
Im sure it's simple but i can't get my head around it.example of my dataframe

Try using str.split, the Pandas way:
df['GENRE'] = df['GENRE'].str.split(',')
df['ACTORS'] = df['ACTORS'].str.split(',').str[:3]
df['DIRECTOR'] = df['DIRECTOR'].str.split(',')

pyspark dataframe "condition should be string or Column"

i am unable to use a filter on a data frame. i keep getting error "TypeError("condition should be string or Column")"
I have tried changing the filter to use col object. Still, it does not work.
path = 'dbfs:/FileStore/tables/TravelData.txt'
data = spark.read.text(path)
from pyspark.sql.types import StructType, StructField, IntegerType , StringType, DoubleType
schema = StructType([
StructField("fromLocation", StringType(), True),
StructField("toLocation", StringType(), True),
StructField("productType", IntegerType(), True)
])
df = spark.read.option("delimiter", "\t").csv(path, header=False, schema=schema)
from pyspark.sql.functions import col
answerthree = df.select("toLocation").groupBy("toLocation").count().sort("count", ascending=False).take(10) # works fine
display(answerthree)
I add a filter to variable "answerthree" as follows:
answerthree = df.select("toLocation").groupBy("toLocation").count().filter(col("productType")==1).sort("count", ascending=False).take(10)
It is throwing error as follows:
""cannot resolve 'productType' given input columns""condition should be string or Column"
In jist, i am trying to solve problem 3 given in below link using pyspark instead of scal. Dataset is also provided in below url.
https://acadgild.com/blog/spark-use-case-travel-data-analysis?fbclid=IwAR0fgLr-8aHVBsSO_yWNzeyh7CoiGraFEGddahDmDixic6wmumFwUlLgQ2c
I should be able to get the desired result only for productType value 1

As you don't have a variable referencing the data frame, the easiest is to use a string condition:
answerthree = df.select("toLocation").groupBy("toLocation").count()\
.filter("productType = 1")\
.sort(...
Alternatively, you can use a data frame variable and use a column-based filter:
count_df = df.select("toLocation").groupBy("toLocation").count()
answerthree = count_df.filter(count_df['productType'] == 1)\
.sort("count", ascending=False).take(10)

Convert PipelinedRDD to dataframe

I'm attempting to convert a pipelinedRDD in pyspark to a dataframe. This is the code snippet:
newRDD = rdd.map(lambda row: Row(row.__fields__ + ["tag"])(row + (tagScripts(row), )))
df = newRDD.toDF()
When I run the code though, I receive this error:
'list' object has no attribute 'encode'
I've tried multiple other combinations, such as converting it to a Pandas dataframe using:
newRDD = rdd.map(lambda row: Row(row.__fields__ + ["tag"])(row + (tagScripts(row), )))
df = newRDD.toPandas()
But then I end up receiving this error:
AttributeError: 'PipelinedRDD' object has no attribute 'toPandas'
Any help would be greatly appreciated. Thank you for your time.

rdd.toDF() or rdd.toPandas() is only used for SparkSession.
To fix your code, try below:
spark = SparkSession.builder.getOrCreate()
rdd = spark.sparkContext.textFile()
newRDD = rdd.map(...)
df = newRDD.toDF() or newRDD.toPandas()

pyspark: TypeError: IntegerType can not accept object in type <type 'unicode'>

programming with pyspark on a Spark cluster,
the data is large and in pieces so can not be loaded into the memory or check the sanity of the data easily
basically it looks like
af.b Current%20events 1 996
af.b Kategorie:Musiek 1 4468
af.b Spesiaal:RecentChangesLinked/Gebruikerbespreking:Freakazoid 1 5209
af.b Spesiaal:RecentChangesLinked/Sir_Arthur_Conan_Doyle 1 5214
wikipedia data:
I read it from aws S3 and then try to construct spark Dataframe with the following python code in pyspark intepreter:
parts = data.map(lambda l: l.split())
wikis = parts.map(lambda p: (p[0], p[1],p[2],p[3]))
fields = [StructField("project", StringType(), True),
StructField("title", StringType(), True),
StructField("count", IntegerType(), True),
StructField("byte_size", StringType(), True)]
schema = StructType(fields)
df = sqlContext.createDataFrame(wikis, schema)
all look fine, only createDataFrame gives me error
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/context.py", line 404, in createDataFrame
rdd, schema = self._createFromRDD(data, schema, samplingRatio)
File "/usr/lib/spark/python/pyspark/sql/context.py", line 298, in _createFromRDD
_verify_type(row, schema)
File "/usr/lib/spark/python/pyspark/sql/types.py", line 1152, in _verify_type
_verify_type(v, f.dataType)
File "/usr/lib/spark/python/pyspark/sql/types.py", line 1136, in _verify_type
raise TypeError("%s can not accept object in type %s" % (dataType, type(obj)))
TypeError: IntegerType can not accept object in type <type 'unicode'>
why I can not set the third column which should be count to IntegerType ?
How can I solve this ?

As noted by ccheneson you pass wrong types.
Assuming you data looks like this:
data = sc.parallelize(["af.b Current%20events 1 996"])
After the first map you get RDD[List[String]]:
parts = data.map(lambda l: l.split())
parts.first()
## ['af.b', 'Current%20events', '1', '996']
The second map converts it to tuple (String, String, String, String):
wikis = parts.map(lambda p: (p[0], p[1], p[2],p[3]))
wikis.first()
## ('af.b', 'Current%20events', '1', '996')
Your schema states that 3rd columns is an integer:
[f.dataType for f in schema.fields]
## [StringType, StringType, IntegerType, StringType]
Schema is used most to avoid a full table scan to infer types and doesn't perform any type casting.
You can either cast your data during last map:
wikis = parts.map(lambda p: (p[0], p[1], int(p[2]), p[3]))
Or define count as a StringType and cast column
fields[2] = StructField("count", StringType(), True)
schema = StructType(fields)
wikis.toDF(schema).withColumn("cnt", col("count").cast("integer")).drop("count")
On a side note count is reserved word in SQL and shouldn't be used as a column name. In Spark it will work as expected in some contexts and fail in another.

With apache 2.0 you can let spark infer the schema of your data. Overall you'll need to cast in your parser function as argued above:
"When schema is None, it will try to infer the schema (column names and types) from data, which should be an RDD of Row, or namedtuple, or dict."

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Converting string list to Python dataframe - pyspark python sparksql - python

Related

How to convert pandas code using .str and .split to Pyspark

Attribute error while creating list from string values

pyspark dataframe "condition should be string or Column"

Convert PipelinedRDD to dataframe

pyspark: TypeError: IntegerType can not accept object in type <type 'unicode'>

Categories

Resources