Convert PipelinedRDD to dataframe - python

I'm attempting to convert a pipelinedRDD in pyspark to a dataframe. This is the code snippet:
newRDD = rdd.map(lambda row: Row(row.__fields__ + ["tag"])(row + (tagScripts(row), )))
df = newRDD.toDF()
When I run the code though, I receive this error:
'list' object has no attribute 'encode'
I've tried multiple other combinations, such as converting it to a Pandas dataframe using:
newRDD = rdd.map(lambda row: Row(row.__fields__ + ["tag"])(row + (tagScripts(row), )))
df = newRDD.toPandas()
But then I end up receiving this error:
AttributeError: 'PipelinedRDD' object has no attribute 'toPandas'
Any help would be greatly appreciated. Thank you for your time.

rdd.toDF() or rdd.toPandas() is only used for SparkSession.
To fix your code, try below:
spark = SparkSession.builder.getOrCreate()
rdd = spark.sparkContext.textFile()
newRDD = rdd.map(...)
df = newRDD.toDF() or newRDD.toPandas()

Related

How can I apply a python function to a pandas dataframe in Alteryx?

I have a read in a workflow successfully but when I try and apply a function to a row of the data and output it as another row I get an error.
The following is my curent code;
from ayx import Alteryx
import pandas
data = Alteryx.read("#1")
def get_weekday_occurrences(date_string):
year_day = datetime.strptime(date_string, '%Y_%m_%d').timetuple().tm_yday
return math.ceil(year_day / 7)
data['Name'] = data['Name'].apply(lambda x: datetime.strptime(x, '%Y_%m_%d'))
data['Name'] = data.apply(lambda row : get_weekday_occurrences(data["Name"]), axis = 1)
This code gives me the following error;
TypeError: ('strptime() argument 1 must be str, not Series', 'occurred at index 0')
What am I doing wring please?

pyspark - 'DataFrame' object has no attribute 'map'

I have the following summary for dataset, using pyspark on databricks
OrderMonthYear
SaleAmount
2012-11-01T00:00:00.000+0000
473760.5700000001
2010-04-01T00:00:00.000+0000
490967.0900000001
I'm having dataframe error for this map function to convert OrderMonthYear into integer type
results = summary.map(lambda r: (int(r.OrderMonthYear.replace('-','')), r.SaleAmount)).toDF(["OrderMonthYear","SaleAmount"])
any ideas?
AttributeError: 'DataFrame' object has no attribute 'map'
Found a solution here Pyspark date yyyy-mmm-dd conversion
from datetime import datetime
from pyspark.sql.functions import col, unix_timestamp, from_unixtime, date_format
from pyspark.sql.types import DateType
df = summary.withColumn('date', from_unixtime(unix_timestamp("OrderMonthYear", 'yyyy-MMM')))
df2 = df.withColumn("new_date_str", date_format(col("date"), "yyyyMMdd"))
display(df2)
thank you #mck for the help!
cheers

How to convert pandas code using .str and .split to Pyspark

I wrote the following code using pandas:
df['last_two'] = df['text'].str[-2:]
df['before_hyphen'] = df['text'].str.split('-').str[0]
df['new_text'] = df['before_hyphen'].astype(str) + "-" + df['last_two'].astype(str)
But when I run it on a spark dataframe I get the following error:
TypeError: startPos and length must be the same type
I know I could just convert the df to pandas, run the code, and then convert it back to a spark df, but I wonder if there's a better way? Thanks
You can try the string functions below:
import pyspark.sql.functions as F
df2 = df.withColumn(
'last_two', F.expr('substring(text, -2)')
).withColumn(
'before_hyphen', F.substring_index('text', '-', 1))
).withColumn(
'new_text', F.concat_ws('-', 'before_hyphen', 'last_two')
)

Attribute error while creating list from string values

I have imported excel file with some data and removed missing values.
df = pd.read_excel (r'file.xlsx', na_values = missing_values)
Im trying to split string values to make them into list for later actions.
df['GENRE'] = df['GENRE'].map(lambda x: x.split(','))
df['ACTORS'] = df['ACTORS'].map(lambda x: x.split(',')[:3])
df['DIRECTOR'] = df['DIRECTOR'].map(lambda x: x.split(','))
But it gives me following error - AttributeError: 'list' object has no attribute 'split'
I've done the same with csv and it worked.. could it be because its excel?
Im sure it's simple but i can't get my head around it.example of my dataframe
Try using str.split, the Pandas way:
df['GENRE'] = df['GENRE'].str.split(',')
df['ACTORS'] = df['ACTORS'].str.split(',').str[:3]
df['DIRECTOR'] = df['DIRECTOR'].str.split(',')

Converting string list to Python dataframe - pyspark python sparksql

I have the following Python / Pyspark code:
sql_command = ''' query ''''
df = spark.sql(sql_command)
ls_colnames = df.schema.names
ls_colnames
['id', 'level1', 'level2', 'level3', 'specify_facts']
cSchema = StructType([
StructField("colname", StringType(), False)
])
df_colnames = spark.createDataFrame(dataset_array,schema=cSchema)
File "/opt/mapr/spark/spark-2.1.0/python/pyspark/sql/types.py", line
1366, in _verify_type
raise TypeError("StructType can not accept object %r in type %s" % (obj, type(obj))) TypeError: StructType can not accept object 'id'
in type class 'str'
What can I do to get a spark object of the colnames?
`
Not sure if I have understood your question correctly. But if you are tryng to create a dataframe based on the given list, you can use below code for the same.
from pyspark.sql import Row
l = ['id', 'level1', 'level2', 'level3', 'specify_facts']
rdd1 = sc.parallelize(l)
row_rdd = rdd1.map(lambda x: Row(x))
sqlContext.createDataFrame(row_rdd,['col_name']).show()
Hope it Helps.
Regards,
Neeraj

Categories

Resources