How to convert a list of array to Spark dataframe - python

Suppose I have a list:
x = [[1,10],[2,14],[3,17]]
I want to convert x to a Spark dataframe with two columns id (1,2,3) and value (10,14,17).
How could I do that?
Thanks

x = [[1,10],[2,14],[3,17]]
df = sc.parallelize(x).toDF(['ID','VALUE'])
df.show()

Alternatively you can create it directly using SparkSession-
x = [[1,10],[2,14],[3,17]]
df = spark.createDataFrame(data=x, schema = ["id","value"])
df.printSchema()
df.show()

Related

extracted data from sql for processing using python

I have saved out a data column as follows:
[[A,1], [B,5], [C,18]....]
i was hoping to group A,B,C as shown above into Category and 1,5,18 into Values/Series for updating of my powerpoint chart using python pptx.
Example:
Category
Values
A
1
B
5
Is there any way i can do it? currently the above example is also extracted as strings so i believe i have to convert it to lists first?
thanks in advance!
Try to parse your strings (a list of lists) then create your dataframe from the real list:
import pandas as pd
import re
s = '[[A,1], [B,5], [C,18]]'
cols = ['Category', 'Values']
data = [row.split(',') for row in re.findall('\[([^]]+)\]', s[1:-1])]
df = pd.DataFrame(data, columns=cols)
print(df)
# Output:
Category Values
0 A 1
1 B 5
2 C 18
You should be able to just use pandas.DataFrame and pass in your data, unless I'm misunderstanding the question. Anyway, try:
df = pandas.DataFrame(data=d, columns = ['Category', 'Value'])
where d is your list of tuples.
from prettytable import PrettyTable
column = [["A",1],["B",5],["C",18]]
columnname=[]
columnvalue =[]
t = PrettyTable(['Category', 'Values'])
for data in column:
columnname.append(data[0])
columnvalue.append(data[1])
t.add_row([data[0], data[1]])
print(t)

Getting null values when converting pyspark.rdd.PipelinedRDD object into Pyspark dataframe

My dataset has one column called 'eventAction'.
It has values like 'conversion', 'purchase', 'check-out', etc.. I want to convert this column in such a way that it maps conversion to 1 and all other categories to 0.
I used lambda function in this way:
e1 = event1.rdd.map(lambda x: 1 if x.eventAction == 'conversion' else 0)
where event1 is the name of my spark dataframe.
When printing e1 I get this:
print(e1.take(5))
[0, 0, 0, 0, 0]
So I think the lambda function worked properly. Now when I am converting to pyspark dataframe, I get null values as shown:
schema1 = StructType([StructField('conversion',IntegerType(),True)])
df = spark.createDataFrame(data=[e1],schema=schema1)
df.printSchema()
df.show()
It will be great if you can help me with this.
Thanks!
spark.createDataFrame expects an RDD of Row, not an RDD of integers. You need to map the RDD to Row objects before converting to dataframe. Note that there is no need to add square brackets around e1.
from pyspark.sql import Row
e1 = event1.rdd.map(lambda x: 1 if x.eventAction == 'conversion' else 0).map(lambda x: Row(x))
schema1 = StructType([StructField('conversion',IntegerType(),True)])
df = spark.createDataFrame(data=e1,schema=schema1)
That said, what you're trying to do should be easily done with Spark SQL when function. There is no need to use RDD with a custom lambda function. e.g.
import pyspark.sql.functions as F
df = events.select(F.when(F.col('eventAction') == 'conversion', 1).otherwise(0).alias('conversion'))

how to use list comprehension variable names in Pyspark dataframes

I am trying to build a list comprehension that has an iteration built into it. however, I have not been able to get this to work. What am I doing wrong?
Here is a trivial representation of what I am trying to do.
dataframe columns = ["code_number_1", "code_number_2", "code_number_3", "code_number_4", "code_number_5", "code_number_6", "code_number_7", "code_number_8",
cols = [0,3,4]
result = df.select([code_number_{f"{x}" for x in cols])
Addendum:
my ultimate goal is to do something like this:
col_buckets ["code_1", "code_2", "code_3"]
amt_buckets = ["code_1_amt", "code_2_amt", "code_3_amt" ]
result = df.withColumn("max_amt_{col_index}", max(df.select(max(**amt_buckets**) for col_indices of amt_buckets if ***any of col indices of col_buckets*** =='01')))
[code_number_{f"{x}" for x in cols] not a valid list comprehension syntax.
Instead try with ["code_number_"+str(x) for x in cols] generates list of column names ['code_number_0', 'code_number_3', 'code_number_4'].
.select accepts strings/columns as arguments to select the matching fields from dataframe.
Example:
df=spark.createDataFrame([("a","b","c","d","e")],["code_number_0","code_number_1","code_number_2","code_number_3","code_number_4"])
cols = [0,3,4]
#passing strings to select
result = df.select(["code_number_"+str(x) for x in cols])
#or passing columns to select
result = df.select([col("code_number_"+str(x)) for x in cols]).show()
result.show()
#+-------------+-------------+-------------+
#|code_number_0|code_number_3|code_number_4|
#+-------------+-------------+-------------+
#| a| d| e|
#+-------------+-------------+-------------+

Appending Entire Column into Dictionary

I'm working with a dataframe. If the column in the dataframe has a certain percentage of blanks I want to append that column into a dictionary (and eventually turn that dictionary into a new dataframe).
features = {}
percent_is_blank = 0.4
for column in df:
x = df[column].isna().mean()
if x < percent_is_blank:
features[column] = ??
new_df = pd.DataFrame.from_dict([features], columns=features.keys())
What would go in the "??"
I think better is filtering with DataFrame.loc:
new_df = df.loc[:, df.isna().mean() < percent_is_blank]
In your solution is possible use:
for column in df:
x = df[column].isna().mean()
if x < percent_is_blank:
features[column] = df[column]

pyspark - Chaining a .orderBy to a .read method

Say you have something like the following code:
df = sqlContext.read.parquet('s3://somebucket/some_parquet_file')
How would you chain an order by on to that object?
df = df.orderBy(df.some_col)
To make it something like:
df = sqlContext.read.parquet('s3://somebucket/some_parquet_file').orderBy(?.some_col)
You can give the column name as a string or a list of strings:
df = sqlContext.read.parquet('s3://somebucket/some_parquet_file').orderBy("some_col")

Categories

Resources