Can data loaded in as newAPIHadoopRDD be converted into a DataFrame? - python

I'm using PySpark to load data from Google BigQuery.
I've loaded data by using:
dfRates = sc.newAPIHadoopRDD(
'com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat',
'org.apache.hadoop.io.LongWritable',
'com.google.gson.JsonObject',
conf=conf)
Where conf is defined as https://cloud.google.com/hadoop/examples/bigquery-connector-spark-example.
I need this data as a DataFrame, so I tried,
row = Row(['userId','accoId','rating']) # or row = Row(('userId','accoId','rating'))
dataRDD = dfRates.map(row).toDF()
and
dataRDD = sqlContext.createDataFrame(dfRates,['userId','accoId','rating'])
But it does not convert the data into a DataFrame. Is there a way to convert it into a DataFrame?

As long as the types can represented using Spark SQL types there is no reason it couldn't be. The only problem here seems to be your code.
newAPIHadoopRDD returns a RDD of pairs (tuple of length equal two). In this particular context it looks you'll get (int, str) in Python which clearly cannot be unpacked into ['userId','accoId','rating'].
According to the doc you've linked com.google.gson.JsonObject is represented as a JSON string which can be either parsed on a Python side using standard Python utils (json module):
def parse(v, fields=["userId", "accoId", "rating"]):
row = Row(*fields)
try:
parsed = json.loads(v)
except json.JSONDecodeError:
parsed = {}
return row(*[parsed.get(x) for x in fields])
dfRates.map(parse).toDF()
or on the Scala / DataFrame side using get_json_object:
from pyspark.sql.functions import col, get_json_object
dfRates.toDF(["id", "json_string"]).select(
# This assumes you expect userId field
get_json_object(col("json_string"), "$.userId"),
...
)
Please note the differences in the syntax I've used to define and create rows.

hbase table rows:
hbase(main):008:0> scan 'test_hbase_table'
ROW COLUMN+CELL
dalin column=cf:age, timestamp=1464101679601, value=40
tangtang column=cf:age, timestamp=1464101649704, value=9
tangtang column=cf:name, timestamp=1464108263191, value=zitang
2 row(s) in 0.0170 seconds
here we go
import json
host = '172.23.18.139'
table = 'test_hbase_table'
conf = {"hbase.zookeeper.quorum": host, "zookeeper.znode.parent": "/hbase-unsecure", "hbase.mapreduce.inputtable": table}
keyConv = "org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter"
valueConv = "org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter"
hbase_rdd = sc.newAPIHadoopRDD(
"org.apache.hadoop.hbase.mapreduce.TableInputFormat",
"org.apache.hadoop.hbase.io.ImmutableBytesWritable",
"org.apache.hadoop.hbase.client.Result",
keyConverter=keyConv,
valueConverter=valueConv,
conf=conf)
hbase_rdd1 = hbase_rdd.flatMapValues(lambda v: v.split("\n"))
and here we got the results
tt=sqlContext.jsonRDD(hbase_rdd1.values())
In [113]: tt.show()
+------------+---------+--------+-------------+----+------+
|columnFamily|qualifier| row| timestamp|type| value|
+------------+---------+--------+-------------+----+------+
| cf| age| dalin|1464101679601| Put| 40|
| cf| age|tangtang|1464101649704| Put| 9|
| cf| name|tangtang|1464108263191| Put|zitang|
+------------+---------+--------+-------------+----+------+

Related

Convert list of strings to list of json objects in pyspark

def tranform_doc(docs):
json_list = []
print(docs)
for doc in docs:
json_doc = {}
json_doc["customKey"] = doc
json_list.append(json_doc)
return json_list
df.groupBy("colA") \
.agg(custom_udf(collect_list(col("colB"))).alias("customCol"))
First Hurdle:
Input: ["str1","str2","str3"]
Output: [{"customKey":"str1"},{"customKey":"str2"},{"customKey":"str3"}]
Second Hurdle:
columns in agg collect_list are changing dynamically. So, how to adjust schema dynamically.
when elements in list changes, receiving an error
Input row doesn't have expected number of values required by the schema. 1 fields are required while 3 values are provided
What I did:
def tranform_doc(agg_docs):
return json_list
## When I failed to get a list of JSON I tried just return the original list of strings to the list of json
schema = StructType([{StructField("col1",StringType()),StructField("col2",StringType()),StructField("col3",StringType())}])
custom_udf = udf(tranform_doc,schema)
df.groupBy("colA") \
.agg(custom_udf(collect_list(col("colB"))).alias("customCol"))
Output I got:
{"col2":"str1","col1":"str2","col3":"str3"}
Struggling to get the required list of JSON strings and to make it dynamical to number of elements in the list
No UDF needed. You can convert colB to a struct before collect_list.
import pyspark.sql.functions as F
df2 = df.groupBy('colA').agg(
F.to_json(
F.collect_list(
F.struct(F.col('colB').alias('customKey'))
)
).alias('output')
)

Web scraping, reading data frames, and exporting to csv

The goal is to scrape pokemonDB, create a DataFrame of the Pokemon data; (Number, Name, Primary type, and secondary type), Separate the two types into their own rows, and export it as a CSV file.
I'm stuck on accessing the 'dex' dataframe, specifically the contents of the columns. Am I using ".loc" correctly?
And then there is separating the two types to each columns. I know i must use a space" " as a delimiter, but not sure how. I'm new to Pandas.
This is what i have:
import pandas as pd
import requests
page = requests.get("https://pokemondb.net/pokedex/all")
dex = pd.read_html(page.text, attrs = {'id': 'pokedex'}, index_col = '#')
column_label_list = (list(dex[0].columns))
NationalNo = column_label_list[0];
Name = column_label_list[1];
Type = column_label_list[2];
numbers_list = dex.loc[ "#"]
names_list = dex.loc[ "Name"]
types1_list = dex.loc[ "Type"]
pokemon_list = pd.DataFrame(
{
NationalNo: numbers_list,
Name: names_list,
Type: types1_list,
#'Type2': types2_list,
})
print(pokemon_list)
#pokemon_list.to_csv('output.csv',encoding='utf-8-sig')
The result should look this like:
output.csv
# | Name | Type1 | Type2 |
__|_________|_______|_______|
0 |Bulbasaur|Grass |Poison |
__|_________|_______|_______|
.
.
.
etc...
I hope what I'm trying to accomplish makes sense.
The dex is an array of all tables existed on that HTML, since there is only one table, select the first table, then you don't need to map them to dataframe anymore, just export it directly since it's already a dataframe. Please consider using this code below:
import pandas as pd
import requests
page = requests.get("https://pokemondb.net/pokedex/all")
dex = pd.read_html(page.text, attrs = {'id': 'pokedex'}, index_col = '#')
dex[0].to_csv("output.csv", encoding='utf-8')

Pyspark merge multiple columns into a json column

I asked the question a while back for python, but now I need to do the same thing in PySpark.
I have a dataframe (df) like so:
|cust_id|address |store_id|email |sales_channel|category|
-------------------------------------------------------------------
|1234567|123 Main St|10SjtT |idk#gmail.com|ecom |direct |
|4567345|345 Main St|10SjtT |101#gmail.com|instore |direct |
|1569457|876 Main St|51FstT |404#gmail.com|ecom |direct |
and I would like to combine the last 4 fields into one metadata field that is a json like so:
|cust_id|address |metadata |
-------------------------------------------------------------------------------------------------------------------
|1234567|123 Main St|{'store_id':'10SjtT', 'email':'idk#gmail.com','sales_channel':'ecom', 'category':'direct'} |
|4567345|345 Main St|{'store_id':'10SjtT', 'email':'101#gmail.com','sales_channel':'instore', 'category':'direct'}|
|1569457|876 Main St|{'store_id':'51FstT', 'email':'404#gmail.com','sales_channel':'ecom', 'category':'direct'} |
Here's the code I used to do this in python:
cols = [
'store_id',
'store_category',
'sales_channel',
'email'
]
df1 = df.copy()
df1['metadata'] = df1[cols].to_dict(orient='records')
df1 = df1.drop(columns=cols)
but I would like to translate this to PySpark code to work with a spark dataframe; I do NOT want to use pandas in Spark.
Use to_json function to create json object!
Example:
from pyspark.sql.functions import *
#sample data
df=spark.createDataFrame([('1234567','123 Main St','10SjtT','idk#gmail.com','ecom','direct')],['cust_id','address','store_id','email','sales_channel','category'])
df.select("cust_id","address",to_json(struct("store_id","category","sales_channel","email")).alias("metadata")).show(10,False)
#result
+-------+-----------+----------------------------------------------------------------------------------------+
|cust_id|address |metadata |
+-------+-----------+----------------------------------------------------------------------------------------+
|1234567|123 Main St|{"store_id":"10SjtT","category":"direct","sales_channel":"ecom","email":"idk#gmail.com"}|
+-------+-----------+----------------------------------------------------------------------------------------+
to_json by passing list of columns:
ll=['store_id','email','sales_channel','category']
df.withColumn("metadata", to_json(struct([x for x in ll]))).drop(*ll).show()
#result
+-------+-----------+----------------------------------------------------------------------------------------+
|cust_id|address |metadata |
+-------+-----------+----------------------------------------------------------------------------------------+
|1234567|123 Main St|{"store_id":"10SjtT","email":"idk#gmail.com","sales_channel":"ecom","category":"direct"}|
+-------+-----------+----------------------------------------------------------------------------------------+
#Shu gives a good answer, here's a variant that works out slightly better for my use case. I'm going from Kafka -> Spark -> Kafka and this one liner does exactly what I want. The struct(*) will pack up all the fields in the dataframe.
# Packup the fields in preparation for sending to Kafka sink
kafka_df = df.selectExpr('cast(id as string) as key', 'to_json(struct(*)) as value')

Extract data out of ORC using Pyspark

I have a ORC file i am able to read that into a DataFrame using Pyspark 2.2.0
from pyspark.context import SparkContext
from pyspark.sql.types import *
from pyspark.sql.functions import *
df = spark.read.orc("s3://leadid-sandbox/krish/lead_test/")
The above df has a schema like below
root
|-- item: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
sample data looks like this(Just a sample data not the entire dataset)
item
{http_Accept-Language={"s":"en-US"}, Win64={"n":"1"},
geoip_region={"s":"FL"}, Platform={"s":"Win7"}, geoip_postal_code=
{"s":"33432"}, JavaApplets={"n":"1"}, http_Accept={"s":"*/*"},
Version={"s":"11.0"}, Cookies={"n":"1"}, Platform_Version=
{"s":"6.1"}, http_Content-Type={"s":"application/x-www-form-
urlencoded"}}
{http_Accept-Language={"s":"en-US"}, Win64={"n":"1"}, IFrames=
{"n":"1"}, geoip_region={"s":"CA"}, Platform={"s":"Win7"}, Parent=
{"s":"IE 11.0"}, http_Dnt={"n":"1"}}
So I exploded "item" like below
expDf = df.select(explode("item"))
The above DataFrame has below schema and when i do an show(2) has the below details
root
|-- key: string (nullable = false)
|-- value: string (nullable = true)
+------------+----------+
| key| value|
+------------+----------+
|geoip_region|
{
"s": "FL"
}
|
| Tables|
{
"n": "1"
}
|
+------------+----------+
How can i select data out of this DataFrame ? I have tried different ways but of no use.
So i would need 'geoip_region' with value as 'FL' and so on.
Any help is appreciated.
I am not sure your full use case, but if it just about having access to keys and value inside "item", you can do it using following sample code:
row = df.select(df.item).collect()
Above line will give you a list of Row Object like [Row(item={http_Accept-Language={"s":"en-US"}, Win64={"n":"1"},....})]
Then to select all the values inside the row you can do: row_item = row[0]['item']
row_item['http_Accept'] will give you access to u"{"s":"en-US"}"
and eval(row_item['http_Accept']) will give you a dictionary from where you can get its key,value
I have just outlined the process, which can be written with loops to get all the key/values in iteration.
Thanks Joshi for your response, for some reason i was getting row[0] not found error on my code, I was running this on AWS Glue environment may be that could be a reason.
I got what i wanted using the below code.
# Creating a DataFrame of the raw file
df = spark.read.orc("s3://leadid-sandbox/krish/lead_test/")
# Creating a temp view called Leads for the above dataFrame
df.createOrReplaceTempView("leads")
# Extracting the data using normal SQL from the above created Temp
View
tblSel = spark.sql("SELECT get_json_object(item['token'], '$.s') as
token, get_json_object(item['account_code'], '$.s') as account_code
from leads").show()

How to convert Spark Streaming data into Spark DataFrame

So far, Spark hasn't created the DataFrame for streaming data, but when I am doing anomalies detection, it is more convenient and faster to use DataFrame for data analysis. I have done this part, but when I try to do real time anomalies detection using streaming data, the problems appeared. I tried several ways and still could not convert DStream to DataFrame, and cannot convert the RDD inside of DStream into DataFrame either.
Here's part of my latest version of the code:
import sys
import re
from pyspark import SparkContext
from pyspark.sql.context import SQLContext
from pyspark.sql import Row
from pyspark.streaming import StreamingContext
from pyspark.mllib.clustering import KMeans, KMeansModel, StreamingKMeans
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.functions import udf
import operator
sc = SparkContext(appName="test")
ssc = StreamingContext(sc, 5)
sqlContext = SQLContext(sc)
model_inputs = sys.argv[1]
def streamrdd_to_df(srdd):
sdf = sqlContext.createDataFrame(srdd)
sdf.show(n=2, truncate=False)
return sdf
def main():
indata = ssc.socketTextStream(sys.argv[2], int(sys.argv[3]))
inrdd = indata.map(lambda r: get_tuple(r))
Features = Row('rawFeatures')
features_rdd = inrdd.map(lambda r: Features(r))
features_rdd.pprint(num=3)
streaming_df = features_rdd.flatMap(streamrdd_to_df)
ssc.start()
ssc.awaitTermination()
if __name__ == "__main__":
main()
As you can see in the main() function, when I am reading the input streaming data using ssc.socketTextStream() method, it generates DStream, then I tried to convert each individual in DStream into Row, hoping I could convert the data into DataFrame later.
If I use ppprint() to print out features_rdd here, it works, which makes me think, each individual in features_rdd is a batch of RDD while the whole features_rdd is a DStream.
Then I created streamrdd_to_df() method and hoped to convert each batch of RDD into dataframe, it gives me the error, showing:
ERROR StreamingContext: Error starting the context, marking it as stopped
java.lang.IllegalArgumentException: requirement failed: No output operations registered, so nothing to execute
Is there any thought about how can I do DataFrame operations on Spark streaming data?
Spark has provided us with structured streaming which can solve such problems. It can generate streaming DataFrame i.e DataFrames being appended continuously. Please check below link
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
Read the Error carefully..It says there is No output operations registered. Spark is Lazy and executes the job/ cod only when it has something to produce as a result. In your program there is no "Output Operation" and same is being complained by Spark.
Define a foreach() or Raw SQL Query over the DataFrame and then print the results. It will work fine.
Why don't you use something like this:
def socket_streamer(sc): # retruns a streamed dataframe
streamer = session.readStream\
.format("socket") \
.option("host", "localhost") \
.option("port", 9999) \
.load()
return streamer
The output itself of this function above (or the readStream in general) is a DataFrame. There you don't need to worry about df, it is already automatically created by spark.
See the Spark Structured Streaming Programming Guide
After 1 year, I started to explore Spark 2.0 streaming methods and finally solved my anomalies detection problem. Here's my code in IPython, you can also find how does my raw data input look like
There is no need to convert DStream into RDD. By definition DStream is a collection of RDD. Just use DStream's method foreach() to loop over each RDD and take action.
val conf = new SparkConf()
.setAppName("Sample")
val spark = SparkSession.builder.config(conf).getOrCreate()
sampleStream.foreachRDD(rdd => {
val sampleDataFrame = spark.read.json(rdd)
}
The spark documentation has an introduction to working with DStream. Basically, you have to use foreachRDD on your stream object to interact with it.
Here is an example (ensure you create a spark session object):
def process_stream(record, spark):
if not record.isEmpty():
df = spark.createDataFrame(record)
df.show()
def main():
sc = SparkContext(appName="PysparkStreaming")
spark = SparkSession(sc)
ssc = StreamingContext(sc, 5)
dstream = ssc.textFileStream(folder_path)
transformed_dstream = # transformations
transformed_dstream.foreachRDD(lambda rdd: process_stream(rdd, spark))
# ^^^^^^^^^^
ssc.start()
ssc.awaitTermination()
With Spark 2.3 / Python 3 / Scala 2.11 (Using databricks) I was able to use temporary tables and a code snippet in scala (using magic in notebooks):
Python Part:
ddf.createOrReplaceTempView("TempItems")
Then on a new cell:
%scala
import java.sql.DriverManager
import org.apache.spark.sql.ForeachWriter
// Create the query to be persisted...
val tempItemsDF = spark.sql("SELECT field1, field2, field3 FROM TempItems")
val itemsQuery = tempItemsDF.writeStream.foreach(new ForeachWriter[Row]
{
def open(partitionId: Long, version: Long):Boolean = {
// Initializing DB connection / etc...
}
def process(value: Row): Unit = {
val field1 = value(0)
val field2 = value(1)
val field3 = value(2)
// Processing values ...
}
def close(errorOrNull: Throwable): Unit = {
// Closing connections etc...
}
})
val streamingQuery = itemsQuery.start()

Categories

Resources