pyspark sql query equivalent functions - python

I'm just starting to dive into Pyspark.
There's this dataset which contains some values I'll demonstrate below to ask the query I'm not able to create.
This is a sample of the actual dataset which contains roughly 20K rows. I'm reading this CSV file in pyspark shell as data frame. Trying to convert some basic SQL queries on this data to get hands on. Below are one such query I'm not able to:
1. Which country has the least number of Government Type (4th Column).
There are other queries I've manually created myself that I can do in SQL but I'm just stuck in understanding the one. If I get an idea for this, it'll be fairly relatable to address other ones.
This is the only line I can create after much bugging:
df.filter(df.Government=='Democratic').select('Country').show()
I'm not sure how to approach this problem statement. Any ideas?

Here is how you can do it
Demography = Row("City", "Country", "Population", "Government")
demo1 = Demography("a","AD",1.2,"Democratic")
demo2 = Demography("b","AD",1.2,"Democratic")
demo3 = Demography("c","AD",1.2,"Democratic")
demo4 = Demography("m","XX",1.2,"Democratic")
demo5 = Demography("n","XX",1.2,"Democratic")
demo6 = Demography("o","XX",1.2,"Democratic")
demo7 = Demography("q","XX",1.2,"Democratic")
demographic_data = [demo1,demo2,demo3,demo4,demo5,demo6,demo7]
demographic_data_df = spark.createDataFrame(demographic_data)
demographic_data_df.show(10)
+----+-------+----------+----------+
|City|Country|Population|Government|
+----+-------+----------+----------+
| a| AD| 1.2|Democratic|
| b| AD| 1.2|Democratic|
| c| AD| 1.2|Democratic|
| m| XX| 1.2|Democratic|
| n| XX| 1.2|Democratic|
| o| XX| 1.2|Democratic|
| q| XX| 1.2|Democratic|
+----+-------+----------+----------+
new = demographic_data_df.groupBy('Country').count().select('Country', f.col('count').alias('n'))
max = new.agg(f.max('n').alias('n'))
new.join(max , on = "n",
how = "inner").show()
+---+-------+
| n|Country|
+---+-------+
| 4| XX|
+---+-------+
The other option is to register the dataframe as a temporary table and run normal SQL queries. For registering it as temporary table you can do the following
demographic_data_df.registerTempTable("demographic_data_table")
Hope that helps

Related

PySpark Self Join without alias

I have a DF, I want to left_outer join with itself but I would liek to do it with pyspark api rather than alias.
So it is something like:
df = ...
df2 = df
df.join(df2, [df['SomeCol'] == df2['SomeOtherCol']], how='left_outer')
Interestingly this is incorrect. When I run it I get this error:
WARN Column: Constructing trivially true equals predicate, 'CAMPAIGN_ID#62L = CAMPAIGN_ID#62L'. Perhaps you need to use aliases.
Is there a way to do this without using alias? Or a clean way with alias? Alias really makes the code a lot dirtier rather than using the pyspark api directly.
The most clean way of using aliases is as follows.
Given the following Dataframe.
df.show()
+---+----+---+
| ID|NAME|AGE|
+---+----+---+
| 1|John| 50|
| 2|Anna| 32|
| 3|Josh| 41|
| 4|Paul| 98|
+---+----+---+
In the following example, I am simply adding "2" to each of the column names so that each column has is unique name after the join.
df2 = df.select([functions.col(c).alias(c + "2") for c in df.columns])
df = df.join(df2, on = df['NAME'] == df2['NAME2'], how='left_outer')
df.show()
+---+----+---+---+-----+----+
| ID|NAME|AGE|ID2|NAME2|AGE2|
+---+----+---+---+-----+----+
| 1|John| 50| 1| John| 50|
| 2|Anna| 32| 2| Anna| 32|
| 3|Josh| 41| 3| Josh| 41|
| 4|Paul| 98| 4| Paul| 98|
+---+----+---+---+-----+----+
If I just simply did a df.join(df).select("NAME"), pyspark does not know which column I want to select as they both have the exact same name. This will lead to errors like the following.
AnalysisException: Reference 'NAME' is ambiguous, could be: NAME, NAME.

How to remove elements with UDF function and Pandas instead of using for loop Python

I have a problem ... how do I make such a for loop as a UDF function?
import cld3
ind_err = []
cnt = 0
cnt_NOT = 0
for index, row in pandasDF.iterrows():
lan, probability, is_reliable, proportion = cld3.get_language(row["content"])
if (lan != 'en'):
cnt_NOT += 1
ind_err.append(index)
elif(lan == 'en' and probability < 0.85):
cnt += 1
ind_err.append(index)
pandasDF = pandasDF.drop(labels=ind_err, axis=0)
This function cycles on all the lines of the pandas data frame and sees through cld3 which is English and which is not, in order to clean up. Save the indexes in an array to delete them with .drop (labels = ind_err, axis = 0).
This is the data that I have:
+--------------------+-----+
| content|score|
+--------------------+-----+
| what sapp| 1|
| right| 5|
|ciao mamma mi pia...| 1|
|bounjourn whatsa ...| 5|
|hola amigos te qu...| 5|
|excellent thank y...| 5|
| whatsapp| 1|
|so frustrating i ...| 1|
|unable to update ...| 1|
| whatsapp| 1|
+--------------------+-----+
This is the data that I would remove:
|ciao mamma mi pia...| 1|
|bounjourn whatsa ...| 5|
|hola amigos te qu...| 5|
And the is the dataframe that I would have:
+--------------------+-----+
| content|score|
+--------------------+-----+
| what sapp| 1|
| right| 5|
|excellent thank y...| 5|
| whatsapp| 1|
|so frustrating i ...| 1|
|unable to update ...| 1|
| whatsapp| 1|
+--------------------+-----+
The problem with this cycle is that it is very slow since there are 1,119,778 rows.
I know PySpark's withColumn is much faster, but I honestly can't figure out how to select the row to delete and get it deleted.
How can I turn that for loop into a function and make language detect a lot faster?
My environment is Google Colab.
Many thanks in advance!!
you can probably do something like that :
from pyspark.sql import functions as F, types as T
# assuming df is your dataframe
#F.udf(T.BooleanType())
def is_english(content):
lan, probability, is_reliable, proportion = cld3.get_language(content)
if lan == "en" and probability >= 0.85:
return True
return False
df.where(is_english(F.col("content")))
Actually, I do not really understand why you want to go through Spark for that. Using properly pandas should solve your problem:
# I used you example so I only have partial text...
def is_engllish(content):
lan, probability, is_reliable, proportion = cld3.get_language(content)
if lan == "en" and probability >= 0.85:
return True
return False
df.loc[df["content"].apply(is_eng)]
content
8 unable to update ...
# That's the only line from your truncated example that matches your criterias

Remove words from pyspark dataframe based on words from another pyspark dataframe

I want to remove the words in main data frame from secondary data frame.
This is the main data frame:
+----------+--------------------+
| event_dt| cust_text|
+----------+--------------------+
|2020-09-02|hi fine i want to go|
|2020-09-02|i need a line hold |
|2020-09-02|i have the 60 packs|
|2020-09-02|hello want you teach|
Below is single-column secondary data frame. The words in the secondary data frame need to be removed from the main data frame in column cust_text wherever the words occur. For example, 'want' will be removed from every row wherever it shows up in the main data frame (in this example will be removed from 1st and 4th row).
+-------+
|column1|
+-------+
| want|
|because|
| need|
| hello|
| a|
| have|
| go|
+-------+
The event_dt column will remain as is and each row will remain as is, only the secondary data frame words are removed from main data frame in the result data frame as shown below
+----------+--------------------+
| event_dt| cust_text|
+----------+--------------------+
|2020-09-02|hi fine i to |
|2020-09-02|i line hold |
|2020-09-02|i the 60 packs |
|2020-09-02|you teach |
+----------+--------------------+
Help is appreciated!!
This should be the working solution for you - Use array_except() in order to eliminate the unwanted strings, however in order to do that, we need to do a little bit of preparation.
Create the DataFrame Here
from pyspark.sql import functions as F
from pyspark.sql import types as T
df = spark.createDataFrame([("2020-09-02","hi fine i want to go"),("2020-09-02","i need a line hold"), ("2020-09-02", "i have the 60 packs"), ("2020-09-02", "hello want you teach")],[ "col1","col2"])
Make the column as Array for future use
df = df.withColumn("col2", F.split("col2", " "))
df.show(truncate=False)
df_lookup = spark.createDataFrame([(1,"want"),(1,"because"), (1, "need"), (1, "hello"),(1, "a"),(1, "give"), (1, "go")],[ "col1","col2"])
df_lookup.show()
Output
+----------+---------------------------+
|col1 |col2 |
+----------+---------------------------+
|2020-09-02|[hi, fine, i, want, to, go]|
|2020-09-02|[i, need, , a, line, hold] |
|2020-09-02|[i, have, the, , 60, packs]|
|2020-09-02|[hello, want, you, teach] |
+----------+---------------------------+
+----+-------+
|col1| col2|
+----+-------+
| 1| want|
| 1|because|
| 1| need|
| 1| hello|
| 1| a|
| 1| give|
| 1| go|
+----+-------+
Now, just groupBy the lookup dataframe and take all the lookup values in a variable as below
df_lookup_var = df_lookup.groupBy("col1").agg(F.collect_set("col2").alias("col2")).collect()[0][1]
print(df_lookup_var)
x = ",".join(df_lookup_var)
print(x)
df = df.withColumn("filter_col", F.lit(x))
df = df.withColumn("filter_col", F.split("filter_col", ","))
df.show(truncate=False)
This does the trick
df = df.withColumn("ArrayColumn", F.array_except("col2", "filter_col"))
df.show(truncate = False)
+----------+---------------------------+-----------------------------------------+---------------------------+
|col1 |col2 |filter_col |ArrayColumn |
+----------+---------------------------+-----------------------------------------+---------------------------+
|2020-09-02|[hi, fine, i, want, to, go]|[need, want, a, because, hello, give, go]|[hi, fine, i, to] |
|2020-09-02|[i, need, , a, line, hold] |[need, want, a, because, hello, give, go]|[i, , line, hold] |
|2020-09-02|[i, have, the, , 60, packs]|[need, want, a, because, hello, give, go]|[i, have, the, , 60, packs]|
|2020-09-02|[hello, want, you, teach] |[need, want, a, because, hello, give, go]|[you, teach] |
+----------+---------------------------+-----------------------------------------+---------------------------+

Use spark function result as input of another function

In my Spark application I have a dataframe with informations like
+------------------+---------------+
| labels | labels_values |
+------------------+---------------+
| ['l1','l2','l3'] | 000 |
| ['l3','l4','l5'] | 100 |
+------------------+---------------+
What I am trying to achieve is to create, given a label name as input a single_label_value column that takes the value for that label from the labels_values column.
For example, for label='l3' I would like to retrieve this output:
+------------------+---------------+--------------------+
| labels | labels_values | single_label_value |
+------------------+---------------+--------------------+
| ['l1','l2','l3'] | 000 | 0 |
| ['l3','l4','l5'] | 100 | 1 |
+------------------+---------------+--------------------+
Here's what I am attempting to use:
selected_label='l3'
label_position = F.array_position(my_df.labels, selected_label)
my_df= my_df.withColumn(
"single_label_value",
F.substring(my_df.labels_values, label_position, 1)
)
But I am getting an error because the substring function does not like the label_position argument.
Is there any way to combine these function outputs without writing an udf?
Hope, this will work for you.
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark=SparkSession.builder.getOrCreate()
mydata=[[['l1','l2','l3'],'000'], [['l3','l4','l5'],'100']]
df = spark.createDataFrame(mydata,schema=["lebels","lebel_values"])
selected_label='l3'
df2=df.select(
"*",
(array_position(df.lebels,selected_label)-1).alias("pos_val"))
df2.createOrReplaceTempView("temp_table")
df3=spark.sql("select *,substring(lebel_values,pos_val,1) as val_pos from temp_table")
df3.show()
+------------+------------+-------+-------+
| lebels|lebel_values|pos_val|val_pos|
+------------+------------+-------+-------+
|[l1, l2, l3]| 000| 2| 0|
|[l3, l4, l5]| 100| 0| 1|
+------------+------------+-------+-------+
This is giving location of the value. If you want exact index then you can use -1 from this value.
--Edited anser -> Worked with temp view. Still looking for solution using withColumn option. I hope, it will help you for now.
Edit2 -> Answer using dataframe.
df2=df.select(
"*",
(array_position(df.lebels,selected_label)-1).astype("int").alias("pos_val")
)
df3=df2.withColumn("asked_col",expr("substring(lebel_values,pos_val,1)"))
df3.show()
Try maybe:
import pyspark.sql.functions as f
from pyspark.sql.functions import *
selected_label='l3'
df=df.withColumn('single_label_value', f.substring(f.col('labels_values'), array_position(f.col('labels'), lit(selected_label))-1, 1))
df.show()
(for spark version >=2.4)
I think lit() was the function you were missing - you can use it to pass constant values to spark dataframes.

How to modify a column for a join in spark dataframe when the join key are given as a list?

I have been trying to join two dataframes using the following list of join key passed as a list and I want to add the functionality to join on a subset of the keys if one of the key value is null
I have been trying to join two dataframes df_1 and df_2.
data1 = [[1,'2018-07-31',215,'a'],
[2,'2018-07-30',None,'b'],
[3,'2017-10-28',201,'c']
]
df_1 = sqlCtx.createDataFrame(data1,
['application_number','application_dt','account_id','var1'])
and
data2 = [[1,'2018-07-31',215,'aaa'],
[2,'2018-07-30',None,'bbb'],
[3,'2017-10-28',201,'ccc']
]
df_2 = sqlCtx.createDataFrame(data2,
['application_number','application_dt','account_id','var2'])
The code I use to join is this:
key_a = ['application_number','application_dt','account_id']
new = df_1.join(df_2,key_a,'left')
The output for the same is:
+------------------+--------------+----------+----+----+
|application_number|application_dt|account_id|var1|var2|
+------------------+--------------+----------+----+----+
| 1| 2018-07-31| 215| a| aaa|
| 3| 2017-10-28| 201| c| ccc|
| 2| 2018-07-30| null| b|null|
+------------------+--------------+----------+----+----+
My concern here is, in the case where account_id is null, the join should still work by comparing other 2 keys.
The required output should be like this:
+------------------+--------------+----------+----+----+
|application_number|application_dt|account_id|var1|var2|
+------------------+--------------+----------+----+----+
| 1| 2018-07-31| 215| a| aaa|
| 3| 2017-10-28| 201| c| ccc|
| 2| 2018-07-30| null| b| bbb|
+------------------+--------------+----------+----+----+
I have found a similar approach to do so by using the statement:
join_elem = "df_1.application_number ==
df_2.application_number|df_1.application_dt ==
df_2.application_dt|F.coalesce(df_1.account_id,F.lit(0)) ==
F.coalesce(df_2.account_id,F.lit(0))".split("|")
join_elem_column = [eval(x) for x in join_elem]
But the design consideration do not allow me to use a full join expression and i am stuck with using the list of column names as join-key.
I have been trying to find a way to accommodate this coalesce thing into this list itself but have not found any success so far.
I would call this solution a workaround.
The issue here is that we have Null value for one of the keys in the DataFrame and the OP wants that rest of the key columns to be used instead. Why not assign an arbitrary value to this Null and then apply the join. Effectively this would be same thing like making a join on the remaining two keys.
# Let's replace Null with an arbitrary value, which has
# little chance of occurring in the Dataset. For eg; -100000
df1 = df1.withColumn('account_id', when(col('account_id').isNull(),-100000).otherwise(col('account_id')))
df2 = df2.withColumn('account_id', when(col('account_id').isNull(),-100000).otherwise(col('account_id')))
# Do a FULL Join
df = df1.join(df2,['application_number','application_dt','account_id'],'full')
# Replace the arbitrary value back with Null.
df = df.withColumn('account_id', when(col('account_id')== -100000, None).otherwise(col('account_id')))
df.show()
+------------------+--------------+----------+----+----+
|application_number|application_dt|account_id|var1|var2|
+------------------+--------------+----------+----+----+
| 1| 2018-07-31| 215| a| aaa|
| 2| 2018-07-30| null| b| bbb|
| 3| 2017-10-28| 201| c| ccc|
+------------------+--------------+----------+----+----+

Categories

Resources