How can we use the Round function with Group by in pyspark? i have a spark dataframe through which i need to generate a result by using group by and round function??
data1 = [{'Name':'Jhon','ID':21.528,'Add':'USA','ID_2':'30.90'},
{'Name':'Joe','ID':3.69,'Add':'USA','ID_2':'12.80'},
{'Name':'Tina','ID':2.48,'Add':'IND','ID_2':'11.07'},
{'Name':'Jhon','ID':22.22, 'Add':'USA','ID_2':'34.87'},
{'Name':'Joe','ID':5.33,'Add':'INA','ID_2':'56.89'}]
a = sc.parallelize(data1)
In SQL query will be like
select count(ID) as newid, count(ID_2) as secondaryid, round(([newid]+
[secondaryid])/[newid]* 200,1) AS [NEW_PERCENTAGE] FROM DATA1
groupby Name
You cannot use round inside a groupby, you need to create a new column afterwards:
import pyspark.sql.functions as F
df = spark.createDataFrame(a)
(df.groupby('Name')
.agg(
F.count('ID').alias('newid'),
F.count('ID_2').alias('secondaryid')
)
.withColumn('NEW_PERCENTAGE', F.round(200 * (F.col('newid') + F.col('secondaryid')) / F.col('newid'), 1))
).show()
+----+-----+-----------+--------------+
|Name|newid|secondaryid|NEW_PERCENTAGE|
+----+-----+-----------+--------------+
| Joe| 2| 2| 400.0|
|Tina| 1| 1| 400.0|
|Jhon| 2| 2| 400.0|
+----+-----+-----------+--------------+
Related
I am relatively new to pyspark. I want to generate a dataframe column with dates between two given dates (constants) and add this column to an existing dataframe. What will be the efficient way?
I tried this but it didn't work:
df_add_column = df.withColumn("repeat", expr("split(repeat(',', diffDays), ',')")).select("*", posexplode("repeat").alias('DATE', "val")) .drop("repeat", "val", "diffDays").withColumn('DATE', expr("date_add('2018-01-01', 'DATE')"))
You can use sequence function to generate the dates then explode. Example:
from pyspark.sql import functions as F
df = spark.createDataFrame([(1,)], ["id"])
df1 = df.withColumn(
"date",
F.explode(F.expr("sequence(to_date('2021-02-01'), to_date('2021-02-08'), interval 1 day)"))
)
df1.show()
#+---+----------+
#| id| date|
#+---+----------+
#| 1|2021-02-01|
#| 1|2021-02-02|
#| 1|2021-02-03|
#| 1|2021-02-04|
#| 1|2021-02-05|
#| 1|2021-02-06|
#| 1|2021-02-07|
#| 1|2021-02-08|
#+---+----------+
I have a pyspark dataframe with multiple columns. For example the one below.
from pyspark.sql import Row
l = [('Jack',"a","p"),('Jack',"b","q"),('Bell',"c","r"),('Bell',"d","s")]
rdd = sc.parallelize(l)
score_rdd = rdd.map(lambda x: Row(name=x[0], letters1=x[1], letters2=x[2]))
score_card = sqlContext.createDataFrame(score_rdd)
+----+--------+--------+
|name|letters1|letters2|
+----+--------+--------+
|Jack| a| p|
|Jack| b| q|
|Bell| c| r|
|Bell| d| s|
+----+--------+--------+
Now I want to group by "name" and concatenate the values in every row for both columns.
I know how to do it but let's say there are thousands of rows then my code becomes very ugly.
Here is my solution.
import pyspark.sql.functions as f
t = score_card.groupby("name").agg(
f.concat_ws("",collect_list("letters1").alias("letters1")),
f.concat_ws("",collect_list("letters2").alias("letters2"))
)
Here is the output I get when I save it in a CSV file.
+----+--------+--------+
|name|letters1|letters2|
+----+--------+--------+
|Jack| ab| pq|
|Bell| cd| rs|
+----+--------+--------+
But my main concern is about these two lines of code
f.concat_ws("",collect_list("letters1").alias("letters1")),
f.concat_ws("",collect_list("letters2").alias("letters2"))
If there are thousands of columns then I will have to repeat the above code thousands of times. Is there a simpler solution for this so that I don't have to repeat f.concat_ws() for every column?
I have searched everywhere and haven't been able to find a solution.
yes, you can use for loop inside agg function and iterate through df.columns. Let me know if it helps.
from pyspark.sql import functions as F
df.show()
# +--------+--------+----+
# |letters1|letters2|name|
# +--------+--------+----+
# | a| p|Jack|
# | b| q|Jack|
# | c| r|Bell|
# | d| s|Bell|
# +--------+--------+----+
df.groupBy("name").agg( *[F.array_join(F.collect_list(column), "").alias(column) for column in df.columns if column !='name' ]).show()
# +----+--------+--------+
# |name|letters1|letters2|
# +----+--------+--------+
# |Bell| cd| rs|
# |Jack| ab| pq|
# +----+--------+--------+
I am new to Data Science and I am working on a simple self project using Google Colab. I took a data from a something.csv file and the file's columns are encrypted with ####, so I don't know the names of the columns. I took
Here is my attempt to solve it using pyspark
df = spark.read.csv('something.csv', header=True)
col = df[df.columns[len(df.columns)-1]] #Taking last column of data-frame
Now I want to iterate through rows of 'col' column and print the rows that has a number less than 100. I searched other stackoverflow posts but didn't understand how to iterate through the column with no name.
In pyspark use .filter method on dataframe to filter records < 100.
#sample data po column is int
df.show()
#+---+----+---+
#| id|name| po|
#+---+----+---+
#| 1| 2|300|
#| 2| 1| 50|
#+---+----+---+
last_col = df[df.columns[len(df.columns)-1]]
from pyspark.sql.functions import *
df.filter(last_col < 100).show()
#+---+----+---+
#| id|name| po|
#+---+----+---+
#| 2| 1| 50|
#+---+----+---+
UPDATE:
#getting rows into list
lst=df.filter(last_col < 100).select(last_col).rdd.flatMap(lambda x:x)
lst.collect()
#[50]
to get all rows into list
lst=df.filter(last_col < 100).rdd.flatMap(lambda x:x)
lst.collect()
#[u'2', u'1', 50]
I have a pyspark dataframe with a column I am trying to extract information from. To give you an example, the column is a combination of 4 foreign keys which could look like this:
Ex 1: 12345-123-12345-4
Ex 2: 5678-4321-123-12
I am trying to extract the last piece of the string, in this case the 4 & 12. Any idea on how I can do this?
I've tried the following:
df.withColumn("result", sf.split(sf.col("column_to_split"), '\_')[1])\
.withColumn("result", sf.col("result").cast('integer'))
However, the result for double digit values is null, and it only returns an integer for single digits (0-9)
Thanks!
For spark2.4,You should use element_at -1 on your array after split
from pyspark.sql import functions as sf
df.withColumn("result", sf.element_at(sf.split("column_to_split","\-"),-1).cast("int")).show()
+-----------------+------+
| column_to_split|result|
+-----------------+------+
|12345-123-12345-4| 4|
| 5678-4321-123-12| 12|
+-----------------+------+
Mohammad's answer is very clean and a nice solution. However if you need a solution for Spark versions < 2.4, you can utilise the reverse string functionality and take the first element, reverse it back and turn into an Integer, f.e.:
import pandas as pd
import pyspark.sql.functions as f
import pyspark.sql.types as t
df = pd.DataFrame()
df['column_to_split'] = ["12345-123-12345-4", "5678-4321-123-12"]
df = spark.createDataFrame(df)
df.withColumn("result",
f.reverse(f.split(f.reverse("column_to_split"), "-")[0]). \
cast(t.IntegerType())).show(2, False)
+-----------------+------+
|column_to_split |result|
+-----------------+------+
|12345-123-12345-4|4 |
|5678-4321-123-12 |12 |
+-----------------+------+
This is how to get the last digits of the serial number above:
serial_no = '12345-123-12345-4'
last_digit = serial_no.split('-')[-1]
print(last_digit)
So in your case, try:
df.withColumn("result", int(sf.col("column_to_split").split('-')[-1]))
If it doesn't work, please share the result.
Adding up another ways:
You can use .regexp_extract() (or) .substring_index() function also:
Example:
df.show()
#+-----------------+
#| column_to_split|
#+-----------------+
#|12345-123-12345-4|
#| 5678-4321-123-12|
#+-----------------+
df.withColumn("result",regexp_extract(col("column_to_split"),"([^-]+$)",1).cast("int")).\
withColumn("result1",substring_index(col("column_to_split"),"-",-1).cast("int")).\
show()
#+-----------------+------+-------+
#| column_to_split|result|result1|
#+-----------------+------+-------+
#|12345-123-12345-4| 4| 4|
#| 5678-4321-123-12| 12| 12|
#+-----------------+------+-------+
I have a data frame df with columns "col1" and "col2". I want to create a third column which uses one of the columns as in an exponent function.
df = df.withColumn("col3", 100**(df("col1")))*df("col2")
However, this always results in:
TypeError: unsupported operand type(s) for ** or pow(): 'float' and 'Column'
I understand that this is due to the function taking df("col1") as a "Column" instead of the item at that row.
If I perform
results = df.map(lambda x : 100**(df("col2"))*df("col2"))
this works, but I can't append to my original data frame.
Any thoughts?
This is my first time posting, so I apologize for any formatting problems.
Since Spark 1.4 you can usepow function as follows:
from pyspark.sql import Row
from pyspark.sql.functions import pow, col
row = Row("col1", "col2")
df = sc.parallelize([row(1, 2), row(2, 3), row(3, 3)]).toDF()
df.select("*", pow(col("col1"), col("col2")).alias("pow")).show()
## +----+----+----+
## |col1|col2| pow|
## +----+----+----+
## | 1| 2| 1.0|
## | 2| 3| 8.0|
## | 3| 3|27.0|
## +----+----+----+
If you use an older version a Python UDF should do the trick:
import math
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType
my_pow = udf(lambda x, y: math.pow(x, y), DoubleType())
Just to complement the accepted answer: one can now do something very similar to what the OP tried to do, i.e., use the ** operator, or even Python's builtin pow:
from pyspark.sql import SparkSession
from pyspark.sql.functions import pow as pow_
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([(1, ), (2, ), (3, ), (4, ), (5, ), (6, )], 'n: int')
df = df.withColumn('pyspark_pow', pow_(df['n'], df['n'])) \
.withColumn('python_pow', pow(df['n'], df['n'])) \
.withColumn('double_star_operator', df['n'] ** df['n'])
df.show()
+---+-----------+----------+--------------------+
| n|pyspark_pow|python_pow|double_star_operator|
+---+-----------+----------+--------------------+
| 1| 1.0| 1.0| 1.0|
| 2| 4.0| 4.0| 4.0|
| 3| 27.0| 27.0| 27.0|
| 4| 256.0| 256.0| 256.0|
| 5| 3125.0| 3125.0| 3125.0|
| 6| 46656.0| 46656.0| 46656.0|
+---+-----------+----------+--------------------+
As one can see, both PySpark's and Python's pow return the same result, as well as the ** operator. It also works when one of the arguments is a scalar:
df = df.withColumn('pyspark_pow', pow_(2, df['n'])) \
.withColumn('python_pow', pow(2, df['n'])) \
.withColumn('double_star_operator', 2 ** df['n'])
df.show()
+---+-----------+----------+--------------------+
| n|pyspark_pow|python_pow|double_star_operator|
+---+-----------+----------+--------------------+
| 1| 2.0| 2.0| 2.0|
| 2| 4.0| 4.0| 4.0|
| 3| 8.0| 8.0| 8.0|
| 4| 16.0| 16.0| 16.0|
| 5| 32.0| 32.0| 32.0|
| 6| 64.0| 64.0| 64.0|
+---+-----------+----------+--------------------+
I believe the reason Python's pow now work on PySpark columns, is the fact that pow is equivalent to the ** operator when used with only two arguments (see docs, here), and the ** operator uses the objects own implementation of the power operation, if it is defined for the object being operated on (see this SO response here).
Apparently, PySpark's Column has the proper definitions for __pow__ operator (see source code for Column).
I am not sure why the ** operator did not work originally, but I am assuming it is related to the fact that - at the time - Column was defined differently.
The stack used for testing was Python 3.8.5 and PySpark 3.1.1, but I have seen this behavior for PySpark >= 2.4 as well.