Pandas to Pyspark environment

Pandas to Pyspark environment - python

newlist = []
for column in new_columns:
count12 = new_df.loc[new_df[col].diff() == 1]
new_df2=new_df2.groupby(['my_id','friend_id','family_id','colleage_id']).apply(len)
There is no option is available in pyspark for getting all length of column
How can we achieve this code into pyspark.
Thanks in advance..

Literally, apply(len) is just an aggregation function that would count grouped elements from groupby. You can do the very same thing in basic PySpark syntax
import pyspark.sql.functions as F
(df
.groupBy('my_id','friend_id','family_id','colleage_id')
.agg(F.count('*'))
.show()
)

Related

create array of columns in pyspark

I have a dataframe with single row and multiple columns. I would like it to convert it into multiple rows.
I had found a similar question here on the stackoverflow.
The question answers how it can be done in scala but I wanted to do this in pyspark. I tried to replicate the code in pyspark but I wasn't able to do that.
I am not able to convert the below code in scala to python:
import org.apache.spark.sql.Column
var ColumnsAndValues: Array[Column] = df.columns.flatMap { c => {Array(lit(c), col(c))}}
val df2 = df1.withColumn("myMap", map(ColumnsAndValues: _*))

In Pyspark you can use create_map function to create map column. And a list comprehension with itertools.chain to get the equivalent of scala flatMap :
import itertools
from pyspark.sql import functions as F
columns_and_values = itertools.chain(*[(F.lit(c), F.col(c)) for c in df1.columns])
df2 = df1.withColumn("myMap", F.create_map(*columns_and_values))

Python Pandas Group by

I've the below code
import pandas as pd
Orders = pd.read_excel (r"C:\Users\Bharath Shana\Desktop\Python\Sample.xls", sheet_name='Orders')
Returns = pd.read_excel (r"C:\Users\Bharath Shana\Desktop\Python\Sample.xls", sheet_name='Returns')
Sum_value = pd.DataFrame(Orders['Sales']).sum
Orders_Year = pd.DatetimeIndex(Orders['Order Date']).year
Orders.merge(Returns, how="inner", on="Order ID")
which gives the output as below
My Requirement is i would like to use groupby and would like to see the output as below
Can some one please help me how to use groupby in my above code, it means i would like to see everything in the single line by using groupby
Regards,
Bharath

You can do by selecting column then define to a new dataframe
grouped = pd.DataFrame()
groupby = ['Year','Segment','Sales']
for i in groupby:
grouped[i] = Orders[i]

Databricks: Python pivot table in spark dataframe

Anyone can give me some guidance on the pivot table, using spark dataframe in python language
I am getting the following error :Column is not iterable
enter image description here
anyone has idea ?

Pivots function Pivots a column of the current DataFrame and performs the specified aggregation operation. There are two versions of pivot function: one that requires the caller to specify the list of distinct values to pivot on, and one that does not.
With specifying column values - df.groupBy("year").pivot("course", Seq("dotNET", "Java")).sum("earnings")
without specifying column values (more concise but less efficient) - df.groupBy("year").pivot("course").sum("earnings")
You are proceeding in the right direction. Sample working code, python 2
>>> from pyspark.sql import SparkSession
>>> from pyspark.sql.functions import col
>>> spark = SparkSession.builder.master('local').appName('app').getOrCreate()
>>> df = spark.read.option('header', 'true').csv('pivot.csv')
>>> df = df.withColumn('value1', col('value1').cast("int"))
>>> pdf = df.groupBy('thisyear').pivot('month', ['JAN','FEB']).sum('value1')
>>> pdf.show(10)
+--------+---+---+
|thisyear|JAN|FEB|
+--------+---+---+
| 2019| 3| 2|
+--------+---+---+
//pivot.csv
thisyear,month,value1
2019,JAN,1
2019,JAN,1
2019,FEB,1
2019,JAN,1
2019,FEB,1

PySpark - Sum a column in dataframe and return results as int

I have a pyspark dataframe with a column of numbers. I need to sum that column and then have the result return as an int in a python variable.
df = spark.createDataFrame([("A", 20), ("B", 30), ("D", 80)],["Letter", "Number"])
I do the following to sum the column.
df.groupBy().sum()
But I get a dataframe back.
+-----------+
|sum(Number)|
+-----------+
| 130|
+-----------+
I would 130 returned as an int stored in a variable to be used else where in the program.
result = 130

I think the simplest way:
df.groupBy().sum().collect()
will return a list.
In your example:
In [9]: df.groupBy().sum().collect()[0][0]
Out[9]: 130

If you want a specific column :
import pyspark.sql.functions as F
df.agg(F.sum("my_column")).collect()[0][0]

The simplest way really :
df.groupBy().sum().collect()
But it is very slow operation: Avoid groupByKey, you should use RDD and reduceByKey:
df.rdd.map(lambda x: (1,x[1])).reduceByKey(lambda x,y: x + y).collect()[0][1]
I tried on a bigger dataset and i measured the processing time:
RDD and ReduceByKey : 2.23 s
GroupByKey: 30.5 s

This is another way you can do this. using agg and collect:
sum_number = df.agg({"Number":"sum"}).collect()[0]
result = sum_number["sum(Number)"]

Similar to other answers, but without the use of a groupby or agg. I just select the column in question, sum it, collect it, and then grab the first two indices to return an int. The only reason I chose this over the accepted answer is I am new to pyspark and was confused that the 'Number' column was not explicitly summed in the accepted answer. If I had to come back after sometime and try to understand what was happening, syntax such as below would be easier for me to follow.
import pyspark.sql.functions as f
df.select(f.sum('Number')).collect()[0][0]

You can also try using first() function. It returns the first row from the dataframe, and you can access values of respective columns using indices.
df.groupBy().sum().first()[0]
In your case, the result is a dataframe with single row and column, so above snippet works.

Select column as RDD, abuse keys() to get value in Row (or use .map(lambda x: x[0])), then use RDD sum:
df.select("Number").rdd.keys().sum()
SQL sum using selectExpr:
df.selectExpr("sum(Number)").first()[0]

The following should work:
df.groupBy().sum().rdd.map(lambda x: x[0]).collect()

sometimes read a csv file to pyspark Dataframe, maybe the numeric column change to string type '23',like this, you should use pyspark.sql.functions.sum to get the result as int , not sum()
import pyspark.sql.functions as F
df.groupBy().agg(F.sum('Number')).show()

Sort in descending order in PySpark

I'm using PySpark (Python 2.7.9/Spark 1.3.1) and have a dataframe GroupObject which I need to filter & sort in the descending order. Trying to achieve it via this piece of code.
group_by_dataframe.count().filter("`count` >= 10").sort('count', ascending=False)
But it throws the following error.
sort() got an unexpected keyword argument 'ascending'

In PySpark 1.3 sort method doesn't take ascending parameter. You can use desc method instead:
from pyspark.sql.functions import col
(group_by_dataframe
.count()
.filter("`count` >= 10")
.sort(col("count").desc()))
or desc function:
from pyspark.sql.functions import desc
(group_by_dataframe
.count()
.filter("`count` >= 10")
.sort(desc("count"))
Both methods can be used with with Spark >= 1.3 (including Spark 2.x).

Use orderBy:
df.orderBy('column_name', ascending=False)
Complete answer:
group_by_dataframe.count().filter("`count` >= 10").orderBy('count', ascending=False)
http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html

By far the most convenient way is using this:
df.orderBy(df.column_name.desc())
Doesn't require special imports.

you can use groupBy and orderBy as follows also
dataFrameWay = df.groupBy("firstName").count().withColumnRenamed("count","distinct_name").sort(desc("count"))

In pyspark 2.4.4
1) group_by_dataframe.count().filter("`count` >= 10").orderBy('count', ascending=False)
2) from pyspark.sql.functions import desc
group_by_dataframe.count().filter("`count` >= 10").orderBy('count').sort(desc('count'))
No need to import in 1) and 1) is short & easy to read,
So I prefer 1) over 2)

RDD.sortBy(keyfunc, ascending=True, numPartitions=None)
an example:
words = rdd2.flatMap(lambda line: line.split(" "))
counter = words.map(lambda word: (word,1)).reduceByKey(lambda a,b: a+b)
print(counter.sortBy(lambda a: a[1],ascending=False).take(10))

PySpark added Pandas style sort operator with the ascending keyword argument in version 1.4.0. You can now use
df.sort('<col_name>', ascending = False)
Or you can use the orderBy function:
df.orderBy('<col_name>').desc()

You can use pyspark.sql.functions.desc instead.
from pyspark.sql.functions import desc
g.groupBy('dst').count().sort(desc('count')).show()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas to Pyspark environment - python

Literally, apply(len) is just an aggregation function that would count grouped elements from groupby. You can do the very same thing in basic PySpark syntax import pyspark.sql.functions as F (df .groupBy('my_id','friend_id','family_id','colleage_id') .agg(F.count('*')) .show() )

Related

create array of columns in pyspark

Python Pandas Group by

Databricks: Python pivot table in spark dataframe

PySpark - Sum a column in dataframe and return results as int

Sort in descending order in PySpark

Categories

Resources