Sort in descending order in PySpark

Sort in descending order in PySpark - python

I'm using PySpark (Python 2.7.9/Spark 1.3.1) and have a dataframe GroupObject which I need to filter & sort in the descending order. Trying to achieve it via this piece of code.
group_by_dataframe.count().filter("`count` >= 10").sort('count', ascending=False)
But it throws the following error.
sort() got an unexpected keyword argument 'ascending'

In PySpark 1.3 sort method doesn't take ascending parameter. You can use desc method instead:
from pyspark.sql.functions import col
(group_by_dataframe
.count()
.filter("`count` >= 10")
.sort(col("count").desc()))
or desc function:
from pyspark.sql.functions import desc
(group_by_dataframe
.count()
.filter("`count` >= 10")
.sort(desc("count"))
Both methods can be used with with Spark >= 1.3 (including Spark 2.x).

Use orderBy:
df.orderBy('column_name', ascending=False)
Complete answer:
group_by_dataframe.count().filter("`count` >= 10").orderBy('count', ascending=False)
http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html

By far the most convenient way is using this:
df.orderBy(df.column_name.desc())
Doesn't require special imports.

you can use groupBy and orderBy as follows also
dataFrameWay = df.groupBy("firstName").count().withColumnRenamed("count","distinct_name").sort(desc("count"))

In pyspark 2.4.4
1) group_by_dataframe.count().filter("`count` >= 10").orderBy('count', ascending=False)
2) from pyspark.sql.functions import desc
group_by_dataframe.count().filter("`count` >= 10").orderBy('count').sort(desc('count'))
No need to import in 1) and 1) is short & easy to read,
So I prefer 1) over 2)

RDD.sortBy(keyfunc, ascending=True, numPartitions=None)
an example:
words = rdd2.flatMap(lambda line: line.split(" "))
counter = words.map(lambda word: (word,1)).reduceByKey(lambda a,b: a+b)
print(counter.sortBy(lambda a: a[1],ascending=False).take(10))

PySpark added Pandas style sort operator with the ascending keyword argument in version 1.4.0. You can now use
df.sort('<col_name>', ascending = False)
Or you can use the orderBy function:
df.orderBy('<col_name>').desc()

You can use pyspark.sql.functions.desc instead.
from pyspark.sql.functions import desc
g.groupBy('dst').count().sort(desc('count')).show()

Related

Pandas to Pyspark environment

newlist = []
for column in new_columns:
count12 = new_df.loc[new_df[col].diff() == 1]
new_df2=new_df2.groupby(['my_id','friend_id','family_id','colleage_id']).apply(len)
There is no option is available in pyspark for getting all length of column
How can we achieve this code into pyspark.
Thanks in advance..

Literally, apply(len) is just an aggregation function that would count grouped elements from groupby. You can do the very same thing in basic PySpark syntax
import pyspark.sql.functions as F
(df
.groupBy('my_id','friend_id','family_id','colleage_id')
.agg(F.count('*'))
.show()
)

How do I update an entire column in a pandas Dataframe [duplicate]

I have a dataframe with 10 columns. I want to add a new column 'age_bmi' which should be a calculated column multiplying 'age' * 'bmi'. age is an INT, bmi is a FLOAT.
That then creates the new dataframe with 11 columns.
Something I am doing isn't quite right. I think it's a syntax issue. Any ideas?
Thanks
df2['age_bmi'] = df(['age'] * ['bmi'])
print(df2)

try df2['age_bmi'] = df.age * df.bmi.
You're trying to call the dataframe as a function, when you need to get the values of the columns, which you can access by key like a dictionary or by property if it's a lowercase name with no spaces that doesn't match a built-in DataFrame method.
Someone linked this in a comment the other day and it's pretty awesome. I recommend giving it a watch, even if you don't do the exercises: https://www.youtube.com/watch?v=5JnMutdy6Fw

As pointed by Cory, you're calling a dataframe as a function, that'll not work as you expect. Here are 4 ways to multiple two columns, in most cases you'd use the first method.
In [299]: df['age_bmi'] = df.age * df.bmi
or,
In [300]: df['age_bmi'] = df.eval('age*bmi')
or,
In [301]: df['age_bmi'] = pd.eval('df.age*df.bmi')
or,
In [302]: df['age_bmi'] = df.age.mul(df.bmi)

You have combined age & bmi inside a bracket and treating df as a function rather than a dataframe. Here df should be used to call the columns as a property of DataFrame-
df2['age_bmi'] = df['age'] *df['bmi']

You can also use assign:
df2 = df.assign(age_bmi = df['age'] * df['bmi'])

pandas dataframe merge expression using a less-than operator?

I was trying to merge two dataframes using a less-than operator. But I ended up using pandasql.
Is it possible to do the same query below using pandas functions?
(Records may be duplicated, but that is fine as I'm looking for something similar to cumulative total later)
sql = '''select A.Name,A.Code,B.edate from df1 A
inner join df2 B on A.Name = B.Name
and A.Code=B.Code
where A.edate < B.edate '''
df4 = sqldf(sql)
The suggested answer seems similar but couldn't get the result expected. Also the answer below looks very crisp.

Use:
df = df1.merge(df2, on=['Name','Code']).query('edate_x < edate_y')[['Name','Code','edate_y']]

PySpark - Sum a column in dataframe and return results as int

I have a pyspark dataframe with a column of numbers. I need to sum that column and then have the result return as an int in a python variable.
df = spark.createDataFrame([("A", 20), ("B", 30), ("D", 80)],["Letter", "Number"])
I do the following to sum the column.
df.groupBy().sum()
But I get a dataframe back.
+-----------+
|sum(Number)|
+-----------+
| 130|
+-----------+
I would 130 returned as an int stored in a variable to be used else where in the program.
result = 130

I think the simplest way:
df.groupBy().sum().collect()
will return a list.
In your example:
In [9]: df.groupBy().sum().collect()[0][0]
Out[9]: 130

If you want a specific column :
import pyspark.sql.functions as F
df.agg(F.sum("my_column")).collect()[0][0]

The simplest way really :
df.groupBy().sum().collect()
But it is very slow operation: Avoid groupByKey, you should use RDD and reduceByKey:
df.rdd.map(lambda x: (1,x[1])).reduceByKey(lambda x,y: x + y).collect()[0][1]
I tried on a bigger dataset and i measured the processing time:
RDD and ReduceByKey : 2.23 s
GroupByKey: 30.5 s

This is another way you can do this. using agg and collect:
sum_number = df.agg({"Number":"sum"}).collect()[0]
result = sum_number["sum(Number)"]

Similar to other answers, but without the use of a groupby or agg. I just select the column in question, sum it, collect it, and then grab the first two indices to return an int. The only reason I chose this over the accepted answer is I am new to pyspark and was confused that the 'Number' column was not explicitly summed in the accepted answer. If I had to come back after sometime and try to understand what was happening, syntax such as below would be easier for me to follow.
import pyspark.sql.functions as f
df.select(f.sum('Number')).collect()[0][0]

You can also try using first() function. It returns the first row from the dataframe, and you can access values of respective columns using indices.
df.groupBy().sum().first()[0]
In your case, the result is a dataframe with single row and column, so above snippet works.

Select column as RDD, abuse keys() to get value in Row (or use .map(lambda x: x[0])), then use RDD sum:
df.select("Number").rdd.keys().sum()
SQL sum using selectExpr:
df.selectExpr("sum(Number)").first()[0]

The following should work:
df.groupBy().sum().rdd.map(lambda x: x[0]).collect()

sometimes read a csv file to pyspark Dataframe, maybe the numeric column change to string type '23',like this, you should use pyspark.sql.functions.sum to get the result as int , not sum()
import pyspark.sql.functions as F
df.groupBy().agg(F.sum('Number')).show()

Pandas df.apply does not modify DataFrame

I am just starting pandas so please forgive if this is something stupid.
I am trying to apply a function to a column but its not working and i don't see any errors also.
capitalizer = lambda x: x.upper()
for df in pd.read_csv(downloaded_file, chunksize=2, compression='gzip', low_memory=False):
df['level1'].apply(capitalizer)
print df
exit(1)
This print shows the level1 column values same as the original csv not doing upper. Am i missing something here ?
Thanks

apply is not an inplace function - it does not modify values in the original object, so you need to assign it back:
df['level1'] = df['level1'].apply(capitalizer)
Alternatively, you can use str.upper, it should be much faster.
df['level1'] = df['level1'].str.upper()

df['level1'] = map(lambda x: x.upper(), df['level1'])
you can use above code to make your column uppercase

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sort in descending order in PySpark - python

Use orderBy: df.orderBy('column_name', ascending=False) Complete answer: group_by_dataframe.count().filter("`count` >= 10").orderBy('count', ascending=False) http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html

By far the most convenient way is using this: df.orderBy(df.column_name.desc()) Doesn't require special imports.

you can use groupBy and orderBy as follows also dataFrameWay = df.groupBy("firstName").count().withColumnRenamed("count","distinct_name").sort(desc("count"))

RDD.sortBy(keyfunc, ascending=True, numPartitions=None) an example: words = rdd2.flatMap(lambda line: line.split(" ")) counter = words.map(lambda word: (word,1)).reduceByKey(lambda a,b: a+b) print(counter.sortBy(lambda a: a[1],ascending=False).take(10))

PySpark added Pandas style sort operator with the ascending keyword argument in version 1.4.0. You can now use df.sort('<col_name>', ascending = False) Or you can use the orderBy function: df.orderBy('<col_name>').desc()

You can use pyspark.sql.functions.desc instead. from pyspark.sql.functions import desc g.groupBy('dst').count().sort(desc('count')).show()

Related

Pandas to Pyspark environment

How do I update an entire column in a pandas Dataframe [duplicate]

pandas dataframe merge expression using a less-than operator?

PySpark - Sum a column in dataframe and return results as int

Pandas df.apply does not modify DataFrame

Categories

Resources