Pyspark: Remove first N elements from a RDD - python

I have a RDD like this:
[('FRO11987', 104),('SNA90258', 550),('ELE91550', 23),('ELE52966', 380),('FRO90334', 63),('FRO84225', 74),('SNA80192', 258)]
and I want to remove first N elements from it.
For example if N = 3, then the new RDD should be like this:
[('ELE52966', 380),('FRO90334', 63),('FRO84225', 74),('SNA80192', 258)]
I had to do some Maps and Reduces, and as you might know, Map and reduces are only available for RDDs. But the more important reason, I have a task that should be done only by RDDs.
I'm new to Pyspark and don't know how to do it. Besides, I've looked for an answer but didn't find anything.
I'd like to just remove the first elements and not iterate through all elements.
Thank you for any help you can offer..

I have never used pyspark before, but is it possible to get the first n elements first, and then filter with the first n elements?
Here is some code that I try to write to implement, but I am not sure whether it will work. I referred to How to remove elements how to delete elemts from one rdd based on other rdd and create new rdd in pyspark?
from pyspark import SparkContext
sc = SparkContext('local')
n = 3
rdd = sc.parallelize([('FRO11987', 104),('SNA90258', 550),('ELE91550', 23),('ELE52966', 380),('FRO90334', 63),('FRO84225', 74),('SNA80192', 258)])
rdd_first=rdd.take(n)
first_list = rdd_first.collect()
filtered_rdd = rdd.filter(lambda x: x not in first_list)

Related

Get specific elements of a list using a list iteration

I'm trying to build the below dataframe
df = pd.DataFrame(columns=['Year','Revenue','Gross Profit','Operating Profit','Net Profit'])
rep_vals =['year','net_sales','gross_income','operating_income','profit_to_equity_holders']
for i in range (len(yearly_reports)):
df.loc[i] = [yearly_reports[i].x for x in rep_vals]
However I get error as per.. 'Report' object has no attribute 'x'
The below (brute force version) of the code works:
for i in range (len(yearly_reports)):
df.loc[i] = [yearly_reports[i].year,yearly_reports[i].net_sales ,
yearly_reports[i].gross_income, yearly_reports[i].operating_income,
yearly_reports[i].profit_to_equity_holders]
My issue is however I want to add a lot more columns and also I don't want to fetch every item from my yearly_reports into the dataframe, how can I iterate the values I want more effeciently please?
Instead of using .x, use [x].
yearly_reports[i][x]
Also, it is probably a bad idea / not necessary / slow to iterate over your dataframe like this. Have a look at join/merge which might be a lot faster.

Pyspark 'for' loop not filtering correctly a pyspark-sql dataframe using .filter()

I am trying to create a for loop i which I first: filter a pyspark sql dataframe, then transform the filtered dataframe to pandas, apply a function to it and yied the result in a list called results. My list contains a sequence of strings (that will be sort of ids in the dataframe); I want the for loop to, in each iteration, obtain one of the strings from the list, and filter all the rows in the dataframe whose id is that string. Sample code:
results = []
for x in list:
aux = df.filter("id='x'")
final= function(aux,"value")
results.append(final)
results
The dataframe is a time-series, and outside the loop I apply the aux = df.filter("id='x'") transformation and then the function runs without problem; the issue is in the loop itself.
However, when I do aux.show() it shows an empty dataframe. The dataframe is a time-series, and outside the loop I apply the aux = df.filter("id='x'") transformation and then the function runs without problem; the issue is in the loop itself.
Does anyone know why this may be happening?
Try the code below. x is not substituted in the filter expression.
results = []
for x in list:
aux = df.filter("id = '%s'" % x)
final= function(aux,"value")
results.append(final)
results

Test Anova on multiple groups

I have the following dataframe:
I would like to use this code to compare the means between my entire dataframe:
F_statistic, pVal = stats.f_oneway(percentage_age_ss.iloc[:,0:1],
percentage_age_ss.iloc[:,1:2],
percentage_age_ss.iloc[:,2:3],
percentage_age_ss.iloc[:,3:4]) etc...
However, I don't want to use each time .iloc because it takes too much time. Do you I have another way to do it?
Thanks
get a list of columns using list comprehension, then use star syntax to expand it into the arglist:
stats.f_oneway(*(percentage_age_ss[col] for col in percentage_age_ss.columns))
or, just
stats.f_oneway(*(percentage_age_ss.T.values))

Count RDD Pairs As Dictionary Pyspark

I have been working with a dataset that has been reduced to a following structure:
10,47,110,296,318,356,364,377,454,527,539,590,593,597,648,858,1097,1197,1206,1214,1221,1265,1291,1721,1961,2571,2628,2706,2716,3147,3578,3717,3793,4306,4993,5952,6539,7153,7438
Where each row of the RDD has the above structure.
I am attempting to count each pair within the row and insert the value to a dictionary. A sample output for this dictionary would be:
(10,47): 1, (10, 110):1, (10,296):1 etc.
I was able to get a basic implementation working but it was taking ten minutes longer on larger datasets vs. a simpler non dictionary approach in pyspark (I am practicing pairs and stripes mapreduce algorithms)
Previously, I was calling my own reduce function that would iterate through all the combination of pairs and then emit the counts for that. Is there a better way to be doing this?
The end goal is to count each row of an RDD and have a dictionary for (val1,val2): count
With the above data example as an rdd called dataRDD I have been performing the following
pairCount = dataRDD.map(combinePairs)
Where combinePairs is defined as
goodDict = defaultdict(int)
def combinePairs(data):
data = data.split(',')
for v in itertools.combinations(data,2):
first = v[0]
second = v[1]
pair = (first,second)
goodDict[pair] = goodDict[pair]+1
return goodDict
Any suggestions greatly appreciated

Generate one rows from many rows into an RDD

I need to convert an rdd with two rows, inot an rdd with one row. Example:
rdd1=a
b
I need:
rdd2=(a,b)
How can I do this step in pyspark?
The question could be stupid but I'm new in spark.
"UPDATE"
This is to performing cartesian between rdd2 and rdd3, starting from rdd1. Like:
rdd3:(k,l)
(c,g)
(f,x)
I want this output:
rddOut:[(a,b),(k,l)]
[(a,b),(c,g)]
[(a,b),(f,x)]
Thanks in advance
update my anwser:
initRDD = sc.parallelize(list('aeiou')).map(lambda x: (x, ord(x))).collect()
ssc = StreamingContext(sc, batchDuration=3)
lines = ssc.socketTextStream('localhost', 9999)
items = lines.flatMap(lambda x: x.split())
counts = items.countByValue().map(lambda x: ([x] + initRDD))
It looks like broadcast rather than cartesian.
Can you explain a little bit more on your need? Having an RDD with a single row is not a good idea as you lose all parallelism.
If you want to collect the data by key, you can convert the RDD into an RDD of pairs (key and value). Then you can do reduceByKey in order to collect everything by the key to a list simply by having the reduce function be a list concatenation.
If my understanding of your question is correct, using flatMap for this will get you the required output.

Categories

Resources