Generate one rows from many rows into an RDD - python

I need to convert an rdd with two rows, inot an rdd with one row. Example:
rdd1=a
b
I need:
rdd2=(a,b)
How can I do this step in pyspark?
The question could be stupid but I'm new in spark.
"UPDATE"
This is to performing cartesian between rdd2 and rdd3, starting from rdd1. Like:
rdd3:(k,l)
(c,g)
(f,x)
I want this output:
rddOut:[(a,b),(k,l)]
[(a,b),(c,g)]
[(a,b),(f,x)]
Thanks in advance

update my anwser:
initRDD = sc.parallelize(list('aeiou')).map(lambda x: (x, ord(x))).collect()
ssc = StreamingContext(sc, batchDuration=3)
lines = ssc.socketTextStream('localhost', 9999)
items = lines.flatMap(lambda x: x.split())
counts = items.countByValue().map(lambda x: ([x] + initRDD))
It looks like broadcast rather than cartesian.

Can you explain a little bit more on your need? Having an RDD with a single row is not a good idea as you lose all parallelism.
If you want to collect the data by key, you can convert the RDD into an RDD of pairs (key and value). Then you can do reduceByKey in order to collect everything by the key to a list simply by having the reduce function be a list concatenation.

If my understanding of your question is correct, using flatMap for this will get you the required output.

Related

Assigning unique IDs to strings

I am trying to build an elegant solution to assigning IDs starting from 0 for the following data:
My Attempt at first creating IDs for the 'Person' category is like this:
df = pd.DataFrame(
{'Person': ['Tom Jones','Bill Smeegle','Silvia Geerea'],
'PersonFriends': [['Bill Smeegle','Silvia Geerea'],['Tom Jones'],['Han Solo']]})
df['PersonID'] = (df['Person']).astype('category').cat.codes
which produces
Now I want to follow the same process but do this for the 'PersonFriends' column to get this result below. How can I apply the same functions to achieve this when I have a list of friends?
I have been able to do this via the hash() function on each name, but the ID generated is long and not very readable. Any help appreciated. Thanks.
Create a dict and apply values from key
id_map = dict(zip(df["Person"], df["PersonID"]))
df["FriendsID"] = df["PersonFriends"].apply(lambda x: [id_map.get(y) for y in x])

Pyspark: Remove first N elements from a RDD

I have a RDD like this:
[('FRO11987', 104),('SNA90258', 550),('ELE91550', 23),('ELE52966', 380),('FRO90334', 63),('FRO84225', 74),('SNA80192', 258)]
and I want to remove first N elements from it.
For example if N = 3, then the new RDD should be like this:
[('ELE52966', 380),('FRO90334', 63),('FRO84225', 74),('SNA80192', 258)]
I had to do some Maps and Reduces, and as you might know, Map and reduces are only available for RDDs. But the more important reason, I have a task that should be done only by RDDs.
I'm new to Pyspark and don't know how to do it. Besides, I've looked for an answer but didn't find anything.
I'd like to just remove the first elements and not iterate through all elements.
Thank you for any help you can offer..
I have never used pyspark before, but is it possible to get the first n elements first, and then filter with the first n elements?
Here is some code that I try to write to implement, but I am not sure whether it will work. I referred to How to remove elements how to delete elemts from one rdd based on other rdd and create new rdd in pyspark?
from pyspark import SparkContext
sc = SparkContext('local')
n = 3
rdd = sc.parallelize([('FRO11987', 104),('SNA90258', 550),('ELE91550', 23),('ELE52966', 380),('FRO90334', 63),('FRO84225', 74),('SNA80192', 258)])
rdd_first=rdd.take(n)
first_list = rdd_first.collect()
filtered_rdd = rdd.filter(lambda x: x not in first_list)

How to get distinct dicts with nested list of RDD in Pyspark?

I have a similar question to:
How can I get a distinct RDD of dicts in PySpark?
However, there is some difference.
I have a dict with a key as string and a list as value in the following shape:
{"link0":["link1","link2",...]}.
So that in each of my RDD partitions dicts are stored.
The collect function gives me back a list of them:
[{"link0":["link1","link2",...]}, {"link1":["link2","link3",...]}, ...]
Assuming for example in partition one of my RDD I store:
[{"link0":["link1","link2"]}, {"link1":["link2","link3"]}] and
in partition two:
[{"link0":["link1","link2"]}, {"link3":["link4","link5"]}]
What I actually want to do is to get all distinct dicts over the RDD, same as in the question above:
[{"link0":["link1","link2"]}, {"link1":["link2","link3"]},
{"link3":["link4","link5"]}]
Yet, when it comes to the list in the values I struggle how to cope with that.
Do you have any recommendations how to handle it?
I tried to apply the dict_to_string() method mentioned, but are not sure if that is really the right way to handle this .
Also i thought about changing the data structure afterall to a better one.
Do you have any ideas what might fit better for my purpose?
After I got all the distinct key:[] pairs I want to get / filter out all the unique links in the list in all dicts except of those who are already as key in a dict, and subsequently store them in a new list:
["link2", "link4", "link5"]
If you have any idea, i'd be happy to hear!
Constructive help appreciated.
Thanks.
As in comment: the dicts always contain a single key and a list as value. you can try the following approach:
rdd = sc.parallelize([
{"link0":["link1","link2"]}, {"link1":["link2","link3"]},
{"link0":["link1","link2"]}, {"link3":["link4","link5"]}])
Task-1: find unique RDD elements:
use flatMap to convert the dict to a tuple with the value-part from list to tuple so that the RDD elements are hashable, take distinct() and then map the RDD elements back to their original data structure:
rdd.flatMap(lambda x: [ (k,tuple(v)) for k,v in x.items() ]) \
.distinct() \
.map(lambda x: {x[0]:list(x[1])}) \
.collect()
#[{'link0': ['link1', 'link2']},
# {'link1': ['link2', 'link3']},
# {'link3': ['link4', 'link5']}]
Task-2: find unique links in values but excluded from keys of dictionaries:
retrieve all unique keys into rdd1 and unique values to rdd2 and then do rdd2.subtract(rdd1)
rdd1 = rdd.flatMap(lambda x: x.keys()).distinct()
# ['link0', 'link1', 'link3']
rdd2 = rdd.flatMap(lambda x: [ v for vs in x.values() for v in vs ]).distinct()
# ['link1', 'link2', 'link3', 'link4', 'link5']
rdd2.subtract(rdd1).collect()
# ['link2', 'link5', 'link4']

How to convert a list in [ ] format to ( ) format

I have a large dataframe with a few hundred million records. I only want 10% of the df so i am filtering the df while reading it. The filter condition is dynamic and changes from one experiment to another.
There is another df from which i am getting the filter values:
filter = "filter_condition in" + tuple(df1.select("xxx").rdd.flatMap(lambda x: x).collect())
The above snippet gives a list say for example [1]
I am using the below query to read the large file:
large_df = (sqlContext.read.parquet(path).filter(filter))
When the tuple has more than 1 element the query works fine but when the filter condition has only 1 value then the tuple comes out as (1,) or (10293,) etc. and this causes an error while reading the large df since the filter condition comes out to be
(sqlContext.read.parquet(path).filter("filter_condition in (1,)"))
Is there a way to convert the list [1] to (1) format. Thanks
It needs to be like that as one element in brackets is just parsed as brackets, and you need the comma to make a one item tuple.
You can solve this by making a custom stringifying method:
def tuple_to_str(t):
t = tuple(t)
if len(t) == 1:
return '({!r})'.format(t[0])
return repr(t)
And doing:
filter = "filter_condition in" + tuple_to_str(
df1.select("xxx").rdd.flatMap(lambda x: x).collect()
)

Count RDD Pairs As Dictionary Pyspark

I have been working with a dataset that has been reduced to a following structure:
10,47,110,296,318,356,364,377,454,527,539,590,593,597,648,858,1097,1197,1206,1214,1221,1265,1291,1721,1961,2571,2628,2706,2716,3147,3578,3717,3793,4306,4993,5952,6539,7153,7438
Where each row of the RDD has the above structure.
I am attempting to count each pair within the row and insert the value to a dictionary. A sample output for this dictionary would be:
(10,47): 1, (10, 110):1, (10,296):1 etc.
I was able to get a basic implementation working but it was taking ten minutes longer on larger datasets vs. a simpler non dictionary approach in pyspark (I am practicing pairs and stripes mapreduce algorithms)
Previously, I was calling my own reduce function that would iterate through all the combination of pairs and then emit the counts for that. Is there a better way to be doing this?
The end goal is to count each row of an RDD and have a dictionary for (val1,val2): count
With the above data example as an rdd called dataRDD I have been performing the following
pairCount = dataRDD.map(combinePairs)
Where combinePairs is defined as
goodDict = defaultdict(int)
def combinePairs(data):
data = data.split(',')
for v in itertools.combinations(data,2):
first = v[0]
second = v[1]
pair = (first,second)
goodDict[pair] = goodDict[pair]+1
return goodDict
Any suggestions greatly appreciated

Categories

Resources