Is there a way to create key based on counts in Spark - python

Note: This question is related to Spark, and not just plain Scala or Python
As it is difficult to explain this, I would show what I want. Lets us say, I have an RDD A with the following value
A = ["word1", "word2", "word3"]
I want to have an RDD with the following value
B = [(1, "word1"), (2, "word2"), (3, "word3")]
That is, it gives a unique number to each entry as a key value. Can we do such thing with Python or Scala?

How about using zipWithIndex?
Zips this RDD with its element indices. The ordering is first based on the partition index and then the ordering of items within each partition. So the first item in the first partition gets index 0, and the last item in the last partition receives the largest index. This is similar to Scala's zipWithIndex but it uses Long instead of Int as the index type. This method needs to trigger a spark job when this RDD contains more than one partitions.
Otherwise, zipWithUniqueId seems a good fit as well.
If the order of the index is important, you can always map a swap function on the RDD.

Yes, one way is as below:
>>> A = ["word1", "word2", "word3"]
>>> B=[(idx+1,val) for idx,val in enumerate(A)]
>>> B
[(1, 'word1'), (2, 'word2'), (3, 'word3')]

Related

Creating paired nested list from a list in a column of pandas dataframe where the end element of first pair should be the start element of next

I have a data in geodataframe as shown in the image.
It contains a column by name neighbourhood_list which contains the list of all the neighbourhood codes of a route. what i want is to create a nested list in which the end element of first pair should be the start element of next because I want to generate a OD directed network (for generating edges) and order also matters here.
to make it bit clear, here is some code.
Here is lets say one record from the dataframe on which i tried some bodge way to get the desired result
list= [15,30,9,7,8]
new_list=[]
for i in range(len(list)-1):
new_list.append(list[i])
new_list.append(list[i+1])
so the above code gives the combined list which i then broke into the pairs which i needed
chunks = [new_list[x:x+2] for x in range(0, len(new_list), 2)]
chunks
Actual data is [15,30,9,7,8]
and desired output is [[15, 30], [30, 9], [9, 7], [7, 8]]
I just figured out the above code from the answer here
Split a python list into other "sublists" i.e smaller lists
However now the real issue is how to apply it in pandas
so far i am trying to tweak around something mentioned here
https://chrisalbon.com/python/data_wrangling/pandas_list_comprehension/
here is some incomplete code, i am not sure if it is correct but i thought if somehow i could get the len of list items from each row of the neighbourhood_list column then maybe i could accomplish
for row in df['neighbourhood_list']:
for i in range ??HOW TO GET range(len) of each row??
new.append(row[i])
new.append(row[i+1])
note: as a layman i dont know how the nested looping or lambda functions work or if there is any available pandas functions to perform this task.
another thing i think is of something like this also mentioned on stackoverflow, but still how to get length of list of each row, even if i try to create a function first and then apply it to my column.
df[["YourColumns"]].apply(someFunction)
apologies ahead if the question need more clarification (i can give more details of the problem if needed)
Thanks so much.
My best guess is that you are trying to create a column containing a list of ordered pairs from a column of lists. If that is the case, something like this should work:
Edit
From what you described, your 'neighbourhood_list' column is not a list yet, but is a string. Add this line to turn the column items to lists, then run the pairs apply.
df['neighbourhood_list']=df['neighbourhood_list'].apply(lambda row: row.split(','))
df['pairs'] = df['neighbourhood_list'].apply(lambda row: [[row[i],row[i+1]] for i in range(len(row)-1)])
If I have misunderstood, please let me know and I'll try and adjust accordingly.
From the description you posted, it seems that all you're trying to do is get that list of graph edges from an ordered list of nodes. First, it helps to use existing methods to reduce your pairing to a simple expression. In this case, I recommend zip:
stops = [15,30,9,7,8]
list(zip(stops, stops[1:]))
Output:
[(15, 30), (30, 9), (9, 7), (7, 8)]
Note that I changed your variable name: using a built-in type as a variable name is a baaaaaad idea. It disables some of your ability to reference that type.
Now, you just need to wrap that in a simple column expression. In any PANDAS tutorial, you will find appropriate instructions on using df["neighourhood_list"] as a series expression.

sorting a list of python tuples (first in descending order, then second element in ascending order [duplicate]

I have a list of tuples of k elements. I'd like to sort with respect to element 0, then element 1 and so on and so forth. I googled but I still can't quite figure out how to do it. Would it be something like this?
list.sort(key = lambda x : (x[0], x[1], ...., x[k-1])
In particular, I'd like to sort using different criteria, for example, descending on element 0, ascending on element 1 and so on.
Since python's sort is stable for versions after 2.2 (or perhaps 2.3), the easiest implementation I can think of is a serial repetition of sort using a series of index, reverse_value tuples:
# Specify the index, and whether reverse should be True/False
sort_spec = ((0, True), (1, False), (2, False), (3, True))
# Sort repeatedly from last tuple to the first, to have final output be
# sorted by first tuple, and ties sorted by second tuple etc
for index, reverse_value in sort_spec[::-1]:
list_of_tuples.sort(key = lambda x: x[index], reverse=reverse_value)
This does multiple passes so it may be inefficient in terms of constant time cost, but still O(nlogn) in terms of asymptotic complexity.
If the sort order for indices is truly 0, 1... n-1, n for a list of n-sized tuples as shown in your example, then all you need is a sequence of True and False to denote whether you want reverse or not, and you can use enumerate to add the index.
sort_spec = (True, False, False, True)
for index, reverse_value in list(enumerate(sort_spec))[::-1]:
list_of_tuples.sort(key = lambda x: x[index], reverse=reverse_value)
While the original code allowed for the flexibility of sorting by any order of indices.
Incidentally, this "sequence of sorts" method is recommended in the Python Sorting HOWTO with minor modifications.
Edit
If you didn't have the requirement to sort ascending by some indices and descending by others, then
from operator import itemgetter
list_of_tuples.sort(key = itemgetter(1, 3, 5))
will sort by index 1, then ties will be sorted by index 3, and further ties by index 5. However, changing the ascending/descending order of each index is non-trivial in one-pass.
list.sort(key = lambda x : (x[0], x[1], ...., x[k-1])
This is actually using the tuple as its own sort key. In other words, the same thing as calling sort() with no argument.
If I assume that you simplified the question, and the actual elements are actually not in the same order you want to sort by (for instance, the last value has the most precedence), you can use the same technique, but reorder the parts of the key based on precedence:
list.sort(key = lambda x : (x[k-1], x[1], ...., x[0])
In general, this is a very handy trick, even in other languages like C++ (if you're using libraries): when you want to sort a list of objects by several members with varying precedence, you can construct a sort key by making a tuple containing all the relevant members, in the order of precedence.
Final trick (this one is off topic, but it may help you at some point): When using a library that doesn't support the idea of "sort by" keys, you can usually get the same effect by building a list that contains the sort-key. So, instead of sorting a list of Obj, you would construct then sort a list of tuples: (ObjSortKey, Obj). Also, just inserting the objects into a sorted set will work, if they sort key is unique. (The sort key would be the index, in that case.)
So I am assuming you want to sort tuple_0 ascending, then tuple_1 descending, and so on. A bit verbose but this is what you might be looking for:
ctr = 0
for i in range(list_of_tuples):
if ctr%2 == 0:
list_of_tuples[0] = sorted(list_of_tuples[0])
else:
list_of_tuples[0] = sorted(list_of_tuples[0], reverse=True)
ctr+=1
print list_of_tuples

Pyspark RDD: find index of an element

I am new to pyspark and I am trying to convert a list in python to rdd and then I need to find elements index using the rdd. For the first part I am doing:
list = [[1,2],[1,4]]
rdd = sc.parallelize(list).cache()
So now the rdd is actually my list. The thing is that I want to find index of any arbitrary element something like "index" function which works for python lists. I am aware of a function called zipWithIndex which assign index to each element but I could not find proper example in python (there are examples with java and scala).
Thanks.
Use filter and zipWithIndex:
rdd.zipWithIndex().
filter(lambda (key,index) : key == [1,2]).
map(lambda (key,index) : index).collect()
Note that [1,2] here can be easily changed to a variable name and this whole expression can be wrapped within a function.
How It Works
zipWithIndex simply returns a tuple of (item,index) like so:
rdd.zipWithIndex().collect()
> [([1, 2], 0), ([1, 4], 1)]
filter finds only those that match a particular criterion (in this case, that key equals a specific sublist):
rdd.zipWithIndex().filter(lambda (key,index) : key == [1,2]).collect()
> [([1, 2], 0)]
map is fairly obvious, we can just get back the index:
rdd.zipWithIndex().filter(lambda (key,index) : key == [1,2]).
map(lambda (key,index): index).collect()
> [0]
and then we can simply get the first element by indexing [0] if you want.

convert python dictionary key string to key date

Say you have a dictionary:
a={'20101216':5,'20100216':1,'20111226':2,'20131216':5}
Two keys have the same value. How would I go about printing the the maximum key date (which is a string) and value? Like:
5 at 12/16/2013
I tried to for loop the key and the values and print the max key and max value, but it's not working out.
edit: I originally started out trying to convert an array of string dates to date objects. But it fails [b]
b=['20101216','20100216','20111226','20131216']
c=[5,1,2,5]
z=[]
for strDate in b:
g=[datetime.datetime.strptime(strDate, '%Y%m%d')]
if g not in z:
z.append(g)
Then from there if it worked I have would of done another for loop on my new array [z] to format each date element properly (m/d/y). Following that I would have zipped both arrays into a dictionary.
Like:
d = dict(zip(z,c))
Which would have resulted in
d={12/16/2010:5,02/16/2010:1,12/26/2011:2,12/16/2013:5}
Finally I would have attempted to find max date key and max value. And printed it like so:
5 at 12/16/2013
But because of the failure converting array b, I was thinking maybe working with a dictionary from the start might yield better results.
TL;DR:
max(a.items(), key = lambda x: (x[1], x[0]))
Basically, the problem is that you cant access dict's values directly and you still need to sort your data counting it. So, dict.items() gives you a list of tuples, i.e.
a.items()
[('20101216', 5), ('20131216', 5), ('20111226', 2), ('20100216', 1)]
Then all you need is to get maximum value of this list. The simple solution for getting maximum value is max func. As your problem is slightly complicated, you should leverage max key argument (take a look at doc) and use "compound" sorting key. In such situation the lambda function is a solution. You can express pretty any thing that you need to sort. So, sorting by 2 values inside tuple with corresponding priority should be
max(l, key = lambda x: (x[1], x[0])) # where l is iterable with tuples

How to access values by their keys as if processed with num_to_word_dict (if it existed)?

Is there a way to access a value by a key using Apache Spark?
Consider the following simple example, where there are two lists of key-value pairs which I would like to join:
num_to_letter = sc.parallelize([(1,'a'),(2,'b'),(3,'c')])
num_to_word = sc.parallelize([(1, 'one'),(2,'two'),(3,'three')])
num_to_letter.join(num_to_word).map(lambda x: x[1]).collect()
The result matches the letters to the words of the numbers:
[('a', 'one'), ('b', 'two'), ('c', 'three')]
The example shows it being done using a join, but it should be much more efficient to actually do this as a map operation where num_to_word is a dictionary:
num_to_word_dict = dict(num_to_word.collect())
num_to_letter.map(lambda x: (x[1], num_to_word_dict[x[0]])).collect()
The question is, is there a way to create something that acts like num_to_word_dict without having to collect the values in num_to_word?
There's a def lookup(key: K): Seq[V] function defined on RDDs of pairs that resolves a key to the list of values associated with that key.
Nevertheless, it will not be helpful in this case because rdds cannot be used in closures and that would be needed if we want to resolve values of a second RDD.
Given that both datasets are RDDs, join is a good way to proceed.
If the RDD that contains the resolution association is small enough to fit in memory of the driver and of each executor, the most efficient way to to this kind of resolution in Spark would be to create a map as a broadcast variable and map elements of the other RDD in each partition.
val numWordBC = sc.broadcast(numToWord.collectAsMap)
val letterToWord = numToLetter.mapPartitions{partition =>
val numWord = numWordBC.value
partition.map{case (k,v) => (numWord(k),v)}
}

Categories

Resources