Pyspark RDD: find index of an element - python

I am new to pyspark and I am trying to convert a list in python to rdd and then I need to find elements index using the rdd. For the first part I am doing:
list = [[1,2],[1,4]]
rdd = sc.parallelize(list).cache()
So now the rdd is actually my list. The thing is that I want to find index of any arbitrary element something like "index" function which works for python lists. I am aware of a function called zipWithIndex which assign index to each element but I could not find proper example in python (there are examples with java and scala).
Thanks.

Use filter and zipWithIndex:
rdd.zipWithIndex().
filter(lambda (key,index) : key == [1,2]).
map(lambda (key,index) : index).collect()
Note that [1,2] here can be easily changed to a variable name and this whole expression can be wrapped within a function.
How It Works
zipWithIndex simply returns a tuple of (item,index) like so:
rdd.zipWithIndex().collect()
> [([1, 2], 0), ([1, 4], 1)]
filter finds only those that match a particular criterion (in this case, that key equals a specific sublist):
rdd.zipWithIndex().filter(lambda (key,index) : key == [1,2]).collect()
> [([1, 2], 0)]
map is fairly obvious, we can just get back the index:
rdd.zipWithIndex().filter(lambda (key,index) : key == [1,2]).
map(lambda (key,index): index).collect()
> [0]
and then we can simply get the first element by indexing [0] if you want.

Related

how to apply function to a list element within a list of lists?

I have a list of lists. Here is an example of 2 of the lists inside a list:
global_tp_old = [[2, 1, 0.8333595991134644],[2, 1, 0.8530714511871338]]
I want to access a dataframe index where the index is specified in the first element of the above list in a list. At the moment I have tried:
global_tp_new = []
for element in global_tp_old:
element[:][0] = df_unique[element[:][0]]
global_tp_new.append(element)
where df_unique is a pandas dataframe produced like this:
['img1.png', 'img2.png', 'img3.png']
I'm trying to match the first element from the list defined above to the number in df_unique.
I should get:
'img3.png'
as it's the 3rd element (0 indexing)
However, I get the incorrect output where it essentially returns the first element every time. It's probably obvious but what do I do to fix this?
Remember that your element array is actually a reference into the original list. If you modify the list, you'll modify global_tp_old as well.
Something like this, although you may need to change the dataframe indexing depending on whether you're looking for rows or columns.
global_tp_old = [[2, 1, 0.8333595991134644],[2, 1, 0.8530714511871338]]
global_tp_new = []
for element in global_tp_old:
element = [df_unique.iloc[element[0]]] + element[1:]
global_tp_new.append(element)
List comprehension might be useful to apply a function fun to the first element of each list in a list of lists (LoL).
LoL = [[61, 1, 0.8333595991134644],[44, 1, 0.8530714511871338]]
newL = [fun(l_loc[0]) for l_loc in LoL]
No need to use a Pandas DataFrame.

Select columns in pyrhon based on a condition

I am new to Python!
I have an input vector of p. I am trying to select columns of p such that p(i)>2 and put them into a new vector y. e.g. something like below which by the way, gives error:
y=(p[i]>2)
If I understand correctly, your question is not about Pandas Dataframe, rather about regular Python List. If so, you can use list comprehension.
A list comprehension is a short syntax for iterating through a list and picking the elements that satisfy a certain condition.
Let's see first how you can accomplish what you want with a regular for loop (the non-pythonic way):
my_list = [1, 4, 6, 1, 0]
my_new_list = []
for n in my_list:
if n > 2:
my_new_list.append(n)
Now, Python makes such a selection of elements from a list very easy using the list comprehension syntax:
my_new_list = [n for n in my_list if n > 2]
where the first n refers to what we append to my_new_list, then comes the for loop and finally the filtering condition.
In python you have to select give the column name value inside the bracket so you need to give the column name first and then you will be able to acces that column and then condition will be working fine. LIke this:
y = dataframe[p['i'] > 80]
and also you will be getting a column which will be taken as dataframe. visit this website for more information.

Is there a way to create key based on counts in Spark

Note: This question is related to Spark, and not just plain Scala or Python
As it is difficult to explain this, I would show what I want. Lets us say, I have an RDD A with the following value
A = ["word1", "word2", "word3"]
I want to have an RDD with the following value
B = [(1, "word1"), (2, "word2"), (3, "word3")]
That is, it gives a unique number to each entry as a key value. Can we do such thing with Python or Scala?
How about using zipWithIndex?
Zips this RDD with its element indices. The ordering is first based on the partition index and then the ordering of items within each partition. So the first item in the first partition gets index 0, and the last item in the last partition receives the largest index. This is similar to Scala's zipWithIndex but it uses Long instead of Int as the index type. This method needs to trigger a spark job when this RDD contains more than one partitions.
Otherwise, zipWithUniqueId seems a good fit as well.
If the order of the index is important, you can always map a swap function on the RDD.
Yes, one way is as below:
>>> A = ["word1", "word2", "word3"]
>>> B=[(idx+1,val) for idx,val in enumerate(A)]
>>> B
[(1, 'word1'), (2, 'word2'), (3, 'word3')]

Python: Sorting a list by class objects

Working on a project for CS1, and I am close to cracking it, but this part of the code has stumped me! The object of the project is to create a list of the top 20 names in any given year by referencing a file with thousands of names on it. Each line in each file contains the name, gender, and how many times it occurs. This file is seperated by gender (so female names in order of their occurences followed by male names in order of their occurences). I have gotten the code to a point where each entry is contained within a class in a list (so this list is a long list of memory entries). Here is the code I have up to this point.
class entry():
__slots__ = ('name' , 'sex' , 'occ')
def mkEntry( name, sex, occ ):
dat = entry()
dat.name = name
dat.sex = sex
dat.occ = occ
return dat
##test = mkEntry('Mary', 'F', '7065')
##print(test.name, test.sex, test.occ)
def readFile(fileName):
fullset = []
for line in open(fileName):
val = line.split(",")
sett = mkEntry(val[0] , val[1] , int(val[2]))
fullset.append(sett)
return fullset
fullset = readFile("names/yob1880.txt")
print(fullset)
What I am wondering if I can do at this point is can I sort this list via usage of sort() or other functions, but sort the list by their occurrences (dat.occ in each entry) so in the end result I will have a list sorted independently of gender and then at that point I can print the first entries in the list, as they should be what I am seeking. Is it possible to sort the list like this?
Yes, you can sort lists of objects using sort(). sort() takes a function as an optional argument key. The key function is applied to each element in the list before making the comparisons. For example, if you wanted to sort a list of integers by their absolute value, you could do the following
>>> a = [-5, 4, 6, -2, 3, 1]
>>> a.sort(key=abs)
>>> a
[1, -2, 3, 4, -5, 6]
In your case, you need a custom key that will extract the number of occurrences for each object, e.g.
def get_occ(d): return d.occ
fullset.sort(key=get_occ)
(you could also do this using an anonymous function: fullset.sort(key=lambda d: d.occ)). Then you just need to extract the top 20 elements from this list.
Note that by default sort returns elements in ascending order, which you can manipulate e.g. fullset.sort(key=get_occ, reverse=True)
This sorts the list by using the occ property in descending order:
fullset.sort(key=lambda x: x.occ, reverse=True)
You mean you want to sort the list only by the occ? sort() has a parameter named key, you can do like this:
fullset.sort(key=lambda x: x.occ)
I think you just want to sort on the value of the 'occ' attribute of each object, right? You just need to use the key keyword argument to any of the various ordering functions that Python has available. For example
getocc = lambda entry: entry.occ
sorted(fullset, key=getocc)
# or, for in-place sorting
fullset.sort(key=getocc)
or perhaps some may think it's more pythonic to use operator.attrgetter instead of a custom lambda:
import operator
getocc = operator.attrgetter('occ')
sorted(fullset, key=getocc)
But it sounds like the list is pretty big. If you only want the first few entries in the list, sorting may be an unnecessarily expensive operation. For example, if you only want the first value you can get that in O(N) time:
min(fullset, key=getocc) # Same getocc as above
If you want the first three, say, you can use a heap instead of sorting.
import heapq
heapq.nsmallest(3, fullset, key=getocc)
A heap is a useful data structure for getting a slice of ordered elements from a list without sorting the whole list. The above is equivalent to sorted(fullset, key=getocc)[:3], but faster if the list is large.
Hopefully it's obvious you can get the three largest with heapq.nlargest and the same arguments. Likewise you can reverse any of the sorts or replace min with max.

Indexing According to Number in the Names of Objects in a List in Python

Apologies for my title not being the best. Here is what I am trying to accomplish:
I have a list:
list1 = [a0_something, a2_something, a1_something, a4_something, a3_something]
i have another list who entries are tuples including a name such as :
list2 = [(x1,y1,z1,'bob'),(x2,y2,z2,'alex')...]
the 0th name in the second list corresponds to a0_something and the name in the 1st entry of the second list corresponds to a1_something. basically the second list is in the write order but the 1st list isnt.
The program I am working with has a setName function I would like to do this
a0_something.setName(list2[0][4])
and so on with a loop.
So that I can really just say
for i in range(len(list1)):
a(i)_something.setName(list2[i][4])
Is there anyway I can refer to that number in the a#_something so that I can iterate with a loop?
No.
Variable names have no meaning in run-time. (Unless you're doing introspection, which I guarantee you is something you should not be doing.)
Use a proper list such that:
lst = [a0_val, a1_val, a2_val, a3_val, a4_val]
and then address it by lst[0].
Alternatively, if those names have meanings, use a dict where:
dct = {
'a0' : a0_val,
'a1' : a1_val,
# ...
}
and use it with dct['a0'].
The enumerate function lets you get the value and the index of the current item. So, for your example, you could do:
for i, asomething in enumerate(list1):
asomething.setName(list2[i][3])
Since your list2 is length 4, the final element is index 3 (you could also use -1)

Categories

Resources