Accessing individual element of a tuple on in RDD - python

Can we access individual element of a tuple in RDD in pyspark? In PIG we use $0,$1 etc ... So something similar do we have in pySpark.
If the tuple have 10 elements, how to get 5th and 7th element ? Which function I should use. How to retrieve only needed elements.

Is this what you want?
rdd57 = rdd.map(lambda x: (x[5], x[7]))

Related

How do I figure out what this code is doing?

I am trying to get my hands dirty by doing some experiments on Data Science using Python and the Pandas library.
Recently I got my hands on a jupyter notebook and stumbled upon a piece of code that I couldn't figure out how it works?
This is the line
md['genres'] = md['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
The dataset comes with a genres column that contains key-value pairs, the code above removes the keys and replaces everything with only the value if more than one value exists a | is inserted as a seperator between the two for instance
Comedy | Action | Drama
I want to know how the code actually works! Why does it need the literal_eval from ast? What is the lambda function doing?! Is there a more concise and clean way to write this?
Let's take this one step at a time:
md['genres'].fillna('[]')
This line fills all instances of NA or NaN in the series with '[]'.
.apply(literal_eval)
This applies literal_eval() from the ast package. We can imply from the fact that NA values have been replaced with '[]' that the original series contains string representations of lists, so literal_eval is used to convert these strings to lists.
.apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
This lambda function applies the following logic: If the value is a list, map to a list containing the ['name'] values for each element within the list, otherwise map to the empty list.
The result of the full function, therefore, is to map each element in the series, which in the original DF is a string representation of a list, to a list of the ['name'] values for each element within that list. If the element is either not a list, or NA, then it maps to the empty list.
You can lookup line by line:
md['genres'] = md['genres'].fillna('[]')
This first line ensures NaN cells are replaced with a string representing an empty list. That's because column genres are expected to contain lists.
.apply(literal_eval)
The method ast.literal_eval is used to actually evaluate dictionaries, and not use them as strings. Thanks to that, you can further access keys and values. See more here.
.apply(
lambda x: [i['name'] for i in x]
if isinstance(x, list)
else []
)
Now you're just applying some function that will filter your lists. These lists contain dictionaries. The function will return all dictionary values associated with key name within your inputs if they're lists. Otherwise, that'll be an empty list.

How access individual element in a tuple on a RDD in pyspark?

Lets say I have a RDD like
[(u'Some1', (u'ABC', 9989)),
(u'Some2', (u'XYZ', 235)),
(u'Some3', (u'BBB', 5379)),
(u'Some4', (u'ABC', 5379))]
I am using map to get one tuple at a time but how can I access to individual element of a tuple like to see if a tuple contains some character. Actually I want to filter out those that contains some character. Here the tuples that contain ABC
I was trying to do something like this but its not helping
def foo(line):
if(line[1]=="ABC"):
return (line)
new_data = data.map(foo)
I am new to spark and python as well please help!!
RDDs can be filtered directly. Below will give you all records that contain "ABC" in the 0th position of the 2nd element of the tuple.
new_data = data.filter(lambda x: x[1][0] == "ABC")

String replace in Spark RDD

I will explain the problem first mentioning the code.
numPartitions = 2
rawData1 = sc.textFile('train_new.csv', numPartitions,use_unicode=False)
rawData1.take(1)
['1,0,0,0,0,0,0,0,0,0,0,1,0,0,5,0,0,0,0,0,0,0,0,0,3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,2,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,0,9,0,0,0,0,0,Class_2']
Now i want to replace Class_2 to 2
after replacement answer should be
['1,0,0,0,0,0,0,0,0,0,0,1,0,0,5,0,0,0,0,0,0,0,0,0,3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,2,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,0,9,0,0,0,0,0,2']
Once i get it for this row, i will perform the operation for the whole data set
Thanks in Advance
Aashish
result = rawData1.map(lambda element: ','.join(element.split(',')[:-1] + ['2']))
should more than do it. It works by mapping each element in your RDD to the lambda function, and returns a new dataset.
The element is split into an array using the ',' delimiter, sliced through to omit the last element, and then made to take an extra element ['2'], following which we join the array together using ','.
More elaborate constructions can be made by modifying the lambda function appropriately.

Is there a way to create key based on counts in Spark

Note: This question is related to Spark, and not just plain Scala or Python
As it is difficult to explain this, I would show what I want. Lets us say, I have an RDD A with the following value
A = ["word1", "word2", "word3"]
I want to have an RDD with the following value
B = [(1, "word1"), (2, "word2"), (3, "word3")]
That is, it gives a unique number to each entry as a key value. Can we do such thing with Python or Scala?
How about using zipWithIndex?
Zips this RDD with its element indices. The ordering is first based on the partition index and then the ordering of items within each partition. So the first item in the first partition gets index 0, and the last item in the last partition receives the largest index. This is similar to Scala's zipWithIndex but it uses Long instead of Int as the index type. This method needs to trigger a spark job when this RDD contains more than one partitions.
Otherwise, zipWithUniqueId seems a good fit as well.
If the order of the index is important, you can always map a swap function on the RDD.
Yes, one way is as below:
>>> A = ["word1", "word2", "word3"]
>>> B=[(idx+1,val) for idx,val in enumerate(A)]
>>> B
[(1, 'word1'), (2, 'word2'), (3, 'word3')]

How to sort a list of lists (non integers)?

I have a list of lists that looks like this:
[['10.2100', '0.93956088E+01'],
['11.1100', '0.96414905E+01'],
['12.1100', '0.98638361E+01'],
['14.1100', '0.12764182E+02'],
['16.1100', '0.16235739E+02'],
['18.1100', '0.11399972E+02'],
['20.1100', '0.76444933E+01'],
['25.1100', '0.37823686E+01'],
['30.1100', '0.23552237E+01'],...]
(here it looks as if it is already ordered, but some of the rest of the elements not included here to avoid a huge list, are not in order)
and I want to sort it by the first element of each pair, I have seen several very similar questions, but in all the cases the examples are with integers, I don't know if that is why when I use the list.sort(key=lambda x: x[0]) or the sorter, or the version with the operator.itemgetter(0) I get the following:
[['10.2100', '0.93956088E+01'],
['100.1100', '0.33752517E+00'],
['11.1100', '0.96414905E+01'],
['110.1100', '0.25774972E+00'],
['12.1100', '0.98638361E+01'],
['14.1100', '0.12764182E+02'],
['14.6100', '0.14123326E+02'],
['15.1100', '0.15451733E+02'],
['16.1100', '0.16235739E+02'],
['16.6100', '0.15351242E+02'],
['17.1100', '0.14040859E+02'],
['18.1100', '0.11399972E+02'], ...]
apparently what is doing is sorting by the first character appearing in the first element of each pair.
Is there a way of using list.sort or sorted() for ordering this pairs with respect to the first element?
dont use list as a variable name!
some_list.sort(key=lambda x: float(x[0]) )
will convert the first element to a float and comparit numerically instead of alphabetically
(note the cast to float is only for comparing... the item is still a string in the list)

Categories

Resources