How access individual element in a tuple on a RDD in pyspark? - python

Lets say I have a RDD like
[(u'Some1', (u'ABC', 9989)),
(u'Some2', (u'XYZ', 235)),
(u'Some3', (u'BBB', 5379)),
(u'Some4', (u'ABC', 5379))]
I am using map to get one tuple at a time but how can I access to individual element of a tuple like to see if a tuple contains some character. Actually I want to filter out those that contains some character. Here the tuples that contain ABC
I was trying to do something like this but its not helping
def foo(line):
if(line[1]=="ABC"):
return (line)
new_data = data.map(foo)
I am new to spark and python as well please help!!

RDDs can be filtered directly. Below will give you all records that contain "ABC" in the 0th position of the 2nd element of the tuple.
new_data = data.filter(lambda x: x[1][0] == "ABC")

Related

How do I figure out what this code is doing?

I am trying to get my hands dirty by doing some experiments on Data Science using Python and the Pandas library.
Recently I got my hands on a jupyter notebook and stumbled upon a piece of code that I couldn't figure out how it works?
This is the line
md['genres'] = md['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
The dataset comes with a genres column that contains key-value pairs, the code above removes the keys and replaces everything with only the value if more than one value exists a | is inserted as a seperator between the two for instance
Comedy | Action | Drama
I want to know how the code actually works! Why does it need the literal_eval from ast? What is the lambda function doing?! Is there a more concise and clean way to write this?
Let's take this one step at a time:
md['genres'].fillna('[]')
This line fills all instances of NA or NaN in the series with '[]'.
.apply(literal_eval)
This applies literal_eval() from the ast package. We can imply from the fact that NA values have been replaced with '[]' that the original series contains string representations of lists, so literal_eval is used to convert these strings to lists.
.apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
This lambda function applies the following logic: If the value is a list, map to a list containing the ['name'] values for each element within the list, otherwise map to the empty list.
The result of the full function, therefore, is to map each element in the series, which in the original DF is a string representation of a list, to a list of the ['name'] values for each element within that list. If the element is either not a list, or NA, then it maps to the empty list.
You can lookup line by line:
md['genres'] = md['genres'].fillna('[]')
This first line ensures NaN cells are replaced with a string representing an empty list. That's because column genres are expected to contain lists.
.apply(literal_eval)
The method ast.literal_eval is used to actually evaluate dictionaries, and not use them as strings. Thanks to that, you can further access keys and values. See more here.
.apply(
lambda x: [i['name'] for i in x]
if isinstance(x, list)
else []
)
Now you're just applying some function that will filter your lists. These lists contain dictionaries. The function will return all dictionary values associated with key name within your inputs if they're lists. Otherwise, that'll be an empty list.

How can I extract values from a list?

I am new to Python and am trying to achieve something new. I have a list defined with some string values, like
col_names = 'ABC,DEF,XYZ'.
If I want to extract and use values individually, how can I do that in Python?
Ex: I want to use ABC in one scenario but DEF in another and so on.
Can I create the list as a dictionary, like below? Would that help anything
col_names = {'ABC','DEF','XYZ'}
col_names is a string, not a list. You could use col_names.split(',') to separate each value.
FYI, the your second definition for col_names is a set, not a dictionary.
To use values from a list, you'd reference each value's index
For example, in a list ls = ['ABC','DEF','XYZ'], ls[2] would be equal to 'XYZ'

What does this anonymmous split function do?

narcoticsCrimeTuples = narcoticsCrimes.map(lambda x:(x.split(",")[0], x))
I have a CSV I am trying to parse by splitting on commas and the first entry in each array of strings is the primary key.
I would like to get the key on a separate line (or just separate) from the value when calling narcoticsCrimeTuples.first()[1]
My current understanding is 'split x by commas, take the first part of each split [0], and return that as the new x', but I'm pretty sure that middle part is not right because the number inside the [] can be anything and returns the same result.
Your variable is named "narcoticsCrimeTuples", so you seem to be expected to get a "tuple".
Your two values of the tuple are the first column of the CSV x.split(",")[0] and the entire line x.
I would like to get the key on a separate line
Not really clear why you want that...
(or just separate) from the value when calling narcoticsCrimeTuples.first()[1]
Well, when you call .first(), you get the entire tuple. [0] is the first column, and [1] would be the corresponding line of the CSV, which also contains the [0] value.
If you narcoticsCrimes.flatMap(lambda x: x.split(",")), then all the values will be separated.
For example, in the word count example...
textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1))
Judging by the syntax seems like you are in PySpark. If that's true you're mapping over your RDD and for each row creating a (key, row) tuple, the key being the first element in a comma-separated list of items. Doing narcoticsCrimeTuples.first() will just give you the first record.
See an example here:
https://gist.github.com/amirziai/5db698ea613c6857d72e9ce6189c1193

Splitting up a list with all values sitting in the same index in Python

Im pretty new to Python.
I have a list which looks like the following:
list = [('foo,bar,bash',)]
I grabbed it from and sql table (someone created the most rubbish sql table!), and I cant adjust it. This is literally the only format I can pull it in. I need to chop it up. I can't split it by index:
print list[0]
because that just literally gives me:
[('foo,bar,bash',)]
How can I split this up? I want to split it up and write it into another list.
Thank you.
list = [('foo,bar,bash',)] is a list which contains a tuple with 1 element. You should also use a different variable name instead of list because list is a python built in.
You can split that one element using split:
lst = [('foo,bar,bash',)]
print lst[0][0].split(',')
Output:
['foo', 'bar', 'bash']
If the tuple contains more than one element, you can loop through it:
lst = [('foo,bar,bash','1,2,3')]
for i in lst[0]:
print i.split(',')

Accessing individual element of a tuple on in RDD

Can we access individual element of a tuple in RDD in pyspark? In PIG we use $0,$1 etc ... So something similar do we have in pySpark.
If the tuple have 10 elements, how to get 5th and 7th element ? Which function I should use. How to retrieve only needed elements.
Is this what you want?
rdd57 = rdd.map(lambda x: (x[5], x[7]))

Categories

Resources