String replace in Spark RDD

String replace in Spark RDD - python

I will explain the problem first mentioning the code.
numPartitions = 2
rawData1 = sc.textFile('train_new.csv', numPartitions,use_unicode=False)
rawData1.take(1)
['1,0,0,0,0,0,0,0,0,0,0,1,0,0,5,0,0,0,0,0,0,0,0,0,3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,2,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,0,9,0,0,0,0,0,Class_2']
Now i want to replace Class_2 to 2
after replacement answer should be
['1,0,0,0,0,0,0,0,0,0,0,1,0,0,5,0,0,0,0,0,0,0,0,0,3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,2,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,0,9,0,0,0,0,0,2']
Once i get it for this row, i will perform the operation for the whole data set
Thanks in Advance
Aashish

result = rawData1.map(lambda element: ','.join(element.split(',')[:-1] + ['2']))
should more than do it. It works by mapping each element in your RDD to the lambda function, and returns a new dataset.
The element is split into an array using the ',' delimiter, sliced through to omit the last element, and then made to take an extra element ['2'], following which we join the array together using ','.
More elaborate constructions can be made by modifying the lambda function appropriately.

Related

Is there a way to Iterate through a specific column and replace cell values in pandas?

How do I replace the cell values in a column if they contain a number in general or contain a specific thing like a comma, replace the whole cell value with something else.
Say for example a column that has a comma meaning it has more than one thing I want it to be replaced by text like "ENM".
For a column that has a cell with a number value, I want to replace it by 'UNM'

As you have not provided examples of what your expected and current output look like, I'm making some assumptions below. What it seems like you're trying to do is iterate through every value in a column and if the value meets certain conditions, change it to something else.
Just a general pointer. Iterating through dataframes requires some important considerations for larger sizes. Read through this answer for more insight.
Start by defining a function you want to use to check the value:
def has_comma(value):
if ',' in value:
return True
return False
Then use the pandas.DataFrame.replace method to make the change.
for i in df['column_name']:
if has_comma(i):
df['column_name'] = df['column_name'].replace([i], 'ENM')
else:
df['column_name'] = df['column_name'].replace([i], 'UNM')

Say you have a column, i.e. pandas Series called col
The following code can be used to map values with comma to "ENM" as per your example
col.mask(col.str.contains(','), "ENM")
You can overwrite your original column with this result if that's what you want to do. This approach will be much faster than looping through each element.
For mapping floats to "UNM" as per your example the following would work
col.mask(col.apply(isinstance, args=(float,)), "UNM")
Hopefully you get the idea.
See https://pandas.pydata.org/docs/reference/api/pandas.Series.mask.html for more info on masking

How to create new column by manipulating another column? pandas

I am trying to make a new column depending on different criteria. I want to add characters to the string dependent on the starting characters of the column.
An example of the data:
RH~111~header~120~~~~~~~ball
RL~111~detailed~12~~~~~hat
RA~111~account~13~~~~~~~~~car
I want to change those starting with RH and RL, but not the ones starting with RA. So I want to look like:
RH~111~header~120~~1~~~~~ball
RL~111~detailed~12~~cancel~~~ball
RA~111~account~12~~~~~~~~~ball
I have attempted to use str split, but it doesn't seem to actually be splitting the string up
(np.where(~df['1'].str.startswith('RH'),
df['1'].str.split('~').str[5],
df['1']))
This is referencing the correct columns but not splitting it where I thought it would, and cant seem to get further than this. I feel like I am not really going about this the right way.

Define a function to replace element No pos in arr list:
def repl(arr, pos):
arr[pos] = '1' if arr[0] == 'RH' else 'cancel'
return '~'.join(arr)
Then perform the substitution:
df[0] = df[0].mask(df[0].str.match('^R[HL]'),
df[0].str.split('~').apply(repl, pos=5))
Details:
str.match provides that only proper elements are substituted.
df[0].str.split('~') splits the column of strings into a column
of lists (resulting from splitting of each string).
apply(repl, pos=5) computes the value to sobstitute.
I assumed that you have a DataFrame with a single column, so its column
name is 0 (an integer), instead of '1' (a string).
If this is not the case, change the column name in the code above.

What does this anonymmous split function do?

narcoticsCrimeTuples = narcoticsCrimes.map(lambda x:(x.split(",")[0], x))
I have a CSV I am trying to parse by splitting on commas and the first entry in each array of strings is the primary key.
I would like to get the key on a separate line (or just separate) from the value when calling narcoticsCrimeTuples.first()[1]
My current understanding is 'split x by commas, take the first part of each split [0], and return that as the new x', but I'm pretty sure that middle part is not right because the number inside the [] can be anything and returns the same result.

Your variable is named "narcoticsCrimeTuples", so you seem to be expected to get a "tuple".
Your two values of the tuple are the first column of the CSV x.split(",")[0] and the entire line x.
I would like to get the key on a separate line
Not really clear why you want that...
(or just separate) from the value when calling narcoticsCrimeTuples.first()[1]
Well, when you call .first(), you get the entire tuple. [0] is the first column, and [1] would be the corresponding line of the CSV, which also contains the [0] value.
If you narcoticsCrimes.flatMap(lambda x: x.split(",")), then all the values will be separated.
For example, in the word count example...
textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1))

Judging by the syntax seems like you are in PySpark. If that's true you're mapping over your RDD and for each row creating a (key, row) tuple, the key being the first element in a comma-separated list of items. Doing narcoticsCrimeTuples.first() will just give you the first record.
See an example here:
https://gist.github.com/amirziai/5db698ea613c6857d72e9ce6189c1193

Pyspark tuple object has no attribute split

I am struggling with a Pyspark assignment. I am required to get a sum of all the viewing numbers per channels. I have 2 sets of files: 1 showing the show and views per show the other showing the shows and what channel they are shown on (can be multiple).
I have performed a join operation on the 2 files and the result looks like ..
[(u'Surreal_News', (u'BAT', u'11')),
(u'Hourly_Sports', (u'CNO', u'79')),
(u'Hourly_Sports', (u'CNO', u'3')),
I now need to extract the channel as the key and then I think do a reduceByKey to get the sum of views for the channels.
I have written this function to extract the chan as key with the views alongside, which I could then use a reduceByKey function to sum the results. However when I try to display results of below function with collect() I get an "AttributeError: 'tuple' object has no attribute 'split'" error
def extract_chan_views(show_chan_views):
key_value = show_chan_views.split(",")
chan_views = key_value[1].split(",")
chan = chan_views[0]
views = int(chan_views[1])
return (chan,views)

Since this is an assignment, I'll try to explain what's going on rather than just doing the answer. Hopefully that will be more helpful!
This actually isn't anything to do with pySpark; it's just a plain Python issue. Like the error is saying, you're trying to split a tuple, when split is a string operation. Instead access them by index. The object you're passing in:
[(u'Surreal_News', (u'BAT', u'11')),
(u'Hourly_Sports', (u'CNO', u'79')),
(u'Hourly_Sports', (u'CNO', u'3')),
is a list of tuples, where the first index is a unicode string and the second is another tuple. You can split them apart like this (I'll annotate each step with comments):
for item in your_list:
#item = (u'Surreal_News', (u'BAT', u'11')) on iteration one
first_index, second_index = item #this will unpack the two indices
#now:
#first_index = u'Surreal_News'
#second_index = (u'BAT', u'11')
first_sub_index, second_sub_index = second_index #unpack again
#now:
#first_sub_index = u'BAT'
#second_sub_index = u'11'
Note that you never had to split on commas anywhere. Also note that the u'11' is a string, not an integer in your data. It can be converted, as long as you're sure it's never malformed, with int(u'11'). Or if you prefer specifying indices to unpacking, you can do the same thing:
first_index, second_index = item
is equivalent to:
first_index = item[0]
second_index = item[1]
Also note that this gets more complicated if you are unsure what form the data will take - that is, if sometimes the objects have two items in them, other times three. In that case unpacking and indexing in a generalized way for a loop require a bit more thought.

I am not exactly resolving your code , but I faced same error when I applied join transformation on two datasets.
lets say , A and B are two RDDs.
c = A.join(B)
We may think that c is also Rdd , wrong. It is a tuple object where we cannot perform any split(",") kind of operations.One needs to make c into Rdd then proceed.
If we want tuple to be accessed, Lets say D is tuple.
E= D[1] // instead of E= D.split(",")[1]

How access individual element in a tuple on a RDD in pyspark?

Lets say I have a RDD like
[(u'Some1', (u'ABC', 9989)),
(u'Some2', (u'XYZ', 235)),
(u'Some3', (u'BBB', 5379)),
(u'Some4', (u'ABC', 5379))]
I am using map to get one tuple at a time but how can I access to individual element of a tuple like to see if a tuple contains some character. Actually I want to filter out those that contains some character. Here the tuples that contain ABC
I was trying to do something like this but its not helping
def foo(line):
if(line[1]=="ABC"):
return (line)
new_data = data.map(foo)
I am new to spark and python as well please help!!

RDDs can be filtered directly. Below will give you all records that contain "ABC" in the 0th position of the 2nd element of the tuple.
new_data = data.filter(lambda x: x[1][0] == "ABC")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

String replace in Spark RDD - python

Related

Is there a way to Iterate through a specific column and replace cell values in pandas?

How to create new column by manipulating another column? pandas

What does this anonymmous split function do?

Pyspark tuple object has no attribute split

How access individual element in a tuple on a RDD in pyspark?

Categories

Resources