What does this anonymmous split function do? - python

narcoticsCrimeTuples = narcoticsCrimes.map(lambda x:(x.split(",")[0], x))
I have a CSV I am trying to parse by splitting on commas and the first entry in each array of strings is the primary key.
I would like to get the key on a separate line (or just separate) from the value when calling narcoticsCrimeTuples.first()[1]
My current understanding is 'split x by commas, take the first part of each split [0], and return that as the new x', but I'm pretty sure that middle part is not right because the number inside the [] can be anything and returns the same result.

Your variable is named "narcoticsCrimeTuples", so you seem to be expected to get a "tuple".
Your two values of the tuple are the first column of the CSV x.split(",")[0] and the entire line x.
I would like to get the key on a separate line
Not really clear why you want that...
(or just separate) from the value when calling narcoticsCrimeTuples.first()[1]
Well, when you call .first(), you get the entire tuple. [0] is the first column, and [1] would be the corresponding line of the CSV, which also contains the [0] value.
If you narcoticsCrimes.flatMap(lambda x: x.split(",")), then all the values will be separated.
For example, in the word count example...
textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1))

Judging by the syntax seems like you are in PySpark. If that's true you're mapping over your RDD and for each row creating a (key, row) tuple, the key being the first element in a comma-separated list of items. Doing narcoticsCrimeTuples.first() will just give you the first record.
See an example here:
https://gist.github.com/amirziai/5db698ea613c6857d72e9ce6189c1193

Related

Why is re not removing some values from my list?

I'm asking more out of curiosity at this point since I found a work-around, but it's still bothering me.
I have a list of dataframes (x) that all have the same column names. I'm trying to use pandas and re to make a list of the subset of column names that have the format
"D(number) S(number)"
so I wrote the following function:
def extract_sensor_columns(x):
sensor_name = list(x[0].columns)
for j in sensor_name:
if bool(re.match('D(\d+)S(\d+)', j))==False:
sensor_name.remove(j)
return sensor_name
The list that I'm generating has 103 items (98 wanted items, 5 items). This function removes three of the five columns that I want to get rid of, but keeps the columns labeled 'Pos' and 'RH.' I generated the sensor_name list outside of the function and tested the truth value of the
bool(re.match('D(\d+)S(\d+)', sensor_name[j]))
for all five of the items that I wanted to get rid of and they all gave the False value. The other thing I tried is changing the conditional to ==True, which even more strangely gave me 54 items (all of the unwanted column names and half of the wanted column names).
If I rewrite the function to add the column names that have a given format (rather than remove column names that don't follow the format), I get the list I want.
def extract_sensor_columns(x):
sensor_name = []
for j in list(x[0].columns):
if bool(re.match('D(\d+)S(\d+)', j))==True:
sensor_name.append(j)
return sensor_name
Why is the first block of code acting so strangely?
In general, do not change arrays while iterating over them. The problem lies in the fact that you remove elements of the iterable in the first (wrong) case. But in the second (correct) case, you add correct elements to an empty list.
Consider this:
arr = list(range(10))
for el in arr:
print(el)
for i, el in enumerate(arr):
print(el)
arr.remove(arr[i+1])
The second only prints even number as every next one is removed.

How to create new column by manipulating another column? pandas

I am trying to make a new column depending on different criteria. I want to add characters to the string dependent on the starting characters of the column.
An example of the data:
RH~111~header~120~~~~~~~ball
RL~111~detailed~12~~~~~hat
RA~111~account~13~~~~~~~~~car
I want to change those starting with RH and RL, but not the ones starting with RA. So I want to look like:
RH~111~header~120~~1~~~~~ball
RL~111~detailed~12~~cancel~~~ball
RA~111~account~12~~~~~~~~~ball
I have attempted to use str split, but it doesn't seem to actually be splitting the string up
(np.where(~df['1'].str.startswith('RH'),
df['1'].str.split('~').str[5],
df['1']))
This is referencing the correct columns but not splitting it where I thought it would, and cant seem to get further than this. I feel like I am not really going about this the right way.
Define a function to replace element No pos in arr list:
def repl(arr, pos):
arr[pos] = '1' if arr[0] == 'RH' else 'cancel'
return '~'.join(arr)
Then perform the substitution:
df[0] = df[0].mask(df[0].str.match('^R[HL]'),
df[0].str.split('~').apply(repl, pos=5))
Details:
str.match provides that only proper elements are substituted.
df[0].str.split('~') splits the column of strings into a column
of lists (resulting from splitting of each string).
apply(repl, pos=5) computes the value to sobstitute.
I assumed that you have a DataFrame with a single column, so its column
name is 0 (an integer), instead of '1' (a string).
If this is not the case, change the column name in the code above.

How access individual element in a tuple on a RDD in pyspark?

Lets say I have a RDD like
[(u'Some1', (u'ABC', 9989)),
(u'Some2', (u'XYZ', 235)),
(u'Some3', (u'BBB', 5379)),
(u'Some4', (u'ABC', 5379))]
I am using map to get one tuple at a time but how can I access to individual element of a tuple like to see if a tuple contains some character. Actually I want to filter out those that contains some character. Here the tuples that contain ABC
I was trying to do something like this but its not helping
def foo(line):
if(line[1]=="ABC"):
return (line)
new_data = data.map(foo)
I am new to spark and python as well please help!!
RDDs can be filtered directly. Below will give you all records that contain "ABC" in the 0th position of the 2nd element of the tuple.
new_data = data.filter(lambda x: x[1][0] == "ABC")

Python not returning specific list item by variable

I'm attempting to create a python script that chooses the next item from a list in a dict on each iteration and replaces the key in the text with the current item. The issue, however, is that it won't return the list item that matches an integer that is set by a variable.
The code:
for x in range(0,runs):
for k,v in shortcodes.iteritems():
description = description.replace('{'+k+'}', v[x])
print description
Simply continues to replace the key value in the text with the first list item, rather than incrementing. If I manually remove x and set it to a higher integer, it responds correctly. Similarly, for sanity sake, printing x as it goes along confirms that x is increasing.
So, why is it ignoring x and staying with 0?
The thing is in your code the for loop only replaces the keyword the first iteration. all the other iterations there are no keywords left to replace, since you have replaced them all with the first element of your valueitem.
I think a more promising solution is something like this
for k,v in shortcodes.iteritems():
for x in v:
index = description.find('{'+k+'}')
description[index:index+len('{'+k+'}')] = description[index:index+len('{'+k+'}')].replace('{'+k+'}', x)
print description
now you itereate over your valuelist for each key instead of the other way around,
and since you want to change every keyword with anoter value, you only replace the substrings you want to change. with find() you get the postion of the lowest index of the substring in the actual string. and with len() you get the legth of the parametervalue

finding first item in a list whose first item in a tuple is matched

I have a list of several thousand unordered tuples that are of the format
(mainValue, (value, value, value, value))
Given a main value (which may or may not be present), is there a 'nice' way, other than iterating through every item looking and incrementing a value, where I can produce a list of indexes of tuples that match like this:
index = 0;
for destEntry in destList:
if destEntry[0] == sourceMatch:
destMatches.append(index)
index = index + 1
So I can compare the sub values against another set, and remove the best match from the list if necessary.
This works fine, but just seems like python would have a better way!
Edit:
As per the question, when writing the original question, I realised that I could use a dictionary instead of the first value (in fact this list is within another dictionary), but after removing the question, I still wanted to know how to do it as a tuple.
With list comprehension your for loop can be reduced to this expression:
destMatches = [i for i,destEntry in enumerate(destList) if destEntry[0] == sourceMatch]
You can also use filter()1 built in function to filter your data:
destMatches = filter(lambda destEntry:destEntry[0] == sourceMatch, destList)
1: In Python 3 filter is a class and returns a filter object.

Categories

Resources