List Transformation With Lambdas in Spark - python

I am attempting to take an RDD containing pairs of integer ranges, and transform it so that each pair has a third term which iterates through the possible values in the range. Basically, I've got this:
[[1,10], [11,20], [21,30]]
And I'd like to end up with this:
[[1,1,10], [2,1,10], [3,1,10], [4,1,10], [5,1,10]...]
The file I'd like to transform is very large, which is why I'm looking to do this with PySpark rather than just Python on a local machine (I've got a way to do it locally on a CSV file, but the process takes several hours given the file's size). So far, I've got this:
a = [[1,10], [11,20], [21,30]]
b = sc.parallelize(a)
c = b.map(lambda x: [range(x[0], x[1]+1), x[0], x[1]])
c.collect()
Which yields:
>>> c.collect()
[[[1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 1, 10], [[11, 12, 13, 14, 15, 16, 17, 18, 19, 20], 11, 20], [[21, 22, 23, 24, 25, 26, 27, 28, 29, 30], 21, 30]]
I can't figure out what the next step needs to be from here, to iterate over the expanded range, and pair each of those with the range delimiters.
Any ideas?
EDIT 5/8/2017 3:00PM
The local Python technique that works on a CSV input is:
import csv
import gzip
csvfile_expanded = gzip.open('C:\output.csv', 'wb')
ranges_expanded = csv.writer(csvfile_expanded, delimiter=',', quotechar='"')
csvfile = open('C:\input.csv', 'rb')
ranges = csv.reader(csvfile, delimiter=',', quotechar='"')
for row in ranges:
for i in range(int(row[0]),int(row[1])+1):
ranges_expanded.writerow([i,row[0],row[1])
The PySpark script I'm questioning begins with the CSV file already having been loaded into HDFS and cast as an RDD.

Try this:
c = b.flatMap(lambda x: ([y, x[0], x[1]] for y in xrange(x[0], x[1]+1)))
The flatMap() ensures that you get one output record per element of the range. Note also the outer ( ) in conjunction with the xrange -- this is a generator expression that avoids materialising the entire range in memory of the executor.
Note: xrange() is Python2. If you are running Python3, use range()

Related

How to remove varying multiple strings from a string extracted from a csv file

I am quite new to programming and have a string with integrated list values. I am trying to isolate the numerical values in the string to be able to use them later.
I have tried to split the string, and change it back to a list and remove the EU variables with a loop. The initial definition produces the indexes of the duplicates and writes them in a list/string format that I am trying to change.
This is the csv file extract example:
Country,Population,Number,code,area
,,,,
Canada,8822267,83858,EU15,central
Denmark,11413058,305010,EU6,west
Southafrica,705034,110912,EU6,south
We are trying to add up repeating EU number populations.
def duplicates(listed, number):
return [i for i,x in enumerate(listed) if x == number]
a=list((x, duplicates(EUlist, x)) for x in set(EUlist) if EUlist.count(x) > 1)
str1 = ''.join(str(e) for e in a)
for x in range (6,27):
str2=str1.replace("EUx","")
#split=str1.split("EUx")
#Here is where I tried to split it as a list. Changing str1 back to a list. str1= [x for x in split]
This is what the code produces:
('EU6', [1, 9, 10, 14, 17, 19])('EU12', [21, 25])('EU25', [4, 5, 7, 12, 15, 16, 18, 20, 23, 24])('EU27', [2, 22])('EU9', [6, 13])('EU15', [0, 8, 26])
I am trying to isolate the numbers in the square brackets so it prints:
[1, 9, 10, 14, 17, 19]
[21, 25]
[4, 5, 7, 12, 15, 16, 18, 20, 23, 24]
[2, 22]
[6, 13]
[0, 8, 26]
This will allow me to isolate the indexes for further use.
I'm not sure without example data but I think this might do the trick:
def duplicates(listed, number):
return [i for i,x in enumerate(listed) if x == number]
a=list((x, duplicates(EUlist, x)) for x in set(EUlist) if EUlist.count(x) > 1)
for item in a:
print(item[1])
At least I think this should print what you asked for in the question.
As an alternative you can use pandas module and save some typing. Remove the four commas on second line and then:
import pandas as pd
csvfile = r'C:\Test\pops.csv'
df = pd.read_csv(csvfile)
df.groupby('membership')['Population'].sum()
Will output:
membership
Brexit 662307
EU10 10868
EU12 569219
EU15 8976639
EU25 17495803
EU27 900255
EU28 41053
EU6 13694963
EU9 105449

File of lists, import as individual lists in python 2.7

I have created a text file in one program, which outputted the numbers 1 to 25 in a pseudo-random order, for example like so:
[21, 19, 14, 22, 18, 23, 25, 10, 6, 9, 1, 13, 2, 7, 5, 12, 8, 20, 24, 15, 17, 4, 11, 3, 16]
Now I have another python file which is supposed to read the file I created earlier and use a sorting algorithm to sort the numbers.
The problem is that I can't seem to figure out how to read the list I created earlier into the file as a list.
Is there actually a way to do this? Or would I be better of to rewrite my output program somehow, so that I can cast the input into a list?
If your file looks like:
21
19
14
22
18
23
...
use this:
with open('file') as f:
mylist = [int(i.strip()) for i in f]
If it really looks like a list like [21, 19, 14, 22...], here is a simple way:
with open('file') as f:
mylist = list(map(int, f.read().lstrip('[').rstrip(']\n').split(', ')))
And if your file not strictly conforms to specs. For example it looks like [ 21,19, 14 , 22...]. Here is another way that use regex:
import re
with open('file') as f:
mylist = list(map(int, re.findall('\d+', f.read())))
If you don't want to change the output of your current script, you may use ast.literal_eval()
import ast
with open ("output.txt", "r") as f:
array=ast.literal_eval(f.read())

Creating a new array of numbers based on an existing string array

I am trying to create a list of values that correlate to a string by comparing each character of my string to that of my "alpha_list". This is for encoding procedure so that the numerical values can be added later.
I keep getting multiple errors from numerous different ways i have tried to make this happen.
import string
alpha_list = " ABCDEFGHIJKLMNOPQRSTUVWXYZ"
ints = "HELLO WORLD"
myotherlist = []
for idx, val in enumerate(ints):
myotherlist[idx] = alpha_list.index(val)
print(myotherlist)
Right now this is my current error reading
Traceback (most recent call last):
File "C:/Users/Derek/Desktop/Python/test2.py", line 11, in <module>
myotherlist[idx] = alpha_list.index(val)
IndexError: list assignment index out of range
I am pretty new to python so if I am making a ridiculously obvious mistake please feel free to criticize.
The print(myotherlist) output that i am looking for should look something like this:
[8, 5, 12, 12, 15, 0, 23, 15, 18, 12, 4]
Just use append:
for val in ints:
myotherlist.append(alpha_list.index(val))
print(myotherlist)
myotherlist is an empty list so you cannot access using myotherlist[idx] as there is no element 0 etc..
Or just use a list comprehension:
my_other_list = [alpha_list.index(val) for val in ints]
Or a functional approach using map:
map(alpha_list.index,ints))
Both output:
In [7]: [alpha_list.index(val) for val in ints]
Out[7]: [8, 5, 12, 12, 15, 0, 23, 15, 18, 12, 4]
In [8]: map(alpha_list.index,ints)
Out[8]: [8, 5, 12, 12, 15, 0, 23, 15, 18, 12, 4]
import string - don't use that a bunch of books say its better to use the built in str
myotherlist[idx] = alpha_list.index(val) is why you are getting the error. This is saying 'Go to idx index and put alpha_list.index(val) there, but since the list is empty it cannot do that.
So if you replace
for idx, val in enumerate(ints):
myotherlist[idx] = alpha_list.index(val)
with
for letter in ints: #iterates over the 'HELLO WORLD' string
index_to_append = alpha_list.index(letter)
myotherlist.append(index_to_append)
you will get the expected result!
If there is something not clear please let me know!

Writing and reading floats and strings in a CSV file - python

I am a bit new to python and programming. In my code, I have developed a feature (which is a 1-D array of 39 elements) for each audio file. I want to write the name of the file, the feature and its target value {0,1} into a CSV file to train my SVM classifier. I used the CSV writer as follows.
with open('train.csv', 'a') as csvfile:
albumwriter = csv.writer(csvfile, delimiter=' ')
albumwriter.writerow(['1.03 I Want To Hold Your Hand'] + Final_feature + [0] )
I want to write the details of around 180 audio files to this CSV file and feed it to the SVM classifier. The code that I use to read the file is:
with open('train.csv', 'rb') as csvfile:
albumreader = csv.reader(csvfile, delimiter=' ')
data = list()
for row in albumreader:
data.append(row[0:])
data = np.array(data)
I can access the name of the file in the first row as data[0][1] and the feature as data[0][2] but both of them are in <type 'numpy.string_'>. I want to convert the feature into a list of floats. The main problem seems to be the ',' that separates the elements in the list. I tried using .astype(np.float) but in vain.
Can anyone suggest me a good method to convert the strings from the CSV file back to the floats? Your help is very much appreciated as I have very less time to complete this project. Thanks in advance.
Edit: As per the comment, this is how my train.csv looks like:
"1.01 I saw her standing there" "[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38]" 0
"1.02 I saw her" "[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40]" 0
"1.03 I want to hold your hand" "[3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41]" 1
I don't get exactly what you want to achieve, but assuming Final_feature is a python list of floats, and according to your code snippets for writing the csv file, you get the list as a string which probably looks like this: (which you get in data[0][2])
feature = '[3.14, 2.12, 4.5]' # 3 elements only for clarity
You asked how to convert this string to float, you can use:
map(float, feature[1:-1].split(','))
For reference, map applies its first argument to every element of its second argument, thus transforming every string in a float and returning a list of floats.
Another solution would be to write each element of your Final_feature in a separate column.
To convert string like "[1.0, 2.0, 3.0]" to list [1.0, 2.0, 3.0]:
# string to convert
s = '[1.0, 2.0, 3.0]'
lst = [float(x) for x in s[1: -1].split(',')]
# and result will be
[1.0, 2.0, 3.0]
This works both with standard python string type and with numpy.string type.
From what I can see, the variable Final_feature is a list of floats? In which case based
on how you wrote the file the following will import the data
with open('train.csv', 'rb') as csvfile:
albumreader = csv.reader(csvfile, delimiter=' ')
audio_file_names = []
final_features = []
target_values = []
for row in albumreader:
audio_file_names.append(row[0])
final_features.append([float(s) for s in row[1:-1]])
target_values.append([int(s) for s in row[-1]])
There are two list comprehensions to convert the data into floats and integers.

Get lists by reading file in Python

There is a matter that I can't resolve in Python.
I'm trying to get lists by reading file (like .xml or .txt).
I've put my lists in a big list in my file like it :
[[48,49,39,7,13,1,11],[46,27,19,15,24,8,4],[35,5,41,10,31,5,9],[12,9,22,2,36,9,2],[50,47,25,6,42,3,1]]
Now I'm looking for code to get this big list like a list, not like a string. In deed, I've already try some parts of code with open(), write() and read() functions. But Python returned me :
'[[48,49,39,7,13,1,11],[46,27,19,15,24,8,4],[35,5,41,10,31,5,9],[12,9,22,2,36,9,2],[50,47,25,6,42,3,1]]'
And it isn't a list, just a string. So I can't use list's functions to modify it.
Thanks for those who will answer to my problem
well, a simple way is to parse it as a json string:
>>> import json
>>> l_str = '[[48,49,39,7,13,1,11],[46,27,19,15,24,8,4],[35,5,41,10,31,5,9],[12,9,22,2,36,9,2],[50,47,25,6,42,3,1]]'
>>> l = json.loads(l_str)
>>> print l
[[48, 49, 39, 7, 13, 1, 11], [46, 27, 19, 15, 24, 8, 4], [35, 5, 41, 10, 31, 5, 9], [12, 9, 22, 2, 36, 9, 2], [50, 47, 25, 6, 42, 3, 1]]
if you want to load a file that only contains that string, you can simply do it using the following:
>>> import json
>>> with open('myfile') as f:
>>> l = json.load(f)
>>> print l
[[48, 49, 39, 7, 13, 1, 11], [46, 27, 19, 15, 24, 8, 4], [35, 5, 41, 10, 31, 5, 9], [12, 9, 22, 2, 36, 9, 2], [50, 47, 25, 6, 42, 3, 1]]
But if what you want is to serialize python objects, then you should instead use pickle that's more powerful at that taskā€¦
Of course, there are other ways that others may give you to parse your string through an eval()-like function, but I strongly advice you against that, as this is dangerous and leads to insecure code. Edit: after reading #kamikai answer, I'm discovering about ast.literal_eval() which looks like a decent option as well, though json.loads() is more efficient.
If your example is truly representative of your data (i.e., your text file contains only a list of lists of integers), you can parse it as JSON:
import json
data = read_the_contents_of_the_file()
decoded = json.loads(data)
Replace data = read_the_contents_of_the_file() with your existing code for reading the contents as string.
As seen here, the inbuilt ast module is probably your best bet, assuming the text is still valid python.
import ast
ast.literal_eval("[[1,2,3], [4,5,6], [7,8,9]]") # Returns nested lists
Use json to load and parse the file:
import json
with open(my_file_path, "rb") as f:
my_list = json.load(my_file_path)

Categories

Resources