pyspark: keep a function in the lambda expression - python

I have the following working code:
def replaceNone(row):
myList = []
row_len = len(row)
for i in range(0, row_len):
if row[i] is None:
myList.append("")
else:
myList.append(row[i])
return myList
rdd_out = rdd_in.map(lambda row : replaceNone(row))
Here row is from pyspark.sql import Row
However, it is kind of lengthy and ugly. Is it possible to avoid making the replaceNone function by writing everything in the lambda process directly? Or at least simplify replaceNone()? Thanks!

I'm not sure what your goal is. It seems like you're jsut trying to replace all the None values in each row in rdd_in with empty strings, in which case you can use a list comprehension:
rdd_out = rdd_in.map(lambda row: [r if r is not None else "" for r in row])
The first call to map will make a new list for every element in row and the list comprehension will replace all Nones with empty strings.
This worked on a trivial example (and defined map since it's not defined for a list):
def map(l, f):
return [f(r) for r in l]
l = [[1,None,2],[3,4,None],[None,5,6]]
l2 = map(l, lambda row: [i if i is not None else "" for i in row])
print(l2)
>>> [[1, '', 2], [3, 4, ''], ['', 5, 6]]

Related

Extract a sub-array from and array according to a condition in python

I would like to extract some lines and columns (kind of sub array) based on conditions
here is an exemple of input and desired output
[["00:00:01","data_update","data1",10.5,"blabla"],
["00:00:02","proc_call","xxx","xxx","blalla"],
["00:00:15","data_update","data2",34.5,"blabla"],
["00:00:25","proc_call","xxx","xxx","blalla"]]
desired output (keep "data_update" line with col 0, 2 and 3)
here is an exemple of input and desired output
[["00:00:01","data1",10.5],
["00:00:15","data2",34.5]]
Is there a simple way to do that in python ?
You can either use a for loop like thus:
reduced_array = []
for i in range(len(full_array)):
if full_array[i][1] == 'data_update':
reduced_array.append([i[0],i[2],i[3]])
or by list comprehension
reduced_array = [[i[0],i[2],i[3]] for i in full_array if i[1] == 'data_update']
if you need to handle more columns you could also use
cols = [0,2,3]
reduced_array = [[i[col] for col in cols] for i in full_array if i[1] == 'data_update']
With regard to adnanmuttaleb answer, using lambda functions is way faster than the list comprehension method proposed by me, however it is also more difficult if someone is not familiar with the concept. For comprehensiveness and without wanting to take credit for his answer I add it here.
reduced_array = map(lambda sub: [sub[i] for i in cols], filter(lambda sub: "data_update" in sub, full_array))
Runtime comparison:
import random as rd
import time
full_array = [[rd.random(),"data_update" if rd.random()< 0.2 else "no",rd.random(),rd.random()] for i in range(1000000)]
cols = [0,2,3]
start1 = time.time()
reduced_array = map(lambda sub: [sub[i] for i in cols], filter(lambda sub: "data_update" in sub, full_array))
print(time.time()-start1)
start2 = time.time()
reduced_array2 = [[i[col] for col in cols] for i in full_array if i[1] == 'data_update']
print(time.time()-start2)
results in
#Lambda function:
0.004003286361694336
#List comprehension
0.254199743270874
For Inputs:
l = [["00:00:01","data_update","data1",10.5,"blabla"],
["00:00:02","proc_call","xxx","xxx","blalla"],
["00:00:15","data_update","data2",34.5,"blabla"],
["00:00:25","proc_call","xxx","xxx","blalla"]]
cols = (0, 2, 3)
Do:
result = map(lambda sub: [sub[i] for i in cols], filter(lambda sub: "data_update" in sub, l))
print(list(result))
Output:
[['00:00:01', 'data1', 10.5], ['00:00:15', 'data2', 34.5]]
result = filter(lambda x: "data_update" in x, a)
result = [[item[0],item[2],item[3]] for item in result]
The first line, find out all lines contains "data_update"
The second line, rebuild the result with the 3 columns you need.
What about looping through the list?
needle = 'data_update'
haystack = [
["00:00:01","data_update","data1",10.5,"blabla"],
["00:00:02","proc_call","xxx","xxx","blalla"],
["00:00:15","data_update","data2",34.5,"blabla"],
["00:00:25","proc_call","xxx","xxx","blalla"]
]
container = []
for x in range(len(haystack)):
if needle in haystack[x]:
container.append([haystack[x][0], haystack[x][2], haystack[x][3]])
This loops through each element in the list and tests if your needle is present in the list item. If it is, then it adds it appends the data to a new output container, made up of only the data that you asked for.

While loop within for loop for list of lists

I'm trying to create a big list that will contain lists of strings. I iterate over the input list of strings and create a temporary list.
Input:
['Mike','Angela','Bill','\n','Robert','Pam','\n',...]
My desired output:
[['Mike','Angela','Bill'],['Robert','Pam']...]
What i get:
[['Mike','Angela','Bill'],['Angela','Bill'],['Bill']...]
Code:
for i in range(0,len(temp)):
temporary = []
while(temp[i] != '\n' and i<len(temp)-1):
temporary.append(temp[i])
i+=1
bigList.append(temporary)
Use itertools.groupby
from itertools import groupby
names = ['Mike','Angela','Bill','\n','Robert','Pam']
[list(g) for k,g in groupby(names, lambda x:x=='\n') if not k]
#[['Mike', 'Angela', 'Bill'], ['Robert', 'Pam']]
Fixing your code, I'd recommend iterating over each element directly, appending to a nested list -
r = [[]]
for i in temp:
if i.strip():
r[-1].append(i)
else:
r.append([])
Note that if temp ends with a newline, r will have a trailing empty [] list. You can get rid of that though:
if not r[-1]:
del r[-1]
Another option would be using itertools.groupby, which the other answerer has already mentioned. Although, your method is more performant.
Your for loop was scanning over the temp array just fine, but the while loop on the inside was advancing that index. And then your while loop would reduce the index. This caused the repitition.
temp = ['mike','angela','bill','\n','robert','pam','\n','liz','anya','\n']
# !make sure to include this '\n' at the end of temp!
bigList = []
temporary = []
for i in range(0,len(temp)):
if(temp[i] != '\n'):
temporary.append(temp[i])
print(temporary)
else:
print(temporary)
bigList.append(temporary)
temporary = []
You could try:
a_list = ['Mike','Angela','Bill','\n','Robert','Pam','\n']
result = []
start = 0
end = 0
for indx, name in enumerate(a_list):
if name == '\n':
end = indx
sublist = a_list[start:end]
if sublist:
result.append(sublist)
start = indx + 1
>>> result
[['Mike', 'Angela', 'Bill'], ['Robert', 'Pam']]

Replace one item in a string with one item from a list

I have a string and a list:
seq = '01202112'
l = [(0,1,0),(1,1,0)]
I would like a pythonic way of replacing each '2' with the value at the corresponding index in the list l such that I obtain two new strings:
list_seq = [01001110, 01101110]
By using .replace(), I could iterate through l, but I wondered is there a more pythonic way to get list_seq?
I might do something like this:
out = [''.join(c if c != '2' else str(next(f, c)) for c in seq) for f in map(iter, l)]
The basic idea is that we call iter to turn the tuples in l into iterators. At that point every time we call next on them, we get the next element we need to use instead of the '2'.
If this is too compact, the logic might be easier to read as a function:
def replace(seq, to_replace, fill):
fill = iter(fill)
for element in seq:
if element != to_replace:
yield element
else:
yield next(fill, element)
giving
In [32]: list(replace([1,2,3,2,2,3,1,2,4,2], to_replace=2, fill="apple"))
Out[32]: [1, 'a', 3, 'p', 'p', 3, 1, 'l', 4, 'e']
Thanks to #DanD in the comments for noting that I had assumed I'd always have enough characters to fill from! We'll follow his suggestion to keep the original characters if we run out, but modifying this approach to behave differently is straightforward and left as an exercise for the reader. :-)
[''.join([str(next(digit, 0)) if x is '2' else x for x in seq])
for digit in map(iter, l)]
I don't know if this solution is 'more pythonic' but:
def my_replace(s, c=None, *other):
return s if c is None else my_replace(s.replace('2', str(c), 1), *other)
seq = '01202112'
l = [(0,1,0),(1,1,0)]
list_req = [my_replace(seq, *x) for x in l]
seq = '01202112'
li = [(0,1,0),(1,1,0)]
def grunch(s, tu):
it = map(str,tu)
return ''.join(next(it) if c=='2' else c for c in s)
list_seq = [grunch(seq,tu) for tu in li]

How do you remove duplicates from a list in Python whilst preserving order and length?

What I want to do is to remove duplicates from the list and every time duplicate is removed insert an empty item.
I have code for removing duplicates. It also ignores empty list items
import csv
#Create new output file
new_file = open('addr_list_corrected.csv','w')
new_file.close()
with open('addr_list.csv', 'r') as addr_list:
csv_reader = csv.reader(addr_list, delimiter=',')
for row in csv_reader:
print row
print "##########################"
seen=set()
seen_add=seen.add
#empty cell/element evaluates to false
new_row = [ cell for cell in row if not (cell and cell in seen or seen_add(cell))]
print new_row
with open('addr_list_corrected.csv', 'a') as addr_list_corrected:
csv_writer=csv.writer(addr_list_corrected, delimiter=',')
csv_writer.writerow(new_row)
But I need to replace every removed item with an empty string.
I would do that with an iterator. Something like this:
def dedup(seq):
seen = set()
for v in seq:
yield '' if v in seen else v
seen.add(v)
Edit: reverse the logic to make the meaning clearer:
Another alternative would be to do something like this:
seen = dict()
seen_setdefault = seen.setdefault
new_row = ["" if cell in seen else seen_setdefault(cell, cell) for cell in row]
To give an example:
>>> row = ["to", "be", "or", "not", "to", "be"]
>>> seen = dict()
>>> seen_setdefault = seen.setdefault
>>> new_row = ["" if cell in seen else seen_setdefault(cell, cell) for cell in row]
>>> new_row
['to', 'be', 'or', 'not', '', '']
Edit 2: Out of curiosity I ran a quick test to see which approach was fastest:
>>> from random import randint
>>> from statistics import mean
>>> from timeit import repeat
>>>
>>> def standard(seq):
... """Trivial modification to standard method for removing duplicates."""
... seen = set()
... seen_add = seen.add
... return ["" if x in seen or seen_add(x) else x for x in seq]
...
>>> def dedup(seq):
... seen = set()
... for v in seq:
... yield '' if v in seen else v
... seen.add(v)
...
>>> def pedro(seq):
... """Pedro's iterator based approach to removing duplicates."""
... my_dedup = dedup
... return [x for x in my_dedup(seq)]
...
>>> def srgerg(seq):
... """Srgerg's dict based approach to removing duplicates."""
... seen = dict()
... seen_setdefault = seen.setdefault
... return ["" if cell in seen else seen_setdefault(cell, cell) for cell in seq]
...
>>> data = [randint(0, 10000) for x in range(100000)]
>>>
>>> mean(repeat("standard(data)", "from __main__ import data, standard", number=100))
1.2130275770426708
>>> mean(repeat("pedro(data)", "from __main__ import data, pedro", number=100))
3.1519048346103555
>>> mean(repeat("srgerg(data)", "from __main__ import data, srgerg", number=100))
1.2611971098676882
As can be seen from the results, making a relatively simple modification to the standard approach described in this other stack-overflow question is fastest.
You can use a set to keep track of seen items. Using the example list used above:
x = ['to', 'be', 'or', 'not', 'to', 'be']
seen = set()
for index, item in enumerate(x):
if item in seen:
x[index] = ''
else:
seen.add(item)
print x
You can create a new List and append the element if it is not present in the new List else append None if the element is already present in the new List.
oldList = [3, 1, 'a', 2, 4, 2, 'a', 5, 1, 3]
newList = []
for i in oldList:
if i in newList:
newList.append(None)
else:
newList.append(i)
print newList
Output:
[3, 1, 'a', 2, 4, None, None, 5, None, None]

Nested lists python

Can anyone tell me how can I call for indexes in a nested list?
Generally I just write:
for i in range (list)
but what if I have a list with nested lists as below:
Nlist = [[2,2,2],[3,3,3],[4,4,4]...]
and I want to go through the indexes of each one separately?
If you really need the indices you can just do what you said again for the inner list:
l = [[2,2,2],[3,3,3],[4,4,4]]
for index1 in xrange(len(l)):
for index2 in xrange(len(l[index1])):
print index1, index2, l[index1][index2]
But it is more pythonic to iterate through the list itself:
for inner_l in l:
for item in inner_l:
print item
If you really need the indices you can also use enumerate:
for index1, inner_l in enumerate(l):
for index2, item in enumerate(inner_l):
print index1, index2, item, l[index1][index2]
Try this setup:
a = [["a","b","c",],["d","e"],["f","g","h"]]
To print the 2nd element in the 1st list ("b"), use print a[0][1] - For the 2nd element in 3rd list ("g"): print a[2][1]
The first brackets reference which nested list you're accessing, the second pair references the item in that list.
You can do this. Adapt it to your situation:
for l in Nlist:
for item in l:
print item
The question title is too wide and the author's need is more specific. In my case, I needed to extract all elements from nested list like in the example below:
Example:
input -> [1,2,[3,4]]
output -> [1,2,3,4]
The code below gives me the result, but I would like to know if anyone can create a simpler answer:
def get_elements_from_nested_list(l, new_l):
if l is not None:
e = l[0]
if isinstance(e, list):
get_elements_from_nested_list(e, new_l)
else:
new_l.append(e)
if len(l) > 1:
return get_elements_from_nested_list(l[1:], new_l)
else:
return new_l
Call of the method
l = [1,2,[3,4]]
new_l = []
get_elements_from_nested_list(l, new_l)
n = [[1, 2, 3], [4, 5, 6, 7, 8, 9]]
def flatten(lists):
results = []
for numbers in lists:
for numbers2 in numbers:
results.append(numbers2)
return results
print flatten(n)
Output: n = [1,2,3,4,5,6,7,8,9]
I think you want to access list values and their indices simultaneously and separately:
l = [[2,2,2],[3,3,3],[4,4,4],[5,5,5]]
l_len = len(l)
l_item_len = len(l[0])
for i in range(l_len):
for j in range(l_item_len):
print(f'List[{i}][{j}] : {l[i][j]}' )

Categories

Resources