replace for loop to parallel process in pyspark

replace for loop to parallel process in pyspark - python

I am using for loop in my script to call a function for each element of size_DF(data frame) but it is taking lot of time. I tried by removing the for loop by map but i am not getting any output.
size_DF is list of around 300 element which i am fetching from a table.
Using For:
import call_functions
newObject = call_functions.call_functions_class()
size_RDD = sc.parallelize(size_DF)
if len(size_DF) == 0:
print "No record present in the truncated list"
else:
for row in size_DF:
length = row[0]
print "length: ", length
insertDF = newObject.full_item(sc, dataBase, length, end_date)
Using Map
if len(size_DF) == 0:
print "No record present in the list"
else:
size_RDD.mapPartition(lambda l: newObject.full_item(sc, dataBase, len(l[0]), end_date))
newObject.full_item(sc, dataBase, len(l[0]), end_date)
In full_item() -- I am doing some select ope and joining 2 tables and inserting the data into a table.
Please help me and let me know what i am doing wrong.

pyspark.rdd.RDD.mapPartition method is lazily evaluated.
Usually to force an evaluation, you can a method that returns a value on the lazy RDD instance that is returned.
There are higher-level functions that take care of forcing an evaluation of the RDD values. e.g. pyspark.rdd.RDD.foreach
Since you don't really care about the results of the operation you can use pyspark.rdd.RDD.foreach instead of pyspark.rdd.RDD.mapPartition.
def first_of(it):
for first in it:
return first
return []
def insert_first(it):
first = first_of(it)
item_count = len(first)
newObject.full_item(sc, dataBase, item_count, end_date)
if len(size_DF) == 0:
print('No record present in the truncated list')
else:
size_DF.forEach(insert_first)

Related

Creating Generator object from record list within a function

I'm trying to create generator object for the list of records with the data from mysql database, so I'm passing the mysql cursor object to the function as parameter.
My issue here is if the "if block" containing yield records is commented then cust_records function works perfectly fine but if I uncomment the line then the function is not working.
Not sure if this is not the way to yield the list object in Python 3
My code so far:
def cust_records(result_set) :
block_id = None
records = []
i = 0
for row in result_set :
records.append((row['id'], row, smaller_ids))
if records :
yield records

The point of generators is lazy evaluation, so storing all records in a list and yielding the list makes no sense at all. If you want to retain lazy evalution (which is IMHO preferable, specially if you have to work on arbitrary datasets that might get huge), you want to yield each record, ie:
def cust_records(result_set) :
for row in result_set :
yield (row['id'], row, smaller_ids)
# then
def example():
cursor.execute(<your_sql_query_here>)
for record in cust_records(cursor):
print(record)
else (if you really want to consume as much memory as possible) just male cust_record a plain function:
def cust_records(result_set) :
records = []
for row in result_set :
records.append((row['id'], row, smaller_ids))
return records

What's the pythonic way of fetching a single value from a single row in cassandra

The query can either return 0 or 1 rows. I'm trying to fetch the id if there is one.
This works but I'm hoping I can end up with something simpler:
result = session.execute("SELECT * FROM table_name WHERE email = %s", (email,))
if len(result.current_rows) is not 1:
raise SystemExit("Account not found for {}".format(email))
id = result[0].id

You could do something like what's outlined in the api docs here
Probably also worth checking that
No empty results are returned (as you do above)
Not more than one result is returned
So in your code:
if (len(result.current_rows) > 1:
raise SystemExit("More than one result found for {}".format(email))
elif (len(result.current_rows) < 1:
raise SystemExit("No results found for {}".format(email))
for result in results:
# code to do things here with the end result
Its probably not a good idea to assume that element [0] will always contain the right result. Maybe have a boolean field value to check which user id is the current one and then iterate through the results until you hit that flag. Thats if you want to check over more than one result, in which case the >1 check above is not applicable.

Specify column name and access by column name:
result = session.execute("SELECT id FROM table_name WHERE email = %s", (email,))
result.id

Easiest way to impliment multithread in this function [Python]

So I have data known as id_list that is coming into the function in this format [(u'SGP-3630', 1202), (u'MTSCR-534', 1244)]. The format being two values paired together, there could be 1 pair or a hundred pairs.
This is the function:
def ListParser(id_list):
list_length = len(id_list)
count = 0
table = ""
while count < list_length:
jira = id_list[count][0]
stash = id_list[count][1]
count = count + 1
table = table + RetrieveFromAPI(stash, jira)
table = TableFormatter(table)
table = TableColouriser(table)
return table
What this function does is goes through the list and extracts the pairs and puts them through a function called RetrieveFromAPI() which fetches information from a URL.
Anyone have an idea on how to impliment multithreading here? I've had a shot at splitting both lists up into their own lists and getting the pool to iterate through each list but it hasn't quite worked.
def ListParser(id_list):
pool = ThreadPool(4)
list_length = len(id_list)
count = 0
table = ""
jira_list = list()
stash_list = list()
while count < list_length:
jira_list = jira_list.extend(id_list[count][0])
print jira_list
stash_list = stash_list.extend(id_list[count][1])
print stash_list
count = count + 1
table = table + pool.map(RetrieveFromAPI, stash_list, jira_list)
table = TableFormatter(table)
table = TableColouriser(table)
return table
The error I'm getting for this attempt is TypeError: 'int' object is not iterable
EDIT 2: Okay so I've managed to get the first list with tuples split up into two different lists, but I'm unsure how to get multithreading working with it.
jira,stash= map(list,zip(*id_list))

You're working too hard! From help(multiprocessing.pool.ThreadPool)
map(self, func, iterable, chunksize=None)
Apply `func` to each element in `iterable`, collecting the results
in a list that is returned.
The second argument is an iterable of the arguments you want to pass to the worker threads. You have a list of lists and you want the first two items from the inner list for each call. id_list is already iterable, so we're close. A small function (in this case implemented as a lambda) bridges the gap.
I worked up a full mock solution just to make sure it works, so here it goes. As an aside, you can benefit from a fairly large pool size since they spend much of their time waiting on I/O.
from multiprocessing.pool import ThreadPool
def RetrieveFromAPI(stash, jira):
# boring mock of api
return '{}-{}.'.format(stash, jira)
def TableFormatter(table):
# mock
return table
def TableColouriser(table):
# mock
return table
def ListParser(id_list):
if id_list:
pool = ThreadPool(min(12, len(id_list)))
table = ''.join(pool.map(lambda item: RetrieveFromAPI(item[1], item[0]),
id_list, chunksize=1))
pool.close()
pool.join()
else:
table = ''
table = TableFormatter(table)
table = TableColouriser(table)
return table
id_list = [[0,1,'foo'], [2,3,'bar'], [4,5, 'baz']]
print(ListParser(id_list))

Web.py sql query, why we can only traverse the result at the first time?

Here is my code:
import web
user_db = web.database(dbn='mysql', ....)
info_list = user_db.query("select * from tablename where t_id= ")
for info in info_list:
# work ok at first time, print the correct id
print info.id
for info in info_list:
# Code can't reach here
print info.id
The second time for each seems not work. Why?

According to the source code, the underlying query() call returns an iterator, which is getting exhausted after the first loop.
If you need to iterate over it multiple times, convert it to a list:
info_list = list(user_db.query("select * from tablename where t_id= "))
Or, you can use itertools.tee() to create new iterators:
info_list1, info_list2 = itertools.tee(info_list)

SQLite output from query into Python script

I have this Python script:
s = stdscr.getstr(0,0, 20) #input length last number
c = db.execute("""SELECT "debit" FROM "members" WHERE "barcode" = '%s' LIMIT 1""" % (s,))
for row in c:
print row
if row == '(0,)':
#display cross
print 'Tick'
else:
#display tick
print 'Cross'
Where it is asking for a barcode input, and matching the debit field in the database.
The "print row" command returns "(0,)" but when I try to match it, I always get "Cross" as the output, which is not the intended result. Is there a semantic I'm obviously not observing?
Many thanks!

The variable row is a tuple, and '(0,)' is its string representation. Your are comparing a variable with its string representation, which cannot work.
You need to compare it to the tuple value
if row == (0,):
Simply remove the quote marks.
Alternatively, you can write
if row[0] == 0:
which will avoid the creation of a tuple just for the comparison. As noted by #CL., row will never be an empty tuple so extracting row[0] is safe.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

replace for loop to parallel process in pyspark - python

Related

Creating Generator object from record list within a function

What's the pythonic way of fetching a single value from a single row in cassandra

Easiest way to impliment multithread in this function [Python]

Web.py sql query, why we can only traverse the result at the first time?

SQLite output from query into Python script

Categories

Resources