Multiprocessing: Use Win32 API to modify Excel-Cells from Python - python

I'm really looking for a good solution here, maybe the complete concept how I did it or at elast tried to do it is wrong!?
I want to make my code capable of using all my cores. In the code I'm modifying Excel Cells using Win32 API. I wrote a small xls-Class which can check whether the desired file is already open (or open it if not so) and set Values to Cells. My stripped down code looks like this:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os
import win32com.client as win32
from multiprocessing import Pool
from time import sleep
class xls:
excel = None
filename = None
wb = None
ws = None
def __init__(self, file):
self.filename = file
def getNumOpenWorkbooks(self):
return self.excel.Workbooks.Count
def openExcelOrActivateWb(self):
self.excel = win32.gencache.EnsureDispatch('Excel.Application')
# Check whether one of the open files is the desired file (self.filename)
if self.getNumOpenWorkbooks() > 0:
for i in range(self.getNumOpenWorkbooks()):
if self.excel.Workbooks.Item(i+1).Name == os.path.basename(self.filename):
self.wb = self.excel.Workbooks.Item(i+1)
break
else:
self.wb = self.excel.Workbooks.Open(self.filename)
def setCell(self, row, col, val):
self.ws.Cells(row, col).Value = val
def setLastWorksheet(self):
self.ws = self.wb.Worksheets(self.wb.Worksheets.Count)
if __name__ == '__main__':
dat = zip(range(1, 11), [1]*10)
# Create Object
xls = xls('blaa.xls')
xls.openExcelOrActivateWb()
xls.setLastWorksheet()
for (row, col) in dat:
# Calculate some value here (only depending on row,col):
# val = some_func(row, col)
val = 'test'
xls.setCell(row, col, val)
Now as the loop does ONLY depend on the both iterated vars, I wanted to make it run in parallel on many cores. So I've heard of Threading and Multiprocessing, but the latter seemed easier to me so I gave it a go.
So I changed the code like this:
import os
import win32com.client as win32
from multiprocessing import Pool
from time import sleep
class xls:
### CLASS_DEFINITION LIKE BEFORE ###
''' Define Multiprocessing Worker '''
def multiWorker((row, col)):
xls.setCell(row, col, 'test')
if __name__ == '__main__':
# Create Object
xls = xls('StockDatabase.xlsm')
xls.openExcelOrActivateWb()
xls.setLastWorksheet()
dat = zip(range(1, 11), [1]*10)
p = Pool()
p.map(multiWorker, dat)
Didn't seem to work because after some reading, Multiprocessing starts new Processes hence xls is not known to the workers.
Unfortunately I can neither pass xls to them as a third parameter as the Win32 can't be pickled :( Like this:
def multiWorker((row, col, xls)):
xls.setCell(row, col, 'test')
if __name__ == '__main__':
# Create Object
xls = xls('StockDatabase.xlsm')
xls.openExcelOrActivateWb()
xls.setLastWorksheet()
dat = zip(range(1, 11), [1]*10, [xls]*10)
p = Pool()
p.map(multiWorker, dat)
The only way would be to initialize the Win32 for each process right before the definition of the multiWorker:
# Create Object
xls = xls('StockDatabase.xlsm')
xls.openExcelOrActivateWb()
xls.setLastWorksheet()
def multiWorker((row, col, xls)):
xls.setCell(row, col, 'test')
if __name__ == '__main__':
dat = zip(range(1, 11), [1]*10, [xls]*10)
p = Pool()
p.map(multiWorker, dat)
But I don't like it because my constructor of xls has some more logic, which automatically tries to find column ids for known header substrings... So that is a little bit more effort then wanted (and I don't think each process should really open it's own Win32 COM Interface), and this also gives me an error because gencache.EnsureDispatch might not be possible to call so often....
What to do? How is the solution?
Thanks!!

While Excel can use multiple cores when recalculating spreadsheets, its programmatic interface is strongly tied to the UI model, which is single threaded. The active workbook, worksheet, and selection are all singleton objects; this is why you cannot interact with the Excel UI at the same time you're driving it using COM (or VBA, for that matter).
tl;dr
Excel doesn't work that way.

Related

Python code doesn't call class from another program when Import statement is used

I'm trying to write a code a that gets user input via pop up and use that in different program.
below is the code which gets user input.
Excel_connection.py
import openpyxl
import tkinter as tk
class App(tk.Frame):
def __init__(self,master=None,**kw):
#Create a blank dictionary
self.answers = {}
tk.Frame.__init__(self,master=master,**kw)
tk.Label(self,text="Give Input Sheet Path").grid(row=0,column=0)
self.Input_From_User1 = tk.Entry(self)
self.Input_From_User1.grid(row=0,column=1)
tk.Button(self,text="Feed into Program",command =
self.collectAnswers).grid(row=2,column=1)
def collectAnswers(self):
self.answers['Input_Path'] = self.Input_From_User1.get()
global Input_Path
Input_Path = self.answers['Input_Path']
functionThatUsesAnswers(self.answers)
quit()
def quit():
root.destroy()
if __name__ == '__main__':
root = tk.Tk()
App(root).grid()
root.mainloop()
wb = openpyxl.load_workbook(Input_Path) # trying to open the open the input sheet from the
below path
ws = wb["Sheet1"]
Below is the code where i'm importing the above program which does some operation
Execution.py
import pandas
from Excel_Connection import *
from Snowflake_Connection import *
all_rows = list(ws.rows)
cur = ctx.cursor()
# Pull information from specific cells.
for row in all_rows[1:400]:
scenario = row[1].value
query = row[2].value
if_execute = row[3].value
if if_execute == 'Y':
try:
cur.execute(query)
df = cur.fetch_pandas_all()
except:
print(scenario," Failed")
else:
print("CREATED",scenario,".csv successfully")
print("ALL INDIVDUAL REPORT GENERATED")
When I'm executing Execution.py, the program does not produce pop up window and instead the code throws below error,
wb = openpyxl.load_workbook(Input_Path) # trying to open the open the input sheet from the below path
NameError: name 'Input_Path' is not defined
I tried to -
executing Excel_connection.py separately and it just worked fine.
place the code directly instead of importing the program in the Execution.py and again it worked fine as expected.
The only time I'm facing issue is when I try to import the Excel_connection.py into Excel_connection.py
Could somebody kindly help me out here.
When you import Excel_connection.py, it'll run the code in it.
So as you run Execution.py:
import pandas -> "run" the pandas stuff to define functions, etc...
from Excel_connection import * -> You import everything from Excel_connection. So the interpreter will open this file, parse, and run:
2.1 Class App is defined
2.2 runs: wb = openpyxl.load_workbook(Input_Path), which is a nonsense. Since as I see Input_Path is defined in App.collectAnswers(), which was never executed before. So there is no Input_Path to use... And you program terminates here, and tells you that.
If you run your Excel_connection.py directly, it'll work, because the if __name__ == '__main__' is True in this case and that section runs too. But if you import the file, it is false so you skip that part of the code.
You should move this to the Execution.py file before the all_rows = list(ws.rows) line
wb = openpyxl.load_workbook(Input_Path) # trying to open the open the input sheet from the
below path
ws = wb["Sheet1"]
And it'll still break since we have no Input_Path, so you must define it somehow, but it is up to you how. You can create an App like how you do it in the Excel_connection.py.
But I shouldn't do that since it is a little ugly. I would do something like:
Excel_Connection.py
import openpyxl
import tkinter as tk
class App(tk.Frame):
def __init__(self,master=None,**kw):
#Create a blank dictionary
self.answers = {}
tk.Frame.__init__(self,master=master,**kw)
tk.Label(self,text="Give Input Sheet Path").grid(row=0,column=0)
self.Input_From_User1 = tk.Entry(self)
self.Input_From_User1.grid(row=0,column=1)
tk.Button(self,text="Feed into Program",command =
self.collectAnswers).grid(row=2,column=1)
def collectAnswers(self):
self.answers['Input_Path'] = self.Input_From_User1.get()
global Input_Path
Input_Path = self.answers['Input_Path']
functionThatUsesAnswers(self.answers)
self.quit()
# def quit():
# root.destroy()
def main():
root = tk.Tk()
App(root).grid()
root.mainloop()
wb = openpyxl.load_workbook(Input_Path) # trying to open the open the input sheet from the below path
return wb["Sheet1"]
if __name__ == '__main__':
main()
and then
Execution.py
import pandas
import Excel_Connection
from Snowflake_Connection import *
# Now we call the main() function from Excel_Connection, which will return the worksheet for us.
ws = Excel_Connection.main()
all_rows = list(ws.rows)
cur = ctx.cursor()
# Pull information from specific cells.
for row in all_rows[1:400]:
scenario = row[1].value
query = row[2].value
if_execute = row[3].value
if if_execute == 'Y':
try:
cur.execute(query)
df = cur.fetch_pandas_all()
except:
print(scenario," Failed")
else:
print("CREATED",scenario,".csv successfully")
print("ALL INDIVDUAL REPORT GENERATED")
you can simply use
csv files
or some file format like that to store and transfer data between files.

How to convert below tkinter code into multi threading?

I have method in my tkinter application. This method is used to export data into CSV from another application. The loop for export the data is very heavy. It takes days of time to complete.
I just came across the multi threading concept. Its kind of difficult to understand, I spent an entire day on this but couldn't achieve anything. Below is the code I use in my loop. Can this be handled by multiple threads without freezing my tkinter UI?
I have a Label that shows the number of records(cells) exported, in the tkinter window.
def export_cubeData(self):
exportPath = self.entry_exportPath.get()
for b in itertools.product(*(k.values())):
self.update()
if (self.flag == 0):
list1 = list()
for pair in zip(dims, b):
list1.extend(pair)
list1.append(self.box_value.get())
mdx1 = mdx.format(*temp, *list1)
try:
data = tm1.cubes.cells.execute_mdx(mdx1)
data1 = Utils.build_pandas_dataframe_from_cellset(data)
final_df = final_df.append(data1)
cellCount = tm1.cubes.cells.execute_mdx_cellcount(mdx1)
finalcellCount = finalcellCount + cellCount
self.noOfRecordsProcessed['text'] = finalcellCount
except:
pass
else:
tm.showinfo("Export Interrupted", "Data export has been cancelled")
return
final_df.to_csv(exportPath)
print(time.time() - start)
tm.showinfo("info", "Data export has been completed")
self.noOfRecordsProcessed['text'] = '0'

How to use io to generate in memory data streams as file like objects?

I like to generate a in-memory (temp file) data stream in Python. One thread is filling the stream with data, and a other one consumes it.
After checking the io - Core tools for working with streams , it seems to me that the io module is the best choice for it.
So I put a simple example for me:
#!/usr/local/bin/python3
# encoding: utf-8
import io
if __name__ == '__main__':
a = io.BytesIO()
a.write("hello".encode())
txt = a.read(100)
txt = txt.decode("utf-8")
print(txt)
My example does not work. "hello" is not written to a and can not be read after. So were is my error? How do I have to alter my code to get a file like object in memory?
Actually it's written; but reading is the problem. You should be referring to class io.BytesIO. You can get the value using getvalue(). Like,
import io
a = io.BytesIO()
a.write("hello".encode())
txt = a.getvalue()
txt = txt.decode("utf-8")
print(txt)
#dhilmathy and #ShadowRanger mentioned that io.BytesIO() do not have separate read and write pointer.
I overcome this problem with creating a simple class that implements a read pointer and remembers the amount of bytes written. When the amount of read bytes is equal the amount of written bytes the file is shrink to save memory.
My solution so far:
#!/usr/local/bin/python3
# encoding: utf-8
import io
class memoryStreamIO(io.BytesIO):
"""
memoryStreamIO
a in memory file like stream object
"""
def __init__(self):
super().__init__()
self._wIndex = 0
self._rIndex = 0
self._mutex = threading.Lock()
def write(self, d : bytearray):
self._mutex.acquire()
r = super().write(d)
self._wIndex += len(d)
self._mutex.release()
return r
def read(self, n : int):
self._mutex.acquire()
super().seek(self._rIndex)
r = super().read(n)
self._rIndex += len(r)
# now we are checking if we can
if self._rIndex == self._wIndex:
super().truncate(0)
super().seek(0)
self._rIndex = 0
self._wIndex = 0
self._mutex.release()
return r
def seek(self, n):
self._mutex.acquire()
self._rIndex = n
r = super().seek(n)
self._mutex.release()
return r
if __name__ == '__main__':
a = streamIO()
a.write("hello".encode())
txt = (a.read(100)).decode()
print(txt)
a.write("abc".encode())
txt = (a.read(100)).decode()
print(txt)

Python / Django compare and update model objects

Iv only just started python but have learned a lot over the last few month, now I have hit a wall about updating objects on a model at a good speed.
I have a model called Products and this is populated from a csv file, every day this file get updated with changes like cost, and quantity, I can compare each line of the file with the Products Model but having 120k lines this takes 3-4hours.
What process can I take to make this process this file faster. I only want to modify the objects if cost and quantity have changed
Any suggestions how I tackle this?
Ver3 of what i have tried.
from django.core.management import BaseCommand
from multiprocessing import Pool
from django.contrib.auth.models import User
from pprint import pprint
from CentralControl.models import Product, Supplier
from CentralControl.management.helpers.map_ingram import *
from CentralControl.management.helpers.helper_generic import *
from tqdm import tqdm
from CentralControl.management.helpers.config import get_ingram
import os, sys, csv, zipfile, CentralControl
# Run Script as 'SYSTEM'
user = User.objects.get(id=1)
# Get Connection config.
SUPPLIER_CODE, FILE_LOCATION, FILE_NAME = get_ingram()
class Command(BaseCommand):
def handle(self, *args, **options):
list_in = get_file()
list_current = get_current_list()
pool = Pool(6)
pool.map(compare_lists(list_in, list_current))
pool.close()
def compare_lists(list_in, list_current):
for row_current in tqdm(list_current):
for row_in in list_in:
if row_in['order_code'] == row_current['order_code']:
#do more stuff here.
pass
def get_current_list():
try:
supplier = Supplier.objects.get(code='440040')
current_list = Product.objects.filter(supplier=supplier).values()
return current_list
except:
print('Error no products with supplier')
exit()
def get_file():
with zipfile.ZipFile(FILE_LOCATION + 'incoming/' + FILE_NAME, 'r') as zip:
with zip.open('228688 .csv') as csvfile:
reader = csv.DictReader(csvfile)
list_in = (list(reader))
for row in tqdm(list_in):
row['order_code'] = row.pop('Ingram Part Number')
row['order_code'] = (row['order_code']).lstrip("0")
row['name'] = row.pop('Ingram Part Description')
row['description'] = row.pop('Material Long Description')
row['mpn'] = row.pop('Vendor Part Number')
row['gtin'] = row.pop('EANUPC Code')
row['nett_cost'] = row.pop('Customer Price')
row['retail_price'] = row.pop('Retail Price')
row['qty_at_supplier'] = row.pop('Available Quantity')
row['backorder_date'] = row.pop('Backlog ETA')
row['backorder_date'] = (row['backorder_date'])
row['backorder_qty'] = row.pop('Backlog Information')
zip.close()
#commented out for dev precess.
#os.rename(FILE_LOCATION + 'incoming/' + FILE_NAME, FILE_LOCATION + 'processed/' + FILE_NAME)
return list_in
I have once faced a problem of slow load of data, I can tell you what i did maybe it can help you somehow, I passed the execution to debug mode and tried to find out wich colomn is causing the slow loading, and everytime i see that a colomn is causing the problem I add an index on it (in the SGBD --> postgreSQL in my case), and it worked. I hope that you are facing the same problem so my answer can help you.
Here it's rough idea:
1, when reading csv, use pandas as suggest by #BearBrow into array_csv
2, convert the obj data from Django into Numpy Arrary array_obj
3, don't compare them one by one , using numpy substraction
compare_index = (array_csv[['cost',['quantity']]] - array[['cost',['quantity']]] == 0)
4, find the updated column
obj_need_updated = array_obj[np.logic_any(compare_index['cost'], compare['quantity'])]
then use Django bulk update https://github.com/aykut/django-bulk-update to bulk update
Hope this will give you hints to speed up your code

IPython cluster and PicklingError

my problem seem to be similar to This Thread however, while I think I am following the advised method, I still get a PicklingError. When I run my process locally without sending to an IPython Cluster Engine the function works fine.
I am using zipline with IPyhon's notebook, so I first create a class based on zipline.TradingAlgorithm
Cell [ 1 ]
from IPython.parallel import Client
rc = Client()
lview = rc.load_balanced_view()
Cell [ 2 ]
%%px --local # This insures that the Class and modules exist on each engine
import zipline as zpl
import numpy as np
class Agent(zpl.TradingAlgorithm): # must define initialize and handle_data methods
def initialize(self):
self.valueHistory = None
pass
def handle_data(self, data):
for security in data.keys():
## Just randomly buy/sell/hold for each security
coinflip = np.random.random()
if coinflip < .25:
self.order(security,100)
elif coinflip > .75:
self.order(security,-100)
pass
Cell [ 3 ]
from zipline.utils.factory import load_from_yahoo
start = '2013-04-01'
end = '2013-06-01'
sidList = ['SPY','GOOG']
data = load_from_yahoo(stocks=sidList,start=start,end=end)
agentList = []
for i in range(3):
agentList.append(Agent())
def testSystem(agent,data):
results = agent.run(data) #-- This is how the zipline based class is executed
#-- next I'm just storing the final value of the test so I can plot later
agent.valueHistory.append(results['portfolio_value'][len(results['portfolio_value'])-1])
return agent
for i in range(10):
tasks = []
for agent in agentList:
#agent = testSystem(agent,data) ## On its own, this works!
#-- To Test, uncomment the above line and comment out the next two
tasks.append(lview.apply_async(testSystem,agent,data))
agentList = [ar.get() for ar in tasks]
for agent in agentList:
plot(agent.valueHistory)
Here is the Error produced:
PicklingError Traceback (most recent call last)/Library/Python/2.7/site-packages/IPython/kernel/zmq/serialize.pyc in serialize_object(obj, buffer_threshold, item_threshold)
100 buffers.extend(_extract_buffers(cobj, buffer_threshold))
101
--> 102 buffers.insert(0, pickle.dumps(cobj,-1))
103 return buffers
104
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
If I override the run() method from zipline.TradingAlgorithm with something like:
def run(self, data):
return 1
Trying something like this...
def run(self, data):
return zpl.TradingAlgorithm.run(self,data)
results in the same PicklingError.
then the passing off to the engines works, but obviously the guts of the test are not performed. As run is a method internal to zipline.TradingAlgorithm and I don't know everything that it does, how would I make sure it is passed through?
It looks like the zipline TradingAlgorithm object is not pickleable after it has been run:
import zipline as zpl
class Agent(zpl.TradingAlgorithm): # must define initialize and handle_data methods
def handle_data(self, data):
pass
agent = Agent()
pickle.dumps(agent)[:32] # ok
agent.run(data)
pickle.dumps(agent)[:32] # fails
But this suggests to me that you should be creating the Agents on the engines, and only passing data / results back and forth (ideally, not passing data across at all, or at most once).
Minimizing data transfers might look something like this:
define the class:
%%px
import zipline as zpl
import numpy as np
class Agent(zpl.TradingAlgorithm): # must define initialize and handle_data methods
def initialize(self):
self.valueHistory = []
def handle_data(self, data):
for security in data.keys():
## Just randomly buy/sell/hold for each security
coinflip = np.random.random()
if coinflip < .25:
self.order(security,100)
elif coinflip > .75:
self.order(security,-100)
load the data
%%px
from zipline.utils.factory import load_from_yahoo
start = '2013-04-01'
end = '2013-06-01'
sidList = ['SPY','GOOG']
data = load_from_yahoo(stocks=sidList,start=start,end=end)
agent = Agent()
and run the code:
def testSystem(agent, data):
results = agent.run(data) #-- This is how the zipline based class is executed
#-- next I'm just storing the final value of the test so I can plot later
agent.valueHistory.append(results['portfolio_value'][len(results['portfolio_value'])-1])
# create references to the remote agent / data objects
agent_ref = parallel.Reference('agent')
data_ref = parallel.Reference('data')
tasks = []
for i in range(10):
for j in range(len(rc)):
tasks.append(lview.apply_async(testSystem, agent_ref, data_ref))
# wait for the tasks to complete
[ t.get() for t in tasks ]
And plot the results, never fetching the agents themselves
%matplotlib inline
import matplotlib.pyplot as plt
for history in rc[:].apply_async(lambda : agent.valueHistory):
plt.plot(history)
This is not quite the same code you shared - three agents bouncing back and forth on all your engines, whereas this has on agent per engine. I don't know enough about zipline to say whether that's useful to you or not.

Categories

Resources