Iv only just started python but have learned a lot over the last few month, now I have hit a wall about updating objects on a model at a good speed.
I have a model called Products and this is populated from a csv file, every day this file get updated with changes like cost, and quantity, I can compare each line of the file with the Products Model but having 120k lines this takes 3-4hours.
What process can I take to make this process this file faster. I only want to modify the objects if cost and quantity have changed
Any suggestions how I tackle this?
Ver3 of what i have tried.
from django.core.management import BaseCommand
from multiprocessing import Pool
from django.contrib.auth.models import User
from pprint import pprint
from CentralControl.models import Product, Supplier
from CentralControl.management.helpers.map_ingram import *
from CentralControl.management.helpers.helper_generic import *
from tqdm import tqdm
from CentralControl.management.helpers.config import get_ingram
import os, sys, csv, zipfile, CentralControl
# Run Script as 'SYSTEM'
user = User.objects.get(id=1)
# Get Connection config.
SUPPLIER_CODE, FILE_LOCATION, FILE_NAME = get_ingram()
class Command(BaseCommand):
def handle(self, *args, **options):
list_in = get_file()
list_current = get_current_list()
pool = Pool(6)
pool.map(compare_lists(list_in, list_current))
pool.close()
def compare_lists(list_in, list_current):
for row_current in tqdm(list_current):
for row_in in list_in:
if row_in['order_code'] == row_current['order_code']:
#do more stuff here.
pass
def get_current_list():
try:
supplier = Supplier.objects.get(code='440040')
current_list = Product.objects.filter(supplier=supplier).values()
return current_list
except:
print('Error no products with supplier')
exit()
def get_file():
with zipfile.ZipFile(FILE_LOCATION + 'incoming/' + FILE_NAME, 'r') as zip:
with zip.open('228688 .csv') as csvfile:
reader = csv.DictReader(csvfile)
list_in = (list(reader))
for row in tqdm(list_in):
row['order_code'] = row.pop('Ingram Part Number')
row['order_code'] = (row['order_code']).lstrip("0")
row['name'] = row.pop('Ingram Part Description')
row['description'] = row.pop('Material Long Description')
row['mpn'] = row.pop('Vendor Part Number')
row['gtin'] = row.pop('EANUPC Code')
row['nett_cost'] = row.pop('Customer Price')
row['retail_price'] = row.pop('Retail Price')
row['qty_at_supplier'] = row.pop('Available Quantity')
row['backorder_date'] = row.pop('Backlog ETA')
row['backorder_date'] = (row['backorder_date'])
row['backorder_qty'] = row.pop('Backlog Information')
zip.close()
#commented out for dev precess.
#os.rename(FILE_LOCATION + 'incoming/' + FILE_NAME, FILE_LOCATION + 'processed/' + FILE_NAME)
return list_in
I have once faced a problem of slow load of data, I can tell you what i did maybe it can help you somehow, I passed the execution to debug mode and tried to find out wich colomn is causing the slow loading, and everytime i see that a colomn is causing the problem I add an index on it (in the SGBD --> postgreSQL in my case), and it worked. I hope that you are facing the same problem so my answer can help you.
Here it's rough idea:
1, when reading csv, use pandas as suggest by #BearBrow into array_csv
2, convert the obj data from Django into Numpy Arrary array_obj
3, don't compare them one by one , using numpy substraction
compare_index = (array_csv[['cost',['quantity']]] - array[['cost',['quantity']]] == 0)
4, find the updated column
obj_need_updated = array_obj[np.logic_any(compare_index['cost'], compare['quantity'])]
then use Django bulk update https://github.com/aykut/django-bulk-update to bulk update
Hope this will give you hints to speed up your code
Related
I want to remove 10% of entries from a DDB table every time a script is ran. So far, I have created a Python script using boto3 that will delete all items from a DDB table:
import boto3
import sys
src_region = sys.argv[1]
src_profile_name = sys.argv[2]
src_ddb_table = sys.argv[3]
# Create source session.
src_session = boto3.session.Session(profile_name=src_profile_name)
dynamoclient = src_session.client('dynamodb', region_name=src_region)
dynamoresponse = dynamoclient.get_paginator('scan').paginate(
TableName=src_ddb_table,
Select='ALL_ATTRIBUTES',
ReturnConsumedCapacity='NONE',
ConsistentRead=True
)
for page in dynamoresponse:
for item in page['Items']:
dynamoclient.delete_item(
Key={'testTableDest': item['testTableDest']},
TableName=src_ddb_table)
How can I modify this script to allow the user to select a percentage of entries they want to delete?
Thank you for any help!
If you want to delete them at random and are ok with a non-exact percentage you can do this easily with the python random package.
import boto3
import sys
import random
src_region = sys.argv[1]
src_profile_name = sys.argv[2]
src_ddb_table = sys.argv[3]
percent_delete = int(sys.argv[4]) # 20
# Create source session.
src_session = boto3.session.Session(profile_name=src_profile_name)
dynamoclient = src_session.client('dynamodb', region_name=src_region)
dynamoresponse = dynamoclient.get_paginator('scan').paginate(
TableName=src_ddb_table,
Select='ALL_ATTRIBUTES',
ReturnConsumedCapacity='NONE',
ConsistentRead=True
)
for page in dynamoresponse:
for item in page['Items']:
if random.random() * 100 < percent_delete:
dynamoclient.delete_item(
Key={'testTableDest': item['testTableDest']},
TableName=src_ddb_table)
This isn't "perfectly random" but will suffice.
i am trying to use python's package for influxdb to upload dataframe into the database
i am using the write_points class to write point into the database as given in the documentation(https://influxdb-python.readthedocs.io/en/latest/api-documentation.html)
every time i try to use the class it only updates the last line of the dataframe instead of the complete dataframe.
is this a usual behavior or there is some problem here?
given below is my script:
from influxdb import InfluxDBClient, DataFrameClient
import pathlib
import numpy as np
import pandas as pd
import datetime
db_client = DataFrameClient('dbserver', port, 'username', 'password', 'database',
ssl=True, verify_ssl=True)
today = datetime.datetime.now().strftime('%Y%m%d')
path = pathlib.Path('/dir1/dir/2').glob(f'pattern_to_match*/{today}.filename.csv')
for file in path:
order_start = pd.read_csv(f'{file}')
if not order_start.empty:
order_start['data_line1'] = (order_start['col1'] - \
order_start['col2'])*1000
order_start['data_line2'] = (order_start['col3'] - \
order_start['col4'])*1000
d1 = round(order_start['data_line1'].quantile(np.arange(0,1.1,0.1)), 3)
d2 = round(order_start['data_line2'].quantile(np.arange(0,1.1,0.1)), 3)
out_file = pd.DataFrame()
out_file = out_file.append(d1)
out_file = out_file.append(d2)
out_file = out_file.T
out_file.index = out_file.index.set_names(['percentile'])
out_file = out_file.reset_index()
out_file['percentile'] = out_file.percentile.apply(lambda x: f'{100*x:.0f}%')
out_file['tag_col'] = str(file).split('/')[2]
out_file['time'] = pd.to_datetime('today').strftime('%Y%m%d')
out_file = out_file.set_index('time')
out_file.index = pd.to_datetime(out_file.index)
db_client.write_points(out_file, 'measurement', database='database',
retention_policy='rp')
can anyone please help?
I'm trying to use data from a csv file ( https://www.kaggle.com/jingbinxu/sample-of-car-data ). I only need the horsepower and weight columns as variables for the equation: ( 1/4 mile et = 6.290 * (weight/hp) ** .33 ), but it won't apply it. I don't know if the storage is working or I shouldn't do it as a class. When I run the program it doesn't show any errors, but it doesn't show results either. Then I got to plot the results, but I don't think it's even calculating and storing results. Any help is appreciated. Thanks in advance.
Here's the current code i have:
import numpy as np
class car_race_analysis():
def __init__(self, filename):
import numpy as np
self.data = np.genfromtxt(filename,delimiter= ',', skip_header = 1 )
def race_stats(self,w,h):
#cars in data
cars = np.unique(self.data[:,0])
#storage for output
race_times = []
#for each car
for car in cars:
#mask
mask = self.data[:,0] == car
#get data
w = self.data[mask,12]
h = self.data[mask,18]
#apply formula
qrtr_mile = 6.290 * ( w / h ) ** .33
race_times.append(qrtr_mile)
#new atribute
self.race_times = np.array(race_times)
print(race_times)
def trend_plotter(self):
import matlib.pyplot as plt
#inputs
self.race_stats
cars = np.unique(self.data[:,0])
#plot
plt.plot(cars,self.race_times)
plt.xlabel("Car")
plt.ylabel("1/4 Mile Time")
plt.savefig("trend_plot.png")
filename = 'car_data.csv'
Two problems:
I think you meant matplotlib instead of matlib. Make sure you install it pip3 install matplotlib --user and edit your code accordingly.
Your previous code wasn't working because you weren't instantiating a class or running any methods. The only "work" your program did was to define the class and then set a filename variable.
To solve #2, replace your filename=... line with the code below.
Here's what it does:
It checks to see if the file is being run directly (i.e. from command prompt such as python3 <your_file_name>.py. If this class is being imported and used from a different python file, this code would not be executed. More reading: https://www.geeksforgeeks.org/what-does-the-if-name-main-do/
We instantiate a instance of your class and supply the filename variable since that it was your class' __init__ method expects.
We invoke the trend_plotter method on the instance of the class.
if __name__ == '__main__':
filename = 'car_data.csv'
car_analysis = car_race_analysis(filename)
car_analysis.trend_plotter()
Even with those changes, your program will not work because it has other errors. I made a guess at fixing it, which I've pasted below, but I strongly encourage you do diff my changes to understand what I altered to be sure it does what you want.
import numpy as np
import matplotlib.pyplot as plt
class car_race_analysis():
race_times = []
cars = []
def __init__(self, filename):
import numpy as np
self.data = np.genfromtxt(filename, delimiter=',', skip_header=1)
def race_stats(self, w, h):
#cars in data
self.cars = np.unique(self.data[:, 0])
# storage for output
self.race_times = []
# for each car
for car in self.cars:
# mask
mask = self.data[:, 0] == car
# get data
w = self.data[mask, 12]
h = self.data[mask, 18]
# apply formula
qrtr_mile = 6.290 * (w / h) ** .33
self.race_times.append(qrtr_mile)
# new atribute
self.race_times = np.array(self.race_times)
def trend_plotter(self):
# inputs
self.race_stats(len(self.cars), len(self.race_times))
# plot
plt.plot(self.cars, self.race_times)
plt.xlabel("Car")
plt.ylabel("1/4 Mile Time")
plt.savefig("trend_plot.png")
plt.show()
if __name__ == '__main__':
filename = 'car_data.csv'
car_analysis = car_race_analysis(filename)
car_analysis.trend_plotter()
I'm really looking for a good solution here, maybe the complete concept how I did it or at elast tried to do it is wrong!?
I want to make my code capable of using all my cores. In the code I'm modifying Excel Cells using Win32 API. I wrote a small xls-Class which can check whether the desired file is already open (or open it if not so) and set Values to Cells. My stripped down code looks like this:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os
import win32com.client as win32
from multiprocessing import Pool
from time import sleep
class xls:
excel = None
filename = None
wb = None
ws = None
def __init__(self, file):
self.filename = file
def getNumOpenWorkbooks(self):
return self.excel.Workbooks.Count
def openExcelOrActivateWb(self):
self.excel = win32.gencache.EnsureDispatch('Excel.Application')
# Check whether one of the open files is the desired file (self.filename)
if self.getNumOpenWorkbooks() > 0:
for i in range(self.getNumOpenWorkbooks()):
if self.excel.Workbooks.Item(i+1).Name == os.path.basename(self.filename):
self.wb = self.excel.Workbooks.Item(i+1)
break
else:
self.wb = self.excel.Workbooks.Open(self.filename)
def setCell(self, row, col, val):
self.ws.Cells(row, col).Value = val
def setLastWorksheet(self):
self.ws = self.wb.Worksheets(self.wb.Worksheets.Count)
if __name__ == '__main__':
dat = zip(range(1, 11), [1]*10)
# Create Object
xls = xls('blaa.xls')
xls.openExcelOrActivateWb()
xls.setLastWorksheet()
for (row, col) in dat:
# Calculate some value here (only depending on row,col):
# val = some_func(row, col)
val = 'test'
xls.setCell(row, col, val)
Now as the loop does ONLY depend on the both iterated vars, I wanted to make it run in parallel on many cores. So I've heard of Threading and Multiprocessing, but the latter seemed easier to me so I gave it a go.
So I changed the code like this:
import os
import win32com.client as win32
from multiprocessing import Pool
from time import sleep
class xls:
### CLASS_DEFINITION LIKE BEFORE ###
''' Define Multiprocessing Worker '''
def multiWorker((row, col)):
xls.setCell(row, col, 'test')
if __name__ == '__main__':
# Create Object
xls = xls('StockDatabase.xlsm')
xls.openExcelOrActivateWb()
xls.setLastWorksheet()
dat = zip(range(1, 11), [1]*10)
p = Pool()
p.map(multiWorker, dat)
Didn't seem to work because after some reading, Multiprocessing starts new Processes hence xls is not known to the workers.
Unfortunately I can neither pass xls to them as a third parameter as the Win32 can't be pickled :( Like this:
def multiWorker((row, col, xls)):
xls.setCell(row, col, 'test')
if __name__ == '__main__':
# Create Object
xls = xls('StockDatabase.xlsm')
xls.openExcelOrActivateWb()
xls.setLastWorksheet()
dat = zip(range(1, 11), [1]*10, [xls]*10)
p = Pool()
p.map(multiWorker, dat)
The only way would be to initialize the Win32 for each process right before the definition of the multiWorker:
# Create Object
xls = xls('StockDatabase.xlsm')
xls.openExcelOrActivateWb()
xls.setLastWorksheet()
def multiWorker((row, col, xls)):
xls.setCell(row, col, 'test')
if __name__ == '__main__':
dat = zip(range(1, 11), [1]*10, [xls]*10)
p = Pool()
p.map(multiWorker, dat)
But I don't like it because my constructor of xls has some more logic, which automatically tries to find column ids for known header substrings... So that is a little bit more effort then wanted (and I don't think each process should really open it's own Win32 COM Interface), and this also gives me an error because gencache.EnsureDispatch might not be possible to call so often....
What to do? How is the solution?
Thanks!!
While Excel can use multiple cores when recalculating spreadsheets, its programmatic interface is strongly tied to the UI model, which is single threaded. The active workbook, worksheet, and selection are all singleton objects; this is why you cannot interact with the Excel UI at the same time you're driving it using COM (or VBA, for that matter).
tl;dr
Excel doesn't work that way.
I came across this snippet for uploading files in Jupyter however I don't know how to save this file on the machine that executes the code or how to show the first 5 lines of the uploaded file. Basically I am looking for proper commands for accessing the file after it has been uploaded:
import io
from IPython.display import display
import fileupload
def _upload():
_upload_widget = fileupload.FileUploadWidget()
def _cb(change):
decoded = io.StringIO(change['owner'].data.decode('utf-8'))
filename = change['owner'].filename
print('Uploaded `{}` ({:.2f} kB)'.format(
filename, len(decoded.read()) / 2 **10))
_upload_widget.observe(_cb, names='data')
display(_upload_widget)
_upload()
_cb is called when the upload finishes. As described in the comment above, you can write to a file there, or store it in a variable. For example:
from IPython.display import display
import fileupload
uploader = fileupload.FileUploadWidget()
def _handle_upload(change):
w = change['owner']
with open(w.filename, 'wb') as f:
f.write(w.data)
print('Uploaded `{}` ({:.2f} kB)'.format(
w.filename, len(w.data) / 2**10))
uploader.observe(_handle_upload, names='data')
display(uploader)
After the upload has finished, you can access the filename as:
uploader.filename
I am working on ML with Jupyter notebook, and I was looking for a solution to select the local files containing the datasets by browsing amongst the local file system. Although, the question here refers more to uploading than selecting a file. I am putting here a snippet that I found here because when I was looking for a solution for my particular case, the result of the search took me several times to here.
import os
import ipywidgets as widgets
class FileBrowser(object):
def __init__(self):
self.path = os.getcwd()
self._update_files()
def _update_files(self):
self.files = list()
self.dirs = list()
if(os.path.isdir(self.path)):
for f in os.listdir(self.path):
ff = os.path.join(self.path, f)
if os.path.isdir(ff):
self.dirs.append(f)
else:
self.files.append(f)
def widget(self):
box = widgets.VBox()
self._update(box)
return box
def _update(self, box):
def on_click(b):
if b.description == '..':
self.path = os.path.split(self.path)[0]
else:
self.path = os.path.join(self.path, b.description)
self._update_files()
self._update(box)
buttons = []
if self.files:
button = widgets.Button(description='..', background_color='#d0d0ff')
button.on_click(on_click)
buttons.append(button)
for f in self.dirs:
button = widgets.Button(description=f, background_color='#d0d0ff')
button.on_click(on_click)
buttons.append(button)
for f in self.files:
button = widgets.Button(description=f)
button.on_click(on_click)
buttons.append(button)
box.children = tuple([widgets.HTML("<h2>%s</h2>" % (self.path,))] + buttons)
And to use it:
f = FileBrowser()
f.widget()
# <interact with widget, select a path>
# in a separate cell:
f.path # returns the selected path
4 years later this remains an interesting question, though Fileupload has slightly changed and belongs to ipywidgets....
Here is some demo that shows how to get the file/files after the button click and reset the button to get more files....
from ipywidgets import FileUpload
def on_upload_change(change):
if not change.new:
return
up = change.owner
for filename,data in up.value.items():
print(f'writing [{filename}] to ./')
with open(filename, 'wb') as f:
f.write(data['content'])
up.value.clear()
up._counter = 0
upload_btn = FileUpload()
upload_btn.observe(on_upload_change, names='_counter')
upload_btn
And here is a "debug" version that shows what is going on / why things work...
from ipywidgets import FileUpload
def on_upload_change(change):
if change.new==0:
print ('cleared')
return
up = change.owner
print (type(up.value))
for filename,data in up.value.items():
print('==========================================================================================')
print(filename)
for k,v in data['metadata'].items():
print(f' -{k:13}:[{v}]')
print(f' -content len :[{len(data["content"])}]')
print('==========================================================================================')
up.value.clear()
up._counter = 0
upload_btn = FileUpload()
upload_btn.observe(on_upload_change, names='_counter')
upload_btn
I stumbled into this thread ~2 years late. For those still confused about how to work with the fileupload widget I have built off of the excellent answer posted by minrk with some other usage examples below.
from IPython.display import display
import fileupload
uploader = fileupload.FileUploadWidget()
def _handle_upload(change):
w = change['owner']
with open(w.filename, 'wb') as f:
f.write(w.data)
print('Uploaded `{}` ({:.2f} kB)'.format(
w.filename, len(w.data) / 2**10))
uploader.observe(_handle_upload, names='data')
display(uploader)
From the widget documentation:
class FileUploadWidget(ipywidgets.DOMWidget):
'''File Upload Widget.
This widget provides file upload using `FileReader`.
'''
_view_name = traitlets.Unicode('FileUploadView').tag(sync=True)
_view_module = traitlets.Unicode('fileupload').tag(sync=True)
label = traitlets.Unicode(help='Label on button.').tag(sync=True)
filename = traitlets.Unicode(help='Filename of `data`.').tag(sync=True)
data_base64 = traitlets.Unicode(help='File content, base64 encoded.'
).tag(sync=True)
data = traitlets.Bytes(help='File content.')
def __init__(self, label="Browse", *args, **kwargs):
super(FileUploadWidget, self).__init__(*args, **kwargs)
self._dom_classes += ('widget_item', 'btn-group')
self.label = label
def _data_base64_changed(self, *args):
self.data = base64.b64decode(self.data_base64.split(',', 1)[1])
Get the data in bytestring format:
uploader.data
Get the data in a regular utf-8 string:
datastr= str(uploader.data,'utf-8')
Make a new pandas dataframe from the utf-8 string (e.g. from a .csv input):
import pandas as pd
from io import StringIO
datatbl = StringIO(datastr)
newdf = pd.read_table(datatbl,sep=',',index_col=None)
You have to enable the file upload option in your code, to enable the browse button to appear in your notebook.
Run the following
!jupyter nbextension enable fileupload --user --py