I am trying to push some big files (around 4 million records) into a mongo instance. What I am basically trying to achieve is to update the existent data with the one from the files. The algorithm would look something like:
rowHeaders = ('orderId', 'manufacturer', 'itemWeight')
for row in dataFile:
row = row.strip('\n').split('\t')
row = dict(zip(rowHeaders, row))
mongoRow = mongoCollection.find({'orderId': 12344})
if mongoRow is not None:
if mongoRow['itemWeight'] != row['itemWeight']:
row['tsUpdated'] = time.time()
else:
row['tsUpdated'] = time.time()
mongoCollection.update({'orderId': 12344}, row, upsert=True)
So, update the whole row besides 'tsUpdated' if weights are the same, add a new row if the row is not in mongo or update the whole row including 'tsUpdated' ... this is the algorithm
The question is: can this be done faster, easier and more efficient from mongo's point of view ? (eventually with some kind of bulk insert)
Combine an unique index on orderId with an update query where you also check for a change in itemWeight. The unique index prevents an insert with only a modified timestamp if the orderId is already present and itemWeight is the same.
mongoCollection.ensure_index('orderId', unique=True)
mongoCollection.update({'orderId': row['orderId'],
'itemWeight': {'$ne': row['itemWeight']}}, row, upsert=True)
My benchmark shows a 5-10x performance improvement against your algorithm (depending on the amount of inserts vs updates).
Related
I was trying to do this in Python: I have multiple prefixes to query in Bigtable, but I only want the first result of each row set defined by a prefix. In essence, applying a limit of 1 for each row set, not for the entire scan.
Imagine you have the following records' row keys:
collection_1#item1#reversed_timestamp1
collection_1#item1#reversed_timestamp2
collection_1#item2#reversed_timestamp3
collection_1#item2#reversed_timestamp4
What if I want to retrieve just the latest entries for collection_1#item1# and collection_1#item2# at the same time?
The expected output should be the rows corresponding to :
collection_1#item1#reversed_timestamp1
collection_1#item2#reversed_timestamp3
Can this be done in Bigtable?
Thanks!
Is collection_1#item1#reversed_timestamp1 the rowkey or is reversed_timestamp1 actually a timestamp?
If it is not part of the rowkey you could use a filter like cells per column
https://cloud.google.com/bigtable/docs/using-filters#cells-per-column-limit e.g.
rows = table.read_rows(filter_=row_filters.CellsColumnLimitFilter(2))
or cells per row
https://cloud.google.com/bigtable/docs/using-filters#cells-per-row-limit e.g.
rows = table.read_rows(filter_=row_filters.CellsRowLimitFilter(2))
depending on how your data is laid out.
I currently have a .h5 file, with a table in it consisting of three columns: a text columns of 64 chars, an UInt32 column relating to the source of the text and a UInt32 column which is the xxhash of the text. The table consists of ~ 2.5e9 rows
I am trying to find and count the duplicates of each text entry in the table - essentially merge them into one entry, while counting the instances. I have tried doing so by indexing on the hash column and then looping through table.itersorted(hash), while keeping track of the hash value and checking for collisions - very similar to finding a duplicate in a hdf5 pytable with 500e6 rows. I did not modify the table as I was looping through it but rather wrote the merged entries to a new table - I am putting the code at the bottom.
Basically the problem I have is that the whole process takes significantly too long - it took me about 20 hours to get to iteration #5 4e5. I am working on a HDD however, so it is entirely possible the bottleneck is there. Do you see any way I can improve my code, or can you suggest another approach? Thank you in advance for any help.
P.S. I promise I am not doing anything illegal, it is simply a large scale leaked password analysis for my Bachelor Thesis.
ref = 3 #manually checked first occuring hash, to simplify the below code
gen_cnt = 0
locs = {}
print("STARTING")
for row in table.itersorted('xhashx'):
gen_cnt += 1 #so as not to flush after every iteration
ps = row['password'].decode(encoding = 'utf-8', errors = 'ignore')
if row['xhashx'] == ref:
if ps in locs:
locs[ps][0] += 1
locs[ps][1] |= row['src']
else:
locs[ps] = [1, row['src']]
else:
for p in locs:
fill_password(new_password, locs[ps]) #simply fills in the columns, with some fairly cheap statistics procedures
new_password.append()
if (gen_cnt > 100):
gen_cnt = 0
new_table.flush()
ref = row['xhashx']```
Your dataset is 10x larger than the referenced solution (2.5e9 vs 500e6 rows). Have you done any testing to identify where the time is spent? The table.itersorted() method may not be linear - and might be resource intensive. (I don't have any experience with itersorted.)
Here is a process that might be faster:
Extract a NumPy array of the hash field (column xhashx
)
Find the unique hash values
Loop thru the unique hash values and extract a NumPy array of
rows that match each value
Do your uniqueness tests against the rows in this extracted array
Write the unique rows to your new file
Code for this process below:
Note: This has been not tested, so may have small syntax or logic gaps
# Step 1: Get a Numpy array of the 'xhashx' field/colmu only:
hash_arr = table.read(field='xhashx')
# Step 2: Get new array with unique values only:
hash_arr_u = np.unique(hash_arr)
# Alternately, combine first 2 steps in a single step
hash_arr_u = np.unique(table.read(field='xhashx'))
# Step 3a: Loop on rows unique hash values
for hash_test in hash_arr_u :
# Step 3b: Get an array with all rows that match this unique hash value
match_row_arr = table.read_where('xhashx==hash_test')
# Step 4: Check for rows with unique values
# Check the hash row count.
# If there is only 1 row, uniqueness tested not required
if match_row_arr.shape[0] == 1 :
# only one row, so write it to new.table
else :
# check for unique rows
# then write unique rows to new.table
##################################################
# np.unique has an option to save the hash counts
# these can be used as a test in the loop
(hash_arr_u, hash_cnts) = np.unique(table.read(field='xhashx'), return_counts=True)
# Loop on rows in the array of unique hash values
for cnt in range(hash_arr_u.shape[0]) :
# Get an array with all rows that match this unique hash value
match_row_arr = table.read_where('xhashx==hash_arr_u(cnt)')
# Check the hash row count.
# If there is only 1 row, uniqueness tested not required
if hash_cnts[cnt] == 1 :
# only one row, so write it to new.table
else :
# check for unique rows
# then write unique rows to new.table
I have created a database with multiple columns and am wanting to use the data stored in two of the columns (named 'cost' and 'Mwe') to create a new column 'Dollar_per_KWh'. I have created two lists, one contains the rowid and the other contains the new value that I want to populate the new Dollar_per_KWh column. As it iterates through all the rows, the two lists are zipped together into a dictionary containing tuples. I then try to populate the new sqlite column. The code runs and I do not receive any errors. When I print out the dictionary it looks correct.
Issue: the new column in my database is not being updated with the new data and I am not sure why. The values in the new column are showing 'NULL'
Thank you for your help. Here is my code:
conn = sqlite3.connect('nuclear_builds.sqlite')
cur = conn.cursor()
cur.execute('''ALTER TABLE Construction
ADD COLUMN Dollar_per_KWh INTEGER''')
cur.execute('SELECT _rowid_, cost, Mwe FROM Construction')
data = cur.fetchall()
dol_pr_kW = dict()
key = list()
value = list()
for row in data:
id = row[0]
cost = row[1]
MWe = row[2]
value.append(int((cost*10**6)/(MWe*10**3)))
key.append(id)
dol_pr_kW = list(zip(key, value))
cur.executemany('''UPDATE Construction SET Dollar_per_KWh = ? WHERE _rowid_ = ?''', (dol_pr_kW[1], dol_pr_kW[0]))
conn.commit()
Not sure why it isn't working. Have you tried just doing it all in SQL?
conn = sqlite3.connect('nuclear_builds.sqlite')
cur = conn.cursor()
cur.execute('''ALTER TABLE Construction
ADD COLUMN Dollar_per_KWh INTEGER;''')
cur.execute('''UPDATE Construction SET Dollar_per_KWh = cast((cost/MWe)*1000 as integer);''')
It's a lot simpler just doing the calculation in SQL than pulling data to Python, manipulating it, and pushing it back to the database.
If you need to do this in Python for some reason, testing whether this works will at least give you some hints as to what is going wrong with your current code.
Update: I see a few more problems now.
First I see you are creating an empty dictionary dol_pr_kW before the for loop. This isn't necessary as you are re-defining it as a list later anyway.
Then you are trying to create the list dol_pr_kW inside the for loop. This has the effect of over-writing it for each row in data.
I'll give a few different ways to solve it. It looks like you were trying a few different things at once (using dict and list, building two lists and zipping into a third list, etc.) that is adding to your trouble, so I am simplifying the code to make it easier to understand. In each solution I will create a list called data_to_insert. That is what you will pass at the end to the executemany function.
First option is to create your list before the for loop, then append it for each row.
dol_pr_kW = list()
for row in data:
id = row[0]
cost = row[1]
MWe = row[2]
val = int((cost*10**6)/(MWe*10**3))
dol_pr_kW.append(id,val)
#you can do this or instead change above step to dol_pr_kW.append(val,id).
data_to_insert = [(r[1],r[0]) for r in dol_pr_kW]
The second way would be to zip the key and value lists AFTER the for loop.
key = list()
value = list()
for row in data:
id = row[0]
cost = row[1]
MWe = row[2]
value.append(int((cost*10**6)/(MWe*10**3)))
key.append(id)
dol_pr_kW = list(zip(key,value))
#you can do this or instead change above step to dol_pr_kW=list(zip(value,key))
data_to_insert = [(r[1],r[0]) for r in dol_pr_kW]
Third, if you would rather keep it as an actual dict you can do this.
dol_pr_kW = dict()
for row in data:
id = row[0]
cost = row[1]
MWe = row[2]
val = int((cost*10**6)/(MWe*10**3))
dol_pr_kW[id] = val
# convert to list
data_to_insert = [(dol_pr_kW[id], id) for id in dol_per_kW]
Then to execute call
cur.executemany('''UPDATE Construction SET Dollar_per_KWh = ? WHERE _rowid_ = ?''', data_to_insert)
cur.commit()
I prefer the first option since it's easiest for me to understand what's happening at a glance. Each iteration of the for loop just adds a (id, val) to the end of the list. It's a little more cumbersome to build two lists independently and zip them together to get a third list.
Also note that if the dol_pr_kW list had been created correctly, passing (dol_pr_kW[1],dol_pr_kW[0]) to executemany would pass the first two rows in the list instead of reversing (key,value) to (value,key). You need to do a list comprehension to accomplish the swap in one line of code. I just did this as a separate line and assigned it to variable data_to_insert for readability.
Context
I have a function in python that scores a row in my table. I would like to combine the scores of all the rows arithmetically (eg. computing the sum, average, etc.. of the scores).
def compute_score(row):
# some complicated python code that would be painful to convert into SQL-equivalent
return score
The obvious first approach is to simply read in all the data
import psycopg2
def sum_scores(dbname, tablename):
conn = psycopg2.connect(dbname)
cur = conn.cursor()
cur.execute('SELECT * FROM ?', tablename)
rows = cur.fetchall()
sum = 0
for row in rows:
sum += score(row)
conn.close()
return sum
Problem
I would like to be able to handle as much data as my database can hold. This could be larger that what would fit into Python's memory, so fetchall() seems to me like it would not function correctly in that case.
Proposed Solutions
I was considering 3 approaches, all with the aim of processing a couple records at a time:
One-by-one record processing using fetchone()
def sum_scores(dbname, tablename):
...
sum = 0
for row_num in cur.rowcount:
row = cur.fetchone()
sum += score(row)
...
return sum
Batch-record processing using fetchmany(n)
def sum_scores(dbname, tablename):
...
batch_size = 1e3 # tunable
sum = 0
batch = cur.fetchmany(batch_size)
while batch:
for row in batch:
sum += score(row)
batch = cur.fetchmany(batch_size)
...
return sum
Relying on the cursor's iterator
def sum_scores(dbname, tablename):
...
sum = 0
for row in cur:
sum += score(row)
...
return sum
Questions
Was my thinking correct in that my 3 proposed solutions would only pull in manageable sized chunks of data at a time? Or do they suffer from the same problem as fetchall?
Which of the 3 proposed solutions would work (ie. compute the correct score combination and not crash in the process) for LARGE datasets?
How does the cursor's iterator (Proposed Solution #3) actually pull in data into Python's memory? One-by-one, in batches, or all at once?
All 3 solutions will work, and only bring a subset of the results into memory.
Iterating via the cursor, Proposed solution #3, will work the same as Proposed Solution #2, if you pass a name to the cursor. Iterating over the cursor will fetch itersize records (default is 2000).
Solutions #2 and #3 will be much quicker than #1, because there is much less of a connection overhead.
http://initd.org/psycopg/docs/cursor.html#fetch
I am writing a program that:
Read the content from an excel sheets for each row (90,000 rows in total)
Compare the content with another excel sheet for each row (600,000 rows in total)
If a match occurs, write the matching entry into a new excel sheet
I have written the script and everything works fine. however, the computational time is HUGE. For an hour, it has done just 200 rows from the first sheet, resulting in writing 200 different files.
I was wondering if there is a way to save the matching in a different way as I am going to use them later on? Is there any way to save in a matrix or something?
import xlrd
import xlsxwriter
import os, itertools
from datetime import datetime
# choose the incident excel sheet
book_1 = xlrd.open_workbook('D:/Users/d774911/Desktop/Telstra Internship/Working files/Incidents.xlsx')
# choose the trap excel sheet
book_2 = xlrd.open_workbook("D:/Users/d774911/Desktop/Telstra Internship/Working files/Traps.xlsx")
# choose the features sheet
book_3 = xlrd.open_workbook("D:/Users/d774911/Desktop/Telstra Internship/Working files/Features.xlsx")
# select the working sheet, either by name or by index
Traps = book_2.sheet_by_name('Sheet1')
# select the working sheet, either by name or by index
Incidents = book_1.sheet_by_name('Sheet1')
# select the working sheet, either by name or by index
Features_Numbers = book_3.sheet_by_name('Sheet1')
#return the total number of rows for the traps sheet
Total_Number_of_Rows_Traps = Traps.nrows
# return the total number of rows for the incident sheet
Total_Number_of_Rows_Incidents = Incidents.nrows
# open a file two write down the non matching incident's numbers
print(Total_Number_of_Rows_Traps, Total_Number_of_Rows_Incidents)
write_no_matching = open('C:/Users/d774911/PycharmProjects/GlobalData/No_Matching.txt', 'w')
# For loop to iterate for all the row for the incident sheet
for Rows_Incidents in range(Total_Number_of_Rows_Incidents):
# Store content for the comparable cell for incident sheet
Incidents_Content_Affected_resources = Incidents.cell_value(Rows_Incidents, 47)
# Store content for the comparable cell for incident sheet
Incidents_Content_Product_Type = Incidents.cell_value(Rows_Incidents, 29)
# Convert Excel date type into python type
Incidents_Content_Date = xlrd.xldate_as_tuple(Incidents.cell_value(Rows_Incidents, 2), book_1.datemode)
# extract the year, month and day
Incidents_Content_Date = str(Incidents_Content_Date[0]) + ' ' + str(Incidents_Content_Date[1]) + ' ' + str(Incidents_Content_Date[2])
# Store content for the comparable cell for incident sheet
Incidents_Content_Date = datetime.strptime(Incidents_Content_Date, '%Y %m %d')
# extract the incident number
Incident_Name = Incidents.cell_value(Rows_Incidents, 0)
# Create a workbook for the selected incident
Incident_Name_Book = xlsxwriter.Workbook(os.path.join('C:/Users/d774911/PycharmProjects/GlobalData/Test/', Incident_Name + '.xlsx'))
# Create sheet name for the created workbook
Incident_Name_Sheet = Incident_Name_Book.add_worksheet('Sheet1')
# insert the first row that contains the features
Incident_Name_Sheet.write_row(0, 0, Features_Numbers.row_values(0))
Insert_Row_to_Incident_Sheet = 0
# For loop to iterate for all the row for the traps sheet
for Rows_Traps in range(Total_Number_of_Rows_Traps):
# Store content for the comparable cell for traps sheet
Traps_Content_Node_Name = Traps.cell_value(Rows_Traps, 3)
# Store content for the comparable cell for traps sheet
Traps_Content_Event_Type = Traps.cell_value(Rows_Traps, 6)
# extract date temporally
Traps_Content_Date_temp = Traps.cell_value(Rows_Traps, 10)
# Store content for the comparable cell for traps sheet
Traps_Content_Date = datetime.strptime(Traps_Content_Date_temp[0:10], '%Y-%m-%d')
# If the content matches partially or full
if len(str(Traps_Content_Node_Name)) * len(str(Incidents_Content_Affected_resources)) != 0 and \
str(Incidents_Content_Affected_resources).lower().find(str(Traps_Content_Node_Name).lower()) != -1 and \
len(str(Traps_Content_Event_Type)) * len(str(Incidents_Content_Product_Type)) != 0 and \
str(Incidents_Content_Product_Type).lower().find(str(Traps_Content_Event_Type).lower()) != -1 and \
len(str(Traps_Content_Date)) * len(str(Incidents_Content_Date)) != 0 and \
Traps_Content_Date <= Incidents_Content_Date:
# counter for writing inside the new incident sheet
Insert_Row_to_Incident_Sheet = Insert_Row_to_Incident_Sheet + 1
# Write the Incident information
Incident_Name_Sheet.write_row(Insert_Row_to_Incident_Sheet, 0, Incidents.row_values(Rows_Incidents))
# Write the Traps information
Incident_Name_Sheet.write_row(Insert_Row_to_Incident_Sheet, 107, Traps.row_values(Rows_Traps))
Incident_Name_Book.close()
Thanks
What your doing is seeking/reading a litte bit of data for each cell. This is very inefficient.
Try reading all information in one go into an as basic as sensible python data structure (lists, dicts etc.) and make your comparisons/operations on this data set in memory and write all results in one go. If not all data fits into memory, try to partition it into sub-tasks.
Having to read the data set 10 times, to extract a tenth of data each time will likely still be hugely faster than reading each cell independently.
I don't see how your code can work; the second loop works on variables which change for every row in the first loop but the second loop isn't inside of the first one.
That said, comparing files in this way has a complexity of O(N*M) which means that the runtime explodes quickly. In your case you try to execute 54'000'000'000 (54 billion) loops.
If you run into these kind of problems, the solution is always a three step process:
Transform the data to make it easier to process
Put the data into efficient structures (sorted lists, dict)
Search the data with the efficient structures
You have to find a way to get rid of the find(). Try to get rid of all the junk in the cells that you want to compare so that you could use =. When you have this, you can put rows into a dict to find matches. Or you could load it into a SQL database and use SQL queries (don't forget to add indexes!)
One last trick is to use sorted lists. If you can sort the data in the same way, then you can use two lists:
Sort the data from the two sheets into two lists
Use two row counters (one per list)
If the current item from the first list is less than the current one from the second list, then there is no match and you have to advance the first row counter.
If the current item from the first list is bigger than the current one from the second list, then there is no match and you have to advance the second row counter.
If the items are the same, you have a match. Process the match and advance both counters.
This allows you to process all the items in one go.
I would suggest that you use pandas. This module provides a huge amount of functions to compare datasets. It also has a very fast import/export algorithms for excel files.
IMHO you should use the merge function and provide the arguments how = 'inner' and on = [list of your columns to compare]. That will create a new dataset with only such rows that occur in both tables (having the same values in the defined colums). This new dataset you can export to your excel file.