Pytables duplicates 2.5 giga rows

Pytables duplicates 2.5 giga rows - python

I currently have a .h5 file, with a table in it consisting of three columns: a text columns of 64 chars, an UInt32 column relating to the source of the text and a UInt32 column which is the xxhash of the text. The table consists of ~ 2.5e9 rows
I am trying to find and count the duplicates of each text entry in the table - essentially merge them into one entry, while counting the instances. I have tried doing so by indexing on the hash column and then looping through table.itersorted(hash), while keeping track of the hash value and checking for collisions - very similar to finding a duplicate in a hdf5 pytable with 500e6 rows. I did not modify the table as I was looping through it but rather wrote the merged entries to a new table - I am putting the code at the bottom.
Basically the problem I have is that the whole process takes significantly too long - it took me about 20 hours to get to iteration #5 4e5. I am working on a HDD however, so it is entirely possible the bottleneck is there. Do you see any way I can improve my code, or can you suggest another approach? Thank you in advance for any help.
P.S. I promise I am not doing anything illegal, it is simply a large scale leaked password analysis for my Bachelor Thesis.
ref = 3 #manually checked first occuring hash, to simplify the below code
gen_cnt = 0
locs = {}
print("STARTING")
for row in table.itersorted('xhashx'):
gen_cnt += 1 #so as not to flush after every iteration
ps = row['password'].decode(encoding = 'utf-8', errors = 'ignore')
if row['xhashx'] == ref:
if ps in locs:
locs[ps][0] += 1
locs[ps][1] |= row['src']
else:
locs[ps] = [1, row['src']]
else:
for p in locs:
fill_password(new_password, locs[ps]) #simply fills in the columns, with some fairly cheap statistics procedures
new_password.append()
if (gen_cnt > 100):
gen_cnt = 0
new_table.flush()
ref = row['xhashx']```

Your dataset is 10x larger than the referenced solution (2.5e9 vs 500e6 rows). Have you done any testing to identify where the time is spent? The table.itersorted() method may not be linear - and might be resource intensive. (I don't have any experience with itersorted.)
Here is a process that might be faster:
Extract a NumPy array of the hash field (column xhashx
)
Find the unique hash values
Loop thru the unique hash values and extract a NumPy array of
rows that match each value
Do your uniqueness tests against the rows in this extracted array
Write the unique rows to your new file
Code for this process below:
Note: This has been not tested, so may have small syntax or logic gaps
# Step 1: Get a Numpy array of the 'xhashx' field/colmu only:
hash_arr = table.read(field='xhashx')
# Step 2: Get new array with unique values only:
hash_arr_u = np.unique(hash_arr)
# Alternately, combine first 2 steps in a single step
hash_arr_u = np.unique(table.read(field='xhashx'))
# Step 3a: Loop on rows unique hash values
for hash_test in hash_arr_u :
# Step 3b: Get an array with all rows that match this unique hash value
match_row_arr = table.read_where('xhashx==hash_test')
# Step 4: Check for rows with unique values
# Check the hash row count.
# If there is only 1 row, uniqueness tested not required
if match_row_arr.shape[0] == 1 :
# only one row, so write it to new.table
else :
# check for unique rows
# then write unique rows to new.table
##################################################
# np.unique has an option to save the hash counts
# these can be used as a test in the loop
(hash_arr_u, hash_cnts) = np.unique(table.read(field='xhashx'), return_counts=True)
# Loop on rows in the array of unique hash values
for cnt in range(hash_arr_u.shape[0]) :
# Get an array with all rows that match this unique hash value
match_row_arr = table.read_where('xhashx==hash_arr_u(cnt)')
# Check the hash row count.
# If there is only 1 row, uniqueness tested not required
if hash_cnts[cnt] == 1 :
# only one row, so write it to new.table
else :
# check for unique rows
# then write unique rows to new.table

Related

Python - How to optimize code to run faster? (lots of for loops in DataFrame)

I have a code that works with an excel file (SAP Download) quite extensively (data transformation and calculation steps).
I need to loop through all the lines (couple thousand rows) a few times. I have written a code prior that adds DataFrame columns separately, so I could do everything in one for loop that was of course quite quick, however, I had to change data source that meant change in raw data structure.
The raw data structure has 1st 3 rows empty, then a Title row comes with column names, then 2 rows empty, and the 1st column is also empty. I decided to wipe these, and assign column names and make them headers (steps below), however, since then, separately adding column names and later calculating everything in one for statement does not fill data to any of these specific columns.
How could i optimize this code?
I have deleted some calculation steps since they are quite long and make code part even less readable
#This function adds new column to the dataframe
def NewColdfConverter(*args):
for i in args:
dfConverter[i] = '' #previously used dfConverter[i] = NaN
#This function creates dataframe from excel file
def DataFrameCreator(path,sheetname):
excelFile = pd.ExcelFile(path)
global readExcel
readExcel = pd.read_excel(excelFile,sheet_name=sheetname)
#calling my function to create dataframe
DataFrameCreator(filePath,sheetName)
dfConverter = pd.DataFrame(readExcel)
#dropping NA values from Orders column (right now called Unnamed)
dfConverter.dropna(subset=['Unnamed: 1'], inplace=True)
#dropping rows and deleting other unnecessary columns
dfConverter.drop(dfConverter.head(1).index, inplace=True)
dfConverter.drop(dfConverter.columns[[0,11,12,13,17,22,23,48]], axis = 1,inplace = True)
#renaming columns from Unnamed 1: etc to proper names
dfConverter = dfConverter.rename(columns={Unnamed 1:propername1 Unnamed 2:propername2 etc.})
#calling new column function -> this Day column appears in the 1st for loop
NewColdfConverter("Day")
#example for loop that worked prior, but not working since new dataset and new header/column steps added:
for i in range(len(dfConverter)):
#Day column-> floor Entry Date -1, if time is less than 5:00:00
if(dfConverter['Time'][i] <= time(hour=5,minute=0,second=0)):
dfConverter['Day'][i] = pd.to_datetime(dfConverter['Entry Date'][i])-timedelta(days=1)
else:
dfConverter['Day'][i] = pd.to_datetime(dfConverter['Entry Date'][i])
Problem is, there are many columns that build on one another, so I cannot get them in one for loop, for instance in below example I need to calculate reqsWoSetUpValue, so I can calculate requirementsValue, so I can calculate otherReqsValue, but I'm not able to do this within 1 for loop by assigning the values to the dataframecolumn[i] row, because the value will just be missing, like nothing happened.
(dfsorted is the same as dfConverter, but a sorted version of it)
#example code of getting reqsWoSetUpValue
for i in range(len(dfSorted)):
reqsWoSetUpValue[i] = #calculationsteps...
#inserting column with value
dfSorted.insert(49,'Reqs wo SetUp',reqsWoSetUpValue)
#getting requirements value with previously calculated Reqs wo SetUp column
for i in range(len(dfSorted)):
requirementsValue[i] = #calc
dfSorted.insert(50,'Requirements',requirementsValue)
#Calculating Other Reqs value with previously calculated Requirements column.
for i in range(len(dfSorted)):
otherReqsValue[i] = #calc
dfSorted.insert(51,'Other Reqs',otherReqsValue)
Anyone have a clue, why I cannot do this in 1 for loop anymore by 1st adding all columns by the function, like:
NewColdfConverter('Reqs wo setup','Requirements','Other reqs')
#then in 1 for loop:
for i in range(len(dfsorted)):
dfSorted['Reqs wo setup'] = #calculationsteps
dfSorted['Requirements'] = #calculationsteps
dfSorted['Other reqs'] = #calculationsteps
Thank you

General comment: How to identify bottlenecks
To get started, you should try to identify which parts of the code are slow.
Method 1: time code sections using the time package
Wrap blocks of code in statements like this:
import time
t = time.time()
# do something
print("time elapsed: {:.1f} seconds".format(time.time() - t))
Method 2: use a profiler
E.g. Spyder has a built-in profiler. This allows you to check which operations are most time consuming.
Vectorize your operations
Your code will be orders of magnitude faster if you vectorize your operations. It looks like your loops are all avoidable.
For example, rather than calling pd.to_datetime on every row separately, you should call it on the entire column at once
# slow (don't do this):
for i in range(len(dfConverter)):
dfConverter['Day'][i] = pd.to_datetime(dfConverter['Entry Date'][i])
# fast (do this instead):
dfConverter['Day'] = pd.to_datetime(dfConverter['Entry Date'])
If you want to perform an operation on a subset of rows, you can also do this in a vectorized operation by using loc:
mask = dfConverter['Time'] <= time(hour=5,minute=0,second=0)
dfConverter.loc[mask,'Day'] = pd.to_datetime(dfConverter.loc[mask,'Entry Date']) - timedelta(days=1)

Not sure this would improve performance, but you could calculate the dependent columns at the same time row by row with DataFrame.iterrows()
for index, data in dfSorted.iterrows():
dfSorted['Reqs wo setup'][index] = #calculationsteps
dfSorted['Requirements'][index] = #calculationsteps
dfSorted['Other reqs'][index] = #calculationsteps

Python DataFrames: finding *almost" identical rows

I have a DF loaded with orders. Some of them contains negative quantities, and the reason for that is that they are actually cancellations of prior orders.
Problem, there is no unique key that can help me find back which order corresponds to which cancellation.
So I've built the following code ('cancelations' is a subset of the original data containing only the rows that correspond to... well... cancelations):
for i, item in cancelations.iterrows():
#find a row similar to the cancelation we are currently studying:
#We use item[1] to access second value of the tuple given back by iterrows()
mask1 = (copy['CustomerID'] == item['CustomerID'])
mask2 = (copy['Quantity'] == item['Quantity'])
mask3 = (copy['Description'] == item['Description'])
subset = copy[ mask1 & mask2 & mask3]
if subset.shape[0] >0: #if we find one or several corresponding orders :
print('possible corresponding orders:', subset.index.tolist())
copy = copy.drop(subset.index.tolist()[0]) #retrieve only the first ot them from the copy of the data
So, this works, but :
first, it takes forever to run; and second, I read somewhere that whenever you find yourself writing complex code to manipulate dataframes, there's already a method for it.
So perhaps one of you know something that could help me ?
thank you for your time !
edit : note that sometimes, we can have several orders that could correspond to the cancelation at hand. This is why I didn't use drop_duplicates with only some columns specified... because it eliminates all duplicates (or all but one) : I need to drop only one of them.

How to multi-thread large number of pandas dataframe selection calls on large dataset

df is a dataframe containing 12 millions+ lines unsorted.
Each row has a GROUP ID.
The end goal is to randomly select 1 row per unique GROUP ID, thus populating a new column named SELECTED where 1 means selected 0 means the opposite
There may be 5000+ unique GROUP IDs.
Seeking better and faster solution than the following, Potentially multi-threaded solution?
for sec in df['GROUP'].unique():
sz = df.loc[df.GROUP == sec, ['SELECTED']].size
sel = [0]*sz
sel[random.randint(0,sz-1)] = 1
df.loc[df.GROUP == sec, ['SELECTED']] = sel

You could try a vectorized version, which will probably speed things up if you have many classes.
import pandas as pd
# get fake data
df = pd.DataFrame(pd.np.random.rand(10))
df['GROUP'] = df[0].astype(str).str[2]
# mark one element of each group as selected
df['selected'] = df.index.isin( # Is current index in a selected list?
df.groupby('GROUP') # Get a GroupBy object.
.apply(pd.Series.sample) # Select one row from each group.
.index.levels[1] # Access index - in this case (group, old_id) pair; select the old_id out of the two.
).astype(pd.np.int) # Convert to ints.
Note that this may fail if duplicate indices are present.

I do not know panda's dataframe, but if you simply set selected where it is needed to be one and later assume that not having the attribute means not selected you could avoid updating all elements.
You may also do something like this :
selected = []
for sec in df['GROUP'].unique():
selected.append(random.choice(sec))
or with list comprehensions
selected = [random.choice(sec) for sec in df['GROUP'].unique()]
maybe this can speed it up because you will not need to allow new memory and udpate all elements from your dataframe.
If you really want multithreading have a look at concurrent.futures https://docs.python.org/3/library/concurrent.futures.html

Efficiently selecting rows from pandas dataframe using sorted column

I have a large-ish pandas dataframe with multiple columns (c1 ... c8) and ~32 mil rows. The dataframe is already sorted by c1. I want to grab other column values from rows that share a particular value of c1.
something like
keys = big_df['c1'].unique()
red = np.zeros(len(keys))
for i, key in enumerate(keys):
inds = (big_df['c1'] == key)
v1 = np.array(big_df.loc[inds]['c2'])
v2 = np.array(big_df.loc[inds]['c6'])
red[i] = reduce_fun(v1,v2)
However this turns out to be very slow I think because it checks the entire columns for the matching criterion (even though there might only be 10 rows out of 32 mil that are relevant). Since big_df is sorted by c1 and the keys is just the list of all unique c1's, is there a fast way to get the red[] array (ie i know the first row with the next key is the row after the last row of the previous key, I know that the last row for a key is the last row that matches the key, since all subsequent rows are guaranteed not to match).
Thanks,
Ilya
Edit: I am not sure what order unique() method produces, but I basically want to have for every key in keys a value of reduce_fun(), I don't particularly care what order they are (presumably the easiest order is the order c1 is already sorted in).
Edit2: I slightly restructured the code. Basically, is there an efficient way of constructing inds. big_df['c1'] == key takes 75.8% of total time in my data, while creating v1, v2 takes 21.6% according to line profiler.

Rather than a list, I chose a dictionary to hold the reduced values keyed on each item in c1.
red = {key: reduce_func(frame['c2'].values, frame['c7'].values)
for key, frame in df.groupby('c1')}

How about a groupby statement in a list comprehension? This should be especially efficient given the DataFrame is already sorted by c1:
Edit: Forgot that groupby returns a tuple. Oops!
red = [reduce_fun(g['c2'].values, g['c6'].values) for i, g in big_df.groupby('c1', sort=False)]
Seems to chug through pretty quickly for me (~2 seconds for 30 million random rows and a trivial reduce_fun).

Generate ranks based on data table

The Background:
I have been creating a script that based on input csv's that exist within an input directory, creates a 3 dimensional array to store the aggregated information. Each table within the array represents one of the pollution sources (eg one of the input csvs was Incinerators.csv, the created table will be the aggregated information about various pollutants released by Incinerators on a watershed scale), each row represents the aggregated information by watershed - row 0 = headers, and each column is the amount of and toxic equivalent of each substance - col 0 = watershed ID.
For each substance in each watershed, the total released by all sources is calculated and stored in another array with the exact same layout addressable using totals[wsid][substance] by index or name based dictionary lookups.
The Question:
With this table of totals, I need to calculate each watershed's relative rank for the amount of each substance released compared to what is released in other watersheds.
I could use a couple of nested loops to go through each substance column and convert this into a list, sort the list, and then relate this back to the watershed ID... but this would not be a very clean solution. Zero values also need to be omitted from ranking and duplicate values should be given the same rank while decreasing total number being ranked.
Is there a smarter way to do this? Or a module where this is already implemented? (didn't see anything evident in pyTables)
One of the requirements is that the solution also remain simple enough so that those with even less python experience than I will at least be able to understand the process. I can use up to 2.7.1
The End Goal:
Generate HTML pages to be iframed from a Google Earth description bubble with the results. I have put a couple entirely unfinished sample outputs here.

For this I have created 2 functions
def sortTable(table, col):
return sorted(table, key=itemgetter(col))
And
def buildRankTable(totalTable, fieldList, wsidList, subList, subDict, wsidDict):
## build rankTable to mimic other templates
rankTable = newTemplateTable(wsidList, fieldList)
## add another row to track total number ranked for each substance
numRanked = [0 for i in range(len(fieldList))]
numRanked[0] = "TotalNoRanked"
rankTable.append(numRanked)
for substance in subList:
tempTable = sortTable(totalTable, subDict[substance])
exportCsv(tempTable, outdir + os.sep + "rankT_" + substance + ".csv")
rankList = []
## extract a the low to high list of wsid's, skipping non-floats (no measurement)
for row in tempTable:
if type(row[subDict[substance]]) == float:
rankList.append(row[0]) ## build wsid list in ranked order
numRanked[subDict[substance]] = len(rankList)
## by default, this ranks low to high, we want to rank high to low starting at 1
rankList.reverse()
## with the list of ranked wsids, get the rank and save to rankTable
for rank, wsid in enumerate(rankList):
rankTable[wsidDict[wsid]][subDict[substance]] = rank + 1
## any 0 (default) values become 'NR' - No Rank
for rowI in range(len(rankTable)):
for colI in range(len(rankTable[rowI])):
if rankTable[rowI][colI] == 0:
rankTable[rowI][colI] = "NR"
return rankTable'
fieldList = list of fields in first row
wsidList = list of wsid's (remaining 595 rows)
subList = list of substances to be ranked
subDict = dictionary to map each substance to it's col index in totalTable
wsidDict = dictionary to map each wsid it it's row index in totalTable

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.