Generate ranks based on data table - python

The Background:
I have been creating a script that based on input csv's that exist within an input directory, creates a 3 dimensional array to store the aggregated information. Each table within the array represents one of the pollution sources (eg one of the input csvs was Incinerators.csv, the created table will be the aggregated information about various pollutants released by Incinerators on a watershed scale), each row represents the aggregated information by watershed - row 0 = headers, and each column is the amount of and toxic equivalent of each substance - col 0 = watershed ID.
For each substance in each watershed, the total released by all sources is calculated and stored in another array with the exact same layout addressable using totals[wsid][substance] by index or name based dictionary lookups.
The Question:
With this table of totals, I need to calculate each watershed's relative rank for the amount of each substance released compared to what is released in other watersheds.
I could use a couple of nested loops to go through each substance column and convert this into a list, sort the list, and then relate this back to the watershed ID... but this would not be a very clean solution. Zero values also need to be omitted from ranking and duplicate values should be given the same rank while decreasing total number being ranked.
Is there a smarter way to do this? Or a module where this is already implemented? (didn't see anything evident in pyTables)
One of the requirements is that the solution also remain simple enough so that those with even less python experience than I will at least be able to understand the process. I can use up to 2.7.1
The End Goal:
Generate HTML pages to be iframed from a Google Earth description bubble with the results. I have put a couple entirely unfinished sample outputs here.

For this I have created 2 functions
def sortTable(table, col):
return sorted(table, key=itemgetter(col))
def buildRankTable(totalTable, fieldList, wsidList, subList, subDict, wsidDict):
## build rankTable to mimic other templates
rankTable = newTemplateTable(wsidList, fieldList)
## add another row to track total number ranked for each substance
numRanked = [0 for i in range(len(fieldList))]
numRanked[0] = "TotalNoRanked"
for substance in subList:
tempTable = sortTable(totalTable, subDict[substance])
exportCsv(tempTable, outdir + os.sep + "rankT_" + substance + ".csv")
rankList = []
## extract a the low to high list of wsid's, skipping non-floats (no measurement)
for row in tempTable:
if type(row[subDict[substance]]) == float:
rankList.append(row[0]) ## build wsid list in ranked order
numRanked[subDict[substance]] = len(rankList)
## by default, this ranks low to high, we want to rank high to low starting at 1
## with the list of ranked wsids, get the rank and save to rankTable
for rank, wsid in enumerate(rankList):
rankTable[wsidDict[wsid]][subDict[substance]] = rank + 1
## any 0 (default) values become 'NR' - No Rank
for rowI in range(len(rankTable)):
for colI in range(len(rankTable[rowI])):
if rankTable[rowI][colI] == 0:
rankTable[rowI][colI] = "NR"
return rankTable'
fieldList = list of fields in first row
wsidList = list of wsid's (remaining 595 rows)
subList = list of substances to be ranked
subDict = dictionary to map each substance to it's col index in totalTable
wsidDict = dictionary to map each wsid it it's row index in totalTable


Generate a pandas dataframe with for-loop

I have generated a dataframe (called 'sectors') that stores information from my brokerage account (sector/industry, sub sector, company name, current value, cost basis, etc).
I want to avoid hard coding a filter for each sector or sub sector to find specific data. I have achieved this with the following code (I know, not very pythonic, but I am new to coding):
for x in set(sectors_df['Sector']):
x_filt = sectors_df['Sector'] == x
#value in sect takes the sum of all current values in a given sector
value_in_sect = round(sectors_df.loc[x_filt]['Current Value'].sum(), 2)
#pct in sect is the % of the sector in the over all portfolio (total equals the total value of all sectors)
pct_in_sect = round((value_in_sect/total)*100 , 2)
print(x, value_in_sect, pct_in_sect)
for sub in set(sectors_df['Sub Sector']):
sub_filt = sectors_df['Sub Sector'] == sub
value_of_subs = round(sectors_df.loc[sub_filt]['Current Value'].sum(), 2)
pct_of_subs = round((value_of_subs/total)*100, 2)
print(sub, value_of_subs, pct_of_subs)
My print statements produce the majority of the information I want, although I am still working through how to program for the % of a sub sector within its own sector. Anyways, I would now like to put this information (value_in_sect, pct_in_sect, etc) into dataframes of their own. What would be the best way or the smartest way or the most pythonic way to go about this? I am thinking a dictionary, and then creating a dataframe from the dictionary, but not sure.
The split-apply-combine process in pandas, specifically aggregation, is the best way to go about this. First I'll explain how this process would work manually, and then I'll show how pandas can do it in one line.
Manual split-apply-combine
First, divide the DataFrame into groups of the same Sector. This involves getting a list of Sectors and figuring out which rows belong to it (just like the first two lines of your code). This code runs through the DataFrame and builds a dictionary with keys as Sectors and a list of indices of rows from sectors_df that correspond to it.
sectors_index = {}
for ix, row in sectors_df.iterrows():
if row['Sector'] not in sectors_index:
sectors_index[row['Sector']] = [ix]
Run the same function, in this case summing of Current Value and calculating its percentage share, on each group. That is, for each sector, grab the corresponding rows from the DataFrame and run the calculations in the next lines of your code. I'll store the results as a dictionary of dictionaries: {'Sector1': {'value_in_sect': 1234.56, 'pct_in_sect': 11.11}, 'Sector2': ... } for reasons that will become obvious later:
sector_total_value = {}
total_value = sectors_df['Current Value'].sum()
for sector, row_indices in sectors_index.items():
sector_df = sectors_df.loc[row_indices]
current_value = sector_df['Current Value'].sum()
sector_total_value[sector] = {'value_in_sect': round(current_value, 2),
'pct_in_sect': round(current_value/total_value * 100, 2)
(see footnote 1 for a note on rounding)
Finally, collect the function results into a new DataFrame, where the index is the Sector. pandas can easily convert this nested dictionary structure into a DataFrame:
sector_total_value_df = pd.DataFrame.from_dict(sector_total_value, orient='index')
split-apply-combine using groupby
pandas makes this process very simple using the groupby method.
The groupby method splits a DataFrame into groups by a column or multiple columns (or even another Series):
grouped_by_sector = sectors_df.groupby('Sector')
grouped_by_sector is similar to the index we built earlier, but the groups can be manipulated much more easily, as we can see in the following steps.
To calculate the total value in each group, select the column or columns to sum up, use the agg or aggregate method with the function you want to apply:
sector_total_value = grouped_by_sector['Current Value'].agg(value_in_sect=sum)
It's already done! The apply step already creates a DataFrame where the index is the Sector (the groupby column) and the value in the value_in_sect column is the result of the sum operation.
I've left out the pct_in_sect part because a) it can be more easily done after the fact:
sector_total_value_df['pct_in_sect'] = round(sector_total_value_df['value_in_sect'] / total_value * 100, 2)
sector_total_value_df['value_in_sect'] = round(sector_total_value_df['value_in_sect'], 2)
and b) it's outside the scope of this answer.
Most of this can be done easily in one line (see footnote 2 for including the percentage, and rounding):
sector_total_value_df = sectors_df.groupby('Sector')['Current Value'].agg(value_in_sect=sum)
For subsectors, there's one additional consideration, which is that grouping should be done by Sector and Subsector rather than just Subsector, so that, for example rows from Utilities/Gas and Energy/Gas aren't combined.
subsector_total_value_df = sectors_df.groupby(['Sector', 'Sub Sector'])['Current Value'].agg(value_in_sect=sum)
This produces a DataFrame with a MultiIndex with levels 'Sector' and 'Sub Sector', and a column 'value_in_sect'. For a final piece of magic, the percentage in Sector can be calculated quite easily:
subsector_total_value_df['pct_within_sect'] = round(subsector_total_value_df['value_in_sect'] / sector_total_value_df['value_in_sect'] * 100, 2)
which works because the 'Sector' index level is matched during division.
Footnote 1. This deviates from your code slightly, because I've chosen to calculate the percentage using the unrounded total value, to minimize the error in the percentage. Ideally though, rounding is only done at display time.
Footnote 2. This one-liner generates the desired result, including percentage and rounding:
sector_total_value_df = sectors_df.groupby('Sector')['Current Value'].agg(
value_in_sect = lambda c: round(sum(c), 2),
pct_in_sect = lambda c: round(sum(c)/sectors_df['Current Value'].sum() * 100, 2),

Pytables duplicates 2.5 giga rows

I currently have a .h5 file, with a table in it consisting of three columns: a text columns of 64 chars, an UInt32 column relating to the source of the text and a UInt32 column which is the xxhash of the text. The table consists of ~ 2.5e9 rows
I am trying to find and count the duplicates of each text entry in the table - essentially merge them into one entry, while counting the instances. I have tried doing so by indexing on the hash column and then looping through table.itersorted(hash), while keeping track of the hash value and checking for collisions - very similar to finding a duplicate in a hdf5 pytable with 500e6 rows. I did not modify the table as I was looping through it but rather wrote the merged entries to a new table - I am putting the code at the bottom.
Basically the problem I have is that the whole process takes significantly too long - it took me about 20 hours to get to iteration #5 4e5. I am working on a HDD however, so it is entirely possible the bottleneck is there. Do you see any way I can improve my code, or can you suggest another approach? Thank you in advance for any help.
P.S. I promise I am not doing anything illegal, it is simply a large scale leaked password analysis for my Bachelor Thesis.
ref = 3 #manually checked first occuring hash, to simplify the below code
gen_cnt = 0
locs = {}
for row in table.itersorted('xhashx'):
gen_cnt += 1 #so as not to flush after every iteration
ps = row['password'].decode(encoding = 'utf-8', errors = 'ignore')
if row['xhashx'] == ref:
if ps in locs:
locs[ps][0] += 1
locs[ps][1] |= row['src']
locs[ps] = [1, row['src']]
for p in locs:
fill_password(new_password, locs[ps]) #simply fills in the columns, with some fairly cheap statistics procedures
if (gen_cnt > 100):
gen_cnt = 0
ref = row['xhashx']```
Your dataset is 10x larger than the referenced solution (2.5e9 vs 500e6 rows). Have you done any testing to identify where the time is spent? The table.itersorted() method may not be linear - and might be resource intensive. (I don't have any experience with itersorted.)
Here is a process that might be faster:
Extract a NumPy array of the hash field (column xhashx
Find the unique hash values
Loop thru the unique hash values and extract a NumPy array of
rows that match each value
Do your uniqueness tests against the rows in this extracted array
Write the unique rows to your new file
Code for this process below:
Note: This has been not tested, so may have small syntax or logic gaps
# Step 1: Get a Numpy array of the 'xhashx' field/colmu only:
hash_arr ='xhashx')
# Step 2: Get new array with unique values only:
hash_arr_u = np.unique(hash_arr)
# Alternately, combine first 2 steps in a single step
hash_arr_u = np.unique('xhashx'))
# Step 3a: Loop on rows unique hash values
for hash_test in hash_arr_u :
# Step 3b: Get an array with all rows that match this unique hash value
match_row_arr = table.read_where('xhashx==hash_test')
# Step 4: Check for rows with unique values
# Check the hash row count.
# If there is only 1 row, uniqueness tested not required
if match_row_arr.shape[0] == 1 :
# only one row, so write it to new.table
else :
# check for unique rows
# then write unique rows to new.table
# np.unique has an option to save the hash counts
# these can be used as a test in the loop
(hash_arr_u, hash_cnts) = np.unique('xhashx'), return_counts=True)
# Loop on rows in the array of unique hash values
for cnt in range(hash_arr_u.shape[0]) :
# Get an array with all rows that match this unique hash value
match_row_arr = table.read_where('xhashx==hash_arr_u(cnt)')
# Check the hash row count.
# If there is only 1 row, uniqueness tested not required
if hash_cnts[cnt] == 1 :
# only one row, so write it to new.table
else :
# check for unique rows
# then write unique rows to new.table

Assigning mean values of a numpy row to a variable to use for making a histogram

Sorry for yet another question but I'm very new at python.
I have reaction time data for my go/no go conditions. I have put them into a dictionary called rts and split with two keys (go) and (no-go). I have worked out how to separate each numpy array row within these conditions as each row is a participant (there are 20 participants). I've managed to print out the mean and standard deviation for each participant into a table. This is the code below:
for row in range(0,20):
print ("{} {:.2f} {:.2f} {:.2f} {:.2f}".format (participant, \
go_row.mean(), go_row.std(),nogo_row.mean(), nogo_row.std()))
What I'm struggling to do is make a variable with each of the mean values for each participant. I want to do this as I want to create a histogram showing the distribution in performance across participants. Any help would be appreciated.
IIUC you want list
means_participant = []
for row in range(0,20):
Store the values for each row in a dictionary, then add the dictionaries to a list that can be looped over later. This can be condensed, but I left it spelled out for clarity.
values = []
for row in range(0,20):
d = {}
d['participant'] = participant
d['go_row_mean'] = go_row.mean()
d['go_row_std'] = go_row.std()
d['nogo_row_mean'] = nogo_row.mean()
d['nogo_row_std'] = nogo_row.std()
The dictionary would be unnecessary if you know you only want one of the values, such as the go_row.mean(), and if you didn't care about matching the means in the list back up with a participant.

How to multi-thread large number of pandas dataframe selection calls on large dataset

df is a dataframe containing 12 millions+ lines unsorted.
Each row has a GROUP ID.
The end goal is to randomly select 1 row per unique GROUP ID, thus populating a new column named SELECTED where 1 means selected 0 means the opposite
There may be 5000+ unique GROUP IDs.
Seeking better and faster solution than the following, Potentially multi-threaded solution?
for sec in df['GROUP'].unique():
sz = df.loc[df.GROUP == sec, ['SELECTED']].size
sel = [0]*sz
sel[random.randint(0,sz-1)] = 1
df.loc[df.GROUP == sec, ['SELECTED']] = sel
You could try a vectorized version, which will probably speed things up if you have many classes.
import pandas as pd
# get fake data
df = pd.DataFrame(
df['GROUP'] = df[0].astype(str).str[2]
# mark one element of each group as selected
df['selected'] = df.index.isin( # Is current index in a selected list?
df.groupby('GROUP') # Get a GroupBy object.
.apply(pd.Series.sample) # Select one row from each group.
.index.levels[1] # Access index - in this case (group, old_id) pair; select the old_id out of the two.
).astype( # Convert to ints.
Note that this may fail if duplicate indices are present.
I do not know panda's dataframe, but if you simply set selected where it is needed to be one and later assume that not having the attribute means not selected you could avoid updating all elements.
You may also do something like this :
selected = []
for sec in df['GROUP'].unique():
or with list comprehensions
selected = [random.choice(sec) for sec in df['GROUP'].unique()]
maybe this can speed it up because you will not need to allow new memory and udpate all elements from your dataframe.
If you really want multithreading have a look at concurrent.futures

Append pandas dataframe with column of averaged values from string matches in another dataframe

Right now I have two dataframes (let's call them A and B) built from Excel imports. Both have different dimensions as well as some empty/NaN cells. Let's say A is data for individual model numbers and B is a set of order information. For every row (unique item) in A, I want to search B for the (possibly) multiple orders for that item number, average the corresponding prices, and append A with a column containing the average price for each item.
The item numbers are alphanumeric so they have to be strings. Not every item will have orders/pricing information for it and I'll be removing those at the next step. This is a large amount of data so efficiency is ideal so iterrows probably isn't the right choice. Thank you in advance!
Here's what I have so far:
avgPrice = []
for index, row in dfA.iterrows():
def avg_unit_price(item_no, unit_price):
matchingOrders = []
for item, price in zip(item_no, unit_price):
if item == row['itemNumber']:
avg_unit_price(dfB['item_no'], dfB['unit_price'])
dfA['avgPrice'] = avgPrice
In general, avoid loops as they perform poorly. If you can't vectorise easily, then as a last resort you can try pd.Series.apply. In this case, neither were necessary.
import pandas as pd
# B: pricing data
df_b = pd.DataFrame([['I1', 34.1], ['I2', 541.31], ['I3', 451.3], ['I2', 644.3], ['I3', 453.2]],
columns=['item_no', 'unit_price'])
# create avg price dictionary
item_avg_price = df_b.groupby('item_no', as_index=False).mean().set_index('item_no')['unit_price'].to_dict()
# A: product data
df_a = pd.DataFrame([['I1'], ['I2'], ['I3'], ['I4']], columns=['item_no'])
# map price info to product data
df_a['avgPrice'] = df_a['item_no'].map(item_avg_price)
# remove unmapped items
df_a = df_a[pd.notnull(df_a['avgPrice'])]

