I currently have a .h5 file, with a table in it consisting of three columns: a text columns of 64 chars, an UInt32 column relating to the source of the text and a UInt32 column which is the xxhash of the text. The table consists of ~ 2.5e9 rows
I am trying to find and count the duplicates of each text entry in the table - essentially merge them into one entry, while counting the instances. I have tried doing so by indexing on the hash column and then looping through table.itersorted(hash), while keeping track of the hash value and checking for collisions - very similar to finding a duplicate in a hdf5 pytable with 500e6 rows. I did not modify the table as I was looping through it but rather wrote the merged entries to a new table - I am putting the code at the bottom.
Basically the problem I have is that the whole process takes significantly too long - it took me about 20 hours to get to iteration #5 4e5. I am working on a HDD however, so it is entirely possible the bottleneck is there. Do you see any way I can improve my code, or can you suggest another approach? Thank you in advance for any help.
P.S. I promise I am not doing anything illegal, it is simply a large scale leaked password analysis for my Bachelor Thesis.
ref = 3 #manually checked first occuring hash, to simplify the below code
gen_cnt = 0
locs = {}
print("STARTING")
for row in table.itersorted('xhashx'):
gen_cnt += 1 #so as not to flush after every iteration
ps = row['password'].decode(encoding = 'utf-8', errors = 'ignore')
if row['xhashx'] == ref:
if ps in locs:
locs[ps][0] += 1
locs[ps][1] |= row['src']
else:
locs[ps] = [1, row['src']]
else:
for p in locs:
fill_password(new_password, locs[ps]) #simply fills in the columns, with some fairly cheap statistics procedures
new_password.append()
if (gen_cnt > 100):
gen_cnt = 0
new_table.flush()
ref = row['xhashx']```
Your dataset is 10x larger than the referenced solution (2.5e9 vs 500e6 rows). Have you done any testing to identify where the time is spent? The table.itersorted() method may not be linear - and might be resource intensive. (I don't have any experience with itersorted.)
Here is a process that might be faster:
Extract a NumPy array of the hash field (column xhashx
)
Find the unique hash values
Loop thru the unique hash values and extract a NumPy array of
rows that match each value
Do your uniqueness tests against the rows in this extracted array
Write the unique rows to your new file
Code for this process below:
Note: This has been not tested, so may have small syntax or logic gaps
# Step 1: Get a Numpy array of the 'xhashx' field/colmu only:
hash_arr = table.read(field='xhashx')
# Step 2: Get new array with unique values only:
hash_arr_u = np.unique(hash_arr)
# Alternately, combine first 2 steps in a single step
hash_arr_u = np.unique(table.read(field='xhashx'))
# Step 3a: Loop on rows unique hash values
for hash_test in hash_arr_u :
# Step 3b: Get an array with all rows that match this unique hash value
match_row_arr = table.read_where('xhashx==hash_test')
# Step 4: Check for rows with unique values
# Check the hash row count.
# If there is only 1 row, uniqueness tested not required
if match_row_arr.shape[0] == 1 :
# only one row, so write it to new.table
else :
# check for unique rows
# then write unique rows to new.table
##################################################
# np.unique has an option to save the hash counts
# these can be used as a test in the loop
(hash_arr_u, hash_cnts) = np.unique(table.read(field='xhashx'), return_counts=True)
# Loop on rows in the array of unique hash values
for cnt in range(hash_arr_u.shape[0]) :
# Get an array with all rows that match this unique hash value
match_row_arr = table.read_where('xhashx==hash_arr_u(cnt)')
# Check the hash row count.
# If there is only 1 row, uniqueness tested not required
if hash_cnts[cnt] == 1 :
# only one row, so write it to new.table
else :
# check for unique rows
# then write unique rows to new.table
I have a dictionary "c" with 30000 keys and around 600000 unique values (around 20 unique values per key)
I want to create a new pandas series "'DOC_PORTL_ID'" to get a sample value from each row of column "'image_keys'" and then look for its key in my dictionary and return. So I wrote a function like this:
def find_match(row, c):
for key, val in c.items():
for item in val:
if item == row['image_keys']:
return key
and then I use .apply to create my new column like:
df_image_keys['DOC_PORTL_ID'] = df_image_keys.apply(lambda x: find_match(x, c), axis =1)
This takes a long time. I am wondering if I can improve my snippet code to make it faster.
I googled a lot and was not able to find the best way of doing this. Any help would appreciated.
You're using your dictionary as a reverse lookup. And frankly, you haven't given us enough information about the dictionary. Are the 600,000 values unique? If not, you're only returning the first one you find. Is that expected?
Assume they are unique
reverse_dict = {val: key for key, values in c.items() for val in values}
df_image_keys['DOC_PORTL_ID'] = df_image_keys['image_keys'].map(reverse_dict)
This is as good as you've done yourself. If those values are not unique, you'll have to provide a better explanation of what you expect to happen.
I'm new to pandas so I apologies in advance if the answer is obvious but I can't find any answers on the topic.
I have a set of a data of about two million rows and I'm trying to group by one column and create unique lists as aggregated values of other two columns:
keys = # list of keys from an s3 paginator
dfs = []
for key in keys:
print(key)
match = re.search('^[\d]{4}-[\d]{2}-[\d]{2}/(.*)/(.*)_batch_request.csv$', key, re.IGNORECASE)
df = pd.read_csv('s3://{}/{}'.format(bucket, key), names = ['id'])
df['environment']=match.group(1)
df['request_id']=match.group(2)
dfs.append(df)
all = pd.concat(dfs).reset_index()
all.groupby('id').agg({'environment': 'unique', 'request_id': 'unique'})
The last line reduces the original set to about 300k rows and it takes a couple of minutes. The time needed to group does not seem directly related to the initial size of the frame as attempts with smaller input produce very similar results.
Maybe I am mistaken but I was expecting much faster performance on an i7 with 4Gb of RAM. If I use nunique instead of unique just to count the unique elements, it runs in ~15 seconds.
Am I doing something wrong or is this the expected performance?
Thanks
In my pandas data frame column, I need to check if the column has any of the word in the dictionary values, then I should return the key.
my_dict = {'woodhill': ["woodhill"],'woodcocks': ["woodcocks"], 'whangateau' : ["whangateau","whangate"],'whangaripo' : ["whangaripo","whangari","whangar"],
'westmere' : ["westmere"],'western springs': ["western springs","western springs","western spring","western sprin",
"western spri","western spr","western sp","western s"]}
I can write a for loop for this, however, I have nearly 1.5 million records in my data frame and the dictionary has more than 100 items and each may have up to 20 values in some case. How do I do this efficiently? Can I create reverse the values as key and key as values in the dictionary to make it fast? Thanks.
you can reverse your dictionary
reversed_dict = {val: key for key in my_dict for val in my_dict[key]}
and then map with your dataframe
df =pd.DataFrame({'col1':['western springs','westerns','whangateau','whangate']})
df['col1'] = df['col1'].map(reversed_dict)
Try this code, this may help you.
1st reverse the dictionary items. # as limited items , so it'll be fast.
2nd create dataframe from dictionary. # instead of searching all keys for each comparison with dataframe, it's best to do join. so for that create dataframe.
3rd make left join from big size dataframe to small size dataframe (in this case dictionary).
I am attempting to use pandas to perform data analysis on a flat source of data. Specifically, what I'm attempting to accomplish is the equivalent of a Union All query in SQL.
I am using the read_csv() method to input the data and the output has unique integer indices and approximately 30+ columns.
Of these columns, several contain identifying information, whilst others contain data.
In total, the first 6 columns contain identifying informations which uniquely identifies an entry. Following these 6 columns there are a range of columns (A,B... etc) which reference the value. Some of these columns are linked together in sets, for example (A,B,C) belong together, as do (D,E,F).
However, (D,E,F) are also related to (A,B,C) as follows ((A,D),(B,E),(C,F)).
What I am attempting to do is take my data set which has as follows:
(id1,id2,id3,id4,id5,id6,A,B,C,D,E,F)
and return the following
((id1,id2,id3,id4,id5,id6,A,B,C),
(id1,id2,id3,id4,id5,id6,D,E,F))
Here, as A and D are linked they are contained within the same column.
(Note, this is a simplification, there are approximately 12 million unique combinations in the total dataset)
I have been attempting to use the merge, concat and join functions to no avail. I feel like I am missing something crucial as in an SQL database I can simply perform a union all query (which is quite slow admittedly) to solve this issue.
I have no working sample code at this stage.
Another way of writing this problem based upon some of the pandas docs.
left = key lval
right = key rval
merge(left, right, on=key) = key, lval, rval
Instead I want:
left = kev, lval
right = key, lval
union(left, right) = key, lval
key, rval
I'm not sure if a new indexing key value would need to be created for this.
I have been able to accomplish what I initially asked for.
It did require a bit of massaging of column names however.
Solution (using pseudo code):
Set up dataframes with the relevant data. e.g.
left = (id1,id2,id3,id4,id5,id6,A,B,C)
right = (id1,id2,id3,id4,id5,id6,D,E,F)
middle = (id1,id2,id3,id4,id5,id6,G,H,I)
Note, here, that for me dataset this resulted in my having non-unique indexing keys for each of the ids. That is, a key is present for each row in left and right.
Rename the column names.
col_names = [id1,id2,id3,id4,id5,id6,val1,val2,val3]
left.columns = col_names
right.columns = col_names
middle.columns = col_names
Concatenate these
pieces = [left, right, middle]
new_df = concat(pieces)
Now, this will create a new dataframe which contains x unique indexing values and 3x entries. This isn't quite ideal but it will do for now, the major shortfall of this is that you cannot uniquely access a single entry row anymore, they will come in triples. To access the data you can create a new dataframe based on the unique id values.
e.g.
check_df = new_df[(new_df[id1] == 'id1') & (new_df[id2] == 'id2') ... etc])
print check_df
key, id1, id2, id3, id4, id5, id6, A, B, C
key, id1, id2, id3, id4, id5, id6, D, E, F
key, id1, id2, id3, id4, id5, id6, G, H, I
Now, this isn't quite ideal but it's the format I needed for some of my other analysis. It may not be applicable for all parties.
If anyone has a better solution please do share it, I'm relatively new to using pandas with python.