Efficiently comparing two ~100M row data sets in Python? - python

Attempting to compare two ~100M row HDF5 datasets. The first dataset is the Master and the second is the result of the master being mapped and run thorough a cluster to discern a specific result for each row.
I need to validate that all the intended rows from the master are present, remove any duplicates and create a list of any missing rows that need to be computed. Hash values would be generated from the common elements between the two datasets. I realize though it wouldn't likely be practical to loop through them row by row with native Python.
Such being the case what would be a more efficient means of running this task? Do you try to code something in Cython to offset the issue with Python loop speed or is there a "better" way ?

Related

Generating a reproducible unique ID in Spark dataframe

We have a data lake containing tons of files, where I can read these the contents of the files along with their paths:
sdf = spark.read.load(source)\
.withColumn("_path", F.input_file_name())
I would like to generate a unique ID for each row, for easier downstream joining between tables, and I want this ID to be reproducible between runs.
Simplest approach is to simply use the _path column as the identifier.
sdf.withColumn("id", F.col("_path"))
However, it would be "prettier" and more compact to have some kind of integer representation. And for other tables the unique identifier could be a combination of a few columns, uglyfying this a bit more.
Another approach is to use monotonically increasing ID.
sdf.withColumn("id", F.monotonically_increasing_id())
However, in this solution there is no guarantee that when running the analysis id=2 is also id=2 when the analysis is run a week later (when new data has arrived).
A third approach is to use the hashing function:
sdf.withColumn("id", F.hash("_path"))
Which could be quite nice, because it is easy to hash a combination of columns, but this is not stable since multiple inputs can give the same output:
Running such analysis on our actual data gave 396,702 hash-ids from a single origin _path, and 24 hash-ids originating from two paths. Hence a collision rate of 0.006%.
We could simply disregard this very small portion of the data, but there must be a more elegant way of achieving what I want to achieve?
You can try the xxhash64 hash in Spark SQL, which gives a 64-bit hash value and should be more robust to hash collisions:
sdf.withColumn("id", F.expr("xxhash64(_path)"))
or to use more robust hashing algorithms,
sdf.withColumn("id", F.expr("conv(sha2(_path,256),16,10)"))

Does Dask/Pandas support removing rows in a group based on complex conditions that rely on other rows?

I'm processing a bunch of text-based records in csv format using Dask, which I am learning to use to work around too large to fit in memory problems, and I'm trying to filter records within groups that best match a complicated criteria.
The best approach I've identified to approach this so far is to basically use Dash to group records in bite sized chunks and then write the applicable logic in Python:
def reduce_frame(partition):
records = partition.to_dict('record')
shortlisted_records = []
# Use Python to locate promising looking records.
# Some of the criteria can be cythonized; one criteria
# revolves around whether record is a parent or child
# of records in shortlisted_records.
for other in shortlisted_records:
if other.path.startswith(record.path) \
or record.path.startswith(other.path):
... # keep one, possibly both
...
return pd.DataFrame.from_dict(shortlisted_records)
df = df.groupby('key').apply(reduce_frame, meta={...})
In case it matters, the complicated criteria revolves around weeding out promising looking links on a web page based on link url, link text, and css selectors across the entire group. Think with given A, and B in shortlist, and C a new record, keep all if each are very very promising, else prefer C over A and/or B if more promising than either or both, else drop C. The resulting Pandas partition objects above are tiny. (The dataset in its entirety is not, hence my using Dask.)
Seeing how Pandas functionality exposes inherently row- and column-based functionality, I'm struggling to imagine any vectorized approach to solving this, so I'm exploring writing the logic in Python.
Is the above the correct way to proceed, or are there more Dask/Pandas idiomatic ways - or simply better ways - to approach this type of problem? Ideally one that allows to parallelize the computations across a cluster? For instance by using Dask.bag or Dask.delayed and/or cytoolz or something else I might have missed while learning Python?
I know nothing about Dask, but can tell a little on passing / blocking
some rows using Pandas.
It is possible to use groupby(...).apply(...) to "filter" the
source DataFrame.
Example: df.groupby('key').apply(lambda grp: grp.head(2)) returns
first 2 rows from each group.
In your case, write a function to applied to each group, which:
contains some logic, processing the current group,
generates the output DataFrame, based on this logic, e.g. returning
only some of input rows.
The returned rows are then concatenated, forming the result of apply.
Another possibility is to use groupby(...).filter(...), but in this
case the underlying function returns a decision "passing" or "blocking"
each group of rows.
Yet another possibility is to define a "filtering function",
say filtFun, which returns True (pass the row) or False (block the row).
Then:
Run: msk = df.apply(filtFun, axis=1) to generate a mask (which rows
passed the filter).
In further processing use df[msk], i.e. only these rows which passed
the filter.
But in this case the underlying function has acces only to the current row,
not to the whole group of rows.

Memory Error when parsing two large data frames

I have two dataframes each around 400k rows call them a and b. What I want to do is for every row in df b find the account number in that row in data frame a. If it exists, i want to drop that row from dataframe a. Problem is, when I try to run this code, i keep getting memory errors. Initially I was using iterrows, but that seems to be bad when working with large datasets, so i switched to apply, but I'm running into the same error. Below is simplified pseudocode of what I'm trying:
def reduceAccount(accountID):
idx = frameA.loc[frameA["AccountID"] == accountID].index
frameB.drop(idx, inplace=True)
frameB["AccountID"].apply(reduceAccount)
I've even tried some shennanigans like iterating thru the first few hundred/thousand rows, but after a cycle, i still hit the memory error, which makes me think im still loading things into memory rather than clearing thru. Is there a better way to reduce dataframeA than what im trying? Note that I do not want to merge the frames (yet) just remove any row in dataframe a that has a duplicate key in dataframe b.
The issue is that in order to see all values to filter, you will need to store both DFs in memory at some point. You can improve your efficiency somewhat by not using apply(), which is still an iterator. The following code is a more efficient, vectorized approach using boolean masking directly.
dfB[~dfB["AccountID"].isin(dfA["AccountID"])]
However, if storage is the problem, then this may still not work. Some approaches to consider are chunking the data, as you say you've already tried, or some of the options in the documentation on enhancing performance
So basically you want every row in A which 'AccountID' is not in B.
This can be done with a left join: frameA = frameA.join(frameB, on='AccountID', how='left')
I think this is best in terms of memory efficiency for you'll be leveraging the power of pandas built-in optimized code.

Is sorting a DataFrame memory efficient?

Is sorting a DataFrame in pandas memory efficient? I.e., can I sort the dataframe without reading the whole thing into memory?
Internally, pandas relies on numpy.argsort to do all the sorting.
That being said: pandas DataFrames are backed by numpy arrays, which have to be present in memory as a whole. So, to answer your question: No, pandas needs the whole dataset in memory for sorting.
Additional thoughts:
You can of course implement such a disk-based external sorting using multiple steps: Load a chunk of your dataset, sort it, save the sorted version. Repeat. Load a part of each sorted subset, join them into one DataFrame and sort it You'll have to be careful here on how much t oload from each source. For example, if your 1000 element dataset is already sorted, getting the top 10 results from each of the 10 subsets won't get you the correct top 100. It will, however, give you the correct top 10.
Without further information about your data, I suggest you let some (relational) database handle all that stuff. They're made for this kind of thing, after all.

Python: Find duplicate rows in large boolean matrix

I have a boolean matrix in python and need to find out which are duplicate rows. The representation can also be a list of bitarrays as I am using this for other purposes anyways. Comparing all rows with all rows is not an option as this would yield 12500^2 comparisons and I can only do about 500 per second. Also converting each row into an integer is not possible as each row is about 5000 bits long. Still it seems to me that the best way would be to sort the list of bitarrays and then compare only consecutive rows. Anyone has an idea how to map bitarrays to sortable values or how to sort a list of bitarrays in the first place? Or is there a different approach that is more promising? Also, since I only have to do this once, I prefer less code over efficiency.
Ok, so a list of bitarrays is quickly sortable by sort() or sorted(). Furthermore, probably better way to solve this problem is indicated in Find unique rows in numpy.array.

Categories

Resources