python runtime 3x deviation for 32 vs 34 char IDs - python

I am running an aggregation script, which heavily relies on aggregating / grouping on an identifier column. Each identifier in this column is 32 character long as a result of a hashing function.
so my ID column which will be used in pandas groupby has something like
e667sad2345...1238a
as an entry.
I tried to add a prefix "ID" to some of the samples, for easier separation afterwards. Thus, I had some identifiers with 34 characters and others still with 32 characters.
e667sad2345...1238a
IDf7901ase323...1344b
Now the aggregation script takes 3 times as long (6000 vs 2000 seconds). And the change in the ID column (adding the prefix) is the only thing which happened. Also note, that I generate data separately and save a pickle file which is read in by my aggregation script as input. So the prefix addition is not part of the runtime I am talking about.
So now I am stunned, why this particular change made such a huge impact. Can someone elaborate?
EDIT: I replaced the prefix with suffix so now it is
e667sad2345...1238a
f7901ase323...1344bID
and now it runs again in 2000 seconds. Does groupby use a binary search or something, so all the ID are overrepresented with the starting character 'I' ?

Ok, I had a revelation what is going on.
My entries are sorted using quick sort, which has an expected runtime of O(n * log n). But in worst case, quick sort will actually run in O(n*n). By making my entries imbalanced (20% of data starts with "I", other 80% randomly distributed over alphanumeric characters) I shifted the data to be more of a bad case for quick sort.

Related

Data structures with fast lookup (as close as possible to sets/hashtables) but with lower memory usage? [duplicate]

I'm looking for a set-like data structure in Python that allows a fast lookup (O(1) for sets), for 100 millions of short strings (or bytes-strings) of length ~ 10.
With 10M strings, this already takes 750 MB RAM on Python 3.7 or 3.10.2 on (or 900 MB if we replace the b-strings by strings):
S = set(b"a%09i" % i for i in range(10_000_000)) # { b"a000000000", b"a000000001", ... }
whereas the "real data" here is 10 bytes * 10M ~ 100 MB. So there is a 7.5x memory consumption factor because of the set structure, pointers, buckets... (for a study about this in the case of a list, see the answer of Memory usage of a list of millions of strings in Python).
When working with "short" strings, having pointers to the strings (probably taking 64 bit = 8 bytes) in the internal structure is probably already responsible for a 2x factor, and also the buckets structure of the hash-table, etc.
Are there some "short string optimizations" techniques allowing to have a memory-efficient set of short bytes-strings in Python? (or any other structure allowing fast lookup/membership test)
Maybe without pointers to strings, but rather storing the strings directly in the data structure if string length <= 16 characters, etc.
Or would using a bisect or a sorted list help (lookup in O(log n) might be ok), while keeping memory usage small? (smaller than a 7.5x factor as with a set)
Up to now here are the methods that I tested thanks to comments, and that seem working.
Sorted list + bisection search (+ bloom filter)
Insert everything in a standard list L, in sorted order. This takes a lot less memory than a set.
(optional) Create a Bloom filter, here is a very small code to do it.
(optional) First test membership with Bloom filter (fast).
Check if it really is a match (and not a false positive) with the fast in_sorted_list() from this answer using bisect, much faster than a standard lookup b"hello" in L.
If the bisection search is fast enough, we can even bypass the bloom filter (steps 2 and 3). It will be O(log n).
In my test with 100M strings, even without bloom filter, the lookup took 2 µs on average.
Sqlite3
As suggested by #tomalak's comment, inserting all the data in a Sqlite3 database works very well.
Querying if a string exists in the database was done in 50 µs on average on my 8 GB database, even without any index.
Adding an index made the DB grow to 11 GB, but then the queries were still done in ~50 µs on average, so no gain here.
Edit: as mentioned in a comment, using CREATE TABLE t(s TEXT PRIMARY KEY) WITHOUT ROWID; even made the DB smaller: 3.3 GB, and the queries are still done in ~50 µs on average. Sqlite3 is (as always) really amazing.
In this case, it's even possible to load it totally in RAM with the method from How to load existing db file to memory in Python sqlite3?, and then it's ~9 µs per query!
Bisection in file with sorted lines
Working, and with very fast queries (~ 35 µs per query), without loading the file in memory! See
Bisection search in the sorted lines of an opened file (not loaded in memory)
Dict with prefixes as keys and concatenation of suffixes as values
This is the solution described here: Set of 10-char strings in Python is 10 times bigger in RAM as expected.
The idea is: we have a dict D and, for a given word,
prefix, suffix = word[:4], word[4:]
D[prefix] += suffix + b' '
With this method, the RAM space used is even smaller than the actual data (I tested with 30M of strings of average length 14, and it used 349 MB), the queries seem very fast (2 µs), but the initial creation time of the dict is a bit high.
I also tried with dict values = list of suffixes, but it's much more RAM-consuming.
Maybe without pointers to strings, but rather storing the strings directly in the data structure if string length <= 16 characters, etc.
While not being a set data-structure but rather a list, I think pyarrow has a quite optimized way of storing a large number of small strings. There is a pandas integration as well which should make it easy to try it out:
https://pythonspeed.com/articles/pandas-string-dtype-memory/

Infer the length of a sequence using the CIGAR

To give you a bit of context: I am trying to convert a sam file to bam
samtools view -bT reference.fasta sequences.sam > sequences.bam
which exits with the following error
[E::sam_parse1] CIGAR and query sequence are of different length
[W::sam_read1] parse error at line 102
[main_samview] truncated file
and the offending line looks like this:
SRR808297.2571281 99 gi|309056|gb|L20934.1|MSQMTCG 747 80 101M = 790 142 TTGGTATAAAATTTAATAATCCCTTATTAATTAATAAACTTCGGCTTCCTATTCGTTCATAAGAAATATTAGCTAAACAAAATAAACCAGAAGAACAT ##CFDDFD?HFDHIGEGGIEEJIIJJIIJIGIDGIGDCHJJCHIGIJIJIIJJGIGHIGICHIICGAHDGEGGGGACGHHGEEEFDC#=?CACC>CCC NM:i:2 MD:Z:98A1A
My sequence is 98 characters long but a probable bug when creating the sam file reported 101 in the CIGAR. I can give myself the luxury to loss a couple of reads and I don't have access at the moment to the source code that produced the sam files, so no opportunity to hunt down the bug and re-run the alignment. In other words, I need a pragmatic solution to move on (for now). Therefore, I devised a python script that counts the length of my string of nucleotides, compares it with what is registered in the CIGAR, and saves the "sane" lines in a new file.
#!/usr/bin/python
import itertools
import cigar
with open('myfile.sam', 'r') as f:
for line in itertools.islice(f,3,None): #Loop through the file and skip the first three lines
cigar=line.split("\t")[5]
cigarlength = len(Cigar(cigar)) #Use module Cigar to obtain the length reported in the CIGAR string
seqlength = len(line.split("\t")[9])
if (cigarlength == seqlength):
...Preserve the line in a new file...
As you can see, to translate the CIGAR into an integer showing the length, I am using the module CIGAR. To be honest, I am a bit wary of its behavior. This module seems to miscalculate the length in very obvious cases. Is there another module or a more explicit strategy to translate the CIGAR into the length of the sequence?
Sidenote: Interesting, to say the least, that this problem has been widely reported but no pragmatic solution can be found in the internet. See the links below:
https://github.com/COMBINE-lab/RapMap/issues/9
http://seqanswers.com/forums/showthread.php?t=67253
http://seqanswers.com/forums/showthread.php?t=21120
https://groups.google.com/forum/#!msg/snap-user/FoDsGeNBDE0/nRFq-GhlAQAJ
The SAM spec offers us this table of CIGAR operations which indicates which ones "consume" the query or the reference, complete with explicit instructions on how to calculate sequence length from a CIGAR string:
Consumes Consumes
Op BAM Description query reference
M 0 alignment match (can be a sequence match or mismatch) yes yes
I 1 insertion to the reference yes no
D 2 deletion from the reference no yes
N 3 skipped region from the reference no yes
S 4 soft clipping (clipped sequences present in SEQ) yes no
H 5 hard clipping (clipped sequences NOT present in SEQ) no no
P 6 padding (silent deletion from padded reference) no no
= 7 sequence match yes yes
X 8 sequence mismatch yes yes
“Consumes query” and “consumes reference” indicate whether the CIGAR operation causes the alignment to step along the query sequence and the reference sequence respectively.
...
Sum of lengths of the M/I/S/=/X operations shall equal the length of SEQ.
This lets us trivially calculate the length of a sequence from its CIGAR by adding up the lengths of all the "consumes query" ops in the CIGAR. This is exactly what happens in the cigar module (see https://github.com/brentp/cigar/blob/754cfed348364d390ec1aa40c951362ca1041f7a/cigar.py#L88-L93), so I don't know why the OP here reckoned that module's implementation was wrong.
If we extract out the relevant code from the (already very short) cigar module, we're left with something like this as a short Python implementation of the summing operation described in the quote above:
from itertools import groupby
def query_len(cigar_string):
"""
Given a CIGAR string, return the number of bases consumed from the
query sequence.
"""
read_consuming_ops = ("M", "I", "S", "=", "X")
result = 0
cig_iter = groupby(cigar_string, lambda chr: chr.isdigit())
for _, length_digits in cig_iter:
length = int(''.join(length_digits))
op = next(next(cig_iter)[1])
if op in read_consuming_ops:
result += length
return result
I suspect the reason there isn't a tool to fix this problem is because there is no general solution, aside from performing the alignment again using software that does not exhibit this problem. In your example, the query sequence aligns perfectly to the reference and so in that case the CIGAR string is not very interesting (just a single Match operation prefixed by the overall query length). In that case the fix simply requires changing 101M to 98M.
However, for more complex CIGAR strings (e.g. those that include Insertions, Deletions, or any other operations), you would have no way of knowing which part of the CIGAR string is too long. If you subtract from the wrong part of the CIGAR string, you'll be left with a misaligned read, which is probably worse for your downstream analysis than just leaving the whole read out.
That said, if it happens to be trivial to get it right (perhaps your broken alignment procedure always adds extra bases to the first or last CIGAR operation), then what you need to know is the correct way to calculate the query length according to the CIGAR string, so that you know what to subtract from it.
samtools calculates this using the htslib function bam_cigar2qlen.
The other functions that bam_cigar2qlen calls are defined in sam.h, including a helpful comment showing the truth table for which operations consume query sequence vs reference sequence.
In short, to calculate the query length of a CIGAR string the way that samtools (really htslib) does it, you should add the given length for CIGAR operations M, I, S, =, or X and ignore the length of CIGAR operations for any of the other operations.
The current version of the python cigar module seem to be using the same set of operations, and the algorithm for calculating the query length (which is what len(Cigar(cigar)) would return) looks right to me. What makes you think that it isn't giving the correct results?
It looks like you should be able to use the cigar python module to hard clip from either the left or right end using the mask_left or mask_right method with mask="H".

Can I group graphite results by regex?

I've been using graphite for some time now in order to power our backend pythonic program. As part of my usage of it, I need to sum (using sumSeries) different metrics using wildcards.
Thing is, I need to group them according to a pattern; say I have the following range of metric names:
group.*.item.*
I need to sum the values of all items, for a given group (meaning: group.1.item.*, group.2.item.*, etc)
Unfortunately, I do not know in advance the set of existing group values, and so what I do right now is that I query metrics/index.json, parse the list, and generated the desired query (manually creating sumSeries(group.NUMBER.item.*) for every NUMBER I find in the metrics index).
I was wondering if there was a way to have graphite do this for me, and save the first roundtrip, as the communication and pre-processing are costly (taking more than half the time of the entire process)
Thanks in advance!
If you want a separate line for each group you could use the groupByNode function.
groupByNode(group.*.item.*, 1, "sumSeries")
Where '1' is the node you're selecting (indexed by 0) and "sumSeries" is the function you are feeding each group into.
You can read more about this here: http://graphite.readthedocs.io/en/latest/functions.html#graphite.render.functions.groupByNode
If you want to restrict the second node to only numeric values you can use a character range. You do this by specifying the range in square brackets [...]. A character range is indicated by 2 characters separated by a dash (-).
group.[0-9].item.*
You can read more about this here:
http://graphite.readthedocs.io/en/latest/render_api.html#paths-and-wildcards

Big data File: Read and Create structured file

I have a 20+GB dataset that is structured as follows:
1 3
1 2
2 3
1 4
2 1
3 4
4 2
(Note: the repetition is intentional and there is no inherent order in either column.)
I want to construct a file in the following format:
1: 2, 3, 4
2: 3, 1
3: 4
4: 2
Here is my problem; I have tried writing scripts in both Python and C++ to load in the file, create long strings, and write to a file line-by-line. It seems, however, that neither language is capable of handling the task at hand. Does anyone have any suggestions as to how to tackle this problem? Specifically, is there a particular method/program that is optimal for this? Any help or guided directions would be greatly appreciated.
You can try this using Hadoop. You can run a stand-alone Map Reduce program. The mapper will output the first column as key and the second column as value. All the outputs with same key will go to one reducer. So you have a key and a list of values with that key. You can run through the values list and output the (key, valueString) which is the final output you desire. You can start this with a simple hadoop tutorial and do mapper and reducer as I suggested. However, I've not tried to scale a 20GB data on a stand-alone hadoop system. You may try. Hope this helps.
Have you tried using a std::vector of std::vector?
The outer vector represents each row. Each slot in the outer vector is a vector containing all the possible values for each row. This assumes that the row # can be used as an index into the vector.
Otherwise, you can try std::map<unsigned int, std::vector<unsigned int> >, where the key is the row number and the vector contains all values for the row.
A std::list of would work also.
Does your program run out of memory?
Edit 1: Handling large data files
You can handle your issue by treating it like a merge sort.
Open a file for each row number.
Append the 2nd column values to the file.
After all data is read, close all files.
Open each file and read the values and print them out, comma separated.
Open output file for each key.
While iterating over lines of source file append values into output files.
Join output files.
An interesting thought found also on Stack Overflow
If you want to persist a large dictionary, you are basically looking at a database.
As recommended there, use Python's sqlite3 module to write to a table where the primary key is auto incremented, with a field called "key" (or "left") and a field called "value" (or "right").
Then SELECT out of the table which was the MIN(key) and MAX(key), and with that information you can SELECT all rows that have the same "key" (or "left") value, in sorted order, and print those informations to an outfile (if the database is not a good output for you).
I have written this approach in the assumption you call this problem "big data" because the number of keys do not fit well into memory (otherwise, a simple Python dictionary would be enough). However, IMHO this question is not correctly tagged as "big data": in order to require distributed computations on Hadoop or similar, your input data should be much more than what you can hold in a single hard drive, or your computations should be much more costly than a simple hash table lookup and insertion.

finding a duplicate in a hdf5 pytable with 500e6 rows

Problem
I have a large (> 500e6 rows) dataset that I've put into a pytables database.
Lets say first column is ID, second column is counter for each ID. each ID-counter combination has to be unique. I have one non-unique row amongst 500e6 rows I'm trying to find.
As a starter I've done something like this:
index1 = db.cols.id.create_index()
index2 = db.cols.counts.create_index()
for row in db:
query = '(id == %d) & (counts == %d)' % (row['id'], row['counts'])
result = th.readWhere(query)
if len(result) > 1:
print row
It's a brute force method I'll admit. Any suggestions on improvements?
update
current brute force runtime is 8421 minutes.
solution
Thanks for the input everyone. I managed to get the runtime down to 2364.7 seconds using the following method:
ex = tb.Expr('(x * 65536) + y', uservars = {"x":th.cols.id, "y":th.cols.counts})
ex = tb.Expr(expr)
ex.setOutput(th.cols.hash)
ex.eval()
indexrows = th.cols.hash.create_csindex(filters=filters)
ref = None
dups = []
for row in th.itersorted(sortby=th.cols.hash):
if row['hash'] == ref:
dups.append(row['hash'] )
ref = row['hash']
print("ids: ", np.right_shift(np.array(dups, dtype=np.int64), 16))
print("counts: ", np.array(dups, dtype=np.int64) & 65536-1)
I can generate a perfect hash because my maximum values are less than 2^16. I am effectively bit packing the two columns into a 32 bit int.
Once the csindex is generated it is fairly trivial to iterate over the sorted values and do a neighbor test for duplicates.
This method can probably be tweaked a bit, but I'm testing a few alternatives that may provide a more natural solution.
Two obvious techniques come to mind: hashing and sorting.
A) define a hash function to combine ID and Counter into a single, compact value.
B) count how often each hash code occurs
C) select from your data all that has hash collissions (this should be a ''much'' smaller data set)
D) sort this data set to find duplicates.
The hash function in A) needs to be chosen such that it fits into main memory, and at the same time provides enough selectivity. Maybe use two bitsets of 2^30 size or so for this. You can afford to have 5-10% collisions, this should still reduce the data set size enough to allow fast in-memory sorting afterwards.
This is essentially a Bloom filter.
The brute force approach that you've taken appears to require that you to execute 500e6 queries, one for each row of the table. Although I think that the hashing and sorting approaches suggested in another answer are essentially correct, it's worth noting that pytables is already supposedly built for speed, and should already be expected to have these kinds of techniques effectively included "under the hood", so to speak.
I contend that the simple code you have written most likely does not yet take best advantage of the capabilities that pytables already makes available to you.
In the documentation for create_index(), it says that the default settings are optlevel=6 and kind='medium'. It mentions that you can increase the speed of each of your 500e6 queries by decreasing the entropy of the index, and you can decrease the entropy of your index to its minimum possible value (zero) either by choosing non-default values of optlevel=9 and kind='full', or equivalently, by generating the index with a call to create_csindex() instead. According to the documentation, you have to pay a little more upfront by taking a longer time to create a better optimized index to begin with, but then it pays you back later by saving you time on the series of queries that you have to repeat 500e6 times.
If optimizing your pytables column indices fails to speed up your code sufficiently, and you want to just simply perform a massive sort on all of the rows, and then just search for duplicates by looking for matches in adjacent sorted rows, it's possible to perform a merge sort in O(N log(N)) time using relatively modest amounts of memory by sorting the data in chunks and then saving the chunks in temporary files on disk. Examples here and here demonstrate in principle how to do it in Python specifically. But you should really try optimizing your pytables index first, as that's likely to provide a much simpler and more natural solution in your particular case.

Categories

Resources