GurobiError: Name too long (maximum name length is 255 characters) - python

I define a parameter t[i,s] as follows:
for i in Trucks:
for s in Slots:
t[i,s]=m.addVar(vtype=GRB.CONTINUOUS, name="t[%s,%s]"%(i,s))
I call the values of t[i,s] from an excel file. I is a list which contains a numbers from 0 to 263, also s is a list from 1 to 24. The problem appears when I run the code the following bug occurs:
GurobiError: Name too long (maximum name length is 255 characters)
How can I fix that?

In case the items in Trucks and Slots are just strings, you can limit the length that is passed to the variable name like this:
maxlen = 250
for i in Trucks:
for s in Slots:
t[i,s] = m.addVar(vtype=GRB.CONTINUOUS, name="t[%s,%s]"%(i[:maxlen],s[:maxlen]))
Strings in Python can be treated just like any other array and support slicing.

it is unfortunate that there is no great way to deal with this, or that the limit exists at all. The gurobipy variable constructors work well when using tuplelists to define the sets/indices over which a variable exists.
But then the variable names, which are strings, become a function of the tuplelist, have 255 character limit, and there does not seem to be any good way to control it (unless you start manipulating your tuples, in which case you may be losing information you wanted to keep).
why is there no way you can adjust the 255 character limit if needed? If there is, I have not found it
given that there is a limit, why doesn't gurobipy just truncate the name at 255 characters? as far as I can tell this would not really affect anything besides the way the variable appears when printed to a model file (e.g. .lp format).
it is frustrating that the answer is just "unfortunately it works this way". it can cause a model to fail when in fact there is no great reason for it to be failing

Related

Remove A Specific Instance of a Partially Duplicated Entry In a List In Python 3

I am relatively new to Python. However, my needs generally only involve simple string manipulation of rigidly formatted data files. I have a specific situation that I have scoured the web trying to solve and have come up blank.
This is the situation. I have a simple list of two-part entires, formatted like this:
name = ['PAUL;25', 'MARY;60', 'PAUL;40', 'NEIL;50', 'MARY;55', 'HELEN;25', ...]
And, I need to keep only one instance of any repeated name (ignoring the number to the right of the ' ; '), keeping only the entry with the highest number, along with that highest value still attached. So the answer would look like this:
ans = ['MARY;60', 'PAUL;40', 'HELEN;25', 'NEIL;50, ...]
The order of the elements in the list is irrelevant, but the format of the ans list entries must remain the same.
I can probably figure out a way to brute force it. I have looked at 2D lists, sets, tuples, etc. But, I can't seem to find the answer. The name list has about a million entries, so I need something that is efficient. I am sure it will be painfully easy for some of you.
Thanks for any input you can provide.
Cheers.
alkemyst
Probably the best data structure for this would be a dictionary, with the entries split up (and converted to integer) and later re-joined.
Something like this:
max_score = {}
for n in name:
person, score_str = n.split(';')
score = int(score_str)
if person not in max_score or max_score[person] < score:
max_score[person] = score
ans = [
'%s;%s' % (person, score)
for person, score in max_score.items()
]
This is a fairly common structure for many functions and programs: first convert the input to an internal representation (in this case, split and convert to integer), then do the logic or calculation (in this case, uniqueness and maximum), then convert to the required output representation (in this case, string separated with ;).
In terms of efficiency, this code looks at each input item once, then at each output item once; there's unlikely to be any approach that can do better than that (certainly not formally, and likely not in practice). All of the per-item operations are constant-time and fast. It accumulates the intermediate answer in memory (in max_score), but again that is unavoidable; if memory is an issue, the input and output could be changed to iterators/generators, but the whole intermediate answer has to be accumulated in max_score before any items can be output.

Divide and Conquer Lists in Python (to read sav files using pyreadstat)

I am trying to read sav files using pyreadstat in python but for some rare scenarios I am getting error of UnicodeDecodeError since the string variable has special characters.
To handle this I think instead of loading the entire variable set I will load only variables which do not have this error.
Below is the pseudo-code that I have with me. This is not a very efficient code since I check for error in each item of list using try and except.
# Reads only the medata to get information about the variables
df, meta = pyreadstat.read_sav('Test.sav', metadataonly=True)
list = meta.column_names # All variables are stored in list
result = []
for var in list:
print(var)
try:
df, meta = pyreadstat.read_sav('Test.sav', usecols=[str(var)])
# If no error that means we can store this variable in result
result.append(var)
except:
pass
# This will finally load the sav for non error variables
df, meta = pyreadstat.read_sav('Test.sav', usecols=result)
For a sav file with 1000+ variables it takes a long amount of time to process this.
I was thinking if there is a way to use divide and conquer approach and do it faster. Below is my suggested approach but I am not very good in implementing recursion algorithm. Can someone please help me with pseudo code it would be very helpful.
Take the list and try to read sav file
In case of no error then output can be stored in result and then we read the sav file
In case of error then split the list into 2 parts and run these again ....
Step 3 needs to run again until we have a list where it does not give any error
Using the second approach 90% of my sav files will get loaded on the first pass itself hence I think recursion is a good method
You can try to reproduce the issue for sav file here
For this specific case I would suggest a different approach: you can give an argument "encoding" to pyreadstat.read_sav to manually set the encoding. If you don't know which one it is, what you can do is iterate over the list of encodings here: https://gist.github.com/hakre/4188459 to find out which one makes sense. For example:
# here codes is a list with all the encodings in the link mentioned before
for c in codes:
try:
df, meta = p.read_sav("Test.sav", encoding=c)
print(encoding)
print(df.head())
except:
pass
I did and there were a few that may potentially make sense, assuming that the string is in a non-latin alphabet. However the most promising one is not in the list: encoding="UTF8" (the list contains UTF-8, with dash and that fails). Using UTF8 (no dash) I get this:
నేను గతంలో వాడిన బ
which according to google translate means "I used to come b" in Telugu. Not sure if that fully makes sense, but it's a way.
The advantage of this approach is that if you find the right encoding, you will not be loosing data, and reading the data will be fast. The disadvantage is that you may not find the right encoding.
In case you would not find the right encoding, you anyway would be reading the problematic columns very fast, and you can discard them later in pandas by inspecting which character columns do not contain latin characters. This will be much faster than the algorithm you were suggesting.

Infer the length of a sequence using the CIGAR

To give you a bit of context: I am trying to convert a sam file to bam
samtools view -bT reference.fasta sequences.sam > sequences.bam
which exits with the following error
[E::sam_parse1] CIGAR and query sequence are of different length
[W::sam_read1] parse error at line 102
[main_samview] truncated file
and the offending line looks like this:
SRR808297.2571281 99 gi|309056|gb|L20934.1|MSQMTCG 747 80 101M = 790 142 TTGGTATAAAATTTAATAATCCCTTATTAATTAATAAACTTCGGCTTCCTATTCGTTCATAAGAAATATTAGCTAAACAAAATAAACCAGAAGAACAT ##CFDDFD?HFDHIGEGGIEEJIIJJIIJIGIDGIGDCHJJCHIGIJIJIIJJGIGHIGICHIICGAHDGEGGGGACGHHGEEEFDC#=?CACC>CCC NM:i:2 MD:Z:98A1A
My sequence is 98 characters long but a probable bug when creating the sam file reported 101 in the CIGAR. I can give myself the luxury to loss a couple of reads and I don't have access at the moment to the source code that produced the sam files, so no opportunity to hunt down the bug and re-run the alignment. In other words, I need a pragmatic solution to move on (for now). Therefore, I devised a python script that counts the length of my string of nucleotides, compares it with what is registered in the CIGAR, and saves the "sane" lines in a new file.
#!/usr/bin/python
import itertools
import cigar
with open('myfile.sam', 'r') as f:
for line in itertools.islice(f,3,None): #Loop through the file and skip the first three lines
cigar=line.split("\t")[5]
cigarlength = len(Cigar(cigar)) #Use module Cigar to obtain the length reported in the CIGAR string
seqlength = len(line.split("\t")[9])
if (cigarlength == seqlength):
...Preserve the line in a new file...
As you can see, to translate the CIGAR into an integer showing the length, I am using the module CIGAR. To be honest, I am a bit wary of its behavior. This module seems to miscalculate the length in very obvious cases. Is there another module or a more explicit strategy to translate the CIGAR into the length of the sequence?
Sidenote: Interesting, to say the least, that this problem has been widely reported but no pragmatic solution can be found in the internet. See the links below:
https://github.com/COMBINE-lab/RapMap/issues/9
http://seqanswers.com/forums/showthread.php?t=67253
http://seqanswers.com/forums/showthread.php?t=21120
https://groups.google.com/forum/#!msg/snap-user/FoDsGeNBDE0/nRFq-GhlAQAJ
The SAM spec offers us this table of CIGAR operations which indicates which ones "consume" the query or the reference, complete with explicit instructions on how to calculate sequence length from a CIGAR string:
Consumes Consumes
Op BAM Description query reference
M 0 alignment match (can be a sequence match or mismatch) yes yes
I 1 insertion to the reference yes no
D 2 deletion from the reference no yes
N 3 skipped region from the reference no yes
S 4 soft clipping (clipped sequences present in SEQ) yes no
H 5 hard clipping (clipped sequences NOT present in SEQ) no no
P 6 padding (silent deletion from padded reference) no no
= 7 sequence match yes yes
X 8 sequence mismatch yes yes
“Consumes query” and “consumes reference” indicate whether the CIGAR operation causes the alignment to step along the query sequence and the reference sequence respectively.
...
Sum of lengths of the M/I/S/=/X operations shall equal the length of SEQ.
This lets us trivially calculate the length of a sequence from its CIGAR by adding up the lengths of all the "consumes query" ops in the CIGAR. This is exactly what happens in the cigar module (see https://github.com/brentp/cigar/blob/754cfed348364d390ec1aa40c951362ca1041f7a/cigar.py#L88-L93), so I don't know why the OP here reckoned that module's implementation was wrong.
If we extract out the relevant code from the (already very short) cigar module, we're left with something like this as a short Python implementation of the summing operation described in the quote above:
from itertools import groupby
def query_len(cigar_string):
"""
Given a CIGAR string, return the number of bases consumed from the
query sequence.
"""
read_consuming_ops = ("M", "I", "S", "=", "X")
result = 0
cig_iter = groupby(cigar_string, lambda chr: chr.isdigit())
for _, length_digits in cig_iter:
length = int(''.join(length_digits))
op = next(next(cig_iter)[1])
if op in read_consuming_ops:
result += length
return result
I suspect the reason there isn't a tool to fix this problem is because there is no general solution, aside from performing the alignment again using software that does not exhibit this problem. In your example, the query sequence aligns perfectly to the reference and so in that case the CIGAR string is not very interesting (just a single Match operation prefixed by the overall query length). In that case the fix simply requires changing 101M to 98M.
However, for more complex CIGAR strings (e.g. those that include Insertions, Deletions, or any other operations), you would have no way of knowing which part of the CIGAR string is too long. If you subtract from the wrong part of the CIGAR string, you'll be left with a misaligned read, which is probably worse for your downstream analysis than just leaving the whole read out.
That said, if it happens to be trivial to get it right (perhaps your broken alignment procedure always adds extra bases to the first or last CIGAR operation), then what you need to know is the correct way to calculate the query length according to the CIGAR string, so that you know what to subtract from it.
samtools calculates this using the htslib function bam_cigar2qlen.
The other functions that bam_cigar2qlen calls are defined in sam.h, including a helpful comment showing the truth table for which operations consume query sequence vs reference sequence.
In short, to calculate the query length of a CIGAR string the way that samtools (really htslib) does it, you should add the given length for CIGAR operations M, I, S, =, or X and ignore the length of CIGAR operations for any of the other operations.
The current version of the python cigar module seem to be using the same set of operations, and the algorithm for calculating the query length (which is what len(Cigar(cigar)) would return) looks right to me. What makes you think that it isn't giving the correct results?
It looks like you should be able to use the cigar python module to hard clip from either the left or right end using the mask_left or mask_right method with mask="H".

using strings as python dictionaries (memory management)

I need to find identical sequences of characters in a collection of texts. Think of it as finding identical/plagiarized sentences.
The naive way is something like this:
ht = defaultdict(int)
for s in sentences:
ht[s]+=1
I usually use python but I'm beginning to think that python is not the best choice for this task. Am I wrong about it? is there a reasonable way to do it with python?
If I understand correctly, python dictionaries use open addressing which means that the key itself is also saved in the array. If this is indeed the case, it means that a python dictionary allows efficient lookup but is VERY bad in memory usage, thus if I have millions of sentences, they are all saved in the dictionary which is horrible since it exceeds the available memory - making the python dictionary an impractical solution.
Can someone approve the former paragraph?
One solution that comes into mind is explicitly using a hash function (either use the builtin hash function, implement one or use the hashlib module) and instead of inserting ht[s]+=1, insert:
ht[hash(s)]+=1
This way the key stored in the array is an int (that will be hashed again) instead of the full sentence.
Will that work? Should I expect collisions? any other Pythonic solutions?
Thanks!
Yes, dict store the key in memory. If you data fit in memory this is the easiest approach.
Hash should work. Try MD5. It is 16 byte int so collision is unlikely.
Try BerkeleyDB for a disk based approach.
Python dicts are indeed monsters in memory. You hardly can operate in millions of keys when storing anything larger than integers. Consider following code:
for x in xrange(5000000): # it's 5 millions
d[x] = random.getrandbits(BITS)
For BITS(64) it takes 510MB of my RAM, for BITS(128) 550MB, for BITS(256) 650MB, for BITS(512) 830MB. Increasing number of iterations to 10 millions will increase memory usage by 2. However, consider this snippet:
for x in xrange(5000000): # it's 5 millions
d[x] = (random.getrandbits(64), random.getrandbits(64))
It takes 1.1GB of my memory. Conclusion? If you want to keep two 64-bits integers, use one 128-bits integer, like this:
for x in xrange(5000000): # it's still 5 millions
d[x] = random.getrandbits(64) | (random.getrandbits(64) << 64)
It'll reduce memory usage by two.
It depends on your actual memory limit and number of sentences, but you should be safe with using dictionaries with 10-20 millions of keys when using just integers. You have a good idea with hashes, but probably want to keep pointer to the sentence, so in case of collision you can investigate (compare the sentence char by char and probably print it out). You could create a pointer as a integer, for example by including number of file and offset in it. If you don't expect massive number of collision, you can simply set up another dictionary for storing only collisions, for example:
hashes = {}
for s in sentence:
ptr_value = pointer(s) # make it integer
hash_value = hash(s) # make it integer
if hash_value in hashes:
collisions.setdefault(hashes[hash_value], []).append(ptr_value)
else:
hashes[hash_value] = ptr_value
So at the end you will have collisions dictionary where key is a pointer to sentence and value is an array of pointers the key is colliding with. It sounds pretty hacky, but working with integers is just fine (and fun!).
perhaps passing keys to md5 http://docs.python.org/library/md5.html
Im not sure exactly how large your data set you are comparing all between is, but I would recommend looking into bloom filters (be careful of false positives). http://en.wikipedia.org/wiki/Bloom_filter ... Another avenue to consider would be something simple like cosine similarity or edit distance between documents, but if you are trying to compare one document with many... I would suggest looking into bloom filters, you can encode it however you find most efficient for your problem.

C data structures

Is there a C data structure equatable to the following python structure?
data = {'X': 1, 'Y': 2}
Basically I want a structure where I can give it an pre-defined string and have it come out with an integer.
The data-structure you are looking for is called a "hash table" (or "hash map"). You can find the source code for one here.
A hash table is a mutable mapping of an integer (usually derived from a string) to another value, just like the dict from Python, which your sample code instantiates.
It's called a "hash table" because it performs a hash function on the string to return an integer result, and then directly uses that integer to point to the address of your desired data.
This system makes it extremely extremely quick to access and change your information, even if you have tons of it. It also means that the data is unordered because a hash function returns a uniformly random result and puts your data unpredictable all over the map (in a perfect world).
Also note that if you're doing a quick one-off hash, like a two or three static hash for some lookup: look at gperf, which generates a perfect hash function and generates simple code for that hash.
The above data structure is a dict type.
In C/C++ paralance, a hashmap should be equivalent, Google for hashmap implementation.
There's nothing built into the language or standard library itself but, depending on your requirements, there are a number of ways to do it.
If the data set will remain relatively small, the easiest solution is to probably just have an array of structures along the lines of:
typedef struct {
char *key;
int val;
} tElement;
then use a sequential search to look them up. Have functions which insert keys, delete keys and look up keys so that, if you need to change it in future, the API itself won't change. Pseudo-code:
def init:
create g.key[100] as string
create g.val[100] as integer
set g.size to 0
def add (key,val):
if lookup(key) != not_found:
return already_exists
if g.size == 100:
return no_space
g.key[g.size] = key
g.val[g.size] = val
g.size = g.size + 1
return okay
def del (key):
pos = lookup (key)
if pos == not_found:
return no_such_key
if pos < g.size - 1:
g.key[pos] = g.key[g.size-1]
g.val[pos] = g.val[g.size-1]
g.size = g.size - 1
def find (key):
for pos goes from 0 to g.size-1:
if g.key[pos] == key:
return pos
return not_found
Insertion means ensuring it doesn't already exist then just tacking an element on to the end (you'll maintain a separate size variable for the structure). Deletion means finding the element then simply overwriting it with the last used element and decrementing the size variable.
Now this isn't the most efficient method in the world but you need to keep in mind that it usually only makes a difference as your dataset gets much larger. The difference between a binary tree or hash and a sequential search is irrelevant for, say, 20 entries. I've even used bubble sort for small data sets where a more efficient one wasn't available. That's because it massively quick to code up and the performance is irrelevant.
Stepping up from there, you can remove the fixed upper size by using a linked list. The search is still relatively inefficient since you're doing it sequentially but the same caveats apply as for the array solution above. The cost of removing the upper bound is a slight penalty for insertion and deletion.
If you want a little more performance and a non-fixed upper limit, you can use a binary tree to store the elements. This gets rid of the sequential search when looking for keys and is suited to somewhat larger data sets.
If you don't know how big your data set will be getting, I would consider this the absolute minimum.
A hash is probably the next step up from there. This performs a function on the string to get a bucket number (usually treated as an array index of some sort). This is O(1) lookup but the aim is to have a hash function that only allocates one item per bucket, so that no further processing is required to get the value.
A degenerate case of "all items in the same bucket" is no different to an array or linked list.
For maximum performance, and assuming the keys are fixed and known in advance, you can actually create your own hashing function based on the keys themselves.
Knowing the keys up front, you have extra information that allows you to fully optimise a hashing function to generate the actual value so you don't even involve buckets - the value generated by the hashing function can be the desired value itself rather than a bucket to get the value from.
I had to put one of these together recently for converting textual months ("January", etc) in to month numbers. You can see the process here.
I mention this possibility because of your "pre-defined string" comment. If your keys are limited to "X" and "Y" (as in your example) and you're using a character set with contiguous {W,X,Y} characters (which even covers EBCDIC as well as ASCII though not necessarily every esoteric character set allowed by ISO), the simplest hashing function would be:
char *s = "X";
int val = *s - 'W';
Note that this doesn't work well if you feed it bad data. These are ideal for when the data is known to be restricted to certain values. The cost of checking data can often swamp the saving given by a pre-optimised hash function like this.
C doesn't have any collection classes. C++ has std::map.
You might try searching for C implementations of maps, e.g. http://elliottback.com/wp/hashmap-implementation-in-c/
A 'trie' or a 'hasmap' should do. The simplest implementation is an array of struct { char *s; int i }; pairs.
Check out 'trie' in 'include/nscript.h' and 'src/trie.c' here: http://github.com/nikki93/nscript . Change the 'trie_info' type to 'int'.
Try a Trie for strings, or a Tree of some sort for integer/pointer types (or anything that can be compared as "less than" or "greater than" another key). Wikipedia has reasonably good articles on both, and they can be implemented in C.

Categories

Resources