Biopython: adding section in middle sequence an having features aligned - python

I want to add section of sequence in middle of previous sequence(in gb file) and have all features still indexed by old sequence.
For example:
previous sequence: ATAGCCATTGAATGTGTGTGTGTCCTAGAGGGCCTAAAA
fetaure: misc_feature complement(20..27)
/gene="Py_ori+A"
I add TTTTTT in position 10.
new sequence: ATAGCCATTGTTTTTTAAGTGTGTGTGTCCTAGAGGGCCTAAAA
fetaure: misc_feature complement(26..33)
/gene="Py_ori+A"
Indexes of features changed because feature must still be about segment TGTCCTA. I want to save the new sequence in a new gb file.
Is there any biopython function or method that could add segment of sequence in middle of old sequence and add length of added segment to indexes of features, that are after the added segment?

TL;DR
Call + on your sliced segments (e.g. a + b). As long as you didn't slice into a feature you should be OK.
The long version:
the BioPython supports feature joining. It is done simply by calling a + b on the respective SeqRecord classes (the features are part of the SeqRecord object not the Seq class.).
There are a quirk to be aware of regarding slicing sequence with features. If you happen to do slicing in the feature, the feature will not be present in resulting SeqRecord.
I've tried to illustrate the behaviour in the following code.
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.SeqFeature import SeqFeature, FeatureLocation
# THIS IS OK
a = SeqRecord(
Seq('ACGTA'),
id='a',
features=[
SeqFeature(FeatureLocation(2,4,1), id='f1')
]
)
b = SeqRecord(
Seq('ACGTA'),
id='b',
features=[
SeqFeature(FeatureLocation(2,4,1), id='f2')
]
)
c = a + b
print('seq a')
print(a.seq)
print(a.features)
print('\nseq b')
print(b.seq)
print(b.features)
print("\n two distinct features joined in seq c")
print(c.seq)
print(c.features)
print("notice how the second feature has now indices (7,9), instead of 2,4\n")
# BEWARE
# slicing into the feature will remove the feature !
print("\nsliced feature removed")
d = a[:3]
print(d.seq)
print(d.features)
# Seq('ACG')
# []
# However slicing around the feature will preserve it
print("\nslicing out of the feature will preserve it")
e = c[1:6]
print(e.seq)
print(e.features)
OUTPUT
seq a
ACGTA
[SeqFeature(FeatureLocation(ExactPosition(2), ExactPosition(4), strand=1), id='f1')]
seq b
ACGTA
[SeqFeature(FeatureLocation(ExactPosition(2), ExactPosition(4), strand=1), id='f2')]
two distinct features joined in seq c
ACGTAACGTA
[SeqFeature(FeatureLocation(ExactPosition(2), ExactPosition(4), strand=1), id='f1'), SeqFeature(FeatureLocation(ExactPosition(7), ExactPosition(9), strand=1), id='f2')]
notice how the second feature has now indices (7,9), instead of 2,4
sliced feature removed
ACG
[]
slicing out of the feature will preserve it
CGTAA
[SeqFeature(FeatureLocation(ExactPosition(1), ExactPosition(3), strand=1), id='f1')]

Related

Updating dictionary and TFIDF for new documents using Gensim

My use case is that I have a corpus of documents, and when new docs come in I update Dictionary and vectorize. The result should be a sparse matrix of TFIDF vectors, which I'm using corpus2csc for.
I think I have a solution, but I have seen other answers that suggest my solution is impossible, and I've seen some unexpected behavior. I'm seeking a gut-check on the approach, with some specific questions below.
Overall approach:
Use Dictionary.doc2bow with allow_update=True
Construct TFIDF model using that updated Dictionary.
Questions and example code below.
from gensim import corpora
from gensim import models
from gensim import matutils
docs = [['definition', 'addition', 'term', 'defined'],
['term', 'sweet', 'generally', 'subject'],
['gene', 'extent', 'provided', 'gene'],
['additional', 'cost', 'gene', 'adequacy'],
['initial', 'condition', 'sweet', 'effectiveness']]
gensim_dict = corpora.Dictionary([], prune_at=None) # start with empty dict because I don't know what the corpus looks like at this point
BoW_corpus0 = [gensim_dict.doc2bow(d, allow_update=True) for d in docs] # Add to the dictionary by setting allow_update=True
# documentation says: corpus (iterable of iterable of (int, number)) – Input corpus in BoW format
matutils.corpus2csc(corpus=BoW_corpus0) # Expected. A 16x5 matrix with 19 stored elements
matutils.corpus2csc(corpus=BoW_corpus0, num_terms=len(gensim_dict.keys()), num_docs=len(BoW_corpus0), num_nnz=sum([len(doc) for doc in BoW_corpus0])) # Also expected for the 'efficient' path
model = models.TfidfModel(BoW_corpus0)
vecs = model[BoW_corpus0]
matutils.corpus2csc(corpus=vecs) # Unexpected - asks for a BoW format, but this is TransformedCorpus. But ignoring that, it's expected -- 16x5 with 19 stored elements.
matutils.corpus2csc(corpus=vecs, num_terms=len(gensim_dict.keys()), num_docs=vecs.obj.num_docs, num_nnz=vecs.obj.num_nnz) # Expected
corpus2csc documentation asks for a BoW format, not a TransformedCorpus. Does it accept a TransformedCorpus just by chance?
Now I add new docs:
new_docs = [['provided', 'communication', 'provided', 'writing']]
BoW_corpus1 = [gensim_dict.doc2bow(d, allow_update=True) for d in new_docs]
len(gensim_dict.cfs) # 18
gensim_dict.num_nnz # 22
len(BoW_corpus1) # 1
matutils.corpus2csc(corpus=BoW_corpus1) # Expected. An 18x1 matrix, with 3 stored elements
matutils.corpus2csc(corpus=BoW_corpus1, num_terms=len(gensim_dict.keys()), num_docs=len(BoW_corpus1), num_nnz=sum([len(doc) for doc in BoW_corpus1])) # Expected. An 18x1 matrix, with 3 stored elements
# Let's use the original model. Ideally we would update the model with the new document but I'm not sure of the best way to do that.
new_vecs = model[BoW_corpus1]
matutils.corpus2csc(corpus=new_vecs) # Unexpected. Using the original model, a 10x1 with 1 stored elements. Why not 18x1?
Why does this approach result in a 10x1 matrix?
# let's try using the Dictionary that's been updated on the fly
dict_based_model = models.TfidfModel(dictionary=gensim_dict)
new_vecs2 = dict_based_model[BoW_corpus1]
matutils.corpus2csc(corpus=new_vecs2) # Expected. 18x1 with 3 stored elements.
matutils.corpus2csc(corpus=new_vecs2, num_terms=len(gensim_dict.keys()), num_docs=len(new_vecs2), num_nnz=sum([len(v) for v in new_vecs2])) # Also expected. 18x1 with 3 stored elements.
# But why are these not the same?
assert new_vecs2.obj.num_docs == len(new_vecs)
assert new_vecs2.obj.num_nnz == sum([len(v) for v in new_vecs2])
# Finally, let's make a model based on the new corpus, I know this isn't right but curious why the output is what it is
new_corpus_based_model = models.TfidfModel(BoW_corpus1)
new_vecs3 = new_corpus_based_model[BoW_corpus1]
matutils.corpus2csc(corpus=new_vecs3) # 0x1 with 0 elements. Kinda expected. But I would have thought it would have produced a 2x1 or 3x1 matrix
Can you confirm that dict_based_model is the right approach?
What is the new_vecs2.obj all about?
Why do I get a 0x1 instead of a 2x1?

Speeding up fuzzy match on large list

I am working on a project that uses fuzzy logic on a list of names that could go about 100,000 unique records. On the recent screening that we have conducted, the functions that we use can complete a single name within 2.20 seconds on average. This means that on a list of 10,000 names, this process could take 6 hours, which is really too long.
Is there a way that we can speed up our process? Here's the snippet of the script that we use.
# Importing packages
import pandas as pd
import Levenshtein as lev
# Reading cleaned datasets
df_name_reference = pd.read_csv('path_to_file')
df_name_to_screen = pd.read_csv('path_to_file')
# Function used in name screening
def get_similarity_score(s1, s2):
''' Return match percentage between 2 strings disregarding name swapping
Parameters
-----------
s1 : str : name from df_name_reference (to be used within pandas apply)
s2 : str : name from df_name_to_screen (ref_name variable)
Return
-----------
float
'''
# Get sorted names
s1_sort = ' '.join(sorted(s1.split(' '))).strip() if type(s1)==str else ''
s2_sort = ' '.join(sorted(s2.split(' '))).strip() if type(s2)==str else ''
# Get ratios and return the max value
# THIS COULD BE THE BOTTLENECK OF OUR SCRIPT: MORE DETAILS BELOW
return max([
lev.ratio(s1, s2),
lev.ratio(s1_sort, s2),
lev.ratio(s1, s2_sort),
lev.ratio(s1_sort, s2_sort)
])
# Returning file
screening_results = []
for row in range(df_name_to_screen.shape[0]):
# Get name to screen
ref_name = df_name_to_screen.loc[row, 'fullname']
# Get scores
scores = df_name_reference.fullname.apply(lev.ratio, args=(ref_name,))
# Append results
screening_results.append(pd.DataFrame({'screened_name':ref_name, 'scores':scores}))
I took four scores from lev.ratio. This is to address variations in the arrangement of names, ie. firstname-lastname and lastname-firstname formats. I know that fuzzywuzzy package has token_sort_ratio, but I've noticed that it's just splitting the name parts, and sorting it alphabetically, which leads to lower scores. Plus, fuzzywuzzy is slower than Levenshtein. So, I had to manually capture the similarity score of sorted and unsorted names.
Can anyone give an approach that I could try? Thanks!
EDIT: Here's a sample dataset that you may try. This is in Google Drive.
In case you don't need scores for all entries in the reference data but just the top N then you can use difflib.get_close_matches to remove the others before calculating any scores:
screening_results = []
for row in range(df_name_to_screen.shape[0]):
ref_name = df_name_to_screen.loc[row, 'fullname']
skimmed = pd.DataFrame({
'fullname': difflib.get_close_matches(
ref_name,
df_name_reference.fullname,
N_RESULTS,
0
)
})
scores = skimmed.fullname.apply(lev.ratio, args=(ref_name,))
screening_results.append(pd.DataFrame({'screened_name': ref_name, 'scores': scores}))
This takes about 50ms per row using the file you provided.

How to split a dataset with multiple variables into train and test while both having the same composition using python?

I have a list of brain metastasis MRIs that I want to use for training and testing purposes.
These images are all similar but the original tumor sites differs. See the following example:
From Lungs:
"Image01.1"
"Image01.2"
"Image01.3"
"Image01.4"
From Breasts:
"Image02.1"
"Image02.2"
"Image02.3"
"Image02.4"
"Image02.5"
From Skin:
"Image03.1"
"Image03.2"
From Lung Tissue:
"Image04.1"
"Image04.2"
"Image04.3"
From Bone Marrow:
"Image05.1"
"Image05.2"
I want the testing and validation set to contain the same amount of images without losing a similar composition (both lists containing the same amount of each subtype).
For this purpose can I create lists for each subtype and then randomly split those 50/50. Followed by adding all these lists together?
If you want to get specific rows from a pandas DataFrame that meet certain criteria, you can filter. In your case, something like:
reader_lung = reader[reader["Image_Title"] == "Lung"]
"Image_Title" you need to change to the name of the column you're looking for your keyword (e.g., Lung) in. This needs to be an exact match.
For something that doesn't require an exact match, you could also do the following:
reader_lung = reader[reader["Image_Title"].str.contains("Lung")]
Could you create a list of lists (one for each type) and then take the first N and put them into training and the last N and put them in test?
Something like this pseudocode:
with open(r"B:/.../excell.csv", newline='') as f:
reader = csv.reader(f, dialect="excel",delimiter=';')
test = []
training = []
type_map = {}
for row in reader:
if row[33] in type_map:
# If the type has already been viewed, append to the existing list of those images
type_map[row[33]].append(row)
else:
# If this type is seen for the first time, create a new array with that row in it
type_map[row[33]] = [row]
# Now you should have a map like : {"Lung": ["image1", "image2" ...], "Heart": ["imageA"....]}
for image_type in type_map:
type_images = type_map[image_type]
half_way_index = len(type_images)/2 # For odd elements i.e 13 elems this will give you 6 (integer division)
test += type_images[0:half_way_index] # First half of the type_images are test
training += type_images[half_way_index:(half_way_index*2)] # Second half are training

Reading a binary file using np.fromfile()

I have a binary file that has numerous sections. Each section has its own pattern (i.e. the placement of integers, floats, and strings).
The pattern of each section is known. However, the number of times that pattern occurs within the section is unknown. Each record is in between two same integers. These integers indicate the size of the record. The section name is in between two integer record length variables: 8 and 8. Also within each section, there are multiple records (which are known).
Header
---------------------
Known header pattern
---------------------
8 Section One 8
---------------------
Section One pattern repeating i times
---------------------
8 Section Two 8
---------------------
Section Two pattern repeating j times
---------------------
8 Section Three 8
---------------------
Section Three pattern repeating k times
---------------------
Here was my approach:
Loop through and read each record using f.read(record_length), if the record is 8 bytes, convert to string, this will be the section name.
Then i call: np.fromfile(file,dtype=section_pattern,count=n)
I am calling np.fromfile for each section.
The issue I am having is two fold:
How do I determine n for each section without doing a first pass read?
Reading each record to find a section name seems rather inefficient. Is there a more efficient way to accomplish this?
The section names are always between two integer record variables: 8 and 8.
Here is a sample code, note that in this case i do not have to specify count since the OES section is the last section:
with open('m13.op2', "rb") as f:
filesize = os.fstat(f.fileno()).st_size
f.seek(108,1) # skip header
while True:
rec_len_1 = unpack_int(f.read(4))
record_bytes = f.read(rec_len_1)
rec_len_2 = unpack_int(f.read(4))
record_num = record_num + 1
if rec_len_1==8:
tablename = unpack_string(record_bytes).strip()
if tablename == 'OES':
OES = [
# Top keys
('1','i4',1),('op2key7','i4',1),('2','i4',1),
('3','i4',1),('op2key8','i4',1),('4','i4',1),
('5','i4',1),('op2key9','i4',1),('6','i4',1),
# Record 2 -- IDENT
('7','i4',1),('IDENT','i4',1),('8','i4',1),
('9','i4',1),
('acode','i4',1),
('tcode','i4',1),
('element_type','i4',1),
('subcase','i4',1),
('LSDVMN','i4',1), # Load set number
('UNDEF(2)','i4',2), # Undefined
('LOADSET','i4',1), # Load set number or zero or random code identification number
('FCODE','i4',1), # Format code
('NUMWDE(C)','i4',1), # Number of words per entry in DATA record
('SCODE(C)','i4',1), # Stress/strain code
('UNDEF(11)','i4',11), # Undefined
('THERMAL(C)','i4',1), # =1 for heat transfer and 0 otherwise
('UNDEF(27)','i4',27), # Undefined
('TITLE(32)','S1',32*4), # Title
('SUBTITL(32)','S1',32*4), # Subtitle
('LABEL(32)','S1',32*4), # Label
('10','i4',1),
# Record 3 -- Data
('11','i4',1),('KEY1','i4',1),('12','i4',1),
('13','i4',1),('KEY2','i4',1),('14','i4',1),
('15','i4',1),('KEY3','i4',1),('16','i4',1),
('17','i4',1),('KEY4','i4',1),('18','i4',1),
('19','i4',1),
('EKEY','i4',1), #Element key = 10*EID+Device Code. EID = (Element key)//10
('FD1','f4',1),
('EX1','f4',1),
('EY1','f4',1),
('EXY1','f4',1),
('EA1','f4',1),
('EMJRP1','f4',1),
('EMNRP1','f4',1),
('EMAX1','f4',1),
('FD2','f4',1),
('EX2','f4',1),
('EY2','f4',1),
('EXY2','f4',1),
('EA2','f4',1),
('EMJRP2','f4',1),
('EMNRP2','f4',1),
('EMAX2','f4',1),
('20','i4',1)]
nparr = np.fromfile(f,dtype=OES)
if f.tell() == filesize:
break

put stockprices into groups when they are within 0.5% of each other

Thanks for the answers, I have not used StackOverflow before so I was suprised by the number of answers and the speed of them - its fantastic.
I have not been through the answers properly yet, but thought I should add some information to the problem specification. See the image below.
I can't post an image in this because i don't have enough points but you can see an image
at http://journal.acquitane.com/2010-01-20/image003.jpg
This image may describe more closely what I'm trying to achieve. So you can see on the horizontal lines across the page are price points on the chart. Now where you get a clustering of lines within 0.5% of each, this is considered to be a good thing and why I want to identify those clusters automatically. You can see on the chart that there is a cluster at S2 & MR1, R2 & WPP1.
So everyday I produce these price points and then I can identify manually those that are within 0.5%. - but the purpose of this question is how to do it with a python routine.
I have reproduced the list again (see below) with labels. Just be aware that the list price points don't match the price points in the image because they are from two different days.
[YR3,175.24,8]
[SR3,147.85,6]
[YR2,144.13,8]
[SR2,130.44,6]
[YR1,127.79,8]
[QR3,127.42,5]
[SR1,120.94,6]
[QR2,120.22,5]
[MR3,118.10,3]
[WR3,116.73,2]
[DR3,116.23,1]
[WR2,115.93,2]
[QR1,115.83,5]
[MR2,115.56,3]
[DR2,115.53,1]
[WR1,114.79,2]
[DR1,114.59,1]
[WPP,113.99,2]
[DPP,113.89,1]
[MR1,113.50,3]
[DS1,112.95,1]
[WS1,112.85,2]
[DS2,112.25,1]
[WS2,112.05,2]
[DS3,111.31,1]
[MPP,110.97,3]
[WS3,110.91,2]
[50MA,110.87,4]
[MS1,108.91,3]
[QPP,108.64,5]
[MS2,106.37,3]
[MS3,104.31,3]
[QS1,104.25,5]
[SPP,103.53,6]
[200MA,99.42,7]
[QS2,97.05,5]
[YPP,96.68,8]
[SS1,94.03,6]
[QS3,92.66,5]
[YS1,80.34,8]
[SS2,76.62,6]
[SS3,67.12,6]
[YS2,49.23,8]
[YS3,32.89,8]
I did make a mistake with the original list in that Group C is wrong and should not be included. Thanks for pointing that out.
Also the 0.5% is not fixed this value will change from day to day, but I have just used 0.5% as an example for spec'ing the problem.
Thanks Again.
Mark
PS. I will get cracking on checking the answers now now.
Hi:
I need to do some manipulation of stock prices. I have just started using Python, (but I think I would have trouble implementing this in any language). I'm looking for some ideas on how to implement this nicely in python.
Thanks
Mark
Problem:
I have a list of lists (FloorLevels (see below)) where the sublist has two items (stockprice, weight). I want to put the stockprices into groups when they are within 0.5% of each other. A groups strength will be determined by its total weight. For example:
Group-A
115.93,2
115.83,5
115.56,3
115.53,1
-------------
TotalWeight:12
-------------
Group-B
113.50,3
112.95,1
112.85,2
-------------
TotalWeight:6
-------------
FloorLevels[
[175.24,8]
[147.85,6]
[144.13,8]
[130.44,6]
[127.79,8]
[127.42,5]
[120.94,6]
[120.22,5]
[118.10,3]
[116.73,2]
[116.23,1]
[115.93,2]
[115.83,5]
[115.56,3]
[115.53,1]
[114.79,2]
[114.59,1]
[113.99,2]
[113.89,1]
[113.50,3]
[112.95,1]
[112.85,2]
[112.25,1]
[112.05,2]
[111.31,1]
[110.97,3]
[110.91,2]
[110.87,4]
[108.91,3]
[108.64,5]
[106.37,3]
[104.31,3]
[104.25,5]
[103.53,6]
[99.42,7]
[97.05,5]
[96.68,8]
[94.03,6]
[92.66,5]
[80.34,8]
[76.62,6]
[67.12,6]
[49.23,8]
[32.89,8]
]
I suggest a repeated use of k-means clustering -- let's call it KMC for short. KMC is a simple and powerful clustering algorithm... but it needs to "be told" how many clusters, k, you're aiming for. You don't know that in advance (if I understand you correctly) -- you just want the smallest k such that no two items "clustered together" are more than X% apart from each other. So, start with k equal 1 -- everything bunched together, no clustering pass needed;-) -- and check the diameter of the cluster (a cluster's "diameter", from the use of the term in geometry, is the largest distance between any two members of a cluster).
If the diameter is > X%, set k += 1, perform KMC with k as the number of clusters, and repeat the check, iteratively.
In pseudo-code:
def markCluster(items, threshold):
k = 1
clusters = [items]
maxdist = diameter(items)
while maxdist > threshold:
k += 1
clusters = Kmc(items, k)
maxdist = max(diameter(c) for c in clusters)
return clusters
assuming of course we have suitable diameter and Kmc Python functions.
Does this sound like the kind of thing you want? If so, then we can move on to show you how to write diameter and Kmc (in pure Python if you have a relatively limited number of items to deal with, otherwise maybe by exploiting powerful third-party add-on frameworks such as numpy) -- but it's not worthwhile to go to such trouble if you actually want something pretty different, whence this check!-)
A stock s belong in a group G if for each stock t in G, s * 1.05 >= t and s / 1.05 <= t, right?
How do we add the stocks to each group? If we have the stocks 95, 100, 101, and 105, and we start a group with 100, then add 101, we will end up with {100, 101, 105}. If we did 95 after 100, we'd end up with {100, 95}.
Do we just need to consider all possible permutations? If so, your algorithm is going to be inefficient.
You need to specify your problem in more detail. Just what does "put the stockprices into groups when they are within 0.5% of each other" mean?
Possibilities:
(1) each member of the group is within 0.5% of every other member of the group
(2) sort the list and split it where the gap is more than 0.5%
Note that 116.23 is within 0.5% of 115.93 -- abs((116.23 / 115.93 - 1) * 100) < 0.5 -- but you have put one number in Group A and one in Group C.
Simple example: a, b, c = (0.996, 1, 1.004) ... Note that a and b fit, b and c fit, but a and c don't fit. How do you want them grouped, and why? Is the order in the input list relevant?
Possibility (1) produces ab,c or a,bc ... tie-breaking rule, please
Possibility (2) produces abc (no big gaps, so only one group)
You won't be able to classify them into hard "groups". If you have prices (1.0,1.05, 1.1) then the first and second should be in the same group, and the second and third should be in the same group, but not the first and third.
A quick, dirty way to do something that you might find useful:
def make_group_function(tolerance = 0.05):
from math import log10, floor
# I forget why this works.
tolerance_factor = -1.0/(-log10(1.0 + tolerance))
# well ... since you might ask
# we want: log(x)*tf - log(x*(1+t))*tf = -1,
# so every 5% change has a different group. The minus is just so groups
# are ascending .. it looks a bit nicer.
#
# tf = -1/(log(x)-log(x*(1+t)))
# tf = -1/(log(x/(x*(1+t))))
# tf = -1/(log(1/(1*(1+t)))) # solved .. but let's just be more clever
# tf = -1/(0-log(1*(1+t)))
# tf = -1/(-log((1+t))
def group_function(value):
# don't just use int - it rounds up below zero, and down above zero
return int(floor(log10(value)*tolerance_factor))
return group_function
Usage:
group_function = make_group_function()
import random
groups = {}
for i in range(50):
v = random.random()*500+1000
group = group_function(v)
if group in groups:
groups[group].append(v)
else:
groups[group] = [v]
for group in sorted(groups):
print 'Group',group
for v in sorted(groups[group]):
print v
print
For a given set of stock prices, there is probably more than one way to group stocks that are within 0.5% of each other. Without some additional rules for grouping the prices, there's no way to be sure an answer will do what you really want.
apart from the proper way to pick which values fit together, this is a problem where a little Object Orientation dropped in can make it a lot easier to deal with.
I made two classes here, with a minimum of desirable behaviors, but which can make the classification a lot easier -- you get a single point to play with it on the Group class.
I can see the code bellow is incorrect, in the sense the limtis for group inclusion varies as new members are added -- even it the separation crieteria remaisn teh same, you heva e torewrite the get_groups method to use a multi-pass approach. It should nto be hard -- but the code would be too long to be helpfull here, and i think this snipped is enoguh to get you going:
from copy import copy
class Group(object):
def __init__(self,data=None, name=""):
if data:
self.data = data
else:
self.data = []
self.name = name
def get_mean_stock(self):
return sum(item[0] for item in self.data) / len(self.data)
def fits(self, item):
if 0.995 < abs(item[0]) / self.get_mean_stock() < 1.005:
return True
return False
def get_weight(self):
return sum(item[1] for item in self.data)
def __repr__(self):
return "Group-%s\n%s\n---\nTotalWeight: %d\n\n" % (
self.name,
"\n".join("%.02f, %d" % tuple(item) for item in self.data ),
self.get_weight())
class StockGrouper(object):
def __init__(self, data=None):
if data:
self.floor_levels = data
else:
self.floor_levels = []
def get_groups(self):
groups = []
floor_levels = copy(self.floor_levels)
name_ord = ord("A") - 1
while floor_levels:
seed = floor_levels.pop(0)
name_ord += 1
group = Group([seed], chr(name_ord))
groups.append(group)
to_remove = []
for i, item in enumerate(floor_levels):
if group.fits(item):
group.data.append(item)
to_remove.append(i)
for i in reversed(to_remove):
floor_levels.pop(i)
return groups
testing:
floor_levels = [ [stock. weight] ,... <paste the data above> ]
s = StockGrouper(floor_levels)
s.get_groups()
For the grouping element, could you use itertools.groupby()? As the data is sorted, a lot of the work of grouping it is already done, and then you could test if the current value in the iteration was different to the last by <0.5%, and have itertools.groupby() break into a new group every time your function returned false.

Categories

Resources