Error in sorting operation on dictionary - python

I am trying to sort a file of sequences according to a certain parameter. The data looks as follows:
ID1 ID2 32
MVKVYAPASSANMSVGFDVLGAAVTP ...
ID1 ID2 18
MKLYNLKDHNEQVSFAQAVTQGLGKN ...
....
There are about 3000 sequences like this, i.e. the first line contains two ID field and one rank field (the sorting key) while the second one contains the sequence. My approach is to open the file, convert the file object to a list object, separate the annotation line (ID1, ID2, rank) from the actual sequence (annotation lines always occur on even indices, while sequence lines always occur on odd indices), merge them into a dictionary and sort the dictionary using the rank field. The code reads like so:
#!/usr/bin/python
with open("unsorted.out","rb") as f:
f = f.readlines()
assert type(f) == list, "ERROR: file object not converted to list"
annot=[]
seq=[]
for i in range(len(f)):
# IDs
if i%2 == 0:
annot.append(f[i])
# Sequences
elif i%2 != 0:
seq.append(f[i])
# Make dictionary
ids_seqs = {}
ids_seqs = dict(zip(annot,seq))
# Solub rankings are the third field of the annot list, i.e. annot[i].split()[2]
# Use this index notation to rank sequences according to solubility measurements
sorted_niwa = sorted(ids_seqs.items(), key = lambda val: val[0].split()[2], reverse=False)
# Save to file
with open("sorted.out","wb") as out:
out.write("".join("%s %s" % i for i in sorted_niwa))
The problem I have encountered is that when I open the sorted file to inspect manually, as I scroll down I notice that some sequences have been wrongly sorted. For example, I see the rank 9 placed after rank 89. Up until a certain point the sorting is correct, but I don't understand why it hasn't worked throughout.
Many thanks for any help!

Sounds like you're comparing strings instead of numbers. "9" > "89" because the character '9' comes lexicographically after the character '8'. Try converting to integers in your key.
sorted_niwa = sorted(ids_seqs.items(), key = lambda val: int(val[0].split()[2]), reverse=False)

Related

Clean up irregularities in dictionary values using regex

I need to create a dictionary from a text file that contains coordinates for named polygons. The output needs to be a dictionary where the polygon name is the key and corresponding x and y coordinates are the values. Most of the entries in the file follow a standard layout as follows:
Name of polygon
(12.345, 1.2567)
(5.6789, 2.9876)
(9.0345, 3.7654)
(3.4556, 2.3445)
Name of next polygon
(x, y values)
However there are some entries that have irregularities such as all the values are on one line or have extra characters between the parentheses. I need to loop over the values and split the values contained in parentheses.
So far I have created the dictionary in an initial pass over the file and am trying to use regex to split the values based on contents of parentheses:
with open(fpath, 'r') as infile:
d = {}
#split the data into keys and values
for group in infile.read().split('\n\n'):
entry = group.split('\n')
key, *val = entry
d[key] = val
for value in d.values():
value = re.split("*[\(.+$\)]*", str(value))
print(d)
I was hoping that this would clean up the values and create individual values for each set of coordinates contained in the parentheses, however I am getting the following error:
re.error: nothing to repeat at position 0
I think I've found a solution to my problem. I needed to account for multiple values per key in the loop and use re.findall() instead of re.split(). So my final loop looks like:
for key, *value in d.items():
d[key] = re.findall("\(.+\)", str(value))

Python Code line by line meaning

I have got a code and need to get the line by line meaning of this python code.
marksheet = []
for i in range(0,int(input())):
marksheet.append([raw_input(), float(input())])
second_highest = sorted(list(set([marks for name, marks in marksheet])))[1]
print('\n'.join([a for a,b in sorted(marksheet) if b == second_highest]))
I highly recommend you to go through the python tutorial
Just for your understanding of this code, I've added the comments.
#initialising an empty list!
marksheet = []
#iterating through a for loop starting from zero, to some user input(default type string) - that is converted to int
for i in range(0,int(input())):
#appending user input(some string) and another user input(a float value) as a list to marksheet
marksheet.append([raw_input(), float(input())])
#[marks for name, marks in marksheet] - get all marks from list
#set([marks for name, marks in marksheet]) - getting unique marks
#list(set([marks for name, marks in marksheet])) - converting it back to list
#sorting the result in decending order with reverse=True and getting the value as first index which would be the second largest.
second_highest = sorted(list(set([marks for name, marks in marksheet])),reverse=True)[1]
#printing the name and mark of student that has the second largest mark by iterating through the sorted list.
#If the condition matches, the result list is appended to tuple -`[a for a,b in sorted(marksheet) if b == second_highest])`
#now join the list with \n - newline to print name and mark of student with second largest mark
print('\n'.join([a for a,b in sorted(marksheet) if b == second_highest]))
Hope it helps!
Would do this in a comment but I don't have 50 reputation yet:
You don't need to use sorted on second_highest but apparently it is not a good habit to rely on this so you can keep the sorted. Calling sorted on an already sorted list doesn't use a lot of resources anyway.
second_highest = sorted(list(set([marks for name, marks in marksheet])))[1]
Also if the list contains something like [1,3,2,5,3,2,1] it will give 2 as result and not 1 since a set removes all duplicates.
If you want to keep duplicates use:
second_highest = sorted([marks for name, marks in marksheet]))[1]

Python issue summing dict into list

I have a dictionary (with multiple objects) I am trying to create a list that sums some of the values for each object. So far I have:
import csv,os,re
#numpy.corrcoef(list1, list2)[0, 1]
input_dict = csv.DictReader(open("./MCPlayerData/AllPlayerData2.csv"))
npi_scores=[]
for person in input_dict:
#print person
i=0
for key in person:
if re.match(r'npi[0-9]+', key):
#print key,'=',person[key] #returns npi0=1,npi1=3,npi3=2,etc
try:
i+=person[key]
#print(person[key])
except TypeError:
i="NA" #returns NA because one of the values wasnt filled out with an integer
break
npi_scores.append(i)
break
print npi_scores #returns sum of npi scores for one person
print('DONE')
When I run this code I get NA based on the first element. Which is what I would expect if the value wasnt an integer, but all are definitely integers. Any ideas?
convert person[key] to integer int(person[key]) ,or it will be treated as string
digit_str = person[key]
# checks value only consists of digits
if digit_str.isdigit():
# converts digits-only value to integer
i += int(digit_str)

Sorting on list values read into a list from a file

I am trying to write a routine to read values from a text file, (names and scores) and then be able to sort the values az by name, highest to lowest etc. I am able to sort the data but only by the position in the string, which is no good where names are different lengths. This is the code I have written so far:
ClassChoice = input("Please choose a class to analyse Class 1 = 1, Class 2 = 2")
if ClassChoice == "1":
Classfile = open("Class1.txt",'r')
else:
Classfile = open("Class2.txt",'r')
ClassList = [line.strip() for line in Classfile]
ClassList.sort(key=lambda s: s[x])
print(ClassList)
This is an example of one of the data files (Each piece of data is on a separate line):
Bob,8,7,5
Fred,10,9,9
Jane,7,8,9
Anne,6,4,8
Maddy,8,5,5
Jim, 4,6,5
Mike,3,6,5
Jess,8,8,6
Dave,4,3,8
Ed,3,3,4
I can sort on the name, but not on score 1, 2 or 3. Something obvious probably but I have not been able to find an example that works in the same way.
Thanks
How about something like this?
indexToSortOn = 0 # will sort on the first integer value of each line
classChoice = ""
while not classChoice.isdigit():
classChoice = raw_input("Please choose a class to analyse (Class 1 = 1, Class 2 = 2) ")
classFile = "Class%s.txt" % classChoice
with open(classFile, 'r') as fileObj:
classList = [line.strip() for line in fileObj]
classList.sort(key=lambda s: int(s.split(",")[indexToSortOn+1]))
print(classList)
The key is to specify in the key function that you pass in what part of each string (the line) you want to be sorting on:
classList.sort(key=lambda s: int(s.split(",")[indexToSortOn+1]))
The cast to an integer is important as it ensures the sort is numeric instead of alphanumeric (e.g. 100 > 2, but "100" < "2")
I think I understand what you are asking. I am not a sort expert, but here goes:
Assuming you would like the ability to sort the lines by either the name, the first int, second int or third int, you have to realize that when you are creating the list, you aren't creating a two dimensional list, but a list of strings. Due to this, you may wish to consider changing your lambda to something more like this:
ClassList.sort(key=lambda s: str(s).split(',')[x])
This assumes that the x is defined as one of the fields in the line with possible values 0-3.
The one issue I see with this is that list.sort() may sort Fred's score of 10 as being less than 2 but greater than 0 (I seem to remember this being how sort worked on ints, but I might be mistaken).

PYTHON problem with negative decimals

I have a list of negative floats. I want to make a histogram with them. As far as I know, Python can't do operations with negative numbers. Is this correct? The list is like [-0.2923998, -1.2394875, -0.23086493, etc.]. I'm trying to find the maximum and minimum number so I can find out what the range is. My code is giving an error:
setrange = float(maxv) - float(minv)
TypeError: float() argument must be a string or a number
And this is the code:
f = open('clusters_scores.out','r')
#first, extract all of the sim values
val = []
for line in f:
lineval = line.split()
print lineval
val.append(lineval)
print val
#val = map(float,val)
maxv = max(val)
minv = min(val)
setrange = float(maxv) - float(minv)
All the values that are being put into the 'val' list are negative decimals. What is the error referring to, and how do I fix it?
The input file looks like:
-0.0783532095182 -0.99415440702 -0.692972552716 -0.639273674023 -0.733029194040.765257900121 -0.755438339963
-0.144140594077 -1.06533353638 -0.366278118372 -0.746931508538 -1.02549039392 -0.296715961215
-0.0915937502791 -1.68680560936 -0.955147543358
-0.0488457137771 -0.0943080192383 -0.747534412969 -1.00491121699
-1.43973471463
-0.0642611118901 -0.0910684525497
-1.19327387414 -0.0794696449245
-1.00791366035 -0.0509749096549
-1.08046507281 -0.957339914505 -0.861495748259
The results of split() are a list of split values, which is probably why you are getting that error.
For example, if you do '-0.2'.split(), you get back a list with a single value ['-0.2'].
EDIT: Aha! With your input file provided, it looks like this is the problem: -0.733029194040.765257900121. I think you mean to make that two separate floats?
Assuming a corrected file like this:
-0.0783532095182 -0.99415440702 -0.692972552716 -0.639273674023 -0.733029194040 -0.765257900121 -0.755438339963
-0.144140594077 -1.06533353638 -0.366278118372 -0.746931508538 -1.02549039392 -0.296715961215
-0.0915937502791 -1.68680560936 -0.955147543358
-0.0488457137771 -0.0943080192383 -0.747534412969 -1.00491121699
-1.43973471463
-0.0642611118901 -0.0910684525497
-1.19327387414 -0.0794696449245
-1.00791366035 -0.0509749096549
-1.08046507281 -0.957339914505 -0.861495748259
The following code will no longer throw that exception:
f = open('clusters_scores.out','r')
#first, extract all of the sim values
val = []
for line in f:
linevals = line.split()
print linevals
val += linevals
print val
val = map(float, val)
maxv = max(val)
minv = min(val)
setrange = float(maxv) - float(minv)
I have changed it to take the list result from split() and concatenate it to the list, rather than append it, which will work provided there are valid inputs in your file.
All the values that are being put into the 'val' list are negative decimals.
No, they aren't; they're lists of strings that represent negative decimals, since the .split() call produces a list. maxv and minv are lists of strings, which can't be fed to float().
What is the error referring to, and how do I fix it?
It's referring to the fact that the contents of val aren't what you think they are. The first step in debugging is to verify your assumptions. If you try this code out at the REPL, then you could inspect the contents of maxv and minv and notice that you have lists of strings rather than the expected strings.
I assume you want to put all the lists of strings (from each line of the file) together into a single list of strings. Use val.extend(lineval) rather than val.append(lineval).
That said, you'll still want to map the strings into floats before calling max or min because otherwise you will be comparing the strings as strings rather than floats. (It might well work, but explicit is better than implicit.)
Simpler yet, just read the entire file at once and split it; .split() without arguments splits on whitespace, and a newline is whitespace. You can also do the mapping at the same point as the reading, with careful application of a list comprehension. I would write:
with open('clusters_scores.out') as f:
val = [float(x) for x in f.read().split()]
result = max(val) - min(val)

Categories

Resources