I an trying to select unique datasets from a very large quite inconsistent list.
My Dataset RawData consists of string-items of different length.
Some items occure many times, for example: ['a','b','x','15/30']
The key to compare the item is always the last string: for example '15/30'
The goal is: Get a list: UniqueData with items that occure only once. (i want to keep the order)
Dataset:
RawData = [['a','b','x','15/30'],['d','e','f','g','h','20/30'],['w','x','y','z','10/10'],['a','x','c','15/30'],['i','j','k','l','m','n','o','p','20/60'],['x','b','c','15/30']]
My desired solution Dataset:
UniqueData = [['a','b','x','15/30'],['d','e','f','g','h','20/30'],['w','x','y','z','10/10'],['i','j','k','l','m','n','o','p','20/60']]
I tried many possible solutions for instance:
for index, elem in enumerate(RawData): and appending to a new list if.....
for element in list does not work, because the items are not exactly the same.
Can you help me finding a solution to my problem?
Thanks!
The best way to remove duplicates is to add them into a set. Add the last element into a set as to keep track of all the unique values. When the value you want to add is already present in the set unique do nothing if not present add the value to set unique and append the lst to result list here it's new.
Try this.
new=[]
unique=set()
for lst in RawData:
if lst[-1] not in unique:
unique.add(lst[-1])
new.append(lst)
print(new)
#[['a', 'b', 'x', '15/30'],
['d', 'e', 'f', 'g', 'h', '20/30'],
['w', 'x', 'y', 'z', '10/10'],
['i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', '20/60']]
You could set up a new array for unique data and to track the items you have seen so far. Then as you loop through the data if you have not seen the last element in that list before then append it to unique data and add it to the seen list.
RawData = [['a', 'b', 'x', '15/30'], ['d', 'e', 'f', 'g', 'h', '20/30'], ['w', 'x', 'y', 'z', '10/10'],
['a', 'x', 'c', '15/30'], ['i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', '20/60'], ['x', 'b', 'c', '15/30']]
seen = []
UniqueData = []
for data in RawData:
if data[-1] not in seen:
UniqueData.append(data)
seen.append(data[-1])
print(UniqueData)
OUTPUT
[['a', 'b', 'x', '15/30'], ['d', 'e', 'f', 'g', 'h', '20/30'], ['w', 'x', 'y', 'z', '10/10'], ['i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', '20/60']]
RawData = [['a','b','x','15/30'],['d','e','f','g','h','20/30'],['w','x','y','z','10/10'],['a','x','c','15/30'],['i','j','k','l','m','n','o','p','20/60'],['x','b','c','15/30']]
seen = []
seen_indices = []
for _,i in enumerate(RawData):
# _ -> index
# i -> individual lists
if i[-1] not in seen:
seen.append(i[-1])
else:
seen_indices.append(_)
for index in sorted(seen_indices, reverse=True):
del RawData[index]
print (RawData)
Using a set to filter out entries for which the key has already been seen is the most efficient way to go.
Here's a one liner example using a list comprehension with internal side effects:
UniqueData = [rd for seen in [set()] for rd in RawData if not(rd[-1] in seen or seen.add(rd[-1])) ]
Related
This question already has answers here:
How to remove items from a list while iterating?
(25 answers)
Closed 2 years ago.
I'm having trouble understanding how 'for loop' works in Python. I want to remove a character from a list using for loop to iterate through the list but the output is not as expected.
In the following code I want to remove the character 'e':
lista = ['g', 'e', 'e', 'k', 'e','s', 'e', 'e']
for x in lista:
if x == 'e':
lista.remove(x)
print(lista)
It prints ['g', 'k', 's', 'e', 'e'] when I was expecting ['g', 'k', 's'].
Thank you.
You cannot remove things from a list when you iterate over it. This is because when you remove an item from the list it shrinks. So what's happening is that when you encounter an 'e', the list is shrunk and you go to the next item in the list. But since the list shrunk, you're actually jumping over an item.
To solve your problem, you have to iterate over copy of your list.
lista = ['g', 'e', 'e', 'k', 'e','s', 'e', 'e']
for x in lista.copy():
if x == 'e':
lista.remove(x)
print(lista)
You can use the following code
lista = ['g', 'e', 'e', 'k', 'e','s', 'e', 'e']
for i in range(0, len(lista)):
element = lista[i]
if element == 'e':
del lista[i]
The above approach will modify the original list.
A far more simpler and better way is as follows:
list(filter(('e').__ne__, lista))
Both the methods return
['g', 'k', 's']
A solution to your problem may be this :
while 1:
try:
lista.remove("e")
except ValueError:
break
The simple list comprehension. The idea is to not update the list while iterating.
lista = ['g', 'e', 'e', 'k', 'e','s', 'e', 'e']
lista = [i for i in lista if i!='e']
I think the most pythonic way is to do it using list comprehension as shown below:
lista = ['g', 'e', 'e', 'k', 'e','s', 'e', 'e']
lista = [x for x in lista if x != 'e']
print(lista)
The reason why your method is not working is because you cannot remove items from a list whilst iterating over it, as you will be changing the indexes of each object in the list.
When you remove an item from a List it gets updated. Therefore shrinking it.
Thus, in your case when first two of the e's are removed, last two elements are not taken into consideration.
What you need to do is to check if the the element still exists-
lista = ['g', 'e', 'e', 'k', 'e', 's', 'e', 'e']
while 'e' in lista:
lista.remove('e')
print(lista)
UPDATE:
As #mad_ pointed out, you can reduce the complexity by-
lista = ['g', 'e', 'e', 'k', 'e', 's', 'e', 'e']
print([i for i in lista if i != 'e'])
Thank you for your help and patience.
I am new to python and am attempting to calculate the number of times a particular atomic symbol appears divided by the total number of atoms. So that the function accepts a list of strings as argument and returns a list containing the fraction of 'C', 'H', 'O' and 'N'. But I keep on getting one result instead of getting all for each of my atoms. My attempt is below:
Atoms = ['N', 'C', 'C', 'O', 'H', 'H', 'C', 'H', 'H', 'H', 'H', 'O', 'H']
def count_atoms (atoms):
for a in atoms:
total = atoms.count(a)/len(atoms)
return total
Then
faa = count_atoms(atoms)
print(faa)
However I only get one result which is 0.07692307692307693. I was supposed to get a list starting with [0.23076923076923078,..etc], but I don't know how to. I was supposed to calculate the fraction of 'C', 'H', 'O' and 'N' atomic symbols in the molecule using a for loop and a return statement. :( Please help, it will be appreciated.
#ganderson comment explains the problem. as to alternative implementation here is one using collections.Counter
from collections import Counter
atoms = ['N', 'C', 'C', 'O', 'H', 'H', 'C', 'H', 'H', 'H', 'H', 'O', 'H']
def count_atoms(atoms):
num = len(atoms)
return {atom:count/num for atom, count in Counter(atoms).items()}
print(count_atoms(atoms))
Well you return the variable total at your first loop. Why don't you use a list to store your values? Like this:
atoms = ['N', 'C', 'C', 'O', 'H', 'H', 'C', 'H', 'H', 'H', 'H', 'O', 'H'] #python is case sensitive!
def count_atoms (atoms):
return_list = [] #empty list
for a in atoms:
total = atoms.count(a)/len(atoms)
return_list.append(total) #we add a new item
return return_list #we return everything and leave the function
It would be better to return a dictionary so you know which element the fraction corresponds to:
>>> fractions = {element: Atoms.count(element)/len(Atoms) for element in Atoms}
>>> fractions
{'N': 0.07692307692307693, 'C': 0.23076923076923078, 'O': 0.15384615384615385, 'H': 0.5384615384615384}
You can, then, even lookup the fraction for a particular element like:
>>> fractions['N']
0.07692307692307693
However, if you must use a for loop and a return statement, then answer from #not_a_bot_no_really_82353 would be the right one.
A simple one liner should do
[atoms.count(a)/float(len(atoms)) for a in set(atoms)]
Or better create a dictionary using comprehension
{a:atoms.count(a)/float(len(atoms)) for a in set(atoms)}
Output
{'C': 0.23076923076923078,
'H': 0.5384615384615384,
'N': 0.07692307692307693,
'O': 0.15384615384615385}
If you still want to use the for loop. I would suggest to go for map which would be a lot cleaner
atoms = ['N', 'C', 'C', 'O', 'H', 'H', 'C', 'H', 'H', 'H', 'H', 'O', 'H']
def count_atoms (a):
total = atoms.count(a)/float(len(atoms))
return total
map(count_atoms,atoms)
I'm trying to sort a list containing only lower case letters by using the string :
alphabet = "abcdefghijklmnopqrstuvwxyz".
that is without using sort, and with O(n) complexity only.
I got here:
def sort_char_list(lst):
alphabet = "abcdefghijklmnopqrstuvwxyz"
new_list = []
length = len(lst)
for i in range(length):
new_list.insert(alphabet.index(lst[i]),lst[i])
print (new_list)
return new_list
for this input :
m = list("emabrgtjh")
I get this:
['e']
['e', 'm']
['a', 'e', 'm']
['a', 'b', 'e', 'm']
['a', 'b', 'e', 'm', 'r']
['a', 'b', 'e', 'm', 'r', 'g']
['a', 'b', 'e', 'm', 'r', 'g', 't']
['a', 'b', 'e', 'm', 'r', 'g', 't', 'j']
['a', 'b', 'e', 'm', 'r', 'g', 't', 'h', 'j']
['a', 'b', 'e', 'm', 'r', 'g', 't', 'h', 'j']
looks like something goes wrong along the way, and I can't seem to understand why.. if anyone can please enlighten me that would be great.
You are looking for a bucket sort. Here:
def sort_char_list(lst):
alphabet = "abcdefghijklmnopqrstuvwxyz"
# Here, create the 26 buckets
new_list = [''] * len(alphabet)
for letter in lst:
# This is the bucket index
# You could use `ord(letter) - ord('a')` in this specific case, but it is not mandatory
index = alphabet.index(letter)
new_list[index] += letter
# Assemble the buckets
return ''.join(new_list)
As for complexity, since alphabet is a pre-defined fixed-size string, searching a letter in it is requires at most 26 operations, which qualifies as O(1). The overall complexity is therefore O(n)
I have a nested list that looks like this:
li = [['m', 'z', 'asdgwergerwhwre'],
['j', 'h', 'asdgasdgasdgasdgas'],
['u', 'a', 'asdgasdgasdgasd'],
['i', 'o', 'sdagasdgasdgdsag']]
I would like to sort this list alphabetically, BUT using either the first or second element in each sublist. For the above example, the desired output would be:
['a', 'u', 'asdgasdgasdgasd']
['h', 'j', 'asdgasdgasdgasdgas']
['i', 'o', 'sdagasdgasdgdsag']
['m', 'z', 'asdgwergerwhwre']
What is the best way to achieve this sort?
As the first step we perform some transformation (swap for first two items - if needed) and at the second aplly simple sort:
>>> sorted(map(lambda x: sorted(x[:2]) + [x[2]], li))
[['a', 'u', 'asdgasdgasdgasd'],
['h', 'j', 'asdgasdgasdgasdgas'],
['i', 'o', 'sdagasdgasdgdsag'],
['m', 'z', 'asdgwergerwhwre']]
You can make use of the built-in method sorted() to accomplish some of this. You would have to reverse the order of the list if you wanted to reverse the way it was printed, but that's not too difficult to do.
def rev(li):
for l in li:
l[0], l[1] = l[1], l[0]
return li
new_list = sorted(rev(li))
If you wanted to sort the list based on a specific index, you can use sorted(li, key=lambda li: li[index]).
import pprint
li = [['m', 'z', 'asdgwergerwhwre'],
['j', 'h', 'asdgasdgasdgasdgas'],
['u', 'a', 'asdgasdgasdgasd'],
['i', 'o', 'sdagasdgasdgdsag']]
for _list in li:
_list[:2]=sorted(_list[:2])
pprint.pprint(sorted(li))
>>>
[['a', 'u', 'asdgasdgasdgasd'],
['h', 'j', 'asdgasdgasdgasdgas'],
['i', 'o', 'sdagasdgasdgdsag'],
['m', 'z', 'asdgwergerwhwre']]
There is a recursive selection sort in the upcoming question that has to be done.
def selsort(l):
"""
sorts l in-place.
PRE: l is a list.
POST: l is a sorted list with the same elements; no return value.
"""
l1 = list("sloppy joe's hamburger place")
vl1 = l1
print l1 # should be: """['s', 'l', 'o', 'p', 'p', 'y', ' ', 'j', 'o', 'e', "'", 's', ' ', 'h', 'a', 'm', 'b', 'u', 'r', 'g', 'e', 'r', ' ', 'p', 'l', 'a', 'c', 'e']"""
ret = selsort(l1)
print l1 # should be """[' ', ' ', ' ', "'", 'a', 'a', 'b', 'c', 'e', 'e', 'e', 'g', 'h', 'j', 'l', 'l', 'm', 'o', 'o', 'p', 'p', 'p', 'r', 'r', 's', 's', 'u', 'y']"""
print vl1 # should be """[' ', ' ', ' ', "'", 'a', 'a', 'b', 'c', 'e', 'e', 'e', 'g', 'h', 'j', 'l', 'l', 'm', 'o', 'o', 'p', 'p', 'p', 'r', 'r', 's', 's', 'u', 'y']"""
print ret # should be "None"
I know how to get this by using key → l.sort(key=str.lower). But the question wants me to extract the maximum element, instead of the minimum, only to .append(...) it on to a recursively sorted sublist.
If I could get any help I would greatly appreciate it.
So. Do you understand the problem?
Let's look at what you were asked to do:
extract the maximum element, instead of the minimum, only to .append(...) it on to a recursively sorted sublist.
So, we do the following things:
1) Extract the maximum element. Do you understand what "extract" means here? Do you know how to find the maximum element?
2) Recursively sort the sublist. Here, "the sublist" consists of everything else after we extract the maximum element. Do you know how recursion works? You just call your sort function again with the sublist, relying on it to do the sorting. After all, the purpose of your function is to sort lists, so this is supposed to work, right? :)
3) .append() the maximum element onto the result of sorting the sublist. This should not require any explanation.
Of course, we need a base case for the recursion. When do we have a base case? When we can't follow the steps exactly as written. When does that happen? Well, why would it happen? Answer: we can't extract the maximum element if there are no elements, because then there is no maximum element to extract.
Thus, at the beginning of the function we check if we were passed an empty list. If we were, we just return an empty list, because sorting an empty list results in an empty list. (Do you see why?) Otherwise, we go through the other steps.
the sort method should do what you want. If you want the reverse, just use list.reverse()
If your job is to make your own sort method, that can be done.
Maybe try something like this:
def sort(l):
li=l[:] #to make new copy
newlist = [] #sorted list will be stored here
while len(li) != 0: #while there is stuff to be sorted
bestindex = -1 #the index of the highest element
bestchar = -1 #the ord value of the highest character
bestcharrep = -1 #a string representation of the best character
i = 0
for v in li:
if ord(v) < bestchar or bestchar == -1:#check if string is lower than old best
bestindex = i #Update best records
bestchar = ord(v)
bestcharrep = v
i += 1
del li[bestindex] #delete retrieved element from list
newlist.append(bestcharrep) #add element to new list
return newlist #return the sorted list