Removing duplicates from the list of unicode strings

Removing duplicates from the list of unicode strings - python

I am trying to remove duplicates from the list of unicode string without changing the order(So, I don't want to use set) of elements appeared in it.
Program:
result = [u'http://google.com', u'http://www.catb.org/esr/faqs/hacker-howto.html', u'http://www.catb.org/~esr/faqs/hacker-howto.html',u'http://amazon.com', u'http://www.catb.org/esr/faqs/hacker-howto.html', u'http://yahoo.com']
result.reverse()
for e in result:
count_e = result.count(e)
if count_e > 1:
for i in range(0, count_e - 1):
result.remove(e)
result.reverse()
print result
Output:
[u'http://google.com', u'http://www.catb.org/esr/faqs/hacker-howto.html', u'http://www.catb.org/~esr/faqs/hacker-howto.html', u'http://amazon.com', u'http://yahoo.com']
Expected Output:
[u'http://google.com', u'http://catb.org/~esr/faqs/hacker-howto.html', u'http://amazon.com', u'http://yahoo.com']
So, Is there any way of doing it simple as possible.

You actually don't have duplicates in your list. One time you have http://catb.org while another time you have http://www.catb.org.
You'll have to figure a way to determine whether the URL has www. in front or not.

You can create a new list and add items to it if they're not already in it.
result = [ /some list items/]
uniq = []
for item in result:
if item not in uniq:
uniq.append(item)

You could use a set and then sort it by the original index:
sorted(set(result), key=result.index)
This works because index returns the first occurrence (so it keeps them in order according to first appearance in the original list)
I also notice that one of the strings in your original isn't a unicode string. So you might want to do something like:
u = [unicode(s) for s in result]
return sorted(set(u), key=u.index)
EDIT: 'http://google.com' and 'http://www.google.com' are not string duplicates. If you want to treat them as such, you could do something like:
def remove_www(s):
s = unicode(s)
prefix = u'http://'
suffix = s[11:] if s.startswith(u'http://www') else s[7:]
return prefix+suffix
And then replace the earlier code with
u = [remove_www(s) for s in result]
return sorted(set(u), key=u.index)

Here is a method that modifies result in place:
result = [u'http://google.com', u'http://catb.org/~esr/faqs/hacker-howto.html', u'http://www.catb.org/~esr/faqs/hacker-howto.html',u'http://amazon.com', 'http://www.catb.org/esr/faqs/hacker-howto.html', u'http://yahoo.com']
seen = set()
i = 0
while i < len(result):
if result[i] not in seen:
seen.add(result[i])
i += 1
else:
del result[i]

Related

Which item in list - Python

I am making a console game using python and I am checking if an item is in a list using:
if variable in list:
I want to check which variable in that list it was like list[0] for example. Any help would be appreciated :)

You can do it using the list class attribute index as following:
list.index(variable)
Index gives you an integer that matches the location of the first appearance of the value you are looking for, and it will throw an error if the value is not found.
If you are already checking if the value is in the list, then within the if statement you can get the index by:
if variable in list:
variable_at = list.index(variable)
Example:
foo = ['this','is','not','This','it','is','that','This']
if 'This' in foo:
print(foo.index('This'))
Outputs:
3
Take a look at the answer below, which has more complete information.
Finding the index of an item in a list

We may be inspired from other languages such as Javascript and create a function which returns index if item exists or -1 otherwise.
list_ = [5, 6, 7, 8]
def check_element(alist: list, item: any):
if item in alist:
return alist.index(item)
else:
return -1
and the usage is
check1 = check_element(list_, 5)
check2 = check_element(list_, 9)
and this one is for one line lovers
check_element_one_liner = lambda alist, item: alist.index(item) if item in alist else -1
alternative_check1 = check_element_one_liner(list_, 5)
alternative_check2 = check_element_one_liner(list_, 9)
and a bit shorter version :)
check_shorter = lambda a, i: a.index(i) if i in a else -1

Using a librairy you could use numpy's np.where(list == variable).
In vanilla Python, I can think of something like:
idx = [idx for idx, item in enumerate(list) if item == variable][0]
But this solution is not fool proof, for instance, if theres no matching results, it will crash. You could complete this using an if right before:
if variable in list:
idx = [idx for idx, item in enumerate(list) if item == variable][0]
else:
idx = None

I understand that you want to get a sublist containing only the elements of the original list that match a certain condition (in your example case, you want to extract all the elements that are equal to the first element of the list).
You can do that by using the built-in filter function which allows you to produce a new list containing only the elements that match a specific condition.
Here's an example:
a = [1,1,1,3,4]
variable = a[0]
b = list(filter(lambda x : x == variable, a)) # [1,1,1]

This answer assumes that you only search for one (the first) matching element in the list.
Using the index method of a list should be the way to go. You just have to wrap it in a try-except statement. Here is an alternative version using next.
def get_index(data, search):
return next((index for index, value in enumerate(data) if value == search), None)
my_list = list('ABCDEFGH')
print(get_index(my_list, 'C'))
print(get_index(my_list, 'X'))
The output is
2
None

assuming that you want to check that it exists and get its index, the most efficient way is to use list.index , it returns the first item index found, otherwise it raises an error so it can be used as follows:
items = [1,2,3,4,5]
item_index = None
try:
item_index = items.index(3) # look for 3 in the list
except ValueError:
# do item not found logic
print("item not found") # example
else:
# do item found logic knowing item_index
print(items[item_index]) # example, prints 3
also please avoid naming variables list as it overrides the built-in function list.

If you simply want to check if the number is in the list and print it or print it's index, you could simply try this:
ls = [1,2,3]
num = 2
if num in ls:
# to print the num
print(num)
# to print the index of num
print(ls.index(num))
else:
print('Number not in the list')

animals = ['cat', 'dog', 'rabbit', 'horse']
index = animals.index('dog')
print(index)

How to remove the first occurence of a repeated Character with python

I have been given the following string 'abcdea' and I need to find the repeated character but remove the first one so the result most be 'bcdea' I have tried to following but only get this result
def remove_rep(x):
new_list = []
for i in x:
if i not in new_list:
new_list.append(i)
new_list = ''.join(new_list)
print(new_list)
remove_rep('abcdea')
and the result is 'abcde' not the one that I was looking 'bcdea'

You could make use of str.find(), which returns the first occurrence with the string:
def remove_rep(oldString):
newString = ''
for i in oldString:
if i in newString:
# Character used previously, .find() returns the first position within string
first_position_index = newString.find(i)
newString = newString[:first_position_index] + newString[
first_position_index + 1:]
newString += i
print(newString)
remove_rep('abcdea')
remove_rep('abcdeaabcdea')
Out:
bcdea
bcdea

One approach can be to iterate in reverse order over the string, and keep track of all the characters seen in the string. If a character is repeated, we don't add it to the new_list.
def remove_rep(x: str):
new_list = []
seen = set()
for char in reversed(x):
if char not in seen:
new_list.append(char)
seen.add(char)
return ''.join(reversed(new_list))
print(remove_rep('abcdea'))
Result: 'bcdea'
Note that the above solution doesn't exactly work as desired, as it'll remove all occurrences of a character except the last one; for example, if you have 2+ occurrences of a chracter and you only want to remove the first one. To resolve that, you can instead do something like below:
def remove_rep(x: str):
new_list = []
first_seen = set()
for char in x:
freq = x.count(char)
if char in first_seen or freq == 1:
new_list.append(char)
elif freq > 1:
first_seen.add(char)
return ''.join(new_list)
Now for the given input:
print(remove_rep('abcdeaca'))
We get the desired result - only the first a and c is removed:
bdeaca
Test for a more complicated input:
print(remove_rep('abcdeaabcdea'))
We do get the correct result:
aabcdea
Do you see what happened in that last one? The first abcde sequence got removed, as all characters are repeated in this string. So our result is actually correct, even though it doesn't look so at an initial glance.

One of the approaches with one small change in the if condition:
def remove_rep(x):
new_list = []
visited = []
for i, item in enumerate(x):
if item not in x[i+1:] or item in visited:
new_list.append(item)
else:
visited.append(item)
new_list = ''.join(new_list)
print(new_list)
remove_rep('abcdeaa')
remove_rep('abcdeaabcdea')
Output:
bcdeaa
aabcdea

str.replace() does that :
https://docs.python.org/3/library/stdtypes.html#str.replace
str.replace(old, new[, count])
Return a copy of the string with all occurrences of substring old
replaced by new. If the optional argument count is given, only the
first count occurrences are replaced.
So basically :
"abcabc".replace('b', '', 1)
# output : 'acabc'

Change
new_list = ''.join(new_list)
to
new_list = ''.join(new_list[1:]+[i])
(and figure out why! Hint: what's the condition of your if block? What are you checking for and why?)

If else fill variable if empty list

I have lists that are empty and filled in the data. I am trying to the store last element of the list into a variable. If there are elements in the list, it is working fine. However, when I pass in a empty [] list, I get error like: IndexError: list index out of range. Which syntax I should be using for []?
ids = [
'abc123',
'ab233',
'23231ad',
'a23r2d23'
]
ids = []
# I tried these for empty
final = [ids if [] else ids[-1]] #error
# final = [ids if ids == None else ids == ids[-1]] # error
# final = [ids if ids == [] else ids == ids[-1]] # gives [[]] instead of []
print(final)
Basically, if an empty list is in ids, I need it to give []. If there are elements, then give the last element, which is working.

Here is one way to do this:
final = ids[-1] if ids else None
(Replace None with the value you'd like final to take when the list is empty.)

you can check for a empty list by below expression.
data = []
if data: #this returns False for empty list
print("list is empty")
else:
print("list has elements")
so what you can do is.
final = data[-1] if data else []
print(final)

final = ids[-1] if len(ids) > 0 else []
This will handle the immediate problem. Please work through class materials or tutorials a little more for individual techniques. For instance, your phrase ids if [] doesnt' do what you (currently) seem to think: it does not check ids against the empty list -- all it does is to see whether that empty list is "truthy", and an empty list evaluates to False.

You are getting the error because you wont be able to select the last item if the list is empty and it will rightfully throw an IndexError.
Try this example
ids = [[i for i in range(10)] for x in range(3)]
ids.append([])
last_if_not_empty = [i[-1] for i in ids if i]
Here you filter out the non-empty list by if i which is the condition to select not empty lists. From there you can pick out the last elements of the lists.

a = list[-1] if not len(list)==0 else 0

Python string function that removes one duplicate pair from multiple duplicates

I'm looking for a string function that removes one duplicate pair from multiple duplicates.
What i'd like the function to do:
input = ['a','a','a','b','b','c','d','d','d','d']
output = ['a','c']
heres what I have so far:
def double(lijst):
"""
returns all duplicates in the list as a set
"""
res = set()
zien = set()
for x in lijst:
if x in zien or zien.add(x):
res.add(x)
return(res)
def main():
list_1 = ['a','a','a','b','b','c']
list_2 = set(list_1)
print(list_2 - double(list_1))
main()
The problem being that it removes all duplicates, and doesn't leave the 'a'. Any ideas how to approach this problem?
For those interested why I need this; I want to track when a levehnstein function is processing vowel steps, if a vowel is being inserted or deleted I want to assign a different value to 'that step' (first I need to tract if a vowel has passed on either side of the matrix before the current step though) hence I need to remove duplicate pairs from a vowel list (as explained in the input output example).

These solves your problem. Take a look.
lsit = ['a','a','a','b','b','c']
for i in lsit:
temp = lsit.count(i)
if temp%2==0:
for x in range(temp):
lsit.remove(i)
else:
for x in range(temp-1):
lsit.remove(i)
print lsit
Output:
['a','c']

Just iterate through the list. If an element does not exist in the result, add it to the set. Or if there does already have one in the set, cancel out those two element.
The code is simple:
def double(l):
"""
returns all duplicates in the list as a set
"""
res = set()
for x in l:
if x in res:
res.remove(x)
else:
res.add(x)
return res
input = ['a','a','a','b','b','c','d','d','d','d']
print double(input)

Python string extraction from array of strings

I am having trouble figuring out the following:
Suppose I have a list of strings
strings = ["and","the","woah"]
I want the output to be a list of strings where the ith position of every string becomes a new string item in the array like so
["atw","nho","dea","h"]
I am playing with the following list comprehension
u = [[]]*4
c = [u[i].append(stuff[i]) for i in range(0,4) for stuff in strings]
but its not working out. Can anyone help? I know you can use other tools to accomplish this, but i am particularly interested in making this happen with for loops and list comprehensions. This may be asking a lot, Let me know if I am.

Using just list comprehensions and for loops you can:
strings = ["and","the","woah"]
#Get a null set to be filled in
new = ["" for x in range(max([len(m) for m in strings]))]
#Cycle through new list
for index,item in enumerate(new):
for w in strings:
try:
item += w[index]
new[index] = item
except IndexError,err:
pass
print new

My idea would be to use itertools.izip_longest and a list comprehension.
>>> from itertools import izip_longest
>>> strings = ["and","the","woah"]
>>> [''.join(x) for x in izip_longest(*strings, fillvalue='')]
['atw', 'nho', 'dea', 'h']

Try
array = ["and","the","woah"]
array1 = []
longest_item = 0
for i in range(0,3): #length of array
if len(array[i]) > longest_item:
longest_item = len(array[i]) #find longest string
for i in range(0,longest_item):
str = ""
for i1 in range(0,3): #length of array
if len(array[i1]) < longest_item:
continue
str += array[i1][i:i+1]
array1.append(str)
I didn't actually try this code out, I just improvised it. Please leave a comment ASAP if you find a bug.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Removing duplicates from the list of unicode strings - python

You actually don't have duplicates in your list. One time you have http://catb.org while another time you have http://www.catb.org. You'll have to figure a way to determine whether the URL has www. in front or not.

You can create a new list and add items to it if they're not already in it. result = [ /some list items/] uniq = [] for item in result: if item not in uniq: uniq.append(item)

Related

Which item in list - Python

How to remove the first occurence of a repeated Character with python

If else fill variable if empty list

Python string function that removes one duplicate pair from multiple duplicates

Python string extraction from array of strings

Categories

Resources