Remove the similar Duplicates from list of strings - python

I'm trying to remove the similar duplicates from my list.
Here is my code:
l = ["shirt", "shirt", "shirt len", "pant", "pant cotton", "len pant", "watch"]
res = [*set(l)]
print(res)
This will Remove only shirt word which is actually duplicate, but I'm looking to remove the similar words to remove like shirt Len,pant cotton,Len pant. Like that.
Expecting Output as
Shirt,pant,watch

It sounds like you want to check if the single-word strings are in any other string, and if so remove them as a duplicate. I would go about it this way:
Separate the list into single-word strings and any other string.
For each longer string, check if any of the single-word strings is contained in it.
If so, remove it. Otherwise, add it to the result.
Finally, add all the single-word strings to the result.
l = ["shirt", "shirt", "shirt len", "pant", "pant cotton", "len pant", "watch"]
single, longer = set(), set()
for s in l:
if len(s.split()) == 1:
single.add(s)
else:
longer.add(s)
res = set()
for s in longer:
if not any(word in s for word in single):
res.add(s)
res |= single
print(res)
This example will give:
{'shirt', 'watch', 'pant'}

You can try something like below:
by selecting single word element from list and then apply set
lst = ["shirt", "shirt", "shirt len", "pant cotton", "len pant", "watch"]
set([ls for ls in lst if ' 'not in ls])
#Output {'pant', 'shirt', 'watch'}
note if your input will ["shirt", "shirt", "shirt len", "pant cotton", "len pant", "watch"] then output will be {'shirt', 'watch'}
and if still would like to add pant, cotton then you can try
set(sum([ls.split(' ') for ls in lst], []))
#output {'cotton', 'len', 'pant', 'shirt', 'watch'}
and later filter out word by conditions as per your requirements

Related

Python Split Strings While Preserving Order?

I have a list of strings in python, where I need to preserve order and split some strings.
The condition to split a string is that after first match of : there is a none space/new line/tab char.
For example, this must be split:
example: Test to ['example':, 'Test']
While this stays the same: example: , IGNORE_ME_EXAMPLE
Given an input like this:
['example: Test', 'example: ', 'IGNORE_ME_EXAMPLE']
I'm expecting:
['example:', 'Test', 'example: ', 'IGNORE_ME_EXAMPLE']
Please Note that split strings are yet stick to each other and follow original order.
Plus, whenever I split a string I don't want to check split parts again. In other words, I don't want to check 'Test' after I split it.
To make it more clear, Given an input like this:
['example: Test::YES']
I'm expecting:
['example:', 'Test::YES']
You can use regular expressions for that:
import re
pattern = re.compile(r"(.+:)\s+([^\s].+)")
result = []
for line in lines:
match = pattern.match(line)
if match:
result.append(match.group(1))
result.append(match.group(2))
else:
result.append(line)
You can use nested loop comprehension for the input list:
l = ['example: Test::YES']
l1 = [j.lower().strip() for i in l for j in i.split(":", 1) if j.strip().lower() != '']
print(l1)
Output:
['example', 'Test::YES']
you need to iterate over your list of words, for each word, you need to check if : present or not. if present the then split the word in 2 parts, pre : and post part. append these pre and post to final list and if there is no : in word add that word in the result list and skip other operation for that word
# your code goes here
wordlist = ['example:', 'Test', 'example: ', 'IGNORE_ME_EXAMPLE']
result = []
for word in wordlist:
index = -1
part1, part2 = None, None
if ':' in word:
index = word.index(':')
else:
result.append(word)
continue
part1, part2 = word[:index+1], word[index+1:]
if part1 is not None and len(part1)>0:
result.append(part1)
if part2 is not None and len(part2)>0:
result.append(part2)
print(result)
output
['example:', 'Test', 'example:', ' ', 'IGNORE_ME_EXAMPLE']

How to remove characters in a string AFTER a given character?

I have a list of tuples that I want to remove the url extensions from. Here's what it looks like
['google.com', 'google.ru', 'google.ca']
Basically, I want to remove everything after the "." in each one so that I'm returned with something like this
['google', 'google', 'google']
My instructions specifically tell me to use the split() function, but I'm confused with that as well. If it's also possible, I need to remove duplicates, so my final result would be:
['google']
Thanks for the help, sorry if my specifications are odd.
This def removes url extensions:
def removeurlextensions(L):
L2 = []
for x in range(len(L)):
L2.append(L[x].split('.')[0])
return L2
To print your list:
L = ['google.com', 'google.ru', 'google.ca']
print(removeurlextensions(L))
#prints ['google', 'google', 'google']
To remove duplicates you can use list(set()):
L = ['google.com', 'google.ru', 'google.ca']
print(list(set(removeurlextensions(L))))
#prints ['google']
You can simply use split.
ls = ['google.com', 'google.ru', 'google.ca']
print([i.split('.', 1)[0] for i in ls])
# result = ['google', 'google', 'google']
And to remove the duplicate, you might want to use set.
mod = [i.split('.', 1)[0] for i in ls]
print(list(set(mod)))
# result = ['google']
This will only work if all items are strings:
for i in range(len(my_list)):
my_list[i] = my_list[I].split('.')[0]
already_in_list = []
for item in my_list:
if item in already_in_list:
my_list.pop(item)
else:
already_in_list.append(item)
print(my_list)
I did do this from memory so if there is a bug please let me know.

Extracting the first word from every value in a list

So I have a long list of column headers. All are strings, some are several words long. I've yet to find a way to write a function that extracts the first word from each value in the list and returns a list of just those singular words.
For example, this is what my list looks like:
['Customer ID', 'Email','Topwater -https:', 'Plastics - some uml']
And I want it to look like:
['Customer', 'Email', 'Topwater', 'Plastics']
I currently have this:
def first_word(cur_list):
my_list = []
for word in cur_list:
my_list.append(word.split(' ')[:1])
and it returns None when I run it on a list.
You can use list comprehension to return a list of the first index after splitting the strings by spaces.
my_list = [x.split()[0] for x in your_list]
To address "and it returns None when I run it on a list."
You didn't return my_list. Because it created a new list, didn't change the original list cur_list, the my_list is not returned.
To extract the first word from every value in a list
From #dfundako, you can simplify it to
my_list = [x.split()[0] for x in cur_list]
The final code would be
def first_word(cur_list):
my_list = [x.split()[0] for x in cur_list]
return my_list
Here is a demo. Please note that some punctuation may be left behind especially if it is right after the last letter of the name:
names = ["OMG FOO BAR", "A B C", "Python Strings", "Plastics: some uml"]
first_word(names) would be ['OMG', 'A', 'Python', 'Plastics:']
>>> l = ['Customer ID', 'Email','Topwater -https://karls.azureedge.net/media/catalog/product/cache/1/image/627x470/9df78eab33525d08d6e5fb8d27136e95/f/g/fgh55t502_web.jpg', 'Plastics - https://www.bass.co.za/1473-thickbox_default/berkley-powerbait-10-power-worm-black-blue-fleck.jpg']
>>> list(next(zip(*map(str.split, l))))
['Customer', 'Email', 'Topwater', 'Plastics']
[column.split(' ')[0] for column in my_list] should do the trick.
and if you want it in a function:
def first_word(my_list):
return [column.split(' ')[0] for column in my_list]
(?<=\d\d\d)\d* try using this in a loop to extract the words using regex

Populate dictionary from list

I have a list of strings (from a .tt file) that looks like this:
list1 = ['have\tVERB', 'and\tCONJ', ..., 'tree\tNOUN', 'go\tVERB']
I want to turn it into a dictionary that looks like:
dict1 = { 'have':'VERB', 'and':'CONJ', 'tree':'NOUN', 'go':'VERB' }
I was thinking of substitution, but it doesn't work that well. Is there a way to tag the tab string '\t' as a divider?
Try the following:
dict1 = dict(item.split('\t') for item in list1)
Output:
>>>dict1
{'and': 'CONJ', 'go': 'VERB', 'tree': 'NOUN', 'have': 'VERB'}
Since str.split also splits on '\t' by default ('\t' is considered white space), you could get a functional approach by feeding dict with a map that looks quite elegant:
d = dict(map(str.split, list1))
With the dictionary d now being in the wanted form:
print(d)
{'and': 'CONJ', 'go': 'VERB', 'have': 'VERB', 'tree': 'NOUN'}
If you need a split only on '\t' (while ignoring ' ' and '\n') and still want to use the map approach, you can create a partial object with functools.partial that only uses '\t' as the separator:
from functools import partial
# only splits on '\t' ignoring new-lines, white space e.t.c
tabsplit = partial(str.split, sep='\t')
d = dict(map(tabsplit, list1))
this, of course, yields the same result for d using the sample list of strings.
do that with a simple dict comprehension and a str.split (without arguments strip splits on blanks)
list1 = ['have\tVERB', 'and\tCONJ', 'tree\tNOUN', 'go\tVERB']
dict1 = {x.split()[0]:x.split()[1] for x in list1}
result:
{'and': 'CONJ', 'go': 'VERB', 'tree': 'NOUN', 'have': 'VERB'}
EDIT: the x.split()[0]:x.split()[1] does split twice, which is not optimal. Other answers here do it better without dict comprehension.
A short way to solve the problem, since split method splits '\t' by default (as pointed out by Jim Fasarakis-Hilliard), could be:
dictionary = dict(item.split() for item in list1)
print dictionary
I also wrote down a more simple and classic approach.
Not very pythonic but easy to understand for beginners:
list1 = ['have\tVERB', 'and\tCONJ', 'tree\tNOUN', 'go\tVERB']
dictionary1 = {}
for item in list1:
splitted_item = item.split('\t')
word = splitted_item[0]
word_type = splitted_item[1]
dictionary1[word] = word_type
print dictionary1
Here I wrote the same code with very verbose comments:
# Let's start with our word list, we'll call it 'list1'
list1 = ['have\tVERB', 'and\tCONJ', 'tree\tNOUN', 'go\tVERB']
# Here's an empty dictionary, 'dictionary1'
dictionary1 = {}
# Let's start to iterate using variable 'item' through 'list1'
for item in list1:
# Here I split item in two parts, passing the '\t' character
# to the split function and put the resulting list of two elements
# into 'splitted_item' variable.
# If you want to know more about split function check the link available
# at the end of this answer
splitted_item = item.split('\t')
# Just to make code more readable here I now put 1st part
# of the splitted item (part 0 because we start counting
# from number 0) in "word" variable
word = splitted_item[0]
# I use the same apporach to save the 2nd part of the
# splitted item into 'word_type' variable
# Yes, you're right: we use 1 because we start counting from 0
word_type = splitted_item[1]
# Finally I add to 'dictionary1', 'word' key with a value of 'word_type'
dictionary1[word] = word_type
# After the for loop has been completed I print the now
# complete dictionary1 to check if result is correct
print dictionary1
Useful links:
You can quickly copy and paste this code here to check how it works and tweak it if you like: http://www.codeskulptor.com
If you want to learn more about split and string functions in general: https://docs.python.org/2/library/string.html

Search a list within a list and more python 2.7 Not sure what i need to change

I have a list containing a list of words called words. I have a function called random_sentence which can be called using any sentence. I want to search the random sentence for any word in my list that is in the spot [0] of each list and then switch it with the corresponding word in that list. Hope that makes sense.
words = [["I", "you"], ["i", "you"], ["we", "you"], ["my", "your"], ["our", "your"]]
def random_sentence(sentence):
list = sentence.split()
string = sentence
for y in list:
for i in words:
for u in i:
if y == u:
mylist = i[1]
string = string.replace(y, mylist)
return string
So random_sentence("I have a my pet dog")
should return "you have your pet dog".
My function works some times, but other times it does not.
Say random_sentence("I went and we")
produces "you yount and you" does not make sense.
How do I fix my function to produce the right outcome?
First, your code, as pasted, does not even run. You have a space instead of an underscore in your function definition, and you never return anything.
But, after fixing that, your code does exactly what you describe.
To figure out why, try adding prints to see what it's doing at each step, or running it through a visualizer, like this one.
When you get to the point where y is "we", you'll end up doing this:
string = string.replace("we", "you")
But that will replace every we in string, including the one in went.
If you want to do things this way, you probably want to modify each y in list, and then join them back together at the end, like this:
def random_sentence(sentence):
list = sentence.split()
for index, y in enumerate(list):
for i in words:
for u in i:
if y == u:
mylist = i[1]
list[index] = mylist
return ' '.join(list)
If you find this hard to understand… well, so do I. All of your variable names are either a single letter, or a misleading name (like mylist, which isn't even a list). Also, you're looking over i when you really only want to check the first element. See if this is more readable:
replacements = [["I", "you"], ["i", "you"], ["we", "you"], ["my", "your"], ["our", "your"]]
def random_sentence(sentence):
words = sentence.split()
for index, word in enumerate(words):
for replacement in replacements:
if word == replacement[0]:
words[index] = replacement[1]
return ' '.join(words)
However, there's a much better way to solve this problem.
First, instead of having a list of word-replacement pairs, just use a dictionary. Then you can get rid of a whole loop and make it much easier to read (and faster, too):
replacements = {"I": "you", "i": "you", "we": "you", "my": "your", "our": "your"}
def random_sentence(sentence):
words = sentence.split()
for index, word in enumerate(words):
replacement = replacements.get(word, word)
words[index] = replacement
return ' '.join(words)
And then, instead of trying to modify the original list in place, just build up a new one:
def random_sentence(sentence):
result = []
for word in sentence.split():
result.append(replacements.get(word, word))
return ' '.join(result)
Then, this result = [], for …: result.append(…) is exactly what a list comprehension is for:
def random_sentence(sentence):
result = [replacements.get(word, word) for word in sentence.split()]
return ' '.join(result)
… or, since you don't actually need the list for any purpose but to serve it to join, you can use a generator expression instead:
def random_sentence(sentence):
return ' '.join(replacements.get(word, word) for word in sentence.split())
A Dictionary/map makes more sense here, not an array of arrays. Define your dictionary words as:
words = {"I":"you", "i":"you", "we":"you","my":"your","our":"your"}
And then, use it as:
def randomsentence(text):
result = []
for word in text.split():
if word in words: #Check if the current word exists in our dictionary
result.append(words[word]) #Append the value against the word
else:
result.append(word)
return " ".join(result)
OUTPUT:
>>> randomsentence("I have a my pet dog")
'you have a your pet dog'
>>> words = {'I': 'you', 'i': 'you', 'we': 'you', 'my': 'your', 'our': 'your'}
>>> def random_sentence(sentence):
return ' '.join([words.get(word, word) for word in sentence.split()])
>>> random_sentence('I have a my pet dog')
'you have a your pet dog'
The problem is that string.replace replaces substrings that are parts of the words. You can manually build an answer like this:
def random_sentence(sentence):
list = sentence.split()
result = []
for y in list:
for i in words:
if i[0] == y:
result.append(i[1])
break
else:
result.append(y)
return " ".join(result)
Note that else corresponds to for not if.

Categories

Resources