Remove short overlapping string from list of string

Remove short overlapping string from list of string - python

I have a list of strings: mylist = ["Hanks", "Tom Hanks","Tom","Tom Can"], I need to remove shorter strings that are substring of another string in the list.
For example in the case above, output should be : ["Tom Hanks","Tom Can"].
What I have done in python:
mylist = ["Hanks", "Tom Hanks","Tom","Tom Can"]
newlst = []
for x in mylist:
noexist = True
for j in mylist:
if x==j:continue
noexist = noexist and not(x in j)
if (noexist==True):
newlst.append(x)
print(newlst)
The code works fine. How can I make it efficient?

If order in output does not matter (replace ',' character with a character that doesn't occur in strings of your list):
mylist = ["Hanks", "Tom Hanks","Tom","Tom Can"]
mylist.sort(key = len)
newlst = []
for i,x in enumerate(mylist):
if x not in ','.join(mylist[i+1:]):
newlst.append(x)
list comprehension alternative (less readable):
mylist = ["Hanks", "Tom Hanks","Tom","Tom Can"]
mylist.sort(key = len)
newlst = [x for i,x in enumerate(mylist) if x not in ','.join(mylist[i+1:])]
output:
['Tom Can', 'Tom Hanks']
And if you want to keep the order:
mylist = ["Hanks", "Tom Hanks","Tom","Tom Can"]
mylist_sorted = mylist.copy()
mylist_sorted.sort(key = len)
newlst = [x for i,x in enumerate(mylist_sorted) if x not in ','.join(mylist_sorted[i+1:])]
newlst = [x for x in mylist if x in newlst]
output:
['Tom Hanks', 'Tom Can']

See this can help you. Added answer based on question sample list :
mylist = ["Hanks", "Tom Hanks","Tom","Tom Can"]
newlist = []
newstring = "|".join(mylist)
for a in mylist:
if newstring.count(a) == 1:
print("Big string: ",a)
newlist.append(a)
else:
print("Small String: ",a)
print(newlist)
Added if else print statement how its traverse and check condition.

a pretty minor improvement without changing the overall algorithm is that once you find another element that contains the current element then you can break out of the inner loop since it is skipped after that.
mylist = ["Hanks", "Tom Hanks","Tom","Tom Can"]
newlist = []
for elem in mylist:
for candidate in mylist:
if elem == candidate:
continue
elif elem in candidate:
break
else:
newlist.append(elem)
print(newlist)

If your strings are always words, you can just split on the words and filter by set operations, which should be quite fast.
from collections import Counter
items = ["Hanks", "Tom Hanks","Tom","Tom Can"]
items = set(items) # Don't want to think about uniqueness
item_words = {} # {item: all_words}
word_counts = Counter() # {word: item_counts}
word_lookups = {} # {word: {all_words: {item, ...}, ...}, ...}
for item in items:
words = frozenset(item.split())
item_words[item] = words
for word in words:
word_lookups.setdefault(word, {}).setdefault(words, set()).add(item)
word_counts[word] += 1
def is_ok(item):
words = item_words[item]
min_word = min(words, key=word_counts.__getitem__)
if word_counts[min_word] == 1:
return True # This item has a unique word
for all_words, others in word_lookups[min_word].items():
if not words.issubset(all_words):
continue # Not all words present
for other in others:
if item == other:
continue # Don't remove yourself
if item in other:
return False
return True # No matches
final = [item for item in items if is_ok(item)]
If you want to be very fast, consider a variation on the Aho–Corasick algorithm, where you would construct patterns for all your entries, and match them against all your inputs, and discard any patterns that have more than one match. This could potentially be linear in time.

Related

python list of lists contain substring

I have the list_of_lists and I need to get the string that contains 'height' in the sublists and if there is no height at all I need to get 'nvt' for the whole sublist.
I have tried the following:
list_of_lists = [['width=9','length=3'],['width=6','length=4','height=4']]
_lists = []
for list in list_of_lists:
list1 = []
for st in list:
if ("height" ) in st:
list1.append(st)
else:
list1.append('nvt')
_lists.append(list1)
OUT = _lists
the result I need to have is :
_lists = ['nvt', 'height=4']
what I'm getting is:
_lists = [['nvt','nvt'],['nvt','nvt','height=4']]

This is a good case for implementing a for/else construct as follows:
list_of_lists = [['width=9','length=3'],['width=6','length=4','height=4']]
result = []
for e in list_of_lists:
for ss in e:
if ss.startswith('height'):
result.append(ss)
break
else:
result.append('nvt')
print(result)
Output:
['nvt', 'height=4']
Note:
This could probably be done with a list comprehension but I think this is more obvious and probably has no significant difference in terms of performance

This should work, you can assign height variable to first value in the sublist where s.startswith("height") is True, and if nothing matches this filter, you can assign height to 'nvt'.
_lists = []
for sublist in list_of_lists:
height = next(filter(lambda s: s.startswith("height"), sublist), 'nvt')
_lists.append(height)
And if you wish to be crazy, you can use list comprehension to reduce the code to the:
_lists = [next(filter(lambda s: s.startswith("height"), sublist), 'nvt') for sublist in list_of_lists]

Try this (Python 3.x):
import re
list_of_lists = [['width=9','length=3'],['width=6','length=4','height=4']]
_lists = []
r = re.compile("height=")
for li in list_of_lists:
match = list(filter(r.match, li))
if len(match) > 0:
_lists.extend(match)
else:
_lists.append('nvt')
OUT = _lists
print(OUT)

How to Find the Length of max itam in List? python

in my list I want the function to return only the maximum values within the list
my_list = ["and", "the", "plus", "from", "i" , "09"]
How can I get the output to be the maximum values ("plus", "from")
I did something like that, and I could not continue ...
my_list = ["and", "the", "plus", "from", "i" , "09"]
list1 = str(my_list)
list2 = list1.split(',')
for i in list2:
if len(i) == max(len(i)):
print(i)
Thanks for the help

def myFunction(words: list):
output = []
for word in sorted(words, key=len, reverse=True):
maxLen = len(sorted(words, key=len, reverse=True)[0])
if(len(word) < maxLen):
break
output.append(word)
return {
"words": output,
"length": maxLen
}
It takes the words list and sorts it from greatest to least length, then sets the maxLen variable to the length of the first thing in that list. Then, it looks to see if the length of the word the for loop is less than maxLen, and if it is, it stops the loop.

If you would like to get the lenght of longest string in list, you can use the shorhand version foor lop like max([len(item) for item in my_list]). but as far as I understood you would like to get the the string even if there are strings with same length you can try this:
def longest_items(my_list):
max_len = 0
items = []
for item in my_list:
if len(item) > max_len:
max_len = len(item)
items.clear()
items.append(item)
elif len(item) == max_len:
items.append(item)
return tuple(items)
This function will return longest string or strings in a tuple.

You can quickly use some list comprehension:
>>> my_list = ["and", "the", "plus", "from", "i" , "09"]
>>> lengths = [len(s) for s in my_list]
>>> max_lengths = [s for s in my_list if len(s) == max(lengths)]
>>> max_lengths
["plus", "from"]
The list in lengths stores the length of each string in my_list. Then, by using the max() function, we can quickly obtaining the maximum value in lengths, which corresponds to the maximum length of a string. Finally, we re-iterate over my_list and compare each string value to max(lengths), checking if each string is the maximum length.
Note that, if desired, this all can be condensed into one list comprehension:
>>> max_lengths = [i for i in my_list if len(i) == max([len(s) for s in my_list])]
>>> max_lengths
["plus", "from"]

my_list = ["and", "the", "plus", "from", "i", "09"]
length = 0
arr = []
for i in my_list:
if len(i) > length:
length = len(i)
arr = []
if len(i) == length:
arr.append(i)
length = len(i)
print(arr)

list2 = list1.split(',')
split is a string function, it turns a string into a list. Like "john, tim, and paul" would turn into ["john", " tim", " and paul"]
max takes an iterator and returns the highest value. you did max(len(i)) ... len(i) is just going to return an integer, so you did max(6). Instead, you want to make a list of the lengths. max([len(x) for x in list1]) would give you the length of the longest item in list1.

How to extract a string contained in nested list?

Please help me out to extract the string containing particular text. I have tried with below:
lst = [['abc', 'abgoodhj', 'rygbadkk'], ['jhjbadnm'], ['hjhj', 'iioytu'], ['hjjh', 'ghjgood1hj', 'jjkkbadgghhj', 'hjhgkll']]
for lst1 in lst:
good_wrd = [txt for txt in lst1 if txt.contains('good')]
bad_wrd = [txt for txt in lst1 if txt.contains('bad')]
I want the words that contain good and bad.

use list comprehension to create a new list.
good_wrd = [
word
for sub_lst in lst
for word in sub_lst
if "good" in word
]
bad_wrd = [
word
for sub_lst in lst
for word in sub_lst
if "bad" in word
]
Alternatively using for loops:
good_wrd = []
bad_wrd = []
for sub_lst in lst:
for word in sub_lst:
if "bad" in word:
bad_wrd.append(word)
elif "good" in word:
good_wrd.append(word)

This would work:
lst = [['abc', 'abgoodhj', 'rygbadkk'], ['jhjbadnm'], ['hjhj', 'iioytu'], ['hjjh', 'ghjgood1hj', 'jjkkbadgghhj', 'hjhgkll']]
good_wrd = []
bad_wrd = []
for lst1 in lst:
good_wrd.extend([txt for txt in lst1 if 'good' in txt])
bad_wrd.extend([txt for txt in lst1 if 'bad' in txt])
print(good_wrd)
print(bad_wrd)

target1 = 'good'
target2 = 'bad'
goods = []
bads = []
for lis in lst:
for txt in lis:
if target1 in txt:
goods.append(txt)
elif target2 in txt:
bads.append(txt)

Sorting a list based on upper and lower case

I have a list:
List1 = ['name','is','JOHN','My']
I want to append the pronoun as the first item in a new list and append the names at last. Other items should be in the middle and their positions can change.
So far I have written:
my_list = ['name','is','JOHN','My']
new_list = []
for i in my_list:
if i.isupper():
my_list.remove(i)
new_list.append(i)
print(new_list)
Here, I can't check if an item is completely upper case or only its first letter is upper case.
Output I get:
['name','is','JOHN','My']
Output I want:
['My','name','is','JOHN']
or:
['My','is','name','JOHN']
EDIT: I have seen this post and it doesn’t have answers to my question.

i.isupper() will tell you if it's all uppercase.
To test if just the first character is uppercase and the rest lowercase, you can use i.istitle()
To make your final result, you can append to different lists based on the conditions.
all_cap = []
init_cap = []
non_cap = []
for i in my_list:
if i.isupper():
all_cap.append(i)
elif i.istitle():
init_cap.append(i)
else:
non_cap.append(i)
new_list = init_cap + non_cap + all_cap
print(new_list)
DEMO

How about this:
s = ['name', 'is', 'JOHN', 'My']
pronoun = ''
name = ''
for i in s:
if i.isupper():
name = i
if i.istitle():
pronoun = i
result = [pronoun, s[0], s[1], name]
print(result)

Don't # me pls XD. Try this.
my_list = ['name','is','JOHN','My']
new_list = ['']
for i in range(len(my_list)):
if my_list[i][0].isupper() and my_list[i][1].islower():
new_list[0] = my_list[i]
elif my_list[i].islower():
new_list.append(my_list[i])
elif my_list[i].isupper():
new_list.append(my_list[i])
print(new_list)

How to delete item and his info in a list?

I have the following list:
lst= ['Jason', 999999999, 'jason#live.com', 'Curt', 333333333, 'curt#job.com']
I want to delete Jason and the next 2 entries and the following so I´m thinking in this :
for i in range(len(lst)):
if "Jason" in lst:
del lst[0]
del lst[1]
del lst[2]
else:
print("Jason not in lst")
Is this correct?
What I´m working with thanks to Tigerhawk is the following:
Original list:
lst = `[['Curt', 333333333, 'curt#job.com'], ['Jason', 999999999, 'jason#live.com']]`
def clean_lst(lst):
name=str(input("Name you want to delete:")) #With this I get the lst on the 1st paragraph
lst = sum(lst, [])
if len(lst)==0:
print("Empty List")
elif name in lst:
idx = lst.index(name)
del lst[idx:idx+3]
else:
print("Name is not on the list")
End result should look like this:
lst = `[['Curt', 333333333, 'curt#job.com']]`

If you can have more than one, start from the end of the list and del i to i + 3 if l[i] is equal to Jason:
l = ['Jason', 999999999, 'jason#live.com', 'Curt', 333333333, 'curt#job.com', "Jason", "foo", "bar"]
for i in range(len(l) - 1, -1, -1):
if l[i] == "Jason":
del l[i:i+3]
Output:
['Curt', 333333333, 'curt#job.com']
As far as your own code goes, it presumes that "Jason" is always the first element even after removing any previous which seems unlikely but only you know for sure.
The most efficient way to do this, is to either create a new list or update the original using a generator function:
def rem_jas(l):
it = iter(l)
for ele in it:
if ele == "Jason":
# skip two elements
next(it,"")
next(it, "")
else:
yield ele
Output:
In [30]: l = ['Jason', 999999999, 'jason#live.com', 'Curt', 333333333, 'curt#job.com', "Jason", "foo", "bar"]
In [31]: l[:] = rem_jas(l)
In [32]: l
Out[32]: ['Curt', 333333333, 'curt#job.com']
If you can possibly have Jason within two elements of another Jason then you need to decide what is the appropriate thing to do. If there is always at least 3 spaces then it will be fine.
Based on your edit and the fact you have a list of lists not a flat list, it seems you want to remove each sublist where the name appears which makes the code a lost simpler:
lst = [['Curt', 333333333, 'curt#job.com'], ['Jason', 999999999, 'jason#live.com']]
from itertools import chain
lst[:] = chain(*(sub for sub in lst if "Jason" not in sub))
print(lst)
Output:
['Curt', 333333333, 'curt#job.com']
sum is not a good way to flatten a list, itertools.chain is far more efficient.
If you want to keep the sublists then don't flatten:
lst[:] = (sub for sub in lst if "Jason" not in sub)
print(lst)
Or a hybrid if you have multiple Jasons and need to add a few prints based on conditions:
def rem_jas(l, name):
it = iter(l)
for ele in it:
if ele == name:
# skip two elements
next(it,"")
next(it, "")
else:
yield ele
def clean_lst(l):
name = "Jason"
for sub in l:
tmp = list(rem_jas(sub, name))
if tmp:
yield tmp
if len(tmp) == len(sub):
print("{} not in sublist".format(name))
lst[:] = clean_lst(lst)
print(lst)
Demo:
In [5]: lst = [['Curt', 333333333, 'curt#job.com'], ['Jason', 999999999, 'jason#live.com']]
In [6]: lst[:] = clean_lst(lst)
Jason not in sublist
In [7]: print(lst)
[['Curt', 333333333, 'curt#job.com']]
And lastly if you want to let the user know which sublist was missing the name:
def clean_lst(l):
name = "Jason"
for ind, sub in enumerate(l):
tmp = list(rem_jas(sub, name))
if tmp:
yield tmp
if len(tmp) == len(sub):
print("{} not in sublist {}".format(name, ind))

You can simply search for the appropriate index and then delete a slice of three entries:
lst = ['Jason', 999999999, 'jason#live.com', 'Curt', 333333333, 'curt#job.com']
if 'Jason' in lst:
idx = lst.index('Jason')
del lst[idx:idx+3]
Result:
>>> lst
['Curt', 333333333, 'curt#job.com']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove short overlapping string from list of string - python

Related

python list of lists contain substring

How to Find the Length of max itam in List? python

How to extract a string contained in nested list?

Sorting a list based on upper and lower case

How to delete item and his info in a list?

Categories

Resources