Search a list of strings with a list of substrings

Search a list of strings with a list of substrings - python

I have a list of strings and currently I can search for one substring at the time:
str = ['abc', 'efg', 'xyz']
[s for s in str if "a" in s]
which correctly returns
['abc']
Now let's say I have a list of substrings instead:
subs = ['a', 'ef']
I want a command like
[s for s in str if anyof(subs) in s]
which should return
['abc', 'efg']

>>> s = ['abc', 'efg', 'xyz']
>>> subs = ['a', 'ef']
>>> [x for x in s if any(sub in x for sub in subs)]
['abc', 'efg']
Don't use str as a variable name, it's a builtin.

Gets a little convoluted but you could do
[s for s in str if any([sub for sub in subs if sub in s])]

Simply use them one after the other:
[s for s in str for r in subs if r in s]
>>> r = ['abc', 'efg', 'xyz']
>>> s = ['a', 'ef']
>>> [t for t in r for x in s if x in t]
['abc', 'efg']

I still like map and filter, despite what is being said against and how comprehension can always replace a map and a filter. Hence, here is a map + filter + lambda version:
print filter(lambda x: any(map(x.__contains__,subs)), s)
which reads:
filter elements of s that contain any element from subs
I like how this uses words that carry a strong semantic meaning, rather than only if, for, in

Related

How to filter out strings that do not start with specific chars from a list [python]? [duplicate]

Given the list ['a','ab','abc','bac'], I want to compute a list with strings that have 'ab' in them. I.e. the result is ['ab','abc']. How can this be done in Python?

This simple filtering can be achieved in many ways with Python. The best approach is to use "list comprehensions" as follows:
>>> lst = ['a', 'ab', 'abc', 'bac']
>>> [k for k in lst if 'ab' in k]
['ab', 'abc']
Another way is to use the filter function. In Python 2:
>>> filter(lambda k: 'ab' in k, lst)
['ab', 'abc']
In Python 3, it returns an iterator instead of a list, but you can cast it:
>>> list(filter(lambda k: 'ab' in k, lst))
['ab', 'abc']
Though it's better practice to use a comprehension.

[x for x in L if 'ab' in x]

# To support matches from the beginning, not any matches:
items = ['a', 'ab', 'abc', 'bac']
prefix = 'ab'
filter(lambda x: x.startswith(prefix), items)

Tried this out quickly in the interactive shell:
>>> l = ['a', 'ab', 'abc', 'bac']
>>> [x for x in l if 'ab' in x]
['ab', 'abc']
>>>
Why does this work? Because the in operator is defined for strings to mean: "is substring of".
Also, you might want to consider writing out the loop as opposed to using the list comprehension syntax used above:
l = ['a', 'ab', 'abc', 'bac']
result = []
for s in l:
if 'ab' in s:
result.append(s)

mylist = ['a', 'ab', 'abc']
assert 'ab' in mylist

Trailing empty string after re.split()

I have two strings where I want to isolate sequences of digits from everything else.
For example:
import re
s = 'abc123abc'
print(re.split('(\d+)', s))
s = 'abc123abc123'
print(re.split('(\d+)', s))
The output looks like this:
['abc', '123', 'abc']
['abc', '123', 'abc', '123', '']
Note that in the second case, there's a trailing empty string.
Obviously I can test for that and remove it if necessary but it seems cumbersome and I wondered if the RE can be improved to account for this scenario.

You can use filter and don't return this empty string like below:
>>> s = 'abc123abc123'
>>> re.split('(\d+)', s)
['abc', '123', 'abc', '123', '']
>>> list(filter(None,re.split('(\d+)', s)))
['abc', '123', 'abc', '123']
By thanks #chepner you can generate list comprehension like below:
>>> [x for x in re.split('(\d+)', s) if x]
['abc', '123', 'abc', '123']
If maybe you have symbols or other you need split:
>>> s = '&^%123abc123$##123'
>>> list(filter(None,re.split('(\d+)', s)))
['&^%', '123', 'abc', '123', '$##', '123']

This has to do with the implementation of re.split() itself: you can't change it. When the function splits, it doesn't check anything that comes after the capture group, so it can't choose for you to either keep or discard the empty string that is left after splitting. It just splits there and leaves the rest of the string (which can be empty) to the next cycle.
If you don't want that empty string, you can get rid of it in various ways before collecting the results into a list. user1740577's is one example, but personally I prefer a list comprehension, since it's more idiomatic for simple filter/map operations:
parts = [part for part in re.split('(\d+)', s) if part]
I recommend against checking and getting rid of the element after the list has already been created, because it involves more operations and allocations.

A simple way to use regular expressions for this would be re.findall:
def bits(s):
return re.findall(r"(\D+|\d+)", s)
bits("abc123abc123")
# ['abc', '123', 'abc', '123']
But it seems easier and more natural with itertools.groupby. After all, you are chunking an iterable based on a single condition:
from itertools import groupby
def bits(s):
return ["".join(g) for _, g in groupby(s, key=str.isdigit)]
bits("abc123abc123")
# ['abc', '123', 'abc', '123']

delete the second item that starts with the same substring

I have a list l = ['abcdef', 'abcd', 'ghijklm', 'ghi', 'xyz', 'pqrs']
I want to delete the elements that start with the same sub-string if they exist (in this case 'abcd' and 'ghi').
N.B: in my situation, I know that the 'repeated' elements, if they exist, can be only 'abcd' or 'ghi'.
To delete them, I used this:
>>> l.remove('abcd') if ('abcdef' in l and 'abcd' in l) else l
>>> l.remove('ghi') if ('ghijklm' in l and 'ghi' in l) else l
>>> l
>>> ['abcdef', 'ghijklm', 'xyz', 'pqrs']
Is there a more efficient (or more automated) way to do this?

You can do it in linear time and O(n*m²) memory (where m is the length of your elements):
prefixes = {}
for word in l:
for x in range(len(word) - 1):
prefixes[word[:x]] = True
result = [word for word in l if word not in prefixes]
Iterate over each word and create a dictionary of the first character of each word, then the first two characters, then three, all the way up to all the characters of the word except the last one. Then iterate over the list again and if a word appears in that dictionary it's a shorter subset of some other word in the list

l = ['abcdef', 'abcd', 'ghijklm', 'ghi', 'xyz', 'pqrs']
for a in l[:]:
for b in l[:]:
if a.startswith(b) and a != b:
l.remove(b)
print(l)
Output
['abcdef', 'ghijklm', 'xyz', 'pqrs']

The following code does what you described.
your_list = ['abcdef', 'abcd', 'ghijklm', 'ghi', 'xyz', 'pqrs']
print("Original list: %s" % your_list)
helper_list = []
for element in your_list:
for element2 in your_list:
if element.startswith(element2) and element != element2:
print("%s starts with %s" % (element, element2))
print("Remove: %s" % element)
your_list.remove(element)
print("Removed list: %s" % your_list)
Output:
Original list: ['abcdef', 'abcd', 'ghijklm', 'ghi', 'xyz', 'pqrs']
abcdef starts with abcd
Remove: abcdef
ghijklm starts with ghi
Remove: ghijklm
Removed list: ['abcd', 'ghi', 'xyz', 'pqrs']
On the other hand, I think there is more simple solution and you can solve it with list comprehension if you want.

#Andrew Allen's way
l = ['abcdef', 'abcd', 'ghijklm', 'ghi', 'xyz', 'pqrs']
i=0
l = sorted(l)
while True:
try:
if l[i] in l[i+1]:
l.remove(l[i])
continue
i += 1
except:
break
print(l)
#['abcdef', 'ghijklm', 'pqrs', 'xyz']

Try this it will work
l =['abcdef', 'abcd', 'ghijklm', 'ghi', 'xyz', 'pqrs']
for i in l:
for j in l:
if len(i)>len(j) and j in i:
l.remove(j)

You can use
l = ['abcdef', 'abcd', 'ghijklm', 'ghi', 'xyz', 'pqrs']
if "abcdef" in l: # only 1 check for containment instead of 2
l = [x for x in l if x != "abcd"] # to remove _all_ abcd
# or
l = l.remove("abcd") # if you know there is only one abcd in it
This might be slightly faster (if you have far more elements then you show) because you only need to check once for "abcdef" - and then once untile the first/all of list for replacement.
>>> l.remove('abcd') if ('abcdef' in l and 'abcd' in l) else l
checks l twice for its full size to check containment (if unlucky) and then still needs to remove something from it
DISCLAIMER:
If this is NOT proven, measured bottleneck or security critical etc. I would not bother to do it unless I have measurements that suggests this is the biggest timesaver/optimization of all code overall ... with lists up to some dozends/hundreds (tummy feeling - your data does not support any analysis) the estimated gain from it is negligable.

Create new list of substrings from list of strings

Is there an easy way in python of creating a list of substrings from a list of strings?
Example:
original list: ['abcd','efgh','ijkl','mnop']
list of substrings: ['bc','fg','jk','no']
I know this could be achieved with a simple loop but is there an easier way in python (Maybe a one-liner)?

Use slicing and list comprehension:
>>> lis = ['abcd','efgh','ijkl','mnop']
>>> [ x[1:3] for x in lis]
['bc', 'fg', 'jk', 'no']
Slicing:
>>> s = 'abcd'
>>> s[1:3] #return sub-string from 1 to 2th index (3 in not inclusive)
'bc'

With a mix of slicing and list comprehensions you can do it like this
listy = ['abcd','efgh','ijkl','mnop']
[item[1:3] for item in listy]
>> ['bc', 'fg', 'jk', 'no']

You can use a one-liner list-comprehension.
Using slicing, and relative positions, you can then trim the first and last character in each item.
>>> l = ['abcd','efgh','ijkl','mnop']
>>> [x[1:-1] for x in l]
['bc', 'fg', 'jk', 'no']
If you are doing this many times, consider using a function:
def trim(string, trim_left=1, trim_right=1):
return string[trim_left:-trim_right]
def trim_list(lst, trim_left=1, trim_right=1):
return [trim(x, trim_left, trim_right) for x in lst]
>>> trim_list(['abcd','efgh','ijkl','mnop'])
['bc', 'fg', 'jk', 'no']

If you want to do this in one line you could try this:
>>> map(lambda s: s[1:-1], ['abcd','efgh','ijkl','mnop'])

Python: function that gets n-th element of a list

I have a list of strings like
['ABC', 'DEF', 'GHIJ']
and I want a list of strings containing the first letter of each string, i.e.
['A', 'D', 'G'].
I thought about doing that using map and the function that returns the first element of a list: my_list[0]. But how can I pass this to map?
Thanks.

you can try
In [14]: l = ['ABC', 'DEF', 'GHIJ']
In [15]: [x[0] for x in l]
Out[15]: ['A', 'D', 'G']

You should use a list comprehension, like #avasal since it's more pythonic, but here's how to do it with map:
>>> from operator import itemgetter
>>> L = ['ABC', 'DEF', 'GHIJ']
>>> map(itemgetter(0), L)
['A', 'D', 'G']

use list comprehension like so:
results = [i[0] for i in mySrcList]

One way:
l1=['ABC', 'DEF', 'GHIJ']
l1=map(lambda x:x[0], l1)

a=['ABC','DEF','GHI']
b=[]
for i in a:
b.append(i[0])
b is the array you need.

Try this.
>>> myArray=['ABC', 'DEF', 'GHIJ']
>>> newArray=[]
>>> for i in map(lambda x:x[0],myArray):
... newArray.append(i)
...
>>> print(newArray)
['A', 'D', 'G']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Search a list of strings with a list of substrings - python

>>> s = ['abc', 'efg', 'xyz'] >>> subs = ['a', 'ef'] >>> [x for x in s if any(sub in x for sub in subs)] ['abc', 'efg'] Don't use str as a variable name, it's a builtin.

Gets a little convoluted but you could do [s for s in str if any([sub for sub in subs if sub in s])]

Simply use them one after the other: [s for s in str for r in subs if r in s] >>> r = ['abc', 'efg', 'xyz'] >>> s = ['a', 'ef'] >>> [t for t in r for x in s if x in t] ['abc', 'efg']

Related

How to filter out strings that do not start with specific chars from a list [python]? [duplicate]

Trailing empty string after re.split()

delete the second item that starts with the same substring

Create new list of substrings from list of strings

Python: function that gets n-th element of a list

Categories

Resources