Asserting equal length of string elements in a list

Asserting equal length of string elements in a list - python

A function I created takes a list of string (long list of long sequences) as an argument. Initially, I want to make sure all strings are of equal length. Of course, I could do it by iterating over all sequences in a loop and checking the length. But I am wondering - is there any way to do it faster/more efficiently?
I've tried looking at the unittest module but I am not sure whether it would suit here. Alternatively, I was thinking about creating a list of len(string) of all strings using list comprehension and then checking whether or elements are the same. However, this seems like a lot of effort.

my_list = [ ... ]
FIXED_SIZE = 100 # Lenght of each string which should be equal
result = all(len(my_string) == FIXED_SIZE for my_string in my_list)

This may help you. If all are same length output will be True otherwise False.
str_list = ['ilo', 'jak']
str_len = map(len,str_list)
all(each_len == str_len[0] for each_len in str_len)

Related

Split string and take only part of it (python)

QUESTION
I have a list of strings, let's call it input_list, and every string in this list is made of five words divided only by a "%" character, like
"<word1>%<word2>%<word3>%<word4>%<word5>"
My goal is, for every element of input_list to make a string made only by <word3> and <word4> divided by the "%" sign, like this "<word3>%<word4>", and create a new list made by these strings.
So for example, if:
input_list = ['the%quick%brown%fox%jumps', 'over%the%lazy%dog%and']
then the new list will look like this
new_list = ['brown%fox', 'lazy%dog']
IMPORTANT NOTES AND POSSIBLE ANSWERS
The length of each word is random, so I can't just use string slicing or guess in any way how <word3> and <word4> start.
A possible way to answer this would the following, but I want to know if there is a better and maybe (computationally) faster way, without having to create a new variable (current_list) and/or without having to consider/split the whole string (maybe using regex?)
input_list = ['the%quick%brown%fox%jumps', 'over%the%lazy%dog%and']
new_list = []
for element in input_list:
current_list = element.split('%')
final_element = [current_list[2], current_list[3]]
new_list.append(final_element)
EDIT:
I tried to compare the running time of #Pac0 answer with the running time of #bb1 answer, and, with an input list of 100 strings, #Pac0 has a running time of 92.28286 seconds, #bb1 has a running time of 42.6106374 seconds. So I will consider #bb1 one as the answer.

new_list = ['%'.join(w.split('%')[2:4]) for w in input_list]

You can use a regular expression (regex) with a capture group:
import re
pattern = re.compile('[^%]*%[^%]*%([^%]*%[^%]*)%[^%]*')
input_list = ['the%quick%brown%fox%jumps', 'over%the%lazy%dog%and']
result = [pattern.search(s).group(1) for s in input_list]
print(result)
Note: the "compile" part is not strictly needed, but can help performance if you have a lot of strings to process.

How about this?
input_list = ['the%quick%brown%fox%jumps', 'over%the%lazy%dog%and']
new_list = ['%'.join(x.split('%')[2:4]) for x in input_list]
print (new_list)
Output
['brown%fox', 'lazy%dog']

Is ordered ensured in list iteration in Python?

Let's suppose to have a list of strings, named strings, in Python and to execute this line:
lengths = [ len(value) for value in strings ]
Is the strings list order kept? I mean, can I be sure that lengths[i] corresponds to strings[i]?
I've tryed many times and it works but I'm not sure if my experiments were special cases or the rule.
Thanks in advance

For lists, yes. That is one of the fundamental properties of lists: that they're ordered.
It should be noted though that what you're doing though is known as "parallel arrays" (having several "arrays" to maintain a linked state), and is often considered to be poor practice. If you change one list, you must change the other in the same way, or they'll be out of sync, and then you have real problems.
A dictionary would likely be the better option here:
lengths_dict = {value:len(value) for value in strings}
print(lengths_dict["some_word"]) # Prints its length
Or maybe if you want lookups by index, a list of tuples:
lengths = [(value, len(value)) for value in strings]
word, length = lengths[1]

Yes, since list in python are sequences you can be sure that each length that you have in the list of the length is corresponding to the string length in the same index.
like the following code represents
a = ['a', 'ab', 'abc', 'abcd']
print([len(i) for i in a])
Output
[1, 2, 3, 4]

Count number of entries in a list when the list has one entry

After a lot of searching and trying stuff out I haven't been able to find an answer. My problem is I want to count the number of entries in a list but my list is "dynamic" so it can contain either a lot of entries or only one. The problen is the len() function does not return 1 when there is only one entry in the list, it returns the number of characters. When the number of entries is above 1 len() works fine. But I need it to return 1 not the length of the string in the list. How can I acomplish this?
So it does something like this:
List = ('abcdefg',)
len(List) # returns 7 instead of 1
but this works fine:
List = ('abcdefg','aqwedfd','foobar')
len(List) # returns 3

Python makes lists with []
>>> my_list = ['abcdefg']
>>> len(my_list)
1

If you want to count the number of elements in a list use len(). You have probably accidentally counted something that is not a list e.g. a string.

Iterating a string and replacing an element in python

I am attempting to search through two strings looking for matching elements. If the strings have two elements in common that are in different positions, I want to make that element in the 'guess' string a COW. If the strings have two elements in the same position, the element is a BULL.
Here is what I have:
if index(number,i) in guess and not index(guess,i) == index(guess,i):
replace(index(guess,i),'COW')
if index(guess,i) == index(number,i):
replace(index(guess,i),'BULL')
I'm not sure if I'm using index correctly.

First off, you need to be using index() and replace() as string methods, like Martijn said in a comment.
This would be like so: guess.index(i) to find the index of i in the string guess.
You might want to check out find() which will do the same as index() but won't raise an exception when the substring is not found.
Also note that you are seeing if the result of index() is in the string guess. That is an error, since an integer cannot be in a string! index() returns an integer!
Then consider that you are stating ... and not guess.index(i) == guess.index(i): (I fixed the index code) which makes no sense, since of course they are equal! They are the same thing!
Lastly, you are using replace incorrectly.
From the documentation, replace takes a string as the first argument - not an index! Try using it like so: guess = guess.replace(i, 'BULL'). That will change guess to have all occurrences of i replaced by the string 'BULL'.
I wasn't concerned with you actual algorithm here, but just your basic errors.

I wouldn't use the index() method. Instead, I would turn the string's elements into a list, then say:
listOne = [hello,goodbye,adios, shalom]
listTwo = [hello,adios,arrivaderci]
def cowbull(L1, L2):
for i in range(len(L1)):
if L1[i] in L2:
if L1[i] == L2[i]:
L1[i] = 'BULL'
L2[i] = 'BULL'
else:
L1[i] = 'COW'
L2[L1[i]] = 'COW'
This is just how I would do it, but using the way you and William's code may work well also. I am just used to doing it this way, and it may very well be not as efficient as his, but it usually works very well.

Check if string in strings

I have a huge list containing many strings like:
['xxxx','xx','xy','yy','x',......]
Now I am looking for an efficient way that removes all strings that are present within another string. For example 'xx' 'x' fit in 'xxxx'.
As the dataset is huge, I was wondering if there is an efficient method for this beside
if a in b:
The complete code: With maybe some optimization parts:
for x in range(len(taxlistcomplete)):
if delete == True:
x = x - 1
delete = False
for y in range(len(taxlistcomplete)):
if taxlistcomplete[x] in taxlistcomplete[y]:
if x != y:
print x,y
print taxlistcomplete[x]
del taxlistcomplete[x]
delete = True
break
print x, len(taxlistcomplete)
An updated version of the code:
for x in enumerate(taxlistcomplete):
if delete == True:
#If element is removed, I need to step 1 back and continue looping.....
delete = False
for y in enumerate(taxlistcomplete):
if x[1] in y[1]:
if x[1] != y[1]:
print x[1],y[1]
print taxlistcomplete[x]
del taxlistcomplete[x[0]]
delete = True
break
print x, len(taxlistcomplete)
Now implemented with the enumerate, only now I am wondering if this is more efficient and howto implement the delete step so I have less to search in as well.
Just a short thought...
Basically what I would like to see...
if element does not match any other elements in list write this one to a file.
Thus if 'xxxxx' not in 'xx','xy','wfirfj',etc... print/save
A new simple version as I dont think I can optimize it much further anyway...
print 'comparison'
file = open('output.txt','a')
for x in enumerate(taxlistcomplete):
delete = False
for y in enumerate(taxlistcomplete):
if x[1] in y[1]:
if x[1] != y[1]:
taxlistcomplete[x[0]] = ''
delete = True
break
if delete == False:
file.write(str(x))

x in <string> is fast, but checking each string against all other strings in the list will take O(n^2) time. Instead of shaving a few cycles by optimizing the comparison, you can achieve huge savings by using a different data structure so that you can check each string in just one lookup: For two thousand strings, that's two thousand checks instead of four million.
There's a data structure called a "prefix tree" (or trie) that allows you to very quickly check whether a string is a prefix of some string you've seen before. Google it. Since you're also interested in strings that occur in the middle of another string x, index all substrings of the form x, x[1:], x[2:], x[3:], etc. (So: only n substrings for a string of length n). That is, you index substrings that start in position 0, 1, 2, etc. and continue to the end of the string. That way you can just check if a new string is an initial part of something in your index.
You can then solve your problem in O(n) time like this:
Order your strings in order of decreasing length. This ensures that no string could be a substring of something you haven't seen yet. Since you only care about length, you can do a bucket sort in O(n) time.
Start with an empty prefix tree and loop over your ordered list of strings. For each string x, use your prefix tree to check whether it is a substring of a string you've seen before. If not, add its substrings x, x[1:], x[2:] etc. to the prefix tree.
Deleting in the middle of a long list is very expensive, so you'll get a further speedup if you collect the strings you want to keep into a new list (the actual string is not copied, just the reference). When you're done, delete the original list and the prefix tree.
If that's too complicated for you, at least don't compare everything with everything. Sort your strings by size (in decreasing order), and only check each string against the ones that have come before it. This will give you a 50% speedup with very little effort. And do make a new list (or write to a file immediately) instead of deleting in place.

Here is a simple approach, assuming you can identify a character (I will use '$' in my example) that is guaranteed not to be in any of the original strings:
result = ''
for substring in taxlistcomplete:
if substring not in result: result += '$' + substring
taxlistcomplete = result.split('$')
This leverages Python's internal optimizations for substring searching by just making one big string to substring-search :)

Here is my suggestion. First I sort the elements by length. Because obviously the shorter the string is, the more likely it is to be a substring of another string. Then I have two for loops, where I run through the list and remove every element from the list where el is a substring. Note that the first for loop only passes each element once.
By sortitng the list first, we destroy the order of elements in the list. So if the order is important, then you can't use this solution.
Edit. I assume there are no identical elements in the list. So that when el == el2, it's because its the same element.
a = ["xyy", "xx", "zy", "yy", "x"]
a.sort(key=len)
for el in a:
for el2 in a:
if el in el2 and el != el2:
a.remove(el2)

Using a list comprehension -- note in -- is the fastest and more Pythonic way of solving your problem:
[element for element in arr if 'xx' in element]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.