Pythons fastest way of randomising case of a string - python

I want to randomise the case of a string, heres what I have:
word="This is a MixeD cAse stRing"
word_cap=''
for x in word:
if random.randint(0,1):
word_cap += x.upper()
else:
word_cap += x.lower()
word = word_cap
print word
Im wondering if you could use list comprehension to make it faster.
I couldnt seem to use the lower() and upper() functions in randomchoice
i tried to do something like
''.join(randomchoice(x.upper(),x.lower()) for x in word)
but i think thats wrong. something like that though is possible?

import random
s = 'this is a lower case string'
''.join(random.choice((str.upper,str.lower))(x) for x in s)
random.choice randomly selects one from two functions str.upper, str.lower.
Then this function is applied to x for each letter in the input string s.
If initial string has all the letters in lowercase, that I would use this code:
''.join(x.upper() if random.randint(0,1) else x for x in s)
because the initial code would use redundant str.lowercase on half of the letters in the case of lowercase initial string.
By the way, look at the other answer by Michael J. Barber. Python have hefty charges for function calls. In his code he calls str.upper only once. In my code str.upper is called for about half the symbols of the initial string. So still a temporary upper-cased string is created in memory, the time efficiency of his code may be much greater.
Lo and behold:
Code timing comparisons: https://ideone.com/eLygn

Try this:
word="this is a lower case string"
caps = word.upper()
''.join(x[random.randint(0,1)] for x in zip(word, caps))
This should outperform your version because it makes many fewer calls to upper and because, more importantly, it avoids the O(N^2) successive appends you used in the version with the loops.
With the modification to the question, you'll need to create both the lowercase and uppercase versions:
word="This is a MixeD cAse stRing"
caps = word.upper()
lowers = word.lower()
''.join(random.choice(x) for x in zip(caps, lowers))
As suggested by Tim Pietzcker in the comments, I've used random.choice to select the letters from the tuples created by the zip call.
Since the question has been changed to focus more on speed, the fastest approach is likely to be using Numpy:
''.join(numpy.where(numpy.random.randint(2, size=len(caps)), list(caps), list(lowers)))

Related

Find all the possible N-length anagrams - fast alternatives

I am given a sequence of letters and have to produce all the N-length anagrams of the sequence given, where N is the length of the sequence.
I am following a kinda naive approach in python, where I am taking all the permutations in order to achieve that. I have found some similar threads like this one but I would prefer a math-oriented approach in Python. So what would be a more performant alternative to permutations? Is there anything particularly wrong in my attempt below?
from itertools import permutations
def find_all_anagrams(word):
pp = permutations(word)
perm_set = set()
for i in pp:
perm_set.add(i)
ll = [list(i) for i in perm_set]
ll.sort()
print(ll)
If there are lots of repeated letters, the key will be to produce each anagram only once instead of producing all possible permutations and eliminating duplicates.
Here's one possible algorithm which only produces each anagram once:
from collections import Counter
def perm(unplaced, prefix):
if unplaced:
for element in unplaced:
yield from perm(unplaced - Counter(element), prefix + element)
else:
yield prefix
def permutations(iterable):
yield from perm(Counter(iterable), "")
That's actually not much different from the classic recursion to produce all permutations; the only difference is that it uses a collections.Counter (a multiset) to hold the as-yet-unplaced elements instead of just using a list.
The number of Counter objects produced in the course of the iteration is certainly excessive, and there is almost certainly a faster way of writing that; I chose this version for its simplicity and (hopefully) its clarity
This is very slow for long words with many similar characters. Slow compared to theoretical maximum performance that is. For example, permutations("mississippi") will produce a much longer list than necessary. It will have a length of 39916800, but but the set has a size of 34650.
>>> len(list(permutations("mississippi")))
39916800
>>> len(set(permutations("mississippi")))
34650
So the big flaw with your method is that you generate ALL anagrams and then remove the duplicates. Use a method that only generates the unique anagrams.
EDIT:
Here is some working, but extremely ugly and possibly buggy code. I'm making it nicer as you're reading this. It does give 34650 for mississippi, so I assume there aren't any major bugs. Warning again. UGLY!
# Returns a dictionary with letter count
# get_letter_list("mississippi") returns
# {'i':4, 'm':1, 'p': 2, 's':4}
def get_letter_list(word):
w = sorted(word)
c = 0
dd = {}
dd[w[0]]=1
for l in range(1,len(w)):
if w[l]==w[l-1]:
d[c]=d[c]+1
dd[w[l]]=dd[w[l]]+1
else:
c=c+1
d.append(1)
dd[w[l]]=1
return dd
def sum_dict(d):
s=0
for x in d:
s=s+d[x]
return s
# Recursively create the anagrams. It takes a letter list
# from the above function as an argument.
def create_anagrams(dd):
if sum_dict(dd)==1: # If there's only one letter left
for l in dd:
return l # Ugly hack, because I'm not used to dics
a = []
for l in dd:
if dd[l] != 0:
newdd=dict(dd)
newdd[l]=newdd[l]-1
if newdd[l]==0:
newdd.pop(l)
newl=create(newdd)
for x in newl:
a.append(str(l)+str(x))
return a
>>> print (len(create_anagrams(get_letter_list("mississippi"))))
34650
It works like this: For every unique letter l, create all unique permutations with one less occurance of the letter l, and then append l to all these permutations.
For "mississippi", this is way faster than set(permutations(word)) and it's far from optimally written. For instance, dictionaries are quite slow and there's probably lots of things to improve in this code, but it shows that the algorithm itself is much faster than your approach.
Maybe I am missing something, but why don't you just do this:
from itertools import permutations
def find_all_anagrams(word):
return sorted(set(permutations(word)))
You could simplify to:
from itertools import permutations
def find_all_anagrams(word):
word = set(''.join(sorted(word)))
return list(permutations(word))
In the doc for permutations the code is detailled and it seems already optimized.
I don't know python but I want to try to help you: probably there are a lot of other more performant algorithm, but I've thought about this one: it's completely recursive and it should cover all the cases of a permutation. I want to start with a basic example:
permutation of ABC
Now, this algorithm works in this way: for Length times you shift right the letters, but the last letter will become the first one (you could easily do this with a queue).
Back to the example, we will have:
ABC
BCA
CAB
Now you repeat the first (and only) step with the substring built from the second letter to the last one.
Unfortunately, with this algorithm you cannot consider permutation with repetition.

Reverse vowels in a String Python [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
Im a beginner to your world. I have seen many attempts to answer this question using fancy builtin function and have found the answer on how to do this a thousand times but NONE using two for loops method.
I would like for the code to only reverse a vowel when it finds one in a string, using the two for loop method.
I coded something that didnt seem to work and I dont quite understand why,
for x in newWord:
for y in vowels:
if x == y:
newWord[x] = newWord[7]
print newWord
with vowels being a list of vowels, with newWord also being a list.
This code is currently not working, like most others who have tried the two for loop method. Any help is appreciated. Thanks
Roughly the approach you want to use is to make two passes over the list of characters. In the first pass you find the index of each vowel, building a list of work to be done (locations of vowels to be swapped).
Then you prepare your work list by matching the first and last items until you have less than two items left in the work list. (An odd number of vowels means that the one in the middle doesn't have to be swapped with anything).
Now you simply iterate over the work list, tuples/pairs of indexes. Swap the character at the first offset with the character at the other one for each pair. Done.
(This is assuming you want transform the list in place. If not then you can either just start with a copy: new_word = word[:] or can iterate over enumerate(word) and conditionally either append the character at each point (if the offset isn't in your work list) ... or append the offset character (if this index matches one of those in your list). I the latter case you might make your work list a dictionary instead).
Here's the code to demonstrate:
def rev_vowels(word):
word = list(word)
results = word[:]
vowel_locations = [index for index, char in enumerate(word) if char in 'aeiou']
work = zip(vowel_locations[:int(len(vowel_locations)/2)], reversed(vowel_locations))
for left, right in work:
results[left], results[right] = word[right], word[left]
return results
This does use a list comprehension, the zip() and reversed() builtins, a complex slice for the first argument to zip(), and the Python tuple packing idiom for swapping variables. So you might have to replace those with more verbose constructs to fulfill your "no fancy builtins" constraint.
Fundamentally, however, a list comprehension is just syntactic sugar around a for loop. So, overall, this demonstrates the approach using two for loops over the data. While I'm returning a copy of the data as my results the code would work without that. (That's why I'm using the tuple packing idiom on line seven (just before the return statement).
If this question is being asked in an interview context I'm reasonably confident that this would be a reasonably good answer. You can easily break down how to implement zip, how to expand a list comprehension into a traditional for suite (block), and the swap line could be two separate assignments when returning a copy rather than performing an in-place transformation on the data.
These variable names are very verbose. But that's to make the intentions especially clear.
This code should solve your problem without using any "fancy builtin functions".
def f(word):
vowels = "aeiou"
string = list(word)
i = 0
j = len(word)-1
while i < j:
if string[i].lower() not in vowels:
i += 1
elif string[j].lower() not in vowels:
j -= 1
else:
string[i], string[j] = string[j], string[i]
i += 1
j -= 1
return "".join(string)

Is there a better way to find all the contiguous substrings of a string that appear in a given dictionary

Is there a more efficient algorithm to find all the substrings that are part of a given language over alphabet than the following:
import string.ascii_lowercase as alphabet
languge = {'aa', 'bc', 'wxyz', 'uz'};
for i in xrange(len(alphabet)):
for j in xrange(i,len(alphabet)):
substirng = alphabet[i:j+1]
if substirng in languge:
print substirng
If I understand your question correctly. You have an alphabet, or string. In this case a string of 26 characters, a-z. You wish to check if any of the strings given to you are substrings of the aforementioned "alphabet string".
If this is indeed the case, there is a better way.
Your current approach amounts to computing all possible substrings from the alphabet, which is O(N^2) in the general case of an alphabet of size N and 26^2 in your particular case and then checking if the substring belongs to your predefined set. A much better approach would be to simply loop over your given strings and check if they are substrings of your alphabet. This is an O(N) operation for each string in your predefined set. This brings the complexity down to O(NM).
This is better if M is noticeably smaller than N.
There might be even better ways, but this is a good start.
Use Aho-Corasick or Rabin-Karp algorithms intended for this purpose:
It is a kind of dictionary-matching algorithm that locates elements of
a finite set of strings (the "dictionary") within an input text. It
matches all strings simultaneously
There are numerous Python implementations for these algorithms.
Complexity for Aho-Corasick searching is O(TextLength + AnswerLength), preprocessing O(n*σ), where n is overall length of all words in the dictionary, σ is alphabet size
For Rabin-Karp average time is O(TextLength + AnswerLength) too, but the worst time is O(TextLength * AnswerLength)
It is nicer if you use
from string import ascii_lowercase as alphabet instead
language = {'aa', 'bc', 'wxyz', 'uz'}
for item in language:
if item in alphabet:
print item
this works but a list comprehension is preferred
substrings = [item for item in language if item in alphabet]

Python searching a large list speed

I have run into a speed issue searching through a very large list. I have a file with a lot of errors and very strange words in it. I am trying to use difflib to find the closest match in a dictionary file I have that has 650,000 words in it. This approach below works really well but is very very slow and I was wondering if there is a better way to approach this problem. This is the code:
from difflib import SequenceMatcher
headWordList = [ #This is a list of 650,000 words]
openFile = open("sentences.txt","r")
for line in openFile:
sentenceList.append[line]
percentage = 0
count = 0
for y in sentenceList:
if y not in headwordList:
for x in headwordList:
m = SequenceMatcher(None, y.lower(), x)
if m.ratio() > percentage:
percentage = m.ratio()
word = x
if percentage > 0.86:
sentenceList[count] = word
count=count+1
Thanks for the help, software engineering is not even close to my strong suit. Much appreciated.
Two things that might provide some small help:
1) Use the approach in this SO answer to read through your large file the most efficiently.
2) Change your code from
for x in headwordList:
m = SequenceMatcher(None, y.lower(), 1)
to
yLower = y.lower()
for x in headwordList:
m = SequenceMatcher(None, yLower, 1)
You're converting each sentence to lower 650,000 times. No need for that.
You should change headwordList into a set.
The test word in headwordList will be very slow. It must do a string comparison on each word in headwordList, one word at a time. It will take time proportional to the length of the list; if you double the length of the list, you will double the amount of time it takes to do the test (on average).
With a set, it always takes the same amount of time to do the in test; it doesn't depend on the number of elements in the set. So that will be a huge speedup.
Now, this whole loop can be simplified:
for x in headwordList:
m = SequenceMatcher(None, y.lower(), x)
if m.ratio() > percentage:
percentage = m.ratio()
word = x
if percentage > 0.86:
sentenceList[count] = word
All this does is find the word from headwordList that has the highest ratio, and keep it (but only keep it if the ratio is over 0.86). Here's a faster way to do this. I'm going to change the name headwordList to just headwords as I want you to make it be a set and not a list.
def check_ratio(m):
return m.ratio()
y = y.lower() # do the .lower() call one time
m, word = max((SequenceMatcher(None, y, word), word) for word in headwords, key=check_ratio)
percentage = max(percentage, m.ratio()) # remember best ratio
if m.ratio() > 0.86:
setence_list.append(word)
This might seem a bit tricky but it is the fastest way to do this in Python. We will call the built-in max() function to find the SequenceMatcher result that has the highest ratio. First, we build a "generator expression" that tries all the words in headwords, calling SequenceMatcher() on each. But when we are done, we also want to know what the word was. So the generator expression produces tuples, where the first value in the tuple is the SequenceMatcher result and the second value is the word. The max() function cannot know that what we care about is the ratio, so we have to tell it that; we do this by making a function that tests what we care about, then passing that function as the key= argument. Now max() finds the value with the highest ratio for us. max() consumes all the values produced by the generator expression and returns a single value, which we then unpack into the varaibles m and word.
In Python, it is best practice to use variable names like sentence_list rather than sentenceList. Please see these guidelines: http://www.python.org/dev/peps/pep-0008/
It is not good practice to use an incrementing index variable and assign into indexed positions in a list. Rather, start with an empty list and use the .append() method function to append values.
Also, you might do better to build a dictionary of words and their ratios.
Note that your original code seems to have a bug: as soon as any word has a percentage over 0.86, all words are saved in sentenceList no matter what their ratio is. The code I wrote, above, only saves words where the word's own ratio was high enough.
EDIT: This is to answer a question about generator expressions needing to be parenthesized.
Whenever I get that error message, I usually split out the generator expression by itself and assign it to a variable. Like this:
def check_ratio(m):
return m.ratio()
y = y.lower() # do the .lower() call one time
genexp = ((SequenceMatcher(None, y, word), word) for word in headwords)
m, word = max(genexp, key=check_ratio)
percentage = max(percentage, m.ratio()) # remember best ratio
if m.ratio() > 0.86:
setence_list.append(word)
That's what I suggest. But if you don't mind a complicated line looking even busier, you can simply add an extra pair of parentheses as the error message suggests, so the generator expression is fully parenthesized. Like so:
m, word = max(((SequenceMatcher(None, y, word), word) for word in headwords), key=check_ratio)
Python lets you omit the explicit parentheses around a generator expression when you pass the expression to a function, but only if it is the only argument to that function. As we are also passing a key= argument, we need a fully parenthesized generator expression.
But I think it's easier to read if you split out the genexp on its own line.
EDIT: #Peter Wood pointed out that the documentation suggests reusing a SequenceMatcher for speed. I don't have time to test this, but I think this is the right way to do it.
Happily, the code got simpler! Always a good sign.
EDIT: I just tested the code. This code works for me; see if it works for you.
from difflib import SequenceMatcher
headwords = [
# This is a list of 650,000 words
# Dummy list:
"happy",
"new",
"year",
]
def words_from_file(filename):
with open(filename, "rt") as f:
for line in f:
for word in line.split():
yield word
def _match(matcher, s):
matcher.set_seq2(s)
return (matcher.ratio(), s)
ratios = {}
best_ratio = 0
matcher = SequenceMatcher()
for word in words_from_file("sentences.txt"):
matcher.set_seq1(word.lower())
if word not in headwords:
ratio, word = max(_match(matcher, word.lower()) for word in headwords)
best_ratio = max(best_ratio, ratio) # remember best ratio
if ratio > 0.86:
ratios[word] = ratio
print(best_ratio)
print(ratios)
1) I would store headwordList as a set, not a list, allowing for faster access as it is a hashed data structure.
2) You have sentenceList defined as a list then attempt to use it as a dictionary with sentenceList[x] = y. I would define a different structure specifically for counts.
3) You construct sentenceList which doesn't need to be done.
for line in file:
if line not in headwordList...
4) You never tokenize line which means you store the entire line before the newline character in sentenceList and see if it is in a wordlist
This is a data structures question. What you want to do, is to turn your list into something with faster element lookup speed, for example a binary search tree would work great here: time complexity is only O (log n) as opposed to O (n) in a list (which is in comparison incredibly fast).
There's a fairly simple explanation here:
http://interactivepython.org/runestone/static/pythonds/Trees/balanced.html
But if you are not familiar with tree concepts, you might want to start few chapters earlier:
http://interactivepython.org/runestone/static/pythonds/Trees/trees.html

What possible improvements can be made to a palindrome program?

I am just learning programming in Python for fun. I was writing a palindrome program and I thought of how I can make further improvements to it.
First thing that came to my mind is to prevent the program from having to go through the entire word both ways since we are just checking for a palindrome. Then I realized that the loop can be broken as soon as the first and the last character doesn't match.
I then implemented them in a class so I can just call a word and get back true or false.
This is how the program stands as of now:
class my_str(str):
def is_palindrome(self):
a_string = self.lower()
length = len(self)
for i in range(length/2):
if a_string[i] != a_string[-(i+1)]:
return False
return True
this = my_str(raw_input("Enter a string: "))
print this.is_palindrome()
Are there any other improvements that I can make to make it more efficient?
I think the best way to improvise write a palindrome checking function in Python is as follows:
def is_palindrome(s):
return s == s[::-1]
(Add the lower() call as required.)
What about something way simpler? Why are you creating a class instead of a simple method?
>>> is_palindrome = lambda x: x.lower() == x.lower()[::-1]
>>> is_palindrome("ciao")
False
>>> is_palindrome("otto")
True
While other answers have been given talking about how best to approach the palindrome problem in Python, Let's look at what you are doing.
You are looping through the string by using indices. While this works, it's not very Pythonic. Python for loops are designed to loop over the objects you want, not simply over numbers, as in other languages. This allows you to cut out a layer of indirection and make your code clearer and simpler.
So how can we do this in your case? Well, what you want to do is loop over the characters in one direction, and in the other direction - at the same time. We can do this nicely using the zip() and reversed() builtins. reversed() allows us to get the characters in the reversed direction, while zip() allows us to iterate over two iterators at once.
>>> a_string = "something"
>>> for first, second in zip(a_string, reversed(a_string)):
... print(first, second)
...
s g
o n
m i
e h
t t
h e
i m
n o
g s
This is the better way to loop over the characters in both directions at once. Naturally this isn't the most effective way of solving this problem, but it's a good example of how you should approach things differently in Python.
Building on Lattyware's answer - by using the Python builtins appropriately, you can avoid doing things like a_string[-(i+1)], which takes a second to understand - and, when writing more complex things than palindromes, is prone to off-by-one errors.
The trick is to tell Python what you mean to do, rather than how to achieve it - so, the most obvious way, per another answer, is to do one of the following:
s == s[::-1]
list(s) == list(reversed(s))
s == ''.join(reversed(s))
Or various other similar things. All of them say: "is the string equal to the string backwards?".
If for some reason you really do need the optimisation that you know you've got a palindrome once you're halfway (you usually shouldn't, but maybe you're dealing with extremely long strings), you can still do better than index arithmetic. You can start with:
halfway = len(s) // 2
(where // forces the result to an integer, even in Py3 or if you've done from __future__ import division). This leads to:
s[:halfway] == ''.join(reversed(s[halfway:]))
This will work for all even-length s, but fail for odd length s because the RHS will be one element longer. But, you don't care about the last character in it, since it is the middle character of the string - which doesn't affect its palindromeness. If you zip the two together, it will stop after the end of the short one - and you can compare a character at a time like in your original loop:
for f,b in zip(s[:half], reversed(s[half:])):
if f != b:
return False
return True
And you don't even need ''.join or list or such. But you can still do better - this kind of loop is so idiomatic that Python has a builtin function called all just to do it for you:
all(f == b for f,b in zip(s[:half], reversed(s[half:])))
Says 'all the characters in the front half of the list are the same as the ones in the back half of the list written backwards'.
One improvement I can see would be to use xrange instead of range.
It probably isn't a faster implementation but you could use a recursive test. Since you're learning, this construction is very useful in many situation :
def is_palindrome(word):
if len(word) < 2:
return True
if word[0] != word[-1]:
return False
return is_palindrome(word[1:-1])
Since this is a rather simple (light) function this construction might not be the fastest because of the overhead of calling the function multiple times, but in other cases where the computation are more intensive it can be a very effective construction.
Just my two cents.

Categories

Resources