python unique string creation

python unique string creation - python

I've looked at several other SO questions (and google'd tons) that are 'similar'-ish to this, but none of them seem to fit my question right.
I am trying to make a non fixed length, unique text string, only containing characters in a string I specify. E.g. made up of capital and lower case a-zA-Z characters. (for this example I use only a, b, and c lower case)
Something like this (broken code below)
def next(index, validCharacters = 'abc'):
return uniqueShortAsPossibleString
The index argument would be an index (integer) that relate to a text string, for instance:
next(1) == 'a'
next(2) == 'b'
next(3) == 'c'
next(4) == 'aa'
next(5) == 'ab'
next(6) == 'ac'
next(7) == 'ba'
next(8) == 'bb'
next(9) == 'bc'
next(10) == 'ca'
next(11) == 'cb'
next(12) == 'cc'
And so forth. The string:
Must be unique, I'll be using it as an identifier, and it can only be a-zA-Z chars
As short as possible, with lower index numbers being shortest (see above examples)
Contain only the characters specified in the given argument string validCharacters
In conclusion, how could I write the next() function to relate an integer index value to an unique short string with the characters specified?
P.S. I'm new to SO, this site has helped me tons throughout the years, and while I've never made an account or asked a question (till now), I really hope I've done an okay job explaining what I'm trying to accomplish with this.

What you are trying to do is write the parameter of the next function in another base.
Let's suppose validCharacters contains k characters: then the job of the next function will be to transform parameter p into base k by using the characters in validCharacters.
In your example, you can write the numbers in base 3 and then associate each digit with one letter:
next(1) -> 1 -> 'a'
next(2) -> 2 -> 'b'
next(4) -> 11 -> 'aa'
next(7) -> 21 -> 'ba'
And so forth.
With this method, you can call next(x) without knowing or computing any next(x-i), which you can't do with iterative methods.

You're trying to convert a number to a number in another base, but using arbitrary characters for the digits of that base.
import string
chars = string.lowercase + string.uppercase
def identifier(x, chars):
output = []
base = len(chars)
while x:
output.append(chars[x % base])
x /= base
return ''.join(reversed(output))
print identifier(1, chars)
This lets you jump to any position, you're counting so the identifiers are totally unique, and it is easy to use any character set of any length (of two or more), and lower numbers give shorter identifiers.

itertools can always give you obfuscated one-liner iterators:
from itertools import combinations_with_replacement, chain
chars = 'abc'
a = chain(*(combinations_with_replacement(chars, i) for i in range(1, len(chars) + 1)))
Basically, this code creates an iterator that combines all combinations of chars of lengths 1, 2, ..., len(chars).
The output of for x in a: print x is:
('a',)
('b',)
('c',)
('a', 'b')
('a', 'c')
('b', 'a')
('b', 'c')
('c', 'a')
('c', 'b')
('a', 'b', 'c')
('a', 'c', 'b')
('b', 'a', 'c')
('b', 'c', 'a')
('c', 'a', 'b')
('c', 'b', 'a')

You can't really "associate" the index with annoying, but the following is a generator that will yield and provide the output you're asking for:
from itertools import combinations_with_replacement
def uniquenames(chars):
for i in range(1, len(chars)):
for j in combinations_with_replacement(chars, i):
yield ''.join(j)
print list(uniquenames('abc'))
# ['a', 'b', 'c', 'aa', 'ab', 'ac', 'bb', 'bc', 'cc']

As far as I understood we shouldn't specify maximum length of output string. So range is not enough:
>>> from itertools import combinations_with_replacement, count
>>> def u(chars):
... for i in count(1):
... for k in combinations_with_replacement(chars, i):
... yield "".join(k)
...
>>> g = u("abc")
>>> next(g)
'a'
>>> next(g)
'b'
>>> next(g)
'c'
>>> next(g)
'aa'
>>> next(g)
'ab'
>>> next(g)
'ac'
>>> next(g)
'bb'
>>> next(g)
'bc'

So it seems like you are trying to enumerate through all the strings generated by the language {'a','b','c'}. This can be done using finite state automata (though you don't want to do that). One simple way to enumerate through the language is to start with a list and append all the strings of length 1 in order (so a then b then c). Then append each letter in the alphabet to each string of length n-1. This will keep it in order as long as you append all the letters in the alphabet to a given string before moving on to the lexicographically next string.

Related

Create string combination based on replacement

Given a word and a dictionary of replacement characters, I need to form a Combination of characters based on the replacement
Input
word = 'accompanying'
substitutions={'c':['$'], 'a': ['4'], 'g': ['9']}
Output
{'a$$ompanyin9', 'ac$ompanyin9','a$companyin9','4ccomp4nying', '4$$omp4nying',
'4$comp4nying','4c$omp4nying', '4ccomp4nyin9', 'a$$ompanying', 'a$companying', 'ac$ompanying',
'accompanyin9', 'accompanying', '4$$omp4nyin9', '4$comp4nyin9', '4c$omp4nyin9','etc.,'}
I wrote a code, But it does not provide me all the combinations which I am expecting
Sample Code
from itertools import product
substitutions={'c':['$'], 'a': ['4'], 'g': ['9']}
for key in substitutions.keys():
if key not in substitutions[key]:
substitutions[key].append(key)
wordPossibilities = []
word = 'accompanying'
for substitute in [zip(substitutions.keys(),ch) for ch in product(*substitutions.values())]:
temp=word
for replacement in substitute:
temp=temp.replace(*replacement)
wordPossibilities.append(temp)
print(set(wordPossibilities))
My Output
{'4$$omp4nyin9', 'a$$ompanyin9', 'a$$ompanying', 'accompanyin9',
'accompanying', '4ccomp4nyin9', '4$$omp4nying', '4ccomp4nying'}
My code replaces all characters in the provided string if found a replacement. How do I make replacements based on Indexes to find all possible combinations?

It is clean and straightforward to use a generator with recursion:
word = 'accompanying'
subs={'c':['$'], 'a': ['4'], 'g': ['9']}
def get_subs(d, c = []):
if not d:
yield ''.join(c)
else:
for i in [d[0], *subs.get(d[0], [])]:
yield from get_subs(d[1:], c+[i])
print(list(get_subs(word)))
Output:
['accompanying', 'accompanyin9', 'accomp4nying', 'accomp4nyin9', 'ac$ompanying', 'ac$ompanyin9', 'ac$omp4nying', 'ac$omp4nyin9', 'a$companying', 'a$companyin9', 'a$comp4nying', 'a$comp4nyin9', 'a$$ompanying', 'a$$ompanyin9', 'a$$omp4nying', 'a$$omp4nyin9', '4ccompanying', '4ccompanyin9', '4ccomp4nying', '4ccomp4nyin9', '4c$ompanying', '4c$ompanyin9', '4c$omp4nying', '4c$omp4nyin9', '4$companying', '4$companyin9', '4$comp4nying', '4$comp4nyin9', '4$$ompanying', '4$$ompanyin9', '4$$omp4nying', '4$$omp4nyin9']
However, itertools.product can be used for a shorter solution:
from itertools import product as prod
s = ''.join('{}' if i in subs else i for i in word)
result = [s.format(*i) for i in prod(*[[i, *subs[i]] for i in word if i in subs])]
Output:
['accompanying', 'accompanyin9', 'accomp4nying', 'accomp4nyin9', 'ac$ompanying', 'ac$ompanyin9', 'ac$omp4nying', 'ac$omp4nyin9', 'a$companying', 'a$companyin9', 'a$comp4nying', 'a$comp4nyin9', 'a$$ompanying', 'a$$ompanyin9', 'a$$omp4nying', 'a$$omp4nyin9', '4ccompanying', '4ccompanyin9', '4ccomp4nying', '4ccomp4nyin9', '4c$ompanying', '4c$ompanyin9', '4c$omp4nying', '4c$omp4nyin9', '4$companying', '4$companyin9', '4$comp4nying', '4$comp4nyin9', '4$$ompanying', '4$$ompanyin9', '4$$omp4nying', '4$$omp4nyin9']

Obviously, you need to rewrite your logic to consider individual instances of the desired letters, rather than each unique letter. Find all occurrences of desired letters; use itertools to get the power set; make the indicated substitutions for each element of the power set. power_set comes from this SO answer. I've left the code "exploded" in some places to show the logic more readily. You will likely want to wrap the final loop into a one-line return expression.
from itertools import chain, combinations
def power_set(iterable):
s = list(iterable)
return chain.from_iterable(combinations(s, r) for r in range(len(s)+1))
substitutions={'c':['$'], 'a': ['4', 'a'], 'g': ['9']}
word = 'accordingly'
# Get index of each desired letter and its poosible substitutions
sub_idx = [(pos, letter, sub_letter) for pos, letter in enumerate(word)
if letter in list(substitutions.keys()) for sub_letter in substitutions[letter]]
print("Replacement set", sub_idx)
for possibility in power_set(sub_idx):
# Make each of the substitutions indicated in the power set
new_word = list(word)
for pos, _, sub_letter in possibility:
new_word[pos] = sub_letter
print(''.join(new_word))
Output:
Replacement set [(0, 'a', '4'), (0, 'a', 'a'), (1, 'c', '$'), (2, 'c', '$'), (8, 'g', '9')]
accordingly
4ccordingly
accordingly
a$cordingly
ac$ordingly
accordin9ly
accordingly
4$cordingly
4c$ordingly
4ccordin9ly
a$cordingly
ac$ordingly
accordin9ly
a$$ordingly
a$cordin9ly
ac$ordin9ly
a$cordingly
ac$ordingly
accordin9ly
4$$ordingly
4$cordin9ly
4c$ordin9ly
a$$ordingly
a$cordin9ly
ac$ordin9ly
a$$ordin9ly
a$$ordingly
a$cordin9ly
ac$ordin9ly
4$$ordin9ly
a$$ordin9ly
a$$ordin9ly

Get sequences from a file and store them into a list in python

Here is the code (i took it from this discussion Translation DNA to Protein, but here i'm using RNA instead of DNA file):
from itertools import takewhile
def translate_rna(sequence, d, stop_codons=('UAA', 'UGA', 'UAG')):
start = sequence.find('AUG')
# Take sequence from the first start codon
trimmed_sequence = sequence[start:]
# Split it into triplets
codons = [trimmed_sequence[i:i + 3] for i in range(0, len(trimmed_sequence), 3)]
# Take all codons until first stop codon
coding_sequence = takewhile(lambda x: x not in stop_codons and len(x) == 3, codons)
# Translate and join into string
protein_sequence = ''.join([codontable[codon] for codon in coding_sequence])
# This line assumes there is always stop codon in the sequence
return "{0}".format(protein_sequence)
Calling the translate_rna function:
sequence = ''
for line in open("to_rna", "r"):
sequence += line.strip()
translate_rna(sequence, d)
My to_rna file looks like:
CCGCCCCUCUGCCCCAGUCACUGAGCCGCCGCCGAGGAUUCAGCAGCCUCCCCCUUGAGCCCCCUCGCUU
CCCGACGUUCCGUUCCCCCCUGCCCGCCUUCUCCCGCCACCGCCGCCGCCGCCUUCCGCAGGCCGUUUCC
ACCGAGGAAAAGGAAUCGUAUCGUAUGUCCGCUAUCCAG.........
The function translate only the first proteine (from the first AUG to the first stop_codon)
I think the problem is in this line:
# Take all codons until first stop codon
coding_sequence = takewhile(lambda x: x not in stop_codons and len(x) == 3 , codons)
My question is : How can i tell python (after finding the first AUG and store it into coding_sequence as a list) to search again the next AUG in the RNA file and sotre it in the next position.
As a result, i wanna have a list like that:
['here_is_the_1st_coding_sequence', 'here_is_the_2nd_coding_sequence', ...]
PS : This is a homework, so i can't use Biopython.
EDIT:
A simple way to describe the problem:
From this code:
from itertools import takewhile
lst = ['N', 'A', 'B', 'Z', 'C', 'A', 'V', 'V' 'Z', 'X']
ch = ''.join(lst)
stop = 'Z'
start = ch.find('A')
seq = takewhile(lambda x: x not in stop, ch)
I want to get this:
['AB', 'AVV']
EDIT 2:
For instance, from this string:
UUUAUGCGCCGCUAACCCAUGGUUCCCUAGUGGUCCUGACGCAUGUGA
I should get as result:
['AUGCGCCGC', 'AUGGUUCCC', 'AUG']

looking at your basic code, because I couldn't quite follow your main stuff, it looks like you just want to split your string on all occurences of another string, and substring the string starting from the index of another string. If that is wrong, please tell me and I can update accordingly.
To achieve this, python has a builtin str.split(sub) which splits a string at every occurence of sub. Also, it has a str.index(sub) which returns the first index of sub. Example:
>>> ch = 'NABZCAVZX'
>>> ch[ch.index('A'):].split('Z')
['AB', 'CAV', 'X']
you can also specify sub strings that aren't just one char:
>>> ch = 'NACBABQZCVEZTZCGE'
>>> ch[ch.index('AB'):].split('ZC')
['ABQ', 'VEZT', 'GE']
Using multiple delimiters:
>>> import re
>>> stop_codons = ['UAA','UGA','UAG']
>>> re.compile('|'.join(stop_codons))\
>>> delim = re.compile('|'.join(stop_codons))
>>> ch = 'CCHAUAABEGTAUAAVEGTUGAVKEGUAABEGEUGABRLVBUAGCGGA'
>>> delim.split(ch)
['CCHA', 'BEGTA', 'VEGT', 'VKEG', 'BEGE', 'BRLVB', 'CGGA']
note that there is no order preferance to the split, ie if there is a UGA string ahead of a UAA, it will still split on the UGA. I am not sure if thats what you want but thats it.

How to separate uppercase and lowercase letters in a string?

I have written code that separates the characters at 'even' and 'odd' indices, and I would like to modify it so that it separates characters by upper/lower case.
I can't figure out how to do this for a string such as "AbBZxYp". I have tried using .lower and .upper but I think I'm using them incorrectly.
def upperLower(string):
odds=""
evens=""
for index in range(len(string)):
if index % 2 == 0:
evens = evens + string[index]
if not (index % 2 == 0):
odds = odds + string[index]
print "Odds: ", odds
print "Evens: ", evens

Are you looking to get two strings, one with all the uppercase letters and another with all the lowercase letters? Below is a function that will return two strings, the upper then the lowercase:
def split_upper_lower(input):
upper = ''.join([x for x in input if x.isupper()])
lower = ''.join([x for x in input if x.islower()])
return upper, lower
You can then call it with the following:
upper, lower = split_upper_lower('AbBZxYp')
which gives you two variables, upper and lower. Use them as necessary.

>>> filter(str.isupper, "AbBZxYp")
'ABZY'
>>> filter(str.islower, "AbBZxYp")
'bxp'
Btw, for odd/even index you could just do this:
>>> "AbBZxYp"[::2]
'ABxp'
>>> "AbBZxYp"[1::2]
'bZY'

There is an itertools recipe called partition that can do this. Here is the implementation:
From itertools recipes:
def partition(pred, iterable):
'Use a predicate to partition entries into false entries and true entries'
# partition(is_odd, range(10)) --> 0 2 4 6 8 and 1 3 5 7 9
t1, t2 = tee(iterable)
return filterfalse(pred, t1), filter(pred, t2)
Upper and Lowercase Letters
You can manually implement the latter recipe, or install a library that implements it for you, e.g. pip install more_itertools:
import more_itertools as mit
iterable = "AbBZxYp"
pred = lambda x: x.islower()
children = mit.partition(pred, iterable)
[list(c) for c in children]
# [['A', 'B', 'Z', 'Y'], ['b', 'x', 'p']]
Here partition uses a predicate function to determine if each item in an iterable is lowercase. If not, it is filtered into the false group. Otherwise, it is filtered into the group of true items. We iterate to expose these groups.
Even and Odd Indices
You can modify this to work for odd and even indices as well:
import itertools as it
import more_itertools as mit
iterable = "AbBZxYp"
pred = lambda x: x[0] % 2 != 0
children = mit.partition(pred, tuple(zip(it.count(), iterable)))
[[i[1] for i in list(c)] for c in children]
# [['A', 'B', 'x', 'p'], ['b', 'Z', 'Y']]
Here we zip an itertools.count() object to enumerate the iterable. Then we iterate the children so that the sub items yield the letters only.
See also more_itertools docs for more tools.

How to maintain a strict alternating pattern of item "types" in a list?

Given a list of strings, where each string is in the format "A - something" or "B - somethingelse", and list items mostly alternate between pieces of "A" data and "B" data, how can irregularities be removed?
Irregularities being any sequence that breaks the A B pattern.
If there are multiple A's, the next B should also be removed.
If there are multiple B's, the preceding A should also be removed.
After removal of these invalid sequnces, list order should be kept.
Example: A B A B A A B A B A B A B A B B A B A B A A B B A B A B
In this case, AAB (see rule 2), ABB (see rule 3) and AABB should be removed.

I'll give it a try with regexp returning indexes of sequences to be removed
>>> import re
>>> data = 'ABABAABABABABABBABABAABBABAB'
>>> [(m.start(0), m.end(0)) for m in re.finditer('(AA+B+)|(ABB+)', data)]
[(4, 7), (13, 16), (20, 24)]
or result of stripping
>>> re.sub('(AA+B+)|(ABB+)', '', data)
ABABABABABABABABAB

The drunk-on-itertools solution:
>>> s = 'ABABAABABABABABBABABAABBABAB'
>>> from itertools import groupby, takewhile, islice, repeat, chain
>>> groups = (list(g) for k,g in groupby(s))
>>> pairs = takewhile(bool, (list(islice(groups, 2)) for _ in repeat(None)))
>>> kept_pairs = (p for p in pairs if len(p[0]) == len(p[1]) == 1)
>>> final = list(chain(*chain(*kept_pairs)))
>>> final
['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B']
(Unfortunately I'm now in no shape to think about corner cases and trailing As etc..)

I'd write it as a generator. Repeat:
read as many A's as possible,
read as many B's as possible,
if you've read exactly 1 A and 1 B, yield them; otherwise ignore and proceed.
Also this needs an additional special case in case you want to allow the input to end with an A.

Using itertools.groupby:
from itertools import groupby
def solve(strs):
drop_next = False
ans = []
for k, g in groupby(strs):
lis = list(g)
if drop_next:
#if True then don't append the current set to `ans`
drop_next = False
elif len(lis) > 1 and k == 'A':
#if current group contains more than 1 'A' then skip the next set of 'B'
drop_next = True
elif len(lis) > 1 and k == 'B':
#if current group contains more than 1 'B' then pop the last appended item
if ans:
ans.pop(-1)
else:
ans.append(k)
return ''.join(ans)
strs = 'ABABAABABABABABBABABAABBABAB'
print solve(strs)
#ABABABABABABABABAB

String Replacement Combinations

So I have a string '1xxx1' and I want to replace a certain number (maybe all maybe none) of x's with a character, let's say '5'. I want all possible combinations (...maybe permutations) of the string where x is either substituted or left as x. I would like those results stored in a list.
So the desired result would be
>>> myList = GenerateCombinations('1xxx1', '5')
>>> print myList
['1xxx1','15xx1','155x1','15551','1x5x1','1x551','1xx51']
Obviously I'd like it to be able to handle strings of any length with any amount of x's as well as being able to substitute any number. I've tried using loops and recursion to figure this out to no avail. Any help would be appreciated.

How about:
from itertools import product
def filler(word, from_char, to_char):
options = [(c,) if c != from_char else (from_char, to_char) for c in word]
return (''.join(o) for o in product(*options))
which gives
>>> filler("1xxx1", "x", "5")
<generator object <genexpr> at 0x8fa798c>
>>> list(filler("1xxx1", "x", "5"))
['1xxx1', '1xx51', '1x5x1', '1x551', '15xx1', '15x51', '155x1', '15551']
(Note that you seem to be missing 15x51.)
Basically, first we make a list of every possible target for each letter in the source word:
>>> word = '1xxx1'
>>> from_char = 'x'
>>> to_char = '5'
>>> [(c,) if c != from_char else (from_char, to_char) for c in word]
[('1',), ('x', '5'), ('x', '5'), ('x', '5'), ('1',)]
And then we use itertools.product to get the Cartesian product of these possibilities and join the results together.
For bonus points, modify to accept a dictionary of replacements. :^)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python unique string creation - python

Related

Create string combination based on replacement

Get sequences from a file and store them into a list in python

How to separate uppercase and lowercase letters in a string?

How to maintain a strict alternating pattern of item "types" in a list?

String Replacement Combinations

Categories

Resources