Python one-liner to standardise input string

Python one-liner to standardise input string - python

I'm trying to standardise the user input in this format: dddL where d is a digit and L a capital letter. If the user doesn't add enough digits, I want to fill the missing digits with leading zeroes, if he adds anything else than at most 3 digits and a letter, just reject:
input/standardised example:
input: '1a'
output: '001A'
input: '0a'
output: '000A'
input: '05c'
output: '005C'
input: '001F'
output: '001F' (unchanged)
input: '10x'
output: '010X'
input: '110x'
output: '110X'
My broken attempt right now returns nothing for some reason: (doesn't yet deal with rejecting invalid input)
>>> x = ['0a', '05c', '001F', '10x']
>>> [i.upper() if len(i)==4 else ('0' * j) + i.upper() for j in range(4-len(i)) for i in x]
[]
I'm not looking necessarily for list processing, I only want it to work for a single variable as input

One implementation:
acceptableInput = re.compile(r"\d{3}[A-Z]")
paddedInput = i.upper().zfill(4)
if acceptableInput.match(paddedInput):
# do something
else:
# reject

For zero padding:
i.zfill(4)
Check invalid input:
import re
re.match("\d{1,3}[A-Z]", i)
Put it together:
[i.zfill(4) for i in x if re.match("\d{1,3}[A-Z]", i)]
Compiling the re separately will make the code faster, so:
x = ['0A', '05C', '001F', '10x']
import re
matcher = re.compile("\d{1,3}[A-Z]")
out = [i.zfill(4) for i in x if matcher.match(i)]
out == ['000A', '005C', '001F']
RE disassembly:
Debuggex Demo

Related

Struggling with Regex for adjacent letters differing by case

I am looking to be able to recursively remove adjacent letters in a string that differ only in their case e.g. if s = AaBbccDd i would want to be able to remove Aa Bb Dd but leave cc.
I can do this recursively using lists:
I think it aught to be able to be done using regex but i am struggling:
with test string 'fffAaaABbe' the answer should be 'fffe' but the regex I am using gives 'fe'
def test(line):
res = re.compile(r'(.)\1{1}', re.IGNORECASE)
#print(res.search(line))
while res.search(line):
line = res.sub('', line, 1)
print(line)
The way that works is:
def test(line):
result =''
chr = list(line)
cnt = 0
i = len(chr) - 1
while i > 0:
if ord(chr[i]) == ord(chr[i - 1]) + 32 or ord(chr[i]) == ord(chr[i - 1]) - 32:
cnt += 1
chr.pop(i)
chr.pop(i - 1)
i -= 2
else:
i -= 1
if cnt > 0: # until we can't find any duplicates.
return test(''.join(chr))
result = ''.join(chr)
print(result)
Is it possible to do this using a regex?

re.IGNORECASE is not way to solve this problem, as it will treat aa, Aa, aA, AA same way. Technically it is possible using re.sub, following way.
import re
txt = 'fffAaaABbe'
after_sub = re.sub(r'Aa|aA|Bb|bB|Cc|cC|Dd|dD|Ee|eE|Ff|fF|Gg|gG|Hh|hH|Ii|iI|Jj|jJ|Kk|kK|Ll|lL|Mm|mM|Nn|nN|Oo|oO|Pp|pP|Qq|qQ|Rr|rR|Ss|sS|Tt|tT|Uu|uU|Vv|vV|Ww|wW|Xx|xX|Yy|yY|Zz|zZ', '', txt)
print(after_sub) # fffe
Note that I explicitly defined all possible letters pairs, because so far I know there is no way to say "inverted case letter" using just re pattern. Maybe other user will be able to provide more concise re-based solution.

I suggest a different approach which uses groupby to group adjacent similar letters:
from itertools import groupby
def test(line):
res = []
for k, g in groupby(line, key=lambda x: x.lower()):
g = list(g)
if all(x == x.lower() for x in g):
res.append(''.join(g))
print(''.join(res))
Sample run:
>>> test('AaBbccDd')
cc
>>> test('fffAaaABbe')
fffe

r'(.)\1{1}' is wrong because it will match any character that is repeated twice, including non-letter characters. If you want to stick to letters, you can't use this.
However, even if we just do r'[A-z]\1{1}', this would still be bad because you would match any sequence of the same letter twice, but it would catch xx and XX -- you don't want to match consecutive same characters with matching case, as you said in the original question.
It just so happens that there is no short-hand to do this conveniently, but it is still possible. You could also just write a small function to turn it into a short-hand.
Building on #Daweo's answer, you can generate the regex pattern needed to match pairs of same letters with non-matching case to get the final pattern of aA|Aa|bB|Bb|cC|Cc|dD|Dd|eE|Ee|fF|Ff|gG|Gg|hH|Hh|iI|Ii|jJ|Jj|kK|Kk|lL|Ll|mM|Mm|nN|Nn|oO|Oo|pP|Pp|qQ|Qq|rR|Rr|sS|Ss|tT|Tt|uU|Uu|vV|Vv|wW|Ww|xX|Xx|yY|Yy|zZ|Zz:
import re
import string
def consecutiveLettersNonMatchingCase():
# Get all 'xX|Xx' with a list comprehension
# and join them with '|'
return '|'.join(['{0}{1}|{1}{0}'.format(s, t)\
# Iterate through the upper/lowercase characters
# in lock-step
for s, t in zip(
string.ascii_lowercase,
string.ascii_uppercase)])
def test(line):
res = re.compile(consecutiveLettersNonMatchingCase())
print(res.search(line))
while res.search(line):
line = res.sub('', line, 1)
print(line)
print(consecutiveLettersNonMatchingCase())

How can I invert a slice?

My code right now
sentence = "Sentence!"
print(*sentence[::3], sep="--")
Output: S--t--c
How am I able to invert the slice so that same input would result in -en-en-e!
I've tried doing -3 and different numbers in the ::3 but none work

Like this:
sentence = 'Sentence!'
import re
tokens = re.findall(r'.(..)', sentence)
print('', '-'.join(tokens), sep='-') # prints: -en-en-e!
Edit: Addressing the question in the comments:
This works, although how can I get this to start on the 3rd letter?
You could try this:
tokens = re.findall(r'(..).?', sentence[2:])
print(*tokens, sep='-')
This will output: nt-nc
Is this what you wanted?

What you're trying to achieve isn't possible using a slice, because the indices you want to keep (1, 2, 4, 5, 7, 8) are not an arithmetic progression.
Since the goal is to replace the first character of every three with a - symbol, the simplest solution I can think of is using a regex:
>>> import re
>>> re.sub(".(.{0,2})", r"-\1", "Sentence!")
'-en-en-e!'
>>> re.sub(".(.{0,2})", r"-\1", "Hello, world!")
'-el-o,-wo-ld-'
The {0,2} means the pattern will match even if the last group doesn't have three letters.

If you want to truly invert the range, then take the indices not in that range:
''.join(sentence[i] if i not in range(0, len(sentence), 3) else '-'
for i in range(len(sentence)))
Output
'-en-en-e!'
Personally, I prefer the regex solutions.

Another attempt:
sentence = ("Sentence!")
print(''.join(ch if i % 3 else '-' for i, ch in enumerate(sentence)))
Prints:
-en-en-e!
If sentence='Hello, world!':
-el-o,-wo-ld-

You can use slice assignment:
def invert(string, step, sep):
sentence = list(string)
sentence[::step] = len(sentence[::step]) * [sep]
return ''.join(sentence)
print(invert('Sentence!', 3, '*'))
# *en*en*e!
print(invert('Hallo World!', 4, '$'))
# $all$ Wo$ld!

Python : Convert Integers into a Count (i.e. 3 --> 1,2,3)

This might be more information than necessary to explain my question, but I am trying to combine 2 scripts (I wrote for other uses) together to do the following.
TargetString (input_file) 4FOO 2BAR
Result (output_file) 1FOO 2FOO 3FOO 4FOO 1BAR 2BAR
My first script finds the pattern and copies to file_2
pattern = "\d[A-Za-z]{3}"
matches = re.findall(pattern, input_file.read())
f1.write('\n'.join(matches))
My second script opens the output_file and, using re.sub, replaces and alters the target string(s) using capturing groups and back-references. But I am stuck here on how to turn i.e. 3 into 1 2 3.
Any ideas?

This simple example doesn't need to use regular expression, but if you want to use re anyway, here's example (note: you have minor error in your pattern, should be A-Z, not A-A):
text_input = '4FOO 2BAR'
import re
matches = re.findall(r"(\d)([A-Za-z]{3})", text_input)
for (count, what) in matches:
for i in range(1, int(count)+1):
print(f'{i}{what}', end=' ')
print()
Prints:
1FOO 2FOO 3FOO 4FOO 1BAR 2BAR
Note: If you want to support multiple digits, you can use (\d+) - note the + sign.

Assuming your numbers are between 1 and 9, without regex, you can use a list comprehension with f-strings (Python 3.6+):
L = ['4FOO', '2BAR']
res = [f'{j}{i[1:]}' for i in L for j in range(1, int(i[0])+1)]
['1FOO', '2FOO', '3FOO', '4FOO', '1BAR', '2BAR']
Reading and writing to CSV files are covered elsewhere: read, write.
More generalised, to account for numbers greater than 9, you can use itertools.groupby:
from itertools import groupby
L = ['4FOO', '10BAR']
def make_var(x, int_flag):
return int(''.join(x)) if int_flag else ''.join(x)
vals = ((make_var(b, a) for a, b in groupby(i, str.isdigit)) for i in L)
res = [f'{j}{k}' for num, k in vals for j in range(1, num+1)]
print(res)
['1FOO', '2FOO', '3FOO', '4FOO', '1BAR', '2BAR', '3BAR', '4BAR',
'5BAR', '6BAR', '7BAR', '8BAR', '9BAR', '10BAR']

Delete a certain number of zeros from right of a string

I'm trying to delete a certain number of zeros from right. For example:
"10101000000"
I want to remove 4 zeros... And get:
"1010100"
I tried to do string.rstrip("0") or string.strip("0") but this removes all the of zeros from right. How can I do that?
The question is not a duplicate because I can't use imports.

You can use a regex
>>> import re
>>> mystr = "10101000000"
>>> numzeros = 4
>>> mystr = re.sub("0{{{}}}$".format(numzeros), "", mystr)
>>> mystr
'1010100'
This will leave the string as is if it doesn't end in four zeros
You could also check and then slice
if mystr.endswith("0" * numzeros):
mystr = mystr[:-numzeros]

For a known number of zeros you can use slicing:
s = "10101000000"
zeros = 4
if s.endswith("0" * zeros):
s = s[:-zeros]

rstrip deletes all characters from the end that are in passed set of characters. You can delete trailing zeros like this:
s = s[:-4] if s[-4:] == "0"*4 else s

Here's my solution:
number = "10101000000"
def my_rstrip(number, char, count=4):
for x in range(count):
if number.endswith(char):
number = number[0:-1]
else:
break
return number
print my_rstrip(number, '0', 4)

>>> s[:-4]+s[-4:].replace('0000','')

Don't forget to convert to str
import re
a = 10101000000
re.sub("0000$","", str(a))

You try to split off the last 4 characters from the string by this way:
string[:-4]

Python split string by pattern

I have strings like "aaaaabbbbbbbbbbbbbbccccccccccc".
The number of the chars can differ and sometimes there can be dash inside the string, like "aaaaa-bbbbbbbbbbbbbbccccccccccc".
Is there any smart way to either split it "aaaaa","bbbbbbbbbbbbbb","ccccccccccc" and get the indices of were it is split or just get the indices, without looping through every string? If the dash is between to patterns it can end up either in the left or right one as long it is always handled the same.
Any idea?

Regular expression MatchObject results include indices of the match. What remains is to match repeating characters:
import re
repeat = re.compile(r'(?P<start>[a-z])(?P=start)+-?')
would match only if a given letter character (a-z) is repeated at least once:
>>> for match in repeat.finditer("aaaaabbbbbbbbbbbbbbccccccccccc"):
... print match.group(), match.start(), match.end()
...
aaaaa 0 5
bbbbbbbbbbbbbb 5 19
ccccccccccc 19 30
The .start() and .end() methods on the match result give you the exact positions in the input string.
Dashes are included in the matches, but not non-repeating characters:
>>> for match in repeat.finditer("a-bb-cccccccc"):
... print match.group(), match.start(), match.end()
...
bb- 2 5
cccccccc 5 13
If you want the a- part to be a match, simply replace the + with a * multiplier:
repeat = re.compile(r'(?P<start>[a-z])(?P=start)*-?')

What about using itertools.groupby?
>>> s = 'aaaaabbbbbbbbbbbbbbccccccccccc'
>>> from itertools import groupby
>>> [''.join(v) for k,v in groupby(s)]
['aaaaa', 'bbbbbbbbbbbbbb', 'ccccccccccc']
This will put the - as their own substrings which could easily be filtered out.
>>> s = 'aaaaa-bbbbbbbbbbbbbb-ccccccccccc'
>>> [''.join(v) for k,v in groupby(s) if k != '-']
['aaaaa', 'bbbbbbbbbbbbbb', 'ccccccccccc']

str="aaaaabbbbbbbbbbbbbbccccccccccc"
p = [0]
for i, c in enumerate(zip(str, str[1:])):
if c[0] != c[1]:
p.append(i + 1)
print p
# [0, 5, 19]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python one-liner to standardise input string - python

One implementation: acceptableInput = re.compile(r"\d{3}[A-Z]") paddedInput = i.upper().zfill(4) if acceptableInput.match(paddedInput): # do something else: # reject

Related

Struggling with Regex for adjacent letters differing by case

How can I invert a slice?

Python : Convert Integers into a Count (i.e. 3 --> 1,2,3)

Delete a certain number of zeros from right of a string

Python split string by pattern

Categories

Resources