String Formatting/Template/Regular Expressions - python

I have a string format let's say where A = alphanumeric and N = Integer so the template is "AAAAAA-NNNN" now the user sometimes will ommit the dash, and sometimes the "NNNN" is only three digits in which case I need it to pad a 0. The first digit of "NNNN" has to be 0, thus if it is a number is is the last digit of the "AAAAAA" as opposed to the first digit of "NNNN". So in essence if I have the following inputs I want the following results:
Sample Inputs:
"SAMPLE0001"
"SAMPL1-0002"
"SAMPL3003"
"SAMPLE-004"
Desired Outputs:
"SAMPLE-0001"
"SAMPL1-0002"
"SAMPL3-0003"
"SAMPLE-0004"
I know how to check for this using regular expressions but essentially I want to do the opposite. I was wondering if there is a easy way to do this other than doing a nested conditional checking for all these variations. I am using python and pandas but either will suffice.
The regex pattern would be:
"[a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9]-\d\d\d\d"
or in abbreviated form:
"[a-zA-Z0-9]{6}-[\d]{4}"

It would be possible through two re.sub functions.
>>> import re
>>> s = '''SAMPLE0001
SAMPL1-0002
SAMPL3003
SAMPLE-004'''
>>> print(re.sub(r'(?m)(?<=-)(?=\d{3}$)', '0', re.sub(r'(?m)(?<=^[A-Z\d]{6})(?!-)', '-', s)))
SAMPLE-0001
SAMPL1-0002
SAMPL3-0003
SAMPLE-0004
Explanation:
re.sub(r'(?m)(?<=^[A-Z\d]{6})(?!-)', '-', s) would be processed at first. It just places a hyphen after the 6th character from the beginning only if the following character is not a hyphen.
re.sub(r'(?m)(?<=-)(?=\d{3}$)', '0', re.sub(r'(?m)(?<=^[A-Z\d]{6})(?!-)', '-', s)) By taking the above command's output as input, this would add a digit 0 after to the hyphen and the characters following must be exactly 3.

An alternative solution, it uses str.join:
import re
inputs = ['SAMPLE0001', 'SAMPL1-0002', 'SAMPL3003','SAMPLE-004']
outputs = []
for input_ in inputs:
m = re.match(r'(\w{6})-?\d?(\d{3})', input_)
outputs.append('-0'.join(m.groups()))
print(outputs)
# ['SAMPLE-0001', 'SAMPL1-0002', 'SAMPL3-0003', 'SAMPLE-0004']
We are matching the regex (\w{6})-?\d?(\d{3}) against the input strings and joining the captured groups with the string '-0'. This is very simple and fast.
Let me know if you need a more in-depth explanation of the regex itself.

Related

Splitting a string every 2 digits

I have a column existing of rows with different strings (Python). ex.
5456656352
435365
46765432
...
I want to seperate the strings every 2 digits with a comma, so I have following result:
54,56,65,63,52
43,53,65
46,76,54,32
...
Can someone help me please.
Try:
text = "5456656352"
print(",".join(text[i:i + 2] for i in range(0, len(text), 2)))
output:
54,56,65,63,52
You can wrap it into a function if you want to apply it to a DF or ...
note: This will separate from left, so if the length is odd, there will be a single number at the end.
Not sure about the structure of desired output (pandas and dataframes, pure strings, etc.). But, you can always use a regex pattern like:
import re
re.findall("\d{2}", "5456656352")
Output
['54', '56', '65', '63', '52']
You can have this output as a string too:
",".join(re.findall("\d{2}", "5456656352"))
Output
54,56,65,63,52
Explanation
\d{2} is a regex pattern that points to a part of a string that has 2 digits. Using findall function, this pattern will divide each string to elements containing just two digits.
Edit
Based on your comment, you want to APPLY this on a column. In this case, you should do something like:
df["my_column"] = df["my_column"].apply(split_it)

How I can use regex to remove repeated characters from string

I have a string as follows where I tried to remove similar consecutive characters.
import re
input = "abccbcbbb";
for i in input :
input = re.sub("(.)\\1+", "",input);
print(input)
Now I need to let the user specify the value of k.
I am using the following python code to do it, but I got the error message TypeError: can only concatenate str (not "int") to str
import re
input = "abccbcbbb";
k=3
for i in input :
input= re.sub("(.)\\1+{"+(k-1)+"}", "",input)
print(input)
The for i in input : does not do what you need. i is each character in the input string, and your re.sub is supposed to take the whole input as a char sequence.
If you plan to match a specific amount of chars you should get rid of the + quantifier after \1. The limiting {min,} / {min,max} quantifier should be placed right after the pattern it modifies.
Also, it is more convenient to use raw string literals when defining regexps.
You can use
import re
input_text = "abccbcbbb";
k=3
input_text = re.sub(fr"(.)\1{{{k-1}}}", "", input_text)
print(input_text)
# => abccbc
See this Python demo.
The fr"(.)\1{{{k-1}}}" raw f-string literal will translate into (.)\1{2} pattern. In f-strings, you need to double curly braces to denote a literal curly brace and you needn't escape \1 again since it is a raw string literal.
If I were you, I would prefer to do it like suggested before. But since I've already spend time on answering this question here is my handmade solution.
The pattern described below creates a named group named "letter". This group updates iterative, so firstly it is a, then b, etc. Then it looks ahead for all the repetitions of the group "letter" (which updates for each letter).
So it finds all groups of repeated letters and replaces them with empty string.
import re
input = 'abccbcbbb'
result = 'abcbcb'
pattern = r'(?P<letter>[a-z])(?=(?P=letter)+)'
substituted = re.sub(pattern, '', input)
assert substituted == result
Just to make sure I have the question correct you mean to turn "abccbcbbb" into "abcbcb" only removing sequential duplicate characters. Is there a reason you need to use regex? you could likely do a simple list comprehension. I mean this is a really cut and dirty way to do it but you could just put
input = "abccbcbbb"
input = list(input)
previous = input.pop(0)
result = [previous]
for letter in input:
if letter != previous : result += letter
previous = letter
result = "".join(result)
and with a method like this, you could make it easier to read and faster with a bit of modification id assume.

Extract a portion of string from another string using regex

Lets assume I have a string as follows:
s = '23092020_indent.xlsx'
I want to extract only indent from the above string. Now there are many approaches:
#Via re.split() operation
s_f = re.split('_ |. ',s) <---This is returning 's' ONLY. Not the desired output
#Via re.findall() operation
s_f = re.findall(r'[^A-Za-z]',s,re.I)
s_f
['i','n','d','e','n','t','x','l','s','x']
s_f = ''.join(s_f) <----This is returning 'indentxlsx'. Not the desired output
Am I missing out anything? Or do I need to use regex at all?
P.S. In the whole part of s only '.'delimiter would be constant. Rests all delimiter can be changed.
Use os.path.splitext and then str.split:
import os
name, ext = os.path.splitext(s)
name.split("_")[1] # If the position is always fixed
Output:
"indent"
I LOVE regex's, so that's definitely the way I'd go.
The exactly right answer requires more information as to all possible input strings and what the right thing to extract is for each of them. Here's a solution that assumes:
one or more digits, then
a single underscore, then
a group of chars not containing a '.', then
a '.', then
anything besides a '.', but at least one char
The #3 part is captured.
import re
s = '23092020_indent.xlsx'
exp = re.compile(r"^\d+_(.*?)\.[^.]+$")
m = exp.match(s)
if m:
print(m.group(1))
Result:
indent

re.findall Find numbers with dashes(-) and commas(,)

I have the value
x = '970.11 - 1,003.54'
I've tried many types of re.findall for example
re.findall('\+d',x)
['970', '11', '1', '003', '54']
although I would like for it to show
['970.11', '1,003.54]
\d is only digits. It won't match other characters even if we think they are part of numbers. You need to do that manually with something like:
import re
x = '970.11 - 1,003.54'
re.findall('[\d\.,]+',x) # match numbers . or ,
result:
['970.11', '1,003.54']
This is a pretty forgiving regex — it will match a lot of things that probably aren't numbers (like ..,,4). Numbers can be tricky to match with a regex if you want something that works in a general case (like .45, 11,000.2, 22.) etc. The more consistent your input, the easier it will be. And sometimes it's easier to match the nonmembers (like your -).
Try this one, it also works:
import re
re.findall('\d+\,?\d+\.*\d*',x)
Output:
['970.11', '1,003.54']
Here , is optional, if it is between the number it takes it otherwise it will not take it.
If you want . as optional then you can make it like this:
In [48]: x
Out[48]: '970.11 - 1,003.54 2345'
In [49]: re.findall('\d+\,?\d+\.?\d+',x)
Out[49]: ['970.11', '1,003.54', '2345']
For getting this you might use grouping of regular expressions, using your example
x = '970.11 - 1,003.54'
y = re.findall('([0-9.,]+)([ -]+)([0-9.,]+)',x)
print(y[0]) #prints ('970.11', ' - ', '1,003.54')
z = [y[0][0],y[0][2]]
print(z) #prints ['970.11', '1,003.54']
Regular expression in this case consists of 3 groups: first and last match at least one of 0123456789., and middle at least one of - (space or dash)

Python Regular Expressions Findall

To look through data, I am using regular expressions. One of my regular expressions is (they are dynamic and change based on what the computer needs to look for --- using them to search through data for a game AI):
O,2,([0-9],?){0,},X
After the 2, there can (and most likely will) be other numbers, each followed by a comma.
To my understanding, this will match:
O,2,(any amount of numbers - can be 0 in total, each followed by a comma),X
This is fine, and works (in RegExr) for:
O,4,1,8,6,7,9,5,3,X
X,6,3,7,5,9,4,1,8,2,T
O,2,9,6,7,11,8,X # matches this
O,4,6,9,3,1,7,5,O
X,6,9,3,5,1,7,4,8,O
X,3,2,7,1,9,4,6,X
X,9,2,6,8,5,3,1,X
My issue is that I need to match all the numbers after the original, provided number. So, I want to match (in the example) 9,6,7,11,8.
However, implementing this in Python:
import re
pattern = re.compile("O,2,([0-9],?){0,},X")
matches = pattern.findall(s) # s is the above string
matches is ['8'], the last number, but I need to match all of the numbers after the given (so '9,6,7,11,8').
Note: I need to use pattern.findall because thee will be more than one match (I shortened my list of strings, but there are actually around 20 thousand strings), and I need to find the shortest one (as this would be the shortest way for the AI to win).
Is there a way to match the entire string (or just the last numbers after those I provided)?
Thanks in advance!
Use this:
O,2,((?:[0-9],?){0,}),X
See it in action:http://regex101.com/r/cV9wS1
import re
s = '''O,4,1,8,6,7,9,5,3,X
X,6,3,7,5,9,4,1,8,2,T
O,2,9,6,7,11,8,X
O,4,6,9,3,1,7,5,O
X,6,9,3,5,1,7,4,8,O
X,3,2,7,1,9,4,6,X
X,9,2,6,8,5,3,1,X'''
pattern = re.compile("O,2,((?:[0-9],?){0,}),X")
matches = pattern.findall(s) # s is the above string
print matches
Outputs:
['9,6,7,11,8']
Explained:
By wrapping the entire value capture between 2, and ,X in (), you end up capturing that as well. I then used the (?: ) to ignore the inner captured set.
you don't have to use regex
split the string to array
check item 0 == 0 , item 1==2
check last item == X
check item[2:-2] each one of them is a number (is_digit)
that's all

Categories

Resources