How to insert space between alphabet characters and numeric character using regex? - python

I'm trying to insert space between numeric characters and alphabet character so I can convert numeric character to words like :
Input :
subject101
street45
Output :
subject 101
street 45
I tried this one
re.sub('[a-z][\d]|[\d][a-z]',' ','subject101')
but the output was like this :
subjec 01
How can I do it using python?

Try this Regex:
(?i)(?<=\d)(?=[a-z])|(?<=[a-z])(?=\d)
Click for Demo
Replace each match with a space
Explanation:
(?i) - modifier to make the matches case-insensitive
(?<=\d)(?=[a-z]) - finds the position just preceded by a digit and followed by a letter
| - OR
(?<=[a-z])(?=\d) - finds the position just preceded by a letter and followed by a digit
Code output
import re
regex = r"(?i)(?<=\d)(?=[a-z])|(?<=[a-z])(?=\d)"
test_str = ("subject101\n"
" street45")
subst = " "
result = re.sub(regex, subst, test_str, 0)
if result:
print (result)

You can use if statement (?(#group)) in regex to check if char is digit or a letter.
Regex: (?<=([a-z])|\d)(?=(?(1)\d|[a-z]))
Python code:
def addSpace(text):
return re.sub(r'(?<=([a-z])|\d)(?=(?(1)\d|[a-z]))', ' ', text)
Output:
addSpace('subject101')
>>> subject 101
addSpace('101subject')
>>> 101 subject

A way to do this would be to pass a callable to re.sub. This allows you to reuse the matched substring to generate the replacement value.
subject = '101subject101'
s = re.sub(r'[a-zA-Z]\d|\d[a-zA-Z]', lambda m: ' '.join(m.group()), subject )
# s: '101 subject 101'

Related

Python Regex: Find character within matched string

I want to use a regex in python to find all words which start with \.
Afterwards, the regex should look for a [ within the matched word and then replace it with an underscore.
Here is an example:
input_string = SomeText \Word[0] Word[3] \SomeText[123] SomeText[10] SomeText
output_string = SomeText \Word_0] Word[3] \SomeText_123] SomeText[10] SomeText
The following python code replaces the square bracket with the underscore:
output_string = re.sub(<regex>, '_', input_string)
I have written this regex to find words that start with \ :
\\[^\s]+
https://regex101.com/r/d4YO9K/1
But now I don't know how to find the square bracket.
Can someone please contribute some ideas how to solve this problem?
You can define a function that takes the match object and returns the replacement string:
def rep(m):
return m.group(0).replace("[", "_")
And pass it as the the replacement parameter to re.sub:
re.sub(r"\\\S+", rep, "abc \\xyz[0] def")
'abc \\xyz_0] def'
You want to match:
r'((?:^| )\\\w*?\['
And replace with:
r'\1_`
which is whatever is in group 1 followed by a '_'.
( - Start of group 1
(?:^| ) - Match start of string or a space
\\ - Match a backslash
\w*? - Match 0 or more word characters non-greedily
) - End of group 1
\[ - Match a [
See Regex Demo
Note that you should not be naming your variable input, which is the name of a built-in function.
import re
s = 'SomeText \Word[0] Word[3] \SomeText[123] SomeText[10] SomeText'
output = re.sub(r'((?:^| )\\\w*?)\[', r'\1_', s)
print(output)
Prints:
SomeText \Word_0] Word[3] \SomeText_123] SomeText[10] SomeText

Regex pattern to find n non-space characters of x length after a certain substring

I am using this regex pattern pattern = r'cig[\s:.]*(\w{10})' to extract the 10 characters after the '''cig''' contained in each line of my dataframe. With this pattern I am accounting for all cases, except for the ones where that substring contains some spaces inside it.
For example, I am trying to extract Z9F27D2198 from the string
/BENEF/FORNITURA GAS FEB-20 CIG Z9F 27D2198 01762-0000031
In the previous string, it seems like Stack overflow formatted it, but there should be 17 whitespaces between F and 2, after CIG.
Could you help me to edit the regex pattern in order to account for the white spaces in that 10-characters substring? I am also using flags=re.I to ignore the case of the strings in my re.findall calls.
To give an example string for which this pattern works:
CIG7826328A2B FORNITURA ENERGIA ELETTRICA U TENZE COMUNALI CONVENZIONE CONSIP E
and it outputs what I want: 7826328A2B.
Thanks in advance.
You can use
r'(?i)cig[\s:.]*(\S(?:\s*\S){9})(?!\S)'
See the regex demo. Details:
cig - a cig string
[\s:.]* - zero or more whitespaces, : or .
(\S(?:\s*\S){9}) - Group 1: a non-whitespace char and then nine occurrences of zero or more whitespaces followed with a non-whitespace char
(?!\S) - immediately to the right, there must be a whitespace or end of string.
In Python, you can use
import re
text = "/BENEF/FORNITURA GAS FEB-20 CIG Z9F 27D2198 01762-0000031"
pattern = r'cig[\s:.]*(\S(?:\s*\S){9})(?!\S)'
matches = re.finditer(pattern, text, re.I)
for match in matches:
print(re.sub(r'\s+', '', match.group(1)), ' found at ', match.span(1))
# => Z9F27D2198 found at (32, 57)
See the Python demo.
What about:
# removes all white spaces with replace()
x = 'CIG7826328A2B FORNITURA ENERGIA ELETTRICA U'.replace(' ', '')
x = x.split("CIG")[1][:10]
# x = '7826328A2B'
x = '/BENEF/FORNITURA GAS FEB-20 CIG Z9F 27D2198 01762-0000031'.replace(' ', '')
x.split("CIG")[1][:10]
# x = '7826328A2B'
Works fine if there is only one "CIG" in the string

Extracting a complex substring using regex with data from a string in python

I have a string say
text = 'i have on 31-Dec-08 USD 5234765 which I gave it in the donation"
i tried :
pattern = r"^[\d]{2}.*,[\d]{3}$"
data = re.findall(pattern, text)
for s in data:
print(s)
my desired output :
[31-Dec-08, USD, 5234765]
you can do it that way
import re
regex = r"(\w+-\w+-\w+)|([A-Z]{3})|(\d+)"
test_str = "i have on 31-Dec-08 USD 5234765 which I gave it in the donation"
matches = re.findall(regex, test_str)
temp = [_ for tupl in matches for _ in tupl if _]
print(temp) #['31-Dec-08', 'USD', '5234765']
\w matches any word character (equivalent to [a-zA-Z0-9_])
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
-matches the character - literally (case sensitive)
[A-Z]{3} matches the capital alphabet exactly 3 times.
\d matches a digit (equivalent to [0-9])

remove n before a string

I want to remove unrequired r and n at beginning of each upper-case word and number in this string. I tried regex. Not sure if regex or some other method would be helpful here.
This is the code I am trying to use:
text = "nFamily n49 new nTom"
regex_pattern = re.compile(r'.*n[A-Z][a-z]*|[0-9]*\s')
matches = regex_pattern.findall(text)
for match in matches:
text = text.replace(match," ")
print(text)
Expected output:
Family 49 new Tom
You can use
text = re.sub(r'\bn(?=[A-Z0-9])', '', text)
See the regex demo.
Details:
\b - here, start of a word
n - a n letter
(?=[A-Z0-9]) - a positive lookahead that requires an uppercase ASCII letter or a digit to be present immediately to the right of the current location.
See the Python demo:
import re
rx = r"\bn(?=[A-Z0-9])"
text = "nFamily n49 new nTom"
print( re.sub(rx, '', text) )
# => Family 49 new Tom

Python regular expression to find letters and numbers

Entering a string
I used 'findall' to find words that are only letters and numbers (The number of words to be found is not specified).
I created:
words = re.findall ("\ w * \ s", x) # x is the input string
If i entered "asdf1234 cdef11dfe a = 1 b = 2"
these sentences seperated asdf1234, cdef11dfe, a =, 1, b =, 2
I would like to pick out only asdf1234, cdef11dfe
How do you write a regular expression?
Try /[a-zA-z0-9]{2,}/.
This looks for any alphanumeric character ([a-zA-Z0-9]) at least 2 times in a row ({2,}). That would be the only way to filter out the one letter words of the string.
The problem with \w is that it includes underscores.
This one should work : (?<![\"=\w])(?:[^\W_]+)(?![\"=\w])
Explanation
(?:[^\W_])+ Anything but a non-word character or an underscore at least one time (non capturing group)
(?<![\"=\w]) not precedeed by " or a word character
(?![\"=\w]) not followed by " or a word character
RegEx Demo
Sample code Run online
import re
regex = r"(?<![\"=\w])(?:[^\W_]+)(?![\"=\w])"
test_str = "a01a b02 c03 e dfdfd abcdef=2 b=3 e=4 c=\"a b\" aaa=2f f=\"asdf 12af\""
matches = re.finditer(regex, test_str)
for matchNum, match in enumerate(matches):
print (match.group())

Categories

Resources