Python regex - substitute until certain character - python

I am looking to replace spaces with commas, but up to first / and tried the following:
import re
txt = "usera 28935 28876 0 Apr25 ? 00:07:20 /xxx/yyyy/foo/bar/zzzzz/Java/jdk-1.8.0_101/xxx/xxx -cp /xxx/yyyy/foo/bar/zzzzz"
rem = (re.sub(' +', ' ', txt)) # convert multiple spaces into single
print(re.sub(' ', ',', rem.lstrip()))
But the output is - inserts comma after every space!
usera,28935,28876,0,Apr25,?,00:07:20,/xxx/yyyy/foo/bar/zzzzz/Java/jdk-1.8.0_101/xxx/xxx,-cp,/xxx/yyyy/foo/bar/zzzzz
Expected Output:
usera,28935,28876,0,Apr25,?,00:07:20,/xxx/yyyy/foo/bar/zzzzz/Java/jdk-1.8.0_101/xxx/xxx -cp /xxx/yyyy/foo/bar/zzzzz
i.e. comma should be applied until the first /
I have tried lookahead, lookbehind but unable to work this out.
Could someone advise me on how to achieve this please?

Whenever you have a problem like this, consider splitting before using a regex
# split the text once at the first /
a, b = txt.split("/", 1)
# do the replacement in the first half
a = re.sub(" +", ",", a)
# join 'em back up
result = "{}/{}".format(a,b)

You can use lookbehind, but it needs to be variable length. So, you'll need third-party regex module:
>>> import regex
>>> txt = "usera 28935 28876 0 Apr25 ? 00:07:20 /xxx/yyyy/foo/bar/zzzzz/Java/jdk-1.8.0_101/xxx/xxx -cp /xxx/yyyy/foo/bar/zzzzz"
>>> regex.sub(r'(?<!/.*) +', ',', txt)
'usera,28935,28876,0,Apr25,?,00:07:20,/xxx/yyyy/foo/bar/zzzzz/Java/jdk-1.8.0_101/xxx/xxx -cp /xxx/yyyy/foo/bar/zzzzz'
# or you can use \G
>>> regex.sub(r'\G([^/ ]*+) +', r'\1,', txt)
'usera,28935,28876,0,Apr25,?,00:07:20,/xxx/yyyy/foo/bar/zzzzz/Java/jdk-1.8.0_101/xxx/xxx -cp /xxx/yyyy/foo/bar/zzzzz'
The first one replaces spaces only if / character is not present earlier in the string.
The second one defines a sequence of other than space or / characters followed by spaces to be matched as many times as possible from the start of the string.

Related

Python: replace two or more spaces followed by a specified character with a single space followed by that character

How do I replace two or more spaces followed by a specified character with a single space followed by that character, so that for example " &" become " &". I could successively run
str = str.replace(" &"," &")
but that is slow.
Use reflex
import re
pattern = re.compile(r' +&')
string = ' & & h'
print(pattern.sub(' &', string))
Output
& & h
replace one or more spaces with a single space:
import re
# match 1 or more spaces and then any non space character
pattern = re.compile(r'(\s+)([^\s]+)')
# replace with single space and keep second group
res = pattern.sub(r' \2', string)

Splitting a string by multiple possible delimiters

I want to parse a str into a list of float values, however I want to be flexible regarding my delimiters. Specifically, I would like to be able to use any of these
s = '3.14; 42.2' # delimiter is '; '
s = '3.14;42.2' # delimiter is ';'
s = '3.14, 42.2' # delimiter is ', '
s = '3.14,42.2' # delimiter is ','
s = '3.14 42.2' # delimiter is ' '
I thought about removing all spaces, but this would disable the last version; I tried the re.split()-function by doing re.split('[;, ]', s) which would work using a single character as delimiter but fails otherwise.
I can however do
s.replace('; ', ';').replace(', ', ';').replace(',', ';').replace(' ', ';')
s.split(';')
which works but seems not really like a good practice or useful - especially if I would add even more delimiters in the future. What would be a good approach to do this?
You can use re.split and split on (The [ ] is a space and the brackets are for display only)
[;,] ?|[ ]
The pattern matches
[;,] ? Match either ; or , followed by an optional space
| or
[ ] Match a single space
Regex demo | Python demo
A bit more strict pattern with lookarounds could be asserting a digit on the left using lookarounds.
(?<=\d)(?:[;,] ?| )(?=\d)
The pattern matches:
(?<=\d) Positive lookbehind, assert a digit to the left
(?: Non capture group for the alternation
[;,] ? Match either ; or , followed by an optional space
| Or
Match a space
) Close non capture group
(?=\d) Positive lookahead, assert a digit to the right
Regex demo
Example code
import re
strings = [
"3.14; 42.2",
"3.14;42.2",
"3.14, 42.2",
"3.14,42.2",
"3.14 42.2"
]
for s in strings:
print(re.split(r"[;,] ?| ", s))
Output
['3.14', '42.2']
['3.14', '42.2']
['3.14', '42.2']
['3.14', '42.2']
['3.14', '42.2']
I think you can account for the last space(s) like this:
re.split(r'[;,]\s*', s)
Here \s* will capture the spaces after the separator, if any.
can also just do:
res = re.split('; |;|,|, | ', data)
see https://www.geeksforgeeks.org/python-split-multiple-characters-from-string/
Assuming you would know the delimiter of the input ahead of time, you could write a function that takes your delimiter as an argument, replaces with a space, and splits it:
def split_on_delim(strng, delim):
return strng.replace(delim, ' ').split()
for example:
>>> s = '3.14; 42.2'
>>> split_on_delim(s, '; ')
['3.14', '42.2']

Merging three regex patterns used for text cleaning to improve efficiency

Given a text I want to make some modifications:
replace uppercase chars at the beginning of a sentence.
remove chars like ’ or ' (without adding whitespace)
remove unwanted chars for example ³ or ? , ! . (and replace with whitespace)
def multiple_replace(text):
# first sub so words like can't will change to cant and not can t
first_strip=re.sub("[’']",'',text)
def cap(match):
return (match.group().lower())
p = re.compile(r'((?<=[\.\?!]\s)(\w+)|(^\w+))')
#second sub to change all words that begin a sentence to lowercase
second_strip = p.sub(cap,first_strip)
# third_strip is to remove all . from text unless they are used in decimal numbers
third_strip= re.sub(r'(?<!\d)\.|\.(?!\d)','',second_strip)
# fourth strip to remove unexpected char that might be in text for example !,?³ and replace with whitespace
forth_strip=re.sub('[^A-Za-z0-9##_$&%]+',' ', third_strip)
return forth_strip
I am wondering if there is a more efficient way of doing it? Because I am going over the text 4 times just so it can be in the right format for me to parse. This seems a lot especially if there are millions of documents. Is there a more efficient way of doing this?
You could make use of an alternation to match either an uppercase char A-Z at the start of the string, or after . ? or ! followed by a whitespace char.
I think you can also add a . to the negated character class [^A-Za-z0-9##_$&%.]+ to not remove the dot for a decimal value and change the order of operations to use cap first before removing any dots.
import re
def cap(match):
return match.group().lower()
p = re.compile(r'(?<=[.?!]\s)[A-Z]|^[A-Z]', re.M)
text = "A test here. this `` (*)is. Test, but keep 1.2"
first_strip = p.sub(cap, text)
second_strip = re.sub(r"[`']+|(?<!\d)\.|\.(?!\d)", '', first_strip)
third_strip = re.sub('[^A-Za-z0-9##_$&%.]+', ' ', second_strip)
print(third_strip)
Output
a test here this is test but keep 1.2
Python demo
You could also use a lambda with all 3 patterns and 2 capturing groups checking the group values in the callback, but I think that would not benefit the readability or making it easier to change or test.
import re
p = re.compile(r"(?:((?<=[.?!]\s)[A-Z]|^[A-Z])|[`']+|((?<!\d)\.|\.(?!\d))|[^A-Za-z0-9##_$&%.]+)", re.M)
text = "A test here. this `` (*)is. Test, but keep 1.2"
result = re.sub(p, lambda x: x.group(1).lower() if x.group(1) else ('' if x.group(2) else ' '), text)
print(result)
Output
a test here this is test but keep 1.2
Python demo

python regex - replace newline (\n) to something else

I'm trying to convert multiple continuous newline characters followed by a Capital Letter to "____" so that I can parse them.
For example,
i = "Inc\n\nContact"
i = re.sub(r'([\n]+)([A-Z])+', r"____\2", i)
In [25]: i
Out [25]: 'Inc____Contact'
This string works fine. I can parse them using ____ later.
However it doesn't work on this particular string.
i = "(2 months)\n\nML"
i = re.sub(r'([\n]+)([A-Z])+', r"____\2", i)
Out [31]: '(2 months)____L'
It ate capital M.
What am I missing here?
EDIT To replace multiple continuous newline characters (\n) to ____, this should do:
>>> import re
>>> i = "(2 months)\n\nML"
>>> re.sub(r'(\n+)(?=[A-Z])', r'____', i)
'(2 months)____ML'
(?=[A-Z]) is to assert "newline characters followed by Capital Letter". REGEX DEMO.
Well let's take a look at your regex ([\n]+)([A-Z])+ - the first part ([\n]+) is fine, matching multiple occurences of a newline into one group (note - this wont match the carriage return \r). However the second part ([A-Z])+ leeds to your error it matches a single uppercase letter into a capturing group - multiple times, if there are multiple Uppercase letter, which will reset the group to the last matched uppercase letter, which is then used for the replace.
Try the following and see what happens
import re
i = "Inc\n\nABRAXAS"
i = re.sub(r'([\n]+)([A-Z])+', r"____\2", i)
You could simply place the + inside the capturing group, so multiple uppercase letters are matched into it. You could also just leave it out, as it doesn't make a difference, how many of these uppercase letters follow.
import re
i = "Inc\n\nABRAXAS"
i = re.sub(r'(\n+)([A-Z])', r"____\2", i)
If you want to replace any sequence of linebreaks, no matter what follows - drop the ([A-Z]) completely and try
import re
i = "Inc\n\nABRAXAS"
i = re.sub(r'(\n+)', r"____", i)
You could also use ([\r\n]+) as pattern, if you want to consider carriage returns
Try:
import re
p = re.compile(ur'[\r?\n]')
test_str = u"(2 months)\n\nML"
subst = u"_"
result = re.sub(p, subst, test_str)
It will reduce string to
(2 months)__ML
See Demo

How to replace .. in a string in python

I am trying to replace this string to become this
import re
s = "haha..hehe.hoho"
s = re.sub('[..+]+',' ', s)
my output i get haha hehe hoho
desired output
haha hehe.hoho
What am i doing wrong?
Test on sites like regexpal: http://regexpal.com/
It's easier to get the output and check if the regex is right.
You should change your regex to something like: '\.\.' if you want to remove only double dots.
If you want to remove when there's at least 2 dots you can use '\.{2,}'.
Every character you put inside a [] will be checked against your expression
And the dot character has a special meaning on a regex, to avoid this meaning you should prefix it with a escape character: \
You can read more about regular expressions metacharacters here: https://www.hscripts.com/tutorials/regular-expression/metacharacter-list.php
[a-z] A range of characters. Matches any character in the specified
range.
. Matches any single character except "n".
\ Specifies the next character as either a special character, a literal, a back reference, or an octal escape.
Your new code:
import re
s = "haha..hehe.hoho"
#pattern = '\.\.' #If you want to remove when there's 2 dots
pattern = '\.{2,}' #If you want to remove when there's at least 2 dots
s = re.sub(pattern, ' ', s)
Unless you are constrained to use regex, then I find the replace() function much simpler:
s = "haha..hehe.hoho"
print s.replace('..',' ')
gives your desired output:
haha hehe.hoho
Change:
re.sub('[..+]+',' ', s)
to:
re.sub('\.\.+',' ', s)
[..+]+ , this meaning in regex is that use the any in the list at least one time. So it matches the .. as well as . in your input. Make the changes as below:
s = re.sub('\.\.+',' ', s)
[] is a character class and will match on anything in it (meaning any 1 .).
I'm guessing you used it because a simple . wouldn't work, because it's a meta character meaning any character. You can simply escape it to mean a literal dot with a \. As such:
s = re.sub('\.\.',' ', s)
Here is what your regex means:
So, you allow for 1 or more literal periods or plus symbols, which is not the case.
You do not have to repeat the same symbol when looking for it, you can use quantifiers, like {2}, which means "exactly 2 occurrences".
You can use split and join, see sample working program:
import re
s = "haha..hehe.hoho"
s = " ".join(re.split(r'\.{2}', s))
print s
Output:
haha hehe.hoho
Or you can use the sub with the regex, too:
s = re.sub(r'\.{2}', ' ', "haha..hehe.hoho")
In case you have cases with more than 2 periods, you should use \.{2,} regex.

Categories

Resources