Regex extract strings

Regex extract strings - python

Original text: This is the first variable "${abc}" and this is the second variable "${def}"
Desired output: This is the first variable and this is the second variable
I want to get rid of "${abc}" and "${def}" using regex. Currently, I am using this regex command: \".*\" but the output I am getting is "${abc}" and this is the second variable "${def}"

You want to match all characters that are not ", so instead of .* do [^"]*.
>>> in_str = 'This is the first variable "${abc}" and this is the second variable "${def}"'
>>> re.sub(r'"[^"]*"', '', in_str)
'This is the first variable and this is the second variable '
Better yet, constrain it more so you only match "${...}", and not anything enclosed in quotes (r'"\$\{[^}]*\}"')
>>> re.sub(r'"\$\{[^}]*\}"', '', in_str)
'This is the first variable and this is the second variable '
Explanation:
"\$\{[^}]*\}"
-------------
" " : Quotes
\$ : A literal $ sign
\{ \} : Braces
[^}]* : Zero or more characters that are not }
To get rid of the trailing spaces after the match, add an optional \s? at the end of the regex.
>>> re.sub(r'"\$\{[^}]*\}"\s?', '', in_str)
'This is the first variable and this is the second variable '
Of course, this leaves behind a trailing space if the last word was a match, but you can just .strip() that out.
Try it on Regex101

Related

Replacing everything with a backslash till next white space

As part of preprocessing my data, I want to be able to replace anything that comes with a slash till the occurrence of space with empty string. For example, \fs24 need to be replaced with empty or \qc23424 with empty. There could be multiple occurrences of tags with slashes which I want to remove. I have created a "tags to be eradicated" list which I aim to consume in a regular expression to clean the extracted text.
Input String: This is a string \fs24 and it contains some texts and tags \qc23424. which I want to remove from my string.
Expected output: This is a string and it contains some texts and tags. which I want to remove from my string.
I am using the regular expression based replace function in Python:
udpated = re.sub(r'/\fs\d+', '')
However, this is not fetching the desired result. Alternately, I have built an eradicate list and replacing that from a loop from top to lower number but this is a performance killer.

Assuming a 'tag' can also occur at the very beginning of your string, and avoid selecting false positives, maybe you could use:
\s?(?<!\S)\\[a-z\d]+
And replace with nothing. See an online demo.
\s? - Optionally match a whitespace character (if a tag is mid-string and therefor preceded by a space);
(?<!\S) - Assert position is not preceded by a non-whitespace character (to allow a position at the start of your input);
\\ - A literal backslash.
[a-z\d]+ - 1+ (Greedy) Characters as per given class.

First, the / doesn't belong in the regular expression at all.
Second, even though you are using a raw string literal, \ itself has special meaning to the regular expression engine, so you still need to escape it. (Without a raw string literal, you would need '\\\\fs\\d+'.) The \ before f is meant to be used literally; the \ before d is part of the character class matching the digits.
Finally, sub takes three arguments: the pattern, the replacement text, and the string on which to perform the replacement.
>>> re.sub(r'\\fs\d+', '', r"This is a string \fs24 and it contains...")
'This is a string and it contains...'

Does that work for you?
re.sub(
r"\\\w+\s*", # a backslash followed by alphanumerics and optional spacing;
'', # replace it with an empty string;
input_string # in your input string
)
>>> re.sub(r"\\\w+\s*", "", r"\fs24 hello there")
'hello there'
>>> re.sub(r"\\\w+\s*", "", "hello there")
'hello there'
>>> re.sub(r"\\\w+\s*", "", r"\fs24hello there")
'there'
>>> re.sub(r"\\\w+\s*", "", r"\fs24hello \qc23424 there")
'there'

'\\' matches '\' and 'w+' matches a word until space
import re
s = r"""This is a string \fs24 and it contains some texts and tags \qc23424. which I want to remove from my string."""
re.sub(r'\\\w+', '', s)
output:
'This is a string and it contains some texts and tags . which I want to remove from my string.'

I tried this and it worked fine for me:
def remover(text, state):
removable = text.split("\\")[1]
removable = removable.split(" ")[0]
removable = "\\" + removable + " "
text = text.replace(removable, "")
state = True if "\\" in text else False
return text, state
text = "hello \\I'm new here \\good luck"
state = True
while state:
text, state = remover(text, state)
print(text)

surrounding a pattern in python string with brackets

I have a string looks like this:
oldString="this is my {{string-d}}" => "this is my {{(string-d)}}"
oldString2="this is my second{{ new_string-d }}" => "this is my second{{ (new_string-d) }}"
oldString2="this is my second new_string-d " => "this is my second (new_string-d) "
oldString2="this is my second new[123string]-d " => "this is my second (new[123string]-d) "
I want to add brackets whenever I see "-d" right after it and before the word that is attached to it.
I wrote a code that looks for the pattern "-d" in strings and partition the string after finding the pattern to 3 partitions before "-d", after "-d" and "-d" itself then I check the block before "-d" until I find whitespace or "{" and stop and add brackets. my code looks like this:
P.S. I have many files that I read from them and try to modify the string there the example above is just for demonstrating what I'm trying to do.
if ('-d') in oldString:
p = oldString.partition('-d')
v = p[p.index('-d')-1]
beforeString=''
for i in reversed(v):
if i != ' ' or i != '{':
beforeString=i+beforeString
indexNew = v.index(i)
outPutLine = v[:indexNew]+'('+v[indexNew:]
newString = outPutLine + '-d' + ' )'
print newString
the result of running the code will be:
newString = "(this is my {{string-d )"
as you can see that the starting bracket is before "this" instead of before "string" why is this happening? also, I'm not sure if this is the best way to do this kind of find and replace any suggestions would be much appreciated.

>>> import re
>>> oldString = "this is my {{string-d}}"
>>> oldString2 = "this is my second{{ new_string-d }}"
>>> re.sub(r"(\w*-d)", r"(\1)", oldString)
'this is my {{(string-d)}}'
>>> re.sub(r"(\w*-d)", r"(\1)", oldString2)
'this is my second{{ (new_string-d) }}'
Note that this matches "words" assuming that a word is composed of only letters, numbers, and underscores.
Here's a more thorough breakdown of what's happening:
An r before a string literal means the string is a "raw string". It prevents Python from interpreting characters as an escape sequence. For instance, r"\n" is a slash followed by the letter n, rather than being interpreted as a single newline character. I like to use raw strings for my regex patterns, even though it's not always necessary.
the parentheses surrounding \w*-d is a capturing group. It indicates to the regex engine that the contents of the group should be saved for later use.
the sequence \w means "any alphanumeric character or underscore".
* means "zero or more of the preceding item". \w* together means "zero or more alphanumeric characters or underscores".
-d means "a hyphen followed by the letter d.
All together, (\w*-d) means "zero or more alphanumeric characters or underscores, followed by a hyphen and the letter d. Save all of these characters for later use."
The second string describes what the matched data should be replaced with. "\1" means "the contents of the first captured group". The parentheses are just regular parentheses. All together, (\1) in this context means "take the saved content from the captured group, surround it in parentheses, and put it back into the string".
If you want to match more characters than just alphanumeric and underscore, you can replace \w with whatever collection of characters you want to match.
>>> re.sub(r"([\w\.\[\]]*-d)", r"(\1)", "{{startingHere[zero1].my_string-d }}")
'{{(startingHere[zero1].my_string-d) }}'
If you also want to match words ending with "-d()", you can match a parentheses pair with \(\) and mark it as optional using ?.
>>> re.sub(r"([\w\.\[\]]*-d(\(\))?)", r"(\1)", "{{startingHere[zero1].my_string-d() }}")
'{{(startingHere[zero1].my_string-d()) }}'

If you want the bracketing to only take place inside double curly braces, you need something like this:
re.sub(r'({{\s*)([^}]*-d)(\s*}})', r'\1(\2)\3', s)
Breaking that down a bit:
# the target pattern
r'({{\s*)([^}]*-d)(\s*}})'
# ^^^^^^^ capture group 1, opening {{ plus optional space
# ^^^^^^^^^ capture group 2, non-braces plus -d
# ^^^^^^^ capture 3, spaces plus closing }}
The replacement r'\1(\2)\3' just assembles the groups, with
parenthesis around the middle one.
Putting it together:
import re
def quote_string_d(s):
return re.sub(r'({{\s*)([^}]*-d)(\s*}})', r'\1(\2)\3', s)
print(quote_string_d("this is my {{string-d}}"))
print(quote_string_d("this is my second{{ new_string-d }}"))
print(quote_string_d("this should not be quoted other_string-d "))
Output:
this is my {{(string-d)}}
this is my second{{ (new_string-d) }}
this should not be quoted other_string-d
Note the third instance does not get the parentheses, because it's not inside {{ }}.

Remove String spaces with regular expression

I am going to remove the first index space and last index spaces via python's re feature:
I tried:
re.sub(r"\s+" ,""," hello world ") // remove the first place space
But it does not removed any thing.

>>> re.sub(r'^\s+|\s+$', '', ' hello world ')
'hello world'
That will remove all leading and trailing whitespaces, though, not necessarily only the first and last index.

Replacing spaces after any digit using re.sub

I am using the following code to try and replace any spaces found after a digit in a string derived from a regex with a comma:
mystring = re.sub('\d ', '\d,',mystring)
This however gives me an input of 6 and replaces it with \d,. What is the correct syntax I need to give me 6, as my output?

You need to use capturing group to capture the digits which exists before the space. So that you could refer that particular digit in the replacement part.
mystring = re.sub(r'(\d) ', r'\1,',mystring)
or
Use positive lookbehind.
mystring = re.sub(r'(?<=\d) ', r',',mystring)

Python regex : adding space after comma only if not followed by a number

I want to add spaces after and before comma's in a string only if the following character isn't a number (9-0). I tried the following code:
newLine = re.sub(r'([,]+[^0-9])', r' \1 ', newLine)
But it seems like the \1 is taking the 2 matching characters and not just the comma.
Example:
>>> newLine = "abc,abc"
>>> newLine = re.sub(r'([,]+[^0-9])', r' \1 ', newLine)
"abc ,a bc"
Expected Output:
"abc , abc"
How can I tell the sub to take only the 'comma' ?

Use this one:
newLine = re.sub(r'[,]+(?![0-9])', r' , ', newLine)
Here using negative lookahead (?![0-9]) it is checking that the comma(s) are not followed by a digit.
Your regex didn't work because you picked the comma and the next character(using ([,]+[^0-9])) in a group and placed space on both sides.
UPDATE: If it is not only comma and other things as well, then place them inside the character class [] and capture them in group \1 using ()
newLine = re.sub(r'([,/\\]+)(?![0-9])', r' \1 ', newLine)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regex extract strings - python

Related

Replacing everything with a backslash till next white space

surrounding a pattern in python string with brackets

Remove String spaces with regular expression

Replacing spaces after any digit using re.sub

Python regex : adding space after comma only if not followed by a number

Categories

Resources