Regex expression with special chars and text

Regex expression with special chars and text - python

Given :
str_var ='host="dsa.asd.dsc"port="1234"service_nameORdbName="dsa"pass="dsa"user="ewq"'
How to match for example in host's case a stirng that can have abc.dfg.ewq.asd and so on? The data can contain only '.' as special character.
The expression that i got can only match text because w+.
result = re.findall('(\w+)="(\w+)"', str_var)
Expected result :
[('host':'dsa.asd.dsc'), ('port', '1234'), ('service_nameORdbName', 'dsa'), ('pass', 'dsa'), ('user', 'ewq')]

You may either add a . to \w:
result = re.findall('(\w+)="([\w.]+)"', str_var)
Or, match . delimited words with \w+(?:\.\w+)* (one or more word chars followed with 0 or more repetitions of a dot and then one or more word chars):
result = re.findall('(\w+)="(\w+(?:\.\w+)*)"', str_var)
Or, match values in-between double quotes that may contain anything but a double quote inside (with "[^"]*" that matches a ", then zero or more chars other than a double quote and then a ") :
result = re.findall('(\w+)="([^"]+)"', str_var))
See the Python demo.

Related

How to split a string with parentheses and spaces into a list

I want to split strings like:
(so) what (are you trying to say)
what (do you mean)
Into lists like:
[(so), what, (are you trying to say)]
[what, (do you mean)]
The code that I tried is below. In the site regexr, the regex expression match the parts that I want but gives a warning, so... I'm not a expert in regex, I don't know what I'm doing wrong.
import re
string = "(so) what (are you trying to say)?"
rx = re.compile(r"((\([\w \w]*\)|[\w]*))")
print(re.split(rx, string ))

Using [\w \w]* is the same as [\w ]* and also matches an empty string.
Instead of using split, you can use re.findall without any capture groups and write the pattern like:
\(\w+(?:[^\S\n]+\w+)*\)|\w+
\( Match (
\w+ Match 1+ word chars
(?:[^\S\n]+\w+)* Optionally repeat matching spaces and 1+ word chars
\) Match )
| Or
\w+ Match 1+ word chars
Regex demo
import re
string = "(so) what (are you trying to say)? what (do you mean)"
rx = re.compile(r"\(\w+(?:[^\S\n]+\w+)*\)|\w+")
print(re.findall(rx, string))
Output
['(so)', 'what', '(are you trying to say)', 'what', '(do you mean)']

For your two examples you can write:
re.split(r'(?<=\)) +| +(?=\()', str)
Python regex<¯\(ツ)/¯>Python code
This does not work, however, for string defined in the OP's code, which contains a question mark, which is contrary to the statement of the question in terms of the two examples.
The regular expression can be broken down as follows.
(?<=\)) # positive lookbehind asserts that location in the
# string is preceded by ')'
[ ]+ # match one or more spaces
| # or
[ ]+ # match one or more spaces
(?=\() # positive lookahead asserts that location in the
# string is followed by '('
In the above I've put each of two space characters in a character class merely to make it visible.

Replacing everything with a backslash till next white space

As part of preprocessing my data, I want to be able to replace anything that comes with a slash till the occurrence of space with empty string. For example, \fs24 need to be replaced with empty or \qc23424 with empty. There could be multiple occurrences of tags with slashes which I want to remove. I have created a "tags to be eradicated" list which I aim to consume in a regular expression to clean the extracted text.
Input String: This is a string \fs24 and it contains some texts and tags \qc23424. which I want to remove from my string.
Expected output: This is a string and it contains some texts and tags. which I want to remove from my string.
I am using the regular expression based replace function in Python:
udpated = re.sub(r'/\fs\d+', '')
However, this is not fetching the desired result. Alternately, I have built an eradicate list and replacing that from a loop from top to lower number but this is a performance killer.

Assuming a 'tag' can also occur at the very beginning of your string, and avoid selecting false positives, maybe you could use:
\s?(?<!\S)\\[a-z\d]+
And replace with nothing. See an online demo.
\s? - Optionally match a whitespace character (if a tag is mid-string and therefor preceded by a space);
(?<!\S) - Assert position is not preceded by a non-whitespace character (to allow a position at the start of your input);
\\ - A literal backslash.
[a-z\d]+ - 1+ (Greedy) Characters as per given class.

First, the / doesn't belong in the regular expression at all.
Second, even though you are using a raw string literal, \ itself has special meaning to the regular expression engine, so you still need to escape it. (Without a raw string literal, you would need '\\\\fs\\d+'.) The \ before f is meant to be used literally; the \ before d is part of the character class matching the digits.
Finally, sub takes three arguments: the pattern, the replacement text, and the string on which to perform the replacement.
>>> re.sub(r'\\fs\d+', '', r"This is a string \fs24 and it contains...")
'This is a string and it contains...'

Does that work for you?
re.sub(
r"\\\w+\s*", # a backslash followed by alphanumerics and optional spacing;
'', # replace it with an empty string;
input_string # in your input string
)
>>> re.sub(r"\\\w+\s*", "", r"\fs24 hello there")
'hello there'
>>> re.sub(r"\\\w+\s*", "", "hello there")
'hello there'
>>> re.sub(r"\\\w+\s*", "", r"\fs24hello there")
'there'
>>> re.sub(r"\\\w+\s*", "", r"\fs24hello \qc23424 there")
'there'

'\\' matches '\' and 'w+' matches a word until space
import re
s = r"""This is a string \fs24 and it contains some texts and tags \qc23424. which I want to remove from my string."""
re.sub(r'\\\w+', '', s)
output:
'This is a string and it contains some texts and tags . which I want to remove from my string.'

I tried this and it worked fine for me:
def remover(text, state):
removable = text.split("\\")[1]
removable = removable.split(" ")[0]
removable = "\\" + removable + " "
text = text.replace(removable, "")
state = True if "\\" in text else False
return text, state
text = "hello \\I'm new here \\good luck"
state = True
while state:
text, state = remover(text, state)
print(text)

how to match either word or sentence in this Python regex?

I have a decent familiarity with regex but this is tricky. I need to find instances like this from a SQL case statement:
when col_name = 'this can be a word or sentence'
I can match the above when it's just one word, but when it's more than one word it's not working.
s = """when col_name = 'a sentence of words'"""
x = re.search("when\s(\w+)\s*=\s*\'(\w+)", s)
if x:
print(x.group(1)) # this returns "col_name"
print(x.group(2)) # this returns "a"
I want group(2) to return "a sentence of words" but I'm just getting the first word. That part could either be one word or several. How to do it?
When I add in the second \', then I get no match:
x = re.search("when\s(\w+)\s*=\s*\'(\w+)\'", s)

You may match all characters other than single quotation mark rather than matching letters, digits and connector punctuation ("word" chars) with the Group 2 pattern:
import re
s = """when col_name = 'a sentence of words'"""
x = re.search(r"when\s+(\w+)\s*=\s*'([^']+)", s)
if x:
print(x.group(1)) # this returns "col_name"
print(x.group(2)) # this returns "a sentence of words"
See the Python demo
The [^'] is a negated character class that matches any char but a single quotation mark, see the regex demo.
In case the string can contain escaped single quotes, you may consider replacing [^'] with
If the escape char is ': ([^']*(?:''[^']*)*)
If the escape char is \: ([^\\']*(?:\\.[^'\\]*)*).
Note the use of the raw string literal to define the regex pattern (all backslashes are treated as literal backslashes inside it).

Python RegEx to get words after a specific string

having a string
string=
""""$deletedFields":["standardizedSkillUrn","standardizedSkill"],"entityUrn":"urn:li:fs_skill:(ACoAAAIv9SQBMzclPm3CZzL1QceTH5W0VrsdxbE,3)","name":"Finance","$type":"voyager.identity.profile.Skill"},{"$deletedFields":["standardizedSkillUrn","standardizedSkill"],"entityUrn":"urn:li:fs_skill:(ACoAAAIv9SQBMzclPm3CZzL1QceTH5W0VrsdxbE,22)","name":"Financial ["standardizedSkillUrn","standardizedSkill"],"entityUrn":"urn:li:fs_skill:(ACoAAAIv9SQBMzclPm3CZzL1QceTH5W0VrsdxbE,34)","name":"Due
Diligence","name":"Strategy""""
What reguar expression can i use to retrieve values after "name": to get Due Dilligence, Financial, and Finance
i have tried
match = re.compile(r'"name"\:(.\w+)')
match.findall(string)
but it returns
['"Finance', '"Financial', '"Due', '"Financial', '"Strategy']
The Due Diligence is split and i want both words as one.

Your whitespace is not detected by regex because /w only searches for non-special characters.
"name"\:(.\w+\s*\w*) accounts for any possible spaces with an extra word (Will not work for three words, but will in your situation)
"name"\:(.\w+\s*\w*"?) accounts for the quotations " at the end of each one but doesn't get Financial.
Example
Edit: Fixed second regex for "Financial

I would use the non-hungry .*? expression with a trailing quote:
import re
string = """$deletedFields":["standardizedSkillUrn","standardizedSkill"],"entityUrn":"urn:li:fs_skill:(ACoAAAIv9SQBMzclPm3CZzL1QceTH5W0VrsdxbE,3)","name":"Finance","$type":"voyager.identity.profile.Skill"},{"$deletedFields":["standardizedSkillUrn","standardizedSkill"],"entityUrn":"urn:li:fs_skill:(ACoAAAIv9SQBMzclPm3CZzL1QceTH5W0VrsdxbE,22)","name":"Financial ["standardizedSkillUrn","standardizedSkill"],"entityUrn":"urn:li:fs_skill:(ACoAAAIv9SQBMzclPm3CZzL1QceTH5W0VrsdxbE,34)","name":"Due Diligence","name":"Strategy"""
# With the leading double quote
match = re.compile(r'"name"\:(".*?)["\[]')
a = match.findall(string)
print a
# Stripping out the leading double quote
match = re.compile(r'"name"\:"(.*?)["\[]')
b = match.findall(string)
print b
And the final output is:
['"Finance', '"Financial ', '"Due Diligence']
['Finance', 'Financial ', 'Due Diligence']

Match charactes and whitespaces, but not numbers

I am trying to create a regex that will match characters, whitespaces, but not numbers.
So hello 123 will not match, but hell o will.
I tried this:
[^\d\w]
but, I cannot find a way to add whitespaces here. I have to use \w, because my strings can contain Unicode characters.

Brief
It's unclear what exactly characters refers to, but, assuming you mean alpha characters (based on your input), this regex should work for you.
Code
See regex in use here
^(?:(?!\d)[\w ])+$
Note: This regex uses the mu flags for multiline and Unicode (multiline only necessary if input is separated by newline characters)
Results
Input
ÀÇÆ some words
ÀÇÆ some words 123
Output
This only shows matches
ÀÇÆ some words
Explanation
^ Assert position at the start of the line
(?:(?!\d)[\w ])+ Match the following one or more times (tempered greedy token)
(?!\d) Negative lookahead ensuring what follows doesn't match a digit. You can change this to (?![\d_]) if you want to ensure _ is also not used.
[\w ] Match any word character or space (matches Unicode word characters with u flag)`
$ Assert position at the end of the line

You can use a lookahead:
(?=^\D+$)[\w\s]+
In Python:
import re
strings = ['hello 123', 'hell o']
rx = re.compile(r'(?=^\D+$)[\w\s]+')
new_strings = [string for string in strings if rx.match(string)]
print(new_strings)
# ['hell o']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regex expression with special chars and text - python

Related

How to split a string with parentheses and spaces into a list

Replacing everything with a backslash till next white space

how to match either word or sentence in this Python regex?

Python RegEx to get words after a specific string

Match charactes and whitespaces, but not numbers

Categories

Resources