Python RegEx to get words after a specific string - python

having a string
string=
""""$deletedFields":["standardizedSkillUrn","standardizedSkill"],"entityUrn":"urn:li:fs_skill:(ACoAAAIv9SQBMzclPm3CZzL1QceTH5W0VrsdxbE,3)","name":"Finance","$type":"voyager.identity.profile.Skill"},{"$deletedFields":["standardizedSkillUrn","standardizedSkill"],"entityUrn":"urn:li:fs_skill:(ACoAAAIv9SQBMzclPm3CZzL1QceTH5W0VrsdxbE,22)","name":"Financial ["standardizedSkillUrn","standardizedSkill"],"entityUrn":"urn:li:fs_skill:(ACoAAAIv9SQBMzclPm3CZzL1QceTH5W0VrsdxbE,34)","name":"Due
Diligence","name":"Strategy""""
What reguar expression can i use to retrieve values after "name": to get Due Dilligence, Financial, and Finance
i have tried
match = re.compile(r'"name"\:(.\w+)')
match.findall(string)
but it returns
['"Finance', '"Financial', '"Due', '"Financial', '"Strategy']
The Due Diligence is split and i want both words as one.

Your whitespace is not detected by regex because /w only searches for non-special characters.
"name"\:(.\w+\s*\w*) accounts for any possible spaces with an extra word (Will not work for three words, but will in your situation)
"name"\:(.\w+\s*\w*"?) accounts for the quotations " at the end of each one but doesn't get Financial.
Example
Edit: Fixed second regex for "Financial

I would use the non-hungry .*? expression with a trailing quote:
import re
string = """$deletedFields":["standardizedSkillUrn","standardizedSkill"],"entityUrn":"urn:li:fs_skill:(ACoAAAIv9SQBMzclPm3CZzL1QceTH5W0VrsdxbE,3)","name":"Finance","$type":"voyager.identity.profile.Skill"},{"$deletedFields":["standardizedSkillUrn","standardizedSkill"],"entityUrn":"urn:li:fs_skill:(ACoAAAIv9SQBMzclPm3CZzL1QceTH5W0VrsdxbE,22)","name":"Financial ["standardizedSkillUrn","standardizedSkill"],"entityUrn":"urn:li:fs_skill:(ACoAAAIv9SQBMzclPm3CZzL1QceTH5W0VrsdxbE,34)","name":"Due Diligence","name":"Strategy"""
# With the leading double quote
match = re.compile(r'"name"\:(".*?)["\[]')
a = match.findall(string)
print a
# Stripping out the leading double quote
match = re.compile(r'"name"\:"(.*?)["\[]')
b = match.findall(string)
print b
And the final output is:
['"Finance', '"Financial ', '"Due Diligence']
['Finance', 'Financial ', 'Due Diligence']

Related

Replacing everything with a backslash till next white space

As part of preprocessing my data, I want to be able to replace anything that comes with a slash till the occurrence of space with empty string. For example, \fs24 need to be replaced with empty or \qc23424 with empty. There could be multiple occurrences of tags with slashes which I want to remove. I have created a "tags to be eradicated" list which I aim to consume in a regular expression to clean the extracted text.
Input String: This is a string \fs24 and it contains some texts and tags \qc23424. which I want to remove from my string.
Expected output: This is a string and it contains some texts and tags. which I want to remove from my string.
I am using the regular expression based replace function in Python:
udpated = re.sub(r'/\fs\d+', '')
However, this is not fetching the desired result. Alternately, I have built an eradicate list and replacing that from a loop from top to lower number but this is a performance killer.
Assuming a 'tag' can also occur at the very beginning of your string, and avoid selecting false positives, maybe you could use:
\s?(?<!\S)\\[a-z\d]+
And replace with nothing. See an online demo.
\s? - Optionally match a whitespace character (if a tag is mid-string and therefor preceded by a space);
(?<!\S) - Assert position is not preceded by a non-whitespace character (to allow a position at the start of your input);
\\ - A literal backslash.
[a-z\d]+ - 1+ (Greedy) Characters as per given class.
First, the / doesn't belong in the regular expression at all.
Second, even though you are using a raw string literal, \ itself has special meaning to the regular expression engine, so you still need to escape it. (Without a raw string literal, you would need '\\\\fs\\d+'.) The \ before f is meant to be used literally; the \ before d is part of the character class matching the digits.
Finally, sub takes three arguments: the pattern, the replacement text, and the string on which to perform the replacement.
>>> re.sub(r'\\fs\d+', '', r"This is a string \fs24 and it contains...")
'This is a string and it contains...'
Does that work for you?
re.sub(
r"\\\w+\s*", # a backslash followed by alphanumerics and optional spacing;
'', # replace it with an empty string;
input_string # in your input string
)
>>> re.sub(r"\\\w+\s*", "", r"\fs24 hello there")
'hello there'
>>> re.sub(r"\\\w+\s*", "", "hello there")
'hello there'
>>> re.sub(r"\\\w+\s*", "", r"\fs24hello there")
'there'
>>> re.sub(r"\\\w+\s*", "", r"\fs24hello \qc23424 there")
'there'
'\\' matches '\' and 'w+' matches a word until space
import re
s = r"""This is a string \fs24 and it contains some texts and tags \qc23424. which I want to remove from my string."""
re.sub(r'\\\w+', '', s)
output:
'This is a string and it contains some texts and tags . which I want to remove from my string.'
I tried this and it worked fine for me:
def remover(text, state):
removable = text.split("\\")[1]
removable = removable.split(" ")[0]
removable = "\\" + removable + " "
text = text.replace(removable, "")
state = True if "\\" in text else False
return text, state
text = "hello \\I'm new here \\good luck"
state = True
while state:
text, state = remover(text, state)
print(text)

Python Regex: add restriction that it should NOT be a match if there is a quotation mark somewhere before the match AND somewhere after

I would like to:
remove any leading whitespace in the example matches below and
exclude matches that have at some position before the string a quotation mark AND at some position after the string another quotation mark (i.e. the strings that should NOT be matched are enclosed by quotation marks but the quotation marks will not necessarily be directly before and after the string).
I tried with combining negative lookbehind & lookahead but I somehow cannot figure it out.
Thank you so much!
My current Python Regex is as follows:
r'''(?<=[#])\s*[A-Z0-9.]+(?=\()'''
#STRING1( # Result: 'STRING1' --> works
# STRI..NG2( # Result: ' STRI..NG2' --> okay, but excl. whitespace would be preferred
# STRING.3( # Result: ' STRING.3' --> okay, but excl. whitespace would be preferred
Example Text:
#STRING4("#STRING5( and maybe "another #STRING6("__"maybe here is text") and #STRING7( " maybe & even another # STRING8( " --- " and a "last" one " #STRING9( &"maybe some more "text"
Right now this returns (including leading whitespaces):
'STRING4'
'STRING5'
'STRING6'
'STRING7'
' STRING8'
'STRING9'
Desired return:
'STRING4'
'STRING6'
'STRING8'
One way to it is with the following regex
reg = r'(?:\"[^\"]+\")|(?:#\s*([A-Z0-9.]+)\()'
You need to then check if the match has a group 1. If it does its a match, otherwise its a false match.
I can't do everything you want, due to Python not allowing non-fixed width lookbehinds, but this:
reg = r"(?<=#)\s*([A-Z\d.]+)(?=(?:[^\"]*\"[^\"]*\")*$)"
Should work. NOTE That it expects the quotes to be properly balanced throughout the string. It also doesn't account for any escaped quotes (\").
Edit: I added a capture group that you can use to remove the leading whitespace.

Regex expression with special chars and text

Given :
str_var ='host="dsa.asd.dsc"port="1234"service_nameORdbName="dsa"pass="dsa"user="ewq"'
How to match for example in host's case a stirng that can have abc.dfg.ewq.asd and so on? The data can contain only '.' as special character.
The expression that i got can only match text because w+.
result = re.findall('(\w+)="(\w+)"', str_var)
Expected result :
[('host':'dsa.asd.dsc'), ('port', '1234'), ('service_nameORdbName', 'dsa'), ('pass', 'dsa'), ('user', 'ewq')]
You may either add a . to \w:
result = re.findall('(\w+)="([\w.]+)"', str_var)
Or, match . delimited words with \w+(?:\.\w+)* (one or more word chars followed with 0 or more repetitions of a dot and then one or more word chars):
result = re.findall('(\w+)="(\w+(?:\.\w+)*)"', str_var)
Or, match values in-between double quotes that may contain anything but a double quote inside (with "[^"]*" that matches a ", then zero or more chars other than a double quote and then a ") :
result = re.findall('(\w+)="([^"]+)"', str_var))
See the Python demo.

Regex find content in between single quotes, but only if contains certain word

I want to get the content between single quotes, but only if it contains a certain word (i.e 'sample_2'). It additionally should not match ones with white space.
Input example: (The following should match and return only: ../sample_2/file and sample_2/file)
['asdf', '../sample_2/file', 'sample_2/file', 'example with space', sample_2, sample]
Right now I just have that matched the first 3 items in the list:
'(.\S*?)'
I can't seem to find the right regex that would return those containing the word 'sample_2'
If you want specific words/characters you need to have them in the regular expression and not use the '\S'. The \S is the equivalent to [^\r\n\t\f\v ] or "any non-whitespace character".
import re
teststr = "['asdf', '../sample_2/file', 'sample_2/file', 'sample_2 with spaces','example with space', sample_2, sample]"
matches = re.findall(r"'([^\s']*sample_2[^\s]*?)',", teststr)
# ['../sample_2/file', 'sample_2/file']
Based on your wording, you suggest the desired word can change. In that case, I would recommend using re.compile() to dynamically create a string which then defines the regular expression.
import re
word = 'sample_2'
teststr = "['asdf', '../sample_2/file', 'sample_2/file', ' sample_2 with spaces','example with space', sample_2, sample]"
regex = re.compile("'([^'\\s]*"+word+"[^\\s]*?)',")
matches = regex.findall(teststr)
# ['../sample_2/file', 'sample_2/file']
Also if you haven't heard of this tool yet, check out regex101.com. I always build my regular expressions here to make sure I get them correct. It gives you the references, explanation of what is happening and even lets you test it right there in the browser.
Explanation of regex
regex = r"'([^\s']*sample_2[^\s]*?)',"
Find first apostrophe, start group capture. Capture anything except a whitespace character or the corresponding ending apostrophe. It must see the letters "sample_2" before accepting any non-whitespace character. Stop group capture when you see the closing apostrophe and a comma.
Note: In python, a string " or ' prepositioned with the character 'r' means the text is compiled as a regular expression. Strings with the character 'r' also do not require double-escape '\' characters.

How to replace .. in a string in python

I am trying to replace this string to become this
import re
s = "haha..hehe.hoho"
s = re.sub('[..+]+',' ', s)
my output i get haha hehe hoho
desired output
haha hehe.hoho
What am i doing wrong?
Test on sites like regexpal: http://regexpal.com/
It's easier to get the output and check if the regex is right.
You should change your regex to something like: '\.\.' if you want to remove only double dots.
If you want to remove when there's at least 2 dots you can use '\.{2,}'.
Every character you put inside a [] will be checked against your expression
And the dot character has a special meaning on a regex, to avoid this meaning you should prefix it with a escape character: \
You can read more about regular expressions metacharacters here: https://www.hscripts.com/tutorials/regular-expression/metacharacter-list.php
[a-z] A range of characters. Matches any character in the specified
range.
. Matches any single character except "n".
\ Specifies the next character as either a special character, a literal, a back reference, or an octal escape.
Your new code:
import re
s = "haha..hehe.hoho"
#pattern = '\.\.' #If you want to remove when there's 2 dots
pattern = '\.{2,}' #If you want to remove when there's at least 2 dots
s = re.sub(pattern, ' ', s)
Unless you are constrained to use regex, then I find the replace() function much simpler:
s = "haha..hehe.hoho"
print s.replace('..',' ')
gives your desired output:
haha hehe.hoho
Change:
re.sub('[..+]+',' ', s)
to:
re.sub('\.\.+',' ', s)
[..+]+ , this meaning in regex is that use the any in the list at least one time. So it matches the .. as well as . in your input. Make the changes as below:
s = re.sub('\.\.+',' ', s)
[] is a character class and will match on anything in it (meaning any 1 .).
I'm guessing you used it because a simple . wouldn't work, because it's a meta character meaning any character. You can simply escape it to mean a literal dot with a \. As such:
s = re.sub('\.\.',' ', s)
Here is what your regex means:
So, you allow for 1 or more literal periods or plus symbols, which is not the case.
You do not have to repeat the same symbol when looking for it, you can use quantifiers, like {2}, which means "exactly 2 occurrences".
You can use split and join, see sample working program:
import re
s = "haha..hehe.hoho"
s = " ".join(re.split(r'\.{2}', s))
print s
Output:
haha hehe.hoho
Or you can use the sub with the regex, too:
s = re.sub(r'\.{2}', ' ', "haha..hehe.hoho")
In case you have cases with more than 2 periods, you should use \.{2,} regex.

Categories

Resources