Python : How to ignore a delimited part of a sentence? - python

I have the following line :
CommonSettingsMandatory = #<Import Project="[\\.]*Shared(\\vc10\\|\\)CommonSettings\.targets," />#,true
and i want the following output:
['commonsettingsmandatory', '<Import Project="[\\\\.]*Shared(\\\\vc10\\\\|\\\\)CommonSettings\\.targets," />', 'true'
If i do a simple regex with the comma, it will split the value if there's a value in it, like i wrote a comma after targets, it will split here.
So i want to ignore the text between the ## to make sure there's no splitting there.
I really don't know how to do!

http://docs.python.org/library/re.html#re.split
import re
string = 'CommonSettingsMandatory = #toto,tata#, true'
splitlist = re.split('\s?=\s?#(.*?)#,\s?', string)
Then splitlist contains ['CommonSettingsMandatory', 'toto,tata', 'true'].

While you might be able to use split with a lookbehind, I would use the groups captured by this expression.
(\S+)\s*=\s*##([^#]+)##,\s*(.*)
m = re.Search(expression, myString). use m.group(1) for the first string, m.group(2) for the second, etc.

If I understand you correctly, you're trying to split the string using spaces as delimiters, but you want to also remove any text between pound signs?
If that's correct, why not simply remove the pound sign-delimited text before splitting the string?
import re
myString = re.sub(r'#.*?#', '', myString)
myArray = myString.split(' ')
EDIT: (based on revised question)
import re
myArray = re.findall(r'^(.*?) = #(.*?)#,(.*?)$', myString)
That will actually return an array of tuples including your matches, in the form of:
[
(
'commonsettingsmandatory',
'<Import Project="[\\\\.]*Shared(\\\\vc10\\\\|\\\\)CommonSettings\\.targets," />',
'true'
)
]
(spacing added to illustrate the format better)

Related

replace before and after a string using re in python

i have string like this 'approved:rakeshc#IAD.GOOGLE.COM'
i would like extract text after ':' and before '#'
in this case the test to be extracted is rakeshc
it can be done using split method - 'approved:rakeshc#IAD.GOOGLE.COM'.split(':')[1].split('#')[0]
but i would want this be done using regular expression.
this is what i have tried so far.
import re
iptext = 'approved:rakeshc#IAD.GOOGLE.COM'
re.sub('^(.*approved:)',"", iptext) --> give everything after ':'
re.sub('(#IAD.GOOGLE.COM)$',"", iptext) --> give everything before'#'
would want to have the result in single expression. expression would be used to replace a string with only the middle string
Here is a regex one-liner:
inp = "approved:rakeshc#IAD.GOOGLE.COM"
output = re.sub(r'^.*:|#.*$', '', inp)
print(output) # rakeshc
The above approach is to strip all text from the start up, and including, the :, as well as to strip all text from # until the end. This leaves behind the email ID.
Use a capture group to copy the part between the matches to the result.
result = re.sub(r'.*approved:(.*)#IAD\.GOOGLE\.COM$', r'\1', iptext)
Hope this works for you:
import re
input_text = "approved:rakeshc#IAD.GOOGLE.COM"
out = re.search(':(.+?)#', input_text)
if out:
found = out.group(1)
print(found)
You can use this one-liner:
re.sub(r'^.*:(\w+)#.*$', r'\1', iptext)
Output:
rakeshc

replace a string after substring found in jython/python

I have a string like this
ABC/AAAA DEF/78kkk OBJ/89KKK KLE/67899
and I pass the substring to find and replace after. so If I pass DEF/00012 and the original string
should be replaced as like this
ABC/AAAA DEF/00012 OBJ/89KKK KLE/67899
I have tried with string.replace('DEF', 'DEF/00012')
I would get the output as
ABC/AAAA DEF/00012/78kkk OBJ/89KKK KLE/67899
any suggestions would be highly appreciated.
Thanks
I would do:
txt = 'ABC/AAAA DEF/78kkk OBJ/89KKK KLE/67899'
change = 'DEF'
changeto = 'DEF/00012'
newtxt = ' '.join(changeto if i.startswith(change) else i for i in txt.split(' '))
print(newtxt)
Output:
ABC/AAAA DEF/00012 OBJ/89KKK KLE/67899
I splitted at spaces and changed part beginning with DEF
string.replace('DEF/78kkk', 'DEF/00012')
If you mean by "substring" is that the succeeding characters after "DEF" is not fixed to a specific value, use regular expressions instead.
result = re.sub("DEF/\w+", "DEF/00012", string)
Assuming there really is a blank space after every "substring" you will have to use re:
import re
your_string = re.sub("DEF/*$", "DEF/00012", your_string)

Replacing when a word is in another word but with special circumstances

My program replaces tokens with values when they are in a file. When reading in a certain line it gets stuck here is an example:
1.1.1.1.1.1.1.1.1.1 Token100.1 1.1.1.1.1.1.1Token100a
The two tokens in the example are Token100 and Token100a. I need a way to only replace Token100 with its data and not replace Token100a with Token100's data with an a afterwards. I can't look for spaces before and after because sometimes they are in the middle of lines. Any thoughts are appreciated. Thanks.
You can use regex:
import re
line = "1.1.1.1.1.1.1.1.1.1 Token100.1 1.1.1.1.1.1.1Token100a"
match = re.sub("Token100a", "data", line)
print(match)
Outputs:
1.1.1.1.1.1.1.1.1.1 Token100.1 1.1.1.1.1.1.1data
More about regex here:
https://www.w3schools.com/python/python_regex.asp
You can use a regular expression with a negative lookahead to ensure that the following character is not an "a":
>>> import re
>>> test = '1.1.1.1.1.1.1.1.1.1 Token100.1 1.1.1.1.1.1.1Token100a'
>>> re.sub(r'Token100(?!a)', 'data', test)
'1.1.1.1.1.1.1.1.1.1 data.1 1.1.1.1.1.1.1Token100a'

How to use a regex variable in a regular expression?

I am using the following pattern to clean a piece of text (replacing the matches with null):
{\s{\s\"[A-Za-z0-9.,\-:]*(?<!\bbecause\b)(?<!\bsince\b)\"\s}\s\"[A-Za-z0-9.,\-:]*\"\s}
I have a list of relators like "because" and "since" that could change every time. So I created a separate string which is a regex itself like:
lookahead_string = (?<!\bbecause\b)(?<!\bsince\b)
And put it in my original regex pattern and changed it like the following:
{\s{\s\"[A-Za-z0-9.,\-:]*'+lookahead_string+r'\"\s}\s\"[A-Za-z0-9.,\-:]*\"\s}
But the new pattern does not match the parts of the input text that could be matched using the original regex pattern. The code I am using is:
lookahead_string = ''
relators = ["because", "since"]
for rel in relators:
lookahead_string += '(?<!\b'+rel+'\b)'
text = re.sub(r'{\s{\s\"[A-Za-z0-9.,\-:]*'+lookahead_string+r'\"\s}\s\"[A-Za-z0-9.,\-:]*\"\s}', "", text)
text = ' '.join(text.split())
What should I do to make it work?! I have already tried using re.escape and format string but none of them works in my case.
Edit: I removed the input output text because I thought it is a little confusing. However, I thank #DYZ for the good suggestion.
A suggestion: Instead of messing up with the complex string syntax, convert the string to a Python list.
import ast
l = ast.literal_eval("[" + s.replace("}", "],").replace("{", "[") + "]")
#[[[[['I'], 'PRP'], 'NP'], [[[[['did'], 'VBD'], [['not'], 'RB'], 'VP'],
# ..., 'S'], '']
Now you can apply simple list functions to your data and, when done, transform the list to a bracketed string.

how to extract portion of a string between two substrings in a multiline string in python

I'm trying to extract the portion of a string between two string identifiers. The technique works if the search is made in first line but it do not work for substrings in other line.
The string is like this:
mystring="""abc jhfshf iztzrtzoi hjge);
kjsyh ldjfsj sjsdgj sodfsd);
sjfhsdvh isdjgdfg sdgjhg isjdgg);
ghdcbnv jgdfkjg fdjgjfdgj);
vgdfnkvgfd dfgjfdjgjöfd);
end"""
Until now I have the following code.
startString='jhfshf'
endString=';'
search_var=mystring[mystring.find(startString)+len(startString):mystring.find(endString)]
print(search_var)
I get the correct output like iztzrtzoi hjge)
But if I search for a string in second line like (startString=ldjfsj), it do not work. Can can body suggest some changes for correction?
Using Regex.
Demo:
import re
mystring="""abc jhfshf iztzrtzoi hjge);
kjsyh ldjfsj sjsdgj sodfsd);
sjfhsdvh isdjgdfg sdgjhg isjdgg);
ghdcbnv jgdfkjg fdjgjfdgj);
vgdfnkvgfd dfgjfdjgjöfd);
end"""
m = re.search("(?<=jhfshf).*?(?=\;)", mystring)
if m:
print( m.group() )
Output:
iztzrtzoi hjge)

Categories

Resources