Replacing unwanted characters with regex

Replacing unwanted characters with regex - python

I have this string which is on one line:
https[:]//sometest[.]com,http[:]//differentt,est.net,https://lololo.com
Note that I purposely placed , into the second URL. I am trying to replace the , where the http(s) meets. So far I tried this:
pattern_src = r"http(.*)"
for i, line_src in enumerate(open("/Users/test/Documents/tools/dump/email.txt")):
for match in re.finditer(pattern_src, line_src):
mal_url = (match.group())
source_ = mal_url
string = source_
for ch in ["[" , "]"]:
for c in [","]:
string = string.replace(c,"\n")
string = string.replace(ch,"")
with open("/Users/test/Documents/tools/dump/urls.txt", 'w') as file:
file.write(string)
print(string)
But you can clearly see it will replace all the , in the string. So my question is, how would I go around replacing just the , before the http and have every http URL on a new line?

>>> s = 'https[:]//sometest[.]com,http[:]//differentt,est.net,https://lololo.com'
>>> print(re.sub(r',(?=http)', '\n', s))
https[:]//sometest[.]com
http[:]//differentt,est.net
https://lololo.com
,(?=http) will match , only if it is followed by http. Here (?=http) is a positive lookahead assertion, which allows to check for conditions without consuming those characters.
See Reference - What does this regex mean? for details on lookarounds or my book: https://learnbyexample.github.io/py_regular_expressions/lookarounds.html

Related

Capture ALL strings within a Python script with regex

This question was inspired by my failed attempts after trying to adapt this answer: RegEx: Grabbing values between quotation marks
Consider the following Python script (t.py):
print("This is also an NL test")
variable = "!\n"
print('And this has an escaped quote "don\'t" in it ', variable,
"This has a single quote ' but doesn\'t end the quote as it" + \
" started with double quotes")
if "Foo Bar" != '''Another Value''':
"""
This is just nonsense
"""
aux = '?'
print("Did I \"failed\"?", f"{aux}")
I want to capture all strings in it, as:
This is also an NL test
!\n
And this has an escaped quote "don\'t" in it
This has a single quote ' but doesn\'t end the quote as it
started with double quotes
Foo Bar
Another Value
This is just nonsense
?
Did I \"failed\"?
{aux}
I wrote another Python script using re module and, from my attempts into regex, the one which finds most of them is:
import re
pattern = re.compile(r"""(?<=(["']\b))(?:(?=(\\?))\2.)*?(?=\1)""")
with open('t.py', 'r') as f:
msg = f.read()
x = pattern.finditer(msg, re.DOTALL)
for i, s in enumerate(x):
print(f'[{i}]',s.group(0))
with the following result:
[0] And this has an escaped quote "don\'t" in it
[1] This has a single quote ' but doesn\'t end the quote as it started with double quotes
[2] Foo Bar
[3] Another Value
[4] Did I \"failed\"?
To improve my failures, I couldn't also fully replicate what I can found with regex101.com:
I'm using Python 3.6.9, by the way, and I'm asking for more insights into regex to crack this one.

Because you want to match ''' or """ or ' or " as the delimiter, put all of that into the first group:
('''|"""|["'])
Don't put \b after it, because then it won't match strings when those strings start with something other than a word character.
Because you want to make sure that the final delimiter isn't treated as a starting delimiter when the engine starts the next iteration, you'll need to fully match it (not just lookahead for it).
The middle part to match anything but the delimiter can be:
((?:\\.|.)*?)
Put it all together:
('''|"""|["'])((?:\\.|.)*?)\1
and the result you want will be in the second capture group:
pattern = re.compile(r"""(?s)('''|\"""|["'])((?:\\.|.)*?)\1""")
with open('t.py', 'r') as f:
msg = f.read()
x = pattern.finditer(msg)
for i, s in enumerate(x):
print(f'[{i}]',s.group(2))
https://regex101.com/r/dvw0Bc/1

How to parse values appear after the same string in python?

I have a input text like this (actual text file contains tons of garbage characters surrounding these 2 string too.)
(random_garbage_char_here)**value=xxx**;(random_garbage_char_here)**value=yyy**;(random_garbage_char_here)
I am trying to parse the text to store something like this:
value1="xxx" and value2="yyy".
I wrote python code as follows:
value1_start = content.find('value')
value1_end = content.find(';', value1_start)
value2_start = content.find('value')
value2_end = content.find(';', value2_start)
print "%s" %(content[value1_start:value1_end])
print "%s" %(content[value2_start:value2_end])
But it always returns:
value=xxx
value=xxx
Could anyone tell me how can I parse the text so that the output is:
value=xxx
value=yyy

Use a regex approach:
re.findall(r'\bvalue=[^;]*', s)
Or - if value can be any 1+ word (letter/digit/underscore) chars:
re.findall(r'\b\w+=[^;]*', s)
See the regex demo
Details:
\b - word boundary
value= - a literal char sequence value=
[^;]* - zero or more chars other than ;.
See the Python demo:
import re
rx = re.compile(r"\bvalue=[^;]*")
s = "$%$%&^(&value=xxx;$%^$%^$&^%^*value=yyy;%$#^%"
res = rx.findall(s)
print(res)

Use regex to filter the data you want from the "junk characters":
>>> import re
>>> _input = '#4#5%value=xxx38u952035983049;3^&^*(^%$3value=yyy#%$#^&*^%;$#%$#^'
>>> matches = re.findall(r'[a-zA-Z0-9]+=[a-zA-Z0-9]+', _input)
>>> matches
['value=xxx', 'value=yyy']
>>> for match in matches:
print(match)
value=xxx
value=yyy
>>>
Summary or the regular expression:
[a-zA-Z0-9]+: One or more alphanumeric characters
=: literal equal sign
[a-zA-Z0-9]+: One or more alphanumeric characters

For this input:
content = '(random_garbage_char_here)**value=xxx**;(random_garbage_char_here)**value=yyy**;(random_garbage_char_here)'
use a simple regex and manually strip off the first and last two characters:
import re
values = [x[2:-2] for x in re.findall(r'\*\*value=.*?\*\*', content)]
for value in values:
print(value)
Output:
value=xxx
value=yyy
Here the assumption is that there are always two leading and two trailing * as in **value=xxx**.

You already have good answers based on the re module. That would certainly be the simplest way.
If for any reason (perfs?) you prefere to use str methods, it is indeed possible. But you must search the second string past the end of the first one :
value2_start = content.find('value', value1_end)
value2_end = content.find(';', value2_start)

I would like to extract the words from ":" to slash

I have asked this question before and now i edit it because i found some lines that doesn't correspond to the format i gave before ...
here's an example of the lines:
data = "09:55:04.125 mta Messages I Doc O:SERVER (NVS:SMTP/me#domain.com) R:NVS:FAXG3.I0.0101 mid:6393"
data2= "09:55:05.045 mta Messages I Doc O:SERVER (NVS:SMTP/me#domain.com) R:ADMIN (NVS:SMTP.0/me#domain.fr) mid:6397"
at first i have matched what's between the slash and two points but i've noticed that there's some lines like the first where the type "FAXG3.I0.0101" isn't followed by a slash
here's the regex i use:
exp = result = re.findall(r'[\w\.]+(?=:*)',data) # type S & D
the result i want is 'SMTP','FAXG3.I0.0101' for the first line and 'SMTP','SMTP.0' for the second.
can someone help correcting my regex to get that ??

You just need to change the regex such that it also accepts '.' as a valid character, e.g.:
import re
data = "This is a test message I Res O:Myself (KTP:SMTP/me#domain.com) R:KTP:SMS.CLASS/+345854595 id:21"
result = re.findall(r'[\w\.]+(?=:*/)',data)
print result
['SMTP', 'SMS.CLASS']
The [\w\.]+ says you'll accept a sequence consisting of at least one 'any alphanumeric character and the underscore' (\w) or . (\. - it needs to be escaped, as . otherwise means 'any character').

That should work:
result = re.findall(r'(?<=:)[\w.]+(?=/)',data)
Saying "a sequence of alphanumerical characters (or underscore or dot) between : and a /".

Python : How to ignore a delimited part of a sentence?

I have the following line :
CommonSettingsMandatory = #<Import Project="[\\.]*Shared(\\vc10\\|\\)CommonSettings\.targets," />#,true
and i want the following output:
['commonsettingsmandatory', '<Import Project="[\\\\.]*Shared(\\\\vc10\\\\|\\\\)CommonSettings\\.targets," />', 'true'
If i do a simple regex with the comma, it will split the value if there's a value in it, like i wrote a comma after targets, it will split here.
So i want to ignore the text between the ## to make sure there's no splitting there.
I really don't know how to do!

http://docs.python.org/library/re.html#re.split
import re
string = 'CommonSettingsMandatory = #toto,tata#, true'
splitlist = re.split('\s?=\s?#(.*?)#,\s?', string)
Then splitlist contains ['CommonSettingsMandatory', 'toto,tata', 'true'].

While you might be able to use split with a lookbehind, I would use the groups captured by this expression.
(\S+)\s*=\s*##([^#]+)##,\s*(.*)
m = re.Search(expression, myString). use m.group(1) for the first string, m.group(2) for the second, etc.

If I understand you correctly, you're trying to split the string using spaces as delimiters, but you want to also remove any text between pound signs?
If that's correct, why not simply remove the pound sign-delimited text before splitting the string?
import re
myString = re.sub(r'#.*?#', '', myString)
myArray = myString.split(' ')
EDIT: (based on revised question)
import re
myArray = re.findall(r'^(.*?) = #(.*?)#,(.*?)$', myString)
That will actually return an array of tuples including your matches, in the form of:
[
(
'commonsettingsmandatory',
'<Import Project="[\\\\.]*Shared(\\\\vc10\\\\|\\\\)CommonSettings\\.targets," />',
'true'
)
]
(spacing added to illustrate the format better)

Remove characters from beginning and end or only end of line

I want to remove some symbols from a string using a regular expression, for example:
== (that occur both at the beginning and at the end of a line),
* (at the beginning of a line ONLY).
def some_func():
clean = re.sub(r'= {2,}', '', clean) #Removes 2 or more occurrences of = at the beg and at the end of a line.
clean = re.sub(r'^\* {1,}', '', clean) #Removes 1 or more occurrences of * at the beginning of a line.
What's wrong with my code? It seems like expressions are wrong. How do I remove a character/symbol if it's at the beginning or at the end of the line (with one or more occurrences)?

If you only want to remove characters from the beginning and the end, you could use the string.strip() method. This would give some code like this:
>>> s1 = '== foo bar =='
>>> s1.strip('=')
' foo bar '
>>> s2 = '* foo bar'
>>> s2.lstrip('*')
' foo bar'
The strip method removes the characters given in the argument from the beginning and the end of the string, ltrip removes them from only the beginning, and rstrip removes them only from the end.
If you really want to use a regular expression, they would look something like this:
clean = re.sub(r'(^={2,})|(={2,}$)', '', clean)
clean = re.sub(r'^\*+', '', clean)
But IMHO, using strip/lstrip/rstrip would be the most appropriate for what you want to do.
Edit: On Nick's suggestion, here is a solution that would do all this in one line:
clean = clean.lstrip('*').strip('= ')
(A common mistake is to think that these methods remove characters in the order they're given in the argument, in fact, the argument is just a sequence of characters to remove, whatever their order is, that's why the .strip('= ') would remove every '=' and ' ' from the beginning and the end, and not just the string '= '.)

You have extra spaces in your regexs. Even a space counts as a character.
r'^(?:\*|==)|==$'

First of all you should pay attention to the spaces before "{" ... those are meaningful so the quantifier in your example applies to the space.
To remove "=" (two or more) only at begin or end also you need a different regexp... for example
clean = re.sub(r'^(==+)?(.*?)(==+)?$', r'\2', s)
If you don't put either "^" or "$" the expression can match anywhere (i.e. even in the middle of the string).

And not substituting but keeping ? :
tu = ('======constellation==' , '==constant=====' ,
'=flower===' , '===bingo=' ,
'***seashore***' , '*winter*' ,
'====***conditions=**' , '=***trees====***' ,
'***=information***=' , '*=informative***==' )
import re
RE = '((===*)|\**)?(([^=]|=(?!=+\Z))+)'
pat = re.compile(RE)
for ch in tu:
print ch,' ',pat.match(ch).group(3)
Result:
======constellation== constellation
==constant===== constant
=flower=== =flower
===bingo= bingo=
***seashore*** seashore***
*winter* winter*
====***conditions=** ***conditions=**
=***trees====*** =***trees====***
***=information***= =information***=
*=informative***== =informative***
Do you want in fact
====***conditions=** to give conditions=** ?
***====hundred====*** to give hundred====*** ?
for the beginning ?**

I think that the following code will do the job:
tu = ('======constellation==' , '==constant=====' ,
'=flower===' , '===bingo=' ,
'***seashore***' , '*winter*' ,
'====***conditions=**' , '=***trees====***' ,
'***=information***=' , '*=informative***==' )
import re,codecs
with codecs.open('testu.txt', encoding='utf-8', mode='w') as f:
pat = re.compile('(?:==+|\*+)?(.*?)(?:==+)?\Z')
xam = max(map(len,tu)) + 3
res = '\n'.join(ch.ljust(xam) + pat.match(ch).group(1)
for ch in tu)
f.write(res)
print res
Where was my brain when I wrote the RE in my earlier post ??! O!O
Non greedy quantifier .*? before ==+\Z is the real solution.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Replacing unwanted characters with regex - python

Related

Capture ALL strings within a Python script with regex

How to parse values appear after the same string in python?

I would like to extract the words from ":" to slash

Python : How to ignore a delimited part of a sentence?

Remove characters from beginning and end or only end of line

Categories

Resources