regex search&replace a variable string including a regex statement - python

I want to use re.sub to replace a part of a string I know exactly what looks like. relevant part of code:
print "Regex statement: ", foundStatements[iterator]
print "string to replace with : \n", latexPreparedString
print "string to search&replace in: \n", fileAsString
processedString = re.sub(foundStatements[iterator], latexPreparedString, fileAsString)
print "processed string: \n", processedString
In my testing case, foundStatements[iterator] is "%#import script_example.py ( *out =(.|\n)*?return out)" But even though processedString contains foundStatements[iterator], processedString looks exactly like fileAsString, so it hasn't accomplished the re.sub task. What am I doing wrong?
EDIT: Ok, it definitely has something to do with the string I'm searching to replace containing regex code. Is there a way to make it just interpret it foundStatements[iterator] as a raw string to search for? The only solution I can think of is to create a function that replaces any regex symbols in a string with \regexsymbol (e.g. * -> \*), but it'd make sense for there to be a way to solve this with inbuilt functions. It'd also be a bit overkill since I'd have to make sure it works with every single regex symbol, of which there are quite a few :/
EDIT2: Well, just changing it to re.sub(re.escape(foundStatements[iterator]), latexPreparedString, fileAsString) seems to work. except when the regex statement doesn't hit anything in the original file. To explain, latexPreparedString is generated by using the regex-part of the foundStatements[iterator]. While it's logical that it shouldn't be able to set latexPreparedString to anything when the regex statement doesn't hit anything, I set latexPreparedString = "" by default, so in that case it should re.sub replace it with a blank string if it doesn't hit anything. Here's how to code looks at the moment: pastebin.com/wUedK3LN

First, for replacing an exact match in a string, you should use [string.replace()][1]:
processedString = fileAsString(foundStatements[iterator], latexPreparedString)
However, this will still fail in your case, because foundStatements[iterator] has a newline character in it. To escape it, you need to use the r prefix when declaring foundStatements[iterator].
If you still want to use re.sub, you have to both prefix the string with r and use re.escape(foundStatements[iterator]) instead of foundStatements[iterator]. You can read more about re.escape here.

Related

Escaping missing parenthesis using pandas str.match

I'm having trouble with regex. I'm trying to check if my database fully matches with the item name I'm working. The problem is that sometimes the data is incomplete and I'll get errors. I would like to ignore regex completely as it is not necessary at this point.
For example the code below returns re.error: missing ), unterminated subpattern at position 10 as the last item on the list is missing a parenthesis. I've tried using if database['Item Name'].str.match(item, regex=False).any(): but it's not enough as the items can be named quite similarly and I would need perfect match. I've also tried to read re module documentation but I do not understand it well enough to get rid of the problem.
Any ideas how could I bypass the issue?
database = pd.read_csv("database.csv", sep=";")
list = ["Test Name !", "Test Name (2020)", "Test name ("]
for item in list:
if database['Item Name'].str.match(item).any():
# do something
pass
else:
#do something else
pass
If I understand your post correctly, you are trying to use the data read to create a regex. Since you don't want these treated as regexes, you might simply use string comparisons.
However, if your application requires the use of regex, you can use re.escape() render the string as literal so the paren won’t be magic.
For example:
import re
string1 = 'this is a magic ( that will break your regex'
string2 = re.escape(string1) # escapes your string
re.match(string2, "this won't cause issues")
#re.match(string1, "this will cause issues")

Replace string with quotes, brackets, braces, and slashes in python

I have a string where I am trying to replace ["{\" with [{" and all \" with ".
I am struggling to find the right syntax in order to do this, does anyone have a solid understanding of how to do this?
I am working with JSON, and I am inserting a string into the JSON properties. This caused it to put a single quotes around my inserted data from my variable, and I need those single quotes gone. I tried to do json.dumps() on the data and do a string replace, but it does not work.
Any help is appreciated. Thank you.
You can use the replace method.
See documentation and examples here
I would recommend maybe posting more of your code below so we can suggest a better answer. Just based on the information you have provided, I would say that what you are looking for are escape characters. I may be able to help more once you provide us with more info!
Use the target/replacement strings as arguments to replace().
The general format is mystring = mystring.replace("old_text", "new_text")
Since your target strings have backslashes, you also probably want to use raw strings to prevent them from being interpreted as special characters.
mystring = "something"
mystring = mystring.replace(r'["{\"', '[{"')
mystring = mystring.replace(r'\"', '"')
if its two characters you want to replace then you have to first check for first character and then the second(which should be present just after the first one and so on) and shift(shorten the whole array by 3 elements in first case whenever the condition is satisfied and in the second case delete \ from the array.
You can also find particular substring by using inbuilt function and replace it by using replace() function to insert the string you want in its place

Search a delimited string in a file - Python

I have the following read.json file
{:{"JOL":"EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD.tr","LAPTOP":"error"}
and python script :
import re
shakes = open("read.json", "r")
needed = open("needed.txt", "w")
for text in shakes:
if re.search('JOL":"(.+?).tr', text):
print >> needed, text,
I want it to find what's between two words (JOL":" and .tr) and then print it. But all it does is printing all the text set in "read.json".
You're calling re.search, but you're not doing anything with the returned match, except to check that there is one. Instead, you're just printing out the original text. So of course you get the whole line.
The solution is simple: just store the result of re.search in a variable, so you can use it. For example:
for text in shakes:
match = re.search('JOL":"(.+?).tr', text)
if match:
print >> needed, match.group(1)
In your example, the match is JOL":"EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD.tr, and the first (and only) group in it is EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD, which is (I think) what you're looking for.
However, a couple of side notes:
First, . is a special pattern in a regex, so you're actually matching anything up to any character followed by tr, not .tr. For that, escape the . with a \. (And, once you start putting backslashes into a regex, use a raw string literal.) So: r'JOL":"(.+?)\.tr'.
Second, this is making a lot of assumptions about the data that probably aren't warranted. What you really want here is not "everything between JOL":" and .tr", it's "the value associated with key 'JOL' in the JSON object". The only problem is that this isn't quite a JSON object, because of that prefixed :. Hopefully you know where you got the data from, and therefore what format it's actually in. For example, if you know it's actually a sequence of colon-prefixed JSON objects, the right way to parse it is:
d = json.loads(text[1:])
if 'JOL' in d:
print >> needed, d['JOL']
Finally, you don't actually have anything named needed in your code; you opened a file named 'needed.txt', but you called the file object love. If your real code has a similar bug, it's possible that you're overwriting some completely different file over and over, and then looking in needed.txt and seeing nothing changed each time…
If you know that your starting and ending matching strings only appear once, you can ignore that it's JSON. If that's OK, then you can split on the starting characters (JOL":"), take the 2nd element of the split array [1], then split again on the ending characters (.tr) and take the 1st element of the split array [0].
>>> text = '{:{"JOL":"EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD.tr","LAPTOP":"error"}'
>>> text.split('JOL":"')[1].split('.tr')[0]
'EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD'

in python find index in list if combination of strings exist

I'm writing my first script and trying to learn python.
But I'm stuck and can't get out of this one.
I'm writing a script to change file names.
Lets say I have a string = "this.is.tEst3.E00.erfeh.ervwer.vwtrt.rvwrv"
I want the result to be string = "This Is Test3 E00"
this is what I have so far:
l = list(string)
//Transform the string into list
for i in l:
if "E" in l:
p = l.index("E")
if isinstance((p+1), int () is True:
if isinstance((p+2), int () is True:
delp = p+3
a = p-3
del l[delp:]
new = "".join(l)
new = new.replace("."," ")
print (new)
get in index where "E" and check if after "E" there are 2 integers.
Then delete everything after the second integer.
However this will not work if there is an "E" anyplace else.
at the moment the result I get is:
this is tEst
because it is finding index for the first "E" on the list and deleting everything after index+3
I guess my question is how do I get the index in the list if a combination of strings exists.
but I can't seem to find how.
thanks for everyone answers.
I was going in other direction but it is also not working.
if someone could see why it would be awesome. It is much better to learn by doing then just coping what others write :)
this is what I came up with:
for i in l:
if i=="E" and isinstance((i+1), int ) is True:
p = l.index(i)
print (p)
anyone can tell me why this isn't working. I get an error.
Thank you so much
Have you ever heard of a Regular Expression?
Check out python's re module. Link to the Docs.
Basically, you can define a "regex" that would match "E and then two integers" and give you the index of it.
After that, I'd just use python's "Slice Notation" to choose the piece of the string that you want to keep.
Then, check out the string methods for str.replace to swap the periods for spaces, and str.title to put them in Title Case
An easy way is to use a regex to find up until the E followed by 2 digits criteria, with s as your string:
import re
up_until = re.match('(.*?E\d{2})', s).group(1)
# this.is.tEst3.E00
Then, we replace the . with a space and then title case it:
output = up_until.replace('.', ' ').title()
# This Is Test3 E00
The technique to consider using is Regular Expressions. They allow you to search for a pattern of text in a string, rather than a specific character or substring. Regular Expressions have a bit of a tough learning curve, but are invaluable to learn and you can use them in many languages, not just in Python. Here is the Python resource for how Regular Expressions are implemented:
http://docs.python.org/2/library/re.html
The pattern you are looking to match in your case is an "E" followed by two digits. In Regular Expressions (usually shortened to "regex" or "regexp"), that pattern looks like this:
E\d\d # ('\d' is the specifier for any digit 0-9)
In Python, you create a string of the regex pattern you want to match, and pass that and your file name string into the search() method of the the re module. Regex patterns tend to use a lot of special characters, so it's common in Python to prepend the regex pattern string with 'r', which tells the Python interpreter not to interpret the special characters as escape characters. All of this together looks like this:
import re
filename = 'this.is.tEst3.E00.erfeh.ervwer.vwtrt.rvwrv'
match_object = re.search(r'E\d\d', filename)
if match_object:
# The '0' means we want the first match found
index_of_Exx = match_object.end(0)
truncated_filename = filename[:index_of_Exx]
# Now take care of any more processing
Regular expressions can get very detailed (and complex). In fact, you can probably accomplish your entire task of fully changing the file name using a single regex that's correctly put together. But since I don't know the full details about what sorts of weird file names might come into your program, I can't go any further than this. I will add one more piece of information: if the 'E' could possibly be lower-case, then you want to add a flag as a third argument to your pattern search which indicates case-insensitive matching. That flag is 're.I' and your search() method would look like this:
match_object = re.search(r'E\d\d', filename, re.I)
Read the documentation on Python's 're' module for more information, and you can find many great tutorials online, such as this one:
http://www.zytrax.com/tech/web/regex.htm
And before you know it you'll be a superhero. :-)
The reason why this isn't working:
for i in l:
if i=="E" and isinstance((i+1), int ) is True:
p = l.index(i)
print (p)
...is because 'i' contains a character from the string 'l', not an integer. You compare it with 'E' (which works), but then try to add 1 to it, which errors out.

Python: regular expressions in control structures [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to check if text is “empty” (spaces, tabs, newlines) in Python?
I am trying to write a short function to process lines of text in a file. When it encounters a line with significant content (meaning more than just whitespace), it is to do something with that line. The control structure I wanted was
if '\S' in line: do something
or
if r'\S' in line: do something
(I tried the same combinations with double quotes also, and yes I had imported re.) The if statement above, in all the forms I tried, always returns False. In the end, I had to resort to the test
if re.search('\S', line) is not None: do something
This works, but it feels a little clumsy in relation to a simple if statement. My question, then, is why isn't the if statement working, and is there a way to do something as (seemingly) elegant and simple?
I have another question unrelated to control structures, but since my suspicion is that it is also related to a possibly illegal use of regular expressions, I'll ask it here. If I have a string
s = " \t\tsome text \t \n\n"
The code
s.strip('\s')
returns the same string complete with spaces, tabs, and newlines (r'\s' is no different). The code
s.strip()
returns "some text". This, even though strip called with no character string supposedly defaults to stripping whitespace characters, which to my mind is exactly what the expression '\s' is doing. Why is the one stripping whitespace and the other not?
Thanks for any clarification.
Python string functions are not aware of regular expressions, so if you want to use them you have to use the re module.
However if you are only interested in finding out of a string is entirely whitespace or not, you can use the str.isspace() function:
>>> 'hello'.isspace()
False
>>> ' \n\t '.isspace()
True
This is what you're looking for
if not line.isspace(): do something
Also, str.strip does not use regular expressions.
If you are really just want to find out if the line only consists of whitespace characters regex is a little overkill. You should got for the following instead:
if text.strip():
#do stuff
which is basically the same as:
if not text.strip() == "":
#do stuff
Python evaluates every non-empty string to True. So if text consists only of whitespace-characters, text.strip() equals "" and therefore evaluates to False.
The expression '\S' in line does the same thing as any other string in line test; it tests whether the string on the left occurs inside the string on the right. It does not implicitly compile a regular expression and search for a match. This is a good thing. What if you were writing a program that manipulated regular expressions input by the user and you actually wanted to test whether some sub-expression like \S was in the input expression?
Likewise, read the documentation of str.strip. Does it say that will treat it's input as a regular expression and remove matching strings? No. If you want to do something with regular expressions, you have to actually tell Python that, not expect it to somehow guess that you meant a regular expression this time while other times it just meant a plain string. While you might think of searching for a regular expression as very similar to searching for a string, they are completely different operations as far as the language implementation is concerned. And most str methods wouldn't even make sense when applied to a regular expression.
Because re.match objects are "truthy" in boolean context (like most class instances), you can at least shorten your if statement by dropping the is not None test. The rest of the line is necessary to actually tell Python what you want. As for your str.strip case (or other cases where you want to do something similar to a string operation but with a regular expression), have a look at the functions in the re module; there are a number of convenience functions on there that can be helpful. Or else it should be pretty easy to implement a re_split function yourself.

Categories

Resources