Escaping missing parenthesis using pandas str.match - python

I'm having trouble with regex. I'm trying to check if my database fully matches with the item name I'm working. The problem is that sometimes the data is incomplete and I'll get errors. I would like to ignore regex completely as it is not necessary at this point.
For example the code below returns re.error: missing ), unterminated subpattern at position 10 as the last item on the list is missing a parenthesis. I've tried using if database['Item Name'].str.match(item, regex=False).any(): but it's not enough as the items can be named quite similarly and I would need perfect match. I've also tried to read re module documentation but I do not understand it well enough to get rid of the problem.
Any ideas how could I bypass the issue?
database = pd.read_csv("database.csv", sep=";")
list = ["Test Name !", "Test Name (2020)", "Test name ("]
for item in list:
if database['Item Name'].str.match(item).any():
# do something
pass
else:
#do something else
pass

If I understand your post correctly, you are trying to use the data read to create a regex. Since you don't want these treated as regexes, you might simply use string comparisons.
However, if your application requires the use of regex, you can use re.escape() render the string as literal so the paren won’t be magic.
For example:
import re
string1 = 'this is a magic ( that will break your regex'
string2 = re.escape(string1) # escapes your string
re.match(string2, "this won't cause issues")
#re.match(string1, "this will cause issues")

Related

How do I remove everything from a string except what I want?

Okay, so basically I want the user to be able to input something, like "quote python syntax and semantics", remove the word 'quote' and anything else (for example, the command could be, 'could you quote for me Python syntax and semantics') then format it in a way that I can pass it to the Wikipedia article URL (in this case 'https://en.wikipedia.org/wiki/Python_syntax_and_semantics'), request it and scrape the element(s) I want.
Any answer would be greatly appreciated.
Here's a simple example of doing this:
import re
msg = input() # Here give as input "quote python syntax and semantics"
repMsg = re.sub("quote", "", msg).strip() # Erase "quote" and space at the start
repMsg = re.sub(" ", "_", repMsg) # Replace spaces with _
print(repMsg) # prints "python_syntax_and_semantics"
The python regex module is very handy for doing this sort of things. Note that you'll probably need to fine tune your code e.g. decide when to replace first occurrence vs replace all, at which point to strip white spaces etc.

regex search&replace a variable string including a regex statement

I want to use re.sub to replace a part of a string I know exactly what looks like. relevant part of code:
print "Regex statement: ", foundStatements[iterator]
print "string to replace with : \n", latexPreparedString
print "string to search&replace in: \n", fileAsString
processedString = re.sub(foundStatements[iterator], latexPreparedString, fileAsString)
print "processed string: \n", processedString
In my testing case, foundStatements[iterator] is "%#import script_example.py ( *out =(.|\n)*?return out)" But even though processedString contains foundStatements[iterator], processedString looks exactly like fileAsString, so it hasn't accomplished the re.sub task. What am I doing wrong?
EDIT: Ok, it definitely has something to do with the string I'm searching to replace containing regex code. Is there a way to make it just interpret it foundStatements[iterator] as a raw string to search for? The only solution I can think of is to create a function that replaces any regex symbols in a string with \regexsymbol (e.g. * -> \*), but it'd make sense for there to be a way to solve this with inbuilt functions. It'd also be a bit overkill since I'd have to make sure it works with every single regex symbol, of which there are quite a few :/
EDIT2: Well, just changing it to re.sub(re.escape(foundStatements[iterator]), latexPreparedString, fileAsString) seems to work. except when the regex statement doesn't hit anything in the original file. To explain, latexPreparedString is generated by using the regex-part of the foundStatements[iterator]. While it's logical that it shouldn't be able to set latexPreparedString to anything when the regex statement doesn't hit anything, I set latexPreparedString = "" by default, so in that case it should re.sub replace it with a blank string if it doesn't hit anything. Here's how to code looks at the moment: pastebin.com/wUedK3LN
First, for replacing an exact match in a string, you should use [string.replace()][1]:
processedString = fileAsString(foundStatements[iterator], latexPreparedString)
However, this will still fail in your case, because foundStatements[iterator] has a newline character in it. To escape it, you need to use the r prefix when declaring foundStatements[iterator].
If you still want to use re.sub, you have to both prefix the string with r and use re.escape(foundStatements[iterator]) instead of foundStatements[iterator]. You can read more about re.escape here.

regEx matching a curly brace don't matched any way I try

solving a trivial task of finding the start of a body of a .php function, I'm not able to get a regEx match however I tried. Here's what I supposed to do the job:
import re
print re.search(r"addToHead(){", "addToHead(){\n\tcode...").group()
# addToHead is the function I'm looking for.
# --> AttributeError: 'NoneType' object has no attribute 'group'
print re.search(r"addToHead()\{", "addToHead(){\n\tcode...").group()
# Nor backslashing or double backslash works.
print re.search(r"addToHead()[\{]", "addToHead(){\n\tcode...").group()
print re.search(r"addToHead()[\x7b]", "addToHead(){\n\tcode...").group()
# Noting works...am I missing something??
Also I tried with re.DOTALL with the same unpleasant result. Do I sit on my nerve? Or a bug..?
Brackets () are used to logically group the matched string in regular expression. Basically, they have special meaning in regular expressions. So you have to escape the brackets () like \(\).
print re.search(r"addToHead\(\){", "addToHead(){\n\tcode...").group()
Output
addToHead(){
Oh, now, just a minute after I posted the question I found it, it's not with the curly brace, but the standard brackets...well, I should likely delete my question, but [Meta-question] would I be able to access it as a record of my past blindness?

in python find index in list if combination of strings exist

I'm writing my first script and trying to learn python.
But I'm stuck and can't get out of this one.
I'm writing a script to change file names.
Lets say I have a string = "this.is.tEst3.E00.erfeh.ervwer.vwtrt.rvwrv"
I want the result to be string = "This Is Test3 E00"
this is what I have so far:
l = list(string)
//Transform the string into list
for i in l:
if "E" in l:
p = l.index("E")
if isinstance((p+1), int () is True:
if isinstance((p+2), int () is True:
delp = p+3
a = p-3
del l[delp:]
new = "".join(l)
new = new.replace("."," ")
print (new)
get in index where "E" and check if after "E" there are 2 integers.
Then delete everything after the second integer.
However this will not work if there is an "E" anyplace else.
at the moment the result I get is:
this is tEst
because it is finding index for the first "E" on the list and deleting everything after index+3
I guess my question is how do I get the index in the list if a combination of strings exists.
but I can't seem to find how.
thanks for everyone answers.
I was going in other direction but it is also not working.
if someone could see why it would be awesome. It is much better to learn by doing then just coping what others write :)
this is what I came up with:
for i in l:
if i=="E" and isinstance((i+1), int ) is True:
p = l.index(i)
print (p)
anyone can tell me why this isn't working. I get an error.
Thank you so much
Have you ever heard of a Regular Expression?
Check out python's re module. Link to the Docs.
Basically, you can define a "regex" that would match "E and then two integers" and give you the index of it.
After that, I'd just use python's "Slice Notation" to choose the piece of the string that you want to keep.
Then, check out the string methods for str.replace to swap the periods for spaces, and str.title to put them in Title Case
An easy way is to use a regex to find up until the E followed by 2 digits criteria, with s as your string:
import re
up_until = re.match('(.*?E\d{2})', s).group(1)
# this.is.tEst3.E00
Then, we replace the . with a space and then title case it:
output = up_until.replace('.', ' ').title()
# This Is Test3 E00
The technique to consider using is Regular Expressions. They allow you to search for a pattern of text in a string, rather than a specific character or substring. Regular Expressions have a bit of a tough learning curve, but are invaluable to learn and you can use them in many languages, not just in Python. Here is the Python resource for how Regular Expressions are implemented:
http://docs.python.org/2/library/re.html
The pattern you are looking to match in your case is an "E" followed by two digits. In Regular Expressions (usually shortened to "regex" or "regexp"), that pattern looks like this:
E\d\d # ('\d' is the specifier for any digit 0-9)
In Python, you create a string of the regex pattern you want to match, and pass that and your file name string into the search() method of the the re module. Regex patterns tend to use a lot of special characters, so it's common in Python to prepend the regex pattern string with 'r', which tells the Python interpreter not to interpret the special characters as escape characters. All of this together looks like this:
import re
filename = 'this.is.tEst3.E00.erfeh.ervwer.vwtrt.rvwrv'
match_object = re.search(r'E\d\d', filename)
if match_object:
# The '0' means we want the first match found
index_of_Exx = match_object.end(0)
truncated_filename = filename[:index_of_Exx]
# Now take care of any more processing
Regular expressions can get very detailed (and complex). In fact, you can probably accomplish your entire task of fully changing the file name using a single regex that's correctly put together. But since I don't know the full details about what sorts of weird file names might come into your program, I can't go any further than this. I will add one more piece of information: if the 'E' could possibly be lower-case, then you want to add a flag as a third argument to your pattern search which indicates case-insensitive matching. That flag is 're.I' and your search() method would look like this:
match_object = re.search(r'E\d\d', filename, re.I)
Read the documentation on Python's 're' module for more information, and you can find many great tutorials online, such as this one:
http://www.zytrax.com/tech/web/regex.htm
And before you know it you'll be a superhero. :-)
The reason why this isn't working:
for i in l:
if i=="E" and isinstance((i+1), int ) is True:
p = l.index(i)
print (p)
...is because 'i' contains a character from the string 'l', not an integer. You compare it with 'E' (which works), but then try to add 1 to it, which errors out.

Repeating a python regular expression until a certain char

I want to get all of the text until a ! appears. Example
some textwfwfdsfosjtortjk\n
sdsfsdfsdfsdfsdfsdfsdfsfsfsdfsdfsdf\n
sfsgdfgdfgdgdfgdg\n
!
The number of lines before the ! changes so I can't hardcode a reg exp like this
"+\n^.+\n^.+"
I am using re.MULTLINE, but should I be using re.DOTALL?
Thanks
Why does this need a regular expression?
index = str.find('!')
if index > -1:
str = str[index:] # or (index+1) to get rid of the '!', too
So you want to match everything from the beginning of the input up to (but not including) the first ! character? This should do it:
re.match(r'[^!]*', input)
If there are no exclamation points this will match the whole string. If you want to match only strings with ! in them, add a lookahead:
re.match(r'[^!]*(?=!)', input)
The MULTILINE flag is not needed because there are no anchors (^ and $), and DOTALL isn't needed because there are no dots.
Following the Python philosophy of "Easier to Ask Forgiveness Than Permission" (EAFP), I suggest you create a subroutine which is easy to understand and later maintain, should your separator change.
SEPARATOR = u"!"
def process_string(s):
try:
return s[:s.index(SEPARATOR)]
except ValueError:
return s
This function will return the string from the beginning up to, and not including, whatever you defined as separator. If the separator is not found, it will return the whole string. The function works regardless of new lines. If your separator changes, simply change SEPARATOR and you are good to go.
ValueError is the exception raised when you request the index of a character not in the string (try it in the command line: "Hola".index("1") (will raise ValueError: substring not found). The workflow then assumes that most of the time you expect the SEPARATOR character to be in the string, so you attempt that first without asking for permission (testing if SEPARATOR is in the string); if you fail (the index method raises ValueError) then you ask forgiveness (return the string as originally received). This approach (EAFP) is considered Pythonic when it applies, as it does in this case.
No regular expressions needed; this is a simple problem.
Look into a 'lookahead' for that particular character you're reading, and match the whole first part as a pattern instead.
I'm not sure exactly how Python's regex reader is different from Ruby, but you can play with it in rubular.com
Maybe something like:
([^!]*(?=\!))
(Just tried this, seems to work)
It should do the job.
re.compile('(.*?)!', re.DOTALL).match(yourString).group(1)
I think you're making this more complex than it needs to be. Your reg exp just needs to say "repeat(any character except !) followed by !". Remember [^!] means "any character except !".
So, like this:
>>> import re
>>> rexp = re.compile("([^!]*)!")
>>> test = """sdasd
... asdasdsa
... asdasdasd
... asdsadsa
... !"""
>>> rexp.findall(test)
['sdasd\nasdasdsa\nasdasdasd\nasdsadsa\n']
>>>
re.DOTALL should be sufficient:
import re
text = """some textwfwfdsfosjtortjk
sdsfsdfsdfsdfsdfsdfsdfsfsfsdfsdfsdf
sfsgdfgdfgdgdfgdg
!"""
rExp = re.compile("(.*)\!", re.S)
print rExp.search(text).groups()[0]
some textwfwfdsfosjtortjk
sdsfsdfsdfsdfsdfsdfsdfsfsfsdfsdfsdf
sfsgdfgdfgdgdfgdg

Categories

Resources