Trying to find text in an article that may contain quotation marks - python

I'm using python's findall function with a reg expression that should work but can't get the function to output results with quotation marks in them ('").
This is what I tried:
Description = findall('<p>([A-Za-z ,\.\—'":;0-9]+).</p>\n', text)
The quotation marks inside the reg expression are creating the hassle and I have no idea how to get around it.

Placing the backslash before the single quote like Sachith Rukshan suggested makes it work

Related

Grabbing text between either double/single quote in Python regex

I have a bunch (thousands) of old unit testing scripts written with the Selenium RC interface in JavaScript. Since we're upgrading to Selenium 3, I want to try and get rid of some of the RC methods in an automated fashion using Python scripts. I'm iterating through these scripts line by line, picking up the Selenese methods, deconstructing them then attempting to rebuild with the WebDriver interface. For example:
selenium.type("xpath=//*[text()='test, xpath']", "test, text");
Would be output as...
driver.findElement(By.xpath("//*[text()='test, xpath']")).sendKeys("test, text");
I have a system for automatically identifying the Selenese methods, storing whitespace and separating the method from the parameters, so what I'm left with is the following string:
("xpath=//*[text()='test, xpath']", "test, text")
A problem I'm running into is, these aren't always consistent. Sometimes there are double-quotes nested in single-quotes, or vice-versa, or escaped double-quotes nested in double-quotes, etc. For example:
("xpath=//*[text()=\"test, xpath\"]", "test, text")
('xpath=//*[text()=\'test, xpath\']', 'test, text')
('xpath=//*[text()="test, xpath"]', 'test, text')
These are all valid. I want to be able to always match the arguments passed into the method, whether double-quotes are used or single-quotes, plus ignore nested quotes opposite of what's used to open the string as well as escaped quotes, then return them as lists.
['xpath=//*[text()="test, xpath"]', 'test, text']
...etc. I've attempted to use the re.findall using the following expression.
([\"'])(?:(?=(\\?))\2.)*?\1
What I'm getting back is this.
>>> print arguments
[('"', ''), ('"', '')]
Is there something I'm missing?
I would not make it this complex using lookbehind or lookahead. Rather I would build a case specific regex. In your case you have something like below
("param1", "param2")
('param1', 'param2')
Inside these params you may have additional escaped quotes or single quotes or what not. But if look at one thing, which is split it using ", " or ', ', these exact patterns will rarely occur in param1 and param2
So simplest non-regex solution would be to split based on ", " or ', '. But then there may be extra spaces or no spaces between, so we use a pattern
^\(\s*["']\s*(?<first_param>.*?)("\s*,\s*"|'\s*,\s*')(?<second_param>.*?)\s*["']\s*\)$
\(\s*["']\s* to match the first brackets and any starting quote
(?<first_param>.*?) to match the first parameter
("\s*,\s*"|'\s*,\s*') to match our split command pattern
(?<second_param>.*?) to match the second param
\s*["']\s*\)$ to match the end.
This is not perfect but will work in 95%+ cases of your
You can check regex fiddle on below link
https://regex101.com/r/z9PytD/1/

Replace text between parentheses in python

My string will contain () in it. What I need to do is to change the text between the brackets.
Example string: "B.TECH(CS,IT)".
In my string I need to change the content present inside the brackets to something like this.. B.TECH(ECE,EEE)
What I tried to resolve this problem is as follows..
reg = r'(()([\s\S]*?)())'
a = 'B.TECH(CS,IT)'
re.sub(reg,"(ECE,EEE)",a)
But I got output like this..
'(ECE,EEE)B(ECE,EEE).(ECE,EEE)T(ECE,EEE)E(ECE,EEE)C(ECE,EEE)H(ECE,EEE)((ECE,EEE)C(ECE,EEE)S(ECE,EEE),(ECE,EEE)I(ECE,EEE)T(ECE,EEE))(ECE,EEE)'
Valid output should be like this..
B.TECH(CS,IT)
Where I am missing and how to correctly replace the text.
The problem is that you're using parentheses, which have another meaning in RegEx. They're used as grouping characters, to catch output.
You need to escape the () where you want them as literal tokens. You can escape characters using the backslash character: \(.
Here is an example:
reg = r'\([\s\S]*\)'
a = 'B.TECH(CS,IT)'
re.sub(reg, '(ECE,EEE)', a)
# == 'B.TECH(ECE,EEE)'
The reason your regex does not work is because you are trying to match parentheses, which are considered meta characters in regex. () actually captures a null string, and will attempt to replace it. That's why you get the output that you see.
To fix this, you'll need to escape those parens – something along the lines of
\(...\)
For your particular use case, might I suggest a simpler pattern?
In [268]: re.sub(r'\(.*?\)', '(ECE,EEE)', 'B.TECH(CS,IT)')
Out[268]: 'B.TECH(ECE,EEE)'

Python regular expression r prefix followed by three single (or double) quotes

What does r prefix followed by three single (or double) quotes, namely r''' ''', mean in Python? And when to use it? Could you explain further the following examples?
foo = r'''foo'''
punctuation = r'''['“".?!,:;]'''
Related SO posts:
The difference between three single quote'd and three double quote'd docstrings in python
r prefix in Python regex
This question is not a duplicate. A direct and concise answer should be there in SO knowledge-base to make the community better.
If your pattern is surrounded by triple quotes, it won't need escaping of quotes present inside the regex.
Simple one,
r'''foo"'b'a'r"buzz'''
tough one which needs escaping.
r'foo"\'b\'a\'r"buzz'
This would be more helpful if your regex contain n number of quotes.
Not just in regular expressions, if you want to declare multiline string you need to use triple quote notation.
E.g.:-
paragraph = """ It is a paragraph
It is a graph
It is para
"""
''' can span strings which are multiline.
Like
x="""hey
hi"""
Here x will have \n even though you didn't put it. You can also include '" inside .

Python regular expression to pull text inside of HTML quotation marks

I'm attempting to pull ticker symbols from corporations' 10-K filings on EDGAR. The ticker symbol typically appears between a pair of HTML quotation marks, e.g., "‘" or "’". An example of a typical portion of relevant text:
Our common stock has been listed on the New York Stock Exchange (“NYSE”) under the symbol “RXN”
At this point I am just trying to figure out how to deal with the occurrence of one or more of a variety of quotation marks. I can write a regex that matches one particular type of quotation mark:
re.findall(r'under[^<]*the[^<]*symbol[^<]*“*[^<]*\n',fileText)
However, I can't write a regex that looks for more than one type of quotation mark. This regex produces nothing:
re.findall(r'under[^<]*the[^<]*symbol[^<]*“*‘*’*“*[^<]*\n',fileText)
Any help would be appreciated.
Your regex looks for all of the quotes occurring together. If you're looking for any one of the possibilities, you need to put parentheses around each string and or them:
(?:“)*|(?:‘)*|(?:’)*|(?:“)*
The ?: makes the paren groups non-capturing. I.e., the parser won't save each one as important text. As an aside, you'll probably want to use group-capturing to save the ticker symbol -- what you're actually looking for. Very quick-and-dirty (and ugly) expression that will return ['NYSE', 'RXN'] from the given string:
re.findall(r'(?:(?:“)|(?:&#14[567];)|(?:&#822[01];))(.+?)(?:(?:“)|(?:&#14[567];)|(?:&#822[01];))', fileText)
You'd probably want to only include left-quotes in the first group and right-quotes in the last group. Plus either-or quotes in both.
You can use
re.sub("&#([0-9]+);", lambda x:chr(int(x.group(1))), text)
this works because you can use search/replace providing a callable for the replace part. The number after "#" is the unicode point for the character and Python chr function can convert it to text.
For example:
re.sub("&#([0-9]+);", lambda x:chr(int(x.group(1))),
"this is a “test“")
results in
'this is a “test“'

Weird Python Regex Issues

whitespace_pattern = u"\s" # bug: tried to use unicode \u0020, broke regex
time_sig_pattern = \
"""^%(ws)s*time signature:%(ws)s*(?P<top>\d+)%(ws)s*\/%(ws)s*(?P<bottom>\d+)%(ws)s*$""" %{"ws": whitespace_pattern}
time_sig = compile(time_sig_pattern, U|M)
For some reason, adding the Verbose flag, X, to compile breaks the pattern.
Also, I wanted to use unicode for whitespace_pattern recognition (supposedly, we'll get patterns that use non-unicode spaces and we need to explicitly check for that one unicode character as a valid space), but the pattern keeps breaking.
VERBOSE gives you the ability to write comments in your regex to document it.
In order to do so, it ignores spaces, since you need to use line breaks to write comments.
Replace all spaces in your regex by \s to specify they are spaces you want to match in your pattern, and not just some spaces to format your comments.
What's more, you may want to use the r prefix for the string you use as a pattern. It tells Python not to interpret special notations such as \n and let you use backslashes without escaping them.
Always define regexes with the r prefix to indicate they are raw strings.
r"""^%(ws)s*time signature:%(ws)s*(?P<top>\d+)%(ws)s*\/%(ws)s*(?P<bottom>\d+)%(ws)s*$""" %{"ws": whitespace_pattern}
When creating a regex to match unicode characters you do not want to use a Python unicode string. In your example regular expression needs to see the literal characters \u0020, so you should use whitespace_pattern = r"\u0020" instead of u"\u0020".
As other answers have mentioned, you should also use the r prefix for time_sig_pattern, after those two changes your code should work fine.
For VERBOSE to work correctly you need to escape all whitespace in the pattern, so towards the beginning of the pattern replace the space in time signature with "\ " (quotes for clarity), \s, or [ ] as documented here.

Categories

Resources