Python regular expression to pull text inside of HTML quotation marks - python

I'm attempting to pull ticker symbols from corporations' 10-K filings on EDGAR. The ticker symbol typically appears between a pair of HTML quotation marks, e.g., "‘" or "’". An example of a typical portion of relevant text:
Our common stock has been listed on the New York Stock Exchange (“NYSE”) under the symbol “RXN”
At this point I am just trying to figure out how to deal with the occurrence of one or more of a variety of quotation marks. I can write a regex that matches one particular type of quotation mark:
re.findall(r'under[^<]*the[^<]*symbol[^<]*“*[^<]*\n',fileText)
However, I can't write a regex that looks for more than one type of quotation mark. This regex produces nothing:
re.findall(r'under[^<]*the[^<]*symbol[^<]*“*‘*’*“*[^<]*\n',fileText)
Any help would be appreciated.

Your regex looks for all of the quotes occurring together. If you're looking for any one of the possibilities, you need to put parentheses around each string and or them:
(?:“)*|(?:‘)*|(?:’)*|(?:“)*
The ?: makes the paren groups non-capturing. I.e., the parser won't save each one as important text. As an aside, you'll probably want to use group-capturing to save the ticker symbol -- what you're actually looking for. Very quick-and-dirty (and ugly) expression that will return ['NYSE', 'RXN'] from the given string:
re.findall(r'(?:(?:“)|(?:&#14[567];)|(?:&#822[01];))(.+?)(?:(?:“)|(?:&#14[567];)|(?:&#822[01];))', fileText)
You'd probably want to only include left-quotes in the first group and right-quotes in the last group. Plus either-or quotes in both.

You can use
re.sub("&#([0-9]+);", lambda x:chr(int(x.group(1))), text)
this works because you can use search/replace providing a callable for the replace part. The number after "#" is the unicode point for the character and Python chr function can convert it to text.
For example:
re.sub("&#([0-9]+);", lambda x:chr(int(x.group(1))),
"this is a “test“")
results in
'this is a “test“'

Related

Extracting Key Value pairs from a String using Regex

I have a web scrapped string containing key value pairs i.e firstName:"Quaran", lastName:"McPherson"
st = '{"accountId":405266,"firstName":"Quaran","lastName":"McPherson","accountIdentifier":"StudentAthlete","profilePicUrl":"https://pbs.twimg.com/profile_images/1331475329014181888/4z19KrCf.jpg","networkProfileCode":"quaran-mcpherson","hasDeals":true,"activityMin":11,"sports":["Men\'s Basketball","Basketball"],"currentTeams":["Nebraska Cornhuskers"],"previousTeams":[],"facebookReach":null,"twitterReach":619,"instagramReach":0,"linkedInReach":null},{"accountId":375964,"firstName":"Micole","lastName":"Cayton","accountIdentifier":"StudentAthlete","profilePicUrl":"https://opendorsepr.blob.core.windows.net/media/375964/20220622223838_46dbe3fd-a683-436b-84d4-90c84a5af35f.jpg","networkProfileCode":"micole-cayton","hasDeals":true,"activityMin":16,"sports":["Basketball","Women\'s Basketball"],"currentTeams":["Minnesota Golden Gophers"],"previousTeams":["Cal Berkeley Golden Bears"],"facebookReach":0,"twitterReach":1273,"instagramReach":5700,"linkedInReach":null}'
I am trying to extract the first_name, last_name and a few other parameters from this string in list format such that I will be having a first_name list with all first_names from the string
I tried using re.findall('"firstName":'"(.*)\S$",st) to access the text "Quaran" but result is coming in the following format
'"Quaran","lastName":"McPherson","accountIdentifier":"StudentAthlete","profilePicUrl":"https://pbs.twimg.com/profile_images/1331475329014181888/4z19KrCf.jpg","networkProfileCode":"quaran-mcpherson","hasDeals":true,"activityMin":11,"sports":["Men\'s Basketball","Basketball"],"currentTeams":["Nebraska Cornhuskers"],"previousTeams":[],"facebookReach":null,"twitterReach":619,"instagramReach":0,"linkedInReach":null}
how do I end the specify within the regex to end the search at the end of the name in quotes??
TIA
Your string seems JSON array, you can easily parse json in any language if it's valid. To make your string valid add '[' at first and ']' at last of your string then parse the JSON in your language. Such as
JavaScript:
JSON.parse(st)
Python:
import json
dict = json.loads(st)
Regular expression:
if you strictly wish to parse using regular expression use:
/(?:\"|\')(?<key>[\w\d]+)(?:\"|\')(?:\:\s*)(?:\"|\')?(?<value>[\w\s-]*)(?:\"|\')?/gm
Try this:
(?<="firstName":")[^"\r\n]+
(?<="firstName":") go to the point where "firstName":" appeasrs in the string,
[^"\r\n]+ then match one or more character except ", \r and \n. not to cross the second double quote of the firstName value and not to cross any newline.
See regex demo.
See python demo.
Try this regex (?<=\"firstName\":\").*?(?=\"). The ? in the middle makes it a lazy match, so that it stops matching as soon as it finds a " character.
I want to add that that with backtracking, there can be an exponential performance penalty. I can do something like this "firstName":"(.*?)" and extract the capturing group so that there will only a linear performance penalty.
Regex: https://regex101.com/r/uM2l8M/1
Python code
import re
st = '{"accountId":405266,"firstName":"Quaran","lastName":"McPherson","accountIdentifier":"StudentAthlete","profilePicUrl":"https://pbs.twimg.com/profile_images/1331475329014181888/4z19KrCf.jpg","networkProfileCode":"quaran-mcpherson","hasDeals":true,"activityMin":11,"sports":["Men\'s Basketball","Basketball"],"currentTeams":["Nebraska Cornhuskers"],"previousTeams":[],"facebookReach":null,"twitterReach":619,"instagramReach":0,"linkedInReach":null},{"accountId":375964,"firstName":"Micole","lastName":"Cayton","accountIdentifier":"StudentAthlete","profilePicUrl":"https://opendorsepr.blob.core.windows.net/media/375964/20220622223838_46dbe3fd-a683-436b-84d4-90c84a5af35f.jpg","networkProfileCode":"micole-cayton","hasDeals":true,"activityMin":16,"sports":["Basketball","Women\'s Basketball"],"currentTeams":["Minnesota Golden Gophers"],"previousTeams":["Cal Berkeley Golden Bears"],"facebookReach":0,"twitterReach":1273,"instagramReach":5700,"linkedInReach":null}'
pattern = re.compile('"firstName":"(.*?)"')
for match in pattern.finditer(st):
print(match.group(1))

Substring between known two markers extraction with problem markers

#miernic asked long ago how do you extract an arbitrary string which is located between two known markers in another string.
My problem is that the two markers include Regular Expression's meta characters. Specifically, I need to extract ABCD from the string ('ABCD',), parenthesis, single quote and comma, all included in the source string. The extracted string itself might include single and double quotes, dots, parenthesis, and white space. The makers are always (' and ',).
I tried to use r' strings and lots of escape characters and nothing works.
Pleeeease....
Converting my comment to answer so that solution is easy to find for future visitors.
You may use this regex with " as regex delimiter:
r"\('(.+?)',\)"
Use above regex in re.findall so that you get only captured group returned from it.

Trying to find text in an article that may contain quotation marks

I'm using python's findall function with a reg expression that should work but can't get the function to output results with quotation marks in them ('").
This is what I tried:
Description = findall('<p>([A-Za-z ,\.\—'":;0-9]+).</p>\n', text)
The quotation marks inside the reg expression are creating the hassle and I have no idea how to get around it.
Placing the backslash before the single quote like Sachith Rukshan suggested makes it work

Pandas extract text notation

I'm new to Pandas, using it for a class, and I can't for the life of me find a resource that shows the notation used in pandas when representing text in the extract function. For example:
movies['year'] = movies['title'].str.extract('.*\((.*)\).*', expand=True)
I know this is telling the extract function to extract everything inside the parentheses from examples done in class, but I don't understand which symbols mean what inside the extract function. Is there a resource that can explain what these symbols mean? Thank you.
In General
The string argument of the .str.extract is a Regular Expression (regex), which is a language used for pattern matching and feature extraction in strings. If you go to the section called "Regular Expression Patterns" in the previous link you can find the meaning of the special control characters.
This Example
What specifically that regex string means is:
match any character, ., zero or more times, *, until a parenthesis, \(, then extract all the content in the parentheses, (.*), then close parenthesis, \), then any character zero or more times, .*, again.
Essentially this will match any string like: 'xxx(message)xxxx' or '(message)' or 'xx(message)' or '(message)x' and extract the 'message'.
Notes on Pandas and Regex
An important part of regular expressions (in general, but particularly for use in pandas with .str.extract) is capturing groups. You can 'capture' or grab part of a string by enclosing the pattern for that part inside of parenthesis. Note that these are the unescaped (no preceding slash - the inner set) parentheses in the regex and not the actual parentheses that appear in the string itself, e.g. in 'xxx(message)xxx'.
Check out the docs on .str.extract for a few examples of using regex with capturing groups in pandas:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.extract.html

regex: replace hyphens with en-dashes with re.sub

I am using a small function to loop over files so that any hyphens - get replaced by en-dashes – (alt + 0150).
The function I use adds some regex flavor to a solution in a related problem (how to replace a character INSIDE the text content of many files automatically?)
def mychanger(fileName):
with open(fileName,'r') as file:
str = file.read()
str = str.decode("utf-8")
str = re.sub(r"[^{]{1,4}(-)","–", str).encode("utf-8")
with open(fileName,'wb') as file:
file.write(str)
I used the regular expression [^{]{1,4}(-) because the search is actually performed on latex regression tables and I only want to replace the hyphens that occur around numbers.
To be clear: I want to replace all hyphens EXCEPT in cases where we have genuine latex code such as \cmidrule(lr){2-4}.
In this case there is a { close (within 3-4 characters max) to the hyphen and to the left of it. Of course, this hyphen should not be changed into an en-dash otherwise the latex code will break.
I think the left part condition of the exclusion is important to write the correct exception in regex. Indeed, in a regression table you can have things like -0.062\sym{***} (that is, a { on the close right of the hyphen) and in that case I do want to replace the hyphen.
A typical line in my table is
variable & -2.061\sym{***}& 4.032\sym{**} & 1.236 \\
& (-2.32) & (-2.02) & (-0.14)
However, my regex does not appear to be correct. For instance, a (-1.2) will be replaced as –1.2, dropping the parenthesis.
What is the problem here?
Thanks!
I can offer the following two step replacement:
str = "-1 Hello \cmidrule(lr){2-4} range 1-5 other stuff a-5"
str = re.sub(r"((?:^|[^{])\d+)-(\d+[^}])","\\1$\\2", str).encode("utf-8")
str = re.sub(r"(^|[^0-9])-(\d+)","\\1$\\2", str).encode("utf-8")
print(str)
The first replacement targets all ranges which are not of the LaTex form {1-9} i.e. are not contained within curly braces. The second replacement targets all numbers prepended with a non number or the start of the string.
Demo
re.sub replaces the entire match. In this case that includes the non-{ character preceding your -. You can wrap that bit in parentheses to create a \1 group and include that in your substitution (you also don't need parentheses around your –):
re.sub(r"([^{]{1,4})-",r"\1–", str)

Categories

Resources