Extracting Key Value pairs from a String using Regex

Extracting Key Value pairs from a String using Regex - python

I have a web scrapped string containing key value pairs i.e firstName:"Quaran", lastName:"McPherson"
st = '{"accountId":405266,"firstName":"Quaran","lastName":"McPherson","accountIdentifier":"StudentAthlete","profilePicUrl":"https://pbs.twimg.com/profile_images/1331475329014181888/4z19KrCf.jpg","networkProfileCode":"quaran-mcpherson","hasDeals":true,"activityMin":11,"sports":["Men\'s Basketball","Basketball"],"currentTeams":["Nebraska Cornhuskers"],"previousTeams":[],"facebookReach":null,"twitterReach":619,"instagramReach":0,"linkedInReach":null},{"accountId":375964,"firstName":"Micole","lastName":"Cayton","accountIdentifier":"StudentAthlete","profilePicUrl":"https://opendorsepr.blob.core.windows.net/media/375964/20220622223838_46dbe3fd-a683-436b-84d4-90c84a5af35f.jpg","networkProfileCode":"micole-cayton","hasDeals":true,"activityMin":16,"sports":["Basketball","Women\'s Basketball"],"currentTeams":["Minnesota Golden Gophers"],"previousTeams":["Cal Berkeley Golden Bears"],"facebookReach":0,"twitterReach":1273,"instagramReach":5700,"linkedInReach":null}'
I am trying to extract the first_name, last_name and a few other parameters from this string in list format such that I will be having a first_name list with all first_names from the string
I tried using re.findall('"firstName":'"(.*)\S$",st) to access the text "Quaran" but result is coming in the following format
'"Quaran","lastName":"McPherson","accountIdentifier":"StudentAthlete","profilePicUrl":"https://pbs.twimg.com/profile_images/1331475329014181888/4z19KrCf.jpg","networkProfileCode":"quaran-mcpherson","hasDeals":true,"activityMin":11,"sports":["Men\'s Basketball","Basketball"],"currentTeams":["Nebraska Cornhuskers"],"previousTeams":[],"facebookReach":null,"twitterReach":619,"instagramReach":0,"linkedInReach":null}
how do I end the specify within the regex to end the search at the end of the name in quotes??
TIA

Your string seems JSON array, you can easily parse json in any language if it's valid. To make your string valid add '[' at first and ']' at last of your string then parse the JSON in your language. Such as
JavaScript:
JSON.parse(st)
Python:
import json
dict = json.loads(st)
Regular expression:
if you strictly wish to parse using regular expression use:
/(?:\"|\')(?<key>[\w\d]+)(?:\"|\')(?:\:\s*)(?:\"|\')?(?<value>[\w\s-]*)(?:\"|\')?/gm

Try this:
(?<="firstName":")[^"\r\n]+
(?<="firstName":") go to the point where "firstName":" appeasrs in the string,
[^"\r\n]+ then match one or more character except ", \r and \n. not to cross the second double quote of the firstName value and not to cross any newline.
See regex demo.
See python demo.

Try this regex (?<=\"firstName\":\").*?(?=\"). The ? in the middle makes it a lazy match, so that it stops matching as soon as it finds a " character.
I want to add that that with backtracking, there can be an exponential performance penalty. I can do something like this "firstName":"(.*?)" and extract the capturing group so that there will only a linear performance penalty.
Regex: https://regex101.com/r/uM2l8M/1
Python code
import re
st = '{"accountId":405266,"firstName":"Quaran","lastName":"McPherson","accountIdentifier":"StudentAthlete","profilePicUrl":"https://pbs.twimg.com/profile_images/1331475329014181888/4z19KrCf.jpg","networkProfileCode":"quaran-mcpherson","hasDeals":true,"activityMin":11,"sports":["Men\'s Basketball","Basketball"],"currentTeams":["Nebraska Cornhuskers"],"previousTeams":[],"facebookReach":null,"twitterReach":619,"instagramReach":0,"linkedInReach":null},{"accountId":375964,"firstName":"Micole","lastName":"Cayton","accountIdentifier":"StudentAthlete","profilePicUrl":"https://opendorsepr.blob.core.windows.net/media/375964/20220622223838_46dbe3fd-a683-436b-84d4-90c84a5af35f.jpg","networkProfileCode":"micole-cayton","hasDeals":true,"activityMin":16,"sports":["Basketball","Women\'s Basketball"],"currentTeams":["Minnesota Golden Gophers"],"previousTeams":["Cal Berkeley Golden Bears"],"facebookReach":0,"twitterReach":1273,"instagramReach":5700,"linkedInReach":null}'
pattern = re.compile('"firstName":"(.*?)"')
for match in pattern.finditer(st):
print(match.group(1))

Related

Why can't I scoop out some ID's of some strings using regex?

I'm trying to scoop out some ID's from some strings. The portion I would like to grab from each string is between bd- and ?. The latter is not always present so I wish to make this sign ? optional. I know I can achieve the same using string manipulation but I wish to do the same using regex.
I've tried with:
import re
content = """
id-HTRY098WE
id-KNGT371WE?witkl
id-ZXV555NQE?phnu
eh-VCBG075LK
"""
for item in re.findall(r'id-(.*)\??',content):
print(item)
Output it yields:
HTRY098WE
KNGT371WE?witkl
ZXV555NQE?phnu
Expected output:
HTRY098WE
KNGT371WE
ZXV555NQE
How can I scrape ID's out of some strings?

You could use a capturing group with a negated character class to match not a questionmark or a whitespace character.
The pattern that you tried first matches until the end of the string using .*. Then at the end of the string, it tries to match an optional question mark \??. This will succeed (because it is optional) resulting in matching the whole string for the first 3 examples.
id-([^?\s]+)
Regex demo | Python demo
For example
import re
content = """
id-HTRY098WE
id-KNGT371WE?witkl
id-ZXV555NQE?phnu
eh-VCBG075LK
"""
for item in re.findall(r'id-([^?\s]+)',content):
print(item)
Result
HTRY098WE
KNGT371WE
ZXV555NQE
Or match only alphanumerics:
id-([A-Z0-9]+)
Regex demo

search substring + integer from a string in python using regular expression

I have a string
str="TMOUT=1800; export TMOUT"
I want to extract only TMOUT=1800 from above string, but 1800 is not constant it can be any integer value. For example TMOUT=18 or TMOUT=201 etc. I'm very new to regular expression.
I tried using code below
re.search("TMOUT=\d",str).
It is not working. Please help

\d matches a single digit. You want to match one or more digits, so you have to add a + quantifier:
re.search("TMOUT=\d+", text)
If you then you want to extract the number you have to create a group using parenthesis ():
match = re.search(r"TMOUT=(\d+)", text)
number = int(match.group(1))
Or you may want to use the named group syntax (?P<name>):
match = re.search(r"TMOUT=(?P<num>\d+)", text)
number = int(match.group("num"))
I suggest you use regex101 to test your regexes and get an explanation of what they do. Also read python's re docs to learn about the methods of the various objects and functions available.

Python regular expression to pull text inside of HTML quotation marks

I'm attempting to pull ticker symbols from corporations' 10-K filings on EDGAR. The ticker symbol typically appears between a pair of HTML quotation marks, e.g., "" or "". An example of a typical portion of relevant text:
Our common stock has been listed on the New York Stock Exchange (“NYSE”) under the symbol “RXN”
At this point I am just trying to figure out how to deal with the occurrence of one or more of a variety of quotation marks. I can write a regex that matches one particular type of quotation mark:
re.findall(r'under[^<]*the[^<]*symbol[^<]*“*[^<]*\n',fileText)
However, I can't write a regex that looks for more than one type of quotation mark. This regex produces nothing:
re.findall(r'under[^<]*the[^<]*symbol[^<]*“****[^<]*\n',fileText)
Any help would be appreciated.

Your regex looks for all of the quotes occurring together. If you're looking for any one of the possibilities, you need to put parentheses around each string and or them:
(?:“)*|(?:)*|(?:)*|(?:)*
The ?: makes the paren groups non-capturing. I.e., the parser won't save each one as important text. As an aside, you'll probably want to use group-capturing to save the ticker symbol -- what you're actually looking for. Very quick-and-dirty (and ugly) expression that will return ['NYSE', 'RXN'] from the given string:
re.findall(r'(?:(?:“)|(?:&#14[567];)|(?:&#822[01];))(.+?)(?:(?:“)|(?:&#14[567];)|(?:&#822[01];))', fileText)
You'd probably want to only include left-quotes in the first group and right-quotes in the last group. Plus either-or quotes in both.

You can use
re.sub("&#([0-9]+);", lambda x:chr(int(x.group(1))), text)
this works because you can use search/replace providing a callable for the replace part. The number after "#" is the unicode point for the character and Python chr function can convert it to text.
For example:
re.sub("&#([0-9]+);", lambda x:chr(int(x.group(1))),
"this is a “test“")
results in
'this is a “test“'

Referencing a RegEx Variable

I'm using python to loop through a large list of self reported locations to try to match them to their home states. The RegEx expression I'm using is:
/^"[^\s]+,\s*([a-zA-Z]{2})"$/
Basically, I'm trying to find a pattern that looks like XXXCITYXXX, [Statecode], where statecode is only two letters.
My issue is that I don't know how to reference the varying state code once I find a matching string. I know in Perl that I could use:
$state = uc($1)
However, I don't know the equivalent Python syntax. Anyone know?

You can do it with re.search, which returns a match object (if the regex matches at all) with a groups property containing the captured groups:
import re
match = re.search('^[^\s]+,\s*([a-zA-Z]{2})$', my_string)
if match:
print match.groups()[0]

Python split by regular expression

In Python, I am extracting emails from a string like so:
split = re.split(" ", string)
emails = []
pattern = re.compile("^[a-zA-Z0-9_\.-]+#[a-zA-Z0-9-]+.[a-zA-Z0-9-\.]+$");
for bit in split:
result = pattern.match(bit)
if(result != None):
emails.append(bit)
And this works, as long as there is a space in between the emails. But this might not always be the case. For example:
Hello, foo#foo.com
would return:
foo#foo.com
but, take the following string:
I know my best friend mailto:foo#foo.com!
This would return null. So the question is: how can I make it so that a regex is the delimiter to split? I would want to get
foo#foo.com
in all cases, regardless of punctuation next to it. Is this possible in Python?
By "splitting by regex" I mean that if the program encounters the pattern in a string, it will extract that part and put it into a list.

I'd say you're looking for re.findall:
>>> email_reg = re.compile(r'[a-zA-Z0-9_.-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+')
>>> email_reg.findall('I know my best friend mailto:foo#foo.com!')
['foo#foo.com']
Notice that findall can handle more than one email address:
>>> email_reg.findall('Text text foo#foo.com, text text, baz#baz.com!')
['foo#foo.com', 'baz#baz.com']

Use re.search or re.findall.
You also need to escape your expression properly (. needs to be escaped outside of character classes, not inside) and remove/replace the anchors ^ and $ (for example with \b), eg:
r"\b[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+\b"

The problem I see in your regex is your use of ^ which matches the start of a string and $ which matches the end of your string. If you remove it and then run it with your sample test case it will work
>>> re.findall("[A-Za-z0-9\._-]+#[A-Za-z0-9-]+.[A-Za-z0-9-\.]+","I know my best friend mailto:foo#foo.com!")
['foo#foo.com']
>>> re.findall("[A-Za-z0-9\._-]+#[A-Za-z0-9-]+.[A-Za-z0-9-\.]+","Hello, foo#foo.com")
['foo#foo.com']
>>>

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting Key Value pairs from a String using Regex - python

Related

Why can't I scoop out some ID's of some strings using regex?

search substring + integer from a string in python using regular expression

Python regular expression to pull text inside of HTML quotation marks

Referencing a RegEx Variable

Python split by regular expression

Categories

Resources