How to find JSON objects without actually parsing the JSON (regex) - python

I'm working on filtering a website's data and looking for keywords. The website uses a long JSON body and I only need to parse everything before a base64-encoded image. I cannot parse the JSON object regularly as the structure changes often and sometimes it's cut off.
Here is a snippet of code I'm parsing:
<script id="__APP_DATA" type="application/json">{"routeProps":{"b723":{"navDataResource":[{"catalogId":48,"parentCatalogId":null,"icon":"https://bin.bnbstatic.com/image/20200609/bbjy2x.png","catalogName":"New Crypto Listings","total":762,"articles":[{"id":54572,"code":"0ef69e1d334c4d8c9ffbd088843bf2dd","title":"Binance Will List GYEN"},{"id":54548,"code":"e5607624f4614c3f9fd2562c8beb8660","title":"BTG, DEXE \u0026 SHIB Enabled on Binance Isolated Margin"},{"id":54394,"code":"a176d4cfd4c74a7fb8238e63d71c062a","title":"Binance Futures Will Launch USDT-Margined ICP Perpetual Contracts with Up to 25X Leverage"},{"id":54392,"code":"4fa91d953fd0484ab9a48cca0a41c192","title":"Binance Will Open Trading for Internet Computer (ICP)"},{"id":54382,"code":"33b6e8116ce54705ac89e898d1a05510","title":"Binance Will List Internet Computer (ICP)"}],"catalogs":[]},{"catalogId":49,"parentCatalogId":null,"icon":"https://bin.bnbstatic.com/image/20200609/zxgg2x.png","catalogName":"Latest News","total":1164,"articles":[{"id":54649,"code":"2291f02b964f45b195fd6d4685db80bb","title":"Update on Trading Suspension for GYEN"},{"id":54646,"code":"724346d139b041198a441dc149133c7d","title":"Binance Liquid Swap Adds RAMP/BUSD Liquidity Pool"},{"id":54643,"code":"bc9f313c04cc40d2b7e598c831fd721f","title":"Notice on Trading Suspension for GYEN"},{"id":54591,"code":"b3c6998066af43078c63a5498bfd80b1","title":"Binance P2P Supports New Payment Methods for Mongolia"},{"id":54586,"code":"d4418be0b9ea4d1b8e92cbbfe8468a17","title":"Dual Investment (42nd Phase) - Earn Up to 56% APY"}]
As you can see, I'm trying to weed out everything except for these:
{"id":54382,"code":"33b6e8116ce54705ac89e898d1a05510","title":"Binance Will List Internet Computer (ICP)"}
As the JSON is really long and it wouldn't be smart to parse the entire thing, is there a way to find strings like these without actually parsing the JSON object? Ideally, I'd like for everything to be in an array. Will regular expressions work?
The ID is 5 numbers long, the code is 32 characters long, and there is a title.
Thanks a lot in advance

The following will use string.find() to step through the string and, if it finds both the start AND end of your target string, it will extract it as a dictionary. If it only finds the start but not the end, it assumes it's a broken or interrupted string and breaks out of the loop as there's nothing further to do.
I'm using the ast module to convert the string to dictionary. This isn't strictly needed to answer the question but I think it makes the end result more usable.
import ast
testdata = '{"routeProps":{"b723":{"navDataResource":[{"catalogId":48,"parentCatalogId":null,"icon":"https://bin.bnbstatic.com/image/20200609/bbjy2x.png","catalogName":"New Crypto Listings","total":762,"articles":[{"id":54572,"code":"0ef69e1d334c4d8c9ffbd088843bf2dd","title":"Binance Will List GYEN"},{"id":54548,"code":"e5607624f4614c3f9fd2562c8beb8660","title":"BTG, DEXE \u0026 SHIB Enabled on Binance Isolated Margin"},{"id":54394,"code":"a176d4cfd4c74a7fb8238e63d71c062a","title":"Binance Futures Will Launch USDT-Margined ICP Perpetual Contracts with Up to 25X Leverage"},{"id":54392,"code":"4fa91d953fd0484ab9a48cca0a41c192","title":"Binance Will Open Trading for Internet Computer (ICP)"},{"id":54382,"code":"33b6e8116ce54705ac89e898d1a05510","title":"Binance Will List Internet Computer (ICP)"}],"catalogs":[]},{"catalogId":49,"parentCatalogId":null,"icon":"https://bin.bnbstatic.com/image/20200609/zxgg2x.png","catalogName":"Latest News","total":1164,"articles":[{"id":54649,"code":"2291f02b964f45b195fd6d4685db80bb","title":"Update on Trading Suspension for GYEN"},{"id":54646,"code":"724346d139b041198a441dc149133c7d","title":"Binance Liquid Swap Adds RAMP/BUSD Liquidity Pool"},{"id":54643,"code":"bc9f313c04cc40d2b7e598c831fd721f","title":"Notice on Trading Suspension for GYEN"},{"id":54591,"code":"b3c6998066af43078c63a5498bfd80b1","title":"Binance P2P Supports New Payment Methods for Mongolia"},{"id":54586,"code":"d4418be0b9ea4d1b8e92cbbfe8468a17","title":"Dual Investment (42nd Phase) - Earn Up to 56% APY"}]'
# Create a list to hold the dictionary objects
itemlist = []
# Create variable to keep track of our position in the string
strMarker = 0
#Neverending Loooooooooooooooooooooooooooooooop
while True:
# Find occurrence of the beginning of a target string
strStart = testdata.find('{"id":',strMarker)
if not strStart == -1:
# If we've found the start, now look for the end marker of the string,
# starting from the location we identified as the beginning of that string
strEnd = testdata.find('}', strStart)
# If it does not exist, this suggests it might be an interrupted string
# so we don't do anything further with it, just allow the loop to break
if not strEnd == -1:
# Save this marker as it will be used as the starting point
# for the next search cycle.
strMarker = strEnd
# Extract the substring based on the start and end positions, +1 to capture
# the final '}'; as this string is nicely formatted as a dictionary object
# already, we are using ast.literal_eval() to turn it into an actual usable
# dictionary object
itemlist.append(ast.literal_eval(testdata[strStart:strEnd+1]))
# We're happy to keep searching so jump to the next loop
continue
# If nothing happened to trigger a jump to the next loop, break out of the
# while loop
break
# Print out the first entry in the list as a demo
print(str(itemlist[0]))
print(str(itemlist[0]["title"]))
Output from this code should be a nice formatted dict:
{"id":54572,"code":"0ef69e1d334c4d8c9ffbd088843bf2dd","title":"Binance Will List GYEN"}
Binance Will List GYEN

Regular expression should work here. Try matching with the following regular expression. It matches the desired sections, when I try it in https://regexr.com/. Also, regexr helps you understand the regular expression, in case you are new to it.
(\{"id":\d{5},"code":".{32}","title":"[^"]*"\})
Here is a small sample python script to find all of the sections.
import re
pattern='(\{"id":\d{5},"code":".{32}","title":"[^"]*"\})'
string_to_parse='...'
sections = re.findall(pattern, string_to_parse, re.DOTALL)

Related

How can I split the useful information from this string and output it separately?

I am using the steam api with python and returning the recently played games of a user. It returns this (albeit as long as how many games the user has launched in the past two weeks, so I will need to limit the amount that is in the string too). [<SteamApp "Blender" (365670)>, <SteamApp "fpsVR" (908520)>, <SteamApp "OVR Toolkit" (1068820)>, <SteamApp "SteamVR" (250820)>]
My code is the following:
async def recently_played(self, ctx, userurl):
steam_user = steamapi.user.SteamUser(userurl=userurl)
await ctx.send(steam_user.recently_played)
I tried to remove speech marks, square brackets, arrows and commas but I thought... what if the game has these in their names? I also could not figure out how to get inbetween each parentheses to get rid of the SteamAppID code.
What you have there isn't a string or list of strings, it's a list of SteamApp objects. Simply access each SteamApp object's name property to construct a list of names, or to simply print them. There is no need to do all this string manipulation:
names = [app.name for app in user.recently_played] # a list of strings
Or:
for app in user.recently_played:
print(app.name)
Take a look at the source code of SteamUser.recently_played and SteamApp.name.
If you are just trying to get an easy-to-read list of the games you could do something like this if the data you are receiving is a string:
# get this string from steam_user.recently_played
unformatted = '[<SteamApp "Blender" (365670)>, <SteamApp "fpsVR" (908520)>, <SteamApp "OVR Toolkit" (1068820)>, <SteamApp "SteamVR" (250820)>]'
gameList = unformatted.split('"')[1::2] # returns a list of the games. This is also possible with a regular expression
games = ", ".join(gameList)
print(games)
# returns Blender, fpsVR, OVR Toolkit, SteamVR

python3.6 How do I regex a url from a .txt?

I need to grab a url from a text file.
The URL is stored in a string like so: 'URL=http://example.net'.
Is there anyway I could grab everything after the = char up until the . in '.net'?
Could I use the re module?
text = """A key feature of effective analytics infrastructure in healthcare is a metadata-driven architecture. In this article, three best practice scenarios are discussed: https://www.healthcatalyst.com/clinical-applications-of-machine-learning-in-healthcare Automating ETL processes so data analysts have more time to listen and help end users , https://www.google.com/, https://www.facebook.com/, https://twitter.com
code below catches all urls in text and returns urls in list."""
urls = re.findall('(?:(?:https?|ftp):\/\/)?[\w/\-?=%.]+\.[\w/\-?=%.]+', text)
print(urls)
output:
[
'https://www.healthcatalyst.com/clinical-applications-of-machine-learning-in-healthcare',
'https://www.google.com/',
'https://www.facebook.com/',
'https://twitter.com'
]
i dont have much information but i will try to help with what i got im assuming that URL= is part of the string in that case you can do this
re.findall(r'URL=(.*?).', STRINGNAMEHERE)
Let me go more into detail about (.*?) the dot means Any character (except newline character) the star means zero or more occurences and the ? is hard to explain but heres an example from the docs "Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. ab? will match either ‘a’ or ‘ab’." the brackets place it all into a group. All this togethear basicallly means it will find everything inbettween URL= and .
You don't need RegEx'es (the re module) for such a simple task.
If the string you have is of the form:
'URL=http://example.net'
Then you can solve this using basic Python in numerous ways, one of them being:
file_line = 'URL=http://example.net'
start_position = file_line.find('=') + 1 # this gives you the first position after =
end_position = file_line.find('.')
# this extracts from the start_position up to but not including end_position
url = file_line[start_position:end_position]
Of course that this is just going to extract one URL. Assuming that you're working with a large text, where you'd want to extract all URLs, you'll want to put this logic into a function so that you can reuse it, and build around it (achieve iteration via the while or for loops, and, depending on how you're iterating, keep track of the position of the last extracted URL and so on).
Word of advice
This question has been answered quite a lot on this forum, by very skilled people, in numerous ways, for instance: here, here, here and here, to a level of detail that you'd be amazed. And these are not all, I just picked the first few that popped up in my search results.
Given that (at the time of posting this question) you're a new contributor to this site, my friendly advice would be to invest some effort into finding such answers. It's a crucial skill, that you can't do without in the world of programming.
Remember, that whatever problem it is that you are encountering, there is a very high chance that somebody on this forum had already encountered it, and received an answer, you just need to find it.
Please try this. It worked for me.
import re
s='url=http://example.net'
print(re.findall(r"=(.*)\.",s)[0])

Scraping data from a http & javaScript site

I currently want to scrape some data from an amazon page and I'm kind of stuck.
For example, lets take this page.
https://www.amazon.com/NIKE-Hyperfre3sh-Athletic-Sneakers-Shoes/dp/B01KWIUHAM/ref=sr_1_1_sspa?ie=UTF8&qid=1546731934&sr=8-1-spons&keywords=nike+shoes&psc=1
I wanted to scrape every variant of shoe size and color. That data can be found opening the source code and searching for 'variationValues'.
There we can see sort of a dictionary containing all the sizes and colors and, below that, in 'asinToDimentionIndexMap', every product code with numbers indicating the variant from the variationValues 'dictionary'.
For example, in asinToDimentionIndexMap we can see
"B01KWIUH5M":[0,0]
Which means that the product code B01KWIUH5M is associated with the size '8M US' (position 0 in variationValues size_name section) and the color 'Teal' (same idea as before)
I want to scrape both the variationValues and the asinToDimentionIndexMap, so i can associate the IndexMap numbers to the variationValues one.
Another person in the site (thanks for the help btw) suggested doing it this way.
script = response.xpath('//script/text()').extract_frist()
import re
# capture everything between {}
data = re.findall(script, '(\{.+?\}_')
import json
d = json.loads(data[0])
d['products'][0]
I can sort of understand the first part. We get everything that's a 'script' as a string and then get everything between {}. The issue is what happens after that. My knowledge of json is not that great and reading some stuff about it didn't help that much.
Is it there a way to get, from that data, 2 dictionaries or lists with the variationValues and asinToDimentionIndexMap? (maybe using some regular expressions in the middle to get some data out of a big string). Or explain a little bit what happens with the json part.
Thanks for the help!
EDIT: Added photo of variationValues and asinToDimensionIndexMap
I think you are close Manuel!
The following code will turn your scraped source into easy-to-select boxes:
import json
d = json.loads(data[0])
JSON is a universal format for storing object information. In other words, it's designed to interpret string data into object data, regardless of the platform you are working with.
https://www.w3schools.com/js/js_json_intro.asp
I'm assuming where you may be finding things a challenge is if there are any errors when accessing a particular "box" inside you json object.
Your code format looks correct, but your access within "each box" may look different.
Eg. If your 'asinToDimentionIndexMap' object is nested within a smaller box in the larger 'products' object, then you might access it like this (after running the code above):
d['products'][0]['asinToDimentionIndexMap']
I've hacked and slash a little bit so you can better understand the structure of your particular json file. Take a look at the link below. On the right-hand side, you will see "which boxes are within one another" - which is precisely what you need to know for accessing what you need.
JSON Object Viewer
For example, the following would yield "companyCompliancePolicies_feature_div":
import json
d = json.loads(data[0])
d['updateDivLists']['full'][0]['divToUpdate']
The person helping you before outlined a general case for you, but you'll need to go in an look at structure this way to truly find what you're looking for.
variationValues = re.findall(r'variationValues\" : ({.*?})', ' '.join(script))[0]
asinVariationValues = re.findall(r'asinVariationValues\" : ({.*?}})', ' '.join(script))[0]
dimensionValuesData = re.findall(r'dimensionValuesData\" : (\[.*\])', ' '.join(script))[0]
asinToDimensionIndexMap = re.findall(r'asinToDimensionIndexMap\" : ({.*})', ' '.join(script))[0]
dimensionValuesDisplayData = re.findall(r'dimensionValuesDisplayData\" : ({.*})', ' '.join(script))[0]
Now you can easily convert them to json as use them combine as you wish.

How Do I Start Pulling Apart This Block of JSON Data?

I'd like to make a program that makes offline copies of math questions from Khan Academy. I have a huge 21.6MB text file that contains data on all of their exercises, but I have no idea how to start analyzing it, much less start pulling the questions from it.
Here is a pastebin containing a sample of the JSON data. If you want to see all of it, you can find it here. Warning for long load time.
I've never used JSON before, but I wrote up a quick Python script to try to load up individual "sub-blocks" (or equivalent, correct term) of data.
import sys
import json
exercises = open("exercises.txt", "r+b")
byte = 0
frontbracket = 0
backbracket = 0
while byte < 1000: #while byte < character we want to read up to
#keep at 1000 for testing purposes
char = exercises.read(1)
sys.stdout.write(char)
#Here we decide what to do based on what char we have
if str(char) == "{":
frontbracket = byte
while True:
char = exercises.read(1)
if str(char)=="}":
backbracket=byte
break
exercises.seek(frontbracket)
block = exercises.read(backbracket-frontbracket)
print "Block is " + str(backbracket-frontbracket) + " bytes long"
jsonblock = json.loads(block)
sys.stdout.write(block)
print jsonblock["translated_display_name"]
print "\nENDBLOCK\n"
byte = byte + 1
Ok, the repeated pattern appears to be this: http://pastebin.com/4nSnLEFZ
To get an idea of the structure of the response, you can use JSONlint to copy/paste portions of your string and 'validate'. Even if the portion you copied is not valid, it will still format it into something you can actually read.
First I have used requests library to pull the JSON for you. It's a super-simple library when you're dealing with things like this. The API is slow to respond because it seems you're pulling everything, but it should work fine.
Once you get a response from the API, you can convert that directly to python objects using .json(). What you have is essentially a mixture of nested lists and dictionaries that you can iterate through and pull specific details. In my example below, my_list2 has to use a try/except structure because it would seem that some of the entries do not have two items in the list under translated_problem_types. In that case, it will just put 'None' instead. You might have to use trial and error for such things.
Finally, since you haven't used JSON before, it's also worth noting that it can behave like a dictionary itself; you are not guaranteed the order in which you receive details. However, in this case, it seems the outermost structure is a list, so in theory it's possible that there is a consistent order but don't rely on it - we don't know how the list is constructed.
import requests
api_call = requests.get('https://www.khanacademy.org/api/v1/exercises')
json_response = api_call.json()
# Assume we first want to list "author name" with "author key"
# This should loop through the repeated pattern in the pastebin
# access items as a dictionary
my_list1 = []
for item in json_response:
my_list1.append([item['author_name'], item['author_key']])
print my_list1[0:5]
# Now let's assume we want the 'sha' of the SECOND entry in translated_problem_types
# to also be listed with author name
my_list2 = []
for item in json_response:
try:
the_second_entry = item['translated_problem_types'][0]['items'][1]['sha']
except IndexError:
the_second_entry = 'None'
my_list2.append([item['author_name'], item['author_key'], the_second_entry])
print my_list2[0:5]

Parsing a string for nested patterns

What would be the best way to do this.
The input string is
<133_3><135_3><116_2>The other system worked for about 1 month</116_2> got some good images <137_3>on it then it started doing the same thing as the first one</137_3> so then I quit using either camera now they are just sitting and collecting dust.</135_3></133_3>
the expected output is
{'The other system worked for about 1 month got some good images on it then it started doing the same thing as the first one so then I quit \
using either camera now they are just sitting and collecting dust.':[133, 135],
'The other system worked for about 1 month': [116],
'on it then it started doing the same thing as the first one':[137]
}
that seems like a recursive regexp search but I can't figure out how exactly.
I can think of a tedious recursive function as of now, but have a feeling that there should be a better way.
Related question: Can regular expressions be used to match nested patterns?
Use expat or another XML parser; it's more explicit than anything else, considering you're dealing with XML data anyway.
However, note that XML element names can't start with a number as your example has them.
Here's a parser that will do what you need, although you'll need to tweak it to combine duplicate elements into one dict key:
from xml.parsers.expat import ParserCreate
open_elements = {}
result_dict = {}
def start_element(name, attrs):
open_elements[name] = True
def end_element(name):
del open_elements[name]
def char_data(data):
for element in open_elements:
cur = result_dict.setdefault(element, '')
result_dict[element] = cur + data
if __name__ == '__main__':
p = ParserCreate()
p.StartElementHandler = start_element
p.EndElementHandler = end_element
p.CharacterDataHandler = char_data
p.Parse(u'<_133_3><_135_3><_116_2>The other system worked for about 1 month</_116_2> got some good images <_137_3>on it then it started doing the same thing as the first one</_137_3> so then I quit using either camera now they are just sitting and collecting dust.</_135_3></_133_3>', 1)
print result_dict
Take an XML parser, make it generate a DOM (Document Object Model) and then build a recursive algorithm that traverses all the nodes, calls "text()" in each node (that should give you the text in the current node and all children) and puts that as a key in the dictionary.
from cStringIO import StringIO
from collections import defaultdict
####from xml.etree import cElementTree as etree
from lxml import etree
xml = "<e133_3><e135_3><e116_2>The other system worked for about 1 month</e116_2> got some good images <e137_3>on it then it started doing the same thing as the first one</e137_3> so then I quit using either camera now they are just sitting and collecting dust. </e135_3></e133_3>"
d = defaultdict(list)
for event, elem in etree.iterparse(StringIO(xml)):
d[''.join(elem.itertext())].append(int(elem.tag[1:-2]))
print(dict(d.items()))
Output:
{'on it then it started doing the same thing as the first one': [137],
'The other system worked for about 1 month': [116],
'The other system worked for about 1 month got some good images on it then it started doing the same thing as the first one so then I quit using \
either camera now they are just sitting and collecting dust. ': [133, 135]}
I think a grammar would be the best option here. I found a link with some information:
http://www.onlamp.com/pub/a/python/2006/01/26/pyparsing.html
Note that you can't actually solve this by a regular expression, since they don't have the expressive power to enforce proper nesting.
Take the following mini-language:
A certain number of "(" followed by the same number of ")", no matter what the number.
You could make a regular expression very easily to represent a super-language of this mini-language (where you don't enforce the equality of the number of starts parentheses and end parentheses). You could also make a regular expression very easilty to represent any finite sub-language (where you limit yourself to some max depth of nesting). But you can never represent this exact language in a regular expression.
So you'd have to use a grammar, yes.
Here's an unreliable inefficient recursive regexp solution:
import re
re_tag = re.compile(r'<(?P<tag>[^>]+)>(?P<content>.*?)</(?P=tag)>', re.S)
def iterparse(text, tag=None):
if tag is not None: yield tag, text
for m in re_tag.finditer(text):
for tag, text in iterparse(m.group('content'), m.group('tag')):
yield tag, text
def strip_tags(content):
nested = lambda m: re_tag.sub(nested, m.group('content'))
return re_tag.sub(nested, content)
txt = "<133_3><135_3><116_2>The other system worked for about 1 month</116_2> got some good images <137_3>on it then it started doing the same thing as the first one</137_3> so then I quit using either camera now they are just sitting and collecting dust. </135_3></133_3>"
d = {}
for tag, text in iterparse(txt):
d.setdefault(strip_tags(text), []).append(int(tag[:-2]))
print(d)
Output:
{'on it then it started doing the same thing as the first one': [137],
'The other system worked for about 1 month': [116],
'The other system worked for about 1 month got some good images on it then it started doing the same thing as the first one so then I quit using \
either camera now they are just sitting and collecting dust. ': [133, 135]}

Categories

Resources