How to scrape a string representation of a nested list? - python

I am trying to record the DataCamp courses I have done by using a web scraper. First kudos to this guy, who has built something along my needs.
However, recently DataCamp has made changes to their website and now the comprehensive course data is not in JSON anymore, but seems to be stored as a string representation of a nested list.
If you take a look at the source of one of the chapter pages, the first element in the body is:
<body><script>window.PRELOADED_STATE = "["~#iM",["preFetchedData",["^0",["course",["^0",["status","SUCCESS","data",["^ ","id",58,"title","Introduction to R ...
So the original scraper was able to rely on JSON and extracting the information via the dict keys. There is an idea field, so probably I should be able to extract the data once I have a list of lists of the underlying data.
I tried extracting the string representation via ast.literal_eval, but that did not work. Any idea how I could make this list usable?

Related

Parse JSON within Python

I just wanted to ask what can I do to solve this issue I have.
Essentially I am making a stock checker for sneakers from Adidas, I know the endpoint to obtain the stock but the JSON data given back to me whilst readable and contains what I need also contains a bunch of other information that is unnecessary to what I am trying to do.
Example of a link to an endpoint:
http://production.store.adidasgroup.demandware.net/s/adidas-GB/dw/shop/v16_9/products/(BZ0221)?client_id=c1f3632f-6d3a-43f4-9987-9de920731dcb&expand=availability,variations,prices
This is a link to the JSON containing the stock of the shoe, price and availability. However, if you try to open it you'll see that it responds a bunch of useless info such as the description of the shoe and the price which I do not need.
A github repository that I was using to try and get to grips with the requests I am trying to make is:
https://github.com/yzyio/adidas-stock-checker/blob/master/assets/index.js
I can get it to give me the JSON response I am just trying to strip what I don't need and keep what I do need which I am finding very difficult especially in python.
Many Thanks!
Since you've said you can get a JSON response from the server than the first think you need to do is tell python to load it as JSON.
import json
data = json.loads(response_from_server)
After doing this you can now access the values in your JSON object the way you would access them via a Python dict.
data["artist"]["id"]

Excel data to Python via web service, data structure for variable data

I have an excel spreadsheet which basically acts as an UI, it is used to let the user enter some parameters which are then passed to some python code on a server via a web service, as well as a whole tab full of data.
I am by far no VBA expert but managed to get my data and individual variables submitted. My question is what is the best suited VBA data structure to use, ideally I would like to have something like a dictionary where the keys would be my defined Names for the Excel cells, plus the data which might for some cases will be a single value or a Variant array.
I have to be able to distinguish between keys and their corresponding values in python eventually.
So far I was playing around with collections
Dim Main_tab_vars As Collection
Set Main_tab_vars = New Collection
Main_tab_vars.Add Range("Start_Date").Value, "Start_Date_var", "Start_Date_var"
Main_tab_vars.Add Range("Definitions").Value, "Definitions_var"
If I look at the collection in my watches window I can see the values correctly stored in item1 and item2. But it looks like my key information gets lost
I would recommend either JSON or Xml when sending data to a web service, these are the industry standards. If chooisng JSON then you'd use nested dictionaries and then build a string (plenty of code on internet) when ready . If using Xml then you could build up the Xml document as you go.
I do not know how well Python handles JSON so probably I'd opt for XML.

How to iterate over everything in a python-docx document?

I am using python-docx to convert a Word docx to a custom HTML equivalent. The document that I need to convert has images and tables, but I haven't been able to figure out how to access the images and the tables within a given run. Here is what I am thinking...
for para in doc.paragraphs:
for run in para.runs:
# How to tell if this run has images or tables?
...but I don't see anything on the Run that has info on the InlineShape or Table. Do I have to fall back to the XML directly or is there a better, cleaner way to iterate over everything in the document?
Thanks!
There are actually two problems to solve for what you're trying to do. The first is iterating over all the block-level elements in the document, in document order. The second is iterating over all the inline elements within each block element, in the order they appear.
python-docx doesn't yet have the features you would need to do this directly. However, for the first problem there is some example code here that will likely work for you:
https://github.com/python-openxml/python-docx/issues/40
There is no exact counterpart I know of to deal with inline items, but I expect you could get pretty far with paragraph.runs. All inline content will be within a paragraph. If you got most of the way there and were just hung up on getting pictures or something you could go down the the lxml level and decode some of the XML to get what you needed. If you get that far along and are still keen, if you post a feature request on the GitHub issues list for something like "feature: Paragraph.iter_inline_items()" I can probably provide you with some similar code to get you what you need.
This requirement comes up from time to time so we'll definitely want to add it at some point.
Note that block-level items (paragraphs and tables primarily) can appear recursively, and a general solution will need to account for that. In particular, a paragraph can (and in fact at least one always must) appear in a table cell. A table can also appear in a table cell. So theoretically it can get pretty deep. A recursive function/method is the right approach for getting to all of those.
Assuming doc is of type Document, then what you want to do is have 3 separate iterations:
One for the paragraphs, as you have in your code
One for the tables, via doc.tables
One for the shapes, via doc.inline_shapes
The reason your code wasn't working was that paragraphs don't have references to the tables and or shapes within the document, as that is stored within the Document object.
Here is the documentation for more info: python-docx

Using Beautifulsoup and regex to traverse javascript in page

I'm fetching webpages with a bunch of javascript on it, and I'm interested in parsing through the javascript portion of the pages for certain relevant info. Right now I have the following code in Python/BeautifulSoup/regex:
scriptResults = soup('script',{'type' : 'text/javascript'})
which yields an array of scripts, of which I can use a for loop to search for text I'd like:
for script in scriptResults:
for block in script:
if *patterniwant* in block:
**extract pattern from line using regex**
(Text in asterisks is pseudocode, of course.)
I was wondering if there was a better way for me to just use regex to find the pattern in the soup itself, searching only through the scripts themselves? My implementation works, but it just seems really clunky so I wanted something more elegant and/or efficient and/or Pythonic.
Thanks in advance!
I lot of website have client side data in JSON format. I that case I would suggest to extract JSON part from JavaScirpt code and parse it using Python's json modules (e.g. json.json.loads ). As a result you will get standard dictionary object.
Another option is to check with your browser what sort of AJAX requests application makes. Quite often it also returns structured data in JSON.
I would also check if page has any structured data already available (e.g. OpenGraph, microformats, RDFa, RSS feeds). A lot of web sites include this to improve pages SEO and make it better integrating with social network sharing.

Python libraries that can tokenize wikipedia pages

I'd like to tokenise out wikipedia pages of interest with a python library or libraries. I'm most interested in tables and listings. I want to be able to then import this data into Postgres or Neo4j.
For example, here are three data sets that I'd be interested in:
How many points each country awarded one another in the Eurovision Song contest of 2008:
http://en.wikipedia.org/wiki/Eurovision_Song_Contest_2008#Final
List of currencies and the countries in which they circulate (a many-to-many relationship):
http://en.wikipedia.org/wiki/List_of_circulating_currencies
Lists of solar plants around the world: http://en.wikipedia.org/wiki/List_of_solar_thermal_power_stations
The source of each of these is written with wikipedia's brand of markup which is used to render them out. There are many wikipedia-specific tags and syntax used in the raw data form. The HTML might almost be the easier solution as I can just use BeautifulSoup.
Anyone know of a better way of tokenizeing? I feel that I'd reinvent the wheel if I took the final HTML and parsing it with BeautifulSoup. Also, if I could find a way to output these pages in XML, the table data might not be tokenized enough and it would require further processing.
Since Wikipedia is built on MediWiki, there is an api you can exploit. There is also Special:Export that you can use.
Once you have the raw data, then you can run it through mwlib to parse it.
This goes more to semantic web direction, but DBPedia allows querying parts (community conversion effort) of wikipedia data with SPARQL. This makes it theoretically straightforward to extract the needed data, however dealing with RDF triples might be cumbersome.
Furthermore, I don't know if DBPedia yet contains any data that is of interest for you.

Categories

Resources