Scrapy to crawl LD JSON data

Scrapy to crawl LD JSON data - python

I have done some research but can't seem to find any information on if it is possible to crawl something like JSON Schema data from a URL. An example i just found as i was looking at the product anyway would be:
https://www.reevoo.com/p/panasonic-nn-e271wmbpq
<script class="microdata-snippet" type="application/ld+json">
{
"#context": "http://schema.org/",
"#type": "Product",
"name": "PANASONIC NN-E271WMBPQ",
"image": "https://images.reevoo.com/products/3530/3530797/550x550.jpg?fingerprint=73ed91807dac7eb8f899757a348c735446d0a1fe&gravity=Center"
,"category": {
"#type": "Thing",
"name": "Microwave",
"url": "https://www.reevoo.com/browse/product_type/microwaves"
}
,"description": "Auto weight programs will automatically calculate the cooking time, once the weight has been entered. Acrylic lining makes cleaning easy, simply wipe after use. Child lock provides extra security to prevent little fingers interfering with the programming of the oven. \nAll our compact microwave ovens are packed with flexible features to make everyday cooking simple. Auto weight programs will automatically calculate the cooking time, once the weight has been entered. Acrylic lining makes cleaning easy, simply wipe after use. Child lock provides extra security to prevent little fingers interfering with the programming of the oven."
,"aggregateRating": {
"#type": "AggregateRating",
"ratingValue": "8.7",
"ratingCount": 636,
"worstRating": "1",
"bestRating": "10"
}
}
</script>
So would it be possible to extract say the rating data?
Thanks in advance,

import json
And next in your code:
microdata_content = response.xpath('//script[#type="application/ld+json"]/text()').extract_first()
microdata = json.loads(microdata_content)
ratingValue = microdata["aggregateRating"]["ratingValue"]

Related

Set context from custom XBRL file

I'm able to read a custom XBRL file. The problem is that the parsed object has the amounts of the initial period (last december) and not the last accountable period.
from xbrl import XBRLParser, GAAP, GAAPSerializer
# xbrl comes from python-xbrl package
xbrl_parser = XBRLParser()
with open('filename.xbrl') as file:
xbrl = xbrl_parser.parse(file)
custom_obj = xbrl_parser.parseCustom(xbrl)
print(custom_obj.cashandcashequivalents)
This prints the cash of 2021/12 not 2022/06 as expected
Current output: 100545101000
Expected: 81518021000
I think those number are the ones you can see in lines 9970 and 9972 of xbrl file.
These are the lines:
9970: <ifrs-full:CashAndCashEquivalents decimals="-3" contextRef="CierreTrimestreActual" unitRef="CLP">81518021000</ifrs-full:CashAndCashEquivalents>
9972: <ifrs-full:CashAndCashEquivalents decimals="-3" contextRef="SaldoActualInicio" unitRef="CLP">100545101000</ifrs-full:CashAndCashEquivalents>
How can I set the context/contextRef so the custom_obj has the numbers of the latest periods?
XBRL file: https://www.cmfchile.cl/institucional/inc/inf_financiera/ifrs/safec_ifrs_verarchivo.php?auth=&send=&rut=70016160&mm=06&aa=2022&archivo=70016160_202206_C.zip&desc_archivo=Estados%20financieros%20(XBRL)&tipo_archivo=XBRL

I've never used python-xbrl, but from a quick look at the source code it looks very basic and makes lots of unwarranted assumptions about the structure of the document. It doesn't appear to have any support for XBRL Dimensions, which the report you're using makes use of.
The module isn't built on a proper model of the XBRL data which would give you easy access to each fact's properties such as the period, and allow you to easily filter down to just the facts that you want.
I don't think the module will allow you to do what you want. Looking at this code it just iterates over all the facts, and sticks them onto properties on an object, so whichever fact it hits last in the document will be the one that you get, and given that order isn't important in XBRL files, it's just going to be pot luck which one you get.
I'd strongly recommend switching to a better XBRL library. Arelle is probably the most widely used, although you could also use my own pxp.
As an example, either tool can be used to convert the XBRL to JSON format, and will give you facts like this:
"f126928": {
"value": "81518021000",
"decimals": -3,
"dimensions": {
"concept": "ifrs-full:CashAndCashEquivalents",
"entity": "scheme:70016160-9",
"period": "2022-07-01T00:00:00",
"unit": "iso4217:CLP"
}
},
"f126930": {
"value": "100545101000",
"decimals": -3,
"dimensions": {
"concept": "ifrs-full:CashAndCashEquivalents",
"entity": "scheme:70016160-9",
"period": "2022-01-01T00:00:00",
"unit": "iso4217:CLP"
}
},
With this, you can then sort the facts by period, and then select the most recent one. Of course, you can do the same directly via the Python interfaces in these tools, rather than going via JSON.

Google Docs API programmatically adding a table of content

I have a python script which does some analysis and output the results as text (paragraphs) on a Google Doc. I know how to insert text, update paragraph and text style through batchUpdate.
doc_service.documents().batchUpdate(documentId=<ID>,body={'requests': <my_request>}).execute()
where, for instance, "my_request" takes the form of something like:
request = [
{
"insertText": {
"location": {
"index": <index_position>,
"segmentId": <id>
},
"text": <text>
}
},
{
"updateParagraphStyle": {
"paragraphStyle": {
"namedStyleType": <paragraph_type>
},
"range": {
"segmentId": <id>,
"startIndex": <index_position>,
"endIndex": <index_position>
},
"fields": "namedStyleType"
}
},
]
However, once the script is done updating the table, it would be fantastic if a table of content could be added at the top of the document.
However, I am very new to Google Docs API and I am not entirely sure how to do that. I know I should use "TableOfContents" as a StructuralElement. I also know this option currently does not update automatically after each modification brought to the document (this is why I would like to create it AFTER the document has finished updating and place it at the top of the document).
How to do this with python? I am unclear where to call "TableOfContents" in my request.
Thank you so very much!

After your comment, I was able to understand better what you are desiring to do, but I came across these two Issue Tracker's posts:
Add the ability to generate and update the TOC of a doc.
Geting a link to heading paragraph.
These are well-known feature requests that unfortunately haven't been implemented yet. You can hit the ☆ next to the issue number in the top left on this page as it lets Google know more people are encountering this and so it is more likely to be seen faster.
Therefore, it's not possible to insert/update a table of contents programmatically.

Storing dynamic user generated fields in mongodb

I'm storing data in a collection for rules. These rules are user defined and can vary, and they will be used to check data from an API. I.e. a user may define that if the field spend (taken from an API) is higher than x do y (i.e. notify Slack).
Each "campaign" can have different rules, and users can set these based on a handful parameters, with a logic comparison to a defined value. A campaign can have many rules, and rules can be used across many campaigns.
It's for an internal tool written in python 3.5
My first intuition was something like:
`{
"rulename" : "test123",
"campaigns" : [
"123456",
"765434"
],
"triggers" : [
{
"impressions": "500",
"comparison": ">"
},
{
"cost": "1.5",
"comparison": ">"
}
],
"action" : "notify"
}`
Here's a picture of user input to better illustrate: https://www.dropbox.com/s/ai745inl2quwdh8/Screenshot%202017-07-09%2022.50.20.png?dl=0
The rules will be used with api requests, and if the rules are triggered (i.e. the API says impression are above 500 AND cost is higher than 1.5, then do something.
I hope it makes sense. Thanks in advance.

Request array of json documents (disable item reference) from MongoDB using python eve

Using Python eve framework, Is there any way to get response shown in first json type which is array of objects like shown in example?. I have tried to disable HATEOAS like it says here. Some View Applications use direct fetching on model and collections based on it, such as Backbone NodeJS data handler.
[
{
"_id": "526c0e21977a67d6966dc763",
"question": "1",
"uk": "I heard a bloke on the train say that tomorrow's trains will be delayed.",
"us": "I heard a guy on the train say that tomorrow's trains will be delayed."
},
{
"_id": "526c0e21977a67d6966dc764",
"question": "2",
"uk": "Tom went outside for a fag. I think he smokes too much!",
"us": "Tom went outside for a cigarette. I think he smokes too much!"
}
]
Instead of returning the JSON object with _items key like it shows:
{
"_items":[
{
"_id": "526c0e21977a67d6966dc763",
"question": "1",
"uk": "I heard a bloke on the train",
"us": "I heard a guy on the train"
},
{
"_id": "526c0e21977a67d6966dc764",
"question": "2",
"uk": "Tom went outside for a fag. I think he smokes too much!",
"us": "Tom went outside for a cigarette. I think he smokes too much!"
}
]
}

This is currently not possible, as the response payload is built as a dictionary in which several keys might appear (pagination data, HATOEAS links, and actual documents).
In theory we could add a new configuration option which would switch to a list-formatted (and simplified) layout. Should consider all the consequences though, so no promises, but consider opening a ticket.

From JSON to JSON-LD without changing the source

There are 'duplicates' to my question but they don't answer my question.
Considering the following JSON-LD example as described in paragraph 6.13 - Named Graphs from http://www.w3.org/TR/json-ld/:
{
"#context": {
"generatedAt": {
"#id": "http://www.w3.org/ns/prov#generatedAtTime",
"#type": "http://www.w3.org/2001/XMLSchema#date"
},
"Person": "http://xmlns.com/foaf/0.1/Person",
"name": "http://xmlns.com/foaf/0.1/name",
"knows": "http://xmlns.com/foaf/0.1/knows"
},
"#id": "http://example.org/graphs/73",
"generatedAt": "2012-04-09",
"#graph":
[
{
"#id": "http://manu.sporny.org/about#manu",
"#type": "Person",
"name": "Manu Sporny",
"knows": "http://greggkellogg.net/foaf#me"
},
{
"#id": "http://greggkellogg.net/foaf#me",
"#type": "Person",
"name": "Gregg Kellogg",
"knows": "http://manu.sporny.org/about#manu"
}
]
}
Question:
What if you start with only the JSON part without the semantic layer:
[{
"name": "Manu Sporny",
"knows": "http://greggkellogg.net/foaf#me"
},
{
"name": "Gregg Kellogg",
"knows": "http://manu.sporny.org/about#manu"
}]
and you link the #context from a separate file or location using a http link header or rdflib parsing, then you are still left without the #id and #type in the rest of the document. Injecting those missing keys-values into the json string is not a clean option. The idea is to go from JSON to JSON-LD without changing the original JSON part.
The way I see it to define a triple subject, one has to use an #id to map tot an IRI. It's very unlikely that JSON data has the #id key-values. So does this mean all JSON files cannot be parsed as JSON-LD without add the keys first? I wonder how they do it.
Does someone have an idea to point me in the right direction?
Thank you.

No, unfortunately that's not possible. There exist, however, libraries and tools that have been created exactly for that reason. JSON-LD Macros is such a library. It allows declarative transformations of JSON objects to make them usable as JSON-LD. So, effectively, all you need is a very thin layer on top of an off-the-shelve JSON-LD processor.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrapy to crawl LD JSON data - python

import json And next in your code: microdata_content = response.xpath('//script[#type="application/ld+json"]/text()').extract_first() microdata = json.loads(microdata_content) ratingValue = microdata["aggregateRating"]["ratingValue"]

Related

Set context from custom XBRL file

Google Docs API programmatically adding a table of content

Storing dynamic user generated fields in mongodb

Request array of json documents (disable item reference) from MongoDB using python eve

From JSON to JSON-LD without changing the source

Categories

Resources