Clean API results to get the headlines of news articles?

Clean API results to get the headlines of news articles? - python

I have been having trouble finding a way to pull out specific text info from the Guardian API for my dissertation. I have managed to get all my text onto Python but how do you then clean it to get say, just the headlines of the news articles?
This is a snippet of the API result that I want to pull out info from:
{
"response": {
"status":"ok",
"userTier":"developer",
"total":1869990,
"startIndex":1,
"pageSize":10,
"currentPage":1,
"pages":186999,
"orderBy":"newest",
"results":[
{
"id":"sport/live/2016/jul/09/tour-de-france-2016-stage-eight-live",
"type":"liveblog",
"sectionId":"sport",
"sectionName":"Sport",
"webPublicationDate":"2016-07-09T13:21:36Z",
"webTitle":"Tour de France 2016: stage eight – live!",
"webUrl":"https://www.theguardian.com/sport/live/2016/jul/09/tour-de-france-2016-stage-eight-live",
"apiUrl":"https://content.guardianapis.com/sport/live/2016/jul/09/tour-de-france-2016-stage-eight-live",
"isHosted":false
},
{
"id":"sport/live/2016/jul/09/serena-williams-v-angelique-kerber-wimbledon-womens-final-live",
"type":"liveblog",
"sectionId":"sport",
"sectionName":"Sport",
"webPublicationDate":"2016-07-09T13:21:02Z",
"webTitle":"Serena Williams v Angelique Kerber: Wimbledon women's final –
...

Hoping the OP adds the used code to the question.
One solution in python is, that whatever you get (from the methods offered by the requests module?) will be either already deeply nested structures you can well index into or you can easily map it to these structures (via json.loads(the_string_you_displayed).
Sample:
d = json.loads(the_string_you_displayed)
head_line = d['response']['results'][0]['webTitle']
Would give the value into headline that is stored in the first dict found in the results "array" (index 0) of the response entries value. (The question was updated so now, the full path is visible)
in case I read the sample snippet given correctly, and it has been cut during copy and paste here, as the given sample is (as is) invalid JSON.
If the text does not represent a valid JSON text, it will depend on sifting through text via substring or pattern matching and may well be very brittle ...
Update: So assuming the full response structure is stored inside a variable named data:
result_seq = data['response']['results'] # yields a list here
headlines = [result['webTitle'] for result in result_seq]
The last line works like so: This is a list comprehension compactly creating a list from all entries result in the result_seq by picking the value of the key webTitle in each dict.
An explicit for loop like solution picking them all would be:
result_seq = data['response']['results']
headlines = []
for result in result_seq:
headlines.append(result['webTitle'])
This does not check for errors like result dicts without a key webTitle etc. but Python will raise a matching exception, and one can decide, if one likes to wrap the processing inside a try: except block or hope for the best ...

Related

how to get nested data with pandas and request

I'm going crazy trying to get data through an API call using request and pandas. It looks like it's nested data, but I cant get the data i need.
https://xorosoft.docs.apiary.io/#reference/sales-orders/get-sales-orders
above is the api documentation. I'm just trying to keep it simple and get the itemnumber and qtyremainingtoship, but i cant even figure out how to access the nested data. I'm trying to use DataFrame to get it, but am just lost. any help would be appreciated. i keep getting stuck at the 'Data' level.
type(json['Data'])
df = pd.DataFrame(['Data'])
df.explode('SoEstimateHeader')
df.explode('SoEstimateHeader')
Cell In [64], line 1
df.explode([0:])
^
SyntaxError: invalid syntax

I used the link to grab a sample response from the API documentation page you provided. From the code you provided it looks like you are already able to get the data and I'm assuming the you have it as a dictionary type already.
From what I can tell I don't think you should be using pandas, unless its some downstream requirement in the task you are doing. But to get the ItemNumber & QtyRemainingToShip you can use the code below.
# get the interesting part of the data out of the api response
data_list = json['Data']
#the data_list is only one element long, so grab the first element which is of type dictionary
data = data_list[0]
# the dictionary has two keys at the top level
so_estimate_header = data['SoEstimateHeader']
# similar to the data list the value associated with "SoEstimateItemLineArr" is of type list and has 1 element in it, so we grab the first & only element.
so_estimate_item_line_arr = data['SoEstimateItemLineArr'][0]
# now we can grab the pieces of information we're interested in out of the dictionary
qtyremainingtoship = so_estimate_item_line_arr["QtyRemainingToShip"]
itemnumber = so_estimate_item_line_arr["ItemNumber"]
print("QtyRemainingToShip: ", qtyremainingtoship)
print("ItemNumber: ", itemnumber)
Output
QtyRemainingToShip: 1
ItemNumber: BC
Side Note
As a side note I wouldn't name any variables json because thats also the name of a popular library in python for parsing json, so that will be confusing to future readers and will clash with the name if you end up having to import the json library.

Scraping data from a http & javaScript site

I currently want to scrape some data from an amazon page and I'm kind of stuck.
For example, lets take this page.
https://www.amazon.com/NIKE-Hyperfre3sh-Athletic-Sneakers-Shoes/dp/B01KWIUHAM/ref=sr_1_1_sspa?ie=UTF8&qid=1546731934&sr=8-1-spons&keywords=nike+shoes&psc=1
I wanted to scrape every variant of shoe size and color. That data can be found opening the source code and searching for 'variationValues'.
There we can see sort of a dictionary containing all the sizes and colors and, below that, in 'asinToDimentionIndexMap', every product code with numbers indicating the variant from the variationValues 'dictionary'.
For example, in asinToDimentionIndexMap we can see
"B01KWIUH5M":[0,0]
Which means that the product code B01KWIUH5M is associated with the size '8M US' (position 0 in variationValues size_name section) and the color 'Teal' (same idea as before)
I want to scrape both the variationValues and the asinToDimentionIndexMap, so i can associate the IndexMap numbers to the variationValues one.
Another person in the site (thanks for the help btw) suggested doing it this way.
script = response.xpath('//script/text()').extract_frist()
import re
# capture everything between {}
data = re.findall(script, '(\{.+?\}_')
import json
d = json.loads(data[0])
d['products'][0]
I can sort of understand the first part. We get everything that's a 'script' as a string and then get everything between {}. The issue is what happens after that. My knowledge of json is not that great and reading some stuff about it didn't help that much.
Is it there a way to get, from that data, 2 dictionaries or lists with the variationValues and asinToDimentionIndexMap? (maybe using some regular expressions in the middle to get some data out of a big string). Or explain a little bit what happens with the json part.
Thanks for the help!
EDIT: Added photo of variationValues and asinToDimensionIndexMap

I think you are close Manuel!
The following code will turn your scraped source into easy-to-select boxes:
import json
d = json.loads(data[0])
JSON is a universal format for storing object information. In other words, it's designed to interpret string data into object data, regardless of the platform you are working with.
https://www.w3schools.com/js/js_json_intro.asp
I'm assuming where you may be finding things a challenge is if there are any errors when accessing a particular "box" inside you json object.
Your code format looks correct, but your access within "each box" may look different.
Eg. If your 'asinToDimentionIndexMap' object is nested within a smaller box in the larger 'products' object, then you might access it like this (after running the code above):
d['products'][0]['asinToDimentionIndexMap']
I've hacked and slash a little bit so you can better understand the structure of your particular json file. Take a look at the link below. On the right-hand side, you will see "which boxes are within one another" - which is precisely what you need to know for accessing what you need.
JSON Object Viewer
For example, the following would yield "companyCompliancePolicies_feature_div":
import json
d = json.loads(data[0])
d['updateDivLists']['full'][0]['divToUpdate']
The person helping you before outlined a general case for you, but you'll need to go in an look at structure this way to truly find what you're looking for.

variationValues = re.findall(r'variationValues\" : ({.*?})', ' '.join(script))[0]
asinVariationValues = re.findall(r'asinVariationValues\" : ({.*?}})', ' '.join(script))[0]
dimensionValuesData = re.findall(r'dimensionValuesData\" : (\[.*\])', ' '.join(script))[0]
asinToDimensionIndexMap = re.findall(r'asinToDimensionIndexMap\" : ({.*})', ' '.join(script))[0]
dimensionValuesDisplayData = re.findall(r'dimensionValuesDisplayData\" : ({.*})', ' '.join(script))[0]
Now you can easily convert them to json as use them combine as you wish.

Scrape a webpage using scrapy into tab-delimited format

I would like to scrape and parse the data on these two pages: here and here into a tab-delimited format using scrapy. I did these commands:
scrapy shell
fetch("https://www.drugbank.ca/drugs/DB04899")
print response.text
My two question:
1. for example, for this page, when I type:
response.css(".sequence::text").extract()
[u'>DB04899: Natriuretic peptides B\nSPKMVQGSGCFGRKMDRISSSSGLGCKVLRRH']
But then when I type:
>>> response.css(".synonyms::text").extract()
[]
>>> response.css(".Synonyms::text").extract()
[]
But you can see that there are synonyms listed on the webpage and so the output should not be empty. Can someone demonstrate what I'm doing wrong? (I also tried other tags such as synonym, Synonym) etc.
When I type: response.css(".targets::text").extract(), the output is [u'Targets (3)']. I'm wondering how I can actually parse the data within this list, but I guess this is related to not using the right tags and question 1 above.
This is a vague question/advanced for me at the minute, is it possible to just scrape the whole page in one go, instead of having to know each individual tag? So my output would be a dictionary called 'identification' with Name, accession number, type etc as keys. Then a dictionary called pharmacology with indication, structured indication etc as keys, then another dictionary called interactions, and another called pharmaeconomics etc, one dictionary per page section?
Thanks

There are really no elements with synonyms or Synonyms class attribute value on the page.
You can get to the synonyms by "going to the right" of the dt element with the "Synonyms" text using following-sibling:
In [2]: response.xpath("//dt[. = 'Synonyms']/following-sibling::dd/ul/li/text()").extract()
Out[2]:
['BNP',
'Brain natriuretic peptide 32',
'Natriuretic peptides B',
'Nesiritide recombinant']

How to get all text inside XML tags

xml file snapshot
From above .xml file I am extracting article-id, article-title, abstract and keywords. For normal text inside single tag getting correct results. But text with multiple tags such as:
<title-group>
<article-title>
Acetylcholinesterase-Inhibiting Activity of Pyrrole Derivatives from a Novel Marine Gliding Bacterium,
<italic>Rapidithrix thailandica</italic>
</article-title>
</title-group>
.
.
same is for abstract...
I got output as:
OrderedDict([(u'italic**', u'Rapidithrix thailandica'), ('#text', u'Acetylcholines terase-Inhibiting Activity of Pyrrole Derivatives from a Novel Marine Gliding Ba cterium,')])
code has considered tag as a text and the o/p generated is also not in the sequence.
How to simply extract text from such input document as "Acetylcholinesterase-Inhibiting Activity of Pyrrole Derivatives from a Novel Marine Gliding Bacterium, Rapidithrix thailandica".
I am using below python code to perform above task..
import xmltodict
import os
from os.path import basename
import re
with open('2630847.nxml') as fd:
doc = xmltodict.parse(fd.read())
pmc_id = doc['article']['front']['article-meta']['article-id'][1]['#text']
article_title = doc['article']['front']['article-meta']['title-group']['article-title']
y = doc['article']['front']['article-meta']['abstract']
y = y.items()[0]
article_abstract = [g.encode('ascii','ignore') for g in y][1]
z = doc['article']['front']['article-meta']['kwd-group']['kwd']
zz = [g.encode('ascii','ignore') for g in z]
article_keywords = ",".join(zz).replace(","," ")
fout = open(str(pmc_id)+".txt","w")
fout.write(str(pmc_id)+"\n"+str(article_title)+". "+str(article_abstract)+". "+str(article_keywords))
Can somebody please suggest corrections..

xmltodict will likely be hard to use for your data. PMC journal articles are definitely not what the authors could have had in mind. Putting any but the most trivial XML into xmltodict is pounding a round peg into a square hole -- you might succeed, but it won't be pretty. I explain this further below under "tldr"....
Instead, I suggest you use a library whose data model fits your data better, such as xml.dom, minidom, or recent versions of BeautifulSoup. In many such libraries you just load the document with one call and then call some function like innerText() to get all the text content of it. You could even just load the document into a browser and call the Javascript innerText() function to get what you want. If the tool you choose doesn't provide innertext() already, it is:
def innertext(node):
t = ""
for curNode in node.childNodes:
if (isinstance(curNode, Text)):
t += curNode.nodeValue
elif (isinstance(curNode, Element)):
t += curNode.innerText
return(t)
You could tweak that to put spaces between the text nodes, depending on your data.
Hope that helps.
==tldr==
xmltodict makes an admirable attempt at making XML "as simple as possible"; but IMHO it errs in making it simpler than possible.
xmltodict basically works by turning every element into a dict, with its children as the dict items, keyed by their element names. But in many cases (such as yours), XML data isn't very much like that at all. For example, an element can have many children with the same name, but a dict can't.
So xmltodict has to do something special. It turns adjacent instances of the same element type into an array (without the element type). Here's an example excerpted from https://github.com/martinblech/xmltodict):
<and>
<many>elements</many>
<many>more elements</many>
</and>
becomes:
"and": {
"many": [
"elements",
"more elements"
]
},
First off, this means that xmltodict always loses the ordering information about child elements unless they are of the same type. So a section that contains a mix of paragraphs, lists, blockquotes, and so on, will either fail to load in xmltodict, or have all the scattered instances of each kind of child gathered together, completely losing their order.
The xmltodict approach also introduces frequent special-cases -- for example, you can't just get a list of all the children, or use len() to find out how many there are, etc. etc., because at every step you have to check whether you're really at a child element, or at a list of them.
Looking at xmltodict's own examples, you'll see that they mostly consist of walking down the tree by element names, but every now and then there's an integer subscript -- that's for the cases where these arrays are needed. But unless the data is unusually simple (which yours isn't), you won't know where that is. For example, if one DIV in an HTML document happens to contain only one P, the code to access the P needs one fewer subscript than with another DIV that happens to have more than one P.
It seems to me undesirable that the number of subscripts to get to something depends on how many siblings it has, and their types.
Alas, the structure still isn't good enough. Since child elements may have their own child elements, just making them strings in that extra array won't be enough. Sometimes they'll have to be dicts again, with some of their items in turn perhaps being arrays, some of whose items may be dicts, and so on. Writing the correct traversal algorithm to gather up the text is significantly harder than the DOM one shown above.
To be completely fair, there is some XML in which the order doesn't matter logically -- for example, you could export a SQL table into an XML file, using a container element for each record with a child element for each field. The order of fields is not information, so if you load such XML into xmltodict, losing the order doesn't matter. Likewise if you serialized Python data that was already just a dict. But those are very specialized edge cases. xmltodict might be an excellent choice for a case like that -- but the articles you're looking at are very far from that.

json change dictionary item to a list with one dictionary

I'm working with a Rest Api for finding address details. I pass it an address and it passes back details for that address: lat/long, suburb etc. I'm using the requests library with the json() method on the response and adding the json response to a list to analyse later.
What I'm finding is that when there is a single match for an address the 'FoundAddress' key in the json response contains a dictionary but when more than one match is found the 'FoundAddress' key contains a list of dictionaries.
The returned json looks something like:
For a single match:
{
'FoundAddress': {AddressDetails...}
}
For multiple matches:
{
'FoundAddress': [{Address1Details...}, {Address2Details...}]
}
I don't want to write code to handle a single match and then multiple matches.
How can I modify the 'FoundAddress' so that when there is a single match it changes it to a list with a single dictionary entry? Such that I get something like this:
{
'FoundAddress': [{AddressDetails...}]
}

If it's the external API sending responses in that format then you can't really change FoundAddress itself, since it will always arrive in that format.
You can change the response if you want to, since you have full control over what you've received:
r = json.parse(response)
fixed = r['FoundAddress'] if (type(r['FoundAddress']) is list) else [r['FoundAddress']]
r['FoundAddress'] = fixed
Alternatively you can do the distinction at address usage time:
def func(foundAddress):
# work with a single dictionary instance here
then:
result = map(func, r['FoundAddress']) if (type(r['FoundAddress']) is list) else [func(r['FoundAddress'])]
But honestly I'd take a clear:
if type(r['FoundAddress']) is list:
result = map(func, r['FoundAddress'])
else:
result = func(r['FoundAddress'])
or the response fix-up over the a if b else c one-liner any day.

If you can, I would just change the API. If you can't there's nothing magical you can do. You just have to handle the special case. You could probably do this in one place in your code with a function like:
def handle_found_addresses(found_addresses):
if not isinstance(found_addresses, list):
found_addresses = [found_addreses]
...
and then proceed from there to do whatever you do with found addresses as if the value is always a list with one or more items.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.