Parallelize openalex API - python

I am using the openalex API see here for an example to get all the papers of 2020.
The way I am doing this is the following:
import requests
# url with a placeholder for cursor
example_url_with_cursor ='https://api.openalex.org/works?filter=publication_year:2020&cursor={}'
dfs=defaultdict(dict)
paper_id=[]
title_lst=[]
year_lst=[]
lev_lst=[]
page=1
cursor = '*'
# loop through pages
while cursor:
# set cursor value and request page from OpenAlex
url = example_url_with_cursor.format(cursor)
print("\n" + url)
page_with_results = requests.get(url).json()
# loop through partial list of results
results = page_with_results['results']
for i,work in enumerate(results):
openalex_id = work['id'].replace("https://openalex.org/", "")
if work['display_name'] is not None and len(work['display_name'])>0:
openalex_title = work['display_name']
else:
openalex_title='No title'
#openalex_author = work['authorships'][0]['author']['display_name']
openalex_year = work['publication_year']
if work['concepts'] is not None and len(work['concepts'])>0:
openalex_lev = work['concepts'][0]['display_name']
else:
openalex_lev = 'None'
paper_id.append(openalex_id)
title_lst.append(openalex_title)
year_lst.append(openalex_year)
lev_lst.append(openalex_lev)
#Constructing a pandas db:
df=pd.DataFrame(paper_id, columns=['paper_id'])
df['title'] = title_lst
df['level'] = lev_lst
df['pub_year'] = year_lst
#print(openalex_id, end='\t' if (i+1)%5!=0 else '\n')
# update cursor to meta.next_cursor
#page += 1
#dfs[page]=df
cursor = page_with_results['meta']['next_cursor']
This method,, however is very slow and I would like to speed it up a bit through parallelization. Since I am quite new to parallelizing while loops I would kindly ask you if there is a way to do so without messing up the final results.
Specifically, the code above is looping through different data pages, picking up the cursor of the next page, saving the data in a dataframe (df) and going to the next page. The process is repeated until the last page with a cursor in it.

Related

scrapy returning too many rows in tables

Feels like I'm not grasping some concepts here, or trying to fly before I can crawl (pun intended).
There are indeed 5 tables on the page, with the one I'm interested in being the 3rd. But executing this:
#!/usr/bin/python
# python 3.x
import sys
import os
import re
import requests
import scrapy
class iso3166_spider( scrapy.Spider):
name = "countries"
def start_requests( self):
urls = ["https://en.wikipedia.org/wiki/ISO_3166-1"]
for url in urls:
yield scrapy.Request( url=url, callback=self.parse)
def parse( self, response):
title = response.xpath('//title/text()').get()
print("-- title -- {0}".format(title))
list_table_selector = response.xpath('//table') # gets all tables on page
print("-- table count -- {0}".format( len( list_table_selector)))
table_selector = response.xpath('//table[2]') # inspect to figure out which one u want
table_selector_text = table_selector.getall() # got the right table, starts with Afghanistan
# print( table_selector_text)
#
# here is where things go wrong
list_row_selector = table_selector.xpath('//tr')
print("number of rows in table: {0}".format( len( list_row_selector))) # gives 302, should be close to 247
for i in range(0,20):
row_selector = list_row_selector[i]
row_selector_text = row_selector.getall()
print("i={0}, getall:{1}".format(i, row_selector_text)
prints the getall() of each row in EVERY table - I see the row for Afghanistan as row 8 not row 2
Changing
list_row_selector = table_selector.xpath('//tr')
to
list_row_selector = table_selector.xpath('/tr')
results in zero rows found where I'd expect roughly 247
Ultimately I want the name and three codes for each country, should be straightforward.
What am I doing wrong?
TIA,
kerchunk
tbl = response.xpath("//th[starts-with(text(),'English short name')]/ancestor::table/tbody/tr[position()>1]") # try this xpath. I check the source of web page, the header ("th" elements) line is under tbody also.
You can also try to replace "//tr" with ".//tr"

How to add all the pages of an api into a pandas dataframe

I do know that python has the read_json function to effectively get data from an api into a pandas dataframe. But is there any way to actually read through all the pages of the api and input it into the same dataframe.
import requests
import pandas as pd
import config
api_key = config.api_key
url = " http://api.themoviedb.org/3/discover/movie?release_date.gte=2017-12-
01&release_date.lte=2017-12-31&api_key=" + api_key
payload = "{}"
response = requests.request("GET", url, data=payload)
print(response.text.encode("utf-8"))
I tried with the requests method but this only gives me the 1st page of the api. But I wanted to see if there is any way I can do it with the df method as below. I am unable to understand how to write a loop to effectively loop over all the pages and then input it all into 1 dataframe for further analysis.
df = pd.read_json('http://api.themoviedb.org/3/discover/movie?
release_date.gte=2017-12-01&release_date.lte=2017-12-
31&api_key=''&page=%s' %page)
You can read each page into a dataframe and concatenate them:
page = 0
df = []
while True:
try:
next_page = pd.read_json('http://api.themoviedb.org/3/discover/movie?
release_date.gte=2017-12-01&release_date.lte=2017-12-
31&api_key=''&page=%s' %page)
# doesn't get any content, stop
if len(next_page) == 0:
break
else:
# move on to the next page
df.append(next_page)
page += 1
except:
# if we got error from the API call, maybe the URL for that page doesn't exist
# the stop
break
df = pd.concat(df, axis=0)
Documentation for pd.concat here. Hope it helps :)

Python 3.6 API while loop to json script not ending

I'm trying to create a loop via API call to a json string since each call is limited to 200 rows. When I tried the below code, the loop doesn't seem to end even when I left the code running for an hour or so. Max rows I'm looking to pull is about ~200k rows from the API.
bookmark=''
urlbase = 'https://..../?'
alldata = []
while True:
if len(bookmark)>0:
url = urlbase + 'bookmark=' + bookmark
requests.get(url, auth=('username', 'password'))
data = response.json()
alldata.extend(data['rows'])
bookmark = data['bookmark']
if len(data['rows'])<200:
break
Also, I'm looking to filter the loop to only output if json value 'pet.type' is "Puppies" or "Kittens." Haven't been able to figure out the syntax.
Any ideas?
Thanks
The break condition for you loop is incorrect. Notice it's checking len(data["rows"]), where data only includes rows from the most recent request.
Instead, you should be looking at the total number of rows you've collected so far: len(alldata).
bookmark=''
urlbase = 'https://..../?'
alldata = []
while True:
if len(bookmark)>0:
url = urlbase + 'bookmark=' + bookmark
requests.get(url, auth=('username', 'password'))
data = response.json()
alldata.extend(data['rows'])
bookmark = data['bookmark']
# Check `alldata` instead of `data["rows"]`,
# and set the limit to 200k instead of 200.
if len(alldata) >= 200000:
break

Very fast webpage scraping (Python)

So I'm trying to filter through a list of urls (potentially in the hundreds) and filter out every article who's body is less than X number of words (ARTICLE LENGTH). But when I run my application, it takes an unreasonable amount of time, so much so that my hosting service times out. I'm currently using Goose (https://github.com/grangier/python-goose) with the following filter function:
def is_news_and_data(url):
"""A function that returns a list of the form
[True, title, meta_description]
or
[False]
"""
result = []
if url == None:
return False
try:
article = g.extract(url=url)
if len(article.cleaned_text.split()) < ARTICLE_LENGTH:
result.append(False)
else:
title = article.title
meta_description = article.meta_description
result.extend([True, title, meta_description])
except:
result.append(False)
return result
In the context of the following. Dont mind the debug prints and messiness (tweepy is my twitter api wrapper):
def get_links(auth):
"""Returns a list of t.co links from a list of given tweets"""
api = tweepy.API(auth)
page_list = []
tweets_list = []
links_list = []
news_list = []
regex = re.compile('http://t.co/.[a-zA-Z0-9]*')
for page in tweepy.Cursor(api.home_timeline, count=20).pages(1):
page_list.append(page)
for page in page_list:
for status in page:
tweet = status.text.encode('utf-8','ignore')
tweets_list.append(tweet)
for tweet in tweets_list:
links = regex.findall(tweet)
links_list.extend(links)
#print 'The length of the links list is: ' + str(len(links_list))
for link in links_list:
news_and_data = is_news_and_data(link)
if True in news_and_data:
news_and_data.append(link)
#[True, title, meta_description, link]
news_list.append(news_and_data[1:])
print 'The length of the news list is: ' + str(len(news_list))
Can anyone recommend a perhaps faster method?
This code is probably causing your slow performance:
len(article.cleaned_text.split())
This is performing a lot of work, most of which is discarded. I would profile your code to see if this is the culprit, if so, replace it with something that just counts spaces, like so:
article.cleaned_text.count(' ')
That won't give you exactly the same result as your original code, but will be very close. To get closer you could use a regular expression to count words, but it won't be quite as fast.
I'm not saying this is the most absolute best you can do, but it will be faster. You'll have to redo some of your code to fit this new function.
It will at least give you less function calls.
You'll have to pass the whole url list.
def is_news_in_data(listings):
new_listings = {}
tmp_listing = ''
is_news = {}
for i in listings:
url = listings[i]
is_news[url] = 0
article = g.extract(url=url).cleaned_text
tmp_listing = '';
for s in article:
is_news[url] += 1
tmp_listing += s
if is_news[url] > ARTICLE_LENGTH:
new_listings[url] = tmp_listing
del is_news[url]
return new_listings

Google App engine Search API Cursor not not updating

I am using cursors to get results from GAE Full text search API. The roblem is that the cursor remains same in each iteration:
cursor = search.Cursor()
files_options = search.QueryOptions(
limit=5,
cursor=cursor,
returned_fields='state'
)
files_dict = {}
query = search.Query(query_string=text_to_search, options=files_options)
index = search.Index(name='title')
while cursor != None:
results = index.search(query)
cursor = results.cursor
The cursor never become None even when the search returns only 18 results
The problem is that you getting the same 5 results over and over again. Every time you do results = index.search(query) inside your loop, you're retrieving the first five results because your query options specify a limit of 5 and empty cursor. You need to create a new query starting a the new cursor on every iteration.
cursor = search.Cursor()
index = search.Index(name='title')
while cursor != None:
options = search.QueryOptions(limit=5, cursor=cursor, returned_fields='state'))
results = index.search(search.Query(query_string=text_to_search, options=options))
cursor = results.cursor
Take a look at the introduction section of this page: https://developers.google.com/appengine/docs/python/search/queryclass

Categories

Resources