How can I optimize JSON webscrape? - python

I recently got interested in bitcoin and the whole blockchain thing. Since every transaction is public by design, I thought it would be interesting to investigate the number of wallets, size of transactions and such. But the current block height of bitcoin is 732,324 which is quite a lot of blocks to walk through one after each other. Thus, I want to obtain the hash code for each block so I can multi-thread grabbing the transactions.
Blockchain links one block after each other and if I go to the first block (the genesis block) and simply find the next block in the chain and so forth until the end, I should have what I need. I am quite new in python, but below is my code for obtaining the hashes and saving them to a file. However, at the current rate it would take 30-40 hours to complete on my machine. Thus, is there a more efficient way to solve the problem?
#imports
from urllib.request import urlopen
from datetime import datetime
import json
#Setting start parameters
genesisBlock = "000000000019d6689c085ae165831e934ff763ae46a2a6c172b3f1b60a8ce26f"
baseurl = "https://blockchain.info/rawblock/"
i = 0 #counter for tracking progress
#Set HASH
blockHASH = genesisBlock
#Open file to save results
filePath = "./blocklist.tsv"
fileObject = open(filePath, 'a')
#Write header, if first line
if i == 0:
fileObject.write("blockHASH\theight\ttime\tn_tx\n")
#Start walking through each block
while blockHASH != "" :
#Print progress
if i % 250 == 0:
print(str(i)+"|"+datetime.now().strftime("%H:%M:%S"))
# store the response of URL
url = baseurl+blockHASH
response = urlopen(url)
# storing the JSON response in data
data_json = json.loads(response.read().decode())
#Write result to file
fileObject.write(blockHASH+"\t"+
str(data_json["height"])+"\t"+
str(data_json["time"])+"\t"+
str(data_json["n_tx"])+"\t"+
"\n")
#increment counter
i = i + 1
#Set new hash
blockHASH = data_json["next_block"][0]
if i > 1000: break #or just let it run until completion
# Close the file
fileObject.close()

While this doesn't comment directly on the efficiency of your approach, using orjson or rapidjson will definitely speed up your results, since they're both quite a bit faster than the standard json library.
Rapidjson can be swapped in as easily as just doing import rapidjson as json whereas orjson you have to make a couple changes, as described on their github page, but nothing too hard.

Related

How can I optimize a web-scraping code snippet to run faster?

I wrote this piece of code and it is currently running, scraping a massive amount of data. So far the loop has run 800 times. It will have to run ~16,000 times to grab all of the data.
Generally speaking, how can I optimize web-scraping code, or am I at the mercy of requests.get?
import json
import requests
import pandas as pd
from pandas.io.json import json_normalize
headers = {}
p = {}
a = int(p['page'])
df = pd.DataFrame()
while True:
p['page'] = str(a)
try:
a += 1
r = requests.get('URL',headers=headers, params=p)
complete_json = r.json()
print('success')
df_data = pd.DataFrame.from_dict(json_normalize(complete_json['explore_vintage']['matches']), orient='columns')
df = df.append(df_data)
except:
False
df.to_excel('output.xlsx', encoding='utf8')
df.to_csv("output.csv")
print(df.head)
There are a couple of optimizations I can see right off the bat.
The first thing you could add here is parallel processing via async requests. The requests library is synchronous and as you are seeing – it's going to block until each page fully processes. There are a number of libraries that the requests project officially recommends. If you go this route you'll need to more explicitly define a terminating condition rather than a try/except block inside an infinite while loop.
This is all pseudo-code primarily ripped from their examples, but you can see how this might work:
from requests_futures.sessions import FuturesSession
from concurrent.futures import as_completed
import json
import time
def response_hook(resp, *args, **kwargs):
with open(f'tmp/{time.thread_time_ns()}.json', 'wb') as fp:
parsed = resp.json()
fp.write(json.dumps(parsed).encode('utf-8'))
futures_session = FuturesSession()
futures_session.hooks['response'] = response_hook
with futures_session as session:
futures = [
session.get(f'https://jsonplaceholder.typicode.com/todos/{i}', hooks={'response': response_hook}) for i in range(16000)
]
for future in as_completed(futures):
resp = future.result()
The parsing of the data into a dataframe is an obvious bottleneck. This is currently going to continue slowing down as the dataframe becomes larger and larger. I don't know the size of these JSON responses but if you're fetching 16k responses I imagine this would quickly grind to a halt once you've eaten through your memory. If possible, I would recommend decoupling the scraping and transforming operations. Save all of your scraped data into their own, independent JSON files (as in the example above). If you save each response separately and the scraping completes you can then loop over all of the saved contents, parse them, then output to Excel and CSV. Note that depending on the size of the JSON files you may still run into memory issues, you at least won't block the scraping process and can deal with the output processing separately.

create loop to extract urls to json and csv

I set up a loop to scrape with 37900 records. Due to the way the url/ server is being set up, there's a limit of 200 records displayed in each url. Each url ends with 'skip=200', or mulitiple of 200 to loop to the next url page where the next 200 records are displayed. Eventually I want to loop through all urls and append them as a table. The related posted unable to loop the last url with paging limits
I created two loops shown as below - one for creating urls with skip= every 200 records, and another one to get response of each of these urls, then another loop to read json and append them to a single dataframe.
I'm not sure what's missing in my second loop - so far it only produces json for the first URL page but not the subsequent pages. I have the feeling that the usl jsons are not appended to the list json = [] and so it prevents looping and append the jsons in csv. Any suggestions on modifying the loops and improving these codes are appreciated!
import pandas as pd
import requests
import json
records = range(37900)
skip = records[0::200]
Page = []
for i in skip:
endpoint = "https://~/Projects?&$skip={}".format(i)
Page.append(endpoint)
jsnlist = []
for j in Page:
response = session.get(j) #session here refers to requests.Session() I had to set up to authenticate my access to these urls
responsejs = response.json()
responsejsval = responsejs['value'] #I only want to extract header called 'value' in each json
with open('response2jsval.json', 'w') as outfile:
json.dump(jsnlist, outfile)
concat = pd.DataFrame()
for k in jsnlist:
df = pd.DataFrame(k) #list to df
concat = concat.append(df, ignore_index = True)
print(concat)
I have nothing to test against
I think you massively over-complicated this. You've since edited the question but there's a couple of points to make:
You define jsnlist = [] but never use it. Why?
You called your own object json (now gone but I'm not sure whether you understand why). Calling your own object json will just supersede the actual module, and the whole code will grind to a halt before you even got into a loop
There is no reason at all to save this data to disk before trying to create a dataframe
Opening the .json file in write mode ('w') will wipe all existing data on each iteration of your loop
Appending JSON to a file will not give a valid format to be parsed when read back in. At best, it might be JSONLines
Appending DataFrames in a loop has terrible complexity because it requires copying of the original data each time.
Your approach will be something like this:
import pandas as pd
import requests
import json
records = range(37900)
skip = records[0::200]
Page = []
for i in skip:
endpoint = "https://~/Projects?&$skip={}".format(i)
Page.append(endpoint)
jsnlist = []
for j in Page:
response = session.get(j) #session here refers to requests.Session() I had to set up to authenticate my access to these urls
responsejs = response.json()
responsejsval = responsejs['value'] #I only want to extract header called 'value' in each json
jsnlist.append(responsejsval)
df = pd.DataFrame(jsnlist)
df = pd.DataFrame(jsnlist) might take some work, but you'll need to show what we're up against. I'd need to see responsejs['value'] to answer fully.

How to download file rather than HTML with requests

Okay so I decided I'd like a program to download osu maps based on the map number(for lack of a better term). After doing some testing with the links to understand the redirecting, I got a program which gets to the .../download page - when I got to said page, the map will download. However, when trying to download it via requests, I get HTML.
def grab(self, identifier=None):
if not identifier:
print("Missing Argument: 'identifier'")
return
mapLink = f"https://osu.ppy.sh/beatmaps/{identifier}"
dl = requests.get(mapLink, allow_redirects=True)
if not dl:
print("Error: map not found!")
return
mapLink2 = dl.url
mapLink2 = f"https://osu.ppy.sh/beatmapsets/{self.parseLink(mapLink2)}/download"
dl = requests.get(mapLink2)
with open(f"{identifier}.osz", "wb") as f:
f.write(dl.content)
And, in case it is necessary, here is self.parseLink:
def parseLink(self, mapLink=None):
if not mapLink:
return None
id = mapLink.replace("https://osu.ppy.sh/beatmapsets/","")
id = id.split("#")
return id[0]
Ideally, when I open the file at the end of grab(), it should save a usable .osz file - one which is NOT html, and can be dragged into the actual game and used. Of course, this is still extremely early in my testing, and I will figure out a way to make the filename the song name for convenience.
edit: example of an identifier is: OsuMaps().grab("1385415") in case you wanted to test
There is a very quick way to get around:
Needing to be logged in
Needing a specific element
This workaround comes in the form of https://bloodcat.com/osu/ - to get a download link directly to a map, all you need is: https://bloodcat.com/osu/s/<beatmap set number>.
Here is an example:
id = "653534" # this map is ILY - Panda Eyes
mapLink = f"https://bloodcat.com/osu/s/{id}" # adds id to the link
dl = requests.get(mapLink)
if len(dl.content) > 330: # see below for explanation
with open(f"{lines[i].rstrip()}.osz", "wb") as f:
f.write(dl.content)
else:
print("Map doesn't exist")
The line if len(dl.conetent) > 330 is my workaround to a link not working. .osz files can contain thousands upon thousands of lines of unknown characters, whereas the site's "not found" page has less that 330 lines - we can use this to check if the file is too short to be a beatmap or not.
That's all! Feel free to use the code if you'd like.

Optimizing web-scraper python loop

I'm scraping articles from a news site on behalf of the owner. I have to keep it to <= 5 requests per second, or ~100k articles in 6 hrs (overnight), but I'm getting ~30k at best.
Using Jupyter notebook, it runs fine # first, but becomes less and less responsive. After 6 hrs, the kernel is normally un-interruptable, and I have to restart it. Since I'm storing each article in-memory, this is a problem.
So my question is: is there a more efficient way to do this to reach ~100k articles in 6 hours?
The code is below. For each valid URL in a Pandas dataframe column, the loop:
downloads the webpage
extracts the relevant text
cleans out some encoding garbage from the text
writes that text to another dataframe column
every 2000 articles, it saves the dataframe to a CSV (overwriting the last backup), to handle the eventual crash of the script.
Some ideas I've considered:
Write each article to a local SQL server instead of in-mem (speed concerns?)
save each article text in a csv with its url, then build a dataframe later
delete all "print()" functions and rely solely on logging (my logger config doesn't seem to perform awesome, though--i'm not sure it's logging everything I tell it to)
i=0
#lots of NaNs in the column, hence the subsetting
for u in unique_urls[unique_urls['unique_suffixes'].isnull() == False]\
.unique_suffixes.values[:]:
i = i+1
if pd.isnull(u):
continue
#save our progress every 2k articles just in case
if i%2000 == 0:
unique_urls.to_csv('/backup-article-txt.csv', encoding='utf-8')
try:
#pull the data
html_r = requests.get(u).text
#the phrase "TX:" indicates start of article
#text, so if it's not present, URL must have been bad
if html_r.find("TX:") == -1:
continue
#capture just the text of the article
txt = html_r[html_r.find("TX:")+5:]
#fix encoding/formatting quirks
txt = txt.replace('\n',' ')
txt = txt.replace('[^\x00-\x7F]','')
#wait 200 ms to spare site's servers
time.sleep(.2)
#write our article to our dataframe
unique_urls.loc[unique_urls.unique_suffixes == u, 'article_text'] = txt
logging.info("done with url # %s -- %s remaining", i, (total_links-i))
print "done with url # " + str(i)
print total_links-i
except:
logging.exception("Exception on article # %s, URL: %s", i, u)
print "ERROR with url # " + str(i)
continue
This is the logging config I'm using. I found it on SO, but w/ this particular script it doesn't seem to capture everything.
logTime = "{:%d %b-%X}".format(datetime.datetime.now())
logger = logging.getLogger()
fhandler = logging.FileHandler(filename='logTime+'.log', mode='a')
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
fhandler.setFormatter(formatter)
logger.addHandler(fhandler)
logger.setLevel(logging.INFO)
eta: some details in response to answers/comments:
script is only thing running on a 16 GB/ram EC2 instance
articles are ~100-800 words apiece
I'm going to take an educated guess and say that your script turns your machine into a swap storm as you get around 30k articles, according to your description. I don't see anything in your code where you could easily free up memory using:
some_large_container = None
Setting something that you know has a large allocation to None tells Python's memory manager that it's available for garbage collection. You also might want to explicitly call gc.collect(), but I'm not sure that would do you much good.
Alternatives you could consider:
sqlite3: Instead of a remote SQL database, use sqlite3 as intermediate storage. Exists there does a Python module.
Keep appending to the CSV checkpoint file.
Compress your strings with zlib.compress().
Any way that you decide to go, you're probably best off doing the collection as phase 1, constructing the Pandas dataframe as phase 2. Never pays off to be clever by a half. The other half tends to hang you.

Multithreaded screen scraping help needed

I'm relatively new to python, and I'm working through a screen- scraping application that gathers data from multiple financial sites. I have four procedures for now. Two run in just a couple minutes, and the other two... hours each. These two look up information on particular stock symbols that I have in a csv file. There are 4,000+ symbols that I'm using. I know enough to know that the vast majority of the time spent is in IO over the wire. It's essential that I get these down to 1/2 hour each (or, better. Is that too ambitious?) for this to be of any practical use to me. I'm using python 3 and BeautifulSoup.
I have the general structure of what I'm doing below. I've abbreviated conceptually non essential sections. I'm reading many threads on multiple calls/ threads at once to speed things up, and it seems like there are a lot of options. Can anyone point me in the right direction that I should pursue, based on the structure of what I have so far? It'd be a huge help. I'm sure it's obvious, but this procedure gets called along with the other data download procs in a main driver module. Thanks in advance...
from bs4 import BeautifulSoup
import misc modules
class StockOption:
def __init__(self, DateDownloaded, OptionData):
self.DateDownloaded = DateDownloaded
self.OptionData = OptionData
def ForCsv(self):
return [self.DateDownloaded, self.Optiondata]
def extract_options(TableRowsFromBeautifulSoup):
optionsList = []
for opt in range(0, len(TableRowsFromBeautifulSoup))
optionsList.append(StockOption(data parsed from TableRows arg))
return optionsList
def run_proc():
symbolList = read in csv file of tickers
for symb in symbolList:
webStr = #write the connection string
try:
with urllib.request.urlopen(webStr) as url: page = url.read()
soup = BeautifulSoup(page)
if soup.text.find('There are no All Markets results for') == -1:
tbls = soup.findAll('table')
if len(tbls[9]) > 1:
expStrings = soup.findAll('td', text=True, attrs={'align': 'right'})[0].contents[0].split()
expDate = datetime.date(int(expStrings[6]), int(currMonth), int(expStrings[5].replace(',', '')))
calls = extract_options(tbls[9], symb, 'Call', expDate)
puts = extract_options(tbls[13], symb, 'Put', expDate)
optionsRows = optionsRows + calls
optionsRows = optionsRows + puts
except urllib.error.HTTPError as err:
if err.code == 404:
pass
else:
raise
opts = [0] * (len(optionsRows))
for option in range(0, len(optionsRows)):
opts[option] = optionsRows[option].ForCsv()
#Write to the csv file.
with open('C:/OptionsChains.csv', 'a', newline='') as fp:
a = csv.writer(fp, delimiter=',')
a.writerows(opts)
if __name__ == '__main__':
run_proc()
There are some mistakes in the abbreviated code you have given, so it is a little hard to understand the code. If you could show more code and check it, it will be easier to understand your problem.
From the code and problem description, I have some advice to share with you:
In run_proc() function, it read webpage for every symbol. If the urls are the same or some urls are repeated, how about read webpages for just one time and write them to memory or hardware, then analyze page contents for every symbol? It will save
BeautifulSoup is easy to write code, but a little slow in performance. If lxml can do your work, it will save a lot time on analyzing webpage contents.
Hope it will help.
I was pointed in the right direction from the following post (thanks to the authors btw):
How to scrape more efficiently with Urllib2?

Categories

Resources