create loop to extract urls to json and csv - python

I set up a loop to scrape with 37900 records. Due to the way the url/ server is being set up, there's a limit of 200 records displayed in each url. Each url ends with 'skip=200', or mulitiple of 200 to loop to the next url page where the next 200 records are displayed. Eventually I want to loop through all urls and append them as a table. The related posted unable to loop the last url with paging limits
I created two loops shown as below - one for creating urls with skip= every 200 records, and another one to get response of each of these urls, then another loop to read json and append them to a single dataframe.
I'm not sure what's missing in my second loop - so far it only produces json for the first URL page but not the subsequent pages. I have the feeling that the usl jsons are not appended to the list json = [] and so it prevents looping and append the jsons in csv. Any suggestions on modifying the loops and improving these codes are appreciated!
import pandas as pd
import requests
import json
records = range(37900)
skip = records[0::200]
Page = []
for i in skip:
endpoint = "https://~/Projects?&$skip={}".format(i)
Page.append(endpoint)
jsnlist = []
for j in Page:
response = session.get(j) #session here refers to requests.Session() I had to set up to authenticate my access to these urls
responsejs = response.json()
responsejsval = responsejs['value'] #I only want to extract header called 'value' in each json
with open('response2jsval.json', 'w') as outfile:
json.dump(jsnlist, outfile)
concat = pd.DataFrame()
for k in jsnlist:
df = pd.DataFrame(k) #list to df
concat = concat.append(df, ignore_index = True)
print(concat)

I have nothing to test against
I think you massively over-complicated this. You've since edited the question but there's a couple of points to make:
You define jsnlist = [] but never use it. Why?
You called your own object json (now gone but I'm not sure whether you understand why). Calling your own object json will just supersede the actual module, and the whole code will grind to a halt before you even got into a loop
There is no reason at all to save this data to disk before trying to create a dataframe
Opening the .json file in write mode ('w') will wipe all existing data on each iteration of your loop
Appending JSON to a file will not give a valid format to be parsed when read back in. At best, it might be JSONLines
Appending DataFrames in a loop has terrible complexity because it requires copying of the original data each time.
Your approach will be something like this:
import pandas as pd
import requests
import json
records = range(37900)
skip = records[0::200]
Page = []
for i in skip:
endpoint = "https://~/Projects?&$skip={}".format(i)
Page.append(endpoint)
jsnlist = []
for j in Page:
response = session.get(j) #session here refers to requests.Session() I had to set up to authenticate my access to these urls
responsejs = response.json()
responsejsval = responsejs['value'] #I only want to extract header called 'value' in each json
jsnlist.append(responsejsval)
df = pd.DataFrame(jsnlist)
df = pd.DataFrame(jsnlist) might take some work, but you'll need to show what we're up against. I'd need to see responsejs['value'] to answer fully.

Related

How to make infinite Rest API requests and store the information?

I want to create one request per second and store all the data in a txt file.
However, the saving of the tuples in the txt doesn't seem to work
import bybit
import pandas as pd
import time
#request data
client = bybit.bybit(test=True, api_key="", api_secret="")
data= client.Market.Market_orderbook(symbol="BTCUSD").result()
#create request loop
h=1
while h>0:
time.sleep(1)
final = data + data
#save in txt
with open('orderbookdata.txt','w') as f:
for tup in final:
f.write( u" ".join(map(unicode,tup))+u"\n")
ps: keys given are read-only from a test net
You could simply save each data to a new line, which requires you to open the file for appending, not writing.
with open('file', 'a') # open for appending
Appending would create a file if it is non-existent, but if it is, it wouldn't just create an empty one, like writing do, but appending to the end of the file.
First of all, an infinite while should be while True:, second, you are reading the value once, then you are saving that data each line. Making a request each second may not permitted by the site, check that in the docs.
import bybit
import pandas as pd
import time
client = bybit.bybit(test=True, api_key="DzX0ObRek383f7BP4f", api_secret="wjZKH8MKJehLXv4iTplJiSxn1bg8rw49Vlbt")
while True:
data = client.Market.Market_orderbook(symbol="BTCUSD").result() #reading data each run
with open('orderbookdata.txt','a') as f: #open for appending
f.write('{}\n'.format(data)) #write data into a new line
time.sleep(1) #sleep

Is there a way to store the value of all the GET requests my program does?

So I have the following program:
client = Socrata("www.datos.gov.co", None)
# Example authenticated client (needed for non-public datasets):
# client = Socrata(www.datos.gov.co,
# MyAppToken,
# userame="user#example.com",
# password="AFakePassword")
# First results, returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get("gt2j-8ykr", limit=800000)
# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)
now, every time I run the code the variable 'results' has a new updated value as expectedly and so does the dataframe 'results_df'. What I want to do is to save all GET requests my program does (to be more precise I just want the len(results_df)). Some people have suggested me to make a list and append len(results_df). But, that obviously does not work as it just appends the current value of len(results_df), it does not save the previous value of len(results_df) so every time I run the code I end up with a list containing the current single value of len(results_df). But, what I want is the list to save the previous values of len(results_df) of previous program executions.
Im sorry if this a silly question but Im new to coding and I could not find any solution anywhere. Thanks
Use a persistent file storage and store the result length on disc:
def write_log_of_lengths(dataframelength):
from datetime import datetime
import os.path
log_name = "my_request_log"
if not os.path.isfile(log_name):
with open(log_name ,"w") as f:
f.write( f"datetime,lenght_data\n" )
with open(log_name ,"a") as f:
f.write( f"{datetime.now()},{dataframelength}\n" )
and then use
# your code
results_df = pd.DataFrame.from_records(results)
write_log_of_lengths(len(results_df))
Example:
write_log_of_lengths(5)
write_log_of_lengths(7)
write_log_of_lengths(22)
to get a file with
datetime,lenght_data
2020-09-07 07:37:17.889504,5
2020-09-07 07:37:17.892475,7
2020-09-07 07:37:17.895424,22

Scraping only select fields from a JSON file

I'm trying to produce only the following JSON data fields, but for some reason it writes the entire page to the .html file? What am I doing wrong? It should only produce the boxes referenced e.g. title, audiosource url, medium sized image, etc?
r = urllib.urlopen('https://thisiscriminal.com/wp-json/criminal/v1/episodes?posts=10000&page=1')
data = json.loads(r.read().decode('utf-8'))
for post in data['posts']:
# data.append([post['title'], post['audioSource'], post['image']['medium'], post['excerpt']['long']])
([post['title'], post['audioSource'], post['image']['medium'], post['excerpt']['long']])
with io.open('criminal-json.html', 'w', encoding='utf-8') as r:
r.write(json.dumps(data, ensure_ascii=False))
You want to differentiate from your input data and your output data. In your for loop, you are referencing the same variable data that you are using to take input in as you are using to output. You want to add the selected data from the input to a list containing the output.
Don't re-use the same variable names. Here is what you want:
import urllib
import json
import io
url = urllib.urlopen('https://thisiscriminal.com/wp-json/criminal/v1/episodes?posts=10000&page=1')
data = json.loads(url.read().decode('utf-8'))
posts = []
for post in data['posts']:
posts.append([post['title'], post['audioSource'], post['image']['medium'], post['excerpt']['long']])
with io.open('criminal-json.html', 'w', encoding='utf-8') as r:
r.write(json.dumps(posts, ensure_ascii=False))
You are loading the whole json in the variable data, and you are dumping it without changing it. That's the reason why this is happening. What you need to do is put whatever you want into a new variable and then dump it.
See the line -
([post['title'], post['audioSource'], post['image']['medium'], post['excerpt']['long']])
it does nothing. So, data remains unchanged. Do what Mark Tolonen suggested and it'll be fine.

Using xlwings, how do I print JSON data into multiple columns from multiple URLs?

I am using grequest to pull json data from multiple urls. Now, I want to print those results to excel using xlwings. Here is the code I have now.
import xlwings as xw
import grequests
import json
urls = [
'https://bittrex.com/api/v1.1/public/getorderbook?market=BTC-1ST&type=both&depth=50',
'https://bittrex.com/api/v1.1/public/getorderbook?market=BTC-AMP&type=both&depth=50',
'https://bittrex.com/api/v1.1/public/getorderbook?market=BTC-ARDR&type=both&depth=50',
]
requests = (grequests.get(u) for u in urls)
responses = grequests.map(requests)
for response in responses:
BQuantity = [response.json()['result']['buy'][0]['Quantity'],
response.json()['result']['buy'][1]['Quantity'],
response.json()['result']['buy'][2]['Quantity'],
response.json()['result']['buy'][3]['Quantity'],
response.json()['result']['buy'][4]['Quantity']
]
wb = xw.Book('Book2')
sht = wb.sheets['Sheet1']
sht.range('A1:C5').options(transpose=True).value = BQuantity
This works just fine, but only if I comment out all but one url, otherwise the results from the first url are overwritten by the results from the second url, an expected result. However, this is not what I want. In the end I want the results from the first URL to dump into column A, the results from the second URL dump into column B, etc...
I am able to pull in each individual link with 'requests' (instead of grequests) one by one, however, this operation is going to consist of a couple hundred urls in which simply pulling the data in one by one is very time consuming. Grequests pulls in 200 urls, and dumps to a JSON file in about 8 seconds compared to normal requests that took about 2 minutes.
Any help would be appreciated.

How do I repeatedly call an API using IDs from a CSV and write the output to a new CSV?

I'm trying to pull song data for a hundred different songs from the Echonest API. I have the IDs for each song in a CSV file - I'm trying to write a script that reads the IDs, appends them to the API url, and writes the data to a new CSV, but I'm having a little trouble.
Is there a good way to pull the ID codes and append them to the URL in a loop? This is what I have so far; not sure how/where to put it in the part about adding the IDs to the URL.
import urllib2
import json
import csv
from time import sleep
outfile_path='/Users/path/to/file.csv'
api_url = 'http://developer.echonest.com/api/v4/song/profile?'
API_KEY = ''
writer = csv.writer(open(outfile_path))
with open('/Users/path/to/file.csv') as f:
for row in csv.DictReader(f):
song_id = row['id']
qs = urllib.urlencode({"api_key": API_KEY,
"bucket": "audio_summary",
"id": song_id})
url = '{}?{}'.format(API_URL, qs)
parsed_json = json.load(resource)
for song in parsed_json['results']:
row = []
writer.writerow({k: v.encode('utf-8') for k, v in song.items()})
sleep(5)
I'm not sure which part you're stuck on (there are a lot of problems with your posted code that will prevent it from even compiling, much less getting to your real problem, and you haven't described the problem), but there seem to be two likely places.
First, I'm not sure you know how to open a CSV file and get values out of it. You're trying to open a directory rather than a file, and you're not doing anything with the rows, and then you're trying to do the inner loop 100 times for each of your 100 rows when I'm 99% sure you just want to do it once for each of your 100 rows.
If you use csv.reader, you have to know which column number the IDs are in; it's a lot easier with a csv.DictReader, because you only have to know what that column name is. So, let's do that:
with open('/path/to/inputfile.csv') as f:
for row in csv.DictReader(f):
song_id = row['id']
# make and process request with song_id
If your CSV file doesn't have a header row, then just use a reader, and put the column number (e.g., 0 for the first column) in place of 'id'.
Now, what you want to do with that ID is to stick each one in a URL. You can do that by using string formatting. For example:
URL_TEMPLATE = 'http://developer.echonest.com/api/v4/song/profile?api_key=&bucket=audio_summary&id={}'
# ... inside the for loop ...
song_id = row['id']
url = URL_TEMPLATE.format(song_id)
resource = urllib2.urlopen(url)
parsed_json = json.load(resource)
You're also going to need to fill in your api_key, or EchoNest won't accept your query, so:
URL_TEMPLATE = 'http://developer.echonest.com/api/v4/song/profile?api_key={}&bucket=audio_summary&id={}'
API_KEY = "<your API key goes here>"
# ... inside the for loop ...
url = URL_TEMPLATE.format(API_KEY, song_id)
However, it's usually better to use urlencode to generate a query string, instead of trying to do it through string methods. Besides being more readable, that will take care of things you probably haven't even thought of, like encoding any URL-unfriendly characters in your values. So:
API_URL = 'http://developer.echonest.com/api/v4/song/profile'
# ... inside the for loop ...
qs = urllib.urlencode({"api_key": API_KEY,
"bucket": "audio_summary",
"id": song_id})
url = '{}?{}'.format(API_URL, qs)
And then you just need the part that loops over parsed_join['results'] and writes out rows, which you've already written. But two notes on that.
First, str(foo.encode('utf-8')) is unnecessary; encode already returns a str.
Second, you've got a whole lot of unnecessary repeated code to build up that row. You're doing the same thing for each key in the song dict, so why not just use a DictWriter and leave it as a dict:
writer.writerow({k: v.encode('utf-8') for k, v in song.items()})
… or, if you prefer to use writer, just use operator.itemgetter to fetch them all at once into a list:
writer.writerow(v.encode('utf-8') for v in itemgetter(headers)(song))

Categories

Resources