Save string to use multiple times in code to avoid fetching the same information multiple times - python

I am working on a code where I fetch information from an online database through the api they provide. The information I receive is a large block of data arranged in square- and curly brackets. As of now, I fetch the whole block of data every time I want a specific parameter. It becomes a problem as I am limited in how much I can send queries through their api. I would therefore like to fetch the large block of data one time and save it as a variable in the code that I then reference to when I want specific parts of it. I will then be able to get all the specific parts of the data with only one request.
As of now the code I have looks like this:
from monkeylearn import MonkeyLearn
ml = MonkeyLearn('my personal API-key')
model_id = 'the id of the model'
text1 = 'This is the text that is analyzed by monkeylearn'
data = ['first text', {'text': text1, 'external_id': 'ANY_ID'}, '']
response = ml.extractors.extract(model_id, data).body
company_name_tag = response[1]['extractions'][0]['tag_name']
company_name = response[1]['extractions'][0]['extracted_text']
response contains all the information I get from the request so I would like to only have to fetch it once. As of now, if I were to print(company_name_tag) and print(company_name) it would fetch the data through the api two times. This leads to me reaching my limit on the api a lot faster than necessary.
I appreciate all help with this issue!

Related

python parsing soup protocol response header and value

Trying to write the question one more time because the first time the problem was not clear.
I'm trying to extract output from a SOAP protocol using Python.
The output is structured with a lot of subcategories and this is where I can't extract the information properly.
The output I have is very long, so I will bring here a short example from soapenv:Body
the original code (soap body)
<mc:ocb-rule id="boic"><mc:id>boic</mc:id><mc:ocb-conditions><mc:rule-deactivated>true</mc:rule-deactivated><mc:international>true</mc:international></mc:ocb-conditions><mc:cb-actions><mc:allow>false</mc:allow></mc:cb-actions></mc:ocb-rule>
As you can see I also used the command
xmltodict.parse(soap_response)
for turning the output into a dictionary
OrderedDict([('#id',
'boic'),
('mc:id',
'boic'),
('mc:ocb-conditions',
OrderedDict([('mc:rule-deactivated',
'true'),
('mc:international',
'true')])),
('mc:cb-actions',
OrderedDict([('mc:allow',
'false')]))])
If there is an easier way, I would appreciate guidance
As I mentioned, my goal is ultimately to get each category its own value, when if there is a subcategory, a separate category will be added.
In the end I will have to put all the values into the Data frame and display all the categories and their values
for example :
table example
I really hope that this time I was able to explain myself properly
Thanks in advance
i am trying to parsing soup response and insert all value and header to data frame

Scraping data from a http & javaScript site

I currently want to scrape some data from an amazon page and I'm kind of stuck.
For example, lets take this page.
https://www.amazon.com/NIKE-Hyperfre3sh-Athletic-Sneakers-Shoes/dp/B01KWIUHAM/ref=sr_1_1_sspa?ie=UTF8&qid=1546731934&sr=8-1-spons&keywords=nike+shoes&psc=1
I wanted to scrape every variant of shoe size and color. That data can be found opening the source code and searching for 'variationValues'.
There we can see sort of a dictionary containing all the sizes and colors and, below that, in 'asinToDimentionIndexMap', every product code with numbers indicating the variant from the variationValues 'dictionary'.
For example, in asinToDimentionIndexMap we can see
"B01KWIUH5M":[0,0]
Which means that the product code B01KWIUH5M is associated with the size '8M US' (position 0 in variationValues size_name section) and the color 'Teal' (same idea as before)
I want to scrape both the variationValues and the asinToDimentionIndexMap, so i can associate the IndexMap numbers to the variationValues one.
Another person in the site (thanks for the help btw) suggested doing it this way.
script = response.xpath('//script/text()').extract_frist()
import re
# capture everything between {}
data = re.findall(script, '(\{.+?\}_')
import json
d = json.loads(data[0])
d['products'][0]
I can sort of understand the first part. We get everything that's a 'script' as a string and then get everything between {}. The issue is what happens after that. My knowledge of json is not that great and reading some stuff about it didn't help that much.
Is it there a way to get, from that data, 2 dictionaries or lists with the variationValues and asinToDimentionIndexMap? (maybe using some regular expressions in the middle to get some data out of a big string). Or explain a little bit what happens with the json part.
Thanks for the help!
EDIT: Added photo of variationValues and asinToDimensionIndexMap
I think you are close Manuel!
The following code will turn your scraped source into easy-to-select boxes:
import json
d = json.loads(data[0])
JSON is a universal format for storing object information. In other words, it's designed to interpret string data into object data, regardless of the platform you are working with.
https://www.w3schools.com/js/js_json_intro.asp
I'm assuming where you may be finding things a challenge is if there are any errors when accessing a particular "box" inside you json object.
Your code format looks correct, but your access within "each box" may look different.
Eg. If your 'asinToDimentionIndexMap' object is nested within a smaller box in the larger 'products' object, then you might access it like this (after running the code above):
d['products'][0]['asinToDimentionIndexMap']
I've hacked and slash a little bit so you can better understand the structure of your particular json file. Take a look at the link below. On the right-hand side, you will see "which boxes are within one another" - which is precisely what you need to know for accessing what you need.
JSON Object Viewer
For example, the following would yield "companyCompliancePolicies_feature_div":
import json
d = json.loads(data[0])
d['updateDivLists']['full'][0]['divToUpdate']
The person helping you before outlined a general case for you, but you'll need to go in an look at structure this way to truly find what you're looking for.
variationValues = re.findall(r'variationValues\" : ({.*?})', ' '.join(script))[0]
asinVariationValues = re.findall(r'asinVariationValues\" : ({.*?}})', ' '.join(script))[0]
dimensionValuesData = re.findall(r'dimensionValuesData\" : (\[.*\])', ' '.join(script))[0]
asinToDimensionIndexMap = re.findall(r'asinToDimensionIndexMap\" : ({.*})', ' '.join(script))[0]
dimensionValuesDisplayData = re.findall(r'dimensionValuesDisplayData\" : ({.*})', ' '.join(script))[0]
Now you can easily convert them to json as use them combine as you wish.

Discogs API => How to retrieve genre?

I've crawled a tracklist of 36.000 songs, which have been played on the Danish national radio station P3. I want to do some statistics on how frequently each of the genres have been played within this period, so I figured the discogs API might help labeling each track with genre. However, the documentation for the API doesent seem to include an example for querying the genre of a particular song.
I have a CSV-file with with 3 columns: Artist, Title & Test(Test where i want the API to label each song with the genre).
Here's a sample of the script i've built so far:
import json
import pandas as pd
import requests
import discogs_client
d = discogs_client.Client('ExampleApplication/0.1')
d.set_consumer_key('key-here', 'secret-here')
input = pd.read_csv('Desktop/TEST.csv', encoding='utf-8',error_bad_lines=False)
df = input[['Artist', 'Title', 'Test']]
df.columns = ['Artist', 'Title','Test']
for i in range(0, len(list(df.Artist))):
x = df.Artist[i]
g = d.artist(x)
df.Test[i] = str(g)
df.to_csv('Desktop/TEST2.csv', encoding='utf-8', index=False)
This script has been working with a dummy file with 3 records in it so far, for mapping the artist of a given ID#. But as soon as the file gets larger(ex. 2000), it returns a HTTPerror when it cannot find the artist.
I have some questions regarding this approach:
1) Would you recommend using the search query function in the API for retrieving a variable as 'Genre'. Or do you think it is possible to retrieve Genre with a 'd.' function from the API?
2) Will I need to aquire an API-key? I have succesfully mapped the 3 records without an API-key so far. Looks like the key is free though.
Here's the guide I have been following:
https://github.com/discogs/discogs_client
And here's the documentation for the API:
https://www.discogs.com/developers/#page:home,header:home-quickstart
Maybe you need to re-read the discogs_client examples, i am not an expert myself, but a newbie trying to use this API.
AFAIK, g = d.artist(x) fails because x must be a integer not a string.
So you must first do a search, then get the artist id, then d.artist(artist_id)
Sorry for no providing an example, i am python newbie right now ;)
Also have you checked acoustid for
It's a probably a rate limit.
Read the status code of your response, you should find an 429 Too Many Requests
Unfortunately, if that's the case, the only solution is to add a sleep in your code to make one request per second.
Checkout the api doc:
http://www.discogs.com/developers/#page:home,header:home-rate-limiting
I found this guide:
https://github.com/neutralino1/discogs_client.
Access the api with your key and try something like:
d = discogs_client.Client('something.py', user_token=auth_token)
release = d.release(774004)
genre = release.genres
If you found a better solution please share.

Flask template streaming with Jinja

I have a flask application. On a particular view, I show a table with about 100k rows in total. It's understandably taking a long time for the page to load, and I'm looking for ways to improve it. So far I've determined I query the database and get a result fairly quickly. I think the problem lies in rendering the actual page. I've found this page on streaming and am trying to work with that, but keep running into problems. I've tried the stream_template solution provided there with this code:
#app.route('/thing/matches', methods = ['GET', 'POST'])
#roles_accepted('admin', 'team')
def r_matches():
matches = Match.query.filter(
Match.name == g.name).order_by(Match.name).all()
return Response(stream_template('/retailer/matches.html',
dashboard_title = g.name,
match_show_option = True,
match_form = form,
matches = matches))
def stream_template(template_name, **context):
app.update_template_context(context)
t = app.jinja_env.get_template(template_name)
rv = t.stream(context)
rv.enable_buffering(5)
return rv
The Match query is the one that returns 100k+ items. However, whenever I run this the page just shows up blank with nothing there. I've also tried the solution with streaming the data to a json and loading it via ajax, but nothing seems to be in the json file either! Here's what that solution looks like:
#app.route('/large.json')
def generate_large_json():
def generate():
app.logger.info("Generating JSON")
matches = Product.query.join(Category).filter(
Product.retailer == g.retailer,
Product.match != None).order_by(Product.name)
for match in matches:
yield json.dumps(match)
app.logger.info("Sending file response")
return Response(stream_with_context(generate()))
Another solution I was looking at was for pagination. This solution works well, except I need to be able to sort through the entire dataset by headers, and couldn't find a way to do that without rendering the whole dataset in the table then using JQuery for sorting/pagination.
The file I get by going to /large.json is always empty. Please help or recommend another way to display such a large data set!
Edit: I got the generate() part to work and updated the code.
The problem in both cases is almost certainly that you are hanging on building 100K+ Match items and storing them in memory. You will want to stream the results from the DB as well using yield_per. However, only PostGres+psycopg2 support the necessary stream_result argument (here's a way to do it with MySQL):
matches = Match.query.filter(
Match.name == g.name).order_by(Match.name).yield_per(10)
# Stream ten results at a time
An alternative
If you are using Flask-SQLAlchemy you can make use of its Pagination class to paginate your query server-side and not load all 100K+ entries into the browser. This has the added advantage of not requiring the browser to manage all of the DOM entries (assuming you are doing the HTML streaming option).
See also
SQLAlchemy: Scan huge tables using ORM?
How to Use SQLAlchemy Magic to Cut Peak Memory and Server Costs in Half

Unable to iteratively call yahoo's term extractor api using python

I am trying to loop through some 50-odd files in a directory. Each file has some text for which i am trying to find the keywords using Yahoo Term Extractor. I am able to extract text from each file, but I am not able to iteratively call the API using the text as input. Only the keywords for the first file is displayed.
Here is my code snippet:
in 'comments' list, I have extracted and stored the text from each file.
for c in comments:
print "building query"
dataDict = [ ('appid', appid), ('context', c)]
queryData = urllib.urlencode(dataDict)
request.add_data(queryData)
print "fetching result"
result = OPENER.open(request).read()
print result
time.sleep(1)
Well I don't know anything about the Yahoo Term Extractor, but I'd presume that your call request.add_data(queryData) simply tacks on another data set with each iteration of your loop. And then the call to OPENER.open(request).read() would probably only process the results of the first data set. So either your request object can only hold one query, or your OPENER object's inner workings can only process one query, it's as simple as that.
Actually a third reason comes to mind now that I read the documentation provided at your link, and this is probably the true one:
RATE LIMITS
The Term Extraction service is limited to 5,000 queries per IP address per day and to noncommercial use. See information on rate limiting.
So it would make sense that the API would limit your usage to one query at a time, and not allow you to flood a bunch of queries in a single request.
In any event, I'd assume you could fix your problem in a "naive" way by having many request variables instead of just one, or maybe just creating a new request with every iteration of your loop. If you're not worried about storing your results, and just trying to debug, you could try:
for c in comments:
print "building query"
dataDict = [ ('appid', appid), ('context', c)]
queryData = urllib.urlencode(dataDict)
request = urllib2.Request() # I don't know how to initialize this variable, do it yourself
request.add_data(queryData)
print "fetching result"
result = OPENER.open(request).read()
print result
time.sleep(1)
Again, I don't know about the Yahoo Term Extractor (nor do I really have time to research it) so there may very well be a better, more native way to do this. If you post more details of your code (i.e. what classes are the request and OPENER objects coming from) then I might be able to elaborate on this.

Categories

Resources