I have a list and while I constantly append element to it, I want to check that it isn't empty and get element at the same time. Normally we wait all the elements append to the list and then we get element from list do something. In this case, we lose some time for waiting all elements add to the list. What knowledge do I need to acquire to make this happen (Multiprocessing, multiprocessing.dummy, asynchronous) ,sorry, I am still new for doing this, I think it's better for me to explain to you why I want to achieve this kind of effect,this problem came from a web crawler
import requests
from model import Document
def add_concrete_content(input_list):
"""input_list data structure [{'url': 'xxx', 'title': 'xxx'}...]"""
for e in input_list:
r = requests.get(e['url'])
html = r.content
e['html'] = html
return input_list
def save(input_list):
for e in input_list:
Document.create(**e)
if __name__ == '__main__':
res = add_concrete_content(list)
save(res)
"""this is I normally do, I save data to mysql or whatever database,but
I think the drawback is I have to wait all the html add to dict and then
save to database, what if I have to deal with tons of data? Can I save
the dict with the html first? Can I save some time? A friend of mine
said this is a typical producer consumer problem, probably gonna use
at least two threads and lock, because without lock, data probably
gonna fall into disorder"""
You're being vague. And I think there's a misconception in the way you want things to happen.
You don't need any extra python rock-science to do what you want:
you can check if the list is empty by simply doing: if list_: (where list_ is your list)
you can verify any element by using list_[idx] (where idx is the index of the element). For example, list_[0] will get you the first element of the list, while list_[-1], the last one.
That said, you don't have to wait for all the elements to be added to the list if you need to process them on the go. You might look for something like this:
def push(list_):
count = 0
while True:
list_.append(count)
f()
count += 1
if count == 1000:
break
def f():
print('First element: {}'.format(list_[0]))
print('Last element: {}'.format(list_[-1]))
if __name__ == '__main__':
list_ = []
push(list_)
Related
I am pretty new to Python, and am more used to JS, so I am a little lost on how to do this.
Basically I have a JSON from an API from Google, and the first result isn't always valid for what I need. But I do only need to the first result that returns true.
I am pretty sure I have the syntax wrong in more than one area, but I need the first imageUrl where [gi]['pagemap'] would be true.
item_len = len(deserialized_output['items'])
for gi in range(item_len):
def loop_tgi():
if deserialized_output['items'][gi]['pagemap'] is True:
imageUrl = deserialized_output['items'][gi]['pagemap']['cse_image'][0]['src']
break
loop_tgi()
You could iterate over items directly, without a use of index
in Python loop leak a variable, so when you break from the loop, gi variable would have what you need (or last value).
To overcome "or last value" we can use else close to check that we went through whole loop with no break
for gi in deserialized_output['items']:
if gi['pagemap'] is True:
break
else:
gi = None # or throw some sort of exception when there is no good elements
if gi: # checking that returned element is good
print(gi) # now we can use that element to do what you want!
imageUrl = gi['pagemap']['cse_image'][0]['src']
I am a bit worried about your gi['pagemap'] is True as later you try to access gi['pagemap']['cse_image']. It means that gi['pagemap'] is not a boolean, but some sort of object.
If this is dict, you could check if gi['pagemap']: that is True if this dict is not empty. but gi['pagemap'] is True would be False if gi['pagemap'] is {'cse_image':...}
I am extremely new to Python and programming in general (I basically started a few days ago) so forgive me if I use the wrong terms or if I'm asking a silly question.
I’m writing a web scraper to get some data from a job vacancy website. I've written some code that first of all downloads the data from the main search results page, parses it and extracts from it the headings which contain a link to each of the vacancy pages where the details of each specific vacancy can be found. Then I’ve written code that opens each link and parses the html from each vacancy page.
Now this all works fine. The issue I have is with the following. I want to scrape some data from each of these vacancy pages and save the data for each vacancy in a separate list so that later I can put all these lists in a data frame. I’ve therefore been looking for a way to number or ‘index’ (if that is the right term to use) each list so that I can refer to them later. Below is the code I have at the moment. Following the advice I found by reading existing answers on Stackoverflow I’ve tried to use enumerate to create an index which I can assign to each list, as follows:
vacancy_headings = resultspage1_soup.body.findAll("a", class_ ="vacancy-link")
vacancydetails = []
for index, vacancy in enumerate(vacancy_headings, start=0):
vacancypage_url = urljoin("https://www.findapprenticeship.service.gov.uk",vacancy["href"])
vacancypage_client = urlopen(vacancypage_url)
vacancypage_html = vacancypage_client.read()
vacancypage_soup = soup(vacancypage_html, "html.parser")
vacancydetails[index]=[]
for p in vacancypage_soup.select("p"):
if p.has_attr("itemprop"):
if p["itemprop"] == "employmentType" or p["itemprop"] == "streetAddress" or p["itemprop"] == "addressLocality" or p["itemprop"] == "addressRegion" or p["itemprop"] == "postalCode":
cells = p.text
vacancydetails[index].append(cells)`
But I get the following error message:
IndexError Traceback (most recent call last)
<ipython-input-10-b8a75df16395> in <module>()
9 vacancypage_html = vacancypage_client.read()
10 vacancypage_soup = soup(vacancypage_html, "html.parser")
---> 11 vacancydetails[index]=[]
12
13 for p in vacancypage_soup.select("p"):
IndexError: list assignment index out of range
Could someone explain to me (in easy-to-understand language if possible!) what is going wrong, and how I can fix this problem?
Thanks!!
Since vacancydetails is a list, trying to access a position in the list that doesn't exist is an error. And, when you first create it, the list is empty. So, before accessing any elements from the list, you'll need to first create those elements.
Thus, instead of this:
vacancydetails[index]=[]
...you want to append a new item to the list (and that new item happens to be an empty list itself), like this:
vacancydetails.append([])
The list vacancydetails is empty until you append to it (or assign to it from somewhere else). Because index is counting up from 0, you just want to manipulate the currently-final entry in vacancydetails in the for p loop.
So, rather than vacancydetails[index]=[] you want vacancydetails.append([]). But then the more pythonic thing to do is work with the last entry in vacancydetails, i.e., vacancydetails[-1], in which case you never need the index variable.
for vacancy in vacancy_headings:
vacancypage_url = urljoin("https://www.findapprenticeship.service.gov.uk",vacancy["href"])
### ...
vacancydetails.append([])
for p in vacancypage_soup.select("p"):
if p.has_attr("itemprop"):
### ...
vacancydetails[-1].append(cells)
I'd like to make a program that makes offline copies of math questions from Khan Academy. I have a huge 21.6MB text file that contains data on all of their exercises, but I have no idea how to start analyzing it, much less start pulling the questions from it.
Here is a pastebin containing a sample of the JSON data. If you want to see all of it, you can find it here. Warning for long load time.
I've never used JSON before, but I wrote up a quick Python script to try to load up individual "sub-blocks" (or equivalent, correct term) of data.
import sys
import json
exercises = open("exercises.txt", "r+b")
byte = 0
frontbracket = 0
backbracket = 0
while byte < 1000: #while byte < character we want to read up to
#keep at 1000 for testing purposes
char = exercises.read(1)
sys.stdout.write(char)
#Here we decide what to do based on what char we have
if str(char) == "{":
frontbracket = byte
while True:
char = exercises.read(1)
if str(char)=="}":
backbracket=byte
break
exercises.seek(frontbracket)
block = exercises.read(backbracket-frontbracket)
print "Block is " + str(backbracket-frontbracket) + " bytes long"
jsonblock = json.loads(block)
sys.stdout.write(block)
print jsonblock["translated_display_name"]
print "\nENDBLOCK\n"
byte = byte + 1
Ok, the repeated pattern appears to be this: http://pastebin.com/4nSnLEFZ
To get an idea of the structure of the response, you can use JSONlint to copy/paste portions of your string and 'validate'. Even if the portion you copied is not valid, it will still format it into something you can actually read.
First I have used requests library to pull the JSON for you. It's a super-simple library when you're dealing with things like this. The API is slow to respond because it seems you're pulling everything, but it should work fine.
Once you get a response from the API, you can convert that directly to python objects using .json(). What you have is essentially a mixture of nested lists and dictionaries that you can iterate through and pull specific details. In my example below, my_list2 has to use a try/except structure because it would seem that some of the entries do not have two items in the list under translated_problem_types. In that case, it will just put 'None' instead. You might have to use trial and error for such things.
Finally, since you haven't used JSON before, it's also worth noting that it can behave like a dictionary itself; you are not guaranteed the order in which you receive details. However, in this case, it seems the outermost structure is a list, so in theory it's possible that there is a consistent order but don't rely on it - we don't know how the list is constructed.
import requests
api_call = requests.get('https://www.khanacademy.org/api/v1/exercises')
json_response = api_call.json()
# Assume we first want to list "author name" with "author key"
# This should loop through the repeated pattern in the pastebin
# access items as a dictionary
my_list1 = []
for item in json_response:
my_list1.append([item['author_name'], item['author_key']])
print my_list1[0:5]
# Now let's assume we want the 'sha' of the SECOND entry in translated_problem_types
# to also be listed with author name
my_list2 = []
for item in json_response:
try:
the_second_entry = item['translated_problem_types'][0]['items'][1]['sha']
except IndexError:
the_second_entry = 'None'
my_list2.append([item['author_name'], item['author_key'], the_second_entry])
print my_list2[0:5]
What is the easiest way to loop through a series URLs until there are no more results returned?
If the number of URLs is fixed e.g 9, something like the following code would work
for i in range(1,10):
print('http://www.trademe.co.nz/browse/categorylistings.aspx?v=list&rptpath=4-380-50-7145-&mcatpath=sports%2fcycling%2fmountain-bikes%2ffull-suspension&page='+ str(i)+'&sort_order=default ')
However, the number of URLs is dynamic, and I get a page saying "Sorry, there are currently no listings in this category." when I overshoot. Example below.
http://www.trademe.co.nz/browse/categorylistings.aspx?v=list&rptpath=4-380-50-7145-&mcatpath=sports%2fcycling%2fmountain-bikes%2ffull-suspension&page=10&sort_order=default
What is the easiest way to only return pages with results?
Cheers
Steve
# count is an iterator that just keeps going
# from itertools import count
# but I'm not going to use it, because you want to set a reasonable limit
# otherwise you'll loop endlessly if your end condition fails
# requests is third party but generally better than the standard libs
import requests
base_url = 'http://www.trademe.co.nz/browse/categorylistings.aspx?v=list&rptpath=4-380-50-7145-&mcatpath=sports%2fcycling%2fmountain-bikes%2ffull-suspension&page={}&sort_order=default'
for i in range(1, 30):
result = requests.get(base_url.format(i))
if result.status_code != 200:
break
content = result.content.decode('utf-8')
# Note, this is actually quite fragile
# For example, they have 2 spaces between 'no' and 'listings'
# so looking for 'no listings' would break
# for a more robust solution be more clever.
if 'Sorry, there are currently no' in content:
break
# do stuff with your content here
print(i)
I have a little script that monitors the RSS for 'new questions tagged with python' specifically on SO. It stores the feed in a variable on the first iteration of the loop, and then constantly checks the feed against the one stored in the variable. If the feed changes, it updates the variable and outputs the newest entry to the console, and plays a soundfile to alert me that there are new questions. All in all, it's quite handy as I don't have to keep an eye on anything. However, there are time discrepancies between new questions actually being posted, and my script detecting feed updates. These discrepancies seem to vary in the length of time, but generally, it's isn't instant and tends to not alert me before there has been enough action on a question so that it's been pretty much dealt with. Not always the case, but generally. Is there a way for me to ensure much faster or quicker updates/alerts? Or is this as good as it gets? (It's crossed my mind that this particular feed is only updated when there is actually action on a question.. anyone know if that's the case?)
Have I misunderstood the way that rss actually works?
import urllib2
import mp3play
import time
from xml.dom import minidom
def SO_notify():
""" play alarm when rss is updated """
rss = ''
filename = "path_to_soundfile"
mp3 = mp3play.load(filename)
mp3.volume(25)
while True:
html = urllib2.urlopen("http://stackoverflow.com/feeds/tag?tagnames=python&sort=newest")
new_rss = html.read()
if new_rss == rss:
continue
rss = new_rss
feed = minidom.parseString(rss)
new_entry = feed.getElementsByTagName('entry')[0]
title = new_entry.getElementsByTagName('title')[0].childNodes[0].nodeValue
print title
mp3.play()
time.sleep(30) #Edit - thanks to all who suggested this
SO_notify()
Something like:
import requests
import mp3play
import time
curr_ids = []
filename = "path_to_soundfile"
mp3 = mp3play.load(filename)
mp3.volume(25)
while True:
api_json = requests.get("http://api.stackoverflow.com/1.1/questions/unanswered?order=desc&tagged=python").json()
new_questions = []
all_questions = []
for q in api_json["questions"]:
all_questions.append(q["question_id"])
if q["question_id"] not in curr_ids:
new_questions.append(q["question_id"])
if new_questions:
print(new_questions)
mp3.play()
curr_ids = all_questions
time.sleep(30)
Used the requests package here because urllib gives me some encoding troubles.
IMHO, you could have 2 solutions to this, depending on which approach you want:
Use JSON - this will give you a nice dict with all entries.
Use RSS (XML). In this case you'd need something like feedparser to process your XML.
Either way, the code should be something like:
# make curr_ids a dictionary for easier lookup
curr_ids = []
filename = "path_to_soundfile"
mp3 = mp3play.load(filename)
mp3.volume(25)
# Loop
while True:
# Get the list of entries in objects
entries = get_list_of_entries()
new_ids = []
for entry in entries:
# Check if we reached the most recent entry
if entry.id in curr_ids:
# Force loop end if we did
break
new_ids.append(entry.id)
# Do whatever operations
print entry.title
if len(new_ids) > 0:
mp3.play()
curr_ids = new_ids
else:
# No updates in the meantime
pass
sleep(30)
Several notes:
I'd order the entries by "oldest" instead so the printed entries look like a stream, with the most recent one being the last printed out.
the new_ids thing is to keep the list of ids to a minimum. Otherwise lookup will become slower with time
get_list_of_entries() is a container to get the entries from the source (objects from XML or a dict from JSON). Depending on which approach you want, referring them is different (but the principle is the same)