How to solve "different pages get the same result" when writing crawler？

How to solve "different pages get the same result" when writing crawler？ - python

I use spider to crawl "https://nj.zu.ke.com/zufang" .I found that the URL composition is roughly: https://{0}.zu.ke.com/zufang/pg{1}.
And each listing has a unique listing number, which constitutes a link to a specific page for access.
So I set up two MySQL columns: 1. Auto-increment ID as the primary key, 2. Listing number is unique and not null.
However, during the crawling process, by changing the number of pages of "pg{1}", the repetition ratio of the listing number obtained is extremely large, 30 items per page, roughly 100 pages, and the final result is only more than 300 items.(at first, I think I didn’t write code right.Later, I checked the number of loops with a single thread to see if there were any problems with the return. I found that many IDs returned on different pages were duplicated with print)
Later, I thought it was a problem with the recommendation system. Then I logged in and wrote the cookie. The result was roughly the same.
And if you refresh directly after visiting the page, the result will be different .
How to solve this problem, thanks.

Related

Getting all review requests from Review Board Python Web API

I would like to get the information about all reviews from my server. That's my code that I used to achieve my goal.
from rbtools.api.client import RBClient
client = RBClient('http://my-server.net/')
root = client.get_root()
reviews = root.get_review_requests()
The variable reviews contains just 25 review requests (I expected much, much more). What's even stranger I tried something a bit different
count = root.get_review_requests(counts_only=True)
Now count.count is equal to 17164. How can I extract the rest of my reviews? I tried to check the official documentation but I haven't found anything connected to my problem.

According to the documentation (https://www.reviewboard.org/docs/manual/dev/webapi/2.0/resources/review-request-list/#webapi2.0-review-request-list-resource), counts_only is only a Boolean flag that indicates following:
If specified, a single count field is returned with the number of results, instead of the results themselves.
But, what you could do, is to provide it with status, so:
count = root.get_review_requests(counts_only=True, status='all')
should return you all the requests.
Keep in mind that I didn't test this part of the code locally. I referred to their repo test example -> https://github.com/reviewboard/rbtools/blob/master/rbtools/utils/tests/test_review_request.py#L643 and the documentation link posted above.

You have to use pagination (unfortunately I can't provide exact code without ability to reproduce your question):
The maximum number of results to return in this list. By default, this is 25. There is a hard limit of 200; if you need more than 200 results, you will need to make more than one request, using the “next” pagination link.
Looks like pagination helper class also available.
If you want to get 200 results you may set max_results:
requests = root.get_review_requests(max_results=200)
Anyway HERE is a good example how to iterate over results.
Also I don't recommend to get all 17164 results by one request even if it possible. Because total data of response will be huge (let's say if size one a result is 10KB total size will be more than 171MB)

How can I efficiently obtain every entry from a website with a return limit?

I am interested in parsing all of the entries from the federal job site: https://www.usajobs.gov/ for data analysis.
I have read through the API and in this section: https://developer.usajobs.gov/Guides/Rate-Limiting, it says the following:
Maximum of 5,000 job records per query* (I am actually getting 10,000 job records in my output)
Maximum of 500 job records returned per request
Here is the rest of the API reference: https://developer.usajobs.gov/API-Reference
So here is my question:
How can I go to the next 10,000 until all records are found?
What I am doing:
response = requests.get('https://data.usajobs.gov/api/Search?Page=20&ResultsPerPage=500', headers=headers)
Gives me 500 results per page in the form of a .json in which I dump them all into one .json until the 20th page by an increment page loop which ends up being all 10,000. I'm not sure what to do to get the next 10,000 until all entries are found.
Another idea is that I can do a query for each state but the downside is that I will lose everything outside of the U.S.
If someone can point me in the right direction for a better, simpler, and more efficient way to get all the entries than my proposed ideas, I would appreciate that too.

The server likely gives some error when it can't find more pages. Try something like
"...?Page=25000&..."
just to see what it gives, then use a while loop with a manually managed incrementer instead of a for loop. The stopping condition for the while loop is to check if the server returns the error page.

Python: find all urls which contain string

I am trying to find all pages that contain a certain string in the name, on a certain domain. For example:
www.example.com/section/subsection/406751371-some-string
www.example.com/section/subsection/235824297-some-string
www.example.com/section/subsection/146783214-some-string
What would be the best way to do it?
The numbers before "-some-string" can be any 9-digit number. I can write a script that loops through all possible 9-digit numbers and tries to access the resulting url, but I keep thinking that there should be a more efficient way to do this, especially since I know that overall there are only about 1000 possible pages that end with that string.

I understood your situation, the numeric value before -some-string is a kind of object id for that web site (for example, this question has a id 39594926, and the url is stackoverflow.com/questions/39594926/python-find-all-urls-which-contain-string)
I don't think there is a way to find all valid numbers, unless you have a listing (or parent) page from that website lists all these numbers. Take Stackoverflow as an example again, in the question list page, you will see all these question ids.
If you can provide me the website, I could have a look try to find the 'pattern' of these numbers. For some simple website, that number is just a increment to identify objects (could be user, question or anything else).

If these articles are all linked to on one page you could parse the html of this index page since all links will be contained in the href tags.

Efficiently searching a large list of URLs

I am building a web crawler which has to crawl hundreds of websites. My crawler keeps a list of urls already crawled. Whenever crawler is going to crawl a new page, it first searches the list of urls already crawled and if it is already listed the crawler skips to the next url and so on. Once the url has been crawled, it is added to the list.
Currently, I am using binary search to search the url list, but the problem is that once the list grows large, searching becomes very slow. So, my question is that what algorithm can I use in order to search a list of urls (size of list grows to about 20k to 100k daily).
Crawler is currently coded in Python. But I am going to port it to C++ or other better languages.

You have to decide at some point just how large you want your crawled list to become. Up to a few tens of millions of items, you can probably just store the URLs in a hash map or dictionary, which gives you O(1) lookup.
In any case, with an average URL length of about 80 characters (that was my experience five years ago when I was running a distributed crawler), you're only going to get about 10 million URLs per gigabyte. So you have to start thinking about either compressing the data or allowing re-crawl after some amount of time. If you're only adding 100,000 URLs per day, then it would take you 100 days to crawl 10 million URLs. That's probably enough time to allow re-crawl.
If those are your limitations, then I would suggest a simple dictionary or hash map that's keyed by URL. The value should contain the last crawl date and any other information that you think is pertinent to keep. Limit that data structure to 10 million URLs. It'll probably eat up close to 2 GB of space, what with dictionary overhead and such.
You will have to prune it periodically. My suggestion would be to have a timer that runs once per day and cleans out any URLs that were crawled more than X days ago. In this case, you'd probably set X to 100. That gives you 100 days of 100,000 URLs per day.
If you start talking about high capacity crawlers that do millions of URLs per day, then you get into much more involved data structures and creative ways to manage the complexity. But from the tone of your question, that's not what you're interested in.

I think hashing your values before putting them into your binary searched list- this will get rid of the probable bottleneck of string comparisons, swapping to int equality checks. It also keeps the O(log2(n)) binary search time- you may not get consistent results if you use python's builtin hash() between runs, however- it is implementation-specific. Within a run, it will be consistent. There's always the option to implement your own hash which can be consistent between sessions as well.

Efficient Alternative to "in"

I'm writing a web crawler with the ultimate goal of creating a map of the path the crawler has taken. While I haven't a clue at what rate other, and most definitely better crawlers pull down pages, mine clocks about 2,000 pages per minute.
The crawler works on a recursive backtracking algorithm which I have limited to a depth of 15.
Furthermore, in order to prevent my crawler from endlessly revisitng pages, it stores the url of each page it has visited in a list, and checks that list for the next candidate url.
for href in tempUrl:
...
if href not in urls:
collect(href,parent,depth+1)
This method seems to become a problem by the time it has pulled down around 300,000 pages. At this point the crawler on average has been clocking 500 pages per minute.
So my question is, what is another method of achieving the same functionality while improving its efficiency.
I've thought that decreasing the size of each entry may help, so instead of appending the entire url, I append the first 2 and the last to characters of each url as a string. This, however hasn't helped.
Is there a way I could do this with sets or something?
Thanks for the help
edit: As a side note, my program is not yet multithreaded. I figured I should resolve this bottleneck before I get into learning about threading.

Perhaps you could use a set instead of a list for the urls that you have seen so far.

Simply replace your 'list of crawled URLS" with a "set of crawled urls". Sets are optimised for random access (using the same hashing algorithms that dictionaries use) and they're a heck of a lot faster. A lookup operation for lists is done using a linear search so it's not particularly fast. You won't need to change the actual code that does the lookup.
Check this out.
In [3]: timeit.timeit("500 in t", "t = list(range(1000))")
Out[3]: 10.020853042602539
In [4]: timeit.timeit("500 in t", "t = set(range(1000))")
Out[4]: 0.1159818172454834

I had a similar problem. Ended up profiling various methods (list/file/sets/sqlite) for memory vs time. See these 2 posts.
Finally sqlite was the best choice. You can also use url hash to reduce the size
Searching for a string in a large text file - profiling various methods in python
sqlite database design with millions of 'url' strings - slow bulk import from csv

Use a dict with the urls as keys instead (O(1) access time).
But a set will also work. See
http://wiki.python.org/moin/TimeComplexity

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.