What's the most efficient way to extract data in this case? - python

So, I'm working on a python web application, it's a search engine for sporting goods (sport outfits, tools ....etc) . Basically it should search for a given keyword on multiple stores and compare results to return the 20 best results .
I was thinking that the best and easiest way to do this is to write a json file wich contains rules for the scraper on how to extract data on each website . For ex:
[{"www.decathlon.com" : { "rules" : { "productTag" : "div['.product']",
"priceTag" : "span[".price"]" } }]
So for decathlon, to get product item we search for div tags with the product class .
I have a list of around 10 - 15 websites to scrape . So for each website it goes to rules.json, see the related rules and use them to extract data .
Pros for this Method :
Very Easy to write, we need a minimal python script for the logic on how to read and map urls to their rules and extract the data through BeautifulSoup + It's also very easy to add, delete new urls and their rules .
Cons of this method : For each search we launch a request to each website, so making around 10 requests at the same time, then compare results, so if 20 users search at the same time we will have around 200 requests which will slow down our app a lot !
Another Method :
I thought that we could have a huge list of keywords, then at 00:00, a script launch requests to all the urls for each keyword in the list, compare them, then store the results in CouchDB, to be used through the day, and It will be updated daily . The only problem with this method is that it's nearly impossible to have a list of all possible keywords .
So please, help me on how should I proceed with this ? Given that I don't have a lot of time .

Along the lines of your "keyword" list: rather than keeping a list of all possible keywords, perhaps you could maintain a priority queue of keywords with importance based on how often a keyword is searched. When a new keyword is encountered, add it to the list, otherwise update it's importance every time it's searched. Launch a script to request urls for the top, say, 30 keywords each day (more or less depending on how often words are searched and what you want to do).
This doesn't necessarily solve your problem of having too many requests, but may decrease the likelihood of it becoming too much of a problem.

HTTP requests can be very expensive. That's why you want to make sure you parallelize your requests and for that you can use something like Celery. This way you will reduce total time to the time of slowest responding website.
It may be a good idea to set request timeout to shorter time (5 seconds?) in case one of the website is not responding to your request.
Have the ability to flag domain as "down/not responding" and be able to handle those exceptions.
Other optimization would be to store page contents after each search for some time in case same search keyword comes in so you can skip expensive requests.

Related

How do I scrape content from a dynamically generated page using selenium and python?

I have tried many attempts and all fail to record the data I need in a reliable and complete manner. I understand the extreme basics of python and selenium for automating simple tasks but in this case the content is dynamically generated and I am unable to find the correct way to access and subsequently record all the data I need.
The URL I am looking to scrape content from is structured similar to the following:
https://dutchie.com/embedded-menu/revolutionary-clinics-somerville/menu
In particular I am trying grab all info using something like -
browser.find_elements_by_xpath('//*[#id="products-container"]
Is this the right approach? How do I access specific sub elements of this element (and all elements of the same path)
I have read that I might need beautifulsoup4, but I am unsure the best way to approach this.
Would the best approach be to use xpaths? If so is there a way to iterate through all elements and record all the data within or do I have to specify each and every data point that I am after?
Any assistance to point me in the right direction would be extremely helpful as I am still learning and have hit a roadblock in my progress.
My end goal is a list of all product names, prices and any other data points that I deem relevant based on the specific exercise at hand. If I could find the correct way to access the data points I could then store them and compare/report on them as needed.
Thank you!
I think you are looking for something like
browser.find_elements_by_css_selector('[class*="product-information__Title"]')
This should find all elements with a class beginning with that string.

Twitter scraping in Python

I have to scrape tweets from Twitter for a specific user (#salvinimi), from January 2018. The issue is that there are a lot of tweets in this range of time, and so I am not able to scrape all the ones I need!
I tried multiple solutions:
1)
pip install twitterscraper
from twitterscraper import query_tweets_from_user as qtfu
tweets = qtfu(user='matteosalvinimi')
With this method, I get only a few teets (500~600 more or less), instead of all the tweets... Do you know why?
2)
!pip install twitter_scraper
from twitter_scraper import get_tweets
tweets = []
for i in get_tweets('matteosalvinimi', pages=100):
tweets.append(i)
With this method I get an error -> "ParserError: Document is empty"...
If I set "pages=40", I get the tweets without errors, but not all the ones. Do you know why?
Three things for the first issue you encounter:
first of all, every API has its limits and one like Twitter would be expected to monitor its use and eventually stop a user from retrieving data if the user is asking for more than the limits. Trying to overcome the limitations of the API might not be the best idea and might result in being banned from accessing the site or other things (I'm taking guesses here as I don't know what's the policy of Twitter on the matter). That said, the documentation on the library you're using states :
With Twitter's Search API you can only sent 180 Requests every 15 minutes. With a maximum number of 100 tweets per Request this means you can mine for 4 x 180 x 100 = 72.000 tweets per hour.
By using TwitterScraper you are not limited by this number but by your internet speed/bandwith and the number of instances of TwitterScraper you are willing to start.
then, the function you're using, query_tweets_from_user() has a limit argument which you can set to an integer. One thing you can try is changing that argument and seeing whether you get what you want or not.
finally, if the above does not work, you could be subsetting your time range in two, three ore more subsets if needed, collect the data separately and merge them together afterwards.
The second issue you mention might be due to many different things so I'll just take a broad guess here. For me, either setting pages=100 is too high and by one way or another the program or the API is unable to retrieve the data, or you're trying to look at a hundred pages when there is less than a hundred in pages to look for reality, which results in the program trying to parse an empty document.

How do I go to a random website? - python

How to generate a random yet valid website link, regardless of languages. Actually, the more diverse the language of the website it generates, the better it is.
I've been doing it by using other people's script on their webpage, how can i not rely on these random site forwarding script and make my own?. I've been doing it as such:
import webbrowser
from random import choice
random_page_generator = ['http://www.randomwebsite.com/cgi-bin/random.pl',
'http://www.uroulette.com/visit']
webbrowser.open(choice(random_page_generator), new=2)
I've been doing it by using other people's script on their webpage, how can i not rely on these random site forwarding script and make my own?
There are two ways to do this:
Create your own spider that amasses a huge collection of websites, and pick from that collection.
Access some pre-existing collection of websites, and pick from that collection. For example, DMOZ/ODP lets you download their entire database;* Google used to have a customized random site URL;** etc.
There is no other way around it (short of randomly generating and testing valid strings of arbitrary characters, which would be a ridiculously bad idea).
Building a web spider for yourself can be a fun project. Link-driven scraping libraries like Scrapy can do a lot of the grunt work for you, leaving you to write the part you care about.
* Note that ODP is a pretty small database compared to something like Google's or Yahoo's, because it's primarily a human-edited collection of significant websites rather than an auto-generated collection of everything anyone has put on the web.
** Google's random site feature was driven by both popularity and your own search history. However, by feeding it an empty search history, you could remove that part of the equation. Anyway, I don't think it exists anymore.
A conceptual explanation, not a code one.
Their scripts are likely very large and comprehensive. If it's a random website selector, they have a huge, huge list of websites line by line, and the script just picks one. If it's a random URL generator, it probably generates a string of letters (e.g. "asljasldjkns"), plugs it between http:// and .com, tries to see if it is a valid URL, and if it is, sends you that URL.
The easiest way to design your own might be to ask to have a look at theirs, though I'm not certain of the success you'd have there.
The best way as a programmer is simply to decipher the nature of URL language. Practice the building of strings and testing them, or compile a huge database of them yourself.
As a hybridization, you might try building two things. One script that, while you're away, searches for/tests URLs and adds them to a database. Another script that randomly selects a line out of this database to send you on your way. The longer you run the first, the better the second becomes.
EDIT: Do Abarnert's thing about spiders, that's much better than my answer.
The other answers suggest building large databases of URL, there is another method which I've used in the past and documented here:
http://41j.com/blog/2011/10/find-a-random-webserver-using-libcurl/
Which is to create a random IP address and then try and grab a site from port 80 of that address. This method is not perfect with modern virtual hosted sites, and of course only fetches the top page but it can be an easy and effective way of getting random sites. The code linked above is C but it should be easily callable from python, or the method could be easily adapted to python.

Word count statistics on a web page

I am looking for a way to extract basic stats (total count, density, count in links, hrefs) for words on an arbitrary website, ideally a Python based solution.
While it is easy to parse a specific website using, say BautifulSoup and determine where the bulk of the content is, it requires you to define the location of the content in the DOM tree ahead of processing. This is easy for, say, hrefs or any arbitraty tag but gets more complicated when determining where the rest of the data (not enclosed in well defined markers) is.
If I understand correctly, robots used by the likes of Google (GoogleBot?) are able to extract data from any website to determine the keyword density. My scenario is similar, obtain the info related to the words that define what the website is about (i.e. after removing js, links and fillers).
My question is, are there any libraries or web APIs that would allow me to get statistics of meaningful words from any given page?
There is no APIs but there could be few libraries that you can use it as a tool.
you should count the meaningful words and record them by the time.
you can also Start from something like this:
string Link= "http://www.website.com/news/Default.asp";
string itemToSearch= "Word";
int count = new Regex(itemToSearch).Matches(Link).Count;
MessageBox.Show(count.ToString());
There are multiple libraries that deal with more advanced processing of web articles, this question should be a duplicate of this one.

Efficient Alternative to "in"

I'm writing a web crawler with the ultimate goal of creating a map of the path the crawler has taken. While I haven't a clue at what rate other, and most definitely better crawlers pull down pages, mine clocks about 2,000 pages per minute.
The crawler works on a recursive backtracking algorithm which I have limited to a depth of 15.
Furthermore, in order to prevent my crawler from endlessly revisitng pages, it stores the url of each page it has visited in a list, and checks that list for the next candidate url.
for href in tempUrl:
...
if href not in urls:
collect(href,parent,depth+1)
This method seems to become a problem by the time it has pulled down around 300,000 pages. At this point the crawler on average has been clocking 500 pages per minute.
So my question is, what is another method of achieving the same functionality while improving its efficiency.
I've thought that decreasing the size of each entry may help, so instead of appending the entire url, I append the first 2 and the last to characters of each url as a string. This, however hasn't helped.
Is there a way I could do this with sets or something?
Thanks for the help
edit: As a side note, my program is not yet multithreaded. I figured I should resolve this bottleneck before I get into learning about threading.
Perhaps you could use a set instead of a list for the urls that you have seen so far.
Simply replace your 'list of crawled URLS" with a "set of crawled urls". Sets are optimised for random access (using the same hashing algorithms that dictionaries use) and they're a heck of a lot faster. A lookup operation for lists is done using a linear search so it's not particularly fast. You won't need to change the actual code that does the lookup.
Check this out.
In [3]: timeit.timeit("500 in t", "t = list(range(1000))")
Out[3]: 10.020853042602539
In [4]: timeit.timeit("500 in t", "t = set(range(1000))")
Out[4]: 0.1159818172454834
I had a similar problem. Ended up profiling various methods (list/file/sets/sqlite) for memory vs time. See these 2 posts.
Finally sqlite was the best choice. You can also use url hash to reduce the size
Searching for a string in a large text file - profiling various methods in python
sqlite database design with millions of 'url' strings - slow bulk import from csv
Use a dict with the urls as keys instead (O(1) access time).
But a set will also work. See
http://wiki.python.org/moin/TimeComplexity

Categories

Resources