Scraping vs Google Trends API using Python - python

I'm trying to collect the top five search queries for each trend for the past year by category on Google Trends.
I don't know if I should do this using a python library such as pytrends, which from their docs require a keyword to be able to query GT, or I don't have any specific keyword, I want to fetch any search query for a term on every category that can be found.
Use a scraping library Selenium or Beautifulsoup4 to collect this information directly from the GT website.
The goal of this is to be able to retrieve the top 5 websites for each query later ...
Which direction should I take?

It is better to use one of the unofficial APIs.
These connect to the Google internal APIs that power the Trends UI with structured information. But scraping would only return mostly unstructured HTML, and you would need to extract the structured data yourself. This information will not be as reliable or as complete.
It is the difference between talking through an API that is intended for "machine to machine" communication, vs a web UI that is intended for "machine to human" interactions.

Related

Conver Python to html using Django

I have a Python program for customers to query price. Each time a customer can input some necessary information, the program will calculate and return the price to customer. Note: during the calculation process, the program also need to query a third party map service web API to get some information (such as google map API or other similar service).
I have a website developed using web development tools such Wix, Strikingly. It offers a capability to customize a web page by simply input a block of HTML codes. So, I want to study the possibility of using Django to convent my python program into HTML (incl. add some user interface such as text box and button), which can then be pasted into the website to form a unique webpage.
I am not sure if it is doable? Especially, the part of connecting third party map service API. Would Django be able to convert this part automatically to HTML as well? (how does it deal with API key and connection).
Python itself runs only on the console, and is meant to be the backend in site development, whereas HTML is meant only to be the frontend, so no calculation or data fetching. Wix is a frontend tool with some content management that offers customization but still in the frontends (html/css), and there's nothing more you could do with the content management other than using the built in table like feature. Trying to use the html generated by wix will be so much pain due to its css name optimization and making it quite unscalable.
If you don't wish to learn frontend building at all then you could look up other html generator tool for the frontend codes. From there, django itself is capable of building the entire website, using the html you generated as template, and passing the data you've computed into the templates. That's what Django is meant to do. In this case you would need to learn Django itself, which I would recommend if you intend to showcase your project as an interactive program rather just console logs.
Other alternatives include converting your python codes into javascript, which is capable of doing calculations and fetching from APIs, and you can include the javascript code directly in HTML with the tag.

How to extract information (citation, h-index, currently working institution etc) about all professors in a specific field from Google scholar?

I want to compare different information (citation, h-index, etc) of professors in a specific field in different institutions all over the world by data mining and analysis techniques. But I have no idea how to extract these data of hundreds of (or even thousands of) professors since Google does not provide an official API for it. So I am wondering are there any other ways to do that?
Use this google code tool will calculate an individual h-index but if you do this on demand for a limited number in a particular field you will not break the terms of use - it doesn't specifically refer to limits on access but does refer to disruption of service (eg bulk requests may potentially do this) the export questions state:
I wrote a program to download lots of search results, but you blocked my computer from accessing Google Scholar. Can you raise the limit?
Err, no, please respect our robots.txt when you access Google Scholar using automated software. As the wearers of crawler's shoes and webmaster's hat, we cannot recommend adherence to web standards highly enough.
Web of Science does have an API available and a collaboration agreement with google scholar but Web of Science only for certain individuals
A solution could be to request user's web of science credential (or your own) to return the information on demand - perhaps for the top ones in the field, then store it as you planned. Google scholar only updates a few times per week and this would not be excessive use.
The other option is to request permission from google, which is an mentioned in the terms of use, although seems unlikely to be granted.
I've done a project exactly on this.
You provide an input text file to the script with the names of the professor you'd like to retrieve the information from, and the script is able to crawl google scholar and manage the info you are interested on.
The project provides also a functionality for downloading automatically the profile picture of the researchers/professors.
In order to respect the constraint imposed by the portal you can set a delay between each requests. if you have >1k of profile to crawl it might take a while but it works.
A concurrency-enabled script has also been implemented and it runs way faster than the basic sequence approach.
note: in order to specify the information you need you have to know either the id of the class of the html generated by google scholar or the name of the class.
good luck!

BigQuery vs Custom Search for High-throughput Google (Scholar) Searches?

For a list of ~30 thousand keywords, I'm trying to find out how many Google search hits there exist for each keyword, similar to this website but on a larger scale: http://azich.org/google/.
I am using python to query and was originally planning to use pygoogle. Unfortunately Google has a limit of ~100 searches a day for a free account. I am willing to use a paid service, but I am not sure which Google service makes more sense - BigQuery or Custom Search. Bigquery seems to be for searches on a provided set of data, whereas Custom Search seems to be website search solutions for a small "slice" of the internet.
Would someone refer me to the appropriate service that will allow me to perform the above task? It doesn't need to be a free service - I am willing to pay.
Two more things, if possible: I'd like the searches to be from Google Scholar, but this is not necessary. Second, I'd like to save the text from the front page, such as the blurbs from each search result, to text-mine the front page results.
BigQuery is not a tool to interact with Google Search in any way. BigQuery is a tool for you to feed your data, and then run analytical queries over those data. But you need first to ingest the data.

Python 3 - way to interact with a web page

I have experience with reading and extracting html source 'as given'(via urllib.request), but now I would like to perform browser-alike actions(like filling a form, or selecting a value from the option menu) and then, of course, read a resulting html code as usual. I did come across some modules that seemed promising, but turned out not supporting Python 3.
So, I'm here asking for a name of library/module that does the wanted, or pointing to a solution within standard libraries if it's there and I failed to see it.
Usually many websites (like Twitter, facebook or Wikipedia) provide their API's to let developers hook into their app and perform activities programmatically. For what so ever web site you wish to perform activities through code, just look for their API support.
In case you need to do web scraping, you can use scrapy. But it only has support upto python 2.7.x. Anyways, you can use requests for HTTP client and beautiful soup for HTML parsing.

Should I use Screen Scrapers or API to read data from websites

I am building a web application as college project (using Python), where I need to read content from websites. It could be any website on internet.
At first I thought of using Screen Scrapers like BeautifulSoup, lxml to read content(data written by authors) but I am unable to search content based upon one logic as each website is developed on different standards.
Thus I thought of using RSS/ Atom (using Universal Feed Parser) but I could only get content summary! But I want all the content, not just summary.
So, is there a way to have one logic by which we can read a website's content using lib's like BeautifulSoup, lxml etc?
Or I should use API's provided by the websites.
My job becomes easy if its a blogger's blog as I can use Google Data API but the trouble is, should I need to write code for every different API for the same job?
What is the best solution?
Using the website's public API, when it exists, is by far the best solution. That is quite why the API exists, it is the way that the website administrators say "use our content". Scraping may work one day and break the next, and it does not imply the website administrator's consent to have their content reused.
You could look into content extraction libraries - I've used Full Text RSS (php) and Boilerpipe (java). Both have web service available so you can easily test if it meets your requirements. Also you can download and run them yourself and further modify its behavior on individual sites.

Categories

Resources