i have been trying to scrap google search data.
Let me explain what i have done so far.
i have used google module to get the search results, with Beautiful soup. Below i have given the sample search i have made,
>>> from google import search
>>>
>>> for i in search("tom and jerry", tld="co.in",num=10,stop=1): print i
https://www.youtube.com/watch?v=mugo5LoG8Ws
https://en.wikipedia.org/wiki/Tom_and_Jerry
http://www.dailymail.co.uk/debate/article-2390792/How-sense-humour-censor-Tom-Jerry-racist-By-Mail-TV-critic-CHRISTOPHER-STEVENS.html
http://edp.wikia.com/wiki/Tom_and_Jerry
https://www.youtube.com/watch?v=gSK5curwV_o
https://www.youtube.com/watch?v=xb8jTvSwJbw
https://www.youtube.com/watch?v=Kj8VuTr5q9g
https://www.youtube.com/watch?v=iIprJoPTJoI
https://www.youtube.com/watch?v=UaX3hvrZDJA
http://www.cartoonnetwork.com/games/tomjerry/
https://www.facebook.com/TomandJerry/
http://www.dailymotion.com/video/x2mn36a
http://www.dailymotion.com/video/x2p0k8j
>>>
But this result actually differs from the manual search result.
How actually it differs, if we make any changes to the init.py file of google library we can get some efficient result?
Please sort me out a possible way..
Thanks in advance.
[Note] : already surfed for previous discussions in stackoverflow. If it is a Dup, I apologize... :)
EDIT 1: Also i get duplicate links sometimes,. First link is repeated few times in the generator output i am getting from google.search(*arg) command. Please advice me how to get rid of this
I got how this DUP came. It is the sublinks shown for the popular websites in google search page.
sorry the pixel was too small. :)
Researching more on the API output and the way the output is parsed. Thanks for all who could have thought of helping me :)
Related
I am going to get response from nltk.
I don't have any idea for this.
If you have a source code or reference link, please share it.
I tried myself several time but it was failed.
I found one link for this question.
https://www.nltk.org/_modules/nltk/chat/eliza.html
I was wondering if anyone has any sample code for finding a certain keyword in twitter that has been recently posted and has a certain amount of likes within a certain timeframe
preferably in python. Anything related to this would help a lot if you have it. Thank You!
I have personally not done this before, but a simple google search yielded this (a python wrapper for the Twitter API):
https://python-twitter.readthedocs.io/en/latest/index.html
and a GitHub with examples that they linked from their getting started page:
https://github.com/bear/python-twitter/tree/master/examples
There you can find some example code for getting all of a user's tweets and much more.
Iterating through a list of users tweets might be able to do the job here, but if that doesn't cut it I recommend searching the docs linked above for what you need.
I am new with Python and am trying to create a program that will read in changing information from a webpage. I'm not sure if what I'm wanting to do is something simple or possible but in my head it seems do-able and relatively. Specifically I am interested in pulling in the song names from Pandora as they change. I have tried looking into just reading in information from a webpage using something like
import urllib
import re
page = urllib.urlopen("http://google.com").read()
re.findall("Shopping", page)
['Shopping']
page.find("Shopping")
However this isn't really what I'm wanting due to it getting information that doesn't change. Any advice or a link to helpful information about reading in changing info from a webpage would be greatly appreciated.
The only way this is possible (without some type of advanced algorithm) is if there are some elements of the page that do NOT change, which you can specify your program to look for. Otherwise, I believe you will need some sort of advanced logic. After all, computers can only do what we instruct them to do. Sorry :)
I've searched the whole afternoon but I'm still stuck.
I need to google keywords, and save the ranks of a given domain name for each keyword.
I tried to use several libraries : xgoogle, google, and pygoogle. However, pygoogle just doesn't work, and google, pygoogle always end up raising "HTTP Error : Service Unavailable".
So I suppose I should use the Google API, that uses the libraries urllib2 and simplejson, as well as the URL "http://ajax.googleapis.com/ajax/services/search/web?v=1.0".
I have several questions :
How to choose the top level domain ?
How to choose the langage of the results ?
How to choose how many results are shown ?
Are the results ranked the way I should find them in my own Google search ? I'm asking the question since I'm under the impression it's not the case.
Are the photos URL taken into account ?
How to choose the starting URL ? Is it possible to start from the 10th result ?
Thank you for your help,
Sebi81
I am a social scientist and a complete newbie/noob when it comes to coding. I have searched through the other questions/tutorials but am unable to get the gist of how to crawl a news website targeting the comments section specifically. Ideally, I'd like to tell python to crawl a number of pages and return all the comments as a .txt file. I've tried
from bs4 import BeautifulSoup
import urllib2
url="http://www.xxxxxx.com"
and that's as far as I can go before I get an error message saying bs4 is not a module. I'd appreciate any kind of help on this, and please, if you decide to respond, DUMB IT DOWN for me!
I can run wget on terminal and get all kinds of text from websites which is awesome IF I could actually figure out how to save the individual output html files into one big .txt file. I will take a response to either question.
Try Scrapy. It is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
You will most likely encounter this as you go, but in some cases, if the site is employing 3rd party services for comments, like Disqus, you will find that you will not be able to pull the comments down in this manner. Just a heads up.
I've gone down this route before and have had to tailor the script to a particular site's layout/design/etc.
I've found libcurl to be extremely handy, if you don't mind doing the post-processing using Python's string handler functions.
If you don't need to implement it purely in Python, you can make use of wget's recursive mirroring option to handle the content pull, then write your python code to parse the downloaded files.
I'll add my two cents here as well.
The first things to check are that you installed beautiful soup, and that it lives somewhere that it can be found. There's all kinds of things that can go wrong here.
My experience is similar to yours: I work at a web startup, and we have a bunch of users who register, but give us no information about their job (which is actually important for us). So my idea was to scrape the homepage and the "About us" page from the domain in their email address, and try to put a learning algorithm around the data that I captured to predict their job. The results for each domain are stored as a text file.
Unfortunately (for you...sorry), the code I ended up with was a bit complicated. The problem is that you'll end up getting a lot of garbage when you do the scraping, and you'll have to filter it out. You'll also end up with encoding issues, and (assuming you want to do some learning here) you'll have to get rid of low-value words. The total code is about 1000 lines, and I'll post some important pieces that may help you out here, if you're interested.