I've been trying for hours using requests and urllib. I'm so lost, misunderstood by google too. Some tips, or even anything would be useful. Thank you.
Goals: Post country code and phone numbers, then get mobile carrier etc.
Problem: Not printing anything. Variable "name" prints out None.
def do_ccc(self): #Part of bigger class
"""Phone Number Scan"""
#prefix=input("Phone Number Prefix:")
#number=input("Phone Number: ")
url=("https://freecarrierlookup.com/index.php")
from bs4 import BeautifulSoup
import urllib
data = {'cc' : "COUNTRY CODE",
'phonenum' : "PHONE NUMBER"}#.encode('ascii')
data=json.dump(data,sys.stdout)
page=urllib.request.urlopen(url, data)
soup=BeautifulSoup(page, 'html.parser')
name=soup.find('div', attrs={'class': 'col-sm-6 col-md-8'})
#^^^# Test(should print phone number)
print(name)
As Zags pointed out, it is not a good idea to use a website and violate their terms of service, especially when the site of offers a cheap API.
But answering your original question:
You are using json.loads instead of json.load resulting in an empty an empty data object.
If you look at the page, you will see that the URL for POST requests is different, getcarrier.php instead of index.php.
You would also need to convert your str from json.dumps to bytes and even then the site will reject your calls, since a hidden token is added to each request submitted by the website to prevent automatic scraping.
The problem is with what you're code is trying to do. freecarrierlookup.com is just a website. In order to do what you want, you'd need to do web scraping, which is complicated, unmaintainable, and usually a violation of the site's terms of service.
What you need to do is find an API that provides the data you're looking for. A good API will usually have either sample code or a Python library that you can use to make requests.
Related
I am looking for a way in python to input a phone number and get a caller's carrier. I am looking for a free and simple way, I have used TELNYX and it returns CELLCO PARTNERSHIP DBA VERIZON instead of just simply 'verizon' which does not work for me. I have tried Twilio as well and it has not worked for me. Has anyone found success doing this? Thanks in advance. Code for the TELNYX:
def getcarrier(number):
url = 'https://api.telnyx.com/v1/phone_number/1' + number
html = requests.get(url).text
data = json.loads(html)
data = data["carrier"]
print(data["name"])
global carrier
What I have done in the past is to isolate the number prefix. And match against the prefix database available HERE. I did this only for my own country (Bangladesh), so it was a relatively easy code (just a series of if/else). So to work for any number I believe you'll need to consider the country code as well.
You can do it in two ways.
One. Having the data locally stored as CSV from the Wikipedia page. (scraping the page should be easy to do). And then use panda or similar CSV handling package to use it as the database of your program.
Or, two, you can write a small program that scrapes the page on demand and find the operator then.
Good Luck.
I am looking to parse the Wikipedia talk page (e.g., https://en.wikipedia.org/wiki/Talk:Elon_Musk). I would like to loop through texts by contributors/editors. Not sure how do I do it. For now, I have the following code:
import pywikibot as pw
wikiPage="elon_musk"
page = pw.Page(pw.Site('en'), wikiPage)
talkpage = page.toggleTalkPage()
s=talkpage.text
cs=talkpage.contributors()
It seems pretty hard to parse the text (i.e., s) and find the talk text made by each contributor. Not sure where the talk begins and ends for a contributor and what talk text is in response to a talk text made by others. Is there a way that talk page returns segments that I can loop through?
Many thanks for your help!
I don't know about pywikibot, but you can do this via the normal API. This will fetch the revisions: https://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Talk:Elon%20Musk&rvlimit=500&rvprop=timestamp|user|comment|ids
Then you can pass the revision ids to get the change in each edit: e.g. https://en.wikipedia.org/w/api.php?action=compare&fromrev=944235185&torev=944237256
According to their documentation:
This should be enough to get the hottest new reddit submissions:
r = client.get(r'http://www.reddit.com/api/hot/', data=user_pass_dict)
But it doesn't and I get a 404 error. Am I getting the url for data request wrong?
http://www.reddit.com/api/login works though.
Your question specifically asks what you need to do to get the "hottest new" submissions. "Hottest new" doesn't really make sense as there is the "hot" view and a "new" view. The URLs for those two views are http://www.reddit.com/hot and http://www.reddit.com/new respectively.
To make those URLs more code-friendly, you can append .json to the end of the URL (any reddit URL for that matter) to get a json-representation of the data. For instance, to get the list of "hot" submissions make a GET request to http://www.reddit.com/hot.json.
For completeness, in your example, you attempt to pass in data=user_pass_dict. That's definitely not going to work the way you expect it to. While logging in is not necessary for what you want to do, if you happen to have need for more complicated use of reddit's API using python, I strongly suggest using PRAW. With PRAW you can iterate over the "hot" submissions via:
import praw
r = praw.Reddit('<REPLACE WITH A UNIQUE USER AGENT>')
for submission in r.get_frontpage():
# do something with the submission
print(vars(submission))
According to the docs, use /hot rather than /api/hot:
r = client.get(r'http://www.reddit.com/hot/', data=user_pass_dict)
I am a social scientist and a complete newbie/noob when it comes to coding. I have searched through the other questions/tutorials but am unable to get the gist of how to crawl a news website targeting the comments section specifically. Ideally, I'd like to tell python to crawl a number of pages and return all the comments as a .txt file. I've tried
from bs4 import BeautifulSoup
import urllib2
url="http://www.xxxxxx.com"
and that's as far as I can go before I get an error message saying bs4 is not a module. I'd appreciate any kind of help on this, and please, if you decide to respond, DUMB IT DOWN for me!
I can run wget on terminal and get all kinds of text from websites which is awesome IF I could actually figure out how to save the individual output html files into one big .txt file. I will take a response to either question.
Try Scrapy. It is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
You will most likely encounter this as you go, but in some cases, if the site is employing 3rd party services for comments, like Disqus, you will find that you will not be able to pull the comments down in this manner. Just a heads up.
I've gone down this route before and have had to tailor the script to a particular site's layout/design/etc.
I've found libcurl to be extremely handy, if you don't mind doing the post-processing using Python's string handler functions.
If you don't need to implement it purely in Python, you can make use of wget's recursive mirroring option to handle the content pull, then write your python code to parse the downloaded files.
I'll add my two cents here as well.
The first things to check are that you installed beautiful soup, and that it lives somewhere that it can be found. There's all kinds of things that can go wrong here.
My experience is similar to yours: I work at a web startup, and we have a bunch of users who register, but give us no information about their job (which is actually important for us). So my idea was to scrape the homepage and the "About us" page from the domain in their email address, and try to put a learning algorithm around the data that I captured to predict their job. The results for each domain are stored as a text file.
Unfortunately (for you...sorry), the code I ended up with was a bit complicated. The problem is that you'll end up getting a lot of garbage when you do the scraping, and you'll have to filter it out. You'll also end up with encoding issues, and (assuming you want to do some learning here) you'll have to get rid of low-value words. The total code is about 1000 lines, and I'll post some important pieces that may help you out here, if you're interested.
I was recently requested by a client to build a website for their insurance business. As part of this, they want to do some screen scraping of the quote site for one of their providers. They asked if their was an API to do this, and were told there wasn't one, but that if they could get the data from their engine they could use it as they wanted to.
My question: is it even possible to perform screen scraping on the response to a form submission to another site? If so, what are the gotchas that I should look out for. Obvious legal/ethical issues aside since they already asked for permission to do what we're planning to do.
As an aside, I would prefer to do any processing in python.
Thanks
A really nice library for screen-scraping is mechanize, which I believe is a clone of an original library written in Perl. Anyway, that in combination with the ClientForm module, and some additional help from either BeautifulSoup and you should be away.
I've written loads of screen-scraping code in Python and these modules turned out to be the most useful. Most of the stuff that mechanize does could in theory be done by simply using the urllib2 or httplib modules from the standard library, but mechanize makes this stuff a breeze: essentially it gives you a programmatic browser (note, it does not require a browser to work, but mearly provides you with an API that behaves like a completely customisable browser).
For post-processing, I've had a lot of success with BeautifulSoup, but lxml.html is a good choice too.
Basically, you will be able to do this in Python for sure, and your results should be really good with the range of tools out there.
You can pass a data parameter to urllib.urlopen to send POST data with the request just like you had filled out the form. You'll obviously have to take a look at what data exactly the form contains.
Also, if the form has method="GET", the request data is just part of the url given to urlopen.
Pretty much standard for scraping the returned HTML data is BeautifulSoup.
I see the other two answers already mention all the major libraries of choice for the purpose... as long as the site being scraped does not make extensive use of Javascript, that is. If it IS a Javascript-heavy site and dependent on JS for the data it fetches and display (e.g. via AJAX) your problem is an order of magnitude harder; in that case, I might suggest starting with crowbar, some customization of diggstripper, or selenium, etc.
You'll have to do substantial work in Javascript and probably dedicated work to deal with the specifics of the (hypothetically JS-heavy) site in question, depending on the JS frameworks it uses, etc; that's why the job is so much harder if that is the case. But in any case you might end up with (at least in part) local HTML copies of the site's pages as displayed, and end by scraping those copies with the other tools already recommended. Good luck: may the sites you scrape always be Javascript-light!-)
Others have recommended BeautifulSoup, but it's much better to use lxml. Despite its name, it is also for parsing and scraping HTML. It's much, much faster than BeautifulSoup, and it even handles "broken" HTML better than BeautifulSoup (their claim to fame). It has a compatibility API for BeautifulSoup too if you don't want to learn the lxml API.
Ian Blicking agrees.
There's no reason to use BeautifulSoup anymore, unless you're on Google App Engine or something where anything not purely Python isn't allowed.