I am trying to get the names of first 200 Facebook users.
I am using Python and BeautifulSoup
The approach which I'm using is that instead of using Graph API, I'm trying to get the names using the title of the webpage.(The title of the profile webpage is the name of the person)
The first user is Zuckerberg(id:4). I want names till 200.
Here's what I've tried.
import urllib2
from BeautifulSoup import BeautifulSoup
x=4
while(x<=200):
print BeautifulSoup(urllib2.urlopen("https://www.facebook.com/"+str(x))).title.string
x+=1
Can anyone help?
Well, I concur with the other commenters that there is barely enough information to figure out what the problem is. But reading between the lines a bit, I imagine the OP is expecting that results for pages such as
https://www.facebook.com/4
which gets redirected to https://www.facebook.com/zuck, Mark Zuckerberg's page, and https://www.facebook.com/5
which gets redirected to https://www.facebook.com/ChrisHughes, another early Facebook employee, will continue to work for further arbitrary user IDs the OP plugs in. In fact, I believe this trick did used to work in the past... until someone posted a spreadsheet of the first 2000 Facebook users somewhere, and Facebook clamped down on this hole (this is from memory, I bet there's a news story out there if someone feels like digging).
Anyway, trying further user IDs in the URL such as:
https://www.facebook.com/7 now gives a "Sorry, this page isn't available" response. To the OP, I don't think there's any easy way you can code around this -- Zuck obviously doesn't care that you're harvesting his own page, but I guess he's not keen on letting you scrape the entire Facebook user list. Sorry.
Update: you might still be able to get away with such harvesting using Facebook's Graph API -- it appears that GETs of pages like https://graph.facebook.com/100 will work for most User IDs. You should be able to script up what you need from there (if I were Facebook, I would have rate-limiting in place to prevent mass harvesting, but you'll have to try and see what you get for yourself.) Here's a script similar to what you're trying to do.
Related
Web-scraping adjacent question about URLs acting whacky.
If I go to glassdoor job search and enter in 6 fields (Austin, "engineering manager", fulltime, exact city, etc.. ). I get a results page with 38 results. This is the link I get. Ideally I'd like to save this link with its search criteria and reference it later.
https://www.glassdoor.com/Job/jobs.htm?sc.generalKeyword=%22engineering+manager%22&sc.locationSeoString=austin&locId=1139761&locT=C?jobType=fulltime&fromAge=30&radius=0&minRating=4.00
However, If I copy that exact link and paste it into a new tab, it doesn't act as desired.
It redirects to this different link, maintaining some of the criteria but losing the location criteria, bringing up thousands of results from around the country instead of just Austin.
https://www.glassdoor.com/Job/jobs.htm?sc.generalKeyword=%22engineering+manager%22&fromAge=30&radius=0&minRating=4.0
I understand I could use selenium to select all 6 fields, I'd just like to understand what's going on here and know if there is a solution involving just using a URL.
The change of URL seems to happen on the server that is handling the request. I would think this is how it's configured on the server-side endpoint for it to trim out extra parameters and redirects you to another URL. There's nothing you can do about this since however you pass it, it will always resolve into the second URL format.
I have also tried URL shortener but the same behavior persists.
The only way around this is to use Automation such as Selenium to enable the same behaviour to select and display the results from the first URL.
I'm trying to scrape a website that provides individual access to court cases in New Jersey county courts. I'm having a lot of trouble figuring out how to start though. I've scraped quite a few websites before but I've usually been able to start by adapting the URL to pass through the search parameters. However, when I access this data the URL does not change so I'm at a bit of a loss.
Additionally, there is a test for me to prove that I am not a Robot (which occasionally turns into a ReCaptcha).
On the website linked above, say, for example, the inputs would be:
Case County==Bergen, Docket Type==Landlord Tenant (LT), Docket Number==000001, and Docket Year==19.
I would then like to be able to extract the Defendant Name or anything from the subsequent page.
Does anyone have any advice on how I should proceed with this?
Thanks in advance
Websites which "require input" can be scraped using Selenium, which evaluates the javascript: your python code then executes the page more as a "user" (click here, type there). It's slow.
Alternatively, if you look at the page details, you may see what happens to input, and simply execute the resulting GET or POST url properly formed (For example, Forms, often, will do a POST with the parameters: Look at the code and figure out what parameters get posted and to what URL, and then in python, execute that POST code -- you'll probably need a cookiejar to maintain session info.
HOWEVER As a website maintainer, my advice to you is to not attempt to scrape this site: it doesn't want to be scraped & repeated attempts only escalate defensive activities on the part of the website owner. You may also be violating usage policy, state and/or federal laws.
Instead, look for an alternative API, or alternative source. (NJ Courts may have an alternative API, designed for computer usage: send them an email!)
this morning I wanted to create a little Software/Script in Python, it was 6am when I started and now I'm about to become crazy because it's 22pm and I have nothing that works.
So basically, I want to do this: Given an Instagram Username, scrape the Name, Number of followers and the business contact email.
I found out that going to the page source will give me this info (let's consider only the email for now): https://imgur.com/a/jYQ2FtR
Any idea about how I can do that? I try many different things and nothing is working. I don't know what to do. I tried downloading the page and parsing the text looking for "business_email" but I have no idea about how to implement it and extracting the data I'm looking for, I know it's a simple task, but I'm a total noob and I haven't been coding for years.
Can someone tell me how to do it? Or at least point me in the right direction.
There are different ways to approach this problem. If the data you want is visible on the page, then you could scrap that info using Beatiful Soup. If not, then it's a little more trickier but you could extract the info for the page source using regular expressions with the re module.
I've been trying for several day now (unsuccessfully) to scrape cities from about 500 Facebook URLs. However, Facebook handles its data in a very strange way and I can't figure out what's going on under the hood to understand what I need to do.
Essentially the problem is that Facebook displays very different amounts of data depending on who is logged in, and what the privacy settings of the account are. For instance, try opening the following three links, both in a browser where you are logged into Facebook, and one where you are not:
[REDACTED LINKS DUE TO PRIVACY CONCERNS]
As you can see, Facebook loads the data in both cases for the first link, but only gets data for the second link if you are logged in (to ANY account). The third link displays city when you are logged in, but only displays other information when you are not.
The reason this is extremely problematic (and related to Python) is that when trying to scrape the page with Beautiful Soup or Mechanize, I cannot figure out how to get the program to "pretend" that I am logged into an account. This means that I can easily grab data off the first type of link (of which there are less than 10), but I cannot get city off the second or third type. So far I've tried a number of solutions with little success.
Here's some sample code that works correctly for the first type, but not for other types:
import mechanize
import re
import csv
user_info = []
fb_url = 'http://www.facebook.com/100004210542493'
br = mechanize.Browser()
br.set_handle_robots(False)
br.open(fb_url)
all_html = br.response().get_data()
print all_html
city = re.search('fsl fwb fcb">(.+?)</a></div><div class="aboutSubtitle fsm fwn fcg', all_html).group(1)
user_info = [fb_url, city]
print user_info
I also have a version that uses Beautiful Soup. If anyone has any ideas on how to get around this, I would be extremely grateful. Thank you!
The right way to do this is to use the facebook API. For various business, security, and privacy reasons they go out of their way to make scraping data tricky.
If you insist on scraping I would try to log in first using mechanize to submit the form. I've never tried to do this with facebook, but alot of websites have easier to parse versions intended for mobile users at m.site.com.
You should look into using facepy by Johannes Gorset. He has done a brilliant job. I used it when I worked on a small Facebook app for a personal project.
I think scraping data from facebook is illegal. It is there in the terms of using facebook. Every activity is registered with your login details, even when you use a bot to scrape. If caught, they can ban you from using facebook for your lifetime. If there is a potential threat to any asset that you may pose, they can penalize you further.
You can try using selenium and Facebook API. I also had to scrape some similar data from list of testing Facebook accounts and selenium webdriver helped to emulate as real user and to scrape the required data.
If you visit this link right now, you will probably get a VBScript error.
On the other hand, if you visit this link first and then the above link (in the same session), the page comes through.
The way this application is set up, the first page is meant to serve as a frame in the second (main) page. If you click around a bit, you'll see how it works.
My question: How do I scrape the first page with Python? I've tried everything I can think of -- urllib, urllib2, mechanize -- and all I get is 500 errors or timeouts.
I suspect the answers lies with mechanize, but my mechanize-fu isn't good enough to crack this. Can anyone help?
It always comes down to the request/response model. You just have to craft a series of http requests such that you get the desired responses. In this case, you also need the server to treat each request as part of the same session. To do that, you need to figure out how the server is tracking sessions. It could be a number of things, from cookies to hidden inputs to form actions, post data, or query strings. If I had to guess I'd put my money on a cookie in this case (I haven't checked the links). If this holds true, you need to send the first request, save the cookie you get back, and then send that cookie along with the 2nd request.
It could also be that the initial page will have buttons and links that get you to the second page. Those links will have something like <A href="http://cad.chp.ca.gov/iiqr.asp?Center=RDCC&LogNumber=0197D0820&t=Traffic%20Hazard&l=3358%20MYRTLE&b="> where a lot of the gobbedlygook is generated by the first page.
The "Center=RDCC&LogNumber=0197D0820&t=Traffic%20Hazard&l=3358%20MYRTLE&b=" part encodes some session information that you must get from the first page.
And, of course, you might even need to do both.
You might also try BeautifulSoup in addition to Mechanize. I'm not positive, but you should be able to parse the DOM down into the framed page.
I also find Tamper Data to be a rather useful plugin when I'm writing scrapers.