I am trying to access Facebook from Python :D
I want to fetch some data which requires I be logged in in order to view. I know I will require cookies and such to view said data with Python, but I am entirely clueless when it comes to cookies.
How can I use Python to login to Facebook, navigate to multiple pages and retrieve some data?
Okay. Potentially this is a very large question. Instead of using the standard API to retrieve information, you wish to screen scrap?
It's possible - although not recommended as screen scraping is reliant upon the HTML format not changing. However it's not an impossible task.
To get started, you want to look at opening a url:
http://docs.python.org/library/urllib2.html
It's super easy - The example on the page will show you something like this:
>>> import urllib2
>>> f = urllib2.urlopen('http://facebook.com/')
>>> print f.read()
And you see you have HTML.
Now facebook will be smarter than your average site to circumvent this type of login ed: I hope
So you may wan to look at handling the session manually:
import urllib2
req = urllib2.Request('http://www.facebook.com/')
req.add_header('Referer', 'http://www.lastpage.com/')
r = urllib2.urlopen(req)
All snipped from the python docs.
Related
I have a question with probably a well-known answer. However I couldnt articulate it well enough to find answers on google.
Lets say you are using the developer interface of Chrome browser (Press F12). If you click on the network tab and go to any website, a lot of files will be queried there for example images, stylesheets and JSON-responses.
I want to parse these JSON-responses using python now.
Thanks in advance!
You can save the network requests to a .har file (JSON format) and analyze that.
In your network tools panel, there is a download button to export as HAR format.
import json
with open('myrequests.har') as f:
network_data = json.load(f)
print(network_data)
Or, as Jack Deeth answered you can make the requests using Python instead of your browser and get the response JSON data that way.
Though, this can sometimes be difficult depending on the website and nature of the request(s) (for example, needing to login and/or figuring out how to get all the proper arguments to make the request)
I use requests to get the data, and it comes back as a Python dictionary:
import requests
r = requests.get("url/spotted/with/devtools")
r.json()["keys_observed_in_devtools"]
Perhaps you can try using Selenium.
Maybe the answers on this question can help you.
from facebook_scraper import get_group_info
cookie = '?'
print(get_group_info('lebanon', cookie=cookie))
I am trying to scrape facebook groups but it telling me that 'facebook_scraper.exceptions.LoginRequired: A login (cookies) is required to see this page', how do I get this cookie...?
I ran into this problem recently and it was actually quite easy to fix. All you need to do is add the cookies parameter to the function (for example get_group_info or get_posts).
The documentation states that you need to include:
The path to a file containing cookies in Netscape or JSON format. You can extract cookies from your browser after logging into Facebook with an extension like Get Cookies.txt (Chrome) or Cookie Quick Manager (Firefox). Make sure that you include both the c_user cookie and the xs cookie, you will get an InvalidCookies exception if you don't.
So, simply add the extension based on which browser you use, visit Facebook and log in, and then use the extension to extract and save your cookies in Netscape format. Save the file somewhere and then just read it into your code, for example:
get_group_info("nintendo", cookies="/path/to/cookies/file")
the solution is actually super simple, you just need to type in your like this:
for post in get_posts(group='EatsleeprepeatESR/', credentials={'email' : '6969#gmail.com' , 'pass' : '*****************'}):
print(post['text'][:50])
I took a programming class in python so I know the basics of the language. A project I'm currently attempting involves submiting a form repeatedly untill the request is successful. In order to achieve a faster success using the program, I thought cutting the browser out of the program by directly sending and recieving data from the server would be faster. Also the website I'm creating the program for has a tendency to crash but I'm pretty sure i could still receive and send response to the server. Currently, im just researching different resources I could use to complete the task. I understand mechanize is easy to fill forms and submit them, but it requires a browser. So my question is what would be the best resource to use within python to communicate directly with the server without a browser.
I apologize if any of my knowledge is flawed. I did take the class but I'm still relatively new to the language.
Yes, there are plenty of ways to do this, but the easiest is the third party library called requests.
With that installed, you can do for example:
requests.post("https://mywebsite/path", {"key: "value"})
You can try this below.
from urllib.parse import urlencode
from urllib.request import Request, urlopen
url = 'https://httpbin.org/post' # Set destination URL here
post_fields = {'foo': 'bar'} # Set POST fields here
request = Request(url, urlencode(post_fields).encode())
json = urlopen(request).read().decode()
print(json)
I see from your tags that you've already decided to use requests.
Here's how to perform a basic POST request using requests:
Typically, you want to send some form-encoded data — much like an HTML
form. To do this, simply pass a dictionary to the data argument. Your
dictionary of data will automatically be form-encoded when the request
is made
import requests
payload = {'key1': 'value1', 'key2': 'value2'}
response = requests.post("http://httpbin.org/post", data=payload)
print(response.text)
I took this example from requests official documentation
I suggest you to read it and try also other examples available in order to become more confident and decide what approach suits your task best.
I recently wanted to extract data from a website that seems to use cookies to grant me access. I do not know very much about those procedures but appearently this inteferes with my method of getting the html content of the website via Python and its requests module.
The code I am running to extract the information contains the following lines:
import responses
#...
response = requests.get(url, proxies=proxies)
content = requests.text
Where the website i am referring to is http://ieeexplore.ieee.org/xpls/icp.jsp?arnumber=6675630&tag=1 and proxies is a valid dict of my proxy servers (I tested those settings on websites that seemed to work fine). However, instead of the content of the article on this site I receive the html-content of the page that you get when you do not accept cookies in your browser.
As I am not really aware of what website is really doing and lack real Web-Developement experience I could not find a solution so far, even if a similar question might have been asked before. Is there any solution to access the content of this website via Python?
startr = requests.get('https://viennaairport.com/login/')
secondr = requests.post('http://xxx/', cookies=startr.cookies)
I'm trying to get a redirected url from another url without using a selenium object. I have an url like:
http://registry.theknot.com/track/View?lt=RetailerGVR&r=325404419&rt=12160&a=994&st=RegistryProfile&ss=LinkedRegistries&sp=Logo
and it gets redirected to:
http://www.target.com/RegistryGiftGiverCmd?isPreview=false&status=completePageLink®istryType=WD&isAjax=false&listId=NjPO_i-DoIafZPZSFhaBRw&clkid=2gTTqGRwsXS4x%3AexW%3ATGBxiqUkWXSi0It0P5VM0&lnm=Online+Tracking+Link&afid=The+Knot%2C+Inc.+and+Subsidiaries&ref=tgt_adv_xasd0002
when is opened by some browser.
I want to avoid instancing a Selenium object and raise a Firefox/Chrome process just to get the redirected URL. Is there any other better way?
Thanks!
If this is just an HTTP redirect, urllib.request/urllib2 in the standard library can follow redirects just fine, as can third-party HTTP client libraries like requests and PycURL. In fact, in the simplest use cases, they do so automatically.
So, just:
>>> import urllib.request
>>> original_url = 'http://registry.theknot.com/track/View?lt=RetailerGVR&r=325404419&rt=12160&a=994&st=RegistryProfile&ss=LinkedRegistries&sp=Logo'
>>> u = urllib.request.urlopen(original_url)
>>> print(u.url)
http://www.target.com/RegistryGiftGiverCmd?isPreview=false&status=completePageLink®istryType=WD&isAjax=false&listId=NjPO_i-DoIafZPZSFhaBRw&clkid=0b5XTmU%3A5WbqRETSYD20AQKOUkWXSGQgQSquVU0&lnm=Online+Tracking+Link&afid=The+Knot%2C+Inc.+and+Subsidiaries&ref=tgt_adv_xasd0002
But if you just want the data, you don't even need that:
>>> data = u.read()
That's the contents of the redirected request.
(For Python 2.x, just replace urllib.request with urllib2 and it works the same.)
The only reason you'd need to use Selenium (or another browser automation and/or JS-environment library) is if the redirect is done through in-page JavaScript. Which it usually isn't, and isn't in this case. There's no reason to go outside the standard library, talk to another app, etc. for simple things like this.