I am just beginner in Python, so I don't know much about it.
For my research project I have to get the friend list of users(already defined) from Facebook and Twitter by crawling webpages by using Python.
I don't how to start like open account then go to friend, save its webpage, then go to another webpage and do the same.
Can anyone please tell me how to do it?
Use the Google API.
https://towardsdatascience.com/how-to-use-facebook-graph-api-and-extract-data-using-python-1839e19d6999
Or use this link for Code
https://codereview.stackexchange.com/questions/167486/parser-for-facebook-friend-list
You May Can Use this Code of Python For that task get from the upper link...
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
class FacebookCrawler:
LOGIN_URL = 'https://www.facebook.com/login.php?login_attempt=1&lwv=111'
def __init__(self, login, password):
chrome_options = webdriver.ChromeOptions()
prefs = {"profile.default_content_setting_values.notifications": 2}
chrome_options.add_experimental_option("prefs", prefs)
self.driver = webdriver.Chrome(chrome_options=chrome_options)
self.wait = WebDriverWait(self.driver, 10)
self.login(login, password)
def login(self, login, password):
self.driver.get(self.LOGIN_URL)
# wait for the login page to load
self.wait.until(EC.visibility_of_element_located((By.ID, "email")))
self.driver.find_element_by_id('email').send_keys(login)
self.driver.find_element_by_id('pass').send_keys(password)
self.driver.find_element_by_id('loginbutton').click()
# wait for the main page to load
self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "a#findFriendsNav")))
def _get_friends_list(self):
return self.driver.find_elements_by_css_selector(".friendBrowserNameTitle > a")
def get_friends(self):
# navigate to "friends" page
self.driver.find_element_by_css_selector("a#findFriendsNav").click()
# continuous scroll until no more new friends loaded
num_of_loaded_friends = len(self._get_friends_list())
while True:
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
try:
self.wait.until(lambda driver: len(self._get_friends_list()) > num_of_loaded_friends)
num_of_loaded_friends = len(self._get_friends_list())
except TimeoutException:
break # no more friends loaded
return [friend.text for friend in self._get_friends_list()]
if __name__ == '__main__':
crawler = FacebookCrawler(login='login', password='password')
for friend in crawler.get_friends():
print(friend)
You can use Facebook's Graph API to fetch friend lists for those who gave permissions to your website only if Facebook approves your website to access this data (You need to request permission). I think the chances of getting approval for a personal website is not very high.
Another way of getting this data is that crawling friend lists via an automated code or application. For this to work:
That person should be your friend on Facebook
That person's friend list should be open to friends (Some people change the privacy setting to "Only me")
You need their Facebook profile URLs
If everything is setup, the crawler goes to profile URLs one by one and visit friend lists to collect data.
Please note that crawling data on Facebook may cause legal issues depending on where you live.
You want to start by looking at the requests library.
Related
I was going to use Selenium to crawl the web
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
options = webdriver.ChromeOptions()
options.add_argument('headless')
driver = webdriver.Chrome('./chromedriver', options=options)
driver.get('https://steamdb.info/tag/1742/?all')
driver.implicitly_wait(3)
li = []
games = driver.find_elements_by_xpath('//*[#class="table-products.text-center.dataTable"]')
for i in games:
time.sleep(5)
li.append(i.get_attribute("href"))
print(li)
After accessing the steam url that I was looking for, I tried to find something called an appid
The picture below is the HTML I'm looking for
I'm trying to find the number next to "data-appid="
But if I run my code, nothing is saved in the "games"
Correct me if I'm wrong but from what I can see this steam page requires you to log-in, are you sure that when webdriver opens the page that same data is available to you ?
Additionally when using By, the correct syntax would be games = driver.find_element(By.CSS_SELECTOR('//*[#class="table-products.text-center.dataTable"]'))
I am trying to make an Instagram bot that can perform various functions - InstaPy kept timing out on me so I decided to use selenium BUT the issue is: I can't seem to get the past the first hurdle of actually logging into IG.
I am not getting any errors on the console but it won't let me past the past additional cookies acceptance page. I have played with the xpath and done a few tweeks but still nothing - any ideas on a fix here ?
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time, urllib.request
import requests
PATH = r"/Users/PycharmProjects/pythonProject13/chromedriver"
driver = webdriver.Chrome(PATH)
driver.get('https://www.instagram.com')
#login
time.sleep(5)
notnow = driver.find_element_by_xpath("/html/body/div[4]/div/div/button[2], 'Allow Essential and Optional Cookies')]").click()
username=driver.find_element_by_css_selector("input[name='username']") #arialabelondevtools = #Phone number, username or email address
password=driver.find_element_by_css_selector("input[name='password']")
username.clear()
password.clear()
username.send_keys("testacct1")
password.send_keys("testpassword123")
login = driver.find_element_by_css_selector("button[type='submit']").click()
One of the most common mistake that people do is write absolute xpath or probably you are copying xpath from browser it self so instead use smarter xpath use id, class and other attributes to write xpath..
I recently did login to Instagram and here is the simple go
driver.get('https://www.instagram.com/')
wait = WebDriverWait(driver, 30)
wait.until(EC.visibility_of_element_located((By.XPATH, '//input[#name="username"]')))
driver.find_element_by_xpath('//input[#name="username"]').send_keys('your_login')
driver.find_element_by_xpath('//input[#type="password"]').send_keys('your_password')
driver.find_element_by_xpath('//input[#type="password"]').submit()
once you past login page you can
driver.get('https://instagram.com/')
it will reload to your home page...
I've created a script using python in combination with selenium implementing proxies within it to log in to facebook and scrape the name of the user whose post is on top of my feed. I would like the script to do this every five minutes for an unlimited time.
As this continuous login may lead my account to ban, I thought to implement proxies within the script to do the whole stuff anonymously.
I've written so far:
import random
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
def get_first_user(random_proxy):
options = webdriver.ChromeOptions()
prefs = {"profile.default_content_setting_values.notifications" : 2}
options.add_experimental_option("prefs",prefs)
options.add_argument(f'--proxy-server={random_proxy}')
with webdriver.Chrome(options=options) as driver:
wait = WebDriverWait(driver,10)
driver.get("https://www.facebook.com/")
driver.find_element_by_id("email").send_keys("username")
driver.find_element_by_id("pass").send_keys("password",Keys.RETURN)
user = wait.until(EC.presence_of_element_located((By.XPATH,"//h4[#id][#class][./span[./a]]/span/a"))).text
return user
if __name__ == '__main__':
proxies = [`list of proxies`]
while True:
random_proxy = proxies.pop(random.randrange(len(proxies)))
print(get_first_user(random_proxy))
time.sleep(60000*5)
How to stay undetected while scraping data continuously from a site that requires authentication?
I'm not sure why you would want to continuously login into your Facebook account every 5 minutes to scrape content. And using a random proxy address for each login would likely raise a red flag with Facebook's security rules .
Instead of logging into Facebook every 5 minutes, I would recommend staying logged-in. Selenium has a feature that refreshes a webpage being controlled by automation. By using this method you could refresh your
Facebook feed at a predefined interval, such as 5 minutes.
The code below uses this refresh method to reload the page. The code also checks from the user post at the top of your feed.
In testing I noted that Facebook uses some randomized tagging, which is likely used to mitigate scraping. I also noted that Facebook changed the username formats on posts linked to groups, so more testing is required if you want the usernames linked to these posts. I highly recommended conducting more tests to determine what user elements aren't being scraped correctly.
from time import sleep
from random import randint
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import NoSuchElementException
chrome_options = Options()
chrome_options.add_argument("--start-maximized")
chrome_options.add_argument("--disable-infobars")
chrome_options.add_argument("--disable-extensions")
chrome_options.add_argument("--disable-popup-blocking")
# disable the banner "Chrome is being controlled by automated test software"
chrome_options.add_experimental_option("useAutomationExtension", False)
chrome_options.add_experimental_option("excludeSwitches", ['enable-automation'])
# global driver
driver = webdriver.Chrome('/usr/local/bin/chromedriver', options=chrome_options)
driver.get('https://www.facebook.com')
driver.implicitly_wait(20)
driver.find_element_by_id("email").send_keys("your_username")
driver.find_element_by_id("pass").send_keys("your_password")
driver.implicitly_wait(10)
driver.find_element_by_xpath(("//button[text()='Log In']")).click()
# this function checks for a standard username tag
def user_element_exist():
try:
if driver.find_element_by_xpath("//h4[#id][#class][./span[./a]]/span/a"):
return True
except NoSuchElementException:
return False
# this function looks for username linked to Facebook Groups at the top of your feed
def group_element():
try:
if driver.find_element_by_xpath("//*[starts-with(#id, 'jsc_c_')]/span[1]/span/span/a/b"):
poster_name = driver.find_element_by_xpath("//*[starts-with(#id, 'jsc_c_')]/span[1]/span/span/a/b").text
return poster_name
if driver.find_element_by_xpath("//*[starts-with(#id, 'jsc_c_')]/strong[1]/span/a/span/span"):
poster_name = driver.find_element_by_xpath("//*[starts-with(#id, 'jsc_c_')]/strong["
"1]/span/a/span/span").text
return poster_name
except NoSuchElementException:
return "No user information found"
while True:
element_exists = user_element_exist()
if not element_exists:
user_name = group_element()
print(user_name)
driver.refresh()
elif element_exists:
user_name = driver.find_element_by_xpath("//h4[#id][#class][./span[./a]]/span/a").text
print(user_name)
driver.refresh()
# set the sleep timer to fit your needs
sleep(300) # This sleeps for 300 seconds, which is 5 minutes.
# I would likely use a random sleep function
# sleep(randint(180, 360))
Instead of log in every 5 minutes try to move that part away from loop to login only once
if __name__ == '__main__':
with webdriver.Chrome(options=options) as driver:
wait = WebDriverWait(driver,10)
driver.get("https://www.facebook.com/")
driver.find_element_by_id("email").send_keys("username")
driver.find_element_by_id("pass").send_keys("password",Keys.RETURN)
while True:
user = wait.until(EC.presence_of_element_located((By.XPATH,"//h4[#id][#class][./span[./a]]/span/a"))).text
print(user)
time.sleep(60000*5)
Also consider to use random interval instead of hardcoded sleep:
import random
time.sleep(random.randint(240, 360)) # wait for 4~6 minutes
I'm working on trying to automate a game I want to get ahead in called pokemon vortex and when I login using selenium it works just fine, however when I attempt to load a page that requires a user to be logged in I am sent right back to the login page (I have tried it outside of selenium with the same browser, chrome).
This is what I have
import time
from selenium import webdriver
from random import randint
driver = webdriver.Chrome(r'C:\Program Files (x86)\SeleniumDrivers\chromedriver.exe')
driver.get('https://zeta.pokemon-vortex.com/dashboard/');
time.sleep(5) # Let the user actually see something!
usernameLoc = driver.find_element_by_id('myusername')
passwordLoc = driver.find_element_by_id('mypassword')
usernameLoc.send_keys('mypassword')
passwordLoc.send_keys('12345')
submitButton = driver.find_element_by_id('submit')
submitButton.submit()
time.sleep(3)
driver.get('https://zeta.pokemon-vortex.com/map/10')
time.sleep(10)
I'm using python 3.6+ and I literally just installed selenium today so it's up to date, how do I force selenium to hold onto cookies?
Using a pre-defined user profile might solve your problem. This way your cache will be saved and will not be deleted.
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument("--user-data-dir=C:/Users/user_name/AppData/Local/Google/Chrome/User Data")
driver = webdriver.Chrome(options=options)
driver.get("xyz.com")
I have written many scrapers but I am not really sure how to handle infinite scrollers. These days most website etc, Facebook, Pinterest has infinite scrollers.
You can use selenium to scrap the infinite scrolling website like twitter or facebook.
Step 1 : Install Selenium using pip
pip install selenium
Step 2 : use the code below to automate infinite scroll and extract the source code
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import NoAlertPresentException
import sys
import unittest, time, re
class Sel(unittest.TestCase):
def setUp(self):
self.driver = webdriver.Firefox()
self.driver.implicitly_wait(30)
self.base_url = "https://twitter.com"
self.verificationErrors = []
self.accept_next_alert = True
def test_sel(self):
driver = self.driver
delay = 3
driver.get(self.base_url + "/search?q=stckoverflow&src=typd")
driver.find_element_by_link_text("All").click()
for i in range(1,100):
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(4)
html_source = driver.page_source
data = html_source.encode('utf-8')
if __name__ == "__main__":
unittest.main()
Step 3 : Print the data if required.
Most sites that have infinite scrolling do (as Lattyware notes) have a proper API as well, and you will likely be better served by using this rather than scraping.
But if you must scrape...
Such sites are using JavaScript to request additional content from the site when you reach the bottom of the page. All you need to do is figure out the URL of that additional content and you can retrieve it. Figuring out the required URL can be done by inspecting the script, by using the Firefox Web console, or by using a debug proxy.
For example, open the Firefox Web Console, turn off all the filter buttons except Net, and load the site you wish to scrape. You'll see all the files as they are loaded. Scroll the page while watching the Web Console and you'll see the URLs being used for the additional requests. Then you can request that URL yourself and see what format the data is in (probably JSON) and get it into your Python script.
Finding the url of the ajax source will be the best option but it can be cumbersome for certain sites. Alternatively you could use a headless browser like QWebKit from PyQt and send keyboard events while reading the data from the DOM tree. QWebKit has a nice and simple api.