Scraping Instagram title using Beautiful Soup

Scraping Instagram title using Beautiful Soup - python

Using BeautifulSoup, I want to be able to get the title tag of the Instagram page depending on the username. What I'm trying to do is to have a user input a particular username that they want to search for and this will be parsed etc. to return the title tag and check if the username that had been entered by the user is within the title. Instead, I get a TypeError saying that it is 'NoneType' even though the page does exist. Below is the code.
from bs4 import BeautifulSoup
import requests
def main():
username = input("Enter username: ")
instagram(username)
def instagram(username):
url = "https://www.instagram.com/" + username
results = requests.get(url)
doc = BeautifulSoup(results.text, "html.parser")
name = doc.title
if username in name:
print(name)
else:
print("Not found")
print(name)
if __name__ == "__main__":
main()
Strangely, when I first tested this it was working completely fine but now it only returns errors or 'None'. Thanks for the help.

First of all let me tell you that scraping data from websites is against the terms of service of most websites, including Instagram. Additionally, the structure of the HTML can change at any time, so this code might stop working if the website changes its structure.

Related

Web Scraping Emails using Python

new to web scraping (using python) and encountered a problem trying to get an email from a university's athletic department site.
I've managed to get to navigate to the email I want to extract but don't know where to go from here. When I print what I have, all I get is '' and not the actual text of the email.
I'm attaching what I have so far, let me know if it needs a better explanation.
Here's a link to an image of what I'm trying to scrape. Website
and the website: https://goheels.com/staff-directory
Thanks!
Here's my code:
from bs4 import BeautifulSoup
import requests
urls = ''
with open('websites.txt', 'r') as f:
for line in f.read():
urls += line
urls = list(urls.split())
print(urls)
for url in urls:
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
try:
body = soup.find(headers="col-staff_email category-0")
links = body.a
print(links)
except Exception as e:
print(f'"This url didn\'t work:" {url}')

The emails are hidden inside a <script> element. With a little pushing, shoving, css selecting and string splitting you can get there:
for em in soup.select('td[headers*="col-staff_email"] script'):
target = em.text.split('var firstHalf = "')[1]
fh = target.split('";')[0]
lh = target.split('var secondHalf = "')[1].split('";')[0]
print(fh+ '#' +lh)
Output:
bubba.cunningham#unc.edu
molly.dalton#unc.edu
athgallo#unc.edu
dhollier#unc.edu
etc.

When I scrape data from a website it only returns a newline

I've tried the code with different websites and elements, but nothing was working.
import requests
from lxml import html
page = requests.get('https://www.instagram.com/username.html')
tree = html.fromstring(page.content)
follow = tree.xpath('//span[#class="g47SY"]/text()')
print(follow)
input()
Above is the code I tried to use to aquire the number of instagram followers someone had.

One issue with web scraping Instagram is that a lot of content, including tag attribute values, is rendered dynamically. So the class you are using to fetch followers may change.
If you are able to use the Beautiful Soup library in Python, you might have an easier time parsing the page and getting the data. You can install it using pip install bs4. You can then search for the og:description descriptor, which follows the Open Graph protocol, and parse it to get follower counts.
Here's an example script that should get the follower count for a particular user:
import requests
from bs4 import BeautifulSoup
username = 'google'
html = requests.get('https://www.instagram.com/' + username)
bs = BeautifulSoup(html.text, 'lxml')
item = bs.select_one("meta[property='og:description']")
name = item.find_previous_sibling().get("content").split("•")[0]
follower_count = item.get("content").split(",")[0]
print(follower_count)

Scrape Facebook friends with BeautifulSoup

I've already done some basic web scraping with BeautifulSoup. For my next project I've chosen to scrape facebook friend list of a specified user. The problem is, facebook lets you see friend lists of people only if you are logged in. So my question is, can I somehow bypass it and if not, can I make BeautifulSoup act like if it was logged in?
Here's my code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = input("enter url: ")
try:
page = urlopen(url)
except:
print("Error opening the URL")
soup = BeautifulSoup(page, 'html.parser')
content = soup.find('div', {"class": "_3i9"})
friends = ''
for i in content.findAll('a'):
friends = friends + ' ' + i.text
print(friends)

BeautifulSoup doesn't require that you use an URL. Instead:
Inspect the friends list
Copy the parent tag containing the list to a new file (ParentTag.html)
Open the file as a string, and pass it to BeautifulSoup()
with open("path/to/ParentTag.html", encoding="utf8") as html:
soup = BeautifulSoup(html, "html.parser")
Then, "you make-a the soup-a."

The problem is, facebook lets you see friend lists of people only if
you are logged in
You can overcome this using Selenium. You'll need it to authenticate yourself, then you can find the user. Once you found it, you can proceed in two ways:
You can get the HTML source with driver.page_sourceand from there use Beatiful Soup
Use the methods that Selenium provide you to scrape friends

Extract the username with beautifulsoup from Facebook

I want to extract the username from Facebook posts without API. I've already succeeded in extraction the timestamp, but the same algorithm is not working with the username.
As input I have a list of links like these:
https://www.facebook.com/barackobama/photos/a.10155401589571749/10156901908101749/?type=3&theater
https://www.facebook.com/photo.php?fbid=391679854902607&set=gm.325851774772841&type=1&theater
https://www.facebook.com/FisherHouse/photos/pcb.10157433176029134/10157433170239134/?type=3&theater
I've already tried searching with pageTitle, but it is not working as expected because there are many unuseful information.
facebook = BeautifulSoup(req.text, "html.parser")
facebookusername = str (facebook.select('[id="pageTitle"]'))
My code now is:
req = requests.get(url)
facebook = BeautifulSoup(req.text, "html.parser")
divs = facebook.find_all('div', class_="_title")
for iteration in range (len(divs)):
if 'title' in str(divs[iteration]):
print (divs[iteration])
I need only the username as output.

As WizKid said, you should use the API. But to give you an answer: The name of the page seems to be nested within the h5-title. Extract the h5 first and then get the name.
x = facebook.find('h5')
title = x.find('a').getText()
I can't try it at the moment but that should do the trick.

Scrape password protected website with no token

(I'm sorry for my english i'll try to do my best) :
I'm a newbie in python and i'm seeking for help for some web scraping. I already have a functionable code to get the links i want but the website is protected by a password.
with the help of a lot of question i read i managed to get a working code to scrape the website after the login but the links i want are on another page :
the login page is http://fantasy.trashtalk.co/login.php
the landing page (the one i scrape with this code) after login is http://fantasy.trashtalk.co/
and the page i want is http://fantasy.trashtalk.co/?tpl=classement&t=1
So i have this code (some import are probably useless, they come from another code):
from bs4 import BeautifulSoup
import requests
from lxml import html
import urllib.request
import re
username = 'myusername'
password = 'mypass'
url = "http://fantasy.trashtalk.co/?tpl=classement&t=1"
log = "http://fantasy.trashtalk.co/login.php"
values = {'email': username,
'password': password}
r = requests.post(log, data=values)
# Not sure about the code below but it works.
data = r.text
soup = BeautifulSoup(data, 'lxml')
tags = soup.find_all('a')
for link in soup.findAll('a', attrs={'href': re.compile("^https://")}):
print(link.get('href'))
I understand that this code only allow me to access to the login page then scrape what come next (the landing page) but i don't figure out how to "save" my loggin info to access the page i want to scrape.
i think i should add something like this after the login code but when i do it it only scrape my links from the login page :
s = request.get(url)
Also i read some topic here using "with session" thing ? but i didn't managed to make it work.
Any of help would be appreciated. Thank you for your time.

The issue was that you needed to save your login credentials by posting them through a session object, not a request object. I've modified your code below and you now have access to the html tags located in the scrape_url page. Good luck!
import requests
from bs4 import BeautifulSoup
username = 'email'
password = 'password'
scrape_url = 'http://fantasy.trashtalk.co/?tpl=classement&t=1'
login_url = 'http://fantasy.trashtalk.co/login.php'
login_info = {'email': username,'password': password}
#Start session.
session = requests.session()
#Login using your authentication information.
session.post(url=login_url, data=login_info)
#Request page you want to scrape.
url = session.get(url=scrape_url)
soup = BeautifulSoup(url.content, 'html.parser')
for link in soup.findAll('a'):
print('\nLink href: ' + link['href'])
print('Link text: ' + link.text)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping Instagram title using Beautiful Soup - python

First of all let me tell you that scraping data from websites is against the terms of service of most websites, including Instagram. Additionally, the structure of the HTML can change at any time, so this code might stop working if the website changes its structure.

Related

Web Scraping Emails using Python

When I scrape data from a website it only returns a newline

Scrape Facebook friends with BeautifulSoup

Extract the username with beautifulsoup from Facebook

Scrape password protected website with no token

Categories

Resources