Python3 - Urllib & BeautifulSoup4 to Extract Specific Text - python

I am new to python and I am trying to get a script working alongside with urllib and BeautifulSoup4 to collect the tweets which are streamable via the emojitracker API. It outputs the tweets of a specific emoji as .json files. An example is this link (opens in chrome):
http://emojitracker.com/api/details/1F52B
I can get all the text from the .json, but I only want to get the tweet (which is after "text:"). I had a look around and there was an example to get all the links on the page, using soup.findAll("a",class_="classname").
I used inspect element and found that the tweet i need is stored as: (span class="type-string")tweet goes here(/span). So I tried the following:
from bs4 import BeautifulSoup
import urllib.request
url = "http://emojitracker.com/api/details/1F52B"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page.read(),"html.parser")
tweets = soup.findAll("span", class_"type-string")
for tweet in tweets:
print (tweet.string)
Running, this it did not print anything. How can I make it so that it only prints out the tweets?

The page you provide is not an html page. In fact it's formatted as a json file, so you won't be able to treat it as an HTML page.
As I understand, what you want is to retrieve all of the recent tweets.
In order to do this, we get the response, as already do, and parse the response string and convert it to a Python dictionary using the json library (which does not require installation as it's part of the standard library).
If you want to do this, we can write the following code:
import json
import urllib.request
url = "http://emojitracker.com/api/details/1F52B"
page = urllib.request.urlopen(url)
json = json.loads(str(page.read(), 'latin'))
for tweet in json['recent_tweets']:
print(tweet['text'])
Hope it helps,

Related

scraping inside script tag with beautifulsoup

I'm scraping data from e-commerce site and I need model number of each laptops. But in div tags, there are no model numbers. I found model number inside script tag as "productCode". For this example its:
"productCode":"MGND3TU/A"
How can I gather the "productCode" data. I couldn't understand from other posts.
Edit: I find the ‘productCode’ inside script tag. But i don’t know how to get it. You can check from page source.
Since the JSON is hidden in the <head>, it can be parsed, but with some custom logic.
Unfortunately the script tags exports the JSON to a window var, so we'll need to remove that befor we can parse it.
Get url
Get all <script>
Check if PRODUCT_DETAIL_APP_INITIAL_STAT exist in the string (valid json)
Remove the prefix (hardcoded)
Find the index of the next key (hardcoded)
Remove after the suffix
Try to parse to json
Print json['product']['productCode'] if it exists
import json
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
reqs = requests.get("https://www.trendyol.com/apple/macbook-air-13-m1-8gb-256gb-ssd-altin-p-67940132")
soup = BeautifulSoup(reqs.text, 'html.parser')
for sc in soup.findAll('script'):
if len(sc.contents) > 0 and "PRODUCT_DETAIL_APP_INITIAL_STAT" in sc.contents[0]:
withoutBegin = sc.contents[0][44:]
endIndex = withoutBegin.find('window.TYPageName=') - 1
withoutEnd = withoutBegin[:endIndex]
try:
j = json.loads(withoutEnd)
if j['product']['productCode']:
print(j['product']['productCode'])
except Exception as e:
print("Unable to parse JSON")
continue
Output:
MGND3TU/A
In this case beautifulsoup is not needed cause response could be searched directly with regex:
json.loads(re.search(r"window.__PRODUCT_DETAIL_APP_INITIAL_STATE__=({.*}});window", r).group(1))
Example
import requests, re, json
r = requests.get('https://www.trendyol.com/apple/macbook-air-13-m1-8gb-256gb-ssd-altin-p-67940132').text
json_data = json.loads(re.search(r"window.__PRODUCT_DETAIL_APP_INITIAL_STATE__=({.*}});window", r).group(1))
json_data['product']['productCode']
Output
MGND3TU/A
That's because those tags are generated using JavaScript. When you send a request to that URL, you will get back a response which has information for a JS script to build DOM for you. (technically JSON information):
To see what your returned response actually is, either print the value of r.text (r is returned from requests.get()) or manually see the "view page source" from the browser. (not inspect element section)
Now to solve it, you can either use something that can render JS, just like your browser. For example Selenium. requests module is not capable of rendering JS. It is just for sending and receiving requests.
Or manually extract that JSON text from the returned text (using Regex or,...) then create a Python dictionary from it.

When I scrape data from a website it only returns a newline

I've tried the code with different websites and elements, but nothing was working.
import requests
from lxml import html
page = requests.get('https://www.instagram.com/username.html')
tree = html.fromstring(page.content)
follow = tree.xpath('//span[#class="g47SY"]/text()')
print(follow)
input()
Above is the code I tried to use to aquire the number of instagram followers someone had.
One issue with web scraping Instagram is that a lot of content, including tag attribute values, is rendered dynamically. So the class you are using to fetch followers may change.
If you are able to use the Beautiful Soup library in Python, you might have an easier time parsing the page and getting the data. You can install it using pip install bs4. You can then search for the og:description descriptor, which follows the Open Graph protocol, and parse it to get follower counts.
Here's an example script that should get the follower count for a particular user:
import requests
from bs4 import BeautifulSoup
username = 'google'
html = requests.get('https://www.instagram.com/' + username)
bs = BeautifulSoup(html.text, 'lxml')
item = bs.select_one("meta[property='og:description']")
name = item.find_previous_sibling().get("content").split("•")[0]
follower_count = item.get("content").split(",")[0]
print(follower_count)

How to loop through a vector of URLs and scrape some basic tags from each

I am trying to loop through a list of URLs and scrape some data from each link. Here is my code.
from bs4 import BeautifulSoup as bs
import webbrowser
import requests
url_list = ['https://corp-intranet.com/admin/graph?dag_id=emm1_daily_legacy',
'https://corp-intranet.com/admin/graph?dag_id=emm1_daily_legacy_history']
for link in url_list:
File = webbrowser.open(link)
File = requests.get(link)
data = File.text
soup = bs(data, "lxml")
tspans = soup.find_all("tspan")
tspans
I think this is pretty close, but I'm getting nothing for the 'tspans' variable. I get no error; 'tspans' just shows [].
This is an internal corporate intranet, so I can't share the exact details, but I think it's just a matter of grabbing all the HTML elements named 'tspans' and writing all of them to a text file or a CSV file. That's my ultimate goal. I want to collate everything into a large list and write it all to a file.
As an aside, I was going to use Selenium to log into this site, which requires creds, but it seem like the code I'm testing now allows you you open new tabs on a browser, and everything loads up fine, if you are already logged in. Is this the best practice, or should I use the full login creds + Selenium? I'm just trying to keep things simple.

BeautifulSoup's "find" acting inconsistently (bs4)

I'm scraping the NFL's website for player statistics. I'm having an issue when parsing the web page and trying to get to the HTML table which contains the actual information I'm looking for. I successfully downloaded the page and saved it into the directory I'm working in. For reference, the page I've saved can be found here.
# import relevant libraries
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("1998.html"))
result = soup.find(id="result")
print result
I found that at one point, I ran the code and result printed the correct table I was looking for. Every other time, it doesn't contain anything! I'm assuming this is user error, but I can't figure out what I'm missing. Using "lxml" returned nothing and I can't get html5lib to work (parsing library??).
Any help is appreciated!
First, you should read the contents of your file before passing it to BeautifulSoup.
soup = BeautifulSoup(open("1998.html").read())
Second, verify manually that the table in question exists in the HTML by printing the contents to screen. The .prettify() method makes the data easier to read.
print soup.prettify()
Lastly, if the element does in fact exist, the following will be able to find it:
table = soup.find('table',{'id':'result'})
A simple test script I wrote cannot reproduce your results.
import urllib
from bs4 import BeautifulSoup
def test():
# The URL of the page you're scraping.
url = 'http://www.nfl.com/stats/categorystats?tabSeq=0&statisticCategory=PASSING&conference=null&season=1998&seasonType=REG&d-447263-s=PASSING_YARDS&d-447263-o=2&d-447263-n=1'
# Make a request to the URL.
conn = urllib.urlopen(url)
# Read the contents of the response
html = conn.read()
# Close the connection.
conn.close()
# Create a BeautifulSoup object and find the table.
soup = BeautifulSoup(html)
table = soup.find('table',{'id':'result'})
# Find all rows in the table.
trs = table.findAll('tr')
# Print to screen the number of rows found in the table.
print len(trs)
This outputs 51 every time.

Python XMl Parser with BeautifulSoup. How do I remove tags?

For a project I decided to make an app that helps people find friends on Twitter.
I have been able to grab usernames from xml pages. So for example with my current code I can get <uri>http://twitter.com/username</uri> from an XML page, but I want to remove the <uri> and </uri> tags using Beautiful Soup.
Here is my current code:
import urllib
import BeautifulSoup
doc = urllib.urlopen("http://search.twitter.com/search.atom?q=travel").read()
soup = BeautifulStoneSoup(''.join(doc))
data = soup.findAll("uri")
Don't use BeautifulSoup to parse twitter, use their API (also don't use BeautifulSoup, use lxml). To answer your question:
import urllib
from BeautifulSoup import BeautifulSoup
resp = urllib.urlopen("http://search.twitter.com/search.atom?q=travel")
soup = BeautifulSoup(resp.read())
for uri in soup.findAll('uri'):
uri.extract()
To answer your question about BeautifulSoup, text is what you need to grab the contents of each <uri> tag. Here I extract the information into a list comprehension:
>>> uris = [uri.text for uri in soup.findAll('uri')]
>>> len(uris)
15
>>> print uris[0]
http://twitter.com/MarieJeppesen
But, as zeekay says, Twitter's REST API is a better approach for querying Twitter.

Categories

Resources