Extract text from embedded tweets with Python & BeautifulSoup

Extract text from embedded tweets with Python & BeautifulSoup - python

I need to extract separately text from embedded tweets on a webpage. The code below works ok but I need to get rid of start and end lines like these: Skip Twitter post by... and End Twitter post by..., date and Report leaving only tweets. I cannot even see where these lines come from and which tag to use. Will really appreciate your help!
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.bbc.co.uk/news/uk-44496876')
soup = BeautifulSoup(r.content, "html.parser")
article_soup = [s.get_text() for s in soup.find_all( 'div', {'class': 'social-embed'})]
tweets = '\n'.join(article_soup)
print(tweets)

import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.bbc.co.uk/news/uk-44496876')
soup = BeautifulSoup(r.content, "html.parser")
article_soup = [s.get_text() for s in soup.find_all('p', {'dir': 'ltr'})]
tweets = '\n'.join(article_soup)
print(tweets)
If you also want to get the author of the tweets it's a bit tricky since you don't have a tag for the author. So I used a python code to remove all the tags in between the author like this:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.bbc.co.uk/news/uk-44496876')
soup = BeautifulSoup(r.content, "html.parser")
articles_soup = [s for s in soup.find_all('blockquote', {'class': 'twitter-tweet'})]
tweets = []
for article_soup in articles_soup:
tweet = article_soup.find('p').get_text()
# The last <a href='...'></a> is the date, others are part of the tweet
date = article_soup.find_all('a')[-1].get_text()
tweet_author = article_soup.get_text()[len(tweet):-len(date)].strip()
tweets.append((tweet_author, tweet))
print(tweets)
Note1: if you want to get only parts of the tweet_author you can easily take the tuple first element and tweek it to get the object that you want.
Note2: the question code example does not always return tweets, the issue is with the html page since from time to time several elements do not return. The fast solution is to run the requests.get method once more - I suggest you look into this issue.
Once I got the tweets with the original question, I found the tags and I got the tweets that you expected to get, each tweet in a different line in my code.

Related

scrape book body text from project gutenberg de

I am new to python and I am looking for a way to extract with beautiful soup existing open source books that are available on gutenberg-de, such as this one
I need to use them for further analysis and text mining.
I tried this code, found in a tutorial, and it extracts metadata, but instead of the body content it gives me a list of the "pages" I need to scrape the text from.
import requests
from bs4 import BeautifulSoup
# Make a request
page = requests.get(
"https://www.projekt-gutenberg.org/keller/heinrich/")
soup = BeautifulSoup(page.content, 'html.parser')
# Extract title of page
page_title = soup.title
# Extract body of page
page_body = soup.body
# Extract head of page
page_head = soup.head
# print the result
print(page_title, page_head)
I suppose I could use that as a second step to extract it then? I am not sure how, though.
Ideally I would like to store them in a tabular way and be able to save them as csv, preserving the metadata author, title, year, and chapter. any ideas?

What happens?
First of all you will get a list of pages, cause you not requesting the right url it to:
page = requests.get('https://www.projekt-gutenberg.org/keller/heinrich/hein101.html')
Recommend that if your looping all the urls store the content in a list of dicts and push it to csv or pandas or ...
Example
import requests
from bs4 import BeautifulSoup
data = []
# Make a request
page = requests.get('https://www.projekt-gutenberg.org/keller/heinrich/hein101.html')
soup = BeautifulSoup(page.content, 'html.parser')
data.append({
'title': soup.title,
'chapter': soup.h2.get_text(),
'text': ' '.join([p.get_text(strip=True) for p in soup.select('body p')[2:]])
}
)
data

Matching a specific piece of text in a title using Beuatiful Soup

Basically, I want to find all links that contain certain key terms. In my case, the titles of these links that I want come in this form: abc... (common text), dce... (common text), ... I want to take all of the links containing "(common text)" and put them in the list. I got the code working and I understand how to find all links. However, I converted the links to strings to find the "(common text)". I know that this isn't good practice and I am not sure how to use Beautiful Soup to find this common element without converting to a string. The issue here is that the titles I am searching for are not all the same. Here's what I have so far:
from bs4 import BeautifulSoup
import requests
import webbrowser
url = 'website.com'
http = requests.get(url)
soup = BeautifulSoup(http.content, "lxml")
links = soup.find_all('a', limit=4000)
links_length = len(links)
string_links = []
targetlist = []
for a in range(links_length):
string_links.append(str(links[a]))
if '(common text)' in string_links[a]:
targetlist.append(string_links[a])
NOTE: I am looking for the simplest method using Beautiful Soup to accomplish this. Any help will be appreciated.

Without the actual website and actual output you want, it's very difficult to say what you want but this is a "cleaner" solution using list comprehension.
from bs4 import BeautifulSoup
import requests
import webbrowser
url = 'website.com'
http = requests.get(url)
soup = BeautifulSoup(http.content, "lxml")
links = soup.find_all('a', limit=4000)
targetlist = [str(link) for link in links if "(common text)" in str(link)]

How to get CData from html using beautiful soup

I am trying to get a value from a webpage. In the source code of the webpage, the data is in CDATA format and also comes from a jQuery. I have managed to write the below code which gets a large amount of text, where the index 21 contains the information I need. However, this output is large and not in a format I understand. Within the output I need to isolate and output "redshift":"0.06" but dont know how. what is the best way to solve this.
import requests
from bs4 import BeautifulSoup
link = "https://wis-tns.weizmann.ac.il/object/2020aclx"
html = requests.get(link).text
soup = BeautifulSoup(html, "html.parser")
res = soup.findAll('b')
print soup.find_all('script')[21]

It can be done using the current approach you have. However, I'd advise against it. There's a neater way to do it by observing that the redshift value is present in a few convenient places on the page itself.
The following approach should work for you. It looks for tables on the page with the class "atreps-results-table" -- of which there are two. We take the second such table and look for the table cell with the class "cell-redshift". Then, we just print out its text content.
from bs4 import BeautifulSoup
import requests
link = 'https://wis-tns.weizmann.ac.il/object/2020aclx'
html = requests.get(link).text
soup = BeautifulSoup(html, 'html.parser')
tab = soup.find_all('table', {'class': 'atreps-results-table'})[1]
redshift = tab.find('td', {'class': 'cell-redshift'})
print(redshift.text)

Try simply:
soup.select_one('div.field-redshift > div.value>b').text

If you view the Page Source of the URL, you will find that there are two script elements that are having CDATA. But the script element in which you are interested has jQuery in it. So you have to select the script element based on this knowledge. After that, you need to do some cleaning to get rid of CDATA tags and jQuery. Then with the help of json library, convert JSON data to Python Dictionary.
import requests
from bs4 import BeautifulSoup
import json
page = requests.get('https://wis-tns.weizmann.ac.il/object/2020aclx')
htmlpage = BeautifulSoup(page.text, 'html.parser')
scriptelements = htmlpage.find_all('script')
for script in scriptelements:
if 'CDATA' in script.text and 'jQuery' in script.text:
scriptcontent = script.text.replace('<!--//--><![CDATA[//>', '').replace('<!--', '').replace('//--><!]]>', '').replace('jQuery.extend(Drupal.settings,', '').replace(');', '')
break
jsondata = json.loads(scriptcontent)
print(jsondata['objectFlot']['plotMain1']['params']['redshift'])

Getting None when scraping for operating income from SEC EDGAR document

I'm trying to obtain the latest quarter's operating income/loss from a quarterly filling.
Desired output highlighted in green: financial statement
Here's the URL of the document that I'm trying to scrape: https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019319000076/a10-qq320196292019.htm
If you'd like to see the data point visually, it is in PART I, Item 1. Financial Statements, Operating income.
The HTML code for the figure that I'm trying to get:
<ix:nonfraction id="fact-identifier-125" name="us-gaap:OperatingIncomeLoss" contextref="FD2019Q3QTD" unitref="usd" decimals="-6" scale="6" format="ixt:numdotdecimal" data-original-id="d305292495e1903-wk-Fact-6250FB76089207E7F73CB52756E0D8D0" continued-taxonomy="false" enabled-taxonomy="true" highlight-taxonomy="false" selected-taxonomy="false" hover-taxonomy="false" onclick="Taxonomies.clickEvent(event, this)" onkeyup="Taxonomies.clickEvent(event, this)" onmouseenter="Taxonomies.enterElement(event, this);" onmouseleave="Taxonomies.leaveElement(event, this);" tabindex="18" isadditionalitemsonly="false">11,544</ix:nonfraction>
The code that I used to obtain this data point (11,544).:
from bs4 import BeautifulSoup
import requests
url = 'https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019319000076/a10-qq320196292019.htm'
response = requests.get(url)
content = BeautifulSoup(response.content, 'html.parser')
operatingincomeloss = content.find('ix:nonfraction', attrs={"name": "us-gaap:OperatingIncomeLoss", "contextref":"FD2019Q3QTD"})
print (operatingincomeloss)
I also tried with
operatingincomeloss = content.find('ix:nonfraction', attrs={"name": "us-gaap:OperatingIncomeLoss"}
Eventually, I want to loop through all the relevant fillings to pull this data point. Currently, I'm just getting None. When I CTRl+F through content, I can't find the ix:nonfraction tag as well.

Page is loaded via JavaScript, I've attached the XHR request made and extracted the data required.
import requests
from bs4 import BeautifulSoup
r = requests.get(
"https://www.sec.gov/Archives/edgar/data/320193/000032019319000076/a10-qq320196292019.htm")
soup = BeautifulSoup(r.text, 'html.parser')
for item in soup.select("#d305292495e1903-wk-Fact-6250FB76089207E7F73CB52756E0D8D0"):
print(item.text)
Output:
11,544
Updated:
import requests
from bs4 import BeautifulSoup
r = requests.get(
"https://www.sec.gov/Archives/edgar/data/320193/000032019319000076/a10-qq320196292019.htm")
soup = BeautifulSoup(r.text, 'html.parser')
for item in soup.findAll("ix:nonfraction", {'contextref': 'FD2019Q3QTD', 'name': 'us-gaap:OperatingIncomeLoss'}):
print(item.text)

As #αԋɱҽԃ αмєяιcαη said, the page is loaded via JavaScript.
I have used the xhr request for this code.
Considering the attributes you have used, I have taken name attribute only, as contextref changes for each element.
You could also change the name attribute if you want to loop through other elements.
As you said you want to loop through this tag, I have printed all the output returning in the code below.
Code:
import requests
from bs4 import BeautifulSoup
res = requests.get('https://www.sec.gov/Archives/edgar/data/320193/000032019319000076/a10-qq320196292019.htm')
soup = BeautifulSoup(res.text, 'html.parser')
for data in soup.find_all('ix:nonfraction', {'name': 'us-gaap:OperatingIncomeLoss'}):
print(data.text)
Output:
11,544
12,612
48,305
54,780
7,442
7,496
26,329
26,580
3,687
3,892
14,371
15,044
3,221
3,414
12,142
15,285
1,795
1,765
7,199
7,193
1,155
1,127
4,811
4,980
17,300
17,694
64,852
69,082
11,544
12,612
48,305
54,780

Scraping table contents using Requests and Beautiful Soup

Python/Webscraping Beginner so please bear with me. I'm trying to grab all product names from this URL
Unfortunately, nothing gets returned when I run my code. The same code works fine for most other sites but I've tried dozens of variations and I can't make it work for this site.
Is this URL even scrapable using Bsoup? Any feedback is appreciated.
import bs4
import requests
url = 'http://www.rakuten.com/sr/searchresults.aspx?qu'
payload = {'q': 'Python',}
r = requests.get(url % payload)
soup = bs4.BeautifulSoup(r.text)
titles = [a.attrs.get('href') for a in soup.findAll('div.productscontainer a[href^=/prod]')]
for t in titles:
print(t)
import bs4
import requests
url = 'http://www.rakuten.com/sr/searchresults.aspx?qu'
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text)
titles = [td.text for td in soup.findAll('td', attrs={'class': 'searchlist'})]
for t in titles:
print(t)
If this formatting is correct, is the JS for sure preventing me from pulling anything?

First of all, your string formatting likely is wrong. Look at this:
>>> url = 'http://www.rakuten.com/sr/searchresults.aspx?qu'
>>> payload = {'q': 'Python',}
>>> url % payload
'http://www.rakuten.com/sr/searchresults.aspx?qu'
I guess that's not what you want. You should look up how string formatting works in Python, and then come up with a proper way to construct your URL.
Secondly, that "search engine" makes heavy use of JavaScript. Probably you will not be able to retrieve the information you want by just looking at the initially retrieved HTML content.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract text from embedded tweets with Python & BeautifulSoup - python

Related

scrape book body text from project gutenberg de

Matching a specific piece of text in a title using Beuatiful Soup

How to get CData from html using beautiful soup

Getting None when scraping for operating income from SEC EDGAR document

Scraping table contents using Requests and Beautiful Soup

Categories

Resources