I'm doing a project where I need to store the date that a video in youtube was published.
The problem is that I'm having some difficulties trying to find this data in the middle of the HTML source code
Here's my code attempt:
import requests
from bs4 import BeautifulSoup as BS
url = "https://www.youtube.com/watch?v=XQgXKtPSzUI&t=915s"
response = requests.get(url)
soup = BS(response.content, "html.parser")
response.close()
dia = soup.find_all('span',{'class':'date'})
print(dia)
Output:
[]
I know that the arguments I'm sending to .find_all() are wrong.
I'm saying this because I was able to store other information from the video using the same code, such as the title and the views.
I've tried different arguments when using .find_all() but didn't figured out how to find it.
If you use Python with pafy, the object you'll get has the published date easily accessible.
Install pafy: "pip install pafy"
import pafy
vid = pafy.new("www.youtube.com/watch?v=2342342whatever")
published_date = vid.published
print(published_date) #Python3 print statement
Check out the pafy docs for more info:
https://pythonhosted.org/Pafy/
The reason I leave the doc link is because it's a really neat module, it handles getting the data without external request modules and also exposes a bunch of other useful properties of the video, like the best format download link, etc.
It seems that YouTube is using javascript to add the date, so that information is not in the source code. You should try using Selenium to scrape, or get the date from the js since it is directly in the source code.
Try adding attribute as shown below:
dia = soup.find_all('span', attr={'class':'date'})
Related
I need to download the Net Income of the s&p 500 companies from this website https://www.macrotrends.net/stocks/charts/MMM/3m/income-statement
I wrote this part of code following an online guide (this one https://towardsdatascience.com/web-scraping-for-accounting-analysis-using-python-part-1-b5fc016a1c9a), but i can't figure out how to conlude it and, more specifically, how to download the extracted Net Income into an excel file.
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
url = 'https://www.macrotrends.net/stocks/charts/MMM/3m/income-statement'
response = requests.get(url)
response
soup = BeautifulSoup(response.text, 'html.parser')
income_statement = soup.findAll('a')[19]
link = income_statement['href']
download_url = 'https://www.macrotrends.net/stocks/charts/MMM/3m/income-statement/'+ link
Any suggestion would be very appreciated, thanks!
I think the correct way to tackle this mission is to use some stock market API instead of web scraping with BS4.
I'll recommend you to have a look at the following article, it also includes some practical examples:
https://towardsdatascience.com/best-5-free-stock-market-apis-in-2019-ad91dddec984
Edit:
If you decide to stick to the plan of using this exact URL you mentioned, I think you should try to use pandas, try something like this:
import pandas as pd
data = pd.read_html('https://www.macrotrends.net/stocks/charts/MMM/3m/income-statement​',skiprows=1)
You'll have to play with encoding a little bit as the table contains some non-ascii chars
Trying to make a python script that can scrape from pastebin's RAW Paste Data section of the page on saved pastebin outputs. But I'm running into an issue with Python Attribute Error about NoneType has no object attribute 'text', I'm using the libraries from BeautifulSoup in my project. I tried to install spider-egg with pip install so I could use that also, but there was issues downloading the package from the server.
I need to be able to grab different multiple lines from the RAW Paste Data section and the print them back out to me.
first_string = raw_box.text.strip()
second_string = raw_box2.text.strip()
from the pastebin page I have the class element names for the RAW Paste Data section which is;
<textarea id="paste_code" class="paste_code" name="paste_code" onkeydown="return catchTab(this,event)">
taking the class name paste_code I then have this
raw_box = soup.find('first_string ', attrs={'class': 'paste_code'})
raw_box2 = soup.find('second_string ', attrs={'class': 'paste_code'})
I thought that should of been it, but apparently not, because I get the error I mentioned before. After parsing the data that has been stripped I need to be able to redirect that into a file after printing what it got. I also want to try make this python3 compatible, but that would take a little more work I think, since there's a lot of differences between python 2.7.12 and 3.5.2.
The following approach should help to get you started:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://pastebin.com/hGeHMBQf')
soup = BeautifulSoup(r.text, "html.parser")
raw = soup.find('textarea', id='paste_code').text
print raw
Which for this example should display:
hello world
I'm trying to write some code which download the two latest publications of the Outage Weeks found at the bottom of http://www.eirgridgroup.com/customer-and-industry/general-customer-information/outage-information/
It's xlsx-files, which I'm going to load into Excel afterwards.
It doesn't matter which programming language the code is written in.
My first idea was to use the direct url's, like http://www.eirgridgroup.com/site-files/library/EirGrid/Outage-Weeks_36(2016)-51(2016)_31%20August.xlsx
, and then make some code which guesses the url of the two latest publications.
But I have noticed some inconsistencies in the url names, so that solution wouldn't work.
Instead it might be solution to scrape the website and use the XPath to download the files. I found out that the two latest publications always have the following XPaths:
/html/body/div[3]/div[3]/div/div/p[5]/a
/html/body/div[3]/div[3]/div/div/p[6]/a
This is where I need help. I'm new to both XPath and Web Scraping. I have tried stuff like this in Python
from lxml import html
import requests
page = requests.get('http://www.eirgridgroup.com/customer-and-industry/general-customer-information/outage-information/')
tree = html.fromstring(page.content)
v = tree.xpath('/html/body/div[3]/div[3]/div/div/p[5]/a')
But v seems to be empty.
Any ideas would be greatly appreciated!
Just use contains to find the hrefs and slice the first two:
tree.xpath('//p/a[contains(#href, "/site-files/library/EirGrid/Outage-Weeks")]/#href')[:2]
Or doing it all with the xpath using [position() < 3]:
tree.xpath'(//p/a[contains(#href, "site-files/library/EirGrid/Outage-Weeks")])[position() < 3]/#href')
The files are ordered from latest to oldest so getting the first two gives you the two newest.
To download the files you just need to join each href to the base url and write the content to a file:
from lxml import html
import requests
import os
from urlparse import urljoin # from urllib.parse import urljoin
page = requests.get('http://www.eirgridgroup.com/customer-and-industry/general-customer-information/outage-information/')
tree = html.fromstring(page.content)
v = tree.xpath('(//p/a[contains(#href, "/site-files/library/EirGrid/Outage-Weeks")])[position() < 3]/#href')
for href in v:
# os.path.basename(href) -> Outage-Weeks_35(2016)-50(2016).xlsx
with open(os.path.basename(href), "wb") as f:
f.write(requests.get(urljoin("http://www.eirgridgroup.com", link)).content)
I'm working on a school project currently which aim goal is to analyze scam mails with the Natural Language Toolkit package. Basically what I'm willing to do is to compare scams from different years and try to find a trend - how does their structure changed with time.
I found a scam-database: http://www.419scam.org/emails/
I would like to download the content of the links with python, but I am stuck.
My code so far:
from BeautifulSoup import BeautifulSoup
import urllib2, re
html = urllib2.urlopen('http://www.419scam.org/emails/').read()
soup = BeautifulSoup(html)
links = soup.findAll('a')
links2 = soup.findAll(href=re.compile("index"))
print links2
So I can fetch the links but I don't know yet how can I download the content. Any ideas? Thanks a lot!
You've got a good start, but right now you're simply retrieving the index page and loading it into the BeautifulSoup parser. Now that you have href's from the links, you essentially need to open all of those links, and load their contents into data structures that you can then use for your analysis.
This essentially amounts to a very simple web-crawler. If you can use other people's code, you may find something that fits by googling "python Web crawler." I've looked at a few of those, and they are straightforward enough, but may be overkill for this task. Most web-crawlers use recursion to traverse the full tree of a given site. It looks like something much simpler could suffice for your case.
Given my unfamiliarity with BeautifulSoup, this basic structure will hopefully get you on the right path, or give you for a sense for how the web crawling is done:
from BeautifulSoup import BeautifulSoup
import urllib2, re
emailContents = []
def analyze_emails():
# this function and any sub-routines would analyze the emails after they are loaded into a data structure, e.g. emailContents
def parse_email_page(link):
print "opening " + link
# open, soup, and parse the page.
#Looks like the email itself is in a "blockquote" tag so that may be the starting place.
#From there you'll need to create arrays and/or dictionaries of the emails' contents to do your analysis on, e.g. emailContents
def parse_list_page(link):
print "opening " + link
html = urllib2.urlopen(link).read()
soup = BeatifulSoup(html)
email_page_links = # add your own code here to filter the list page soup to get all the relevant links to actual email pages
for link in email_page_links:
parseEmailPage(link['href'])
def main():
html = urllib2.urlopen('http://www.419scam.org/emails/').read()
soup = BeautifulSoup(html)
links = soup.findAll(href=re.compile("20")) # I use '20' to filter links since all the relevant links seem to have 20XX year in them. Seemed to work
for link in links:
parse_list_page(link['href'])
analyze_emails()
if __name__ == "__main__":
main()
I am thinking of trying to extend Pinry, a self-hostable Pinterest "clone". One of key features lack in Pinry is that it accepts image URLs only currently. I'm wondering if there is any suggested way to do that in Python?
Yes there are lots of ways to do that, BeautifulSoup could be an option. Or even more simply you could grab the html with the requests library and then use a regex to match
<img src="">.
A full example using BeautifulSoup4 and requests is below
import requests
from bs4 import BeautifulSoup
r = requests.get('http://goodnewshackney.com')
soup = BeautifulSoup(r.text)
for img in soup.find_all('img'):
print(img.get('src'))
Will print out:
http://24.media.tumblr.com/avatar_69da5d8bb161_128.png
http://24.media.tumblr.com/avatar_69da5d8bb161_128.png
....
http://25.media.tumblr.com/tumblr_m07jjfqKj01qbulceo1_250.jpg
http://27.media.tumblr.com/tumblr_m05s9b5hyc1qbulceo1_250.jpg
You will then need to present these images to the user somehow and let them pick one. Should be quite simple.
http://www.crummy.com/software/BeautifulSoup/ ?