I am thinking of trying to extend Pinry, a self-hostable Pinterest "clone". One of key features lack in Pinry is that it accepts image URLs only currently. I'm wondering if there is any suggested way to do that in Python?
Yes there are lots of ways to do that, BeautifulSoup could be an option. Or even more simply you could grab the html with the requests library and then use a regex to match
<img src="">.
A full example using BeautifulSoup4 and requests is below
import requests
from bs4 import BeautifulSoup
r = requests.get('http://goodnewshackney.com')
soup = BeautifulSoup(r.text)
for img in soup.find_all('img'):
print(img.get('src'))
Will print out:
http://24.media.tumblr.com/avatar_69da5d8bb161_128.png
http://24.media.tumblr.com/avatar_69da5d8bb161_128.png
....
http://25.media.tumblr.com/tumblr_m07jjfqKj01qbulceo1_250.jpg
http://27.media.tumblr.com/tumblr_m05s9b5hyc1qbulceo1_250.jpg
You will then need to present these images to the user somehow and let them pick one. Should be quite simple.
http://www.crummy.com/software/BeautifulSoup/ ?
Related
I'm doing a project where I need to store the date that a video in youtube was published.
The problem is that I'm having some difficulties trying to find this data in the middle of the HTML source code
Here's my code attempt:
import requests
from bs4 import BeautifulSoup as BS
url = "https://www.youtube.com/watch?v=XQgXKtPSzUI&t=915s"
response = requests.get(url)
soup = BS(response.content, "html.parser")
response.close()
dia = soup.find_all('span',{'class':'date'})
print(dia)
Output:
[]
I know that the arguments I'm sending to .find_all() are wrong.
I'm saying this because I was able to store other information from the video using the same code, such as the title and the views.
I've tried different arguments when using .find_all() but didn't figured out how to find it.
If you use Python with pafy, the object you'll get has the published date easily accessible.
Install pafy: "pip install pafy"
import pafy
vid = pafy.new("www.youtube.com/watch?v=2342342whatever")
published_date = vid.published
print(published_date) #Python3 print statement
Check out the pafy docs for more info:
https://pythonhosted.org/Pafy/
The reason I leave the doc link is because it's a really neat module, it handles getting the data without external request modules and also exposes a bunch of other useful properties of the video, like the best format download link, etc.
It seems that YouTube is using javascript to add the date, so that information is not in the source code. You should try using Selenium to scrape, or get the date from the js since it is directly in the source code.
Try adding attribute as shown below:
dia = soup.find_all('span', attr={'class':'date'})
I'm trying to write some code which download the two latest publications of the Outage Weeks found at the bottom of http://www.eirgridgroup.com/customer-and-industry/general-customer-information/outage-information/
It's xlsx-files, which I'm going to load into Excel afterwards.
It doesn't matter which programming language the code is written in.
My first idea was to use the direct url's, like http://www.eirgridgroup.com/site-files/library/EirGrid/Outage-Weeks_36(2016)-51(2016)_31%20August.xlsx
, and then make some code which guesses the url of the two latest publications.
But I have noticed some inconsistencies in the url names, so that solution wouldn't work.
Instead it might be solution to scrape the website and use the XPath to download the files. I found out that the two latest publications always have the following XPaths:
/html/body/div[3]/div[3]/div/div/p[5]/a
/html/body/div[3]/div[3]/div/div/p[6]/a
This is where I need help. I'm new to both XPath and Web Scraping. I have tried stuff like this in Python
from lxml import html
import requests
page = requests.get('http://www.eirgridgroup.com/customer-and-industry/general-customer-information/outage-information/')
tree = html.fromstring(page.content)
v = tree.xpath('/html/body/div[3]/div[3]/div/div/p[5]/a')
But v seems to be empty.
Any ideas would be greatly appreciated!
Just use contains to find the hrefs and slice the first two:
tree.xpath('//p/a[contains(#href, "/site-files/library/EirGrid/Outage-Weeks")]/#href')[:2]
Or doing it all with the xpath using [position() < 3]:
tree.xpath'(//p/a[contains(#href, "site-files/library/EirGrid/Outage-Weeks")])[position() < 3]/#href')
The files are ordered from latest to oldest so getting the first two gives you the two newest.
To download the files you just need to join each href to the base url and write the content to a file:
from lxml import html
import requests
import os
from urlparse import urljoin # from urllib.parse import urljoin
page = requests.get('http://www.eirgridgroup.com/customer-and-industry/general-customer-information/outage-information/')
tree = html.fromstring(page.content)
v = tree.xpath('(//p/a[contains(#href, "/site-files/library/EirGrid/Outage-Weeks")])[position() < 3]/#href')
for href in v:
# os.path.basename(href) -> Outage-Weeks_35(2016)-50(2016).xlsx
with open(os.path.basename(href), "wb") as f:
f.write(requests.get(urljoin("http://www.eirgridgroup.com", link)).content)
What I am trying to do is the following. There is this web page: http://xml.buienradar.nl .
From that, I want to extract a value every n minutes, preferably with Python. Let's say the windspeed at the Gilze-Rijen station. That is located on this page at:
<buienradarnl>.<weergegevens>.<actueel_weer>.<weerstations>.<weerstation id="6350">.<windsnelheidMS>4.80</windsnelheidMS>
Now, I can find loads of questions with answers that use Python to read a local XML file. But, I would rather not need to wget or curl this page every couple of minutes.
Obviously, I'm not very familiar with this.
There must be a very easy way to do this. The answer either escapes me or is drowned in all the answers that solve problems with a local file.
I would use urllib2 and BeautifulSoup.
from urllib2 import Request, urlopen
from bs4 import BeautifulSoup
req = Request("http://xml.buienradar.nl/")
response = urlopen(req)
output = response.read()
soup = BeautifulSoup(output)
print soup.prettify()
Then you can traverse the output like you were suggesting:
soup.buienradarnl.weergegevens (etc)
I would like to get the plain text (e.g. no html tags and entities) from a given URL.
What library should I use to do that as quickly as possible?
I've tried (maybe there is something faster or better than this):
import re
import mechanize
br = mechanize.Browser()
br.open("myurl.com")
vh = br.viewing_html
//<bound method Browser.viewing_html of <mechanize._mechanize.Browser instance at 0x01E015A8>>
Thanks
you can use HTML2Text if the site isnt working for you you can go to HTML2Text github Repo and get it for Python
or maybe try this:
import urllib
from bs4 import*
html = urllib.urlopen('myurl.com').read()
soup = BeautifulSoup(html)
text = soup.get_text()
print text
i dont know if it gets rid of all the js and stuff but it gets rid of the HTML
do some Google searches there are multiple other questions similar to this one
also maybe take a look at Read2Text
In Python 3, you can fetch the HTML as bytes and then convert to a string representation:
from urllib import request
text = request.urlopen('myurl.com').read().decode('utf8')
I'm working on a school project currently which aim goal is to analyze scam mails with the Natural Language Toolkit package. Basically what I'm willing to do is to compare scams from different years and try to find a trend - how does their structure changed with time.
I found a scam-database: http://www.419scam.org/emails/
I would like to download the content of the links with python, but I am stuck.
My code so far:
from BeautifulSoup import BeautifulSoup
import urllib2, re
html = urllib2.urlopen('http://www.419scam.org/emails/').read()
soup = BeautifulSoup(html)
links = soup.findAll('a')
links2 = soup.findAll(href=re.compile("index"))
print links2
So I can fetch the links but I don't know yet how can I download the content. Any ideas? Thanks a lot!
You've got a good start, but right now you're simply retrieving the index page and loading it into the BeautifulSoup parser. Now that you have href's from the links, you essentially need to open all of those links, and load their contents into data structures that you can then use for your analysis.
This essentially amounts to a very simple web-crawler. If you can use other people's code, you may find something that fits by googling "python Web crawler." I've looked at a few of those, and they are straightforward enough, but may be overkill for this task. Most web-crawlers use recursion to traverse the full tree of a given site. It looks like something much simpler could suffice for your case.
Given my unfamiliarity with BeautifulSoup, this basic structure will hopefully get you on the right path, or give you for a sense for how the web crawling is done:
from BeautifulSoup import BeautifulSoup
import urllib2, re
emailContents = []
def analyze_emails():
# this function and any sub-routines would analyze the emails after they are loaded into a data structure, e.g. emailContents
def parse_email_page(link):
print "opening " + link
# open, soup, and parse the page.
#Looks like the email itself is in a "blockquote" tag so that may be the starting place.
#From there you'll need to create arrays and/or dictionaries of the emails' contents to do your analysis on, e.g. emailContents
def parse_list_page(link):
print "opening " + link
html = urllib2.urlopen(link).read()
soup = BeatifulSoup(html)
email_page_links = # add your own code here to filter the list page soup to get all the relevant links to actual email pages
for link in email_page_links:
parseEmailPage(link['href'])
def main():
html = urllib2.urlopen('http://www.419scam.org/emails/').read()
soup = BeautifulSoup(html)
links = soup.findAll(href=re.compile("20")) # I use '20' to filter links since all the relevant links seem to have 20XX year in them. Seemed to work
for link in links:
parse_list_page(link['href'])
analyze_emails()
if __name__ == "__main__":
main()