Scraping text from HTML5 website using Python - python

I need to way to scrape just the text from a website using python. I have installed BeautifulSoup 4, HTML Requests, and NLTK but I just can't seem to find out how to scrape.
I really need a simple snippet of code that I can plug any URL into and get the plain text. I'm trying to get it from this website

BeautifulSoup can extract all the texts from a page easily. The following is an example to extract texts inside the <body>...</body> section.
import urllib
from bs4 import BeautifulSoup
from contextlib import closing
url = 'https://developer.valvesoftware.com/wiki/Hammer_Selection_Tool'
with closing(urllib.urlopen(url)) as h:
soup = BeautifulSoup(h.read())
print soup.body.get_text()

Related

I try to parse internal network webpage using by beautifulsoup library but didn't same like html

I'd like to make an auto login program in internal network website.
So, I try to parse that site using requests and Beautifulsoup library.
It works...and I get some html alot shorter than that site's html.
what's the problem? maybe security issue?..
pleas help me.
import requests
from bs4 import BeautifulSoup as bs
page = requests.get("http://test.com")
soup = bs(page.text, "html.parse")
print(soup) # I get some html alot shorter than that site's html

how to scraping text from hidden div and class using python?

i working on a script for scraping video titles from this webpage
" https://www.google.com.eg/trends/hotvideos "
but the proplem is the titles are hidden on the html source page but i can see it if i used the inspector to looking for that
that's my code it's working good with this ("class":"wrap")
but when i used that with the hidden one like "class":"hotvideos-single-trend-title-container" that's did't give me anything on output
#import urllib2
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('https://www.google.com.eg/trends/hotvideos').read()
soup = BeautifulSoup(html)
print (soup.findAll('div',{"class":"hotvideos-single-trend-title-container"}))
#wrap
The page is generated/populated by using Javascript.
BeautifulSoup won't help you here, you need a library which supports Javascript generated HTML pages, see here for a list or have a look at Selenium

how to scrape all the links of image of product present in flipkart

I am trying to scrape url of all the different images present in this link https://www.flipkart.com/samsung-galaxy-nxt-gold-32-gb/p/itmemzd4gepexjya?pid=MOBEMZD4KHRF5VZX. I am trying it with beautifulsoup module of python. but didn't succeed with this method. I am not able to understand the code structure of flipkart.com and why it is not returning the required data.
The code that I am trying is as follow
from bs4 import BeautifulSoup
import urllib
from pprintpp import pprint
import pandas as pd
import requests
from time import sleep
x=requests.get("https://www.flipkart.com/samsung-galaxy-nxt-gold-32-gb/p/itmemzd4gepexjya?pid=MOBEMZD4KHRF5VZX").content
#x= urllib._urlopener("https://www.flipkart.com/jbl-t250si-on-the-ear-headphone/p/itmefbgezsc72mgt?pid=ACCEFBGAK5ZDTBF7&")
soup2 = BeautifulSoup(x, 'html.parser')
data=[]
for j in soup2.find_all('img', attrs={'class':"sfescn"}):
data+=[j]
print data
Well I can clearly see that there are no links of mobile images in the page source code.
So I would recommend using tool Fiddler or your browser developer's console to track where the actual data is coming from, most probably it would be coming from a json response type request.
I am not familiar with beautifulsoup, i have been working with scrapy.

How to extract ids and classes from a webpage using python?

This is my code so far :
import urllib2
with urllib2.urlopen("https://quora.com") as response:
html = response.read()
I am new to Python and somehow I am successful in fetching the webpage, now how to extract ids and classes from the webpage?
A better way to do so would be using the BeautifulSoup (bs4) web-scraping library, and requests.
After having installed both using pip, you can start as so:
import requests
from bs4 import BeautifulSoup
r = requests.get("http://quora.com")
soup = BeautifulSoup(r.content, "html.parser")
To find an element with a specific id:
soup.find(id="your_id")
To find all elements with the "Answer" class:
soup.find_all(class_="Answer")
You can then use .get_text() to remove the html tags and use python string operations to organize your data.
You may try to parse the html code using dedicated libraries, for instance BeautifulSoup.
you can do it easly by xml parsing
from lxml import html
import requests
page = requests.get('http://google.com')
with open('/home/Desktop/test.txt','wb') as f :
f.write(page.content)

Finding difficulty in scraping a specific field using lxml or beautifulsoup python

I am new in scraping can you guide me how to get price ("299.00") from following html using python lxml or beautifulsoup?
HTML CODE image is attached
Click to See Html code image
from lxml import html
import requests
url="https://world.taobao.com/item/537221576985.htm?fromSite=main&spm=a21ct.7779917.1441024229133.8.AvoEs0"
page=requests.get(url)
tree=html.fromstring(page.text)
price=tree.xpath('//strong[#class="tb-rmb-num"]/text()')
print(price)

Categories

Resources