I am trying to read a table from a web-page. Generally, my company has strict authentication policies restricting us in the way we can scrape the data.
But the following code is how I am trying to use to do the same
from urllib.request import urlopen
from requests_kerberos import HTTPKerberosAuth, OPTIONAL
import os
import lxml.html as LH
import requests
import pandas as pd
cert = r"C:\\Users\\name\\Desktop\\cacert.pem"
os.environ["REQUESTS_CA_BUNDLE"] = cert
kerberos = HTTPKerberosAuth(mutual_authentication=OPTIONAL)
session = requests.Session()
link = 'weblink'
data=session.get(link,auth=kerberos,verify=False).content.decode("latin-1")
And that leaves me with the entire HTML of the webpage in "data".
How do I convert this into a dataframe?
Note : I couldn't provide the weblink due to privacy concerns.. I was just wondering if there was a general way which I can use to tackle this situation.
It looks like you're looking for something like this, using Beautifulsoup?
From there, you'll have to create the data frame itself, but you will have passed the 'procedure to convert the HTML into' a data structure step. (that is, read the HTML table into a list or dictionary, and then transform it into a dataframe)
Edit 1
Actually, you can use Pandas' read_html. You might need Beautifulsoup still to get exactly what you want, but depending on how the source HTML looks like, it might be enough alone.
Related
I need to download the Net Income of the s&p 500 companies from this website https://www.macrotrends.net/stocks/charts/MMM/3m/income-statement
I wrote this part of code following an online guide (this one https://towardsdatascience.com/web-scraping-for-accounting-analysis-using-python-part-1-b5fc016a1c9a), but i can't figure out how to conlude it and, more specifically, how to download the extracted Net Income into an excel file.
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
url = 'https://www.macrotrends.net/stocks/charts/MMM/3m/income-statement'
response = requests.get(url)
response
soup = BeautifulSoup(response.text, 'html.parser')
income_statement = soup.findAll('a')[19]
link = income_statement['href']
download_url = 'https://www.macrotrends.net/stocks/charts/MMM/3m/income-statement/'+ link
Any suggestion would be very appreciated, thanks!
I think the correct way to tackle this mission is to use some stock market API instead of web scraping with BS4.
I'll recommend you to have a look at the following article, it also includes some practical examples:
https://towardsdatascience.com/best-5-free-stock-market-apis-in-2019-ad91dddec984
Edit:
If you decide to stick to the plan of using this exact URL you mentioned, I think you should try to use pandas, try something like this:
import pandas as pd
data = pd.read_html('https://www.macrotrends.net/stocks/charts/MMM/3m/income-statement​',skiprows=1)
You'll have to play with encoding a little bit as the table contains some non-ascii chars
I am currently trying to automatically scrape some stock exchange data in 1 or 5 minute ticks from the website stooq.com. I've tried to get it using BeautifulSoup from bs4, but can neither find it in the tables of the website, nor can I manage to get the underlying data of the html5 chart.
This is the link to the website containing the html5 chart:
dax_link = 'https://stooq.com/q/a2/?s=^dax&i=1&t=l&a=lg&z=500&ft=201808141221&l=0&d=1&ch=0&f=0<=57&r=0&o=1'
I've tried the following using beautifulsoup:
import pandas as pd
import requests
import re
from bs4 import BeautifulSoup as bs
stooq_dax = ('https://stooq.com/q/a2/?s=^dax&i=1&t=l&a=lg'
'&z=500&ft=201808141221&l=0&d=1&ch=0&f=0<=57&r=0&o=1')
reqstdax = requests.get(stooq_dax)
stdax = reqstdax.content
soupstdax = bs(stdax)
tbls_dax = soupstdax.table
df = pd.read_html(str(tbls_dax))
But none of the 31 dataframes contains any useful data.
I also tried finding some specific values in the website, like for example
soupstdax.find_all(text=re.compile('12368'))
which is the "open" value at the time 2018-08-14,15:24:00, but none can be found.
I can of course get these values by clicking on the csv button in the bottom right corner, but this can't be automatized since the link to the csv-generation is hidden (and I did not manage to reconstruct it).
Is there any way I could get the underlying data of the chart or find the correct link for generating the csv files?
Thanks in advance!
If you inspect the web-page in Chrome or FireFox you see it making a XHR to:
https://stooq.com/q/a2/d/?s=^dax&i=1&l=201808141633
You can access this directly to get the data that the page is updated with:
20180814,163200,12349.80,12350.10,12348.50,12348.50
20180814,163300,12348.5,12350,12348.1,12349.5
Is that the data you want?
Update
It looks like the initial data comes from here:
https://stooq.com/q/a2/d/?s=^dax&i=1
And the 201808141633 is a timestamp (2018/08/14 16:33)
I've been reading up on parsing xml with python all day, but looking at the site i need to extract data on, i'm not sure if i'm barking up the wrong tree. Basically i want to get the 13-digit barcodes from a supermarket website (found in the name of the images). For example:
http://www.tesco.com/groceries/SpecialOffers/SpecialOfferDetail/Default.aspx?promoId=A31033985
has 11 items and 11 images, the barcode for the first item is 0000003235676. However when i look at the page source (i assume this is the best way to extract all of the barcodes in one go with python, urllib and beautifulsoup) all of the barcodes are on one line (line 12) however the data doesn't seem to be structured as i would expect in terms of elements and attributes.
new TESCO.sites.UI.entities.Product({name:"Lb Mens Mattifying Dust 7G",xsiType:"QuantityOnlyProduct",productId:"275303365",baseProductId:"72617958",quantity:1,isPermanentlyUnavailable:true,imageURL:"http://img.tesco.com/Groceries/pi/805/5021320051805/IDShot_90x90.jpg",maxQuantity:99,maxGroupQuantity:0,bulkBuyLimitGroupId:"",increment:1,price:2.5,abbr:"g",unitPrice:3.58,catchWeight:"0",shelfName:"Mens Styling",superdepartment:"Health & Beauty",superdepartmentID:"TO_1448953606"});
new TESCO.sites.UI.entities.Product({name:"Lb Mens Thickening Shampoo 250Ml",xsiType:"QuantityOnlyProduct",productId:"275301223",baseProductId:"72617751",quantity:1,isPermanentlyUnavailable:true,imageURL:"http://img.tesco.com/Groceries/pi/225/5021320051225/IDShot_90x90.jpg",maxQuantity:99,maxGroupQuantity:0,bulkBuyLimitGroupId:"",increment:1,price:2.5,abbr:"ml",unitPrice:1,catchWeight:"0",shelfName:"Mens Shampoo ",superdepartment:"Health & Beauty",superdepartmentID:"TO_1448953606"});
new TESCO.sites.UI.entities.Product({name:"Lb Mens Sculpting Puty 75Ml",xsiType:"QuantityOnlyProduct",productId:"275301557",baseProductId:"72617906",quantity:1,isPermanentlyUnavailable:true,imageURL:"http://img.tesco.com/Groceries/pi/287/5021320051287/IDShot_90x90.jpg",maxQuantity:99,maxGroupQuantity:0,bulkBuyLimitGroupId:"",increment:1,price:2.5,abbr:"ml",unitPrice:3.34,catchWeight:"0",shelfName:"Pastes, Putty, Gums, Pomades",superdepartment:"Health & Beauty",superdepartmentID:"TO_1448953606"});
Maybe something like BeautifulSoup is overkill? I understand the DOM tree is not the same thing as the raw source, but why are they so different - when i go to inspect element in firefox the data seems structured as i would expect.
Apologies if this comes across as totally stupid, thanks in advance.
Unfortunately, the barcode is not given in the HTML as structured data; it only appears embedded as part of a URL. So we'll need to isolate the URL and then pick off the barcode with string manipulation:
import urllib2
import bs4 as bs
import re
import urlparse
url = 'http://www.tesco.com/groceries/SpecialOffers/SpecialOfferDetail/Default.aspx?promoId=A31033985'
response = urllib2.urlopen(url)
content = response.read()
# with open('/tmp/test.html', 'w') as f:
# f.write(content)
# Useful for debugging off-line:
# with open('/tmp/test.html', 'r') as f:
# content = f.read()
soup = bs.BeautifulSoup(content)
barcodes = set()
for tag in soup.find_all('img', {'src': re.compile(r'/pi/')}):
href = tag['src']
scheme, netloc, path, query, fragment = urlparse.urlsplit(href)
barcodes.add(path.split('\\')[1])
print(barcodes)
yields
set(['0000003222737', '0000010039670', '0000010036297', '0000010008393', '0000003050453', '0000010062951', '0000003239438', '0000010078402', '0000010016312', '0000003235676', '0000003203132'])
As your site uses javascript to format its content, You might find useful switching from urllib to a tool like Selenium. That way you can crawl pages as they render for a real user with a web browser. This github project seems to solve your task.
Other option will be filtering out json data from page javascript scripts and getting data directly from there.
I would like to get the plain text (e.g. no html tags and entities) from a given URL.
What library should I use to do that as quickly as possible?
I've tried (maybe there is something faster or better than this):
import re
import mechanize
br = mechanize.Browser()
br.open("myurl.com")
vh = br.viewing_html
//<bound method Browser.viewing_html of <mechanize._mechanize.Browser instance at 0x01E015A8>>
Thanks
you can use HTML2Text if the site isnt working for you you can go to HTML2Text github Repo and get it for Python
or maybe try this:
import urllib
from bs4 import*
html = urllib.urlopen('myurl.com').read()
soup = BeautifulSoup(html)
text = soup.get_text()
print text
i dont know if it gets rid of all the js and stuff but it gets rid of the HTML
do some Google searches there are multiple other questions similar to this one
also maybe take a look at Read2Text
In Python 3, you can fetch the HTML as bytes and then convert to a string representation:
from urllib import request
text = request.urlopen('myurl.com').read().decode('utf8')
Here,
http://www.ffiec.gov/census/report.aspx?year=2011&state=01&report=demographic&msa=11500
There is a table. My goal is to extract the table and save it to a csv file. I wrote a code:
import urllib
import os
web = urllib.urlopen("http://www.ffiec.gov/census/report.aspx?year=2011&state=01&report=demographic&msa=11500")
s = web.read()
web.close()
ff = open(r"D:\ex\python_ex\urllib\output.txt", "w")
ff.write(s)
ff.close()
I lost from here. Anyone who can help on this? Thanks!
Pandas can do this right out of the box, saving you from having to parse the html yourself. read_html() extracts all tables from your html and puts them in a list of dataframes. to_csv() can be used to convert each dataframe to a csv file. For the web page in your example, the relevant table is the last one, which is why I used df_list[-1] in the code below.
import requests
import pandas as pd
url = 'http://www.ffiec.gov/census/report.aspx?year=2011&state=01&report=demographic&msa=11500'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
print(df)
df.to_csv('my data.csv')
It's simple enough to do in one line, if you prefer:
pd.read_html(requests.get(<url>).content)[-1].to_csv(<csv file>)
P.S. Just make sure you have lxml, html5lib, and BeautifulSoup4 packages installed in advance.
So essentially you want to parse out html file to get elements out of it. You can use BeautifulSoup or lxml for this task.
You already have solutions using BeautifulSoup. I'll post a solution using lxml:
from lxml import etree
import urllib.request
web = urllib.request.urlopen("http://www.ffiec.gov/census/report.aspx?year=2011&state=01&report=demographic&msa=11500")
s = web.read()
html = etree.HTML(s)
## Get all 'tr'
tr_nodes = html.xpath('//table[#id="Report1_dgReportDemographic"]/tr')
## 'th' is inside first 'tr'
header = [i[0].text for i in tr_nodes[0].xpath("th")]
## Get text from rest all 'tr'
td_content = [[td.text for td in tr.xpath('td')] for tr in tr_nodes[1:]]
I would recommend BeautifulSoup as it has the most functionality. I modified a table parser that I found online that can extract all tables from a webpage, as long as there are no nested tables. Some of the code is specific to the problem I was trying to solve, but it should be pretty easy to modify for your usage. Here is the pastbin link.
http://pastebin.com/RPNbtX8Q
You could use it as follows:
from urllib2 import Request, urlopen, URLError
from TableParser import TableParser
url_addr ='http://foo/bar'
req = Request(url_addr)
url = urlopen(req)
tp = TableParser()
tp.feed(url.read())
# NOTE: Here you need to know exactly how many tables are on the page and which one
# you want. Let's say it's the first table
my_table = tp.get_tables()[0]
filename = 'table_as_csv.csv'
f = open(filename, 'wb')
with f:
writer = csv.writer(f)
for row in table:
writer.writerow(row)
The code above is an outline, but if you use the table parser from the pastbin link you should be able to get to where you want to go.
You need to parse the table into an internal data structure and then output it in CSV form.
Use BeautifulSoup to parse the table. This question is about how to do that (the accepted answer uses version 3.0.8 which is out of date by now, but you can still use it, or convert the instructions to work with BeautifulSoup version 4).
Once you have the table in a data structure (probably a list of lists in this case) you can write it out with csv.write.
Look at BeautifulSOup module. In documentation you will find many examples of parsing html.
Also for csv you have ready solution - csv module.
It should be quite easy.
Look at this answer parsing table with BeautifulSoup and write in text file.
Also use google with next words "python beautifulsoup"