I wanted to scrape a table from a website putting the required date range and other conditional parts into it. But the problem is the table that is generating after putting the required detail doesn't contain a few columns which are required for my reporting purpose.
But when I'm exporting the table in a CSV format by clicking on the "export" button manually, it contains all the desired columns in it.
So, now I want to know how can I get this CSV formated file automatically using python as it will help me, later on, to produce the required dataframe through reading it from the downloaded path.
Here's my code :
import datetime
import urllib.request
from bs4 import BeautifulSoup as bs
import pandas as pd
......
with Session() as s:
Today = pd.datetime.now().date()
Yesterday = Today - datetime.timedelta(days=1)
Search_Data = {"dp_type": "all","from_date[date]": Yesterday,"from_date[time]": "00:00","to_date[date]":Today,"to_date[time]":'00:00',"salesorder_schedule": 'singledelivery',
"op": 'List Export', 'selectall_oid': '1','overview_delivery_cost': '0',
'form_token':Form_token, "form_id": 'clc_saleorder_report_form'}
s.post(url, Search_Data)
Here op is the defined option that I want to do with this URL and I choose to export.
The code has run with zero (0) error but the problem is I don't see any file has downloaded anywhere after the successful execution of this code.
Can anyone help what is the missing tricks here?
I need to download the Net Income of the s&p 500 companies from this website https://www.macrotrends.net/stocks/charts/MMM/3m/income-statement
I wrote this part of code following an online guide (this one https://towardsdatascience.com/web-scraping-for-accounting-analysis-using-python-part-1-b5fc016a1c9a), but i can't figure out how to conlude it and, more specifically, how to download the extracted Net Income into an excel file.
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
url = 'https://www.macrotrends.net/stocks/charts/MMM/3m/income-statement'
response = requests.get(url)
response
soup = BeautifulSoup(response.text, 'html.parser')
income_statement = soup.findAll('a')[19]
link = income_statement['href']
download_url = 'https://www.macrotrends.net/stocks/charts/MMM/3m/income-statement/'+ link
Any suggestion would be very appreciated, thanks!
I think the correct way to tackle this mission is to use some stock market API instead of web scraping with BS4.
I'll recommend you to have a look at the following article, it also includes some practical examples:
https://towardsdatascience.com/best-5-free-stock-market-apis-in-2019-ad91dddec984
Edit:
If you decide to stick to the plan of using this exact URL you mentioned, I think you should try to use pandas, try something like this:
import pandas as pd
data = pd.read_html('https://www.macrotrends.net/stocks/charts/MMM/3m/income-statement​',skiprows=1)
You'll have to play with encoding a little bit as the table contains some non-ascii chars
I am trying to read a table from a web-page. Generally, my company has strict authentication policies restricting us in the way we can scrape the data.
But the following code is how I am trying to use to do the same
from urllib.request import urlopen
from requests_kerberos import HTTPKerberosAuth, OPTIONAL
import os
import lxml.html as LH
import requests
import pandas as pd
cert = r"C:\\Users\\name\\Desktop\\cacert.pem"
os.environ["REQUESTS_CA_BUNDLE"] = cert
kerberos = HTTPKerberosAuth(mutual_authentication=OPTIONAL)
session = requests.Session()
link = 'weblink'
data=session.get(link,auth=kerberos,verify=False).content.decode("latin-1")
And that leaves me with the entire HTML of the webpage in "data".
How do I convert this into a dataframe?
Note : I couldn't provide the weblink due to privacy concerns.. I was just wondering if there was a general way which I can use to tackle this situation.
It looks like you're looking for something like this, using Beautifulsoup?
From there, you'll have to create the data frame itself, but you will have passed the 'procedure to convert the HTML into' a data structure step. (that is, read the HTML table into a list or dictionary, and then transform it into a dataframe)
Edit 1
Actually, you can use Pandas' read_html. You might need Beautifulsoup still to get exactly what you want, but depending on how the source HTML looks like, it might be enough alone.
I'm doing a project where I need to store the date that a video in youtube was published.
The problem is that I'm having some difficulties trying to find this data in the middle of the HTML source code
Here's my code attempt:
import requests
from bs4 import BeautifulSoup as BS
url = "https://www.youtube.com/watch?v=XQgXKtPSzUI&t=915s"
response = requests.get(url)
soup = BS(response.content, "html.parser")
response.close()
dia = soup.find_all('span',{'class':'date'})
print(dia)
Output:
[]
I know that the arguments I'm sending to .find_all() are wrong.
I'm saying this because I was able to store other information from the video using the same code, such as the title and the views.
I've tried different arguments when using .find_all() but didn't figured out how to find it.
If you use Python with pafy, the object you'll get has the published date easily accessible.
Install pafy: "pip install pafy"
import pafy
vid = pafy.new("www.youtube.com/watch?v=2342342whatever")
published_date = vid.published
print(published_date) #Python3 print statement
Check out the pafy docs for more info:
https://pythonhosted.org/Pafy/
The reason I leave the doc link is because it's a really neat module, it handles getting the data without external request modules and also exposes a bunch of other useful properties of the video, like the best format download link, etc.
It seems that YouTube is using javascript to add the date, so that information is not in the source code. You should try using Selenium to scrape, or get the date from the js since it is directly in the source code.
Try adding attribute as shown below:
dia = soup.find_all('span', attr={'class':'date'})
I've been reading up on parsing xml with python all day, but looking at the site i need to extract data on, i'm not sure if i'm barking up the wrong tree. Basically i want to get the 13-digit barcodes from a supermarket website (found in the name of the images). For example:
http://www.tesco.com/groceries/SpecialOffers/SpecialOfferDetail/Default.aspx?promoId=A31033985
has 11 items and 11 images, the barcode for the first item is 0000003235676. However when i look at the page source (i assume this is the best way to extract all of the barcodes in one go with python, urllib and beautifulsoup) all of the barcodes are on one line (line 12) however the data doesn't seem to be structured as i would expect in terms of elements and attributes.
new TESCO.sites.UI.entities.Product({name:"Lb Mens Mattifying Dust 7G",xsiType:"QuantityOnlyProduct",productId:"275303365",baseProductId:"72617958",quantity:1,isPermanentlyUnavailable:true,imageURL:"http://img.tesco.com/Groceries/pi/805/5021320051805/IDShot_90x90.jpg",maxQuantity:99,maxGroupQuantity:0,bulkBuyLimitGroupId:"",increment:1,price:2.5,abbr:"g",unitPrice:3.58,catchWeight:"0",shelfName:"Mens Styling",superdepartment:"Health & Beauty",superdepartmentID:"TO_1448953606"});
new TESCO.sites.UI.entities.Product({name:"Lb Mens Thickening Shampoo 250Ml",xsiType:"QuantityOnlyProduct",productId:"275301223",baseProductId:"72617751",quantity:1,isPermanentlyUnavailable:true,imageURL:"http://img.tesco.com/Groceries/pi/225/5021320051225/IDShot_90x90.jpg",maxQuantity:99,maxGroupQuantity:0,bulkBuyLimitGroupId:"",increment:1,price:2.5,abbr:"ml",unitPrice:1,catchWeight:"0",shelfName:"Mens Shampoo ",superdepartment:"Health & Beauty",superdepartmentID:"TO_1448953606"});
new TESCO.sites.UI.entities.Product({name:"Lb Mens Sculpting Puty 75Ml",xsiType:"QuantityOnlyProduct",productId:"275301557",baseProductId:"72617906",quantity:1,isPermanentlyUnavailable:true,imageURL:"http://img.tesco.com/Groceries/pi/287/5021320051287/IDShot_90x90.jpg",maxQuantity:99,maxGroupQuantity:0,bulkBuyLimitGroupId:"",increment:1,price:2.5,abbr:"ml",unitPrice:3.34,catchWeight:"0",shelfName:"Pastes, Putty, Gums, Pomades",superdepartment:"Health & Beauty",superdepartmentID:"TO_1448953606"});
Maybe something like BeautifulSoup is overkill? I understand the DOM tree is not the same thing as the raw source, but why are they so different - when i go to inspect element in firefox the data seems structured as i would expect.
Apologies if this comes across as totally stupid, thanks in advance.
Unfortunately, the barcode is not given in the HTML as structured data; it only appears embedded as part of a URL. So we'll need to isolate the URL and then pick off the barcode with string manipulation:
import urllib2
import bs4 as bs
import re
import urlparse
url = 'http://www.tesco.com/groceries/SpecialOffers/SpecialOfferDetail/Default.aspx?promoId=A31033985'
response = urllib2.urlopen(url)
content = response.read()
# with open('/tmp/test.html', 'w') as f:
# f.write(content)
# Useful for debugging off-line:
# with open('/tmp/test.html', 'r') as f:
# content = f.read()
soup = bs.BeautifulSoup(content)
barcodes = set()
for tag in soup.find_all('img', {'src': re.compile(r'/pi/')}):
href = tag['src']
scheme, netloc, path, query, fragment = urlparse.urlsplit(href)
barcodes.add(path.split('\\')[1])
print(barcodes)
yields
set(['0000003222737', '0000010039670', '0000010036297', '0000010008393', '0000003050453', '0000010062951', '0000003239438', '0000010078402', '0000010016312', '0000003235676', '0000003203132'])
As your site uses javascript to format its content, You might find useful switching from urllib to a tool like Selenium. That way you can crawl pages as they render for a real user with a web browser. This github project seems to solve your task.
Other option will be filtering out json data from page javascript scripts and getting data directly from there.