I am trying to scrape a table from this url: https://cryptoli.st/lists/fixed-supply
I gather that the table I want is in the div class "dataTables_scroll". I use the following code and it only returns an empty list:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
url = requests.get("https://cryptoli.st/lists/fixed-supply")
soup = bs(url.content, 'lxml')
table = soup.find_all("div", {"class": "dataTables_scroll"})
print(table)
Any help would be most appreciated.
Thanks!
The reason is that the response you get from requests.get() does not contain table data in it.
It might be loaded on client-side(by javascript).
What can you do about this? Using a selenium webdriver is a possible solution. You can "wait" until the table is loaded and becomes interactive, then get the page content with selenium, pass the context to bs4 to do the scraping.
You can check the response by writing it to a file:
f = open("demofile.html", "w", encoding='utf-8')
f.write(soup.prettify())
f.close()
and you will be able to see "...Loading..." where the table is expected.
I believe the data is loaded from a script tag. I have to go to work so can't spend more time working out how to appropriately recreate the a dataframe from the "|" delimited data at present, but the following may serve as a starting point for others, as it extracts the relevant entries from the script tag for the table body.
import requests, re
import ast
r = requests.get('https://cryptoli.st/lists/fixed-supply').text
s = re.search(r'cl\.coinmainlist\.dataraw = (\[.*?\]);', r, flags = re.S).group(1)
data = ast.literal_eval(s)
data = [i.split('|') for i in data]
print(data)
Related
I'm new to python (and coding in general) and am trying to write a script that will scrape all of the <p> tags from a given URL and then create a CSV file with them. It seems to run through okay, but the CSV file it creates doesn't have any data in it. Below is my code:
import requests
r = requests.get('https://seekingalpha.com/amp/article/4420423-chipotle-mexican-grill-inc-s-cmg-ceo-brian-niccol-on-q1-2021-results-earnings-call-transcript')
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('<p>')
records = []
for result in results:
Comment = result.find('<p>').text
records.append((Comment))
import pandas as pd
df = pd.DataFrame(records, columns=['Comment'])
df.to_csv('CMG_test.csv', index=False, encoding='utf-8')
print('finished')
Any help greatly appreciated!
First, you need to pass CSS selectors to BeautifulSoup methods. <p> isn't a selector. p is. So, in order to find all p tags, you need to use the find_all method on the soup like so:
results = soup.find_all('p')
Take a look at this page for more info on the CSS selectors.
Secondly, in your iteration over results, you don't need to find the tag all over again. You can simply extract the text by result.text.
So, if you rewrite your code like the following:
import requests
r = requests.get('https://seekingalpha.com/amp/article/4420423-chipotle-mexican-grill-inc-s-cmg-ceo-brian-niccol-on-q1-2021-results-earnings-call-transcript')
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('p')
records = []
for result in results:
Comment = result.text
records.append(Comment)
import pandas as pd
df = pd.DataFrame(records, columns=['Comment'])
df.to_csv('CMG_test.csv', index=False, encoding='utf-8')
print('finished')
You'll find your csv well-populated with the content you're looking for.
I am trying to get a value from a webpage. In the source code of the webpage, the data is in CDATA format and also comes from a jQuery. I have managed to write the below code which gets a large amount of text, where the index 21 contains the information I need. However, this output is large and not in a format I understand. Within the output I need to isolate and output "redshift":"0.06" but dont know how. what is the best way to solve this.
import requests
from bs4 import BeautifulSoup
link = "https://wis-tns.weizmann.ac.il/object/2020aclx"
html = requests.get(link).text
soup = BeautifulSoup(html, "html.parser")
res = soup.findAll('b')
print soup.find_all('script')[21]
It can be done using the current approach you have. However, I'd advise against it. There's a neater way to do it by observing that the redshift value is present in a few convenient places on the page itself.
The following approach should work for you. It looks for tables on the page with the class "atreps-results-table" -- of which there are two. We take the second such table and look for the table cell with the class "cell-redshift". Then, we just print out its text content.
from bs4 import BeautifulSoup
import requests
link = 'https://wis-tns.weizmann.ac.il/object/2020aclx'
html = requests.get(link).text
soup = BeautifulSoup(html, 'html.parser')
tab = soup.find_all('table', {'class': 'atreps-results-table'})[1]
redshift = tab.find('td', {'class': 'cell-redshift'})
print(redshift.text)
Try simply:
soup.select_one('div.field-redshift > div.value>b').text
If you view the Page Source of the URL, you will find that there are two script elements that are having CDATA. But the script element in which you are interested has jQuery in it. So you have to select the script element based on this knowledge. After that, you need to do some cleaning to get rid of CDATA tags and jQuery. Then with the help of json library, convert JSON data to Python Dictionary.
import requests
from bs4 import BeautifulSoup
import json
page = requests.get('https://wis-tns.weizmann.ac.il/object/2020aclx')
htmlpage = BeautifulSoup(page.text, 'html.parser')
scriptelements = htmlpage.find_all('script')
for script in scriptelements:
if 'CDATA' in script.text and 'jQuery' in script.text:
scriptcontent = script.text.replace('<!--//--><![CDATA[//>', '').replace('<!--', '').replace('//--><!]]>', '').replace('jQuery.extend(Drupal.settings,', '').replace(');', '')
break
jsondata = json.loads(scriptcontent)
print(jsondata['objectFlot']['plotMain1']['params']['redshift'])
Trying get a table from the website SGX.
The page is saved to local drive and I am using BeautifulSoup to parse it:
soup = BeautifulSoup(open(pages), "lxml")
soup.prettify()
list_0 = soup.find_all('table')[0]
print list_0
What it returned, is not the first row on the page:
[<tr><td>Zhongmin Baihui</td><td>5SR</td><td class="nowrap">09:44 AM</td><td class="nowrap">09:49 AM</td><td>0.615</td><td>0.675</td><td>0.555</td></tr>]
What's the right way to retrieve this table?
Thank you.
Data are being fetched after page loads using AJAX request, by inspecting the page you can find the API URL (the Url below), and then you can use something like that:
import pandas as pd
import requests
import json
response = requests.get('https://api.sgx.com/securities/v1.1?excludetypes=bonds¶ms=nc%2Cadjusted-vwap%2Cb%2Cbv%2Cp%2Cc%2Cchange_vs_pc%2Cchange_vs_pc_percentage%2Ccx%2Ccn%2Cdp%2Cdpc%2Cdu%2Ced%2Cfn%2Ch%2Ciiv%2Ciopv%2Clt%2Cl%2Co%2Cp_%2Cpv%2Cptd%2Cs%2Csv%2Ctrading_time%2Cv_%2Cv%2Cvl%2Cvwap%2Cvwap-currency')
data = json.loads(response.content)["data"]["prices"]
df = pd.DataFrame(data)
print(df)
If your requirement are complex and your crawling done in regular basis I recommend using scrapy.
I am a beginner in Python and web scraping but I am really interested. What I want to do is to extract the total number of search results per day.
If you open it, you will see here:
Used Cars for Sale
Results 1 - 20 of 30,376
What I want is only the number 30,376. Is there any way to extract it on a daily basis automatically and save it to an excel file please? I have played around some packages in Python but all I got is error messages and something not relevant like below:
from bs4 import BeautifulSoup
from urllib.request import urlopen
base_url = "..."
def make_soup(url):
html = urlopen(url).read()
return BeautifulSoup(html, "lxml")
make_soup(base_url)
Can someone show me how to extract that particular number please? Thanks!
Here is the one way through requests module and soup.select function.
from bs4 import BeautifulSoup
import requests
base_url = "http://www.autotrader.co.nz/used-cars-for-sale"
def make_soup(url):
html = requests.get(url).content
soup = BeautifulSoup(html, "lxml")
txt = soup.select('#result-header .result-count')[0].text
print txt.split()[-1]
make_soup(base_url)
soup.select accepts an css selector as argument. This #result-header .result-count selector means find the element having result-count class which was inside an element having result-header as id.
from bs4 import BeautifulSoup
from urllib.request import urlopen
base_url = "http://www.autotrader.co.nz/used-cars-for-sale"
html = urlopen(base_url).read()
soup = BeautifulSoup(html, 'lxml')
result_count = soup.find(class_="result-count").text.split('of ')[-1]
print(result_count)
out:
30,376
from bs4 import BeautifulSoup
import requests, re
base_url = "http://www.autotrader.co.nz/used-cars-for-sale"
a = BeautifulSoup(requests.get(base_url).content).select('div#result-header p.result-count')[0].text
num = re.search('([\w,]+)$',a)
print int(num.groups(1)[0].replace(',',''))
Output:
30378
Will get any other number also which is at the end of the statement.
Appending new rows to an Existing Excel File
Script to append today's date and the extracted number to existing excel file:
!!!Important!!!: Don't run this code directly on your main file. Instead make a copy of it first and run on that file. If it works properly then you can run it on your main file. I'm not responsible if you loose your data :)
import openpyxl
import datetime
wb = openpyxl.load_workbook('/home/yusuf/Desktop/data.xlsx')
sheet = wb.get_sheet_by_name('Sheet1')
a = sheet.get_highest_row()
sheet.cell(row=a,column=0).value=datetime.date.today()
sheet.cell(row=a,column=1).value=30378 # use a variable here from the above (previous) code.
wb.save('/home/yusuf/Desktop/data.xlsx')
I'm scraping the NFL's website for player statistics. I'm having an issue when parsing the web page and trying to get to the HTML table which contains the actual information I'm looking for. I successfully downloaded the page and saved it into the directory I'm working in. For reference, the page I've saved can be found here.
# import relevant libraries
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("1998.html"))
result = soup.find(id="result")
print result
I found that at one point, I ran the code and result printed the correct table I was looking for. Every other time, it doesn't contain anything! I'm assuming this is user error, but I can't figure out what I'm missing. Using "lxml" returned nothing and I can't get html5lib to work (parsing library??).
Any help is appreciated!
First, you should read the contents of your file before passing it to BeautifulSoup.
soup = BeautifulSoup(open("1998.html").read())
Second, verify manually that the table in question exists in the HTML by printing the contents to screen. The .prettify() method makes the data easier to read.
print soup.prettify()
Lastly, if the element does in fact exist, the following will be able to find it:
table = soup.find('table',{'id':'result'})
A simple test script I wrote cannot reproduce your results.
import urllib
from bs4 import BeautifulSoup
def test():
# The URL of the page you're scraping.
url = 'http://www.nfl.com/stats/categorystats?tabSeq=0&statisticCategory=PASSING&conference=null&season=1998&seasonType=REG&d-447263-s=PASSING_YARDS&d-447263-o=2&d-447263-n=1'
# Make a request to the URL.
conn = urllib.urlopen(url)
# Read the contents of the response
html = conn.read()
# Close the connection.
conn.close()
# Create a BeautifulSoup object and find the table.
soup = BeautifulSoup(html)
table = soup.find('table',{'id':'result'})
# Find all rows in the table.
trs = table.findAll('tr')
# Print to screen the number of rows found in the table.
print len(trs)
This outputs 51 every time.