Can anybody help, please?
I am finding it difficult to grab the value of currency units per SDR using my python code below.
The code is working with other URLs, but for this URL, I always get a null result. I don’t understand what is wrong.
I’m using python scrappy spider.
URL: https://www.imf.org/external/np/fin/data/rms_five.aspx
I reviewed the content on the website and found that it contains some spaces. There are some spaces in the element value:
Using RSS response and XPath, I get the same result i.e. null
def start_requests(self):
urls = [
'https://www.imf.org/external/np/fin/data/rms_five.aspx'
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
print(response.url)
#for i in range (1, 24):
yield {
'Kurs_imf2': response.xpath('//*[#id="content"]/center/table/tbody/tr[3]/td/div/table/tbody/tr[3]/td[3]/text()').getall()
}
Well, looking only at your ultimate goal (and disregarding your eventual unfortunate choice of tools), here is one way of getting the data from that table, as a dataframe. Once you have the data, you can manipulate it as needed (like stripping the leading/trailing spaces, etc):
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
url = 'https://www.imf.org/external/np/fin/data/rms_five.aspx'
r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
table_w_data = soup.select_one('div.fancy').select_one('table')
df = pd.read_html(str(table_w_data))[0]
print(df)
This will display in terminal:
('SDRs per Currency unit 2', 'Unnamed: 0_level_1')
('SDRs per Currency unit 2', 'September 05, 2022')
('SDRs per Currency unit 2', 'September 02, 2022')
('SDRs per Currency unit 2', 'September 01, 2022')
('SDRs per Currency unit 2', 'August 31, 2022')
('SDRs per Currency unit 2', 'August 30, 2022')
0
Chinese yuan
nan
0.111437
0.111266
0.111474
0.110906
1
Euro
nan
0.768586
0.768411
0.768436
0.769353
2
Japanese yen
nan
0.00549178
0.00550612
0.00554387
0.00553607
3
U.K. pound
nan
0.889646
0.887851
0.892615
0.898473
4
U.S. dollar
nan
0.769124
0.768104
0.768436
0.766746
5
Algerian dinar
nan
0.0054812
0.00547939
0.00547416
0.00547027
6
Australian dollar
nan
0.522235
0.524922
0.530375
0.529438
7
Botswana pula
nan
0.0593764
0.0595281
0.0600917
0.0600362
8
Brazilian real
nan
0.148273
0.147709
0.148393
0.151498
9
Brunei dollar
nan
0.548669
0.548215
0.55085
0.54893
10
Canadian dollar
nan
0.586178
0.5834
0.5861
0.586377
11
Chilean peso
nan
0.000856257
0.000855312
0.000871134
0.000860642
12
Czech koruna
nan
0.0313774
0.0313768
0.0313289
0.0312983
13
Danish krone
nan
0.103346
0.10332
0.103325
0.103441
14
Indian rupee
nan
0.0096395
0.00967422
nan
0.00961806
15
Israeli New Shekel
nan
0.227889
0.228331
0.230002
0.231996
16
Korean won
nan
0.000569131
0.000571039
0.000570268
0.000568929
17
Kuwaiti dinar
nan
nan
2.49425
2.49654
2.49105
18
Malaysian ringgit
nan
0.171526
0.171356
nan
0.170977
19
Mauritian rupee
nan
0.0172087
nan
0.0171443
0.0171741
20
Mexican peso
nan
0.0385038
0.0379361
0.0382379
0.0380585
21
New Zealand dollar
nan
0.466781
0.468236
0.471167
0.47105
22
Norwegian krone
nan
0.0768317
0.0767413
0.0773168
0.0788655
23
Omani rial
nan
nan
1.99767
1.99853
1.99414
24
Peruvian sol
nan
0.198946
0.199042
0.200166
nan
25
Philippine peso
nan
nan
0.0136744
0.0136628
0.0136826
26
Polish zloty
nan
0.162688
0.163569
0.162254
0.162412
27
Qatari riyal
nan
nan
0.211018
0.211109
0.210645
28
Russian ruble
nan
0.0127399
0.0127514
0.0127565
0.0127013
29
Saudi Arabian riyal
nan
nan
0.204828
0.204916
0.204466
30
Singapore dollar
nan
0.548669
0.548215
0.55085
0.54893
31
South African rand
nan
0.0444661
0.0448438
0.0450817
0.0455618
32
Swedish krona
nan
0.0715711
0.0718351
0.0720278
0.0718782
33
Swiss franc
nan
0.782903
0.78458
0.784038
0.789442
34
Thai baht
nan
0.0208978
0.020927
0.0210548
0.0210465
35
Trinidadian dollar
nan
0.114475
0.114048
nan
0.114091
36
U.A.E. dirham
nan
nan
0.20915
0.209241
0.20878
37
Uruguayan peso
nan
0.0188128
0.0188288
0.0187606
0.0188043
Relevant documentation:
Requests: https://requests.readthedocs.io/en/latest/
Pandas: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html
BeautifulSoup: https://beautiful-soup-4.readthedocs.io/en/latest/index.html
Related
here's the code, I'm trying to parse a spreadsheet whit over 2000 items but when ever I run this script I only get the last one hundred or so what could I do to fix this I have tried different parsers, and I haven't found any solution's
from bs4 import BeautifulSoup
import requests
url = "https://backpack.tf/spreadsheet"
sourse = requests.get("https://backpack.tf/spreadsheet").text
soup = BeautifulSoup(sourse, "html.parser")
try:
for name in soup.find_all("tr"):
header = name.text
print(header)
except:
pass
coulden't get the html to work sorry so pls go to https://backpack.tf/spreadsheet
Simplest way to read the table from this page is with pd.read_html:
import requests
import pandas as pd
url = "https://backpack.tf/spreadsheet"
r = requests.get(
url,
headers={
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:103.0) Gecko/20100101 Firefox/103.0"
},
)
df = pd.read_html(r.text)[0]
print(df)
Prints:
Name Type Genuine Vintage Unique Strange Haunted Collector's
0 A Brush with Death Cosmetic NaN NaN 5.11–5.22 ref 18.4 keys NaN 200 keys
1 A Color Similar to Slate Tool NaN NaN 33.11–33.22 ref NaN NaN NaN
2 A Color Similar to Slate (Non-Craftable) Tool NaN NaN 30 ref NaN NaN NaN
...
2851 Zepheniah's Greed (Non-Craftable) Tool NaN NaN 12 ref NaN NaN NaN
2852 Zipperface (Non-Craftable) Cosmetic NaN NaN 10.55 ref NaN NaN NaN
2853 Zipperface Cosmetic NaN NaN 1.65 keys NaN 13–14.66 ref NaN
I am trying to scrape data from the following website. For the year 1993, for eg, this is the link.
https://www.ugc.ac.in/jobportal/search.aspx?tid=MTk5Mw==
Firstly, I am not sure how to navigate between the pages as the url for every page is the same.
Secondly, I wrote the following code to scrape information on any given page.
url = "https://www.ugc.ac.in/jobportal/search.aspx?tid=MTk5Mw=="
File = []
response = requests.get(url)
soup = bs(response.text,"html.parser")
entries = soup.find_all('tr',{'class': 'odd'})
for entry in entries:
columns = {}
Cells = entry.find_all("td")
columns['Gender'] = Cells[3].get_text()
columns['Category'] = Cells[4].get_text()
columns['Subject'] = Cells[5].get_text()
columns['NET Qualified'] = Cells[6].get_text()
columns['Month/Year'] = Cells[7].get_text()
File.append(columns)
df = pd.DataFrame(File)
df
I am not getting any error while running the code but I am not getting any output. I cant figure out what mistake I am doing here. Would appreciate any inputs. Thanks!
All data is stored inside <script> on that HTML page. To read it into panda's dataframe you can use next example:
import re
import requests
import pandas as pd
url = "https://www.ugc.ac.in/jobportal/search.aspx?tid=MTk5Mw=="
html_doc = requests.get(url).text
data = re.search(r'"aaData":(\[\{.*?\}]),', html_doc, flags=re.S).group(1)
df = pd.read_json(data)
print(df)
df.to_csv("data.csv", index=False)
Prints:
ugc_net_ref_no candidate_name roll_no subject gender jrf_lec cat vh_ph dob fname mname address exam_date result_date emailid mobile_no subject_code year
0 1/JRF (DEC.93) SHRI VEERENDRA KUMAR SHARMA N035010 ECONOMICS Male JRF GEN NaN NaN SHRI SATYA NARAIN SHARMA None 27 E KARANPUR (PRAYAG) ALLAHABAD U.P. 19th DEC.93 NULL NaN NaN NaN 1993
1 1/JRF (JUNE,93) SH MD TARIQUE R020005 ECONOMICS Male JRF GEN NaN NaN MD. ZAFIR ALAM None D-32, R.M. HALL, A.M.U. ALIGARH 20th June,93 NULL NaN NaN NaN 1993
2 10/JRF (DEC.93) SHRI ARGHYA GHOSH A245015 ECONOMICS Male JRF GEN NaN NaN SHRI BHOLANATH GHOSH None C/O SH.B. GHOSH 10,BAMACHARAN GHOSH LANE P.O.-BHADRESWAR,DIST.HOOGHLY CALCUTTA-24. 19th DEC.93 NULL NaN NaN NaN 1993
3 10/JRF (JUNE,93) SH SANTADAS GHOSH T210024 ECONOMICS Male JRF GEN NaN NaN SHRI HARIDAS GHOSH None P-112, MOTILAL COLONY, NO.-1 P.O. DUM DUM, CALCUTTA - 700 028 20th June,93 NULL NaN NaN NaN 1993
...
and saves data.csv (screenshot from LibreOffice):
from bs4 import BeautifulSoup
import pandas as pd
import requests
import time
from datetime import datetime
def extract_source(url):
agent = {"User-Agent":"Mozilla/5.0"}
source=requests.get(url, headers=agent).text
return source
html_text = extract_source('https://www.mpbio.com/us/life-sciences/biochemicals/amino-acids')
soup = BeautifulSoup(html_text, 'lxml')
for a in soup.find_all('a', class_ = 'button button--link button--fluid catalog-list-item__actions-primary-button', href=True):
# print ("Found the URL:", a['href'])
urlof = a['href']
html_text = extract_source(urlof)
soup = BeautifulSoup(html_text, 'lxml')
table_rows = soup.find_all('tr')
first_columns = []
third_columns = []
for row in table_rows:
# for row in table_rows[1:]:
first_columns.append(row.findAll('td')[0])
third_columns.append(row.findAll('td')[1])
for first, third in zip(first_columns, third_columns):
print(first.text, third.text)
Basically I am trying to scrape data from tables from multiple links of Website. And I want to insert that data in one excel csv file in following table format
SKU 07DE9922
Analyte / Target Corticosterone
Base Catalog Number DE9922
Diagnostic Platforms EIA/ELISA
Diagnostic Solutions Endocrinology
Disease Screened Corticosterone
Evaluation Quantitative
Pack Size 96 Wells
Sample Type Plasma, Serum
Sample Volume 10 uL
Species Reactivity Mouse, Rat
Usage Statement For Research Use Only, not for use in diagnostic procedures.
To below format in excel file
SKU Analyte/Target Base Catalog Number Pack Size Sample Type
data data data data data
I am facing difficulties while converting data in proper format
I made small modification to your code. Instead of printing the data, I created a dictionary and added it to list. Then I used this list to create a DataFrame:
import pandas as pd
import requests
import time
from datetime import datetime
def extract_source(url):
agent = {"User-Agent": "Mozilla/5.0"}
source = requests.get(url, headers=agent).text
return source
html_text = extract_source(
"https://www.mpbio.com/us/life-sciences/biochemicals/amino-acids"
)
soup = BeautifulSoup(html_text, "lxml")
data = []
for a in soup.find_all(
"a",
class_="button button--link button--fluid catalog-list-item__actions-primary-button",
href=True,
):
urlof = a["href"]
html_text = extract_source(urlof)
soup = BeautifulSoup(html_text, "lxml")
table_rows = soup.find_all("tr")
first_columns = []
third_columns = []
for row in table_rows:
first_columns.append(row.findAll("td")[0])
third_columns.append(row.findAll("td")[1])
# create dictionary with values and add to the list
d = {}
for first, third in zip(first_columns, third_columns):
d[first.get_text(strip=True)] = third.get_text(strip=True)
data.append(d)
df = pd.DataFrame(data)
print(df)
df.to_csv("data.csv", index=False)
Prints:
SKU Alternate Names Base Catalog Number CAS # EC Number Format Molecular Formula Molecular Weight Personal Protective Equipment Usage Statement Application Notes Beilstein Registry Number Optical Rotation Purity UV Visible Absorbance Hazard Statements RTECS Number Safety Symbol Auto Ignition Biochemical Physiological Actions Density Melting Point pH pKa Solubility Vapor Pressure Grade Boiling Point Isoelectric Point
0 02100078-CF 2-Acetamido-5-Guanidinovaleric acid 100078 210545-23-6 205-846-6 Powder C8H16N4O3· 2H2O 216.241 g/mol Eyeshields, Gloves, respirator filter Unless specified otherwise, MP Biomedical's pr... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 02100142-CF Acetyltryptophan; DL-α-Acetylamino-3-indolepro... 100142 87-32-1 201-739-3 Powder C13H14N2O3 246.266 g/mol Eyeshields, Gloves, respirator filter Unless specified otherwise, MP Biomedical's pr... N-acetyl-DL-tryptophan, is used as stabilizer ... 89478 0° ± 2° (c=1, 1N NaOH, 24 hrs.) ~99% λ max (water)=280 ± 2 nm NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 02100421-CF L-2,5-Diaminopentanoic acid; 2,5-Diaminopentan... 100421 3184-13-2 221-678-6 Powder C5H12N2O2 • HCl 168.621 g/mol Eyeshields, Gloves, respirator filter Unless specified otherwise, MP Biomedical's pr... NaN 3625847 NaN ~99% NaN H319 RM2985000 GHS07 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 02100520-CF Phosphocreatine Disodium Salt Tetrahydrate; So... 100520 922-32-7 213-074-6 Powder C4H8N3Na2O5P·4H2O 255.077 g/mol Eyeshields, Gloves, respirator filter Unless specified otherwise, MP Biomedical's pr... NaN NaN NaN ≥98% NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 02100769-CF Vitamin C; Ascorbate; Sodium ascorbate; L-Xylo... 100769 50-81-7 200-066-2 NaN C6H8O6 176.124 g/mol Eyeshields, Gloves, respirator filter Unless specified otherwise, MP Biomedical's pr... L-Ascorbic Acid is used as an Antimicrobial an... 84272 +18° to +32° (c=1, water) ≥98% NaN NaN CI7650000 NaN 1220° F (NTP, 1992) Ascorbic Acid, also known as Vitamin C, is a s... 1.65 (NTP, 1992) 374 to 378° F (decomposes) (NTP, 1992) Between 2,4 and 2,8 (2 % aqueous solution) pK1: 4.17; pK2: 11.57 greater than or equal to 100 mg/mL at 73° F (... 9.28X10-11 mm Hg at 25 deg C (est) NaN NaN NaN
5 02101003-CF Lycine; Oxyneurine; (Carboxymethyl)trimethylam... 101003 107-43-7 203-490-6 Powder C5H11NO2 117.148 g/mol Eyeshields, Gloves, respirator filter Unless specified otherwise, MP Biomedical's pr... Betaine is a reagent that is used in soldering... 3537113 NaN NaN NaN NaN DS5900000 NaN NaN End-product of oxidative metabolism of choline... NaN Decomposes around 293 deg C NaN 1.83 (Lit.) Solubility (g/100 g solvent): <a class="pubche... 1.36X10-8 mm Hg at 25 deg C (est) Anhydrous NaN NaN
6 02101806-CF (S)-2,5-Diamino-5-oxopentanoic acid; L-Glutami... 101806 56-85-9 200-292-1 NaN C5H10N2O3 146.146 g/mol Eyeshields, Gloves, respirator filter Unless specified otherwise, MP Biomedical's pr... L-glutamine is an essential amino acid, which ... 1723797 +30 ± 5° (c = 3.5, 1N HCl) ≥99% NaN NaN MA2275100 NaN NaN L-Glutamine is an essential amino acid that is... 1.364 g/cu cm 185.5 dec °C pH = 5-6 at 14.6 g/L at 25 deg C NaN Water Solubility41300 mg/L (at 25 °C) 1.9X10-8 mm Hg at 25 deg C (est) NaN NaN NaN
7 02102158-CF L-2-Amino-4-methylpentanoic acid; Leu; L; α-am... 102158 61-90-5 200-522-0 Powder C6H13NO2 131.175 g/mol Eyeshields, Gloves, respirator filter Unless specified otherwise, MP Biomedical's pr... Leucine has been used as a molecular marker in... 1721722 +14.5 ° to +16.5 ° (Lit.) NaN NaN NaN OH2850000 NaN NaN NaN 1.293 g/cu cm at 18 deg C 293 °C NaN 2.33 (-COOH), 9.74 (-NH2)(Lit.) Water Solubility21500 mg/L (at 25 °C) 5.52X10-9 mm Hg at 25 deg C (est) NaN Sublimes at 145-148 deg C. Decomposes at 293-2... 6.04(Lit.)
8 02102576-CF 4-Hydroxycinnamic acid; 3-(4-Hydroxphenyl)-2-p... 102576 7400-08-0 231-000-0 Powder C9H8O3 164.16 g/mol Dust mask, Eyeshields, Gloves Unless specified otherwise, MP Biomedical's pr... p-Coumaric acid was used as a substrate to stu... NaN NaN ≥98% NaN NaN GD9094000 NaN NaN NaN NaN 211.5 °C NaN NaN NaN NaN NaN NaN NaN
9 02102868-CF DL-2-Amino-3-hydroxypropionic acid; (±)-2-Amin... 102868 302-84-1 206-130-6 Powder C3H7NO3 105.093 g/mol Eyeshields, Gloves, respirator filter Unless specified otherwise, MP Biomedical's pr... NaN 1721405 -1° to + 1° (c = 5, 1N HCl) ≥98% NaN NaN NaN NaN NaN NMDA agonist acting at the glycine site; precu... 1.6 g/cu cm # 22 deg C 228 deg C (decomposes) NaN NaN SOL IN <a class="pubchem-internal-link CID-962... NaN NaN NaN NaN
...and so on.
And saves data.csv (screenshot from LibreOffice):
Try this -
import pandas as pd
import requests
import time
from datetime import datetime
from bs4 import BeautifulSoup
def extract_source(url):
agent = {"User-Agent": "Mozilla/5.0"}
return requests.get(url, headers=agent).text
html_text = extract_source(
'https://www.mpbio.com/us/life-sciences/biochemicals/amino-acids')
soup = BeautifulSoup(html_text, 'lxml')
result = []
for index, a in enumerate(soup.find_all('a', class_='button button--link button--fluid catalog-list-item__actions-primary-button', href=True)):
# if index >= 10:
# break
# print ("Found the URL:", a['href'])
urlof = a['href']
html_text = extract_source(urlof)
soup = BeautifulSoup(html_text, 'lxml')
table_rows = soup.find_all('tr')
first_columns = []
third_columns = []
for row in table_rows:
# for row in table_rows[1:]:
first_columns.append(row.findAll('td')[0])
third_columns.append(row.findAll('td')[1])
temp = {}
for first, third in zip(first_columns, third_columns):
third = str(third.text).strip('\n')
first = str(first.text).strip('\n')
temp[first] = third
result.append(temp)
df = pd.DataFrame(result)
# please drop useless column before saving the output to csv.
df.to_csv('out.csv', index=False)
This will give output -
SKU
Alternate Names
Base Catalog Number
CAS #
EC Number
Format
Molecular Formula
Molecular Weight
Personal Protective Equipment
Usage Statement
Application Notes
Beilstein Registry Number
Optical Rotation
Purity
UV Visible Absorbance
Hazard Statements
RTECS Number
Safety Symbol
02100078-CF
2-Acetamido-5-Guanidinovaleric acid
100078
210545-23-6
205-846-6
Powder
C8H16N4O3· 2H2O
216.241 g/mol
Eyeshields, Gloves, respirator filter
Unless specified otherwise, MP Biomedical's products are for research or further manufacturing use only, not for direct human use. For more information, please contact our customer service department.
02100142-CF
Acetyltryptophan; DL-α-Acetylamino-3-indolepropionic acid
100142
87-32-1
201-739-3
Powder
C13H14N2O3
246.266 g/mol
Eyeshields, Gloves, respirator filter
Unless specified otherwise, MP Biomedical's products are for research or further manufacturing use only, not for direct human use. For more information, please contact our customer service department.
N-acetyl-DL-tryptophan, is used as stabilizer in the human blood-derived therapeutic products normal serum albumin and plasma protein fraction.
89478
0° ± 2° (c=1, 1N NaOH, 24 hrs.)
~99%
λ max (water)=280 ± 2 nm
02100421-CF
L-2,5-Diaminopentanoic acid; 2,5-Diaminopentanoic acid monohydrochloride
100421
3184-13-2
221-678-6
Powder
C5H12N2O2 • HCl
168.621 g/mol
Eyeshields, Gloves, respirator filter
Unless specified otherwise, MP Biomedical's products are for research or further manufacturing use only, not for direct human use. For more information, please contact our customer service department.
3625847
~99%
H319
RM2985000
GHS07
I am trying to figure out an elegant way to scrape tables from a website. However, when running below script I am getting a ValueError: No tables found error.
from selenium import webdriver
import pandas as pd
driver = webdriver.Chrome(executable_path=r'C:...\chromedriver.exe')
driver.implicitly_wait(30)
driver.get("https://www.gallop.co.za/#meeting#20201125#3")
df_list=pd.read_html(driver.find_element_by_id("eventTab_4").get_attribute('outerHTML'))
When I look at the site elements, I notice that the code below works if the < table > tag lies neatly within the < div id="...">. However, in this case, I think the code is not working because of the following reasons:
There is a < div > within a < div > and then there is the < table > tag.
The site uses Javascript with the tables.
Grateful for advice on how to pull the tables for all races. That is, there are several tables which are made visible as the user clicks on each tab (race). I need to extract all of them into separate dataframes.
from selenium import webdriver
import time
import pandas as pd
pd.set_option('display.max_column',None)
driver = webdriver.Chrome(executable_path='C:/bin/chromedriver.exe')
driver.get("https://www.gallop.co.za/#meeting#20201125#3")
time.sleep(5)
tab = driver.find_element_by_id('tabs') #All tabs are here
li_list = tab.find_elements_by_tag_name('li') #They are in a "li"
a_list = []
for li in li_list[1:]: #First tab has nothing..We skip it
a = li.find_element_by_tag_name('a') #extract the "a" element from the "li"
a_list.append(a)
df = []
for a in a_list:
a.click() #Next Tab
time.sleep(8) #Tables take some time to load fully
page = driver.page_source #Get the HTML of the new Tab page
source = pd.read_html(page)
table = source[1] #Get 2nd table
df.append(table)
print(df)
Output
[ Silk No Horse Unnamed: 3 ACS SH CFR Trainer \
0 NaN 1 JEM ROCK NaN 4bC A NaN Eric Sands
1 NaN 2 MAISON MERCI NaN 3chC A NaN Brett Crawford
2 NaN 3 WORDSWORTH NaN 3bC AB NaN Greg Ennion
3 NaN 4 FOUND THE DREAM NaN 3bG A NaN Adam Marcus
4 NaN 5 IZAPHA NaN 3bG A NaN Andre Nel
5 NaN 6 JACKBEQUICK NaN 3grG A NaN Glen Kotzen
6 NaN 7 MHLABENI NaN 3chG A NaN Eric Sands
7 NaN 8 ORLOV NaN 3bG A NaN Adam Marcus
8 NaN 9 T'CHALLA NaN 3bC A NaN Justin Snaith
9 NaN 10 WEST COAST WONDER NaN 3bG A NaN Piet Steyn
Jockey Wgt MR Dr Odds Last 3 Runs
0 D Dillon 60 0 7 NaN NaN
continued
I am scraping a table from a website and I have not had any problems getting the data, but I am having issues printing the final output. In the code I've provided it prints everything, no issues if it prints within my 'for' statement (see the commented out print commands). However if I print it later in the code, outside of the 'for' statement, I only get the first row. I'd like to take this code and put it in a larger project where this output (among others) are in a single email. How do I get the entire output to appear?
I've tried appending each table row to a list, I think I am doing it wrong, but it just prints the same row over and over or individual letters from the first row.
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
#print('Scraping Iowa Dept of Banking...')
url = 'https://www.idob.state.ia.us/bank/docs/applica/app_status.aspx'
r = requests.get(url, headers = headers)
soup = BeautifulSoup(r.text, 'html.parser')
mylist = []
for tr in soup.find_all('tr')[2:]:
tds = tr.find_all('td')
if len(tds[5].text) == 1:
edate = "NA"
else:
edate = ""
if len(tds[6].text) == 1:
loc = "NA"
else:
loc = ""
output5 = ("Bank: %s, City: %s, Type: %s, Effective Date: %s, Location: %s, Comment: %s \r\n" % (tds[0].text, tds[1].text, tds[2].text.replace(" ", ""), tds[5].text+edate, tds[6].text.replace(" ", "")+loc, tds[7].text))
global outputs5
outputs5 = output5
#print(outputs5) #The whole table prints if printed here
if outputs5 is None:
outputs5 = "No information available"
print(outputs5)
print(outputs5) #only prints the first line
I would love to use pandas which is python library and extract the table and import into csv.
import pandas as pd
tables=pd.read_html("https://www.idob.state.ia.us/bank/docs/applica/app_status.aspx")
tables[1].to_csv('output.csv')
Csv will look like that.
It is so easy to install pandas.Just type in command prompt.
pip install pandas
Try like this, you need to append the outputs to the list and then join the list together before printing it.
The reason why your print inside the loop worked was because it actually printed 5 times, not just once.
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
#print('Scraping Iowa Dept of Banking...')
url = 'https://www.idob.state.ia.us/bank/docs/applica/app_status.aspx'
r = requests.get(url, headers = headers)
soup = BeautifulSoup(r.text, 'html.parser')
mylist = []
for tr in soup.find_all('tr')[2:]:
tds = tr.find_all('td')
if len(tds[5].text) == 1:
edate = "NA"
else:
edate = ""
if len(tds[6].text) == 1:
loc = "NA"
else:
loc = ""
output5 = ("Bank: %s, City: %s, Type: %s, Effective Date: %s, Location: %s, Comment: %s \r\n" % (tds[0].text, tds[1].text, tds[2].text.replace(" ", ""), tds[5].text+edate, tds[6].text.replace(" ", "")+loc, tds[7].text))
mylist.append(outputs5)
if mylist == []:
print("No information available")
print('\n'.join(mylist))
Have you tried using pandas' .read_html()?
import pandas as pd
url ='https://www.idob.state.ia.us/bank/docs/applica/app_status.aspx'
tables = pd.read_html(url)
Output:
print (tables[-1].to_string())
Bank City Type Accepted Approved Effective # Location Comment
0 State Bank New Hampton Merge With and Into 05/21/2019 NaN NaN NaN Application to merge State Bank, New Hampton, ...
1 Farmers and Traders Savings Bank Douds Merge With and Into 05/20/2019 NaN NaN NaN Application to merge Farmers and Traders Savin...
2 City State Bank Norwalk Establish a Bank Office 05/15/2019 05/29/2019 NaN Mesa, AZ Application by City State Bank, Norwalk, to es...
3 Availa Bank Carroll Purchase and Assumption 04/16/2019 04/30/2019 NaN NaN Application by Availa Bank, Carroll, to purcha...
4 West Bank West Des Moines Establish a Bank Office 04/16/2019 05/02/2019 05/10/2019 Mankato, MN Application by West Bank, West Des Moines, to ...
5 West Bank West Des Moines Establish a Bank Office 04/10/2019 05/02/2019 05/07/2019 Owatonna, MN Application by West Bank, West Des Moines, to ...
6 West Bank West Des Moines Establish a Bank Office 04/09/2019 05/02/2019 05/07/2019 NaN Application by West Bank, West Des Moines, to ...
7 Iowa State Bank Algona Establish a Bank Office 03/15/2019 NaN NaN Phoenix, AZ Application by Iowa State Bank, Algona, to est...
8 Peoples Savings Bank Elma Merge With and Into 03/13/2019 04/24/2019 NaN NaN Application to merge Peoples State Bank, Elma,...
9 Two Rivers Bank & Trust Burlington Purchase and Assumption 01/25/2019 01/31/2019 05/31/2019 NaN Application by Two Rivers Bank & Trust, Burlin...
10 Westside State Bank Westside Establish a Bank Office 01/25/2019 02/06/2019 NaN Bellevue, NE Application by Westside State Bank, Westside, ...
11 Northwest Bank Spencer Relocate a Bank Office 11/29/2018 12/12/2018 NaN Ankeny Application by Northwest Bank, Spencer, to rel...
12 City State Bank Norwalk Establish a Bank Office 11/21/2018 12/12/2018 NaN Norwalk Application by City State Bank, Norwalk, to es...
13 First Security Bank and Trust Company Charles City Relocate a Bank Office 06/21/2018 06/29/2018 NaN Manly Application by First Security Bank and Trust C...
14 Lincoln Savings Bank Cedar Falls Establish a Bank Office 06/04/2018 06/25/2018 NaN Des Moines Application by Lincoln Savings Bank, Cedar Fal...
15 Raccoon Valley Bank Perry Establish a Bank Office 02/12/2018 03/02/2018 NaN Grimes Application by Raccoon Valley Bank, Perry, to ...
16 Community Savings Bank Edgewood Relocate a Bank Office 01/25/2018 01/25/2018 NaN Manchester Application by Community Savings Bank, Edgewoo...
17 Luana Savings Bank Luana Establish a Bank Office 06/05/2017 08/16/2017 NaN Norwalk Application by Luana Savings Bank, Luana, to e...
18 Fort Madison Financial Company Fort Madison Change of Ownership NaN 10/19/2017 NaN NaN Application for Linda Sue Baier, Fort Madison,...
19 Lincoln Bancorp Reinbeck Change of Ownership NaN 12/10/2018 NaN NaN Application for Lincoln Bancorp Employee Stock...
20 Emmetsburg Bank Shares, Inc. Emmetsburg Change of Ownership NaN 01/17/2019 NaN NaN Application for Charles and Maryanna Sarazine,...
21 Albrecht Financial Services, Inc. Norwalk Change of Control NaN 03/27/2019 05/10/2019 NaN Application for Dean L. Albrecht 2014 Family T...
22 Solon Financial, Inc. Solon Change of Ownership NaN 03/05/2019 NaN NaN Application for Cordelia A. Cosgrove, Bruce A....
23 How-Win Development Co. Cresco Change of Ownership NaN 03/28/2019 NaN NaN Application for John Scott Thomson, as trustee...
24 Lee Capital Corp. Fort Madison Change of Ownership NaN 04/15/2019 NaN NaN Application by Jean M. Humphrey, Kathleen A. M...
25 FNB BanShares, Inc. West Union Change of Ownership NaN 05/02/2019 NaN NaN Application for James L. Moss, individually an...
26 Old O'Brien Banc Shares, Inc. Sutherland Change of Ownership NaN 03/06/2019 NaN NaN Application for James J. Johnson and Colleen D...
27 Pella Financial Group, Inc. Pella Change of Control NaN 03/15/2019 NaN NaN Application for Pella Financial Group, Inc., P...
28 BANK Wapello Amend or Restate Articles of Incorporation NaN 05/07/2019 05/07/2019 NaN Restatement of Articles of Incorporation.
29 Security Agency, Inc. Decorah Change of Ownership NaN 11/28/2018 NaN NaN Application for the 2018 Grantor Trust FBO Rac...
30 Arendt's Inc. Montezuma Change of Ownership NaN 05/29/2019 NaN NaN Application for C. W. Bolen, Montezuma, indivi...