I am scraping a table from a website and I have not had any problems getting the data, but I am having issues printing the final output. In the code I've provided it prints everything, no issues if it prints within my 'for' statement (see the commented out print commands). However if I print it later in the code, outside of the 'for' statement, I only get the first row. I'd like to take this code and put it in a larger project where this output (among others) are in a single email. How do I get the entire output to appear?
I've tried appending each table row to a list, I think I am doing it wrong, but it just prints the same row over and over or individual letters from the first row.
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
#print('Scraping Iowa Dept of Banking...')
url = 'https://www.idob.state.ia.us/bank/docs/applica/app_status.aspx'
r = requests.get(url, headers = headers)
soup = BeautifulSoup(r.text, 'html.parser')
mylist = []
for tr in soup.find_all('tr')[2:]:
tds = tr.find_all('td')
if len(tds[5].text) == 1:
edate = "NA"
else:
edate = ""
if len(tds[6].text) == 1:
loc = "NA"
else:
loc = ""
output5 = ("Bank: %s, City: %s, Type: %s, Effective Date: %s, Location: %s, Comment: %s \r\n" % (tds[0].text, tds[1].text, tds[2].text.replace(" ", ""), tds[5].text+edate, tds[6].text.replace(" ", "")+loc, tds[7].text))
global outputs5
outputs5 = output5
#print(outputs5) #The whole table prints if printed here
if outputs5 is None:
outputs5 = "No information available"
print(outputs5)
print(outputs5) #only prints the first line
I would love to use pandas which is python library and extract the table and import into csv.
import pandas as pd
tables=pd.read_html("https://www.idob.state.ia.us/bank/docs/applica/app_status.aspx")
tables[1].to_csv('output.csv')
Csv will look like that.
It is so easy to install pandas.Just type in command prompt.
pip install pandas
Try like this, you need to append the outputs to the list and then join the list together before printing it.
The reason why your print inside the loop worked was because it actually printed 5 times, not just once.
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
#print('Scraping Iowa Dept of Banking...')
url = 'https://www.idob.state.ia.us/bank/docs/applica/app_status.aspx'
r = requests.get(url, headers = headers)
soup = BeautifulSoup(r.text, 'html.parser')
mylist = []
for tr in soup.find_all('tr')[2:]:
tds = tr.find_all('td')
if len(tds[5].text) == 1:
edate = "NA"
else:
edate = ""
if len(tds[6].text) == 1:
loc = "NA"
else:
loc = ""
output5 = ("Bank: %s, City: %s, Type: %s, Effective Date: %s, Location: %s, Comment: %s \r\n" % (tds[0].text, tds[1].text, tds[2].text.replace(" ", ""), tds[5].text+edate, tds[6].text.replace(" ", "")+loc, tds[7].text))
mylist.append(outputs5)
if mylist == []:
print("No information available")
print('\n'.join(mylist))
Have you tried using pandas' .read_html()?
import pandas as pd
url ='https://www.idob.state.ia.us/bank/docs/applica/app_status.aspx'
tables = pd.read_html(url)
Output:
print (tables[-1].to_string())
Bank City Type Accepted Approved Effective # Location Comment
0 State Bank New Hampton Merge With and Into 05/21/2019 NaN NaN NaN Application to merge State Bank, New Hampton, ...
1 Farmers and Traders Savings Bank Douds Merge With and Into 05/20/2019 NaN NaN NaN Application to merge Farmers and Traders Savin...
2 City State Bank Norwalk Establish a Bank Office 05/15/2019 05/29/2019 NaN Mesa, AZ Application by City State Bank, Norwalk, to es...
3 Availa Bank Carroll Purchase and Assumption 04/16/2019 04/30/2019 NaN NaN Application by Availa Bank, Carroll, to purcha...
4 West Bank West Des Moines Establish a Bank Office 04/16/2019 05/02/2019 05/10/2019 Mankato, MN Application by West Bank, West Des Moines, to ...
5 West Bank West Des Moines Establish a Bank Office 04/10/2019 05/02/2019 05/07/2019 Owatonna, MN Application by West Bank, West Des Moines, to ...
6 West Bank West Des Moines Establish a Bank Office 04/09/2019 05/02/2019 05/07/2019 NaN Application by West Bank, West Des Moines, to ...
7 Iowa State Bank Algona Establish a Bank Office 03/15/2019 NaN NaN Phoenix, AZ Application by Iowa State Bank, Algona, to est...
8 Peoples Savings Bank Elma Merge With and Into 03/13/2019 04/24/2019 NaN NaN Application to merge Peoples State Bank, Elma,...
9 Two Rivers Bank & Trust Burlington Purchase and Assumption 01/25/2019 01/31/2019 05/31/2019 NaN Application by Two Rivers Bank & Trust, Burlin...
10 Westside State Bank Westside Establish a Bank Office 01/25/2019 02/06/2019 NaN Bellevue, NE Application by Westside State Bank, Westside, ...
11 Northwest Bank Spencer Relocate a Bank Office 11/29/2018 12/12/2018 NaN Ankeny Application by Northwest Bank, Spencer, to rel...
12 City State Bank Norwalk Establish a Bank Office 11/21/2018 12/12/2018 NaN Norwalk Application by City State Bank, Norwalk, to es...
13 First Security Bank and Trust Company Charles City Relocate a Bank Office 06/21/2018 06/29/2018 NaN Manly Application by First Security Bank and Trust C...
14 Lincoln Savings Bank Cedar Falls Establish a Bank Office 06/04/2018 06/25/2018 NaN Des Moines Application by Lincoln Savings Bank, Cedar Fal...
15 Raccoon Valley Bank Perry Establish a Bank Office 02/12/2018 03/02/2018 NaN Grimes Application by Raccoon Valley Bank, Perry, to ...
16 Community Savings Bank Edgewood Relocate a Bank Office 01/25/2018 01/25/2018 NaN Manchester Application by Community Savings Bank, Edgewoo...
17 Luana Savings Bank Luana Establish a Bank Office 06/05/2017 08/16/2017 NaN Norwalk Application by Luana Savings Bank, Luana, to e...
18 Fort Madison Financial Company Fort Madison Change of Ownership NaN 10/19/2017 NaN NaN Application for Linda Sue Baier, Fort Madison,...
19 Lincoln Bancorp Reinbeck Change of Ownership NaN 12/10/2018 NaN NaN Application for Lincoln Bancorp Employee Stock...
20 Emmetsburg Bank Shares, Inc. Emmetsburg Change of Ownership NaN 01/17/2019 NaN NaN Application for Charles and Maryanna Sarazine,...
21 Albrecht Financial Services, Inc. Norwalk Change of Control NaN 03/27/2019 05/10/2019 NaN Application for Dean L. Albrecht 2014 Family T...
22 Solon Financial, Inc. Solon Change of Ownership NaN 03/05/2019 NaN NaN Application for Cordelia A. Cosgrove, Bruce A....
23 How-Win Development Co. Cresco Change of Ownership NaN 03/28/2019 NaN NaN Application for John Scott Thomson, as trustee...
24 Lee Capital Corp. Fort Madison Change of Ownership NaN 04/15/2019 NaN NaN Application by Jean M. Humphrey, Kathleen A. M...
25 FNB BanShares, Inc. West Union Change of Ownership NaN 05/02/2019 NaN NaN Application for James L. Moss, individually an...
26 Old O'Brien Banc Shares, Inc. Sutherland Change of Ownership NaN 03/06/2019 NaN NaN Application for James J. Johnson and Colleen D...
27 Pella Financial Group, Inc. Pella Change of Control NaN 03/15/2019 NaN NaN Application for Pella Financial Group, Inc., P...
28 BANK Wapello Amend or Restate Articles of Incorporation NaN 05/07/2019 05/07/2019 NaN Restatement of Articles of Incorporation.
29 Security Agency, Inc. Decorah Change of Ownership NaN 11/28/2018 NaN NaN Application for the 2018 Grantor Trust FBO Rac...
30 Arendt's Inc. Montezuma Change of Ownership NaN 05/29/2019 NaN NaN Application for C. W. Bolen, Montezuma, indivi...
Related
I have a dataframe with 6 columns, the first two are an id and a name column, the remaining 4 are potential matches for the name column.
id name match1 match2 match3 match4
id name match1 match2 match3 match4
1 NXP Semiconductors NaN NaN NaN NaN
2 Cincinnati Children's Hospital Medical Center Montefiore Medical center Children's Hospital Los Angeles Cincinnati Children's Hospital Medical Center SSM Health SLU Hospital
3 Seminole Tribe of Florida The State Board of Administration of Florida NaN NaN NaN
4 Miami-Dade County County of Will County of Orange NaN NaN
5 University of California California Teacher's Association Yale University University of Toronto University System of Georgia
6 Bon Appetit Management Waste Management Sculptor Capital NaN NaN
I'd like to use SequenceMatcher to compare the name column with each match column if there is a value and return the match value with the highest ratio, or closest match, in a new column at the end of the dataframe.
So the output would be something like this:
id name match1 match2 match3 match4 best match
1 NXP Semiconductors NaN NaN NaN NaN NaN
2 Cincinnati Children's Hospital Medical Center Montefiore Medical center Children's Hospital Los Angeles Cincinnati Children's Hospital Medical Center SSM Health SLU Hospital Cincinnati Children's Hospital Medical Center
3 Seminole Tribe of Florida The State Board of Administration of Florida NaN NaN NaN The State Board of Administration of Florida
4 Miami-Dade County County of Will County of Orange NaN NaN County of Orange
5 University of California California Teacher's Association Yale University University of Toronto University System of Georgia California Teacher's Association
6 Bon Appetit Management Waste Management Sculptor Capital NaN NaN Waste Management
I've gotten the data into the dataframe and have been able to compare one column to a single other column using the apply method:
df['diff'] = df.apply(lambda x: diff.SequenceMatcher(None, x[0].strip(), x[1].strip()).ratio(), axis=1)
However, I'm not sure how to loop over multiple columns in the same row. I also thought about trying to reformat my data so it that the method above would work, something like this:
name match
name1 match1
name1 match2
name1 match3
However, I was running into issues dealing with the NaN values. Open to suggestions on the best route to accomplish this.
I ended up solving this using the second idea of reformatting the table. Using the melt function I was able to get a two column table of the name field with each possible match. From there I used the original lambda function to compare the two columns and output a ratio. From there it was relatively easy to go through and see the most likely matches, although it did require some manual effort.
df = pd.read_csv('output.csv')
df1 = df.melt(id_vars = ['id', 'name'], var_name = 'match').dropna().drop('match',1).sort_values('name')
df1['diff'] = df1.apply(lambda x: diff.SequenceMatcher(None, x[1].strip(), x[2].strip()).ratio(), axis=1)
df1.to_csv('comparison-output.csv', encoding='utf-8')
Can anybody help, please?
I am finding it difficult to grab the value of currency units per SDR using my python code below.
The code is working with other URLs, but for this URL, I always get a null result. I don’t understand what is wrong.
I’m using python scrappy spider.
URL: https://www.imf.org/external/np/fin/data/rms_five.aspx
I reviewed the content on the website and found that it contains some spaces. There are some spaces in the element value:
Using RSS response and XPath, I get the same result i.e. null
def start_requests(self):
urls = [
'https://www.imf.org/external/np/fin/data/rms_five.aspx'
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
print(response.url)
#for i in range (1, 24):
yield {
'Kurs_imf2': response.xpath('//*[#id="content"]/center/table/tbody/tr[3]/td/div/table/tbody/tr[3]/td[3]/text()').getall()
}
Well, looking only at your ultimate goal (and disregarding your eventual unfortunate choice of tools), here is one way of getting the data from that table, as a dataframe. Once you have the data, you can manipulate it as needed (like stripping the leading/trailing spaces, etc):
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
url = 'https://www.imf.org/external/np/fin/data/rms_five.aspx'
r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
table_w_data = soup.select_one('div.fancy').select_one('table')
df = pd.read_html(str(table_w_data))[0]
print(df)
This will display in terminal:
('SDRs per Currency unit 2', 'Unnamed: 0_level_1')
('SDRs per Currency unit 2', 'September 05, 2022')
('SDRs per Currency unit 2', 'September 02, 2022')
('SDRs per Currency unit 2', 'September 01, 2022')
('SDRs per Currency unit 2', 'August 31, 2022')
('SDRs per Currency unit 2', 'August 30, 2022')
0
Chinese yuan
nan
0.111437
0.111266
0.111474
0.110906
1
Euro
nan
0.768586
0.768411
0.768436
0.769353
2
Japanese yen
nan
0.00549178
0.00550612
0.00554387
0.00553607
3
U.K. pound
nan
0.889646
0.887851
0.892615
0.898473
4
U.S. dollar
nan
0.769124
0.768104
0.768436
0.766746
5
Algerian dinar
nan
0.0054812
0.00547939
0.00547416
0.00547027
6
Australian dollar
nan
0.522235
0.524922
0.530375
0.529438
7
Botswana pula
nan
0.0593764
0.0595281
0.0600917
0.0600362
8
Brazilian real
nan
0.148273
0.147709
0.148393
0.151498
9
Brunei dollar
nan
0.548669
0.548215
0.55085
0.54893
10
Canadian dollar
nan
0.586178
0.5834
0.5861
0.586377
11
Chilean peso
nan
0.000856257
0.000855312
0.000871134
0.000860642
12
Czech koruna
nan
0.0313774
0.0313768
0.0313289
0.0312983
13
Danish krone
nan
0.103346
0.10332
0.103325
0.103441
14
Indian rupee
nan
0.0096395
0.00967422
nan
0.00961806
15
Israeli New Shekel
nan
0.227889
0.228331
0.230002
0.231996
16
Korean won
nan
0.000569131
0.000571039
0.000570268
0.000568929
17
Kuwaiti dinar
nan
nan
2.49425
2.49654
2.49105
18
Malaysian ringgit
nan
0.171526
0.171356
nan
0.170977
19
Mauritian rupee
nan
0.0172087
nan
0.0171443
0.0171741
20
Mexican peso
nan
0.0385038
0.0379361
0.0382379
0.0380585
21
New Zealand dollar
nan
0.466781
0.468236
0.471167
0.47105
22
Norwegian krone
nan
0.0768317
0.0767413
0.0773168
0.0788655
23
Omani rial
nan
nan
1.99767
1.99853
1.99414
24
Peruvian sol
nan
0.198946
0.199042
0.200166
nan
25
Philippine peso
nan
nan
0.0136744
0.0136628
0.0136826
26
Polish zloty
nan
0.162688
0.163569
0.162254
0.162412
27
Qatari riyal
nan
nan
0.211018
0.211109
0.210645
28
Russian ruble
nan
0.0127399
0.0127514
0.0127565
0.0127013
29
Saudi Arabian riyal
nan
nan
0.204828
0.204916
0.204466
30
Singapore dollar
nan
0.548669
0.548215
0.55085
0.54893
31
South African rand
nan
0.0444661
0.0448438
0.0450817
0.0455618
32
Swedish krona
nan
0.0715711
0.0718351
0.0720278
0.0718782
33
Swiss franc
nan
0.782903
0.78458
0.784038
0.789442
34
Thai baht
nan
0.0208978
0.020927
0.0210548
0.0210465
35
Trinidadian dollar
nan
0.114475
0.114048
nan
0.114091
36
U.A.E. dirham
nan
nan
0.20915
0.209241
0.20878
37
Uruguayan peso
nan
0.0188128
0.0188288
0.0187606
0.0188043
Relevant documentation:
Requests: https://requests.readthedocs.io/en/latest/
Pandas: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html
BeautifulSoup: https://beautiful-soup-4.readthedocs.io/en/latest/index.html
I have data with current names of companies, old names, and the date of name changes. It looks like this:
name
former_name1
name_change_date1
ACMAT CORP
nan
NaT
ACME ELECTRIC CORP
nan
NaT
ACME UNITED CORP
nan
NaT
COLUMBIA ACORN TRUST
LIBERTY ACORN TRUST
2003-10-20
MULTIGRAPHICS INC
AM INTERNATIONAL INC
1997-03-17
MILLER LLOYD I III
nan
NaT
AFFILIATED COMPUTER SERVICES INC
nan
NaT
ADAMS RESOURCES & ENERGY, INC.
ADAMS RESOURCES & ENERGY INC
2005-04-01
BK Technologies Corp
BK Technologies, Inc.
2019-03-28
I want to figure out what the name of each company was at a particular date. Let's say I want to figure out the name of a company as of January 1st 2002. Then I could create a new column called say, edited_name, which would contain the current name of the company unless the company has changed names since 1/1/2002, in which case it would contain the historical name (i.e. former_name1) of the company. So the output should look something like this:
name
former_name1
name_change_date1
edited_name
ACMAT CORP
nan
NaT
ACMAT CORP
ACME ELECTRIC CORP
nan
NaT
ACME ELECTRIC CORP
ACME UNITED CORP
nan
NaT
ACME UNITED CORP
COLUMBIA ACORN TRUST
LIBERTY ACORN TRUST
2003-10-20
LIBERTY ACORN TRUST
MULTIGRAPHICS INC
AM INTERNATIONAL INC
1997-03-17
MULTIGRAPHICS INC
MILLER LLOYD I III
nan
NaT
MILLER LLOYD I III
AFFILIATED COMPUTER SERVICES INC
nan
NaT
AFFILIATED COMPUTER SERVICES INC
ADAMS RESOURCES & ENERGY, INC.
ADAMS RESOURCES & ENERGY INC
2005-04-01
ADAMS RESOURCES & ENERGY INC
BK Technologies Corp
BK Technologies, Inc.
2019-03-28
BK Technologies, Inc.
In Stata (with which I am much more familiar) this could be easily accomplished with:
gen edited_name = name
replace edited_name = former_name1 if name_change_date_1 > date("2002-01-01", "YMD") & name_change_date_1 != .
Unfortunately I am at a loss of how to accomplish this in Python/Pandas.
Data:
{'name': ['ACMAT CORP', 'ACME ELECTRIC CORP', 'ACME UNITED CORP', 'COLUMBIA ACORN TRUST',
'MULTIGRAPHICS INC', 'MILLER LLOYD I III', 'AFFILIATED COMPUTER SERVICES INC',
'ADAMS RESOURCES & ENERGY, INC.', 'BK Technologies Corp'],
'former_name1': [nan, nan, nan, 'LIBERTY ACORN TRUST', 'AM INTERNATIONAL INC', nan, nan,
'ADAMS RESOURCES & ENERGY INC', 'BK Technologies, Inc.'],
'name_change_date1': [NaT, NaT, NaT, '2003-10-20', '1997-03-17', NaT, NaT,
'2005-04-01', '2019-03-28']}
You could use numpy.where to select values depending on if a name change occurred or not:
import numpy as np
df['edited_name'] = np.where(df['name_change_date1'].notna() &
df['name_change_date1'].gt(pd.to_datetime('1/1/2002')),
df['former_name1'], df['name'])
or with mask:
df['edited_name'] = df['name'].mask(df['name_change_date1'].notna() &
df['name_change_date1'].gt(pd.to_datetime('1/1/2002')),
df['former_name1'])
Output:
name former_name1 \
0 ACMAT CORP NaN
1 ACME ELECTRIC CORP NaN
2 ACME UNITED CORP NaN
3 COLUMBIA ACORN TRUST LIBERTY ACORN TRUST
4 MULTIGRAPHICS INC AM INTERNATIONAL INC
5 MILLER LLOYD I III NaN
6 AFFILIATED COMPUTER SERVICES INC NaN
7 ADAMS RESOURCES & ENERGY, INC. ADAMS RESOURCES & ENERGY INC
8 BK Technologies Corp BK Technologies, Inc.
name_change_date1 edited_name
0 NaT ACMAT CORP
1 NaT ACME ELECTRIC CORP
2 NaT ACME UNITED CORP
3 2003-10-20 LIBERTY ACORN TRUST
4 1997-03-17 MULTIGRAPHICS INC
5 NaT MILLER LLOYD I III
6 NaT AFFILIATED COMPUTER SERVICES INC
7 2005-04-01 ADAMS RESOURCES & ENERGY INC
8 2019-03-28 BK Technologies, Inc.
Use:
import numpy as np
df = pd.DataFrame({'name':['a', 'b', 'c', 'd'], 'fname':[np.nan, 'h', 's', np.nan], 'dc':[np.nan, '2003-10-20', '1997-03-17', np.nan]})
df['dc'] = pd.to_datetime(df['dc'])
df['nname'] = df['fname'][df['dc']>'1/1/2002']
res = df['name'][df['nname'].isna()]
temp = df['fname'][df['nname'].notna()]
res = res.append(temp)
df['res']=res
output:
Write code to get a list of tickers for all S&P 500 stocks from Wikipedia. As of 2/24/2021, there are 505 tickers in that list. You can use any method you want as long as the code actually queries the following website to get the list:
https://en.wikipedia.org/wiki/List_of_S%26P_500_companies
One way would be to use the requests module to get the HTML code and then use the re module to extract the tickers. Another option would be the .read_html function in pandas and then export the tickers column to a list.
You should save the tickers in a list with the name sp500_tickers
This will grab the data in the table named 'constituents'.
# find a specific table by table count
import pandas as pd
import requests
from bs4 import BeautifulSoup
res = requests.get("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies")
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[0]
df = pd.read_html(str(table))
print(df[0].to_json(orient='records'))
Result:
[{"Symbol":"MMM","Security":"3M Company","SEC filings":"reports","GICS Sector":"Industrials","GICS Sub-Industry":"Industrial Conglomerates","Headquarters Location":"St. Paul, Minnesota","Date first added":"1976-08-09","CIK":66740,"Founded":"1902"},{"Symbol":"ABT","Security":"Abbott Laboratories","SEC filings":"reports","GICS Sector":"Health Care","GICS Sub-Industry":"Health Care Equipment","Headquarters Location":"North Chicago, Illinois","Date first added":"1964-03-31","CIK":1800,"Founded":"1888"},{"Symbol":"ABBV","Security":"AbbVie Inc.","SEC filings":"reports","GICS Sector":"Health Care","GICS Sub-Industry":"Pharmaceuticals","Headquarters Location":"North Chicago, Illinois","Date first added":"2012-12-31","CIK":1551152,"Founded":"2013 (1888)"},{"Symbol":"ABMD","Security":"Abiomed","SEC filings":"reports","GICS Sector":"Health Care","GICS Sub-Industry":"Health Care Equipment","Headquarters Location":"Danvers, Massachusetts","Date first added":"2018-05-31","CIK":815094,"Founded":"1981"},{"Symbol":"ACN","Security":"Accenture","SEC filings":"reports","GICS Sector":"Information Technology","GICS Sub-Industry":"IT Consulting & Other Services","Headquarters Location":"Dublin, Ireland","Date first added":"2011-07-06","CIK":1467373,"Founded":"1989"},{"Symbol":"ATVI","Security":"Activision Blizzard","SEC filings":"reports","GICS Sector":"Communication Services","GICS Sub-Industry":"Interactive Home Entertainment","Headquarters Location":"Santa Monica, California","Date first added":"2015-08-31","CIK":718877,"Founded":"2008"},{"Symbol":"ADBE","Security":"Adobe Inc.","SEC filings":"reports","GICS Sector":"Information Technology","GICS Sub-Industry":"Application Software","Headquarters Location":"San Jose, California","Date first added":"1997-05-05","CIK":796343,"Founded":"1982"},
Etc., etc., etc.
That's JSON. If you want a table, kind of like what you would use in Excel, simply print the df.
Result:
[ Symbol Security SEC filings GICS Sector \
0 MMM 3M Company reports Industrials
1 ABT Abbott Laboratories reports Health Care
2 ABBV AbbVie Inc. reports Health Care
3 ABMD Abiomed reports Health Care
4 ACN Accenture reports Information Technology
.. ... ... ... ...
500 YUM Yum! Brands Inc reports Consumer Discretionary
501 ZBRA Zebra Technologies reports Information Technology
502 ZBH Zimmer Biomet reports Health Care
503 ZION Zions Bancorp reports Financials
504 ZTS Zoetis reports Health Care
GICS Sub-Industry Headquarters Location \
0 Industrial Conglomerates St. Paul, Minnesota
1 Health Care Equipment North Chicago, Illinois
2 Pharmaceuticals North Chicago, Illinois
3 Health Care Equipment Danvers, Massachusetts
4 IT Consulting & Other Services Dublin, Ireland
.. ... ...
500 Restaurants Louisville, Kentucky
501 Electronic Equipment & Instruments Lincolnshire, Illinois
502 Health Care Equipment Warsaw, Indiana
503 Regional Banks Salt Lake City, Utah
504 Pharmaceuticals Parsippany, New Jersey
Date first added CIK Founded
0 1976-08-09 66740 1902
1 1964-03-31 1800 1888
2 2012-12-31 1551152 2013 (1888)
3 2018-05-31 815094 1981
4 2011-07-06 1467373 1989
.. ... ... ...
500 1997-10-06 1041061 1997
501 2019-12-23 877212 1969
502 2001-08-07 1136869 1927
503 2001-06-22 109380 1873
504 2013-06-21 1555280 1952
[505 rows x 9 columns]]
Alternatively, you can export the df to a CSV file.
df.to_csv('constituents.csv')
I'm learning how to scrape using Beautiful soup with selenium and I found a website that has multiple tables and found table tags (first time dealing with them). I'm learning how to try to scrape those texts from each table and append each element to respected list. First im trying to scrape the first table, and the rest I want to do on my own. But I cannot access the tag for some reason.
I also incorporated selenium to access the sites, because when I copy the link to the site onto another tab, the list of tables disappears, for some reason.
My code so far:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
from selenium import webdriver
from selenium.webdriver.support.ui import Select
PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)
targetSite = "https://www.sdvisualarts.net/sdvan_new/events.php"
driver.get(targetSite)
select_event = Select(driver.find_element_by_name('subs'))
select_event.select_by_value('All')
select_loc = Select(driver.find_element_by_name('loc'))
select_loc.select_by_value("All")
driver.find_element_by_name("submit").click()
targetSite = "https://www.sdvisualarts.net/sdvan_new/viewevents.php"
event_title = []
name = []
address = []
city = []
state = []
zipCode = []
location = []
webSite = []
fee = []
event_dates = []
opening_dates = []
description = []
try:
page = requests.get(targetSite )
soup = BeautifulSoup(page.text, 'html.parser')
items = soup.find_all('table', {"class":"popdetail"})
for i in items:
event_title.append(item.find('b', {'class': "text"})).text.strip()
name.append(item.find('td', {'class': "text"})).text.strip()
address.append(item.find('td', {'class': "text"})).text.strip()
city.append(item.find('td', {'class': "text"})).text.strip()
state.append(item.find('td', {'class': "text"})).text.strip()
zipCode.append(item.find('td', {'class': "text"})).text.strip()
Can someone let me know if I am doing this correctly, This is my first time dealing with site's urls elements disappear when copied onto a new tab and/or window
So far, I am unable to append any information to each list.
One issue is with the for loop.
you have for i in items:, but then you are calling item instead of i.
And secondly, if you are using selenium to render the page, then you should probably use selenium to get the html. They also have some embedded tables within tables, so it's not as straight forward as iterating through the <table> tags. What I ended up doing was having pandas read in the tables (returns a list of dataframes), then iterating through those as there is a pattern of how the dataframes are constructed.
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import Select
PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)
targetSite = "https://www.sdvisualarts.net/sdvan_new/events.php"
driver.get(targetSite)
select_event = Select(driver.find_element_by_name('subs'))
select_event.select_by_value('All')
select_loc = Select(driver.find_element_by_name('loc'))
select_loc.select_by_value("All")
driver.find_element_by_name("submit").click()
targetSite = "https://www.sdvisualarts.net/sdvan_new/viewevents.php"
event_title = []
name = []
address = []
city = []
state = []
zipCode = []
location = []
webSite = []
fee = []
event_dates = []
opening_dates = []
description = []
dfs = pd.read_html(driver.page_source)
driver.close
for idx, table in enumerate(dfs):
if table.iloc[0,0] == 'Event Title':
event_title.append(table.iloc[-1,0])
tempA = dfs[idx+1]
tempA.index = tempA[0]
tempB = dfs[idx+4]
tempB.index = tempB[0]
tempC = dfs[idx+5]
tempC.index = tempC[0]
name.append(tempA.loc['Name',1])
address.append(tempA.loc['Address',1])
city.append(tempA.loc['City',1])
state.append(tempA.loc['State',1])
zipCode.append(tempA.loc['Zip',1])
location.append(tempA.loc['Location',1])
webSite.append(tempA.loc['Web Site',1])
fee.append(tempB.loc['Fee',1])
event_dates.append(tempB.loc['Dates',1])
opening_dates.append(tempB.loc['Opening Days',1])
description.append(tempC.loc['Event Description',1])
df = pd.DataFrame({'event_title':event_title,
'name':name,
'address':address,
'city':city,
'state':state,
'zipCode':zipCode,
'location':location,
'webSite':webSite,
'fee':fee,
'event_dates':event_dates,
'opening_dates':opening_dates,
'description':description})
Output:
print (df.to_string())
event_title name address city state zipCode location webSite fee event_dates opening_dates description
0 The San Diego Museum of Art Welcomes a Special... San Diego Museum of Art 1450 El Prado, Balboa Park San Diego CA 92101 Central San Diego https://www.sdmart.org/ NaN Starts On 6-18-2020 Ends On 1-10-2021 Opens virtually on June 18. The work will beco... The San Diego Museum of Art is launching its f...
1 New Exhibit: Miller Dairy Remembered Lemon Grove Historical Society 3185 Olive Street, Treganza Heritage Park Lemon Grove CA 91945 Central San Diego http://www.lghistorical.org Children 12 and under free and must be accompa... Starts On 6-27-2020 Ends On 12-4-2020 Exhibit on view Saturdays 11 am to 2 pm; close... From 1926 there were cows smack in the midst o...
2 Gizmos and Shivelight Distinction Gallery 317 E. Grand Ave Escondido CA 92025 North County Inland http://www.distinctionart.com NaN Starts On 7-14-2020 Ends On 9-5-2020 08/08/20 - 09/05/20 Distinction Gallery is proud to present our so...
3 Virtual Opening - July Exhibitions Vision Art Museum 2825 Dewey Rd. Suite 100 San Diego CA 92106 Central San Diego http://www.visionsartmuseum.org Free Starts On 7-18-2020 Ends On 10-4-2020 NaN Join Visions Art Museum for a virtual exhibiti...
4 Laying it Bare: The Art of Walter Redondo and ... Fresh Paint Gallery 1020-B Prospect Street La Jolla CA 92037 Central San Diego http://freshpaintgallery.com/ NaN Starts On 8-1-2020 Ends On 9-27-2020 Tuesday through Sunday. Mondays closed. A two-person exhibit of new abstract expressio...
5 Online oil painting lessons with Concetta Antico NaN NaN NaN NaN NaN Virtual http://concettaantico.com/live-online-oil-pain... NaN Starts On 8-10-2020 Ends On 8-31-2020 NaN Anyone can learn to paint like the masters! Ov...
6 MOMENTUM: A Creative Industry Symposium Vanguard Culture Via Zoom San Diego California 92101 Virtual https://www.eventbrite.com/e/momentum-a-creati... $10 suggested donation Starts On 8-17-2020 Ends On 9-7-2020 NaN MOMENTUM: A Creative Industry Symposium Monday...
7 Virtual Locals Invitational Show Art & Frames of Coronado 936 ORANGE AVE Coronado CA 92118 0 https://www.artsteps.com/view/5eed0ad62cd0d65b... free Starts On 8-21-2020 Ends On 8-1-2021 NaN Art and Frames of Coronado invites you to our ...
8 HERE & Now R.B. Stevenson Gallery 7661 Girard Avenue, Suite 101 La Jolla California 92037 Central San Diego http://www.rbstevensongallery.com Free Starts On 8-22-2020 Ends On 9-25-2020 Tuesday through Saturday R.B.Stevenson Gallery is pleased to announce t...
9 Art Unites Learning: Normal 2.0 Art Unites NaN San Diego NaN 92116 Central San Diego https://www.facebook.com/events/956878098104971 Free Starts On 8-25-2020 Ends On 8-25-2020 NaN Please join us on Tuesday, August 25th as we: ...
10 Image Quest Sojourn; Visual Journaling for Per... Pamela Underwood Studios Virtual NaN NaN NaN Virtual http://www.pamelaunderwood.com/event/new-onlin... $595.00 Starts On 8-26-2020 Ends On 11-11-2020 NaN Create a personal Image Quest resource journal...
11 Behind The Exhibition: Southern California Con... Oceanside Museum of Art 704 Pier View Way Oceanside California 92054 Virtual https://oma-online.org/events/behind-the-exhib... No fee required. Donations recommended. Starts On 8-27-2020 Ends On 8-27-2020 NaN Join curator Beth Smith and exhibitions manage...
12 Lay it on Thick, a Virtual Art Exhibition San Diego Watercolor Society 2825 Dewey Rd Bldg #202 San Diego California 92106 0 https://www.sdws.org NaN Starts On 8-30-2020 Ends On 9-26-2020 NaN The San Diego Watercolor Society proudly prese...
13 The Forum: Marketing & Branding for Creatives Vanguard Culture Via Zoom San Diego CA 92101 South San Diego http://vanguardculture.com/ $5 suggested donation Starts On 9-1-2020 Ends On 9-1-2020 NaN Attention creative industry professionals! Joi...
14 Write or Die Solo Exhibition You Belong Here 3619 EL CAJON BLVD San Diego CA 92104 Central San Diego http://www.youbelongsd.com/upcoming-events/wri... $10 donation to benefit You Belong Here Starts On 9-4-2020 Ends On 9-6-2020 NaN Write or Die is an immersive installation and ...
15 SDVAN presents Art San Diego at Bread and Salt San Diego Visual Arts Network 1955 Julian Avenue San Digo CA 92113 Central San Diego http://www.sdvisualarts.net and https://www.br... Free Starts On 9-5-2020 Ends On 10-24-2020 NaN We are pleased to announce the four artist rec...
16 The Coming of Treganza Heritage Park Lemon Grove Historical Society 3185 Olive Street Lemon Grove CA 91945 Central San Diego http://www.lghistorical.org Free for all ages Starts On 9-10-2020 Ends On 9-10-2020 The park is open daily, 8 am to 8 pm. Covid 19... Lemon Grove\'s central city park will be renam...
17 Online oil painting course | 4 weeks NaN NaN NaN NaN NaN Virtual http://concettaantico.com/live-online-oil-pain... NaN Starts On 9-14-2020 Ends On 10-5-2020 NaN Over 4 weekly Zoom lessons, learn the techniqu...
18 Online oil painting course | 4 weeks NaN NaN NaN NaN NaN Virtual http://concettaantico.com/live-online-oil-pain... NaN Starts On 10-12-2020 Ends On 11-2-2020 NaN Over 4 weekly Zoom lessons, learn the techniqu...
19 36th Annual Mission Fed ArtWalk Mission Fed ArtWalk Ash Street San Diego California 92101 Central San Diego www.missionfedartwalk.org Free Starts On 11-7-2020 Ends On 11-8-2020 Sat and Sun Nov 7 and 8 Mission Fed ArtWalk returns to San Diego’s Lit...
20 Mingei Pop Up Workshop: My Daruma Doll New Childrens Museum 200 West Island Avenue San Diego California 92101 Central San Diego http://thinkplaycreate.org/ Free with admission Starts On 11-13-2020 Ends On 11-13-2020 NaN Join Mingei International Museum at The New Ch...