BeautifulSoup joining two tables(rows) to generate the csv file - python

from urllib.request import urlopen
from bs4 import BeautifulSoup
import csv
html = urlopen(
"https://www.accessdata.fda.gov/scripts/drugshortages/default.cfm")
bsObj = BeautifulSoup(html, "lxml")
table = bsObj.find('table', id="cont")
rows = table.findAll("tr")
links = [a['href'] for a in table.find_all('a', href=True) if a.text]
new_links = []
for link in links:
new_links.append(("https://www.accessdata.fda.gov/scripts/drugshortages/"+link).replace(" ", "%20"))
href_rows = []
for link in new_links:
link = link.replace("®", "%C2%AE")
html = urlopen(link)
bsObj_href = BeautifulSoup(html, "lxml")
#bsObj_href = BeautifulSoup (html.decode('utf-8', 'ignore'))
div_href = bsObj_href.find("div",{"id":"accordion"})
href_rows.append(div_href.findAll("tr"))
csvFile = open("drug_shortage.csv", 'wt', newline='')
writer = csv.writer(csvFile)
try:
for row in rows:
csvRow = []
for cell in row.findAll(['td', 'th']):
csvRow.append(cell.get_text())
writer.writerow(csvRow)
finally:
csvFile.close()
Hello, so I created two rows like that. If you go to the this website https://www.accessdata.fda.gov/scripts/drugshortages/default.cfm
they have the drug name and status column and when you click the drug name you can find four more columns. I like to combine together(based on drug name) in order
So It would be drug name,status, Presentation, Availability, and Estimated Shortage Duration,Related Information, Shortage Reason (per FDASIA).
But current codes only generate for the first one(drug names,status).
I tried
for row in rows,rows_href:
but then I get AttributeError: ResultSet object has no attribute 'findAll'. I get the same error for
for row in rows_href:
Any suggestion how do I generate as I wanted?

Your code is too chaotic.
You get all rows, next all links, and next you try to get all other information but this way you can't control which values to join in row. The biggest problem will be when some row will not have data on subpage and all your data will move to one row up.
You should get all rows from table on main page and then use for-loop to work with every row separatelly to get other elements only for this single row - read link only for this row, get data from subpage only for this row, etc. and put all data for this row on list as sublist [name, status, link, presentation, availability, related, reason]. And after that you get next work and work only with data for next row.
BTW: because subpage may have many rows so I create many rows in data with the same name, status but with different other values
[name, status, values from first row on subpage]
[name, status, values from second row on subpage]
[name, status, values from string row on subpage]
from urllib.request import urlopen
from bs4 import BeautifulSoup
import csv
html = urlopen("https://www.accessdata.fda.gov/scripts/drugshortages/default.cfm")
bsObj = BeautifulSoup(html, "lxml")
# list for all rows with all values
data = []
# get table on main page
table = bsObj.find('table', {'id': 'cont'})
# work with every row separatelly
for row in table.find_all("tr")[1:]: # use `[1:]` to skip header
# get columns only in this row
cols = row.find_all('td')
# get name and url from first column
link = cols[0].find('a', href=True)
name = link.text.strip()
url = link['href']
url = "https://www.accessdata.fda.gov/scripts/drugshortages/" + url
url = url.replace(" ", "%20").replace("®", "%C2%AE")
print('name:', name)
print('url:', url)
# get status from second column
status = cols[1].text.strip()
print('status:', status)
# subpage
html = urlopen(url)
bsObj_href = BeautifulSoup(html, "lxml")
subtable = bsObj_href.find("table")
if not subtable:
data.append([name, status, link, '', '', '', ''])
print('---')
else:
for subrows in subtable.find_all('tr')[1:]: # use `[1:]` to skip header
#print(subrows)
subcols = subrows.find_all('td')
presentation = subcols[0].text.strip()
availability = subcols[1].text.strip()
related = subcols[2].text.strip()
reason = subcols[3].text.strip()
data.append([name, status, link, presentation, availability, related, reason])
print(presentation, availability, related, reason)
print('---')
print('----------')
with open("drug_shortage.csv", 'wt', newline='') as csvfile:
writer = csv.writer(csvFile)
# write header - one row - using `writerow` without `s` at the end
#writer.writerow(['Name', 'Status', 'Link', 'Presentation', 'Availability', 'Related', 'Reason'])
# write data - many rowr - using `writerows` with `s` at the end
writer.writerows(data)
# no need to close because it use `with`

Related

Beautiful Soup to Scrape Data from Static Webpages

I am trying to values from a table of multiple static webpages. It is the verb conjugation data for Korean verbs here: https://koreanverb.app/
My Python script uses Beautiful Soup. The goal is to grab all conjugations from multiple URL inputs and output the data to a CSV file.
Conjugations are stored on the page in table with class "table-responsive" and under the table rows with class "conjugation-row". There are multiple "conjugation-row" table rows on each page. My script is someone only grabbing the first table row with class "conjugation-row".
Why isn't the for loop grabbing all the td elements with class "conjugation-row"? I would appreciate a solution that grabs all tr with class "conjugation-row". I tried using job_elements = results.find("tr", class_="conjugation-row"), but I get the following error:
AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
Furthermore, when I do get the data and output to a CSV file, the data is in separate rows as expected, but leaves empty spaces., It places the data rows for the second URL at the index after all data rows for the first URL. See example output here:
See code here:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv
# create csv file
outfile = open("scrape.csv","w",newline='')
writer = csv.writer(outfile)
## define first URL to grab conjugation names
url1 = 'https://koreanverb.app/?search=%ED%95%98%EB%8B%A4'
# define dataframe columns
df = pd.DataFrame(columns=['conjugation name'])
# get URL content
response = requests.get(url1)
soup = BeautifulSoup(response.content, 'html.parser')
# get table with all verb conjugations
results = soup.find("div", class_="table-responsive")
##### GET CONJUGATIONS AND APPEND TO CSV
# define URLs
urls = ['https://koreanverb.app/?search=%ED%95%98%EB%8B%A4',
'https://koreanverb.app/?search=%EB%A8%B9%EB%8B%A4',
'https://koreanverb.app/?search=%EB%A7%88%EC%8B%9C%EB%8B%A4']
# loop to get data
for url in urls:
response = requests.get(url)
soup2 = BeautifulSoup(response.content, 'html.parser')
# get table with all verb conjugations
results2 = soup2.find("div", class_="table-responsive")
# get dictionary form of verb/adjective
verb_results = soup2.find('dl', class_='dl-horizontal')
verb_title = verb_results.find('dd')
verb_title_text = verb_title.text
job_elements = results2.find_all("tr", class_="conjugation-row")
for job_element in job_elements:
conjugation_name = job_element.find("td", class_="conjugation-name")
conjugation_korean = conjugation_name.find_next_sibling("td")
conjugation_name_text = conjugation_name.text
conjugation_korean_text = conjugation_korean.text
data_column = pd.DataFrame({ 'conjugation name': [conjugation_name_text],
verb_title_text: [conjugation_korean_text],
})
#data_column = pd.DataFrame({verb_title_text: [conjugation_korean_text]})
df = df.append(data_column, ignore_index = True)
# save to csv
df.to_csv('scrape.csv')
outfile.close()
print('Verb Conjugations Collected and Appended to CSV, one per column')
Get all the job_elements using find_all() since find() only returns the first occurrence and iterate over them in a for loop like below.
job_elements = results.find_all("tr", class_="conjugation-row")
for job_element in job_elements:
conjugation_name = job_element.find("td", class_="conjugation-name")
conjugation_korean = conjugation_name.find_next_sibling("td")
conjugation_name_text = conjugation_name.text
conjugation_korean_text = conjugation_korean.text
# append element to data
df2 = pd.DataFrame([[conjugation_name_text,conjugation_korean_text]],columns=['conjugation_name','conjugation_korean'])
df = df.append(df2)
The error is where you are trying to use find() on a variable of type list.
As your script is growing big, I made some modifications like using get_conjugations() function and some proper names that are easy to understand. Firstly, conjugation_names and conjugation_korean_names are added into pandas Dataframe columns and then other columns are added subsequently (korean0, korean1 ...).
import requests
from bs4 import BeautifulSoup
import pandas as pd
# function to parse the html data & get conjugations
def get_conjugations(url):
#set return lists
conjugation_names = []
conjugation_korean_names = []
#get html text
html = requests.get(url).text
#parse the html text
soup = BeautifulSoup(html, 'html.parser')
#get table
table = soup.find("div", class_="table-responsive")
table_rows = table.find_all("tr", class_="conjugation-row")
for row in table_rows:
conjugation_name = row.find("td", class_="conjugation-name")
conjugation_korean = conjugation_name.find_next_sibling("td")
conjugation_names.append(conjugation_name.text)
conjugation_korean_names.append(conjugation_korean.text)
#return both lists
return conjugation_names, conjugation_korean_names
# create csv file
outfile = open("scrape.csv", "w", newline='')
urls = ['https://koreanverb.app/?search=%ED%95%98%EB%8B%A4',
'https://koreanverb.app/?search=%EB%A8%B9%EB%8B%A4',
'https://koreanverb.app/?search=%EB%A7%88%EC%8B%9C%EB%8B%A4']
# define dataframe columns
df = pd.DataFrame(columns=['conjugation_name', 'conjugation_korean', 'korean0', 'korean1'])
conjugation_names, conjugation_korean_names = get_conjugations(urls[0])
df['conjugation_name'] = conjugation_names
df['conjugation_korean'] = conjugation_korean_names
for index, url in enumerate(urls[1:]):
conjugation_names, conjugation_korean_names = get_conjugations(url)
#set column name
column_name = 'korean' + str(index)
df[column_name] = conjugation_korean_names
#save to csv
df.to_csv('scrape.csv')
outfile.close()
# Print DONE
print('Export to CSV Complete')
Output:
,conjugation_name,conjugation_korean,korean0,korean1
0,declarative present informal low,해,먹어,마셔
1,declarative present informal high,해요,먹어요,마셔요
2,declarative present formal low,한다,먹는다,마신다
3,declarative present formal high,합니다,먹습니다,마십니다
...
Note:
This assumes that elements in different URLs are in same order.

Python Web Scraping: Output to csv

I'm doing some progress with web scraping however I still need some help to perform some operations:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'http://fcf.cat/equip/1920/1i/sant-ildefons-ue-b'
# soup = BeautifulSoup(requests.get(converturl).content, 'html.parser')
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
out = []
for tr in soup.select('.col-md-4 tbody tr'):
On the class col-md-4 I know there are 3 tables I want to generate a csv which as an output has three values: first name, last name, and for the last value I want the header name of the table.
first name, last name, header table
Any help would be appreciated.
This is what I have done on my own:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'http://fcf.cat/equip/1920/1i/sant-ildefons-ue-b'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
filename = url.rsplit('/', 1)[1] + '.csv'
tables = soup.select('.col-md-4 table')
rows = []
for tr in tables:
t = tr.get_text(strip=True, separator='|').split('|')
rows.append(t)
df = pd.DataFrame(rows)
print(df)
df.to_csv(filename)
Thanks,
This might work:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'http://fcf.cat/equip/1920/1i/sant-ildefons-ue-b'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
tables = soup.select('.col-md-4 table')
rows = []
for table in tables:
cleaned = list(table.stripped_strings)
header, names = cleaned[0], cleaned[1:]
data = [name.split(', ') + [header] for name in names]
rows.extend(data)
result = pd.DataFrame.from_records(rows, columns=['surname', 'name', 'table'])
You need to first iterate through each table you want to scrape, then for each table, get its header and rows of data. For each row of data, you want to parse out the First Name and Last Name (along with the header of the table).
Here's a verbose working example:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'http://fcf.cat/equip/1920/1i/sant-ildefons-ue-b'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
out = []
# Iterate through each of the three tables
for table in soup.select(".col-md-4 table"):
# Grab the header and rows from the table
header = table.select("thead th")[0].text.strip()
rows = [s.text.strip() for s in table.select("tbody tr")]
t = [] # This list will contain the rows of data for this table
# Iterate through rows in this table
for row in rows:
# Split by comma (last_name, first_name)
split = row.split(",")
last_name = split[0].strip()
first_name = split[1].strip()
# Create the row of data
t.append([first_name, last_name, header])
# Convert list of rows to a DataFrame
df = pd.DataFrame(t, columns=["first_name", "last_name", "table_name"])
# Append to list of DataFrames
out.append(df)
# Write to CSVs...
out[0].to_csv("first_table.csv", index=None) # etc...
Whenever you're web scraping, I highly recommend using strip() on all of the text you parse to make sure you don't have superfluous spaces in your data.
I hope this helps!

Add title in the fist column of scraped table

I am currently working on a school project where I am scraping results from a cycling website. I managed to build the scraper to loop through all the urls containing the results. I would like to add the event title to the first column of every table but am facing some difficulties.
Here is my code:
# list of needed packages
import requests
from bs4 import BeautifulSoup
import time
import csv
# create string of urls to scrape
urls = ['https://cqranking.com/men/asp/gen/race.asp?raceid=36151', 'https://cqranking.com/men/asp/gen/race.asp?raceid=36151']
# Generates a csv-file named cycling_results.csv, with wanted headers
with open('cycling_results.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile, delimiter=';')
writer.writerow(['Start', 'Rank', '', '', '', 'Name', '', 'Team', '', 'Time', '', 'Points'])
# loop through all urls in the array
for url in urls:
time.sleep(2)
response = requests.get(url)
data = response.content
soup = BeautifulSoup(data, 'html.parser')
# Find the title of the racing event
titles = soup.find('title')
for title in titles:
writer.writerow(title)
tables = soup.find_all('table')
for table in tables:
rows = table.find_all('tr')
for row in rows:
csv_row = []
columns = row.find_all('td')
for column in columns:
csv_row.append(column.get_text())
writer.writerow(csv_row)
In the next fase I will add code to remove empty rows.
Thank you
Regards
Kevin
This code should be
titles = soup.find('title')
for title in titles:
writer.writerow(title)
---->
titles = soup.find('title')
writer.writerow([title.text])
find return just an element, not a list of elements. Write element text or your desired info, but not full element

formatting data to csv file

I wrote this page scraper using python and beautiful soup to extract data from a table and now want to save it. The area i scraped is the table on the right hand side of the website. I need the bold part on the left side to correspond to the right side, so key people to correspond to ceo for example. New to this, need some advice on the best way to format this. Thank you.
import requests
import csv
from datetime import datetime
from bs4 import BeautifulSoup
# download the page
myurl = requests.get("https://en.wikipedia.org/wiki/Goodyear_Tire_and_Rubber_Company")
# create BeautifulSoup object
soup = BeautifulSoup(myurl.text, 'html.parser')
# pull the class containing all tire name
name = soup.find(class_ = 'logo')
# pull the div in the class
nameinfo = name.find('div')
# just grab text inbetween the div
nametext = nameinfo.text
# print information about goodyear logo on wiki page
#print(nameinfo)
# now, print type of company, private or public
#status = soup.find(class_ = 'category')
#for link in soup.select('td.category a'):
#print link.text
# now get the ceo information
#for employee in soup.select('td.agent a'):
#print employee.text
# print area served
#area = soup.find(class_ = 'infobox vcard')
#print(area)
# grab information in bold on the left hand side
vcard = soup.find(class_ = 'infobox vcard')
rows = vcard.find_all('tr')
for row in rows:
cols=row.find_all('th')
cols=[x.text.strip() for x in cols]
print cols
# grab information in bold on the right hand side
vcard = soup.find(class_ = 'infobox vcard')
rows = vcard.find_all('tr')
for row in rows:
cols2=row.find_all('td')
cols2=[x.text.strip() for x in cols2]
print cols2
# save to csv file named index
with open('index.csv', 'w') as csv_file:
writer = csv.writer(csv_file) # actually write to the file
writer.writerow([cols,cols2 , datetime.now()]) # apprend time
You need to reorder your code a bit. It is also possible to find both tr and th at the same time which would solve your problem of the two columns needing to be in sync:
import requests
import csv
from datetime import datetime
from bs4 import BeautifulSoup
myurl = requests.get("https://en.wikipedia.org/wiki/Goodyear_Tire_and_Rubber_Company")
soup = BeautifulSoup(myurl.text, 'html.parser')
vcard = soup.find(class_='infobox vcard')
with open('output.csv', 'wb') as f_output:
csv_output = csv.writer(f_output)
for row in vcard.find_all('tr')[1:]:
cols = row.find_all(['th', 'td'])
csv_output.writerow([x.text.strip().replace('\n', ' ').encode('ascii', 'ignore') for x in cols] + [datetime.now()])
This would create an output.csv file such as:
Type,Public,2018-03-27 17:12:45.146000
Tradedas,NASDAQ:GT S&P 500 Component,2018-03-27 17:12:45.147000
Industry,Manufacturing,2018-03-27 17:12:45.147000
Founded,"August29, 1898; 119 years ago(1898-08-29) Akron, Ohio, U.S.",2018-03-27 17:12:45.147000
Founder,Frank Seiberling,2018-03-27 17:12:45.147000
Headquarters,"Akron, Ohio, U.S.",2018-03-27 17:12:45.148000
Area served,Worldwide,2018-03-27 17:12:45.148000
Key people,"Richard J. Kramer (Chairman, President and CEO)",2018-03-27 17:12:45.148000
Products,Tires,2018-03-27 17:12:45.148000
Revenue,US$ 15.158 billion[1](2016),2018-03-27 17:12:45.149000
Operating income,US$ 1.52 billion[1](2016),2018-03-27 17:12:45.149000
Net income,US$ 1.264 billion[1](2016),2018-03-27 17:12:45.149000
Total assets,US$ 16.511 billion[1](2016),2018-03-27 17:12:45.150000
Total equity,US$ 4.507 billion[1](2016),2018-03-27 17:12:45.150000
Number of employees,"66,000[1](2017)",2018-03-27 17:12:45.150000
Subsidiaries,List of subsidiaries,2018-03-27 17:12:45.151000
Website,goodyear.com,2018-03-27 17:12:45.151000

Syntax issues when scraping data

import requests
from bs4 import BeautifulSoup
import csv
from urlparse import urljoin
import urllib2
from lxml import html
base_url = 'http://www.pro-football-reference.com' # base url for concatenation
data = requests.get("http://www.pro-football-reference.com/years/2014/games.htm") #website for scraping
soup = BeautifulSoup(data.content)
list_of_cells = []
for link in soup.find_all('a'):
if link.has_attr('href'):
if link.get_text() == 'boxscore':
url = base_url + link['href']
for x in url:
response = requests.get('x')
html = response.content
soup = BeautifulSoup(html)
table = soup.find('table', attrs={'class': 'stats_table x_large_text'})
for row in table.findAll('tr'):
for cell in row.findAll('td'):
text = cell.text.replace(' ', '')
list_of_cells.append(text)
print list_of_cells
I am using the code in order to get all the boxscore urls from http://www.pro-football-reference.com/years/2014/games.htm. After I get these boxscore urls I would like to loop through them to scrape the quarter by quarter data for each team but my syntax always seems to be off no matter how I format the code.
If it is possible I would like to scrape more than just the scoring data by also getting the Game Info, officials, and Expected points per game.
If you modify your loop slightly to:
for link in soup.find_all('a'):
if not link.has_attr('href'):
continue
if link.get_text() != 'boxscore':
continue
url = base_url + link['href']
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
# Scores
table = soup.find('table', attrs={'id': 'scoring'})
for row in table.findAll('tr'):
for cell in row.findAll('td'):
text = cell.text.replace(' ', '')
list_of_cells.append(text)
print list_of_cells
That returns each of the cells for each row in the scoring table for each page linked to with the 'boxscore' text.
The issues I found with the existing code were:
You were attempting to loop through each character in the href returned for the 'boxscore' link.
You were always requesting the string 'x'.
Not so much an issue, but I changed the table selector to identify the table by its id 'scoring' rather than the class. Ids at least should be unique within the page (though there is no guarentee).
I'd recommend that you find each table (or HTML element) containing the data you want in the main loop (e.g score_table = soup.find('table'...) but that you move the code that parses that data (e.g)...
for row in table.findAll('tr'):
for cell in row.findAll('td'):
text = cell.text.replace(' ', '')
list_of_cells.append(text)
print list_of_cells
...into a separate function that returns said data (one for each type of data you are extracting), just to keep the code slightly more manageable. The more the code indents to handle if tests and for loops the more difficult it tends to be to follow the flow. For example:
score_table = soup.find('table', attrs={'id': 'scoring'})
score_data = parse_score_table(score_table)
other_table = soup.find('table', attrs={'id': 'other'})
other_data = parse_other_table(other_table)

Categories

Resources