I want to scrape a site and save the data in a csv file that could be opened in Excel. I've managed to retrieve the information, but I have trouble transferring it to a csv document. When I open the document, the headers are there and in different columns, but the actual contents are in the same one, name first and price second.
I have tried putting file.writerow([Name, Price]) at the end of the code, but, probably because I've used span.find for name, only the last name value is displayed. I figured file.writerow has to be in the loop to work, but I can't move the data to another column.
Here's the code:
import requests
from bs4 import BeautifulSoup
import csv
file = csv.writer(open('GPU.csv', 'w'))
file.writerow(['Name','Price'])
url = 'link'
page = requests.get(url)
soup = BeautifulSoup(page.text,'html.parser')
for span in soup.findAll('span', attrs={'class':'details'}):
name = span.find('a').string
file.writerow([name])
for span in soup.findAll('span', attrs={'class':'price'}):
price = span.findAll(text=True)
file.writerow([price])
If there is nothing I can do with file.writerow, looping could be the issue. I have no experience with coding and would appreciate any advice.
The csv module always only writes sequentially. However, you can gather the names and prices up into separate lists up front, then use the zip() function to iterate over them in pairs, like so:
import requests
from bs4 import BeautifulSoup
import csv
url = "link"
page = requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")
names = []
prices = []
for span in soup.findAll("span", attrs={"class": "details"}):
names.append(span.find("a").string)
for span in soup.findAll("span", attrs={"class": "price"}):
prices.append(span.findAll(text=True))
file = csv.writer(open("GPU.csv", "w"))
file.writerow(["Name", "Price"])
for name, price in zip(names, prices):
file.writerow([name, price])
Related
I am new to python and I am looking for a way to extract with beautiful soup existing open source books that are available on gutenberg-de, such as this one
I need to use them for further analysis and text mining.
I tried this code, found in a tutorial, and it extracts metadata, but instead of the body content it gives me a list of the "pages" I need to scrape the text from.
import requests
from bs4 import BeautifulSoup
# Make a request
page = requests.get(
"https://www.projekt-gutenberg.org/keller/heinrich/")
soup = BeautifulSoup(page.content, 'html.parser')
# Extract title of page
page_title = soup.title
# Extract body of page
page_body = soup.body
# Extract head of page
page_head = soup.head
# print the result
print(page_title, page_head)
I suppose I could use that as a second step to extract it then? I am not sure how, though.
Ideally I would like to store them in a tabular way and be able to save them as csv, preserving the metadata author, title, year, and chapter. any ideas?
What happens?
First of all you will get a list of pages, cause you not requesting the right url it to:
page = requests.get('https://www.projekt-gutenberg.org/keller/heinrich/hein101.html')
Recommend that if your looping all the urls store the content in a list of dicts and push it to csv or pandas or ...
Example
import requests
from bs4 import BeautifulSoup
data = []
# Make a request
page = requests.get('https://www.projekt-gutenberg.org/keller/heinrich/hein101.html')
soup = BeautifulSoup(page.content, 'html.parser')
data.append({
'title': soup.title,
'chapter': soup.h2.get_text(),
'text': ' '.join([p.get_text(strip=True) for p in soup.select('body p')[2:]])
}
)
data
I want to extract a list of names from multiple pages of a website.
The website has over 200 pages and i want to save all the names to a text file. I have wrote some code but it's giving me index error.
CODE:
import requests
from bs4 import BeautifulSoup as bs
URL = 'https://hamariweb.com/names/muslim/boy/page-'
#for page in range(1, 203):
page = 1
req = requests.get(URL + str(page))
soup = bs(req.text, 'html.parser')
row = soup.find('div', attrs={'class', 'row'})
books = row.find_all('a')
for book in books:
data = book.find_all('b')[0].get_text()
print(data)
OUTPUT:
Aabbaz
Aabid
Aabideen
Aabinus
Aadam
Aadeel
Aadil
Aadroop
Aafandi
Aafaq
Aaki
Aakif
Aalah
Aalam
Aalamgeer
Aalif
Traceback (most recent call last):
File "C:\Users\Mujtaba\Documents\names.py", line 15, in <module>
data = book.find_all('b')[0].get_text()
IndexError: list index out of range
>>>
The reason for getting the error is since it can't find a <b> tag.
Try this code to request each page and save the data to a file:
import requests
from bs4 import BeautifulSoup as bs
MAIN_URL = "https://hamariweb.com/names/muslim/boy/"
URL = "https://hamariweb.com/names/muslim/boy/page-{}"
with open("output.txt", "a", encoding="utf-8") as f:
for page in range(203):
if page == 0:
req = requests.get(MAIN_URL.format(page))
else:
req = requests.get(URL.format(page))
soup = bs(req.text, "html.parser")
print(f"page # {page}, Getting: {req.url}")
book_name = (
tag.get_text(strip=True)
for tag in soup.select(
"tr.bottom-divider:nth-of-type(n+2) td:nth-of-type(1)"
)
)
f.seek(0)
f.write("\n".join(book_name) + "\n")
I suggest to change your parser to html5lib #pip install html5lib. I just think it's better. Second It's better NOT to do a .find() from your soup object DIRECTLY since it might cause some problems where the tags and classes might have duplicates. SO you might be finding data on a html tag where your data isn't even there. So it's better to check everything and inspect element the the tags you want to get and see on what block of code they might be in cause it is easier that way to scrape, also to avoid more errors.
What I did there is I inspected the elements first and FIND the BLOCK of code where you want to get your data and I found that it is on a div and its class is mb-40 content-box that is where all the names you are trying to get are. Luckily the class is UNIQUE and there are no other elements with the same tag and class so we can just directly .find() it.
Then the value of trs are simply the tr tags inside of that block
(Take note also that those <tr> tags are inside of a <table> tag but the good thing is those are the only <tr> tags that exist so there wouldn't be much of a problem like if there would be another <table> tag with the same class value)
which the <tr> tags contains the names you want to get. You may ask why is there [1:] it's because to start at index 1 to NOT include the Header from the table on the website.
Then just loop through those tr tags and get the text. With regards to your error on why is it happening it is simply because of index out of range you are trying to access a .find_all() result list item where it is out of bounds and this might happen if cases that there are no such data that is being found and that also might happen if you DIRECTLY do a .find() function on your soup variable, because there would be times that there are tags and their respective class values are the same BUT! WITH DIFFERENT CONTENT WITHIN IT. So what happens is you're expecting to scrape that particular part of the website but what actually happening is you're scraping a different part, that's why you might not get any data and wonder why it is happening.
import requests
from bs4 import BeautifulSoup as bs
URL = 'https://hamariweb.com/names/muslim/boy/page-'
#for page in range(1, 203):
page = 1
req = requests.get(URL + str(page))
soup = bs(req.content, 'html5lib')
div_container = soup.find('div', class_='mb-40 content-box')
trs = div_container.find_all("tr",class_="bottom-divider")[1:]
for tr in trs:
text = tr.find("td").find("a").text
print(text)
The issue you're having with the IndexError means that in this case the b-tag you found doesn't contains the information that you are looking for.
You can simply wrap that piece of code in a try-except clause.
for book in books:
try:
data = book.find_all('b')[0].get_text()
print(data)
# Add data to the all_titles list
all_titles.append(data)
except IndexError:
pass # There was no element available
This will catch you error and move on. But not break the code.
Below I have also added some extra lines to save your title to a text-file.
Take a look at the inline comments.
import requests
from bs4 import BeautifulSoup as bs
URL = 'https://hamariweb.com/names/muslim/boy/page-'
# Theres is where your titles will be saved. Changes as needed
PATH = '/tmp/title_file.txt'
page = 1
req = requests.get(URL + str(page))
soup = bs(req.text, 'html.parser')
row = soup.find('div', attrs={'class', 'row'})
books = row.find_all('a')
# Here your title will be stored before writing to file
all_titles = []
for book in books:
try:
# Add strip() to cleanup the input
data = book.find_all('b')[0].get_text().strip()
print(data)
# Add data to the all_titles list
all_titles.append(data)
except IndexError:
pass # There was no element available
# Open path to write
with open(PATH, 'w') as f:
# Write all titles on a new line
f.write('\n'.join(all_titles))
I've been given a project to make covid tracker. I decided to scrape some elements through the site (https://www.worldometers.info/coronavirus/). I'm very new to python so decided to go with BeautifulSoup. I was able to scrape the basic elements, like the total cases, active cases and so on. However, whenever I try to grab the country names or the numbers, it returns an empty list. Even though there exists a class 'sorting_1', it still returns an empty list. Could someone guide me where am I going wrong?
This is something which I am trying to grab:
<td style="font-weight: bold; text-align:right" class="sorting_1">4,918,420</td>
Here is my current code:
import requests
import bs4
#making a request and a soup
res = requests.get('https://www.worldometers.info/coronavirus/')
soup = bs4.BeautifulSoup(res.text, 'lxml')
#scraping starts here
total_cases = soup.select('.maincounter-number')[0].text
total_deaths = soup.select('.maincounter-number')[1].text
total_recovered = soup.select('.maincounter-number')[2].text
active_cases = soup.select('.number-table-main')[0].text
country_cases = soup.find_all('td', {'class': 'sorting_1'})
You can get sorting_1 class because it not present in page source.
You have found all rows from the table and then read information from the required columns.
So, to get total cases for each country, you can use following code:
import requests
import bs4
res = requests.get('https://www.worldometers.info/coronavirus/')
soup = bs4.BeautifulSoup(res.text, 'lxml')
country_cases = soup.find_all('td', {'class': 'sorting_1'})
rows = soup.select('table#main_table_countries_today tr')
for row in rows[8:18]:
tds = row.find_all('td')
print(tds[1].text.strip(), '=', tds[2].text.strip())
Welcome to SO!
Looking at their website, it seems that the sorting_X classes are added by javascript, so they don't exist in the raw html.
The table does exist, however, so i'd advise to loop over the table rows similar to this:
table_rows = soup.find("table", id="main_table_countries_today").find_all("tr")
for row in table_rows:
name = "unknown"
# Find country name
for td in row.find_all("td"):
if td.find("mt_a"): # This kind of link apparently only exists in the "name" column
name = td.find("a").text
# Do some more scraping
Warning, i didn't work with soup for a while so this may not be 100% correct. You get the idea.
I have a project for one of my college classes that requires me to pull all URLs from a page on the U.S. census bureau website and store them in a CSV file. For the most part I've figured out how to do that but for some reason when the data gets appended to the CSV file, all the entries are being inserted horizontally. I would expect the data to be arranged vertically, meaning row 1 has the first item in the list, row 2 has the second item and so on. I have tried several approaches but the data always ends up as a horizontal representation. I am new to python and obviously don't have a firm enough grasp on the language to figure this out. Any help would be greatly fully appreciated.
I am parsing the website using Beautifulsoup4 and the request library. Pulling all the 'a' tags from the website was easy enough and getting the URLs from those 'a' tags into a list was pretty clear as well. But when I append the list to my CSV file with a writerow function, all the data ends up in one row as opposed to one separate row for each URL.
import requests
import csv
requests.get
from bs4 import BeautifulSoup
from pprint import pprint
page = requests.get('https://www.census.gov/programs-surveys/popest.html')
soup = BeautifulSoup(page.text, 'html.parser')
## Create Link to append web data to
links = []
# Pull text from all instances of <a> tag within BodyText div
AllLinks = soup.find_all('a')
for link in AllLinks:
links.append(link.get('href'))
with open("htmlTable.csv", "w") as f:
writer = csv.writer(f)
writer.writerow(links)
pprint(links)
Try this:
import requests
import csv
from bs4 import BeautifulSoup
page = requests.get('https://www.census.gov/programs-surveys/popest.html')
soup = BeautifulSoup(page.text, 'html.parser')
## Create Link to append web data to
links = []
# Pull text from all instances of <a> tag within BodyText div
AllLinks = soup.find_all('a')
for link in AllLinks:
links.append(link.get('href'))
with open("htmlTable.csv", "w") as f:
writer = csv.writer(f)
for link in links:
if (isinstance(link, str)):
f.write(link + "\n",)
I changed it to check whether a given link was indeed a string and if so, add a newline after it.
Try making a list of lists, by appending the url inside a list
links.append([link.get('href')])
Then the csv writer will put each list on a new line with writerows
writer.writerows(links)
I am trying to scrape data from https://www.wsj.com/market-data/bonds/treasuries.
There are two tables on this website which get switched when we select the options:
1. Treasury Notes and Bond
2. Treasury Bills
I want to scrape the data for Treasury bills. But there is no change in the link and attributes or anything when i click that option. I have tried a lot of things but every time, i am able to scrape the data for Treasury Notes and Bond.
Can someone help me with that?
Following the my code:
import re
import csv
import requests
import pandas as pd
from bs4 import BeautifulSoup
mostActiveStocksUrl = "https://www.wsj.com/market-data/bonds/treasuries"
page = requests.get(mostActiveStocksUrl)
data = page.text
soup = BeautifulSoup(page.content, 'html.parser')
rows = soup.find_all('tr')
list_rows = []
for row in rows:
cells = row.find_all('td')
str_cells = str(cells)
clean = re.compile('<.*?>')
clean2 = (re.sub(clean, '',str_cells))
list_rows.append(clean2)
df = pd.DataFrame(list_rows)
df1 = df[0].str.split(',', expand=True)
All the data in the site is loaded once and then js is used to update the values in the table
Here is a working quickly written code:
import requests
from bs4 import BeautifulSoup
import json
mostActiveStocksUrl = "https://www.wsj.com/market-data/bonds/treasuries"
page = requests.get(mostActiveStocksUrl)
data = page.text
soup = BeautifulSoup(page.content, 'html.parser')
rows = soup.find_all('script') # we get all the script tags
importantJson = ''
for r in rows:
text = r.text
if 'NOTES_AND_BONDS' in text: # the scirpt tags containing the date, probably you can do this better
importantJson = text
break
# remove the non json stuff
importantJson = importantJson\
.replace('window.__STATE__ =', '')\
.replace(';', '')\
.strip()
#parse the json
jsn = json.loads(importantJson)
print(jsn) #json object containing all the data you need
How did I got to this conclusion?
First I noticed that switching between the two tables makes no http requests to the server, meaning the data is already there.
Then I inspected the table html and noticed that there is only one table and its contents are dynamically changing, which lead me to the conclusion that this data is already on the page.
Then with simple search in the source I found the script tag containing the json.