I have a basic bs4 web scraper, There are no issues in getting my scrape data, but when I try to write it to a .csv file, I got some problems. I am unable to write my data to more than one column. In the tutorial I kinda follow, he can separate rows with "," easily but when I open my CSV with excel, neither in the header nor in data there is a separation, what am I missing?
import requests
from bs4 import BeautifulSoup
url="myurl"
page=requests.get(url)
soup=BeautifulSoup(page.content,'html.parser')
items=soup.find_all('a', class_='listing-card')
filename = 'data.csv'
f = open(filename, "w")
header = "name, price\n"
f.write(header)
for item in items:
title = item.find('span', class_='title').text
price = item.find('span', class_='price').text
f.write(title.replace(",","|") + ',' + price + "\n")
f.close()
Another method.
from simplified_scrapy import SimplifiedDoc, utils, req
url = "myurl"
html = req.get(url)
rows = []
rows.append(['name', 'price']) # Add header
doc = SimplifiedDoc(html)
items = doc.getElements('a', attr='class', value='listing-card') # Get all nodes a according to the class
for item in items:
title = item.getElement('span', value='title').text
price = item.getElement('span', value='price').text
rows.append([title, price])
utils.save2csv('data.csv', rows) # Save to CSV file
Here are more examples: https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples
I have found that the easiest way to get your data into a CSV file is to put the data into a pandas DataFrame then use the to_csv method to write the file.
Using your example the code would be as follows:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url="myurl"
page=requests.get(url)
soup=BeautifulSoup(page.content,'html.parser')
items=soup.find_all('a', class_='listing-card')
filename = 'data.csv'
f = open(filename, "w")
header = "name, price\n"
f.write(header)
#
# Create an empty list to store entries
mylist = []
for item in items:
title = item.find('span', class_='title').text
price = item.find('span', class_='price').text
#
# Create the dictionary item to be appended to the list
entry = {'name' : title, 'price' : price}
mylist.append(entry)
myDataframe = pd.DataFrame(mylist)
myDataframe.to_csv('CSV_file.csv')
Related
I am trying to values from a table of multiple static webpages. It is the verb conjugation data for Korean verbs here: https://koreanverb.app/
My Python script uses Beautiful Soup. The goal is to grab all conjugations from multiple URL inputs and output the data to a CSV file.
Conjugations are stored on the page in table with class "table-responsive" and under the table rows with class "conjugation-row". There are multiple "conjugation-row" table rows on each page. My script is someone only grabbing the first table row with class "conjugation-row".
Why isn't the for loop grabbing all the td elements with class "conjugation-row"? I would appreciate a solution that grabs all tr with class "conjugation-row". I tried using job_elements = results.find("tr", class_="conjugation-row"), but I get the following error:
AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
Furthermore, when I do get the data and output to a CSV file, the data is in separate rows as expected, but leaves empty spaces., It places the data rows for the second URL at the index after all data rows for the first URL. See example output here:
See code here:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv
# create csv file
outfile = open("scrape.csv","w",newline='')
writer = csv.writer(outfile)
## define first URL to grab conjugation names
url1 = 'https://koreanverb.app/?search=%ED%95%98%EB%8B%A4'
# define dataframe columns
df = pd.DataFrame(columns=['conjugation name'])
# get URL content
response = requests.get(url1)
soup = BeautifulSoup(response.content, 'html.parser')
# get table with all verb conjugations
results = soup.find("div", class_="table-responsive")
##### GET CONJUGATIONS AND APPEND TO CSV
# define URLs
urls = ['https://koreanverb.app/?search=%ED%95%98%EB%8B%A4',
'https://koreanverb.app/?search=%EB%A8%B9%EB%8B%A4',
'https://koreanverb.app/?search=%EB%A7%88%EC%8B%9C%EB%8B%A4']
# loop to get data
for url in urls:
response = requests.get(url)
soup2 = BeautifulSoup(response.content, 'html.parser')
# get table with all verb conjugations
results2 = soup2.find("div", class_="table-responsive")
# get dictionary form of verb/adjective
verb_results = soup2.find('dl', class_='dl-horizontal')
verb_title = verb_results.find('dd')
verb_title_text = verb_title.text
job_elements = results2.find_all("tr", class_="conjugation-row")
for job_element in job_elements:
conjugation_name = job_element.find("td", class_="conjugation-name")
conjugation_korean = conjugation_name.find_next_sibling("td")
conjugation_name_text = conjugation_name.text
conjugation_korean_text = conjugation_korean.text
data_column = pd.DataFrame({ 'conjugation name': [conjugation_name_text],
verb_title_text: [conjugation_korean_text],
})
#data_column = pd.DataFrame({verb_title_text: [conjugation_korean_text]})
df = df.append(data_column, ignore_index = True)
# save to csv
df.to_csv('scrape.csv')
outfile.close()
print('Verb Conjugations Collected and Appended to CSV, one per column')
Get all the job_elements using find_all() since find() only returns the first occurrence and iterate over them in a for loop like below.
job_elements = results.find_all("tr", class_="conjugation-row")
for job_element in job_elements:
conjugation_name = job_element.find("td", class_="conjugation-name")
conjugation_korean = conjugation_name.find_next_sibling("td")
conjugation_name_text = conjugation_name.text
conjugation_korean_text = conjugation_korean.text
# append element to data
df2 = pd.DataFrame([[conjugation_name_text,conjugation_korean_text]],columns=['conjugation_name','conjugation_korean'])
df = df.append(df2)
The error is where you are trying to use find() on a variable of type list.
As your script is growing big, I made some modifications like using get_conjugations() function and some proper names that are easy to understand. Firstly, conjugation_names and conjugation_korean_names are added into pandas Dataframe columns and then other columns are added subsequently (korean0, korean1 ...).
import requests
from bs4 import BeautifulSoup
import pandas as pd
# function to parse the html data & get conjugations
def get_conjugations(url):
#set return lists
conjugation_names = []
conjugation_korean_names = []
#get html text
html = requests.get(url).text
#parse the html text
soup = BeautifulSoup(html, 'html.parser')
#get table
table = soup.find("div", class_="table-responsive")
table_rows = table.find_all("tr", class_="conjugation-row")
for row in table_rows:
conjugation_name = row.find("td", class_="conjugation-name")
conjugation_korean = conjugation_name.find_next_sibling("td")
conjugation_names.append(conjugation_name.text)
conjugation_korean_names.append(conjugation_korean.text)
#return both lists
return conjugation_names, conjugation_korean_names
# create csv file
outfile = open("scrape.csv", "w", newline='')
urls = ['https://koreanverb.app/?search=%ED%95%98%EB%8B%A4',
'https://koreanverb.app/?search=%EB%A8%B9%EB%8B%A4',
'https://koreanverb.app/?search=%EB%A7%88%EC%8B%9C%EB%8B%A4']
# define dataframe columns
df = pd.DataFrame(columns=['conjugation_name', 'conjugation_korean', 'korean0', 'korean1'])
conjugation_names, conjugation_korean_names = get_conjugations(urls[0])
df['conjugation_name'] = conjugation_names
df['conjugation_korean'] = conjugation_korean_names
for index, url in enumerate(urls[1:]):
conjugation_names, conjugation_korean_names = get_conjugations(url)
#set column name
column_name = 'korean' + str(index)
df[column_name] = conjugation_korean_names
#save to csv
df.to_csv('scrape.csv')
outfile.close()
# Print DONE
print('Export to CSV Complete')
Output:
,conjugation_name,conjugation_korean,korean0,korean1
0,declarative present informal low,해,먹어,마셔
1,declarative present informal high,해요,먹어요,마셔요
2,declarative present formal low,한다,먹는다,마신다
3,declarative present formal high,합니다,먹습니다,마십니다
...
Note:
This assumes that elements in different URLs are in same order.
I'm doing some progress with web scraping however I still need some help to perform some operations:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'http://fcf.cat/equip/1920/1i/sant-ildefons-ue-b'
# soup = BeautifulSoup(requests.get(converturl).content, 'html.parser')
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
out = []
for tr in soup.select('.col-md-4 tbody tr'):
On the class col-md-4 I know there are 3 tables I want to generate a csv which as an output has three values: first name, last name, and for the last value I want the header name of the table.
first name, last name, header table
Any help would be appreciated.
This is what I have done on my own:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'http://fcf.cat/equip/1920/1i/sant-ildefons-ue-b'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
filename = url.rsplit('/', 1)[1] + '.csv'
tables = soup.select('.col-md-4 table')
rows = []
for tr in tables:
t = tr.get_text(strip=True, separator='|').split('|')
rows.append(t)
df = pd.DataFrame(rows)
print(df)
df.to_csv(filename)
Thanks,
This might work:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'http://fcf.cat/equip/1920/1i/sant-ildefons-ue-b'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
tables = soup.select('.col-md-4 table')
rows = []
for table in tables:
cleaned = list(table.stripped_strings)
header, names = cleaned[0], cleaned[1:]
data = [name.split(', ') + [header] for name in names]
rows.extend(data)
result = pd.DataFrame.from_records(rows, columns=['surname', 'name', 'table'])
You need to first iterate through each table you want to scrape, then for each table, get its header and rows of data. For each row of data, you want to parse out the First Name and Last Name (along with the header of the table).
Here's a verbose working example:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'http://fcf.cat/equip/1920/1i/sant-ildefons-ue-b'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
out = []
# Iterate through each of the three tables
for table in soup.select(".col-md-4 table"):
# Grab the header and rows from the table
header = table.select("thead th")[0].text.strip()
rows = [s.text.strip() for s in table.select("tbody tr")]
t = [] # This list will contain the rows of data for this table
# Iterate through rows in this table
for row in rows:
# Split by comma (last_name, first_name)
split = row.split(",")
last_name = split[0].strip()
first_name = split[1].strip()
# Create the row of data
t.append([first_name, last_name, header])
# Convert list of rows to a DataFrame
df = pd.DataFrame(t, columns=["first_name", "last_name", "table_name"])
# Append to list of DataFrames
out.append(df)
# Write to CSVs...
out[0].to_csv("first_table.csv", index=None) # etc...
Whenever you're web scraping, I highly recommend using strip() on all of the text you parse to make sure you don't have superfluous spaces in your data.
I hope this helps!
I have a 45k+ rows CSV file, each one containing a different path of the same domain - which are structurally identical to each other - and every single one is clickable. I managed to use BeautifulSoup to scrape the title and content of each one and through the print function, I was able to validate the scraper. However, when I try to export the information gathered to a new CSV file, I only get the last URL's street name and description, and not all of them as I expected.
from bs4 import BeautifulSoup
import requests
import csv
with open('URLs.csv') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
site = requests.get(row['addresses']).text
soup = BeautifulSoup(site, 'lxml')
StreetName = soup.find('div', class_='hist-title').text
Description = soup.find('div', class_='hist-content').text
with open('OutputList.csv','w', newline='') as output:
Header = ['StreetName', 'Description']
writer = csv.DictWriter(output, fieldnames=Header)
writer.writeheader()
writer.writerow({'StreetName' : StreetName, 'Description' : Description})
How can the output CSV have on each row the street name and description for the respective URL row in the input CSV file?
You need to open both files on the same level and then read and write on each iteration. Something like this:
from bs4 import BeautifulSoup
import requests
import csv
with open('URLs.csv') as a, open('OutputList.csv', 'w') as b:
reader = csv.reader(a)
writer = csv.writer(b, quoting=csv.QUOTE_ALL)
writer.writerow(['StreetName', 'Description'])
# Assuming url is the first field in the CSV
for url, *_ in reader:
r = requests.get(url)
if r.ok:
soup = BeautifulSoup(r.text, 'lxml')
street_name = soup.find('div', class_='hist-title').text.strip()
description = soup.find('div', class_='hist-content').text.strip()
writer.writerow([street_name, description])
I hope it helps.
I'm attempting to scrape multiple websites for specific products and I'm sure there is a way to optimize my code. As of right now, the code does it's job but this is really not the Pythonic way to go about it(I am a Python novice so please excuse my lack of knowledge).
The goal of this program is to get the prices of the products from the URLs provided and write them to a .csv file. Each website has a different structure, but I am always using the same 3 websites. This is an example of my current code:
import requests
import csv
import io
import os
from datetime import datetime
from bs4 import BeautifulSoup
timeanddate=datetime.now().strftime("%Y%m%d-%H%M%S")
folder_path =
'my_folder_path'
file_name = 'product_prices_'+timeanddate+'.csv'
full_name = os.path.join(folder_path, file_name)
with io.open(full_name, 'w', newline='', encoding="utf-8") as file:
writer = csv.writer(file)
writer.writerow(["ProductTitle", "Website1", "Website2", "Website3"])
#---Product 1---
#Website1 price
website1product1 = requests.get('website1product1URL')
website1product1Data = BeautifulSoup(website1product1.text, 'html.parser')
website1product1Price = website1product1Data.find('div', attrs={'class': 'price-final'}).text.strip()
print(website1product1Price)
#Website2 price
website2product1 = requests.get('website2product1URL')
website2product1Data = BeautifulSoup(website2product1.text, 'html.parser')
website2product1Price = website2product1Data.find('div', attrs={'class': 'price_card'}).text.strip()
print(website2product1Price)
#Website3 price
website3product1 = requests.get('website3product1URL')
website3product1Data = BeautifulSoup(website3product1.text, 'html.parser')
website3product1Price = website3product1Data.find('strong', attrs={'itemprop': 'price'}).text.strip()
print(website3product1Price)
writer.writerow(["ProductTitle", website1product1Price, website2product1Price, website3product1Price])
file.close()
It saves the ProductTitles and Prices to a .csv in this format and I'd like to keep this format:
#Header
ProductTitle Website1 Website2 Website3
#Scraped data
Product1 $23 $24 $52
This is manageable for a few products, but I'd like to have hundreds and copying the same lines of code and changing variable names is confusing, tedious and is bound to be riddled with human error.
Can I create a function that takes 3 URLs as arguments and outputs the website1product1Price, website2product1Price and website2product1Price, and call that function once per product? Can it then be wrapped in a loop to go through a list of URLs and still keep the original formatting?
Any help is appreciated.
Is this could be a solution for you?
Admitting you have an array of dict for your product:
products = [
{
'name': 'product1',
'url1': 'https://url1',
'url2': 'https://url2',
'url3': 'https://url3'
}
]
Your code could be something like this:
import requests
import csv
import io
import os
from datetime import datetime
from bs4 import BeautifulSoup
def get_product_prices(product):
#---Product 1---
#Website1 price
website1product1 = requests.get(product['url1'])
website1product1Data = BeautifulSoup(website1product1.text, 'html.parser')
website1product1Price = website1product1Data.find('div', attrs={'class': 'price-final'}).text.strip()
#Website2 price
website2product1 = requests.get(product['url2'])
website2product1Data = BeautifulSoup(website2product1.text, 'html.parser')
website2product1Price = website2product1Data.find('div', attrs={'class': 'price_card'}).text.strip()
#Website3 price
website3product1 = requests.get(product['url3'])
website3product1Data = BeautifulSoup(website3product1.text, 'html.parser')
website3product1Price = website3product1Data.find('strong', attrs={'itemprop': 'price'}).text.strip()
return website1product1Price, website2product1Price, website3product1Price
timeanddate=datetime.now().strftime("%Y%m%d-%H%M%S")
folder_path =
'my_folder_path'
file_name = 'product_prices_'+timeanddate+'.csv'
full_name = os.path.join(folder_path, file_name)
with io.open(full_name, 'w', newline='', encoding="utf-8") as file:
writer = csv.writer(file)
writer.writerow(["ProductTitle", "Website1", "Website2", "Website3"])
for product in products:
price1, price2, price3 = get_product_prices(product)
write.writerow(product['name'], price1, price2, price3)
file.close()
You can create a function and pass everything as parameter like url, tag_name , attribute_name and attribute_value.see if this help.
def price_text(url_text,ele_tag,ele_attr,attrval):
website1product1 = requests.get(url_text)
website1product1Data = BeautifulSoup(website1product1.text, 'html.parser')
website1product1Price=website1product1Data.find("'" + ele_tag + "'", attrs="{'" + ele_attr + "': '" + attrval + "'}").text.strip()
print(website1product1Price)
website1product1Price=price_text("url","div","class","price-final")
website1product2Price=price_text("url","div","class","price_card")
website1product3Price=price_text("url","strong","itemprop","price")
I am trying to scrape from the first page to page 14 of this website: https://cross-currents.berkeley.edu/archives?author=&title=&type=All&issue=All®ion=All
Here is my code:
import requests as r
from bs4 import BeautifulSoup as soup
import pandas
#make a list of all web pages' urls
webpages=[]
for i in range(15):
root_url = 'https://cross-currents.berkeley.edu/archives?author=&title=&type=All&issue=All®ion=All&page='+ str(i)
webpages.append(root_url)
print(webpages)
#start looping through all pages
for item in webpages:
headers = {'User-Agent': 'Mozilla/5.0'}
data = r.get(item, headers=headers)
page_soup = soup(data.text, 'html.parser')
#find targeted info and put them into a list to be exported to a csv file via pandas
title_list = [title.text for title in page_soup.find_all('div', {'class':'field field-name-node-title'})]
title = [el.replace('\n', '') for el in title_list]
#export to csv file via pandas
dataset = {'Title': title}
df = pandas.DataFrame(dataset)
df.index.name = 'ArticleID'
df.to_csv('example31.csv',encoding="utf-8")
The output csv file only contains targeted info of the last page. When I print "webpages", it shows that all the pages' urls have been properly put into the list. What am I doing wrong? Thank you in advance!
You are simply overwriting the same output CSV file for all the pages, you can call .to_csv() in the "append" mode to have the new data added to the end of the existing file:
df.to_csv('example31.csv', mode='a', encoding="utf-8", header=False)
Or, even better would be to collect the titles into a list of titles and then dump into a CSV once:
#start looping through all pages
titles = []
for item in webpages:
headers = {'User-Agent': 'Mozilla/5.0'}
data = r.get(item, headers=headers)
page_soup = soup(data.text, 'html.parser')
#find targeted info and put them into a list to be exported to a csv file via pandas
title_list = [title.text for title in page_soup.find_all('div', {'class':'field field-name-node-title'})]
titles += [el.replace('\n', '') for el in title_list]
# export to csv file via pandas
dataset = [{'Title': title} for title in titles]
df = pandas.DataFrame(dataset)
df.index.name = 'ArticleID'
df.to_csv('example31.csv', encoding="utf-8")
Another way in addition to what alexce posted would be to keep appending the dataframe inside to a new dataframe and then write that to the CSV.
Declare finalDf as a dataframe outside the loops:
finalDf = pandas.DataFrame()
Later do this:
for item in webpages:
headers = {'User-Agent': 'Mozilla/5.0'}
data = r.get(item, headers=headers)
page_soup = soup(data.text, 'html.parser')
#find targeted info and put them into lists to be exported to a csv file via pandas
title_list = [title.text for title in page_soup.find_all('div', {'class':'field field-name-node-title'})]
title = [el.replace('\n', '') for el in title_list]
#export to csv file via pandas
dataset = {'Title': title}
df = pandas.DataFrame(dataset)
finalDf = finalDf.append(df)
#df.index.name = 'ArticleID'
#df.to_csv('example31.csv', mode='a', encoding="utf-8", header=False)
finalDf = finalDf.reset_index(drop = True)
finalDf.index.name = 'ArticleID'
finalDf.to_csv('example31.csv', encoding="utf-8")
Notice the lines with finalDf