Python Web Scraping: Output to csv - python

I'm doing some progress with web scraping however I still need some help to perform some operations:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'http://fcf.cat/equip/1920/1i/sant-ildefons-ue-b'
# soup = BeautifulSoup(requests.get(converturl).content, 'html.parser')
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
out = []
for tr in soup.select('.col-md-4 tbody tr'):
On the class col-md-4 I know there are 3 tables I want to generate a csv which as an output has three values: first name, last name, and for the last value I want the header name of the table.
first name, last name, header table
Any help would be appreciated.

This is what I have done on my own:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'http://fcf.cat/equip/1920/1i/sant-ildefons-ue-b'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
filename = url.rsplit('/', 1)[1] + '.csv'
tables = soup.select('.col-md-4 table')
rows = []
for tr in tables:
t = tr.get_text(strip=True, separator='|').split('|')
rows.append(t)
df = pd.DataFrame(rows)
print(df)
df.to_csv(filename)
Thanks,

This might work:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'http://fcf.cat/equip/1920/1i/sant-ildefons-ue-b'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
tables = soup.select('.col-md-4 table')
rows = []
for table in tables:
cleaned = list(table.stripped_strings)
header, names = cleaned[0], cleaned[1:]
data = [name.split(', ') + [header] for name in names]
rows.extend(data)
result = pd.DataFrame.from_records(rows, columns=['surname', 'name', 'table'])

You need to first iterate through each table you want to scrape, then for each table, get its header and rows of data. For each row of data, you want to parse out the First Name and Last Name (along with the header of the table).
Here's a verbose working example:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'http://fcf.cat/equip/1920/1i/sant-ildefons-ue-b'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
out = []
# Iterate through each of the three tables
for table in soup.select(".col-md-4 table"):
# Grab the header and rows from the table
header = table.select("thead th")[0].text.strip()
rows = [s.text.strip() for s in table.select("tbody tr")]
t = [] # This list will contain the rows of data for this table
# Iterate through rows in this table
for row in rows:
# Split by comma (last_name, first_name)
split = row.split(",")
last_name = split[0].strip()
first_name = split[1].strip()
# Create the row of data
t.append([first_name, last_name, header])
# Convert list of rows to a DataFrame
df = pd.DataFrame(t, columns=["first_name", "last_name", "table_name"])
# Append to list of DataFrames
out.append(df)
# Write to CSVs...
out[0].to_csv("first_table.csv", index=None) # etc...
Whenever you're web scraping, I highly recommend using strip() on all of the text you parse to make sure you don't have superfluous spaces in your data.
I hope this helps!

Related

Need help getting tr values when scraping

I have the following code
# Import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.ipma.pt/pt/otempo/obs.superficie/table-top-stations-all.jsp'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
# Get the content for tab_Co id
temp_table = soup.find('table', id='tab_Co')
# Create Headers
headers = []
for i in temp_table.find_all('th'):
title = i.text
headers.append(title)
# Create DataFrame with the headers as columns
mydata = pd.DataFrame(columns = headers)
# This is where the script goes wrong
# Create loop that retrieves information and appends it to the DataFrame
for j in table1.find_all('tr')[1:]:
row_data = j.find_all('td')
row = [i.text for i in row_data]
length = len(mydata)
mydata.loc[length] = row
What am I doing wrong? The final purpose is to have a dataframe where I can extract the top 4 values for each column
'Temperatura Max (ºC)',
'Temperatura Min (ºC)',
'Prec. acumulada (mm)',
'Rajada máxima (km/h)',
'Humidade Max (%)',
'Humidade Min (%)',
'Pressão atm. (hPa)']
and then use those to generate a daily image.
Any ideas? Thank you in advance!
Disclaimer: This is for a non-for-profit project and no commercial use will be made of the solution.
So this worked, based on this solution by Falsovsky on GitHub
# Import libraries
import requests
import pandas as pd
import regex
# Define target URL
url = 'https://www.ipma.pt/pt/otempo/obs.superficie/table-top-stations-all.jsp'
# Get URL information
page = requests.get(url)
# After inspecting the page apply a regex search
search = re.search('var observations = (.*?);',page.text,re.DOTALL);
# Create dict by loading the json information
json_data = json.loads(search.group(1))
# Create Dataframe from json result
df1 = pd.concat({k: pd.DataFrame(v).T for k, v in json_data.items()}, axis=0)
From the source view-source:https://www.ipma.pt/pt/otempo/obs.superficie/table-top-stations-all.jsp, it is clear that the data is in the th attributes so try scraping with row_data = j.find_all('th')

Beautiful Soup to Scrape Data from Static Webpages

I am trying to values from a table of multiple static webpages. It is the verb conjugation data for Korean verbs here: https://koreanverb.app/
My Python script uses Beautiful Soup. The goal is to grab all conjugations from multiple URL inputs and output the data to a CSV file.
Conjugations are stored on the page in table with class "table-responsive" and under the table rows with class "conjugation-row". There are multiple "conjugation-row" table rows on each page. My script is someone only grabbing the first table row with class "conjugation-row".
Why isn't the for loop grabbing all the td elements with class "conjugation-row"? I would appreciate a solution that grabs all tr with class "conjugation-row". I tried using job_elements = results.find("tr", class_="conjugation-row"), but I get the following error:
AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
Furthermore, when I do get the data and output to a CSV file, the data is in separate rows as expected, but leaves empty spaces., It places the data rows for the second URL at the index after all data rows for the first URL. See example output here:
See code here:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv
# create csv file
outfile = open("scrape.csv","w",newline='')
writer = csv.writer(outfile)
## define first URL to grab conjugation names
url1 = 'https://koreanverb.app/?search=%ED%95%98%EB%8B%A4'
# define dataframe columns
df = pd.DataFrame(columns=['conjugation name'])
# get URL content
response = requests.get(url1)
soup = BeautifulSoup(response.content, 'html.parser')
# get table with all verb conjugations
results = soup.find("div", class_="table-responsive")
##### GET CONJUGATIONS AND APPEND TO CSV
# define URLs
urls = ['https://koreanverb.app/?search=%ED%95%98%EB%8B%A4',
'https://koreanverb.app/?search=%EB%A8%B9%EB%8B%A4',
'https://koreanverb.app/?search=%EB%A7%88%EC%8B%9C%EB%8B%A4']
# loop to get data
for url in urls:
response = requests.get(url)
soup2 = BeautifulSoup(response.content, 'html.parser')
# get table with all verb conjugations
results2 = soup2.find("div", class_="table-responsive")
# get dictionary form of verb/adjective
verb_results = soup2.find('dl', class_='dl-horizontal')
verb_title = verb_results.find('dd')
verb_title_text = verb_title.text
job_elements = results2.find_all("tr", class_="conjugation-row")
for job_element in job_elements:
conjugation_name = job_element.find("td", class_="conjugation-name")
conjugation_korean = conjugation_name.find_next_sibling("td")
conjugation_name_text = conjugation_name.text
conjugation_korean_text = conjugation_korean.text
data_column = pd.DataFrame({ 'conjugation name': [conjugation_name_text],
verb_title_text: [conjugation_korean_text],
})
#data_column = pd.DataFrame({verb_title_text: [conjugation_korean_text]})
df = df.append(data_column, ignore_index = True)
# save to csv
df.to_csv('scrape.csv')
outfile.close()
print('Verb Conjugations Collected and Appended to CSV, one per column')
Get all the job_elements using find_all() since find() only returns the first occurrence and iterate over them in a for loop like below.
job_elements = results.find_all("tr", class_="conjugation-row")
for job_element in job_elements:
conjugation_name = job_element.find("td", class_="conjugation-name")
conjugation_korean = conjugation_name.find_next_sibling("td")
conjugation_name_text = conjugation_name.text
conjugation_korean_text = conjugation_korean.text
# append element to data
df2 = pd.DataFrame([[conjugation_name_text,conjugation_korean_text]],columns=['conjugation_name','conjugation_korean'])
df = df.append(df2)
The error is where you are trying to use find() on a variable of type list.
As your script is growing big, I made some modifications like using get_conjugations() function and some proper names that are easy to understand. Firstly, conjugation_names and conjugation_korean_names are added into pandas Dataframe columns and then other columns are added subsequently (korean0, korean1 ...).
import requests
from bs4 import BeautifulSoup
import pandas as pd
# function to parse the html data & get conjugations
def get_conjugations(url):
#set return lists
conjugation_names = []
conjugation_korean_names = []
#get html text
html = requests.get(url).text
#parse the html text
soup = BeautifulSoup(html, 'html.parser')
#get table
table = soup.find("div", class_="table-responsive")
table_rows = table.find_all("tr", class_="conjugation-row")
for row in table_rows:
conjugation_name = row.find("td", class_="conjugation-name")
conjugation_korean = conjugation_name.find_next_sibling("td")
conjugation_names.append(conjugation_name.text)
conjugation_korean_names.append(conjugation_korean.text)
#return both lists
return conjugation_names, conjugation_korean_names
# create csv file
outfile = open("scrape.csv", "w", newline='')
urls = ['https://koreanverb.app/?search=%ED%95%98%EB%8B%A4',
'https://koreanverb.app/?search=%EB%A8%B9%EB%8B%A4',
'https://koreanverb.app/?search=%EB%A7%88%EC%8B%9C%EB%8B%A4']
# define dataframe columns
df = pd.DataFrame(columns=['conjugation_name', 'conjugation_korean', 'korean0', 'korean1'])
conjugation_names, conjugation_korean_names = get_conjugations(urls[0])
df['conjugation_name'] = conjugation_names
df['conjugation_korean'] = conjugation_korean_names
for index, url in enumerate(urls[1:]):
conjugation_names, conjugation_korean_names = get_conjugations(url)
#set column name
column_name = 'korean' + str(index)
df[column_name] = conjugation_korean_names
#save to csv
df.to_csv('scrape.csv')
outfile.close()
# Print DONE
print('Export to CSV Complete')
Output:
,conjugation_name,conjugation_korean,korean0,korean1
0,declarative present informal low,해,먹어,마셔
1,declarative present informal high,해요,먹어요,마셔요
2,declarative present formal low,한다,먹는다,마신다
3,declarative present formal high,합니다,먹습니다,마십니다
...
Note:
This assumes that elements in different URLs are in same order.

Appending results with Panda and BeautifulSoup

PROBLEM: I have a list of sites that I want BS and Pandas to grab a data table for. I want to add all the iterative results to the same xlsx or csv file.
My current code below will iterate over each of the 3 sites, but the final product is just the last page to get scraped. Removing my export function and just printing df, I can see all 3 pages of data; so I'm not sure how to correctly append each iteration into my output file.
from bs4 import BeautifulSoup
import requests
import pandas as pd
from time import gmtime, strftime
#Pass in the URL
url = ["https://www.nfl.com/standings/league/2021/reg", "https://www.nfl.com/standings/league/2020/reg", "https://www.nfl.com/standings/league/2019/reg"]
for site in url:
#Load the page html
page = requests.get(site)
soup = BeautifulSoup(page.text, 'lxml')
# Get all the table data
table = soup.find('table', {'summary':'Standings - Detailed View'})
headers = []
for i in table.find_all('th'):
title = i.text.strip()
headers.append(title)
#Dataframe the headers into columns
df = pd.DataFrame(columns = headers)
# TR for the rows, TD for the values
for row in table.find_all('tr')[1:]:
data = row.find_all('td')
row_data = [td.text.strip() for td in data]
length = len(df)
df.loc[length] = row_data
#Write the collected data out to an Excel file
dateTime = strftime("%d%b%Y_%H%M", gmtime())
writer = pd.ExcelWriter(dateTime + "Z" + ".xlsx")
df.to_excel(writer)
writer.save()
print('[*] Data successfully written to Excel File.')
Try the following. You need to capture all the dataframes from each url, then concatenate them, then write the new df to excel. This should work, but untested. See comments inline.
from bs4 import BeautifulSoup
import requests
import pandas as pd
from time import gmtime, strftime
#Pass in the URL
url = ["https://www.nfl.com/standings/league/2021/reg", "https://www.nfl.com/standings/league/2020/reg", "https://www.nfl.com/standings/league/2019/reg"]
df_hold_list = [] #collect each dataframe separately
for site in url:
#Load the page html
page = requests.get(site)
soup = BeautifulSoup(page.text, 'lxml')
# Get all the table data
table = soup.find('table', {'summary':'Standings - Detailed View'})
headers = []
for i in table.find_all('th'):
title = i.text.strip()
headers.append(title)
#Dataframe the headers into columns
df = pd.DataFrame(columns = headers)
# TR for the rows, TD for the values
for row in table.find_all('tr')[1:]:
data = row.find_all('td')
row_data = [td.text.strip() for td in data]
length = len(df)
df.loc[length] = row_data
df_hold_list.append(df) # add the dfs to the list
final_df = pd.concat(df_hold_list, axis=1) # put them together-check that axis=1 is correct, otherwise axis=0
# move this out of loop
#Write the collected data out to an Excel file
dateTime = strftime("%d%b%Y_%H%M", gmtime())
writer = pd.ExcelWriter(dateTime + "Z" + ".xlsx")
final_df.to_excel(writer) # write final_df to excel
writer.save()
print('[*] Data successfully written to Excel File.')

Scraping a table: IndexError: list index out of range

I am new to python. I am using it in a jupyter notebooks to scrape a table from Wikipedia. All the code I wrote works, except when I want to put the information into a csv file. The error that appears is "Index list index out of range".
Here is the code:
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)'
import csv
import pandas as pd
import requests
from bs4 import BeautifulSoup
import time
s = requests.Session()
response = s.get(url, timeout=10)
response
table_id = 'main'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.prettify().encode('UTF-8'))
table = soup.find('table', attrs={'id': table_id})
for row in table.find_all('tr'):
print(row)
table = soup.find('table', attrs={'id': table_id})
for row in table.find_all('tr')[1:]:
col = row.find_all('td')
print(col[0].find('a').contents[0])
print(col[1].string) #name
print(col[2].string)
print(col[3].string)
print(col[4].string)
print(col[5].find(text=True))
csvfile = open('population.csv', 'w')
csvwriter = csv.writer(csvfile, delimiter=',')
headers = ('COUNTRY','CONTINENT','SUBREGION', 'POPULATION_2018', 'POPULATION_2019', 'CHANGE')
csvwriter.writerow(headers)
table = soup.find('table', attrs={'id': table_id})
for row in table.find_all('tr')[1:]:
col = row.find_all('td')
country = col[0].find('a').contents[0]
continent = col[1].string
subregion = col[2].string
population_2018 = col[3].string
population_2019 = col[4].string
change = col[5].find(text=True)
parsed_row = (country, continent, subregion, population_2018, population_2019, change)
csvwriter.writerow(parsed_row)
csvfile.close()
Thank you very much!
I have two part answers. The easiest way to accomplish your task and where in your code the error is.
Let pandas handle the requests, BeautifulSoup and csv for you.
import pandas as pd
URI = 'https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)'
df = pd.read_html(URI)[3]
df.to_csv('population.csv', index=False)
pandas has .read_html that returns a list of all tables in the webpage. Your table was at index 3. With that, I saved it with .to_csv.
With .read_html, you can pass the attributes of a specific table e.g. attrs = {'id': 'table'}
# the table is now at index 0
df = pd.read_html(URI, attrs={'id':'main'})[0]
You can also specify the parser that will be used by BeautifulSoup that .read_html calls:
df = pd.read_html(URI, attrs={'id':'main'}, flavor='lxml')[0]
# 'lxml' is known for speed. But you can use `html.parser` if `lxml` or `html5lib` are not installed.
See more documentation .read_html
Update: Debugging You’re Code
The error from your code is from empty col. using if conditions solves the problem:
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)'
import csv
import pandas as pd
import requests
from bs4 import BeautifulSoup
import time
s = requests.Session()
response = s.get(url, timeout=10)
response
table_id = 'main'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
#print(soup.prettify().encode('UTF-8'))
csvfile = open('population.csv', 'w')
csvwriter = csv.writer(csvfile, delimiter=',')
headers = ('COUNTRY','CONTINENT','SUBREGION', 'POPULATION_2018', 'POPULATION_2019', 'CHANGE')
csvwriter.writerow(headers)
table = soup.find('table', attrs={'id': table_id})
for row in table.find_all('tr')[1:]:
col = row.find_all('td')
# this is all that was missing
if col:
country = col[0].find('a')['title']
continent = col[1].string
subregion = col[2].string
population_2018 = col[3].string
population_2019 = col[4].string
change = col[5].find(text=True)
parsed_row = (country, continent, subregion, population_2018, population_2019, change)
csvwriter.writerow(parsed_row)
csvfile.close()
Prayson W. Daniel has already given the answer, and I offer another way.
import requests
from simplified_scrapy import SimplifiedDoc, utils, req
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)'
s = requests.Session()
res = s.get(url, timeout=10)
rows = []
headers = ('COUNTRY','CONTINENT','SUBREGION', 'POPULATION_2018', 'POPULATION_2019', 'CHANGE')
rows.append(headers)
table_id = 'main'
doc = SimplifiedDoc(res.text)
table = doc.select('table#'+table_id) # Get the table by id.
trs = table.tbody.children.children[1:] # Get all data rows
for tr in trs:
row = [tr[0].a.text] # First col, get first link
row.extend(tr.text[1:]) # Left cols
rows.append(row)
utils.save2csv('test_wiki.csv', rows) # Save data to csv

I am trying to scrape multiple tables from 30 similar links using Python

I have 10 links of companies.
https://www.zaubacorp.com/company/ASHRAFI-MEDIA-NETWORK-PRIVATE-LIMITED/U22120GJ2019PTC111757,
https://www.zaubacorp.com/company/METTLE-PUBLICATIONS-PRIVATE-LIMITED/U22120MH2019PTC329729,
https://www.zaubacorp.com/company/PRINTSCAPE-INDIA-PRIVATE-LIMITED/U22120MH2020PTC335354,
https://www.zaubacorp.com/company/CHARVAKA-TELEVISION-NETWORK-PRIVATE-LIMITED/U22121KA2019PTC126665,
https://www.zaubacorp.com/company/BHOOKA-NANGA-FILMS-PRIVATE-LIMITED/U22130DL2019PTC353194,
https://www.zaubacorp.com/company/WHITE-CAMERA-SCHOOL-OF-PHOTOGRAPHY-PRIVATE-LIMITED/U22130JH2019PTC013311,
https://www.zaubacorp.com/company/RLE-PRODUCTIONS-PRIVATE-LIMITED/U22130KL2019PTC059208,
https://www.zaubacorp.com/company/CATALIZADOR-MEDIA-PRIVATE-LIMITED/U22130KL2019PTC059793,
https://www.zaubacorp.com/company/TRIPPLED-MEDIAWORKS-OPC-PRIVATE-LIMITED/U22130MH2019OPC333171,
https://www.zaubacorp.com/company/KRYSTAL-CINEMAZ-PRIVATE-LIMITED/U22130MH2019PTC330391
Now I am trying to scrape tables from these links and save the data in csv columns in well manner formet. I want to scrape tables of "Company Details", "Share Capital & Number of Employees", "Listing and Annual Compliance Details", "Contact Details", "Director Details". If any table has not the data or if any column is missing I want that column blank in output csv file. I have written a code but can't get the output. I am doing something wrong here. Please help
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import urlopen
import requests
import csv
import lxml
url_file = "Zaubalinks.txt"
with open(url_file, "r") as url:
url_pages = url.read()
# we need to split each urls into lists to make it iterable
pages = url_pages.split("\n") # Split by lines using \n
# now we run a for loop to visit the urls one by one
data = []
for single_page in pages:
r = requests.get(single_page)
soup = BeautifulSoup(r.content, 'html5lib')
table = soup.find_all('table') # finds all tables
table_top = pd.read_html(str(table))[0] # the top table
try: # try to get the other table if exists
table_capital = pd.read_html(str(table))[5]
table_listing = pd.read_html(str(table))[6]
table_contact = pd.read_html(str(table))[7]
table_director = pd.read_html(str(table))[8]
except:
table_capital = pd.DataFrame()
table_listing = pd.DataFrame()
table_contact = pd.DataFrame()
table_director = pd.DataFrame()
result = pd.concat([table_top, table_capital, table_listing, table_contact, table_director])
data.append(result)
print(data)
pd.concat(data).to_csv('ZaubaAll.csv')
import requests
from bs4 import BeautifulSoup
import pandas as pd
companies = {
'ASHRAFI-MEDIA-NETWORK-PRIVATE-LIMITED/U22120GJ2019PTC111757',
'METTLE-PUBLICATIONS-PRIVATE-LIMITED/U22120MH2019PTC329729',
'PRINTSCAPE-INDIA-PRIVATE-LIMITED/U22120MH2020PTC335354',
'CHARVAKA-TELEVISION-NETWORK-PRIVATE-LIMITED/U22121KA2019PTC126665',
'BHOOKA-NANGA-FILMS-PRIVATE-LIMITED/U22130DL2019PTC353194',
'WHITE-CAMERA-SCHOOL-OF-PHOTOGRAPHY-PRIVATE-LIMITED/U22130JH2019PTC013311',
'RLE-PRODUCTIONS-PRIVATE-LIMITED/U22130KL2019PTC059208',
'CATALIZADOR-MEDIA-PRIVATE-LIMITED/U22130KL2019PTC059793',
'TRIPPLED-MEDIAWORKS-OPC-PRIVATE-LIMITED/U22130MH2019OPC333171',
'KRYSTAL-CINEMAZ-PRIVATE-LIMITED/U22130MH2019PTC330391'
}
def main(url):
with requests.Session() as req:
goal = []
for company in companies:
r = req.get(url.format(company))
df = pd.read_html(r.content)
target = pd.concat([df[x].T for x in [0, 3, 4]], axis=1)
goal.append(target)
new = pd.concat(goal)
new.to_csv("data.csv")
main("https://www.zaubacorp.com/company/{}")
Fortunatley, it seems you can get there with simpler methods. Taking one reandom link as an example, it should be something like:
url = 'https://www.zaubacorp.com/company/CHARVAKA-TELEVISION-NETWORK-PRIVATE-LIMITED/U22121KA2019PTC126665'
import pandas as pd
tables = pd.read_html(url)
From here, your tables are in tables[0], tables[3], tables[4], tables[15], etc. Just use a for loop to rotate through all the urls.

Categories

Resources