I am having trouble getting a usable data frame from scraping a website. I know I need to turn my list into a list of lists, and that's easy to do with a static data frame. But here's the rub: my scraped data changes daily, and I want to automate the data frame creation. First, I scrape the data:
### Libraries/packages
import pandas as pd
import numpy as np
import re
import requests
import datetime
from datetime import datetime
import urllib
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup
### Function 1
def strava_page():
urllist = ['https://www.strava.com/login',
'https://www.strava.com/clubs/roosevelt-island-dc-parkrun']
return urllist
### Function 2
def strava_login(urllist):
# navigate to page
driver = webdriver.Chrome(executable_path = r"/Users/user/Documents/chromedriver")
driver.get(urllist[1])
# last week's leaderboard
last_week = driver.find_element_by_css_selector('body > div.view > div.page.container > div:nth-child(4) > div.spans11 > div > div:nth-child(2) > ul > li:nth-child(1) > span')
last_week.click()
# getting rows from leaderboard
table_rows = []
myrow = []
totalrows = len(driver.find_elements_by_xpath("//div[#class='leaderboard']/table/tbody//tr"))
print("[Number of Rows in Leaderboard]:", totalrows)
# gets individual rows, and puts each one into its own list
for i in range(totalrows):
myrow.clear()
for items in driver.find_elements_by_xpath("//div[#class='leaderboard']/table/tbody//tr["+str(i+1)+"]/td"):
myrow.append(items.text)
table_rows.append(myrow)
print(myrow)
driver.close()
# myrow variable is a list
print(type(myrow))
# column names
my_columns = ['Rank', 'Athlete', 'Distance', 'Runs', 'Longest', 'Avg. Pace', 'Elev. Gain']
# PROBLEM AREA *************
new_table = pd.DataFrame(np.array(myrow).reshape(1, 7), columns = my_columns)
return new_table
### Calling functions
one = strava_page()
two = strava_login(one)
two
I keep getting cannot reshape data size errors. I know the numpy reshape is the correct way to go. But I cannot get the myrow output into a full frame - i.e. it only returns the last row of that frame:
When I want EVERY row in the table from the Strava webpage. How do I dynamically get every row into a table (with day-to-day variation of number of rows), and not have to set the .reshape() by hand every time I run the script?
For reference, here's a screenshot of the table. There are 7 columns, and number of rows should reflect the number of rows in the table, even the number of rows change daily:
Relatively simple fix, all it took was for me to ignore work for a bit, and play around with numpy outside of the loop:
new_table = np.array(myrow).reshape(-1, 7)
previous_week = pd.DataFrame(new_table, columns = my_columns)
I got rid of the myrow.clear(), and returned previous_week. Worked like a charm, after I discovered the -1 method in np.reshape().
Related
Python is new to me. I've been attempting to download the table from this website: https://tradereport.moc.go.th/Report/ReportEng.aspx?Report=HarmonizeCommodity&Lang=Eng&ImExType=1&Option=1
However, it appears complicated because the page source code does not have a button which needs to be clicked to show the report.
Thank you very much.
Firstly, before getting to that complicated table structure, I make it simple by testing only how to get the result table by clicking "ReviewReport" button. It shows table on the webpage but nothing I can scrape out of the page source. Please help me how to get the table data.
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
import pandas as pd
Year = ('2021')
Month = ('11')
HScode = ('721990')
options = Options()
#options.headless = True
#driver = webdriver.Chrome('C:\\Python\Scripts\chromedriver.exe', options=options)
driver = webdriver.Firefox(executable_path=r'C:\Python\geckodriver.exe')
driver.get("https://tradereport.moc.go.th/Report/ReportEng.aspx? Report=HarmonizeCommodity&Lang=Eng&ImExType=1&Option=1")
time.sleep(5)
driver.get("https://tradereport.moc.go.th/Report/ReportEng.aspx? Report=HarmonizeCommodity&Lang=Eng&ImExType=1&Option=1") #reenter the url again to get the table page
time.sleep(5)
driver.find_element_by_id("ddlYear").send_keys(Year) #Working fine
driver.find_element_by_id("ddlMonth").send_keys(Month) #Working fine
driver.find_element_by_id("txtHsCode").send_keys(HScode) #Working fine
submitbttn = driver.find_element_by_id("btnSubmit") #Working fine
submitbttn.click()
time.sleep(5)
f=open("d:\\page.txt","w")
f.write(driver.page_source)
f.close()
print ("************* Scrapping data done**********************")
driver.quit`
By inspecting the html I noticed that the table is contained into an iframe, so first thing to do is to switch to it so that selenium can find elements inside it
iframe_id = 'ASPxDocumentViewer1_Splitter_Viewer_ContentFrame'
driver.switch_to.frame(iframe_id)
Then we scrape the header of the table, which is composed by two lines (I called them header1 and header2). For the sake of simplicity we squeeze them into one line, called header in the code. This is what it looks like
['No',
'Country',
'Quanity (Dec.2022)',
'Value (Dec.2022)',
'Share (Dec.2022)',
'Quanity (Jan. - Dec. 2022)',
'Value (Jan. - Dec. 2022)',
'Share (Jan. - Dec. 2022)']
Then we can start scraping the values of the table. You can do it in two ways: by rows or by columns. In our case (actually, almost in all cases) there are more rows than columns (28 vs 8), so it is faster to do it by columns. At the end of the loop the variable columns will be a list containing 8 lists, each one containing 28 elements. So by using the header as keys and the columns as values we can create a dictionary, which then we pass it to pd.DataFrame to create a table, which we then save to a csv named tradereport_data.csv.
header1 = [td.text for td in driver.find_elements(By.XPATH, "//div[#id='report_div']//tr[6]/td[#class]")]
header2 = [td.text for td in driver.find_elements(By.XPATH, "//div[#id='report_div']//tr[7]/td[#class]")]
header = header1[:2] + [f'{h} ({header1[2]})' for h in header2[2:5]] + [f'{h} ({header1[3]})' for h in header2[5:8]]
columns_number = 8
columns = []
for p in range(1,columns_number+1):
columns.append( [x.text.replace('\n','').strip() for x in driver.find_elements(By.XPATH, f"//div[#id='report_div']//tr[(position()>7) and (position()<last()-1)]/td[#class][{p}]")] )
df = pd.DataFrame(dict(zip(header,columns)))
df.to_csv('tradereport_data.csv', index=False)
and this is what df looks like
As a final note, what the xpath tr[(position()>7) and (position()<last()-1)] does is to select all the tr elements excluding the first seven and the last two.
I want to retrieve data from the below website, I want to get all the pages table information and put them in the excel form. It goes through all pages but each time it erase the excel sheet and renew them for writing. at the end I just have the last page not the total pages table. would you help me?
site: https://www150.statcan.gc.ca/n1/pub/71-607-x/2021004/imp-eng.htm?r1=(1)&r2=0&r3=0&r4=12&r5=0&r7=0&r8=2022-01-01&r9=2022-01-01
from selenium import webdriver
import pandas as pd
import time
driver = webdriver.Chrome('C:\Webdriver\chromedriver.exe')
driver.get('https://www150.statcan.gc.ca/n1/pub/71-607-x/2021004/imp-eng.htm?r1=(1)&r2=0&r3=0&r4=12&r5=0&r7=0&r8=2022-01-01&r9=2022-01-01')
time.sleep(2)
for J in range (20):
commodities = driver.find_elements_by_xpath('.//*[#id="report_table"]/tbody/tr["i"]/td[2]/a')
Countries = driver.find_elements_by_xpath('.//*[#id="report_table"]/tbody/tr["i"]/td[4]')
quantities = driver.find_elements_by_xpath('.//*[#id="report_table"]/tbody/tr["i"]/td[7]')
weights = driver.find_elements_by_xpath('.//*[#id="report_table"]/tbody/tr["i"]/td[8]/abbr')
Canada_Result=[]
for i in range(25):
temporary_data= {'Commodity': commodities[i].text,'Country': Countries[i].text,'quantity': quantities[i].text, 'weight': weights[i].text }
Canada_Result.append(temporary_data)
df_data = pd.DataFrame(Canada_Result)
df_data
df_data.to_excel('Canada_scrapping_result.xlsx', index=False)
# click on the Next button
driver.find_element_by_xpath('//*[#id="report_results_next"]').click()
time.sleep(1)
I would suggest before entering running the code, create an excel with the desired name and column names, after that do not create a DataFrame, but instead read the excel.
Your problem is being caused because everytime the loop repeats, its deleting the previous data and replacing the already existing excel.
so instead of
df = pd.DataFrame(Canada_Result)
I would recommend just reading the excel with
df = pd.read_excel('Canada_scrapping_result.xlsx')
You have to initialize the Canada_Result list before entering the outer for loop, then just append new data, and when the outer loop is ended convert the list to a dataframe and export it to a file.
from selenium import webdriver
import pandas as pd
import time
driver = webdriver.Chrome('C:\Webdriver\chromedriver.exe')
driver.get('https://www150.statcan.gc.ca/n1/pub/71-607-x/2021004/imp-eng.htm?r1=(1)&r2=0&r3=0&r4=12&r5=0&r7=0&r8=2022-01-01&r9=2022-01-01')
time.sleep(2)
Canada_Result = []
for J in range (20):
commodities = driver.find_elements_by_xpath('.//*[#id="report_table"]/tbody/tr["i"]/td[2]/a')
Countries = driver.find_elements_by_xpath('.//*[#id="report_table"]/tbody/tr["i"]/td[4]')
quantities = driver.find_elements_by_xpath('.//*[#id="report_table"]/tbody/tr["i"]/td[7]')
weights = driver.find_elements_by_xpath('.//*[#id="report_table"]/tbody/tr["i"]/td[8]/abbr')
for i in range(25):
temporary_data = {'Commodity': commodities[i].text,'Country': Countries[i].text,'quantity': quantities[i].text, 'weight': weights[i].text }
Canada_Result.append(temporary_data)
# click on the Next button
driver.find_element_by_xpath('//*[#id="report_results_next"]').click()
time.sleep(1)
df_data = pd.DataFrame(Canada_Result)
df_data.to_excel('Canada_scrapping_result.xlsx', index=False)
I'm new to web scraping. I'm trying to scrape data from the news site.
I have this code:
from bs4 import BeautifulSoup as soup
import pandas as pd
import requests
detik_url = "https://news.detik.com/indeks/2"
detik_url
html = requests.get(detik_url)
bsobj = soup(html.content, 'lxml')
bsobj
for link in bsobj.findAll("h3"):
print("Headline : {}".format(link.text.strip()))
links = []
for news in bsobj.findAll('article',{'class':'list-content__item'}):
links.append(news.a['href'])
for link in links:
page = requests.get(link)
bsobj = soup(page.content)
div = bsobj.findAll('div',{'class':'detail__body itp_bodycontent_wrapper'})
for p in div:
print(p.find('p').text.strip())
How do I utilize a Pandas Dataframe to store the obtained content into a CSV file?
You can store your content in a pandas dataframe, and then write the structure to a csv file.
Suppose you want to save all the text in your p.find('p').text.strip(), along with the headline in a csv file, you can store your headline in any variable (say head):
So, from your code:
for link in links:
page = requests.get(link)
bsobj = soup(page.content)
div = bsobj.findAll('div',{'class':'detail__body itp_bodycontent_wrapper'})
for p in div: # <----- Here we make the changes
print(p.find('p').text.strip())
In the line shown above, we do the following:
import pandas as pd
# Create an empty array to store all the data
generated_text = [] # create an array to store your data
for link in links:
page = requests.get(link)
bsobj = soup(page.content)
div = bsobj.findAll('div',{'class':'detail__body itp_bodycontent_wrapper'})
for p in div:
# print statement if you want to see the output
generated_text.append(p.find('p').text.strip()) # <---- save the data in an array
# then write this into a csv file using pandas, first you need to create a
# dataframe from our list
df = pd.DataFrame(generated_text, columns = [head])
# save this into a csv file
df.to_csv('csv_name.csv', index = False)
Also, instead of the for loop, you can directly use list comprehensions and save to your CSV.
# instead of the above snippet, replace the whole `for p in div` loop by
# So from your code above:
.....
bsobj = soup(page.content)
div = bsobj.findAll('div',{'class':'detail__body itp_bodycontent_wrapper'})
# Remove the whole `for p in div:` and instead use this:
df = pd.DataFrame([p.find('p').text.strip() for p in div], columns = [head])
....
df.to_csv('csv_name.csv', index = False)
Also, you can convert your array generated from list comprehension to a numpy array and directly write it to a csv file:
import numpy as np
import pandas as pd
# On a side note:
# convert your normal array to numpy array or use list comprehension to make a numpy array,
# also there are faster ways to convert a normal array to numpy array which you can explore,
# from there you can write to a csv
pd.DataFrame(nparray).to_csv('csv_name.csv'))
I'm new to coding and I finally got the data I want from the website. The problem here is I can't figure out how to get these into one DataFrame. I can't concat because these aren't assigned to a variable, it's just coming from the scraper.
Here's the code:
import pandas as pd
import numpy as np
import requests
from csv import writer
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.keys import Keys
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get("https://app.hedgeye.com/feed_items/all?page=1&with_category=33-risk-ranges")
#login
import requests
import sys
url = 'https://accounts.hedgeye.com/users/sign_in'
driver.get(url)
username = driver.find_element_by_id("user_email")
password = driver.find_element_by_id("user_password")
username.send_keys("")
password.send_keys("")
driver.find_element_by_name("commit").click()
#end login
for tr in driver.find_elements_by_tag_name("tr"):
data = tr.get_attribute("innerText")
data2= data.split()[-3:]
#makes the list rows not columns
df = pd.DataFrame(np.array(data2).reshape(-1,len(data2)))
print (df)
driver.quit()
Here's what the dataframe looks like:
Here is what the scraper looks like before I put it into a dataframe and what the webpage looks like:
Final Product
Try initializing df outside of the for loop first. Then, df could be amended with each iteration with pd.concat.
So, outside the for loop you'd have something like:
df = DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)
then after the data2 variable assignment:
df = pd.concat([df, DataFrame(data=data2.reshape(-1,len(data2)))])
Link to website: http://www.tennisabstract.com/cgi-bin/player-classic.cgi?p=RafaelNadal
I am trying to write code which goes through each row in a table and extracts each element from that row.
I am aiming for an ouput in the following layout
Row1Element1, Row1Element2, Row1Element3
Row2Element1, Row2Element2, Row2Element3
Row3Element1, Row3Element2, Row3Element3
I have had two major attempts at coding this.
Attempt 1:
rows = driver.find_elements_by_xpath('//table//body//tr')
elements = rows.find_elements_by_xpath('//td')
#this gets all rows in the table, but then gets all elements on the page,
not just the table
Attempt 2:
driver.find_elements_by_xpath('//table//body//tr//td')
#this gets all the elements that I want, but makes no distinction to which
row each element belongs to
Any help is appreciated
You can get table headers and use indexes to get right sequence in the row data.
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("http://www.tennisabstract.com/cgi-bin/player-classic.cgi?p=RafaelNadal")
table_headers = [th.text.strip() for th in driver.find_elements_by_css_selector("#matchheader th")]
rows = driver.find_elements_by_css_selector("#matches tbody > tr")
date_index = table_headers.index("Date")
tournament_index = table_headers.index("Tournament")
score_index = table_headers.index("Score")
for row in rows:
table_data = row.find_elements_by_tag_name("td")
print(table_data[date_index].text, table_data[tournament_index].text, table_data[score_index].text)
This is the locator each rows the table you mean
XPATH: //table[#id="matches"]//tbody//tr
First following import:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
Each rows:
driver.get('http://www.tennisabstract.com/cgi-bin/player-classic.cgi?p=RafaelNadal')
rows = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, '//table[#id="matches"]//tbody//tr')))
for row in rows:
print(row.text)
Or each cells:
for row in rows:
cols = row.find_elements_by_tag_name('td')
for col in cols:
print(col.text)