My code below will open a website and scrape values into an array and plot. Notice at the bottom, One can comment out "driver.quit()" and when the Python code stops, the webpage of interest is still open. At a short time later, I would like to soft start the Python code and continue reading from the website. My attempt was to print out the value for the driver and skip to this value without having to open a new page. Once I am on the welcome page it takes a lot of time/effort to get to the desired page and I would like to avoid that. Look at the third line of code where I have pasted the value of the driver for the session that is currently open. Python does not like that. Is there a way to continue on that session while it is open in Python?
driver = webdriver.Chrome (executable_path="C:\chromedriver.exe")
driver.get("https://my_example.com/welcome")
#driver = <selenium.webdriver.chrome.webdriver.WebDriver (session="1a636e51f3d40bd9b66996e3d52d945b")>
my_name = ["nm25", "nm26", "nm27", "nm33", "nm38", "nm41", "nm45", ]
data_points = 450
my_file = np.zeros((13, 7, data_points))
x = []
y = []
soup = "chicken" # Initialization constant
while soup.find(my_name[0]) == -1:
source = driver.page_source
soup = BeautifulSoup(source, "lxml")
soup = (soup.get_text())
time.sleep(5)
tim = time.time()
my_cntr =0
plt.title("Title")
plt.xlabel(" Time")
plt.ylabel("y axis amplitude")
for i in range(data_points):
source = driver.page_source
soup = BeautifulSoup(source, "lxml") # was "html.parser"
soup3 = (soup.get_text())
soup2 = (soup.get_text("**", strip=True)) # adds "**" between values for ease of reading
x_pos = soup2.find(my_name[0]) # Remove all but First name
soup2 = soup2[x_pos-2:] # Data starts with "**"
for j in range(len(my_name)): # Go through all of the names
for k in range(0, 13): # The number of values to read per name
soup2 = soup2[2:] # Remove **
x_pos = soup2.find('**')
if k < 2 or k==6:
my_file[k, j, i] =time.time() - tim # adds time stamp
else:
my_file[k, j, i] = soup2[:x_pos]
soup2 = soup2[x_pos:]
if ( k== 7) and j==0 :
x.append(my_file[0,0,i] )
y.append(my_file[7,0,i] )
my_cntr = my_cntr +1
if my_cntr/20 == int(my_cntr/20):
plt.plot(x, y)
plt.pause(0.2)
plt.show()
driver.quit() # remove this line to leave the browser open
Try getting the session id and setting the value. You should probably save the session id in a config.py file, json file, txt file, etc. and access it later
driver = webdriver.Chrome (executable_path="C:\chromedriver.exe")
driver.get("https://my_example.com/welcome")
# do stuff and then save the session id at the end
session_id = driver.session_id
import json
# save the session id in a json file
with open("sample.json", "w") as f:
f.write(json.dumps({'session_id': session_id}))
Once the code finishes running and the selenium browser is still open, you should be able to have another .py file that you will run to get the session id you just saved in a json file so you can access the open browser again
import json
# open the json file
with open('sample.json', 'r') as f:
session_id = json.load(f)['session_id']
driver.session_id = sessioin_id
# do more stuff
Related
I am trying to iterate through player seasons on NBA.com and pull shooting statistics after each click of the season dropdown menu. After each click, I get the error message "list index out of range" for:
headers = table[1].findAll('th')
It seems to me that the page doesn't load all the way before the source data is saved.
Looking at other similar questions, I have tried using an browser.implicitly_wait() for each loop, but I am still getting the same error. It also doesn't seem that the browser waits after more than the first iteration of the loop.
from selenium.webdriver.support.ui import Select
from bs4 import BeautifulSoup
import pandas as pd
player_id = str(1629216)
url = 'https://www.nba.com/stats/player/' + player_id + "/shooting/"
browser = Chrome(executable_path='/usr/local/bin/chromedriver')
browser.get(url)
select = Select(browser.find_element_by_xpath('/html/body/main/div/div/div/div[4]/div/div/div/div/div[1]/div[1]/div/div/label/select'))
options = select.options
for index in range(0, len(options)):
select.select_by_index(index)
browser.implicitly_wait(5)
src = browser.page_source
parser = BeautifulSoup(src, "lxml")
table = parser.findAll("div", attrs = {"class":"nba-stat-table__overflow"})
headers = table[1].findAll('th')
headerlist = [h.text.strip() for h in headers[1:]]
headerlist = [a for a in headerlist if not '\n' in a]
headerlist.append('AST%')
headerlist.append('UAST%')
row_labels = table[1].findAll("td", {"class": "first"})
row_labels_list = [r.text.strip() for r in row_labels[0:]]
rows = table[1].findAll('tr')[1:]
player_stats = [[td.getText().strip() for td in rows[i].findAll('td')[1:]] for i in range(len(rows))]
df = pd.DataFrame(data=player_stats, columns=headerlist, index = row_labels_list)
print(df)
I found my own answer. I used time.sleep(1) at the top of the loop to give the browser a second to load all the way. Without this delay, the pages source code did not have the appropriate table that I am scraping.
Responding to those who answered - I did not want to go the api route, but I have seen people scrape nba.com using that method. Table[1] is the correct table; just needed the source code a chance to load after the I loop through the season dropdown.
select.select_by_index(index)
time.sleep(1)
src = browser.page_source
parser = BeautifulSoup(src, "lxml")
table = parser.findAll("div", attrs = {"class":"nba-stat-table__overflow"})
headers = table[1].findAll('th')
headerlist = [h.text.strip() for h in headers[1:]]
headerlist = [a for a in headerlist if not '\n' in a]
headerlist.append('AST%')
headerlist.append('UAST%')
row_labels = table[1].findAll("td", {"class": "first"})
row_labels_list = [r.text.strip() for r in row_labels[0:]]
rows = table[1].findAll('tr')[1:]
player_stats = [[td.getText().strip() for td in rows[i].findAll('td')[1:]] for i in range(len(rows))]
df = pd.DataFrame(data=player_stats, columns=headerlist, index = row_labels_list)
print(df)
driver = webdriver.Chrome(ChromeDriverManager().install())
url = 'www.mywebsite.com'
driver.get(url)
response = requests.get(url)
markup = driver.page_source
soup = BeautifulSoup(markup, 'lxml')
for _ in range(50):
driver.find_element_by_tag_name('body').send_keys(Keys.END) # Move the page down
element = driver.find_element_by_class_name('prevnext')
master_list = []
for name in soup.find_all(itemprop='name'):
data_dict = {}
data_dict['company name'] = name.get_text(strip=True, separator = '\n')
master_list.append(data_dict)
df = pd.DataFrame(master_list)
print('Page scraped')
time.sleep(5)
print('Sleeping for 2..')
print('Is the button enabled : ' + str(element.is_enabled()))
print('Is the button visible : ' + str(element.is_displayed()))
element.click();
print('Clicked Next')
driver.implicitly_wait(2)
# # for _ in range(1):
# # print('waiting 10')
# driver.find_element_by_class_name('submit-btn').click()
print('Finished Scraping')
I Need this to run through 50 pages. It scrapes the first one, and flips through the other ones. However, at the end only the first one is scraped and added to df. Every page has 20 records. I believe my indentation is wrong. Any help appreciated.
It seems you made a small mistake.
markup = driver.page_source
soup = BeautifulSoup(markup, 'lxml')
Remove this line from your code and add it to start of your for loop as you would also need to get the source every time you click because new content is loaded everytime.
I am using webdriver selenium driver to open the url in the for loop. Once the URL opens, it stores file_total_l.append(get_file_total) in a list. How can I check to make sure variable 'missing_amount' is in the webpage URL before appending get_file_total to list file_total_l?
Whats happening:
Its inserting file_total into my database a second time into my table, if I run the script twice. The missing_amount is 165,757,06 from my table so why isnt that being inserted.
print(icl_dollar_amount):
['627,418.07', '6,986,500.57', '165,757.06']
print(missing_amount[i])
'165,757.06'
code:
missing_amount = []
for rps_amount2 in rps_amount_l:
if rps_amount2 not in bpo_file_total_l:
rps_table_q_2 = f"""select * from rps..sendfile where processingdate = '{cd}' and datasetname like '%ICL%' and paymenttotamt = '{rps_amount2}' """
rps_table_results = sql_server_cursor.execute(rps_table_q_2).fetchall()
file_missing = True
for rps in rps_table_results:
rps_amount_f = str(rps[18]).rstrip('0')
rps_amount_f = ("{:,}".format(float(rps_amount_f)))
missing_amount.append(rps_amount_f)
file_total_l
for link in url_list:
print(link)
options = Options()
browser = webdriver.Chrome(options=options,
executable_path=r'\\test\user$\test\Documents\driver\chromedriver.exe')
browser.get(link)
body = browser.find_element_by_xpath("//*[contains(text(), 'Total:')]").text
body_l.append(body)
icl_dollar_amount = re.findall('(?:[\£\$\€]{1}[,\d]+.?\d*)', body)[0].split('$', 1)[1]
icl_dollar_amount_l.append(icl_dollar_amount)
if not missing_amount:
logging.info("List is empty")
print("List is empty")
count = 0
for i in range(len(missing_amount)):
if missing_amount[i] in icl_dollar_amount_l:
body = body_l[i]
get_file_total = re.findall('(?:[\£\$\€]{1}[,\d]+.?\d*)', body)[0].split('$', 1)[1]
file_total_l.append(get_file_total)
I am a student working on a scraping project and I am having trouble completing my script because it fills my computer's memory with all of the data is stores.
It currently stores all of my data until the end, so my solution to this would be to break up the scrape into smaller bits and then write out the data periodically so it does not just continue to make one big list and then write out at the end.
In order to do this, I would need to stop my scroll method, scrape the loaded profiles, write out the data that I have collected, and then repeat this process without duplicating my data. It would be appreciated if someone could show me how to do this. Thank you for your help :)
Here's my current code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from time import sleep
from selenium.common.exceptions import NoSuchElementException
Data = []
driver = webdriver.Chrome()
driver.get("https://directory.bcsp.org/")
count = int(input("Number of Pages to Scrape: "))
body = driver.find_element_by_xpath("//body")
profile_count = driver.find_elements_by_xpath("//div[#align='right']/a")
while len(profile_count) < count: # Get links up to "count"
body.send_keys(Keys.END)
sleep(1)
profile_count = driver.find_elements_by_xpath("//div[#align='right']/a")
for link in profile_count: # Calling up links
temp = link.get_attribute('href') # temp for
driver.execute_script("window.open('');") # open new tab
driver.switch_to.window(driver.window_handles[1]) # focus new tab
driver.get(temp)
# scrape code
Name = driver.find_element_by_xpath('/html/body/table/tbody/tr/td/table/tbody/tr/td[5]/div/table[1]/tbody/tr/td[1]/div[2]/div').text
IssuedBy = "Board of Certified Safety Professionals"
CertificationorDesignaationNumber = driver.find_element_by_xpath('/html/body/table/tbody/tr/td/table/tbody/tr/td[5]/div/table[1]/tbody/tr/td[3]/table/tbody/tr[1]/td[3]/div[2]').text
CertfiedorDesignatedSince = driver.find_element_by_xpath('/html/body/table/tbody/tr/td/table/tbody/tr/td[5]/div/table[1]/tbody/tr/td[3]/table/tbody/tr[3]/td[1]/div[2]').text
try:
AccreditedBy = driver.find_element_by_xpath('/html/body/table/tbody/tr/td/table/tbody/tr/td[5]/div/table[1]/tbody/tr/td[3]/table/tbody/tr[5]/td[3]/div[2]/a').text
except NoSuchElementException:
AccreditedBy = "N/A"
try:
Expires = driver.find_element_by_xpath('/html/body/table/tbody/tr/td/table/tbody/tr/td[5]/div/table[1]/tbody/tr/td[3]/table/tbody/tr[5]/td[1]/div[2]').text
except NoSuchElementException:
Expires = "N/A"
info = Name, IssuedBy, CertificationorDesignaationNumber, CertfiedorDesignatedSince, AccreditedBy, Expires + "\n"
Data.extend(info)
driver.close()
driver.switch_to.window(driver.window_handles[0])
with open("Spredsheet.txt", "w") as output:
output.write(','.join(Data))
driver.close()
Test.py
Displaying Test.py.
Try the below approach using requests and beautifulsoup. In the below script i have used the API URL fetched from website itself for ex:-API URL
First it will create the URL(refer first url) for first iteration, add headers and data in .csv file.
Second iteration it will again create the URL(refer second url) with 2 extra params start_on_page=20 & show_per_page=20 where start_on_page number 20 is incremented by 20 on each iteration and show_per_page = 100 defaulted to extract 100 records per iteration so on till all the data dumped in to the .csv file.second iteration API URL
Script is dumping 4 things number, name, location and profile url.
On each iteration data will be appended to .csv file , so your memory issue will get resolved by this approach.
Do not forget to add your system path in file_path variable where do you want to create .csv file before running the script.
import requests
from urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
from bs4 import BeautifulSoup as bs
import csv
def scrap_directory_data():
list_of_credentials = []
file_path = ''
file_name = 'credential_list.csv'
count = 0
page_number = 0
page_size = 100
create_url = ''
main_url = 'https://directory.bcsp.org/search_results.php?'
first_iteration_url = 'first_name=&last_name=&city=&state=&country=&certification=&unauthorized=0&retired=0&specialties=&industries='
number_of_records = 0
csv_headers = ['#','Name','Location','Profile URL']
while True:
if count == 0:
create_url = main_url + first_iteration_url
print('-' * 100)
print('1 iteration URL created: ' + create_url)
print('-' * 100)
else:
create_url = main_url + 'start_on_page=' + str(page_number) + '&show_per_page=' + str(page_size) + '&' + first_iteration_url
print('-' * 100)
print('Other then first iteration URL created: ' + create_url)
print('-' * 100)
page = requests.get(create_url,verify=False)
extracted_text = bs(page.text, 'lxml')
result = extracted_text.find_all('tr')
if len(result) > 0:
for idx, data in enumerate(result):
if idx > 0:
number_of_records +=1
name = data.contents[1].text
location = data.contents[3].text
profile_url = data.contents[5].contents[0].attrs['href']
list_of_credentials.append({
'#':number_of_records,
'Name':name,
'Location': location,
'Profile URL': profile_url
})
print(data)
with open(file_path + file_name ,'a+') as cred_CSV:
csvwriter = csv.DictWriter(cred_CSV, delimiter=',',lineterminator='\n',fieldnames=csv_headers)
if idx == 0 and count == 0:
print('Writing CSV header now...')
csvwriter.writeheader()
else:
for item in list_of_credentials:
print('Writing data rows now..')
print(item)
csvwriter.writerow(item)
list_of_credentials = []
count +=1
page_number +=20
scrap_directory_data()
I have tried many times, but it does not work:
import requests
from lxml import html, etree
from selenium import webdriver
import time, json
#how many page do you want to scan
page_numnotint = input("how many page do you want to scan")
page_num = int(page_numnotint)
file_name = 'jd_goods_data.json'
url = 'https://list.jd.com/list.html?cat=1713,3264,3414&page=1&delivery=1&sort=sort_totalsales15_desc&trans=1&JL=4_10_0#J_main'
driver = webdriver.Chrome()
driver.get(url)
base_html = driver.page_source
selctor = etree.HTML(base_html)
date_info = []
name_data, price_data = [], []
jd_goods_data = {}
for q in range(page_num):
i = int(1)
while True:
name_string = '//*[#id="plist"]/ul/li[%d]/div/div[3]/a/em/text()' %(i)
price_string = '//*[#id="plist"]/ul/li[%d]/div/div[2]/strong[1]/i/text()' %(i)
if i == 60:
break
else:
i += 1
name = selctor.xpath(name_string)[0]
name_data.append(name)
price = selctor.xpath(price_string)[0]
price_data.append(price)
jd_goods_data[name] = price
print(name_data)
with open(file_name, 'w') as f:
json.dump(jd_goods_data, f)
time.sleep(2)
driver.find_element_by_xpath('//*[#id="J_bottomPage"]/span[1]/a[10]').click()
time.sleep(2)
# for k, v in jd_goods_data.items():
# print(k,v)
I am trying to download some details, but it doesn't work. If you type 2 to scan, it only downloads one page details, but twice!
Ok, you define q but you do not actually use it as such. In this case, the convention is to name this unused variable as _. I mean, instead of doing
for q in range(page_num):
you should do
for _ in range(page_num):
Thus, other programers will directly know that you do not use q, and only want your operation to be repeated.
Which means that (for some reasons) the line driver.find_element_by_xpath('//*[#id="J_bottomPage"]/span[1]/a[10]').click() does not execute correctly. For sure there is a way to make it work. But in your case, I heuristically see that your url contains a parameter whose name is page. I recommend you to use it instead. Which thus leads to actually use the variable q as such., as follows:
import requests
from lxml import html,etree
from selenium import webdriver
import time, json
#how many page do you want to scan
page_numnotint = input("how many page do you want to scan")
page_num = int(page_numnotint)
file_name = 'jd_goods_data.json'
driver = webdriver.Chrome()
date_info = []
name_data, price_data = [], []
jd_goods_data = {}
for q in range(page_num):
url = 'https://list.jd.com/list.html?cat=1713,3264,3414&page={page}&delivery=1&sort=sort_totalsales15_desc&trans=1&JL=4_10_0#J_main'.format(page=q)
driver.get(url)
base_html = driver.page_source
selctor = etree.HTML(base_html)
i = 1
while True:
name_string = '//*[#id="plist"]/ul/li[%d]/div/div[3]/a/em/text()' %(i)
price_string = '//*[#id="plist"]/ul/li[%d]/div/div[2]/strong[1]/i/text()' %(i)
if i == 60:
break
else:
i += 1
name = selctor.xpath(name_string)[0]
name_data.append(name)
price = selctor.xpath(price_string)[0]
price_data.append(price)
jd_goods_data[name] = price
print(name_data)
with open(file_name, 'w') as f:
json.dump(jd_goods_data, f)
driver.quit()