I am trying to write a python script using selenium for a horse racing software site.
The site shows a table in which horses names and odds will appear when they become an 'arb'.
When a horse is an 'arb' it will show up on the table and when it is no longer an arb it will disappear.
The script needs to make a list of all the horses that have come up throughout the day with their name and odds.
So far I have managed to get it to write out the selected 'name' and 'odds' values when I first run the script.
However I am unsure how I would code it to iterate over time, updating and adding to the list.
Any help/advice would be greatly appreciated.
from selenium import webdriver
import time
#Start Driver
driver = webdriver.Chrome("/Users/username/PycharmProjects/SoftwareBot/drivers/chromedriver")
driver.get("https://software.com/members/user/software2")
#Get Title
title = driver.title
print(title)
#Login To Page
driver.find_element_by_name("email").send_keys("johndoe#hotmail.com")
driver.find_element_by_name("password").send_keys("Pass123")
driver.find_element_by_xpath('//button[text()="Continue"]').click()
#Sleep
time.sleep(1.5)
#Find Num Of Rows
rows = len(driver.find_elements_by_xpath('//*[#id="data_body"]/tr'))
print(rows)
#Find Num Of Columns
cols = len(driver.find_elements_by_xpath('//*[#id="data_body"]/tr[1]/td'))
print(cols)
#Open Text File
f= open("horses.txt","w+")
#Write out needed table values
for r in range(1, rows + 1):
for c in range(3, 7, 3):
value = driver.find_element_by_xpath('//*[#id="data_body"]/tr[' + str(r) + ']/td[' + str(c) + ']').text
print(value)
f.write(value)
f.write("\n")
time.sleep(1.5)
Related
I am trying to do a web scraping with selenium to make a dataset from this website. What I want to achieve is to get every possible combinations of "pain is", "pain located in", etc and then save the result (possible causes) in a dataframe (csv file). I've figure out how to select the checkbox but I have no idea how to try all combinations of that checkboxes automatically.
I found this answer but I don't know how to write it in python because I'm not familiar with java.
Any kind of helps will be much appreciated. Thank you.
from selenium import webdriver
import itertools
import time
def create_subset(t):
r = []
for L in range(0, t+1):
for subset in itertools.combinations(range(1,t+1), L):
if subset != ():
r.append(subset)
return r
driver = webdriver.Chrome()
driver.get("https://www.mayoclinic.org/symptom-checker/abdominal-pain-in-adults-adult/related-factors/itt-20009075")
pain_count = len(driver.find_elements_by_class_name("frm_options")[0].find_elements_by_tag_name("li"))
pain_located_count = len(driver.find_elements_by_class_name("frm_options")[1].find_elements_by_tag_name("li"))
pain_checkbox = create_subset(pain_count)
location_checkbox = create_subset(pain_located_count)
for pain_option in pain_checkbox:
for number in pain_option:
driver.find_elements_by_class_name("frm_options")[0].find_elements_by_tag_name("li")[number - 1].click()
time.sleep(0.05)
for pain_location in location_checkbox:
for number in pain_location:
driver.find_elements_by_class_name("frm_options")[1].find_elements_by_tag_name("li")[number - 1].click()
# this is just for visual presentation
for number in pain_location:
driver.find_elements_by_class_name("frm_options")[1].find_elements_by_tag_name("li")[number - 1].click()
# this is just for visual presentation
for number in pain_option:
driver.find_elements_by_class_name("frm_options")[0].find_elements_by_tag_name("li")[number - 1].click()
I want to scrape every page on the following website: https://www.top40.nl/top40/2020/week-34 (for each year and weeknumber) by clicking on the song, then move to 'songinfo' and then scrape all data in the table listed there. For this question, I only scraped the title so far.
This the url I use:
url = 'https://www.top40.nl/top40/'
However, when I print the songs_list, it will only return the last title on the website. As such, I believe I am overwriting.
Hopefully someone can explain me which mistake(s) I am making and if there is any easier way to scrape the table on each page, very happy to hear.
Please find my python code below:
for year in range(2015,2016):
for week in range(1,2):
page_url = url+str(year) + '/' + 'week-' + str(week)
driver.get(page_url)
lists = driver.find_elements_by_xpath("//a[#data-linktype='title']")
links = []
for l in lists:
print(l.get_attribute('href'))
links.append(l.get_attribute('href'))
for link in links:
driver.get(link)
driver.find_element_by_xpath("//a[#href='#songinfo']").click()
songs = driver.find_elements_by_xpath(""".//*[#id="songinfo"]/table/tbody/tr[2]/td""")
songs_list = []
for s in songs:
print(s.get_attribute('innerHTML'))
songs_list.append(s.get_attribute('innerHTML'))```
The line songs_list = [] is inside the for link in links loop, so with each new iteration it gets set to an empty list (and then you append to this new, empty list). Once you end all your loops, you only see the songs_list created.
The simplest fix is to place the songs_list = [] line outside all for loops, ex:
songs_list = []
for year in range(2015,2016):
for week in range(1,2):
# etc
(Code below)
I'm scraping a website and the data I'm getting back is in 2 multi-dimensional arrays. I'm wanting everything to be in a JSON format because I want to save this and load it in again later when I add "tags".
So, less vague. I'm writing a program which takes in data like what characters you have and what missions are requiring you to do (you can complete multiple at once if the attributes align), and then checks that against a list of attributes that each character fulfills and returns a sorted list of the best characters for the context.
Right now I'm only scraping character data but I've already "got" the attribute data per character - the problem there was that it wasn't sorted by name so it was just a randomly repeating list that I needed to be able to look up. I still haven't quite figured out how to do that one.
Right now I have 2 arrays, 1 for the headers of the table and one for the rows of the table. The rows contain the "Answers" for the Header's "Questions" / "Titles" ; ie Maximum Level, 50
This is true for everything but the first entry which is the Name, Pronunciation (and I just want to store the name of course).
So:
Iterations = 0
While loop based on RowArray length / 9 (While Iterations <= that)
HeaderArray[0] gives me the name
RowArray[Iterations + 1] gives me data type 2
RowArray[Iterations + 2] gives me data type 3
Repeat until Array[Iterations + 8]
Iterations +=9
So I'm going through and appending these to separate lists - single arrays like CharName[] and CharMaxLevel[] and so on.
But I'm actually not sure if that's going to make this easier or not? Because my end goal here is to send "CharacterName" and get stuff back based on that AND be able to send in "DesiredTraits" and get "CharacterNames who fit that trait" back. Which means I also need to figure out how to store that category data semi-efficiently. There's over 80 possible categories and most only fit into about 10. I don't know how I'm going to store or load that data.
I'm assuming JSON is the best way? And I'm trying to keep it all in one file for performance and code readability reasons - don't want a file for each character.
CODE: (Forgive me, I've never scraped anything before + I'm actually somewhat new to Python - just got it 4? days ago)
https://pastebin.com/yh3Z535h
^ In the event anyone wants to run this and this somehow makes it easier to grab the raw code (:
import time
import requests, bs4, re
from urllib.parse import urljoin
import json
import os
target_dir = r"D:\00Coding\Js\WebScraper" #Yes, I do know that storing this in my Javascript folder is filthy
fullname = os.path.join(target_dir,'TsumData.txt')
StartURL = 'http://disneytsumtsum.wikia.com/wiki/Skill_Upgrade_Chart'
URLPrefix = 'http://disneytsumtsum.wikia.com'
def make_soup(url):
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text, 'lxml')
return soup
def get_links(url):
soup = make_soup(url)
a_tags = soup.find_all('a', href=re.compile(r"^/wiki/"))
links = [urljoin(URLPrefix, a['href'])for a in a_tags] # convert relative url to absolute url
return links
def get_tds(link):
soup = make_soup(link)
#tds = soup.find_all('li', class_="category normal") #This will give me the attributes / tags of each character
tds = soup.find_all('table', class_="wikia-infobox")
RowArray = []
HeaderArray = []
if tds:
for td in tds:
#print(td.text.strip()) #This is everything
rows = td.findChildren('tr')#[0]
headers = td.findChildren('th')#[0]
for row in rows:
cells = row.findChildren('td')
for cell in cells:
cell_content = cell.getText()
clean_content = re.sub( '\s+', ' ', cell_content).strip()
if clean_content:
RowArray.append(clean_content)
for row in rows:
cells = row.findChildren('th')
for cell in cells:
cell_content = cell.getText()
clean_content = re.sub( '\s+', ' ', cell_content).strip()
if clean_content:
HeaderArray.append(clean_content)
print(HeaderArray)
print(RowArray)
return(RowArray, HeaderArray)
#Output = json.dumps([dict(zip(RowArray, row_2)) for row_2 in HeaderArray], indent=1)
#print(json.dumps([dict(zip(RowArray, row_2)) for row_2 in HeaderArray], indent=1))
#TempFile = open(fullname, 'w') #Read only, Write Only, Append
#TempFile.write("EHLLO")
#TempFile.close()
#print(td.tbody.Series)
#print(td.tbody[Series])
#print(td.tbody["Series"])
#print(td.data-name)
#time.sleep(1)
if __name__ == '__main__':
links = get_links(StartURL)
MainHeaderArray = []
MainRowArray = []
MaxIterations = 60
Iterations = 0
for link in links: #Specifically I'll need to return and append the arrays here because they're being cleared repeatedly.
#print("Getting tds calling")
if Iterations > 38: #There are this many webpages it'll first look at that don't have the data I need
TempRA, TempHA = get_tds(link)
MainHeaderArray.append(TempHA)
MainRowArray.append(TempRA)
MaxIterations -= 1
Iterations += 1
#print(MaxIterations)
if MaxIterations <= 0: #I don't want to scrape the entire website for a prototype
break
#print("This is the end ??")
#time.sleep(3)
#jsonized = map(lambda item: {'Name':item[0], 'Series':item[1]}, zip())
print(MainHeaderArray)
#time.sleep(2.5)
#print(MainRowArray)
#time.sleep(2.5)
#print(zip())
TsumName = []
TsumSeries = []
TsumBoxType = []
TsumSkillDescription = []
TsumFullCharge = []
TsumMinScore = []
TsumScoreIncreasePerLevel = []
TsumMaxScore = []
TsumFullUpgrade = []
Iterations = 0
MaxIterations = len(MainRowArray)
while Iterations <= MaxIterations: #This will fire 1 time per Tsum
print(Iterations)
print(MainHeaderArray[Iterations][0]) #Holy this gives us Mickey ;
print(MainHeaderArray[Iterations+1][0])
print(MainHeaderArray[Iterations+2][0])
print(MainHeaderArray[Iterations+3][0])
TsumName.append(MainHeaderArray[Iterations][0])
print(MainRowArray[Iterations][1])
#At this point it will, of course, crash - that's because I only just realized I needed to append AND I just realized that everything
#Isn't stored in a list as I thought, but rather a multi-dimensional array (as you can see below I didn't know this)
TsumSeries[Iterations] = MainRowArray[Iterations+1]
TsumBoxType[Iterations] = MainRowArray[Iterations+2]
TsumSkillDescription[Iterations] = MainRowArray[Iterations+3]
TsumFullCharge[Iterations] = MainRowArray[Iterations+4]
TsumMinScore[Iterations] = MainRowArray[Iterations+5]
TsumScoreIncreasePerLevel[Iterations] = MainRowArray[Iterations+6]
TsumMaxScore[Iterations] = MainRowArray[Iterations+7]
TsumFullUpgrade[Iterations] = MainRowArray[Iterations+8]
Iterations += 9
print(Iterations)
print("It's Over")
time.sleep(3)
print(TsumName)
print(TsumSkillDescription)
Edit:
tl;dr my goal here is to be like
"For this Mission Card I need a Blue Tsum with high score potential, a Monster's Inc Tsum for a bunch of games, and a Male Tsum for a long chain.. what's the best Tsum given those?" and it'll be like "SULLY!" and automatically select it or at the very least give you a list of Tsums. Like "These ones match all of them, these ones match 2, and these match 1"
Edit 2:
Here's the command Line Output for the code above:
https://pastebin.com/vpRsX8ni
Edit 3: Alright, just got back for a short break. With some minor looking over I see what happened - my append code is saying "Append this list to the array" meaning I've got a list of lists for both the Header and Row arrays that I'm storing. So I can confirm (for myself at least) that these aren't nested lists per se but they are definitely 2 lists, each containing a single list at every entry. Definitely not a dictionary or anything "special case" at least. This should help me quickly find an answer now that I'm not throwing "multi-dimensional list" around my google searches or wondering why the list stuff isn't working (as it's expecting 1 value and gets a list instead).
Edit 4:
I need to simply add another list! But super nested.
It'll just store the categories that the Tsum has as a string.
so Array[10] = ArrayOfCategories[Tsum] (which contains every attribute in string form that the Tsum has)
So that'll be ie TsumArray[10] = ["Black", "White Gloves", "Mickey & Friends"]
And then I can just use the "Switch" that I've already made in order to check them. Possibly. Not feeling too well and haven't gotten that far yet.
Just use the with open file as json_file , write/read (super easy).
Ultimately stored 3 json files. No big deal. Much easier than appending into one big file.
I have a BeautifulSoup problem that hopefully you can help me out with.
Currently, I have a website with a lot of links on it. The links lead to pages that contain the data of that item that is linked. If you want to check it out, it's this one: http://ogle.astrouw.edu.pl/ogle4/ews/ews.html. What I ultimately want to accomplish is to print out the links of the data that are labeled with an 'N'. It may not be apparent at first, but if you look closely on the website, some of the data have 'N' after their Star No, and others do not. Afterwards, I use that link to download a file containing the information I need on that data. The website is very convenient because the download URLs only change a bit from data to data, so I only need to change a part of the URL, as you'll see in the code below.
I currently have accomplished the data downloading part. However, this is where you come in. Currently, I need to put in the identification number of the BLG event that I desire. (This will become apparent after you view the code below.) However, the website is consistently updating over time, and having to manually search for 'N' events takes up unnecessary time. I want the Python code to be able to do it for me. My original thoughts on the subject were that I could have BeautifulSoup search through the text for all N's, but I ran into some issues on accomplishing that. I feel like I am not familiar enough with BeautifulSoup to get done what I wish to get done. Some help would be appreciated.
The code I have currently is below. I have put in a range of BLG events that have the 'N' label as an example.
#Retrieve .gz files from URLs
from urllib.request import urlopen
import urllib.request
from bs4 import BeautifulSoup
#Access website
URL = 'http://ogle.astrouw.edu.pl/ogle4/ews/ews.html'
soup = BeautifulSoup(urlopen(URL))
#Select the desired data numbers
numbers = list(range(974,998))
x=0
for i in numbers:
numbers[x] = str(i)
x += 1
print(numbers)
#Get all links and put into list
allLinks = []
for link in soup.find_all('a'):
list_links = link.get('href')
allLinks.append(list_links)
#Remove None datatypes from link list
while None in allLinks:
allLinks.remove(None)
#print(allLinks)
#Remove all links but links to data pages and gets rid of the '.html'
list_Bindices = [i for i, s in enumerate(allLinks) if 'b' in s]
print(list_Bindices)
bLinks = []
for x in list_Bindices:
bLinks.append(allLinks[x])
bLinks = [s.replace('.html', '') for s in bLinks]
#print(bLinks)
#Create a list of indices for accessing those pages
list_Nindices = []
for x in numbers:
list_Nindices.append([i for i, s in enumerate(bLinks) if x in s])
#print(type(list_Nindices))
#print(list_Nindices)
nindices_corrected = []
place = 0
while place < (len(list_Nindices)):
a = list_Nindices[place]
nindices_corrected.append(a[0])
place = place + 1
#print(nindices_corrected)
#Get the page names (without the .html) from the indices
nLinks = []
for x in nindices_corrected:
nLinks.append(bLinks[x])
#print(nLinks)
#Form the URLs for those pages
final_URLs = []
for x in nLinks:
y = "ftp://ftp.astrouw.edu.pl/ogle/ogle4/ews/2017/"+ x + "/phot.dat"
final_URLs.append(y)
#print(final_URLs)
#Retrieve the data from the URLs
z = 0
for x in final_URLs:
name = nLinks[z] + ".dat"
#print(name)
urllib.request.urlretrieve(x, name)
z += 1
#hrm = urllib.request.urlretrieve("ftp://ftp.astrouw.edu.pl/ogle/ogle4/ews/2017/blg-0974.tar.gz", "practice.gz")
This piece of code has taken me quite some time to write, as I am not a professional programmer, nor an expert in BeautifulSoup or URL manipulation in any way. In fact, I use MATLAB more than Python. As such, I tend to think in terms of MATLAB, which translates into less efficient Python code. However, efficiency is not what I am searching for in this problem. I can wait the extra five minutes for my code to finish if it means that I understand what is going on and can accomplish what I need to accomplish. Thank you for any help you can offer! I realize this is a fairly muti-faceted problem.
This should do it:
from urllib.request import urlopen
import urllib.request
from bs4 import BeautifulSoup
#Access website
URL = 'http://ogle.astrouw.edu.pl/ogle4/ews/ews.html'
soup = BeautifulSoup(urlopen(URL), 'html5lib')
Here, I'm using the html5lib to parse the url content.
Next, we'll look through the table, extracting links if the star names have a 'N' in them:
table = soup.find('table')
links = []
for tr in table.find_all('tr', {'class' : 'trow'}):
td = tr.findChildren()
if 'N' in td[4].text:
links.append('http://ogle.astrouw.edu.pl/ogle4/ews/' + td[1].a['href'])
print(links)
Output:
['http://ogle.astrouw.edu.pl/ogle4/ews/blg-0974.html', 'http://ogle.astrouw.edu.pl/ogle4/ews/blg-0975.html', 'http://ogle.astrouw.edu.pl/ogle4/ews/blg-0976.html', 'http://ogle.astrouw.edu.pl/ogle4/ews/blg-0977.html', 'http://ogle.astrouw.edu.pl/ogle4/ews/blg-0978.html',
...
]
I'm writing a script that scans through a set of links. Within each link the script searches a table for a row. Once found, it increments the variable total_rank which is the sum ranks found on each web page. The rank is equal to the row number.
The code looks like this and is outputting zero:
import requests
from bs4 import BeautifulSoup
import time
url_to_scrape = 'https://www.teamrankings.com/ncb/stats/'
r = requests.get(url_to_scrape)
soup = BeautifulSoup(r.text, "html.parser")
stat_links = []
for a in soup.select(".chooser-list ul"):
list_entry = a.findAll('li')
relative_link = list_entry[0].find('a')['href']
link = "https://www.teamrankings.com" + relative_link
stat_links.append(link)
total_rank = 0
for link in stat_links:
r = requests.get(link)
soup = BeautifulSoup(r.text, "html.parser")
team_rows = soup.select(".tr-table.datatable.scrollable.dataTable.no-footer table")
for row in team_rows:
if row.findAll('td')[1].text.strip() == 'Oklahoma':
rank = row.findAll('td')[0].text.strip()
total_rank = total_rank + rank
# time.sleep(1)
print total_rank
debugging team_rows is empty after the select() call thing is, I've also tried different tags. For example I've tried soup.select(".scroll-wrapper div") I've tried soup.select("#DataTables_Table_0_wrapper div") all are returning nothing
The selector
".tr-table datatable scrollable dataTable no-footer tr"
Selects a <tr> element anywhere under a <no-footer> element anywhere under a <dataTable> element....etc.
I think really "datatable scrollable dataTable no-footer" are classes on your .tr-table? So in that case, they should be joined with the first class with a period. So I believe the final correct selector is:
".tr-table.datatable.scrollable.dataTable.no-footer tr"
UPDATE: the new selector looks like this:
".tr-table.datatable.scrollable.dataTable.no-footer table"
The problem here is that the first part, .tr-table.datatable... refers to the table itself. Assuming you're trying to get the rows of this table:
<table class="tr-table datatable scrollable dataTable no-footer" id="DataTables_Table_0" role="grid">
The proper selector remains the one I originally suggested.
The #audiodude's answer is correct though the suggested selector is not working for me.
You don't need to check every single class of the table element. Here is the working selector:
team_rows = soup.select("table.datatable tr")
Also, if you need to find Oklahoma inside the table - you don't have to iterate over every row and cell in the table. Just directly search for a specific cell and get the previous containing the rank:
rank = soup.find("td", {"data-sort": "Oklahoma"}).find_previous_sibling("td").get_text()
total_rank += int(rank) # it is important to convert the row number to int
Also note that you are extracting more stats links than you should - looks like the Player Stats links should not be followed since you are focused specifically on the Team Stats. Here is one way to get Team Stats links only:
links_list = soup.find("h2", text="Team Stats").find_next_sibling("ul")
stat_links = ["https://www.teamrankings.com" + a["href"]
for a in links_list.select("ul.expand-content li a[href]")]