How to get the positions of get_text in Beautiful Soup - python

I am trying to store the result of my get_text in variables.
I am filtering my html in order to find the information I need. If I want to extract for example a number of rooted can present several, this is how the information I get is displayed:
<span cetxt\"="" class='\"rSpnValor' vidc0='\"74922\"'>74922</span>
<span cetxt\"="" class='\"rSpnValor' vidc0='\"75005\"'>75005</span>
With get_text it would look like this:
74922
75005
I share a bit of my code:
def getValBySpanName(name):
dataArray = soup.find_all('div', {'class': '\\\"rDivDatosAseg'})
for data in dataArray:
data_container = data
spans_data = data_container.find_all("span")
info = []
if spans_data[0].get_text() == name:
container_values = spans_data[1].get_text()
return container_values
file_number= getValBySpanName('NĂºmero de radicado')
print(file_number)
The problem is that I get the first position "74922" as a result. I need to find a way to store each value in the variable (Then I will insert this data in sql) so I need to save it one by one
I tried to go through them with a for but it goes through the positions of the first value, something like '7,4,9,2,2'

If I understand you correctly, you are probably looking for something like this:
dataArray = soup.select('div.\\"rSpnValor span')
container_values = []
for data in dataArray:
container_values.append(data.text)
print(container_values)
output
['74922', '75005']

Related

BeautifulSoup4: Extracting tables, now how do I exclude certain tags and bits of information I do not want

Trying to extract coin names, price, and market cap from coinmarketcap.com. I first tried using soup.find_all to search for certain tags with a specific class but it always picked up information I didnt need or want. So instead I used find_all to search for 'td' and then planned on using a for loop to look for specific class names and to append those to a new list and then print that list but it returns a data type for some reason.
coin_table = soup.find_all('td')
class_value = 'sc-1eb5slv-0 iJjGCS'
for i in coin_table:
if class_value in coin_table:
list.append(i)
print(list)
But this returns:
<class 'list'>
to the console even though im not asking to see the data type. Very new to beautifulsoup and coding in general so sorry if this is a very basic question. Still trying to get my head around all of this stuff.
As #RJAdriaansen mentioned, you don't need to scrape website when they provide api. Here is how you do it with requests library:
import requests
url = 'https://api.coinmarketcap.com/data-api/v3/cryptocurrency/listing?start=1&limit=100&sortBy=market_cap&sortType=desc&convert=USD,BTC,ETH&cryptoType=all&tagType=all&audited=false&aux=ath,atl,high24h,low24h,num_market_pairs,cmc_rank,date_added,tags,platform,max_supply,circulating_supply,total_supply,volume_7d,volume_30d'
response = requests.get(url)
data = response.json()
This will give you json data. Now you can grab all you need by accessing correct keys:
final_list = []
temp = []
for each_crypto in data['data']['cryptoCurrencyList']:
temp.append(each_crypto['name'])
# each_crypto['quotes'] gives you list of price and market gap of each crypto
for quote in each_crypto['quotes']:
# assuming you want to get USD price of each crypto
if quote['name'] == "USD":
temp.append(quote['price'])
temp.append(quote['marketCap'])
final_list.append(temp)
temp = []
Final result would look like this:
[
['Bitcoin', 34497.01819639692, 646704595579.0485],
['Ethereum', 2195.11816422801, 255815488972.87268],
['Tether', 1.0003936138399, 62398426501.02027],
['Binance Coin', 294.2550537711805, 45148405357.003],
...
]

Why can't I use a variable in a results.find_all?

I'm trying to do a pretty basic web scrape but I'd like to be able to use variables so I don't have to keep repeating code for multiple pages.
example line of code:
elems = results.find_all("span", class_="st_2XVIMK7 st_8u0ePN3 st_2oUi2Vb st_3kXJm4P")
I would like it to read as:
elems = results.find_all(<variable>)
and then have my variable be:
'"span", class_="st_2XVIMK7 st_8u0ePN3 st_2oUi2Vb st_3kXJm4P"'
However, when I do this, I get no results. I've included the rest of the function below. Does anyone know why this will not work?
EDIT:
I've also tried splitting it up like below example but still get the same issue:
elems = results.find_all(variable1 , class_=variable2)
variable1 = '"span"'
variable2 = '"st_2XVIMK7 st_8u0ePN3 st_2oUi2Vb st_3kXJm4P"'
code:
def get_name(sym,result,elem,name):
url = URL + sym
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id=result)
elems = results.find_all(elem)
for elem in elems:
name_elem = elem.find(name)
print(name_elem.text)
get_name('top',"app",'"span",class_="st_2XVIMK7 st_8u0ePN3 st_2oUi2Vb st_3kXJm4P"','"span","st_3lrv4Jo"')
The find_all method takes more then one parameter
you are just using a string in the first argument of the method which would struggle to find anything
you will need to split the variable into multiple so your variable '"span", class_="st_2XVIMK7 st_8u0ePN3 st_2oUi2Vb st_3kXJm4P"' will need to be split into to variables
elem = "span" and class="st_2XVIMK7 st_8u0ePN3 st_2oUi2Vb st_3kXJm4P"
and in your code it will look like
elems = results.find_all(elem, class)
Just a few more things:
according to the documentation https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all and what i can find online the class parameter takes a Dict with a string array for multiple class values so your function will look more like
findAll(elem, {'class':['st_2XVIMK7', 'st_8u0ePN3', 'st_2oUi2Vb', 'st_3kXJm4P']})

Take only the second element that has the same name in BeautifulSoup

I'm scraping a website for college work, and I am having trouble getting only the second text in a span.I have seen that you can use below to get the text:
gross = container.find_all('span', attrs = {'name':'nv'})
print(gross)
I have as a result this:
[<span data-value="845875" name="nv">845.875</span>, <span data-value="335.451.311" name="nv">$335.45M</span>]
how do I get only the values contained with in the second data-value, in a way that can replicate for others span's ?
Try this.
gross = container.find_all('span', attrs = {'name':'nv', 'data-value':'335.451.311'})
print(gross)
If this data val keeps changing then you don't have any other choice but to use gross[1].

Handle Key-Error whilst scraping

I am currently working on a script to scrape data from ClinicalTrials.gov. To do this I have written the following script:
def clinicalTrialsGov (id):
url = "https://clinicaltrials.gov/ct2/show/" + id + "?displayxml=true"
data = BeautifulSoup(requests.get(url).text, "lxml")
studyType = data.study_type.text
if studyType == 'Interventional':
allocation = data.allocation.text
interventionModel = data.intervention_model.text
primaryPurpose = data.primary_purpose.text
masking = data.masking.text
enrollment = data.enrollment.text
officialTitle = data.official_title.text
condition = data.condition.text
minAge = data.eligibility.minimum_age.text
maxAge = data.eligibility.maximum_age.text
gender = data.eligibility.gender.text
healthyVolunteers = data.eligibility.healthy_volunteers.text
armType = []
intType = []
for each in data.findAll('intervention'):
intType.append(each.intervention_type.text)
for each in data.findAll('arm_group'):
armType.append(each.arm_group_type.text)
citedPMID = tryExceptCT(data, '.results_reference.PMID')
citedPMID = data.results_reference.PMID
print(citedPMID)
return officialTitle, studyType, allocation, interventionModel, primaryPurpose, masking, enrollment, condition, minAge, maxAge, gender, healthyVolunteers, armType, intType
However, the following script won't always work as not all studies will have all items (ie. a KeyError will occur). To resolve this, I could simply wrap each statement in a try-except catch, like this:
try:
studyType = data.study_type.text
except:
studyType = ""
but it seems a bad way to implement this. What's a better/cleaner solution?
This is a good question. Before I address it, let me say that you should consider changing the second parameter to the BeautifulSoup (BS) constructor from lxml to xml. Otherwise, BS does not flag the parsed markup as XML (to verify this for yourself, access the is_xml attribute on the data variable in your code).
You can avoid generating an error when attempting to access a non-existent element by passing a list of desired element names to the find_all() method:
subset = ['results_reference','allocation','interventionModel','primaryPurpose','masking','enrollment','eligibility','official_title','arm_group','condition']
tag_matches = data.find_all(subset)
Then, if you want to get a specific element from the list of Tags without iterating through it, you can convert it to a dict using the Tag names as keys:
tag_dict = dict((tag_matches[i].name, tag_matches[i]) for i in range(0, len(tag_matches)))

Scraping a URL host and primary path plus query string to produce a list of all possible additional extensions for that query

I am working on a project to to scrape table data from a url.
The main web domain is https://www.pro-football-reference.com. I have already written the code to scrape the table data from this domain.
A search for statistical data on this might start with a query with set parameters. For example: here is the url with table data for all players in the National Football League who have thrown at least 25 passes in their careers:
input_url = 'https://www.pro-football-reference.com/play-index/psl_finder.cgi?request=1&match=combined&year_min=1920&year_max=2016&season_start=1&season_end=-1&pos%5B%5D=qb&pos%5B%5D=rb&pos%5B%5D=wr&pos%5B%5D=te&pos%5B%5D=e&pos%5B%5D=t&pos%5B%5D=g&pos%5B%5D=c&pos%5B%5D=ol&pos%5B%5D=dt&pos%5B%5D=de&pos%5B%5D=dl&pos%5B%5D=ilb&pos%5B%5D=olb&pos%5B%5D=lb&pos%5B%5D=cb&pos%5B%5D=s&pos%5B%5D=db&pos%5B%5D=k&pos%5B%5D=p&draft_year_min=1936&draft_year_max=2017&draft_slot_min=1&draft_slot_max=500&draft_pick_in_round=pick_overall&conference=any&draft_pos%5B%5D=qb&draft_pos%5B%5D=rb&draft_pos%5B%5D=wr&draft_pos%5B%5D=te&draft_pos%5B%5D=e&draft_pos%5B%5D=t&draft_pos%5B%5D=g&draft_pos%5B%5D=c&draft_pos%5B%5D=ol&draft_pos%5B%5D=dt&draft_pos%5B%5D=de&draft_pos%5B%5D=dl&draft_pos%5B%5D=ilb&draft_pos%5B%5D=olb&draft_pos%5B%5D=lb&draft_pos%5B%5D=cb&draft_pos%5B%5D=s&draft_pos%5B%5D=db&draft_pos%5B%5D=k&draft_pos%5B%5D=p&c1stat=pass_att&c1comp=gt&c1val=25&c5val=1.0&order_by=pass_td'
But this url only contains the statistical table data for players 1 - 100 on this list. So there are 7 additional urls with 100 players each and one additional url with 81 players.
The url for the 2nd url from this query contains a table with players 101-200 is here:
url_passing2 = 'https://www.pro-football-reference.com/play-index/psl_finder.cgi?request=1&match=combined&year_min=1920&year_max=2016&season_start=1&season_end=-1&pos%5B%5D=qb&pos%5B%5D=rb&pos%5B%5D=wr&pos%5B%5D=te&pos%5B%5D=e&pos%5B%5D=t&pos%5B%5D=g&pos%5B%5D=c&pos%5B%5D=ol&pos%5B%5D=dt&pos%5B%5D=de&pos%5B%5D=dl&pos%5B%5D=ilb&pos%5B%5D=olb&pos%5B%5D=lb&pos%5B%5D=cb&pos%5B%5D=s&pos%5B%5D=db&pos%5B%5D=k&pos%5B%5D=p&draft_year_min=1936&draft_year_max=2017&draft_slot_min=1&draft_slot_max=500&draft_pick_in_round=pick_overall&conference=any&draft_pos%5B%5D=qb&draft_pos%5B%5D=rb&draft_pos%5B%5D=wr&draft_pos%5B%5D=te&draft_pos%5B%5D=e&draft_pos%5B%5D=t&draft_pos%5B%5D=g&draft_pos%5B%5D=c&draft_pos%5B%5D=ol&draft_pos%5B%5D=dt&draft_pos%5B%5D=de&draft_pos%5B%5D=dl&draft_pos%5B%5D=ilb&draft_pos%5B%5D=olb&draft_pos%5B%5D=lb&draft_pos%5B%5D=cb&draft_pos%5B%5D=s&draft_pos%5B%5D=db&draft_pos%5B%5D=k&draft_pos%5B%5D=p&c1stat=pass_att&c1comp=gt&c1val=25&c5val=1.0&order_by=pass_td&offset=100'
Notice that these are exactly the same until the very last part, where there is the additional extension string '&offset=100'. Each additional page has the same host/path/query string plus '&offset=200', '&offset=300', '&offset=400', and so on up to '&offset=800'.
My question is this: starting with a url like this, how can I create a Python function that will collect a list of all of the possible url iterations from this host/path/query string, so that I can get the entire list of players who match this query?
My desired output would be a list that looks something like this:
list_or_urls: ['https://www.pro-football-reference.com/play-index/psl_finder.cgi?request=1&match=combined&year_min=1920&year_max=2016&season_start=1&season_end=-1&pos%5B%5D=qb&pos%5B%5D=rb&pos%5B%5D=wr&pos%5B%5D=te&pos%5B%5D=e&pos%5B%5D=t&pos%5B%5D=g&pos%5B%5D=c&pos%5B%5D=ol&pos%5B%5D=dt&pos%5B%5D=de&pos%5B%5D=dl&pos%5B%5D=ilb&pos%5B%5D=olb&pos%5B%5D=lb&pos%5B%5D=cb&pos%5B%5D=s&pos%5B%5D=db&pos%5B%5D=k&pos%5B%5D=p&draft_year_min=1936&draft_year_max=2017&draft_slot_min=1&draft_slot_max=500&draft_pick_in_round=pick_overall&conference=any&draft_pos%5B%5D=qb&draft_pos%5B%5D=rb&draft_pos%5B%5D=wr&draft_pos%5B%5D=te&draft_pos%5B%5D=e&draft_pos%5B%5D=t&draft_pos%5B%5D=g&draft_pos%5B%5D=c&draft_pos%5B%5D=ol&draft_pos%5B%5D=dt&draft_pos%5B%5D=de&draft_pos%5B%5D=dl&draft_pos%5B%5D=ilb&draft_pos%5B%5D=olb&draft_pos%5B%5D=lb&draft_pos%5B%5D=cb&draft_pos%5B%5D=s&draft_pos%5B%5D=db&draft_pos%5B%5D=k&draft_pos%5B%5D=p&c1stat=pass_att&c1comp=gt&c1val=25&c5val=1.0&order_by=pass_td', 'https://www.pro-football-reference.com/play-index/psl_finder.cgi?request=1&match=combined&year_min=1920&year_max=2016&season_start=1&season_end=-1&pos%5B%5D=qb&pos%5B%5D=rb&pos%5B%5D=wr&pos%5B%5D=te&pos%5B%5D=e&pos%5B%5D=t&pos%5B%5D=g&pos%5B%5D=c&pos%5B%5D=ol&pos%5B%5D=dt&pos%5B%5D=de&pos%5B%5D=dl&pos%5B%5D=ilb&pos%5B%5D=olb&pos%5B%5D=lb&pos%5B%5D=cb&pos%5B%5D=s&pos%5B%5D=db&pos%5B%5D=k&pos%5B%5D=p&draft_year_min=1936&draft_year_max=2017&draft_slot_min=1&draft_slot_max=500&draft_pick_in_round=pick_overall&conference=any&draft_pos%5B%5D=qb&draft_pos%5B%5D=rb&draft_pos%5B%5D=wr&draft_pos%5B%5D=te&draft_pos%5B%5D=e&draft_pos%5B%5D=t&draft_pos%5B%5D=g&draft_pos%5B%5D=c&draft_pos%5B%5D=ol&draft_pos%5B%5D=dt&draft_pos%5B%5D=de&draft_pos%5B%5D=dl&draft_pos%5B%5D=ilb&draft_pos%5B%5D=olb&draft_pos%5B%5D=lb&draft_pos%5B%5D=cb&draft_pos%5B%5D=s&draft_pos%5B%5D=db&draft_pos%5B%5D=k&draft_pos%5B%5D=p&c1stat=pass_att&c1comp=gt&c1val=25&c5val=1.0&order_by=pass_td&offset=100', 'https://www.pro-football-reference.com/play-index/psl_finder.cgi?request=1&match=combined&year_min=1920&year_max=2016&season_start=1&season_end=-1&pos%5B%5D=qb&pos%5B%5D=rb&pos%5B%5D=wr&pos%5B%5D=te&pos%5B%5D=e&pos%5B%5D=t&pos%5B%5D=g&pos%5B%5D=c&pos%5B%5D=ol&pos%5B%5D=dt&pos%5B%5D=de&pos%5B%5D=dl&pos%5B%5D=ilb&pos%5B%5D=olb&pos%5B%5D=lb&pos%5B%5D=cb&pos%5B%5D=s&pos%5B%5D=db&pos%5B%5D=k&pos%5B%5D=p&draft_year_min=1936&draft_year_max=2017&draft_slot_min=1&draft_slot_max=500&draft_pick_in_round=pick_overall&conference=any&draft_pos%5B%5D=qb&draft_pos%5B%5D=rb&draft_pos%5B%5D=wr&draft_pos%5B%5D=te&draft_pos%5B%5D=e&draft_pos%5B%5D=t&draft_pos%5B%5D=g&draft_pos%5B%5D=c&draft_pos%5B%5D=ol&draft_pos%5B%5D=dt&draft_pos%5B%5D=de&draft_pos%5B%5D=dl&draft_pos%5B%5D=ilb&draft_pos%5B%5D=olb&draft_pos%5B%5D=lb&draft_pos%5B%5D=cb&draft_pos%5B%5D=s&draft_pos%5B%5D=db&draft_pos%5B%5D=k&draft_pos%5B%5D=p&c1stat=pass_att&c1comp=gt&c1val=25&c5val=1.0&order_by=pass_td&offset=200', 'https://www.pro-football-reference.com/play-index/psl_finder.cgi?request=1&match=combined&year_min=1920&year_max=2016&season_start=1&season_end=-1&pos%5B%5D=qb&pos%5B%5D=rb&pos%5B%5D=wr&pos%5B%5D=te&pos%5B%5D=e&pos%5B%5D=t&pos%5B%5D=g&pos%5B%5D=c&pos%5B%5D=ol&pos%5B%5D=dt&pos%5B%5D=de&pos%5B%5D=dl&pos%5B%5D=ilb&pos%5B%5D=olb&pos%5B%5D=lb&pos%5B%5D=cb&pos%5B%5D=s&pos%5B%5D=db&pos%5B%5D=k&pos%5B%5D=p&draft_year_min=1936&draft_year_max=2017&draft_slot_min=1&draft_slot_max=500&draft_pick_in_round=pick_overall&conference=any&draft_pos%5B%5D=qb&draft_pos%5B%5D=rb&draft_pos%5B%5D=wr&draft_pos%5B%5D=te&draft_pos%5B%5D=e&draft_pos%5B%5D=t&draft_pos%5B%5D=g&draft_pos%5B%5D=c&draft_pos%5B%5D=ol&draft_pos%5B%5D=dt&draft_pos%5B%5D=de&draft_pos%5B%5D=dl&draft_pos%5B%5D=ilb&draft_pos%5B%5D=olb&draft_pos%5B%5D=lb&draft_pos%5B%5D=cb&draft_pos%5B%5D=s&draft_pos%5B%5D=db&draft_pos%5B%5D=k&draft_pos%5B%5D=p&c1stat=pass_att&c1comp=gt&c1val=25&c5val=1.0&order_by=pass_td&offset=300', 'https://www.pro-football-reference.com/play-index/psl_finder.cgi?request=1&match=combined&year_min=1920&year_max=2016&season_start=1&season_end=-1&pos%5B%5D=qb&pos%5B%5D=rb&pos%5B%5D=wr&pos%5B%5D=te&pos%5B%5D=e&pos%5B%5D=t&pos%5B%5D=g&pos%5B%5D=c&pos%5B%5D=ol&pos%5B%5D=dt&pos%5B%5D=de&pos%5B%5D=dl&pos%5B%5D=ilb&pos%5B%5D=olb&pos%5B%5D=lb&pos%5B%5D=cb&pos%5B%5D=s&pos%5B%5D=db&pos%5B%5D=k&pos%5B%5D=p&draft_year_min=1936&draft_year_max=2017&draft_slot_min=1&draft_slot_max=500&draft_pick_in_round=pick_overall&conference=any&draft_pos%5B%5D=qb&draft_pos%5B%5D=rb&draft_pos%5B%5D=wr&draft_pos%5B%5D=te&draft_pos%5B%5D=e&draft_pos%5B%5D=t&draft_pos%5B%5D=g&draft_pos%5B%5D=c&draft_pos%5B%5D=ol&draft_pos%5B%5D=dt&draft_pos%5B%5D=de&draft_pos%5B%5D=dl&draft_pos%5B%5D=ilb&draft_pos%5B%5D=olb&draft_pos%5B%5D=lb&draft_pos%5B%5D=cb&draft_pos%5B%5D=s&draft_pos%5B%5D=db&draft_pos%5B%5D=k&draft_pos%5B%5D=p&c1stat=pass_att&c1comp=gt&c1val=25&c5val=1.0&order_by=pass_td&offset=400', 'https://www.pro-football-reference.com/play-index/psl_finder.cgi?request=1&match=combined&year_min=1920&year_max=2016&season_start=1&season_end=-1&pos%5B%5D=qb&pos%5B%5D=rb&pos%5B%5D=wr&pos%5B%5D=te&pos%5B%5D=e&pos%5B%5D=t&pos%5B%5D=g&pos%5B%5D=c&pos%5B%5D=ol&pos%5B%5D=dt&pos%5B%5D=de&pos%5B%5D=dl&pos%5B%5D=ilb&pos%5B%5D=olb&pos%5B%5D=lb&pos%5B%5D=cb&pos%5B%5D=s&pos%5B%5D=db&pos%5B%5D=k&pos%5B%5D=p&draft_year_min=1936&draft_year_max=2017&draft_slot_min=1&draft_slot_max=500&draft_pick_in_round=pick_overall&conference=any&draft_pos%5B%5D=qb&draft_pos%5B%5D=rb&draft_pos%5B%5D=wr&draft_pos%5B%5D=te&draft_pos%5B%5D=e&draft_pos%5B%5D=t&draft_pos%5B%5D=g&draft_pos%5B%5D=c&draft_pos%5B%5D=ol&draft_pos%5B%5D=dt&draft_pos%5B%5D=de&draft_pos%5B%5D=dl&draft_pos%5B%5D=ilb&draft_pos%5B%5D=olb&draft_pos%5B%5D=lb&draft_pos%5B%5D=cb&draft_pos%5B%5D=s&draft_pos%5B%5D=db&draft_pos%5B%5D=k&draft_pos%5B%5D=p&c1stat=pass_att&c1comp=gt&c1val=25&c5val=1.0&order_by=pass_td&offset=500', 'https://www.pro-football-reference.com/play-index/psl_finder.cgi?request=1&match=combined&year_min=1920&year_max=2016&season_start=1&season_end=-1&pos%5B%5D=qb&pos%5B%5D=rb&pos%5B%5D=wr&pos%5B%5D=te&pos%5B%5D=e&pos%5B%5D=t&pos%5B%5D=g&pos%5B%5D=c&pos%5B%5D=ol&pos%5B%5D=dt&pos%5B%5D=de&pos%5B%5D=dl&pos%5B%5D=ilb&pos%5B%5D=olb&pos%5B%5D=lb&pos%5B%5D=cb&pos%5B%5D=s&pos%5B%5D=db&pos%5B%5D=k&pos%5B%5D=p&draft_year_min=1936&draft_year_max=2017&draft_slot_min=1&draft_slot_max=500&draft_pick_in_round=pick_overall&conference=any&draft_pos%5B%5D=qb&draft_pos%5B%5D=rb&draft_pos%5B%5D=wr&draft_pos%5B%5D=te&draft_pos%5B%5D=e&draft_pos%5B%5D=t&draft_pos%5B%5D=g&draft_pos%5B%5D=c&draft_pos%5B%5D=ol&draft_pos%5B%5D=dt&draft_pos%5B%5D=de&draft_pos%5B%5D=dl&draft_pos%5B%5D=ilb&draft_pos%5B%5D=olb&draft_pos%5B%5D=lb&draft_pos%5B%5D=cb&draft_pos%5B%5D=s&draft_pos%5B%5D=db&draft_pos%5B%5D=k&draft_pos%5B%5D=p&c1stat=pass_att&c1comp=gt&c1val=25&c5val=1.0&order_by=pass_td&offset=600', 'https://www.pro-football-reference.com/play-index/psl_finder.cgi?request=1&match=combined&year_min=1920&year_max=2016&season_start=1&season_end=-1&pos%5B%5D=qb&pos%5B%5D=rb&pos%5B%5D=wr&pos%5B%5D=te&pos%5B%5D=e&pos%5B%5D=t&pos%5B%5D=g&pos%5B%5D=c&pos%5B%5D=ol&pos%5B%5D=dt&pos%5B%5D=de&pos%5B%5D=dl&pos%5B%5D=ilb&pos%5B%5D=olb&pos%5B%5D=lb&pos%5B%5D=cb&pos%5B%5D=s&pos%5B%5D=db&pos%5B%5D=k&pos%5B%5D=p&draft_year_min=1936&draft_year_max=2017&draft_slot_min=1&draft_slot_max=500&draft_pick_in_round=pick_overall&conference=any&draft_pos%5B%5D=qb&draft_pos%5B%5D=rb&draft_pos%5B%5D=wr&draft_pos%5B%5D=te&draft_pos%5B%5D=e&draft_pos%5B%5D=t&draft_pos%5B%5D=g&draft_pos%5B%5D=c&draft_pos%5B%5D=ol&draft_pos%5B%5D=dt&draft_pos%5B%5D=de&draft_pos%5B%5D=dl&draft_pos%5B%5D=ilb&draft_pos%5B%5D=olb&draft_pos%5B%5D=lb&draft_pos%5B%5D=cb&draft_pos%5B%5D=s&draft_pos%5B%5D=db&draft_pos%5B%5D=k&draft_pos%5B%5D=p&c1stat=pass_att&c1comp=gt&c1val=25&c5val=1.0&order_by=pass_td&offset=700', 'https://www.pro-football-reference.com/play-index/psl_finder.cgi?request=1&match=combined&year_min=1920&year_max=2016&season_start=1&season_end=-1&pos%5B%5D=qb&pos%5B%5D=rb&pos%5B%5D=wr&pos%5B%5D=te&pos%5B%5D=e&pos%5B%5D=t&pos%5B%5D=g&pos%5B%5D=c&pos%5B%5D=ol&pos%5B%5D=dt&pos%5B%5D=de&pos%5B%5D=dl&pos%5B%5D=ilb&pos%5B%5D=olb&pos%5B%5D=lb&pos%5B%5D=cb&pos%5B%5D=s&pos%5B%5D=db&pos%5B%5D=k&pos%5B%5D=p&draft_year_min=1936&draft_year_max=2017&draft_slot_min=1&draft_slot_max=500&draft_pick_in_round=pick_overall&conference=any&draft_pos%5B%5D=qb&draft_pos%5B%5D=rb&draft_pos%5B%5D=wr&draft_pos%5B%5D=te&draft_pos%5B%5D=e&draft_pos%5B%5D=t&draft_pos%5B%5D=g&draft_pos%5B%5D=c&draft_pos%5B%5D=ol&draft_pos%5B%5D=dt&draft_pos%5B%5D=de&draft_pos%5B%5D=dl&draft_pos%5B%5D=ilb&draft_pos%5B%5D=olb&draft_pos%5B%5D=lb&draft_pos%5B%5D=cb&draft_pos%5B%5D=s&draft_pos%5B%5D=db&draft_pos%5B%5D=k&draft_pos%5B%5D=p&c1stat=pass_att&c1comp=gt&c1val=25&c5val=1.0&order_by=pass_td&offset=800']
Or, more concisely:
list of urls = ['&offset=0', '&offset=100', '&offset=200', '&offset=300', '&offset=400', '&offset=500', '&offset=600', '&offset700', '&offset=800']
The following is what I have so far for my attempt at creating the function. My approach is to try to iterate through the urls and check if there is a table on the url or not. The idea is that "if" there is a table on the page, append the url to my output list, and if there is not a table on the page, exit the function. But this only produces a list of the first two urls -- it's not looping back to append the last 7 urls for the list.
def get_url_list(frontpage_url):
offset_extension = ''
output_list = [frontpage_url]
x = 0
for url in output_list:
results_table = pd.read_html(url)
table_results = pd.DataFrame(results_table)
if table_results.empty == False:
x+=1
offset_extension = '&offset=' + '%d' % (100 * x)
output_list.append(frontpage_url + offset_extension)
else:
exit
return output_list[1:-1]
urls_list_output = get_url_list(sports_url_starter)
Your approach looks okay but your for loop is incorrect. When using for you don't need to index into the list you are looping through, ie you shouldn't use output_list[x].
Try replacing it with something like:
results = []
for url in output_list:
try:
# read_html returns a list of dataframes, so add all to the results
results.extend(pd.read_html(url))
...
output_list.append(new_url)
except ValueError:
# if there are no tables on the page return what you have so far
return results

Categories

Resources