I'm having a problem with a for loop. In the script, I use a text list to build a URL and then ran a for loop for each element of the list. After having all the URLs I want to extract information from the website. That's where I have a problem.
I checked the program and it's building the correct URL but I don't know how to extract the information for all elements of the look using just the 1st URL.
Please, anyone have an idea where I'm going wrong?
import urllib2
import re
from bs4 import BeautifulSoup
import time
date = date = (time.strftime('%Y%m%d'))
symbolslist = open('pistas.txt').read().split()
for symbol in symbolslist:
url = "http://trackinfo.com/entries-race.jsp?raceid=" + symbol + "$" + date +"A01"
htmltext = BeautifulSoup(urllib2.urlopen(url).read())
names=soup.findAll('a',{'href':re.compile("dog")})
for name in names:
results = ' '.join(name.string.split())
print results
and that is the text list:
GBM
GBR
GCA
GDB
GSP
GDQ
GEB
Hei man, try this:
import urllib2
import re
from bs4 import BeautifulSoup
import time
date = (time.strftime('%Y%m%d'))
symbolslist = open('pistas.txt').read().split()
for symbol in symbolslist:
url = "http://trackinfo.com/entries-race.jsp?raceid=" + symbol + "$" + date +"A01"
htmltext = BeautifulSoup(urllib2.urlopen(url).read())
names=htmltext.findAll('a',{'href':re.compile("dog")})
for name in names:
results = ' '.join(name.string.split())
print results
Related
Python code:
url = 'https://www.basketball-reference.com/players/'
initial = list(string.ascii_lowercase)
initial_url = [url + i for i in initial]
html_initial = [urllib.request.urlopen(i).read() for i in initial_url]
soup_initial = [BeautifulSoup(i, 'html.parser') for i in html_initial]
tags_initial = [i('a') for i in soup_initial]
print(tags_initial[0][50])
Results example:
Shareef Abdur-Rahim
From the example above, I want to extract the name of the players which is 'Shareef Abdur-Rahim', but I want to do it for all the tags_initial lists,
Does anyone have an idea?
Could you modify your post by adding your code so that we can help you better?
Maybe that could help you :
name = soup.findAll(YOUR_SELECTOR)[0].string
UPDATE
import re
import string
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = 'https://www.basketball-reference.com/players/'
# Alphabet
initial = list(string.ascii_lowercase)
datas = []
# URLS
urls = [url + i for i in initial]
for url in urls:
# Soup Object
soup = BeautifulSoup(urlopen(url), 'html.parser')
# Players link
url_links = soup.findAll("a", href=re.compile("players"))
for link in url_links:
# Player name
datas.append(link.string)
print("datas : ", datas)
Then, "datas" contains all the names of the players, but I advise you to do a little processing afterwards to remove some erroneous information like "..." or perhaps duplicates
There are probably better ways but I'd do it like this:
html = "a href=\"/teams/LAL/2021.html\">Los Angeles Lakers</a"
index = html.find("a href")
index = html.find(">", index) + 1
index_end = html.find("<", index)
print(html[index:index_end])
If you're using a scraper library it probably has a similar function built-in.
I'm trying to parse all the query string present in a page, so that using that query string I can navigate to specific page. code that I tried for doing this is as below
import requests
from bs4 import BeautifulSoup
from datetime import datetime
import datetime
import dateutil.parser
import time
import pytz
"""python espncricinfo library module https://github.com/dwillis/python-espncricinfo """
from espncricinfo.match import Match
from espncricinfo.exceptions import MatchNotFoundError, NoScorecardError
"""----time-zone-calculation----"""
time_zone = pytz.timezone("Asia/Kolkata")
datetime_today = datetime.datetime.now(time_zone)
datestring_today = datetime_today.strftime("%Y-%m-%d")
"""------URL of page to parse-------with a date of today-----"""
url = "http://www.espncricinfo.com/ci/engine/match/index.html?date=datestring_today"
"""eg. url = http://www.espncricinfo.com/ci/engine/match/index.html?date=2018-02-12"""
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
""""------parsing for matchno------"""
match_no = [x['href'].split('/',4)[4].split('.')[0] for x in
soup.findAll('a', href=True, text='Scorecard')]
for p in match_no:
""" where p is a match no, e.g p = '1122282'"""
m = Match(p)
m.latest_batting
print(m.latest_batting)
when I print match_no I get output:
['8890/scorecard/1118760/andhra-vs-tamil-nadu-group-c-vijay-hazare-trophy-2017-18/', '8890/scorecard/1118743/assam-vs-odisha-group-a-vijay-hazare-trophy-2017-18/', '8890/scorecard/1118745/bengal-vs-delhi-group-b-vijay-hazare-trophy-2017-18/', '8890/scorecard/1118763/chhattisgarh-vs-vidarbha-group-d-vijay-hazare-trophy-2017-18/']
this page(http://www.espncricinfo.com/ci/engine/match/index.html?date=datestring_today") contains all the match_no of games happening on that day, I want to trim this to get match_no which is 7 digit number[1118743,1118743.1118745....], how can I do this? SO using that match_no I can pass it to the Match() so I get details of particular match which happening at that day.
PS if no match is going on the new day then match_no returns none.
First, your code is very hard to read. You need to let your code breathe and make it appealing for others to read it.
Second, what is causing issue is probably this line:
match_no = [x['href'].split('/',4)[4].split('.')[0] for x in soup.findAll('a', href=True, text='Scorecard')]
It is hard to read too. There are far more better and readable ways of parsing match id from URL.
Here is example of what should be working. I did take provisional date for matches:
import re
import pytz
import requests
import datetime
from bs4 import BeautifulSoup
from espncricinfo.exceptions import MatchNotFoundError, NoScorecardError
from espncricinfo.match import Match
"""python espncricinfo library module https://github.com/dwillis/python-espncricinfo """
# from espncricinfo.match import Match
def get_match_id(link):
match_id = re.search(r'([0-9]{7})', link)
if match_id is None:
return None
return match_id.group()
# ----time-zone-calculation----
time_zone = pytz.timezone("Asia/Kolkata")
datetime_today = datetime.datetime.now(time_zone)
datestring_today = datetime_today.strftime("%Y-%m-%d")
# ------URL of page to parse-------with a date of today-----
url = "http://www.espncricinfo.com/ci/engine/match/index.html?date=datestring_today"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
spans = soup.findAll('span', {"class": "match-no"})
matches_ids = []
for s in spans:
for a in s.findAll('a', href=lambda href: 'scorecard' in href):
match_id = get_match_id(a['href'])
if match_id is None:
continue
matches_ids.append(match_id)
# ------parsing for matchno------
for p in matches_ids:
# where p is a match no, e.g p = '1122282'
m = Match(p)
m.latest_batting
print(m.latest_batting)
Now, I didn't have every lib that you are using here, but this should give you an idea of how to do it.
Once again, my advice is that empty lines are your friends. They are reader's friends for sure. Make your code 'breathe'.
I was making a program that would collect the value of the cryptocurrency verge. This script did the trick:
import urllib2
from bs4 import BeautifulSoup
url=("https://coinmarketcap.com/currencies/verge/")
page=urllib2.urlopen(url)
soup=BeautifulSoup(page,"html.parser")
find_value=soup.find('span',attrs={'class':'text-large2'})
price=find_value.text
Though the issue was that the result was in USD and i lived in Australia. So what i then did was put that value in a USD to AUD converter to find my value. I tried with the following code:
url2="http://www.xe.com/currencyconverter/convert/?
Amount="+price+"&From=USD&To=AUD"
print url2
page2=urllib2.urlopen(url2)
soup2=BeautifulSoup(page2,"html.parser")
find_value2=soup.find('span',attrs={'class':'uccResultAmount'})
print find_value2
The result was that i would get the right url though i would get the wrong result. Could anybody tell me where i am going wrong?Thank You
You can use regular expressions to scrape the currency converter:
import urllib
from bs4 import BeautifulSoup
import re
def convert(**kwargs):
url = "http://www.xe.com/currencyconverter/convert/?Amount={amount}&From={from_curr}&To={to_curr}".format(**kwargs)
data = str(urllib.urlopen(url).read())
val = map(float, re.findall("(?<=uccResultAmount'>)[\d\.]+", data))
return val[0]
url="https://coinmarketcap.com/currencies/verge/"
page=urllib.urlopen(url)
soup=BeautifulSoup(page,"html.parser")
find_value=soup.find('span',attrs={'class':'text-large2'})
print convert(amount = float(find_value.text), from_curr = 'USD', to_curr = 'AUD')
Output:
0.170358
The script used to work, but no longer and I can't figure out why. I am trying to go to the link and extract/print the religion field. Using firebug, the religion field entry is within the 'tbody' then 'td' tag-structure. But now the script find "none" when searching for these tags. And I also look at the lxml by 'print Soup_FamSearch' and I couldn't see any 'tbody' and 'td' tags appeared on firebug.
Please let me know what I am missing?
import urllib2
import re
import csv
from bs4 import BeautifulSoup
import time
from unicodedata import normalize
FamSearchURL = 'https://familysearch.org/pal:/MM9.1.1/KH21-211'
OpenFamSearchURL = urllib2.urlopen(FamSearchURL)
Soup_FamSearch = BeautifulSoup(OpenFamSearchURL, 'lxml')
OpenFamSearchURL.close()
tbodyTags = Soup_FamSearch.find('tbody')
trTags = tbodyTags.find_all('tr', class_='result-item ')
for trTags in trTags:
tdTags_label = trTag.find('td', class_='result-label ')
if tdTags_label:
tdTags_label_string = tdTags_label.get_text(strip=True)
if tdTags_label_string == 'Religion: ':
print trTags.find('td', class_='result-value ')
Find the Religion: label by text and get the next td sibling:
soup.find(text='Religion:').parent.find_next_sibling('td').get_text(strip=True)
Demo:
>>> import requests
>>> from bs4 import BeautifulSoup
>>>
>>> response = requests.get('https://familysearch.org/pal:/MM9.1.1/KH21-211')
>>> soup = BeautifulSoup(response.content, 'lxml')
>>>
>>> soup.find(text='Religion:').parent.find_next_sibling('td').get_text(strip=True)
Methodist
Then, you can make a nice reusable function and reuse:
def get_field_value(soup, field):
return soup.find(text='%s:' % field).parent.find_next_sibling('td').get_text(strip=True)
print get_field_value(soup, 'Religion')
print get_field_value(soup, 'Nationality')
print get_field_value(soup, 'Birthplace')
Prints:
Methodist
Canadian
Ontario
I am parsing html from the following website: http://www.asusparts.eu/partfinder/Asus/All In One/E Series I was just wondering if there was any way i could explore a parsed attribute in python?
For example.. The code below outputs the following:
datas = s.find(id='accordion')
a = datas.findAll('a')
for data in a:
if(data.has_attr('onclick')):
model_info.append(data['onclick'])
print data
[OUTPUT]
Bracket
These are the values i would like to retrieve:
nCategoryID = Bracket
nModelID = ET10B
family = E Series
As the page is rendered from AJAX, They are using a script source resulting in the following url from the script file:
url = 'http://json.zandparts.com/api/category/GetCategories/' + country + '/' + currency + '/' + nModelID + '/' + family + '/' + nCategoryID + '/' + brandName + '/' + null
How can i retrieve only the 3 values listed above?
[EDIT]
import string, urllib2, urlparse, csv, sys
from urllib import quote
from urlparse import urljoin
from bs4 import BeautifulSoup
from ast import literal_eval
changable_url = 'http://www.asusparts.eu/partfinder/Asus/All%20In%20One/E%20Series'
page = urllib2.urlopen(changable_url)
base_url = 'http://www.asusparts.eu'
soup = BeautifulSoup(page)
#Array to hold all options
redirects = []
#Array to hold all data
model_info = []
print "FETCHING OPTIONS"
select = soup.find(id='myselectListModel')
#print select.get_text()
options = select.findAll('option')
for option in options:
if(option.has_attr('redirectvalue')):
redirects.append(option['redirectvalue'])
for r in redirects:
rpage = urllib2.urlopen(urljoin(base_url, quote(r)))
s = BeautifulSoup(rpage)
#print s
print "FETCHING MAIN TITLE"
#Finding all the headings for each specific Model
maintitle = s.find(id='puffBreadCrumbs')
print maintitle.get_text()
#Find entire HTML container holding all data, rendered by AJAX
datas = s.find(id='accordion')
#Find all 'a' tags inside data container
a = datas.findAll('a')
#Find all 'span' tags inside data container
content = datas.findAll('span')
print "FETCHING CATEGORY"
#Find all 'a' tags which have an attribute of 'onclick' Error:(doesn't display anything, can't seem to find
#'onclick' attr
if(hasattr(a, 'onclick')):
arguments = literal_eval('(' + a['onclick'].replace(', this', '').split('(', 1)[1])
model_info.append(arguments)
print arguments #arguments[1] + " " + arguments[3] + " " + arguments[4]
print "FETCHING DATA"
for complete in content:
#Find all 'class' attributes inside 'span' tags
if(complete.has_attr('class')):
model_info.append(complete['class'])
print complete.get_text()
#Find all 'table data cells' inside table held in data container
print "FETCHING IMAGES"
img = s.find('td')
#Find all 'img' tags held inside these 'td' cells and print out
images = img.findAll('img')
print images
I have added an Error line where the problem lays...
Similar to Martijn's answer, but makes primitive use of pyparsing (ie, it could be refined to recognise the function and only take quoted strings with the parentheses):
from bs4 import BeautifulSoup
from pyparsing import QuotedString
from itertools import chain
s = '''Bracket'''
soup = BeautifulSoup(s)
for a in soup('a', onclick=True):
print list(chain.from_iterable(QuotedString("'", unquoteResults=True).searchString(a['onclick'])))
# ['Asus', 'Bracket', 'ET10B', '7138', 'E Series']
You could parse that as a Python literal, if you remove the this, part from it, and only take everything between the parenthesis:
from ast import literal_eval
if data.has_attr('onclick'):
arguments = literal_eval('(' + data['onclick'].replace(', this', '').split('(', 1)[1])
model_info.append(arguments)
print arguments
We remove the this argument because it is not a valid python string literal and you don't want to have it anyway.
Demo:
>>> literal_eval('(' + "getProductsBasedOnCategoryID('Asus','Bracket','ET10B','7138', this, 'E Series')".replace(', this', '').split('(', 1)[1])
('Asus', 'Bracket', 'ET10B', '7138', 'E Series')
Now you have a Python tuple and can pick out any value you like.
You want the values at indices 1, 2 and 4, for example:
nCategoryID, nModelID, family = arguments[1], arguments[3], arguments[4]