I'm trying to scrape multiple tables with the same class name using BeautifulSoup 4 and Python.
from bs4 import BeautifulSoup
import csv
standingsURL = "https://efl.network/index/efl/Standings.html"
standingsPage = requests.get(standingsURL)
standingsSoup = BeautifulSoup(standingsPage.content, 'html.parser')
standingTable = standingsSoup.find_all('table', class_='Grid')
standingTitles = standingTable.find_all("tr", class_='hilite')
standingHeaders = standingTable.find_all("tr", class_="alt")
However when running this it gives me the error
Traceback (most recent call last):
File "C:/Users/user/Desktop/program.py", line 15, in <module>
standingTitles = standingTable.find_all("tr", class_='hilite')
File "C:\Users\user\AppData\Local\Programs\Python\Python37-32\lib\site-packages\bs4\element.py", line 2128, in __getattr__
"ResultSet object has no attribute '%s'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
If i change the standingTable = standingsSoup.find_all('table', class_='Grid') with
standingTable = standingsSoup.find('table', class_='Grid')
it works, but only gives me the data of one of the tables while I'm trying to get the data of both
Try this.
from simplified_scrapy import SimplifiedDoc,req,utils
standingsURL = "https://efl.network/index/efl/Standings.html"
standingsPage = req.get(standingsURL)
doc = SimplifiedDoc(standingsPage)
standingTable = doc.selects('table.Grid')
standingTitles = standingTable.selects("tr.hilite")
standingHeaders = standingTable.selects("tr.alt")
print(standingTitles.tds.text)
Result:
[[['Wisconsin Brigade', '10', '3', '0', '.769', '386', '261', '6-1-0', '4-2-0', '3-2-0', '3-2-0', 'W4'], ...
Here are more examples. https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples
Related
I'm trying to get text from "Redeemed Highlight My Message" twitch chat, here is my code.
from selenium import webdriver
driver = webdriver.Chrome('D:\Project\Project\Rebot Router\chromedriver11.exe')
driver.get("https://www.twitch.tv/nightblue3")
while True:
text11= driver.find_elements_by_xpath('//*[#id="6583f0b7722e3be4537e78903686d3b4"]/div/div[1]/div/div/section/div/div[3]/div[2]/div[3]/div/div/div[116]/div[2]/span[4]/span')
text44= driver.find_elements_by_class_name("chat-line--inline chat-line__message")
print(str(text11))
print(str(text44))
but when i run it that's what i get
[]
[]
[]
[]
[]
[]
[]
[]
[]
and when i use .text like that
while True:
text11= driver.find_elements_by_xpath('//*[#id="6583f0b7722e3be4537e78903686d3b4"]/div/div[1]/div/div/section/div/div[3]/div[2]/div[3]/div/div/div[116]/div[2]/span[4]/span').text
text44= driver.find_elements_by_class_name("chat-line--inline chat-line__message").text
print(str(text11))
print(str(text44))
that's what i get
Traceback (most recent call last):
File "D:/Project/Project/Rebot Router/test.py", line 7, in <module>
text11= driver.find_elements_by_xpath('//*[#id="6583f0b7722e3be4537e78903686d3b4"]/div/div[1]/div/div/section/div/div[3]/div[2]/div[3]/div/div/div[116]/div[2]/span[4]/span').text
AttributeError: 'list' object has no attribute 'text'
so any help please.
btw text11 and text44 is the same i just use in text11 xpath and text44 class_name.
while True:
Texts = driver.find_elements_by_xpath("//span[#class='text-fragment']")
for x in range (0, len(Texts)):
print(Texts[x].text)
I'm trying to pull some datas, i have a function to get all tags in an array after i return it.
My code :
def getfontCats(soup):
cats_name = soup.find("div", {"class": ["fontTagsMain"]}).find_all("a")
cat_list = []
for cats in cats_name:
cat_list.append(cats.get_text())
return cat_list
for sk in set(getListing(soup)):
print(sk)
print(getfontCats(sk))
print("###################################################")
time.sleep(1)
HTML Content (Soup):
<div class="fontTagsMain">
AnimalComic CartoonCartoonComicFunFunnyComic Book </div>
Output :
Traceback (most recent call last):
File "/Users/evilslab/Documents/Websites/www.futurepoint.dev.cc/dobuyme/socket/fonts.py", line 109, in <module>
print(getfontCats(sk))
File "/Users/evilslab/Documents/Websites/www.futurepoint.dev.cc/dobuyme/socket/fonts.py", line 40, in getfontCats
cats_name = soup.find("div", {"class": ["fontTagsMain"]}).find_all("a")
TypeError: slice indices must be integers or None or have an __index__ method
But it works when i use the code which i used inside function outside. But when i try to call the code with function it gives me error.
Yes fixed. I'm sorry i was initialised soup in function
i trying to use crawler to get ieee paper keywords but now i get a error
how can to fix my crawler?
my code is here
import requests
import json
from bs4 import BeautifulSoup
ieee_content = requests.get("http://ieeexplore.ieee.org/document/8465981", timeout=180)
soup = BeautifulSoup(ieee_content.text, 'xml')
tag = soup.find_all('script')
for i in tag[9]:
s = json.loads(re.findall('global.document.metadata=(.*;)', i)[0].replace("'", '"').replace(";", ''))
and error is here
Traceback (most recent call last):
File "G:/github/爬蟲/redigg-leancloud/crawlers/sup_ieee_keywords.py", line 90, in <module>
a.get_es_data(offset=0, size=1)
File "G:/github/爬蟲/redigg-leancloud/crawlers/sup_ieee_keywords.py", line 53, in get_es_data
self.get_data(link=ieee_link, esid=es_id)
File "G:/github/爬蟲/redigg-leancloud/crawlers/sup_ieee_keywords.py", line 65, in get_data
s = json.loads(re.findall('global.document.metadata=(.*;)', i)[0].replace(";", '').replace("'", '"'))
IndexError: list index out of range
Here's another answer. I don't know what you are doing with 's' in your code after the load (replace) in my code.
The code below doesn't thrown an error, but again how are you using 's'
import requests
import json
from bs4 import BeautifulSoup
ieee_content = requests.get("http://ieeexplore.ieee.org/document/8465981", timeout=180)
soup = BeautifulSoup(ieee_content.text, 'xml')
tag = soup.find_all('script')
# i is a list
for i in tag[9]:
metadata_format = re.compile(r'global.document.metadata=.*', re.MULTILINE)
metadata = re.findall(metadata_format, i)
if len(metadata) != 0:
# convert the list
convert_to_json = json.dumps(metadata)
x = json.loads(convert_to_json)
s = x[0].replace("'", '"').replace(";", '')
###########################################
# I don't know what you plan to do with 's'
###########################################
print (s)
Apparently in line 65 some of the data provided in i did not suite the regex pattern you're trying to use. Therefor your [0] will not work as the data returned is not an array of suitable length.
Solution:
x = json.loads(re.findall('global.document.metadata=(.*;)', i)
if x:
s = x[0].replace("'", '"').replace(";", ''))
so im trying to extract the value of a line of html that looks like this:
<input type="hidden" name="_ref_ck" value="41d875b47692bb0211ada153004a663f">
and to get the value im doing:
self.ref = soup.find("input",{"name":"_ref_ck"}).get("value")
and its working fine for me but i gave a friend of mine the program to beta and he is getting an error like this:
Traceback (most recent call last):
File "C:\Users\Daniel\AppData\Local\Temp\Rar$DI85.192\Invent Manager.py", line 262, in onOK
self.main = GUI(None, -1, 'Inventory Manager')
File "C:\Users\Daniel\AppData\Local\Temp\Rar$DI85.192\Invent Manager.py", line 284, in __init__
self.inv.Login(log.user)
File "C:\Users\Daniel\AppData\Local\Temp\Rar$DI85.192\Invent Manager.py", line 34, in Login
self.get_ref_ck()
File "C:\Users\Daniel\AppData\Local\Temp\Rar$DI85.192\Invent Manager.py", line 43, in get_ref_ck
self.ref = soup.find('input',{'name':'_ref_ck'}).get("value")
AttributeError: 'NoneType' object has no attribute 'get'
which means that beautifulSoup is returning a NoneType for some reason
so i told him to send me the HTML that the request returns and it was fine then i told him to give me the soup and it only had the the top part of the page and i cant figure out why
this means the BS is returning only part of the html its recieving
my question is why or if there is an easy way i could do this with regex or something else thanks!
Here's a quick pyparsing-based solution walkthrough:
Import HTML parsing helpers from pyparsing
>>> from pyparsing import makeHTMLTags, withAttribute
Define your desired tag expression (makeHTMLTags returns starting and ending tag matching expressions, you just want a starting expression, so we just take the 0'th returned value).
>>> inputTag = makeHTMLTags("input")[0]
Only want input tags having a name attribute = "_ref_ck", use withAttribute to do this filtering
>>> inputTag.setParseAction(withAttribute(name="_ref_ck"))
Now define your sample input, and use the inputTag expression definition to search for a match.
>>> html = '''<input type="hidden" name="_ref_ck" value="41d875b47692bb0211ada153004a663f">'''
>>> tagdata = inputTag.searchString(html)[0]
Call tagdata.dump() to see all parsed tokens and available named results.
>>> print (tagdata.dump())
['input', ['type', 'hidden'], ['name', '_ref_ck'], ['value', '41d875b47692bb0211ada153004a663f'], False]
- empty: False
- name: _ref_ck
- startInput: ['input', ['type', 'hidden'], ['name', '_ref_ck'], ['value', '41d875b47692bb0211ada153004a663f'], False]
- empty: False
- name: _ref_ck
- tag: input
- type: hidden
- value: 41d875b47692bb0211ada153004a663f
- tag: input
- type: hidden
- value: 41d875b47692bb0211ada153004a663f
Use tagdata.value to get the value attribute:
>>> print (tagdata.value)
41d875b47692bb0211ada153004a663f
I want to serialize a dictionary to JSON in Python. I have this 'str' object has no attribute 'dict' error. Here is my code...
from django.utils import simplejson
class Person(object):
a = ""
person1 = Person()
person1.a = "111"
person2 = Person()
person2.a = "222"
list = {}
list["first"] = person1
list["second"] = person2
s = simplejson.dumps([p.__dict__ for p in list])
And the exception is;
Traceback (most recent call last):
File "/base/data/home/apps/py-ide-online/2.352580383594527534/shell.py", line 380, in post
exec(compiled_code, globals())
File "<string>", line 17, in <module>
AttributeError: 'str' object has no attribute '__dict__'
How about
s = simplejson.dumps([p.__dict__ for p in list.itervalues()])
What do you think [p.__dict__ for p in list] does?
Since list is not a list, it's a dictionary, the for p in list iterates over the key values of the dictionary. The keys are strings.
Never use names like list or dict for variables.
And never lie about a data type. Your list variable is a dictionary. Call it "person_dict` and you'll be happier.
You are using a dictionary, not a list as your list, in order your code to work you should change it to a list e.g.
list = []
list.append(person1)
list.append(person2)