BeautifulSoup doesnt return all HTML

BeautifulSoup doesnt return all HTML - python

so im trying to extract the value of a line of html that looks like this:
<input type="hidden" name="_ref_ck" value="41d875b47692bb0211ada153004a663f">
and to get the value im doing:
self.ref = soup.find("input",{"name":"_ref_ck"}).get("value")
and its working fine for me but i gave a friend of mine the program to beta and he is getting an error like this:
Traceback (most recent call last):
File "C:\Users\Daniel\AppData\Local\Temp\Rar$DI85.192\Invent Manager.py", line 262, in onOK
self.main = GUI(None, -1, 'Inventory Manager')
File "C:\Users\Daniel\AppData\Local\Temp\Rar$DI85.192\Invent Manager.py", line 284, in __init__
self.inv.Login(log.user)
File "C:\Users\Daniel\AppData\Local\Temp\Rar$DI85.192\Invent Manager.py", line 34, in Login
self.get_ref_ck()
File "C:\Users\Daniel\AppData\Local\Temp\Rar$DI85.192\Invent Manager.py", line 43, in get_ref_ck
self.ref = soup.find('input',{'name':'_ref_ck'}).get("value")
AttributeError: 'NoneType' object has no attribute 'get'
which means that beautifulSoup is returning a NoneType for some reason
so i told him to send me the HTML that the request returns and it was fine then i told him to give me the soup and it only had the the top part of the page and i cant figure out why
this means the BS is returning only part of the html its recieving
my question is why or if there is an easy way i could do this with regex or something else thanks!

Here's a quick pyparsing-based solution walkthrough:
Import HTML parsing helpers from pyparsing
>>> from pyparsing import makeHTMLTags, withAttribute
Define your desired tag expression (makeHTMLTags returns starting and ending tag matching expressions, you just want a starting expression, so we just take the 0'th returned value).
>>> inputTag = makeHTMLTags("input")[0]
Only want input tags having a name attribute = "_ref_ck", use withAttribute to do this filtering
>>> inputTag.setParseAction(withAttribute(name="_ref_ck"))
Now define your sample input, and use the inputTag expression definition to search for a match.
>>> html = '''<input type="hidden" name="_ref_ck" value="41d875b47692bb0211ada153004a663f">'''
>>> tagdata = inputTag.searchString(html)[0]
Call tagdata.dump() to see all parsed tokens and available named results.
>>> print (tagdata.dump())
['input', ['type', 'hidden'], ['name', '_ref_ck'], ['value', '41d875b47692bb0211ada153004a663f'], False]
- empty: False
- name: _ref_ck
- startInput: ['input', ['type', 'hidden'], ['name', '_ref_ck'], ['value', '41d875b47692bb0211ada153004a663f'], False]
- empty: False
- name: _ref_ck
- tag: input
- type: hidden
- value: 41d875b47692bb0211ada153004a663f
- tag: input
- type: hidden
- value: 41d875b47692bb0211ada153004a663f
Use tagdata.value to get the value attribute:
>>> print (tagdata.value)
41d875b47692bb0211ada153004a663f

Related

How to extract attribute value from a tag in BeautifulSoup

I am trying to extract the value of an attribute from a tag (in this case, TD). The code is as follows (the HTML document is loaded correctly; self.data contains string with HTML data, this method is part of a class):
def getLine (self):
dat = BeautifulSoup(self.data, "html.parser")
tags = dat.find_all("tr")
for current in tags:
line = current.findChildren("td", recursive=False)
for currentLine in line:
# print (currentLine)
clase = currentLine["class"] # <-- PROBLEMATIC LINE
if clase is not None and "result" in clase:
valor = Line()
valor.name = line.text
The error is in the line clase = currentLine["class"]. I just need to check the tag element has this attribute and do things in case it has the value "result".
File "C:\DataProgb\urlwrapper.py", line 43, in getLine
clase = currentLine["class"] #Trying to extract attribute class
\AppData\Local\Programs\Python\Python39\lib\site-packages\bs4\element.py", line 1519, in __getitem__
return self.attrs[key]
KeyError: 'class'
It should work, because it's just an element. I don't understand this error. Thanks.

Main issue is that you try to access the attribute key directly, what will return a KeyError, if the attribute is not available:
currentLine["class"]
Instead use get() that will return in fact of a missing attribute None:
currentLine.get("class")
From the docs - get(key\[, default\]):
Return the value for key if key is in the dictionary, else default. If default is not given, it defaults to None, so that this method never raises a KeyError.

Finding multiple tables with same class name, Python webscraping

I'm trying to scrape multiple tables with the same class name using BeautifulSoup 4 and Python.
from bs4 import BeautifulSoup
import csv
standingsURL = "https://efl.network/index/efl/Standings.html"
standingsPage = requests.get(standingsURL)
standingsSoup = BeautifulSoup(standingsPage.content, 'html.parser')
standingTable = standingsSoup.find_all('table', class_='Grid')
standingTitles = standingTable.find_all("tr", class_='hilite')
standingHeaders = standingTable.find_all("tr", class_="alt")
However when running this it gives me the error
Traceback (most recent call last):
File "C:/Users/user/Desktop/program.py", line 15, in <module>
standingTitles = standingTable.find_all("tr", class_='hilite')
File "C:\Users\user\AppData\Local\Programs\Python\Python37-32\lib\site-packages\bs4\element.py", line 2128, in __getattr__
"ResultSet object has no attribute '%s'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
If i change the standingTable = standingsSoup.find_all('table', class_='Grid') with
standingTable = standingsSoup.find('table', class_='Grid')
it works, but only gives me the data of one of the tables while I'm trying to get the data of both

Try this.
from simplified_scrapy import SimplifiedDoc,req,utils
standingsURL = "https://efl.network/index/efl/Standings.html"
standingsPage = req.get(standingsURL)
doc = SimplifiedDoc(standingsPage)
standingTable = doc.selects('table.Grid')
standingTitles = standingTable.selects("tr.hilite")
standingHeaders = standingTable.selects("tr.alt")
print(standingTitles.tds.text)
Result:
[[['Wisconsin Brigade', '10', '3', '0', '.769', '386', '261', '6-1-0', '4-2-0', '3-2-0', '3-2-0', 'W4'], ...
Here are more examples. https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples

Retrieving RDS tags using boto3 gives an index error.

I am trying to retrieve a tags using boto3 but I constantly run into the ListIndex out of range error.
My Code:
rds = boto3.client('rds',region_name='us-east-1')
rdsinstances = rds.describe_db_instances()
for rdsins in rdsinstances['DBInstances']:
rdsname = rdsins['DBInstanceIdentifier']
arn = "arn:aws:rds:%s:%s:db:%s"%(reg,account_id,rdsname)
rdstags = rds.list_tags_for_resource(ResourceName=arn)
if 'MyTag' in rdstags['TagList'][0]['Key']:
print "Tags exist and the value is:%s"%rdstags['TagList'][0]['Value']
The error that I have is:
Traceback (most recent call last):
File "rdstags.py", line 49, in <module>
if 'MyTag' in rdstags['TagList'][0]['Key']:
IndexError: list index out of range
I also tried using the for loop by specifying the range, it didn't seem to work either.
for i in range(0,10):
print rdstags['TagList'][i]['Key']
Any help is appreciated. Thanks!

You should iterate over list of tags first and compare MyTag with each item independently:
something like that:
if 'MyTag' in [tag['Key'] for tag in rdstags['TagList']]:
print "Tags exist and.........."
or better:
for tag in rdstags['TagList']:
if tag['Key'] == 'MyTag':
print "......"

I use function have_tag to find tag in all modul in Boto3
client = boto3.client('rds')
instances = client.describe_db_instances()['DBInstances']
if instances:
for i in instances:
arn = i['DBInstanceArn']
# arn:aws:rds:ap-southeast-1::db:mydbrafalmarguzewicz
tags = client.list_tags_for_resource(ResourceName=arn)['TagList']
print(have_tag('MyTag'))
print(tags)
Function search tags:
def have_tag(self, dictionary: dict, tag_key: str):
"""Search tag key
"""
tags = (tag_key.capitalize(), tag_key.lower())
if dictionary is not None:
dict_with_owner_key = [tag for tag in dictionary if tag["Key"] in tags]
if dict_with_owner_key:
return dict_with_owner_key[0]['Value']
return None

“TypeError: 'unicode' object does not support item assignment” in dicts when scraping via scrapy pipeline

I'm trying to build a dictionary of keywords and put it into a scrapy item.
'post_keywords':{1: 'midwest', 2: 'i-70',}
The point is that this will all go inside a json object later on down the road. I've tried initializing a new blank dictionary first, but that doesn't work.
Pipeline code:
tag_count = 0
for word, tag in blob.tags:
if tag == 'NN':
tag_count = tag_count+1
nouns.append(word.lemmatize())
keyword_dict = dict()
key = 0
for item in random.sample(nouns, tag_count):
word = Word(item)
key=key+1
keyword_dict[key] = word
item['post_keywords'] = keyword_dict
Item:
post_keywords = scrapy.Field()
Output:
Traceback (most recent call last):
File "B:\Mega Sync\Programming\job_scrape\lib\site-packages\twisted\internet\defer.py", line 588, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "B:\Mega Sync\Programming\job_scrape\cl_tech\cl_tech\pipelines.py", line215, in process_item
item['post_noun_phrases'] = noun_phrase_dict
TypeError: 'unicode' object does not support item assignment
It SEEMS like pipelines behave weirdly, like they don't want to run all the code in the pipeline UNLESS all the item assignments check out, which makes it so that my initialized dictionaries aren't created or something.

Thanks to MarkTolonen for the help.
My mistake was using the variable name 'item' for more than two things.
This works:
for thing in random.sample(nouns, tag_count):
word = Word(thing)
key = key+1
keyword_dict[key] = word
item['post_keywords'] = keyword_dict

Python - regex relation extraction

As a part of schoolwork we have been given this code:
>>> IN = re.compile(r'.*\bin\b(?!\b.+ing)')
>>> for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'):
... for rel in nltk.sem.extract_rels('ORG', 'LOC', doc,
... corpus='ieer', pattern = IN):
... print(nltk.sem.rtuple(rel))
We are asked to try it out with some sentences of our own to see the output, so for this i decided to define a function:
def extract(sentence):
import re
import nltk
IN = re.compile(r'.*\bin\b(?!\b.+ing)')
for rel in nltk.sem.extract_rels('ORG', 'LOC', sentence, corpus='ieer', pattern = IN):
print(nltk.sem.rtuple(rel))
When I try and run this code:
>>> from extract import extract
>>> extract("The Whitehouse in Washington")
I get the gollowing error:
Traceback (most recent call last):
File "<pyshell#1>", line 1, in <module>
extract("The Whitehouse in Washington")
File "C:/Python34/My Scripts\extract.py", line 6, in extract
for rel in nltk.sem.extract_rels('ORG', 'LOC', sentence, corpus='ieer', pattern = IN):
File "C:\Python34\lib\site-packages\nltk\sem\relextract.py", line 216, in extract_rels
pairs = tree2semi_rel(doc.text) + tree2semi_rel(doc.headline)
AttributeError: 'str' object has no attribute 'text'
Can anyone help me understand where I am going wrong in my function?
The correct output for the test sentence should be:
[ORG: 'Whitehouse'] 'in' [LOC: 'Washington']

If you see the method definition of extract_rels, it expects the parsed document as third argument.
And here you are passing the sentence. To overcome this error, you can do following :
tagged_sentences = [ nltk.pos_tag(token) for token in tokens]
class doc():
pass
IN = re.compile(r'.*\bin\b(?!\b.+ing)')
doc.headline=["test headline for sentence"]
for i,sent in enumerate(tagged_sentences):
doc.text = nltk.ne_chunk(sent)
for rel in nltk.sem.relextract.extract_rels('ORG', 'LOC', doc, corpus='ieer', pattern=IN):
print(nltk.sem.rtuple(rel) )// you can change it according
Try it out..!!!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

BeautifulSoup doesnt return all HTML - python

Related

How to extract attribute value from a tag in BeautifulSoup

Finding multiple tables with same class name, Python webscraping

Retrieving RDS tags using boto3 gives an index error.

“TypeError: 'unicode' object does not support item assignment” in dicts when scraping via scrapy pipeline

Python - regex relation extraction

Categories

Resources