python - Get EVERY div class and div id from website - python

So there were some attempts, but i cannot find a way to get the name and content of every
div class
div id
Im using lxml and beautysoup in my project, but i simply cant seem to find a way to find div's that are unknown to me.
Can someone show me a method or any tips how to do this?
Thanks in advance.

You can use the find_all method to find all tags of a certain type, then look at their attributes via ther attrs dict, e.g.:
soup = BeautifulSoup(html, 'lxml')
for div in soup.find_all('div'):
print(div.attrs)

Related

Python Beatiful Soup (Select only the one class with same name)

I am using Beautiful Soup to parse through elements of an email and I have successfully been able to extract the links from a button from the email. However, the class name on the button appears twice in the email HTML, therefore extracting/ printing two links. I only need one the first link or reference to the class first with the same name.
This is my code:
soup = BeautifulSoup(msg.html, 'html.parser')
for link in soup('a', class\_='mcnButton', href=True):
print(link\['href'\])
The 'mcnButton' is referencing two html buttons within the email containing two seperate links.I only need the first reference to the 'mcnButton' class and link containing.
The above codes prints out two links (again I only need the first).
https://leemarpet.us10.list-manage.com/track/XXXXXXX1
https://leemarpet.us10.list-manage.com/track/XXXXXXX2
I figured there should be a way to index and separately access the first reference to the class and link. Any assistance would be greatly appreciated, Thanks!
I tried the select_one, find, and attempts to index the class, unfortunately resulted in a syntax error.
To find only the first element matching your pattern use .find():
soup.find('a', class_='mcnButton', href=True)
and to get its href:
soup.find('a', class_='mcnButton', href=True).get('href')
For more information check the docs

Using Python + Selenium to select a tag within divs of a class within a div?

Alright so, what I'm trying to do is searching for the first a tags within the divs of a specific class, in a div with a specific ID. Using Python + Selenium offcourse.
Right now I have as my code
newest_elements = driver.find_elements_by_css_selector("div.elements > a")
What this is doing is searching for all divs in a page with class "elements", and taking the very top most link from those divs. But I do not want to search all of the divs on the entire page with the class "elements". I only want to search the "elements" divs that are in another larger div with an specific id called "list-all".
How do I achieve this? Thanks in advance for your help guys
According to your description instead of
newest_elements = driver.find_elements_by_css_selector("div.elements > a")
You should use
newest_elements = driver.find_elements_by_css_selector("div#list-all div.elements > a")
You may possibly add waits / delays here.

BeautifulSoup Python fetch syntax

I'm attempting to gather some data from a wikipedia page and I can't seem to narrow down my fetch to the ui and li items within the div. Here's what I have so far:
soup.findAll('div', attrs={'class': "mw-parser-output"})
I'm reading through the documentation and I cant seem to find where or how I can drill down to the ul or li within div class = mw-parser-output.
This is the first time I've used BeautifulSoup, so please excuse my ignorance.
Thanks in advance!
BeautifulSoup supports CSS selectors with the .select(selector) syntax. You could use something like soup.select('div.mw-parser-output ul li')
try this:
items=soup.find_all('div', attrs={'class': "mw-parser-output"})
for item in items:
print(item.find_all('li'))
^__^
I got what I was looking for by simply using items = soup.findAll('li'). If there is a better way, please let me know.

How can I get a tag's (eg. div, or other) value by parameter name?

I'am new in Python and I run in a problem.
There is a website where that site has a complete structure. I know how can I find a div, or other tag, but when I found that tag (eg. with class name), I would like to gathering all parameters with value, but I can't.
So, My question is how can I gathering a random tag's all parameters and values after I found that?
How I found:
from bs4 import BeautifulSoup as BS
.
.
.
soup = BS(page.content, 'html.parser')
soup.prettify()
div = soup.find('div', {'class':'abc'})
If you want to get all the element's attributes, simply use the .attrs property:
print(div.attrs)
This would print out a dictionary where keys are attribute names and values are attribute values.

Pull Tag Value using BeautifulSoup

Can someone direct me as how to pull the value of a tag using BeautifulSoup? I read the documentation but had a hard time navigating through it. For example, if I had:
<span title="Funstuff" class="thisClass">Fun Text</span>
How would I just pull "Funstuff" busing BeautifulSoup/Python?
Edit: I am using version 3.2.1
You need to have something to identify the element you're looking for, and it's hard to tell what it is in this question.
For example, both of these will print out 'Funstuff' in BeautifulSoup 3. One looks for a span element and gets the title, another looks for spans with the given class. Many other valid ways to get to this point are possible.
import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup('<html><body><span title="Funstuff" class="thisClass">Fun Text</span></body></html>')
print soup.html.body.span['title']
print soup.find('span', {"class": "thisClass"})['title']
A tags children are available via .contents
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#contents-and-children
In your case you can find the tag be using its CSS class to extract the contents
from bs4 import BeautifulSoup
soup=BeautifulSoup('<span title="Funstuff" class="thisClass">Fun Text</span>')
soup.select('.thisClass')[0].contents[0]
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors has all the details nevessary

Categories

Resources