bs4: Differentiating between text and HTML-element

bs4: Differentiating between text and HTML-element - python

I'm trying to seperate XSS-payloads to analyze their structure with different methods.
An example payload looks like this:
<picture><source srcset="x"><img onerror="***payload***"></picture>
Now I need to seperate the different parts to get the following output:
picture source srcset x img onerror ***payload***
My problem is, that sometimes the payloads contain text content and sometimes another HTML-element, like in the example, as content. If I would simply append the content of the "upper" HTML-element, the output would be wrong, since I would iterate over that element a second time.
My code looks something like this:
for x in self.normalized_payloads:
tmp = []
soup = BeautifulSoup(x, 'html.parser')
elements = soup.find_all()
for y in elements:
tmp.append(y.name)
for u in y.attrs.keys():
tmp.append(u)
tmp.append(y.attrs[u])
seperated_payloads.append(tmp)
How can I differentiate between text and another HTML-element as content of an HTML-element? Do you have other ways of reaching this output, without iterating through every HTML-element of the payload?

So, as it turns out I came up with a working solution:
if len(BeautifulSoup(y.decode_contents(formatter="html"),'html.parser').find_all()) == 0 and y.decode_contents(formatter="html") != "":
tmp.append(y.decode_contents(formatter="html"))
This code checks, whether or not child elements are parseable HTML-elements or just text.

Related

Processing all data in a for loop instead of only one element

I wrote some code in order to scrape some data from a website. When I run the code manually I can get all the information for all the shoes, but when I run my script it only gives me one result for each variable.
What can I change to get all the results I want?
For example, when I run the following, I only get one result for marque and one for modele, but when i do it in my terminal I can see that vignette contains multiple values.
import requests
from bs4 import BeautifulSoup
r=requests.get('https://www.sarenza.com/store/product/gender-type/list/view?gender=1&type=76&index=0&count=99')
soup=BeautifulSoup(r.text,'lxml')
vignette=soup.find_all('li',class_='vignette')
for i in range(len(vignette)):
marque=vignette[i].contents[3].text
modele=vignette[i].contents[5].contents[3].text

You're updating your marque and modele variables overwriting their previous value on each iteration of the loop. At the end of the loop, they will only contain the last values that were assigned to them.
If you want to extract all the values, you need to use two lists, and append values to them like this:
marques = []
modeles = []
for i in range(len(vignette)):
marques.append(vignette[i].contents[3].text)
modeles.append(vignette[i].contents[5].contents[3].text)
Or, in a more Pythonic way:
marques = list(v.contents[3].text for v in vignette)
modeles = list(v.contents[5].contents[3].text for v in vignette)
Now you'll have all the values you need, and you can process them or print them out, like this:
for marque, modele in zip(marques, modeles):
print('Marque:', marque, 'Modèle:', modele)

retrieved URLs, trouble building payload to use requests module

I'm a Python novice, thanks for your patience.
I retrieved a web page, using the requests module. I used Beautiful Soup to harvest a few hundred href objects (links). I used uritools to create an array of full URLs for the target pages I want to download.
I don't want everybody who reads this note to bombard the web server with requests, so I'll show a hypothetical example that is realistic for just 2 hrefs. The array looks like this:
hrefs2 = ['http://ku.edu/pls/WP040?PT001F01=910&pf7331=11',
'http://ku.edu/pls/WP040?PT001F01=910&pf7331=12']
If I were typing these into 100s of lines of code, I understand what to do in order to retrieve each page:
from lxml import html
import requests
url = 'http://ku.edu/pls/WP040/'
payload = {'PT001F01' : '910', 'pf7331' : '11')
r = requests.get(url, params = payload)
Then get the second page
payload = {'PT001F01' : '910', 'pf7331' : '12')
r = requests.get(url, params = payload)
And keep typing in payload objects. Not all of the hrefs I'm dealing with are sequential, not all of the payloads are different simply in the last integer.
I want to automate this and I don't see how to create the payloads from the hrefs2 array.
While fiddling with uritools, I find urisplit which can give me the part I need to parse into a payload:
[urisplit(x)[3] for x in hrefs2]
['PT001F01=910&pf7331=11',
'PT001F01=910&pf7331=12']
Each one of those has to be turned into a payload object and I don't understand what to do.
I'm using Python3 and I used uritools because that appears to be the standards-compliant replacement of urltools.
I fell back on shell script to get pages with wget, which does work, but it is so un-Python-ish that I'm asking here for what to do. I mean, this does work:
import subprocess
for i in hrefs2:
subprocess.call(["wget", i])

You can pass the full url to requests.get() without splitting up the parameters.
>>> requests.get('http://ku.edu/pls/WP040?PT001F01=910&pf7331=12')
<Response [200]>
If for some reason you don't want to do that, you'll need to split up the parameters some how. I'm sure there are better ways to do it, but the first thing that comes to mind is:
a = ['PT001F01=910&pf7331=11',
'PT001F01=910&pf7331=12']
# list to store all url parameters after they're converted to dicts
urldata = []
#iterate over list of params
for param in a:
data = {}
# split the string into key value pairs
for kv in param.split('&'):
# split the pairs up
b = kv.split('=')
# first part is the key, second is the value
data[b[0]] = b[1]
# After converting every kv pair in the parameter, add the result to a list.
urldata.append(data)
You could do this with less code but I wanted to be clear what was going on. I'm sure there is already a module somewhere out there that does this for you too.

Creating Properly-Nested XML Output in Python

I'm attempting to save data from several lists in XML format, but I cannot understand how to make the XML display properly. An example of my code right now is as follows:
from lxml import etree
#Create XML Root
articles = etree.Element('root')
#Create Lists & Data
t_list = ['title1', 'title2', 'title3', 'title4', 'title5']
c_list = ['content1', 'content2', 'content3', 'content4', 'content5']
sum_list = ['summary1', 'summary2', 'summary3', 'summary4', 'summary5']
s_list = ['source1', 'source2', 'source3', 'source4', 'source5']
i = 0
for t in t_list:
for i in range(len(t_list)):
#Create SubElements of XML Root
article = etree.SubElement(articles, 'Article')
titles = etree.SubElement(article, 'Title')
summary = etree.SubElement(article, 'Summary')
source = etree.SubElement(article, 'Source')
content = etree.SubElement(article, 'Content')
#Add List Data to SubElements
titles.text = t_list[i]
summary.text = sum_list[i]
source.text = s_list[i]
content.text = c_list[i]
print(etree.tostring(articles, pretty_print=True))
My Current Output is written in one very jumbled fashion, all on a single line as follows:
b'<root>\n <Article>\n <Title>title1</Title>\n <Summary>summary1</Summary>\n <Source>source1</Source>\n <Content>content1</Content>\n </Article>\n
It looks like the pretty_print function within lxml is adding proper indentation, as well as \n breaks as I would want, but it doesn't seem to be getting interpreted correctly during output; it write on a single line.
The output I'm trying to get is as follows:
<root>
<Article>
<Title>title1</Title>
<Summary>summary1</Summary>
<Source>source1</Source>
<Content>content1</Content>
</Article>
Ideally, I'd like for my output to be viewed as a valid XML document, and display in proper nested format.

Your "Current Output" is the representation (internal python representation) of the bytestring generated by etree.tostring(), and seems that in Python3 print(somebytestring) prints the representation instead of the actual string.
Hopefully the solution is quite simple: just pass the desired encoding to etree.tostring(), ie:
xml = etree.tostring(articles, encoding="unicode", pretty_print=True)
print(xml)

I've only used the base ET module in Python and can't find an lxml download for python 3.5 (which I'm on) in order to test it, but the b before the line indicates bytes and a quick glance at the documentation indicates that tostring() has an encoding keyword, so you should just need to set that to unicode or utf-8.
I'll also mention that you don't need to set "i" before your for-loop (python will create the "i" it needs for the for-loop), though I- personally- would zip the lists and iterate the items in the lists themselves (though that's not going to have any real impact on the code in this situation).

How do I preserve new lines when extracting text from html using lxml.text_content()

I am trying to learn to use Whoosh. I have a large collection of html documents I want to search. I discovered that the text_content() method creates some interesting problems for example I might have some text that is organized in a table that looks like
<html><table><tr><td>banana</td><td>republic</td></tr><tr><td>stateless</td><td>person</td></table></html>
When I take the original string and and get the tree and then use text_content to get the text in the following manner
mytree = html.fromstring(myString)
text = mytree.text_content()
The results have no spaces (as should be expected)
'bananarepublicstatelessperson'
I tried to insert new lines using string.replace()
myString = myString.replace('</tr>','</tr>\n')
I confirmed that the new line was present
'<html><table><tr><td>banana</td><td>republic</td></tr>\n<tr><td>stateless</td><td>person</td></table></html>'
but when I run the same code from above the line feeds are not present. Thus the resulting text_content() looks just like above.
This is a problem from me because I need to be able to separate words, I thought I could add non-breaking spaces after each td and line breaks after rows as well asd line breaks after body elements etc to get text that reasonably conforms to my original source.
I will note that I did some more testing and found that line breaks inserted after paragraph tag closes were preserved. But there is a lot of text in the tables that I need to be able to search.
Thanks for any assistance

You could use this solution:
import re
def striphtml(data):
p = re.compile(r'<.*?>')
return p.sub('', data)
>>> striphtml('I Want This <b>text!</b>')
>>> 'I Want This text!'
Found here: using python, Remove HTML tags/formatting from a string

how to convert a bs4.element.ResultSet to strings? Python

I have a simple code like:
p = soup.find_all("p")
paragraphs = []
for x in p:
paragraphs.append(str(x))
I am trying to convert a list I obtained from xml and convert it to string. I want to keep it with it's original tag so I can reuse some text, thus the reason why I am appending it like this. But the list contains over 6000 observations, thus an recursion error occurs because of the str:
"RuntimeError: maximum recursion depth exceeded while calling a Python object"
I read that you can change the max recursion but it's not wise to do so. My next idea was to split the conversion to strings into batches of 500, but I am sure that there has to be a better way to do this. Does anyone have any advice?

The problem here is probably that some of the binary graphic data at the bottom of the document contains the sequence of characters <P, which Beautiful Soup is trying to repair into an actual HTML tag. I haven't managed to pinpoint which text is causing the "recursion depth exceeded" error, but it's somewhere in there. It's p[6053] for me, but since you seem to have modified the file a bit (or maybe you're using a different parser for Beautiful Soup), it'll be different for you, I imagine.
Assuming you don't need the binary data at the bottom of the document to extract whatever you need from the actual <p> tags, try this:
# boot out the last `<document>`, which contains the binary data
soup.find_all('document')[-1].extract()
p = soup.find_all('p')
paragraphs = []
for x in p:
paragraphs.append(str(x))

I believe the issue is that the BeautifulsSoup object p is not built iteratiely, therefore the method call limit is reached before you can finish constructing p = soup.find_all('p'). Note the RecursionError is similarly thrown when building soup.prettify().
For my solution I used the re module to gather all <p>...</p> tags (see code below). My final result was len(p) = 5571. This count is lower than yours because the regex conditions did not match any text within the binary graphic data.
import re
import urllib
from urllib.request import Request, urlopen
url = 'https://www.sec.gov/Archives/edgar/data/1547063/000119312513465948/0001193125-13-465948.txt'
response = urllib.request.urlopen(url).read()
p = re.findall('<P((.|\s)+?)</P>', str(response)) #(pattern, string)
paragraphs = []
for x in p:
paragraphs.append(str(x))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

bs4: Differentiating between text and HTML-element - python

Related

Processing all data in a for loop instead of only one element

retrieved URLs, trouble building payload to use requests module

Creating Properly-Nested XML Output in Python

How do I preserve new lines when extracting text from html using lxml.text_content()

how to convert a bs4.element.ResultSet to strings? Python

Categories

Resources