I want to crawl the following HTML code using Scrapy:
<tbody id="pageData11">
<tr>
<td>
<div style="border-left:3px solid #1A8CFF !important; float: left; padding-right: 5px;"> </div>
2018-May-29 Tuesday
</td>
Strictly speaking, the answer to your question is response.xpath('/html/body/tbody/tr/td/div/following::text()').extract_first().strip(), but, in this case, it's also the text in the td. Thus you can also do something like "".join(i.strip() for i in response.css('td::text').extract()).
Just considering your given example in question.
response.css('td::text').extract())
Related
I have a PyQt6 application that features a custom text editor.
When user hovers some word in this editor, a custom QToolTip is displayed.
I would have liked to make it fancier than the default one, with the following structure:
******* TITLE
* *
* IMG * - some text
* * - some other text
*******
I'm really noobish when it comes to HTML. I tried some code using <div> and <p> blocks, it delivered what I wanted when loading it in a navigator, but the result was not as expected in the application.
From what it seems, despite the documentation stating that Qt supports HTML blocks, what I want to achieve might be impossible.
Do you guys have any clue on what I could do to make it work? Above is an example of what I tried.
<div style="background-color: #2F3135;font-family: Franklin Gothic;font-size: 12;">
<div style="float: left;background-color: #2F3135;padding: 30px 20px 30px 30px;"><img src=MY_IMAGE width="64" height="64"/>
</div>
<p style="color: #FFFFFF;line-height:135%"><b><span style="background-color: #009900">TITLE:</b></span><br>
<span style="background-color: #009900;">some text<br></span>
<span style="background-color: #009900;">some other text<br></span>
</p>
</div>
EDIT
Following the answer from musicamante, I tried to do a table rather than using div blocks.
As I say in my answer to him, it works except if the title word of the second row is too long. Above should be a reproducable QTooltip example:
<table style="background-color: #454850;">
<tr>
<th rowspan=7 style="vertical-align: middle;padding-left: 20px;padding-right: 15px"><img src=MYIMAGE width="64" height="64"/></th>
<th><b><span style="color: #CECED7;font-family: Verdana, sans-serif;font-size: 10;">MAIN_TITLE</span></b></th>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td><b><span style="background-color: #307D30;color: #BACABA;font-family: Verdana, sans-serif;font-size: 10;">Inputs:</b></span></td>
</tr>
<tr>
<td><span style="background-color: #913131;color: #D0BFBF;font-family: Verdana, sans-serif;font-size: 10;">blabla</span></td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td><b><span style="background-color: #913131;color: #D0BFBF;font-family: Verdana, sans-serif;font-size: 10;">Outputs:</b></td>
</tr>
<tr>
<td><span style="background-color: #913131;color: #D0BFBF;font-family: Verdana, sans-serif;font-size: 10;">blabla</span></td>
</tr>
</table>
So if MAIN_TITLE > 32 characters in my case, the 2nd column will not display properly in the QToolTip.
Maybe I messed up in the HTML code (I'm very new to HTML, never really worked with it).
Any tips is welcomed!
I am trying to round the corners of my table, border-radius doen't seem to work when I convert the below HTML to PDF using xhtmltopdf pdf generator. Below is the HTML written for content file name is sticker_print.html :
<div class="sticker" style="height:196px">
<table class="sticker_box" align="left">
<tr>
<td style="border: 1px solid #222;background-color: #ffffff;">
<h3 style="border-bottom: 1px solid #222222;">Batch Sticker</h3>
<h5 style="padding: 0 0 0 10px;">Batch ID</h5>
<p>MFG Date</p>
<p style="padding-bottom:0px;"><img src="http://www.computalabel.com/Images/C128ff#2x.png" width="195px" height="26px"><span> Bar Code </span></p>
<p style="text-align: left; padding-bottom: 0px;">
<img src="https://www.kaspersky.com/content/en-global/images/repository/isc/2020/9910/a-guide-to-qr-codes-and-how-to-scan-qr-codes-2.png" width="65px" height="65px">
<span style="display: block;margin-top: 0px;">QR Code</span>
</p>
</td>
</tr>
</table>
</div>
PDF CODE
pdf = render_to_pdf('sticker_print.html')
return HttpResponse(pdf, content_type='application/pdf')
Even though I'm not using the same PDF engine as you (and your question is 6 months old), I solved this issue by using corner-radius instead of border-radius on a table cell or div.
The structure of the website I'm trying to parse looks like this:
<table border="0" cellpadding="3" cellspacing="0" width="100%">
<tr height="25">
<td class="th" style="border:none" width="2%"> </td>
<td class="th">movie</td>
<td class="th"> </td>
<td class="th"> </td>
</tr>
<tr id="place_1">
<td style="color: #555; vertical-align: top; padding: 6px">
<a name="1"></a>1.
</td>
<td style="height: 27px; vertical-align: middle; padding: 6px 30px 6px 0">
<a class="all" href="/326/">MOVIE TITLE IN SPANISH</a>
<br/>
<span class="text-grey">MOVIE TITLE IN ENGLISH</span>
</td>
<td style="width: 85px">
<div style="width: 85px; position: relative">
<a class="continue" href="/326/votes/">
9.191
</a>
<span style="color: #777">
(592 184)
</span>
</div>
</td>
</tr>
...
...
...
The problem is I can't get the text inside span-tag. I've tried .text as for a-tag, also tried .get_text(). But none of these worked. My code on Python:
for row in table.find_all('tr')[1:]:
info = row.find_all('td')
movies.append({
'spn_title' : info[1].a.text,
'eng_title' : info[1].span.text,
})
The errors I get:
AttributeError: 'NoneType' object has no attribute 'get_text'
or
'eng_title' : info[1].span.text AttributeError: 'NoneType' object has
no attribute 'text'
Try the following. Also, check your soup variable because I can run your code without problem. I suspect that somewhere later in the HTML you don't have one of these present in a row.
If the class names are consistent you could filter only qualifying rows having the appropriate type elements with those classes.Using bs4 4.7.1.
for row in table.select('tr :has(span.text-grey):has(a.all)'):
movies.append({
'spn_title' : row.select_one('.all').text,
'eng_title' : row.select_one('.text-grey').text
})
print(movies)
Otherwise, you want a way to handle if not present. For example,
for row in table.find_all('tr')[1:]:
movies.append({
'spn_title' : row.select_one('.all').text if row.select_one('.all') is not None else 'None',
'eng_title' : row.select_one('.text-grey').text if row.select_one('.text-grey') is not None else 'None'
})
print(movies)
I think that you should use innerHTML.
info[1].getElementsByTagName('span')[0].innerHTML
should work.
I have the same issue but I was able to resolve it.
example
<span class="a-offscreen">$10.99</span>
instead of Elem.FindElementByCss("span.a-offscreen").Text
use:
Elem.FindElementByCss("span.a-offscreen").FindElementByXPath("parent::*").Text
The trick is to get the text of the parent.
Btw, I am using VBA so you need to change it to Python Syntax.
I am trying to get a certain paragraph of text from a website, but my current methodology is not working.
I want the paragraph at the bottom. Thank you for your help, and I apologize for being a novice. I tried reading the docs but could not decipher much.
from bs4 import BeautifulSoup
import requests
url = "https://pwcs.edu/"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "lxml")
container = soup.find("div",attrs={'class': 'alertWrapper'})
paragraph = container.find("p")
When I print paragraph.getText() I get a bunch of blank space but no errors.
The html is :
<div id="page">
<div id="em-alerts" role="alert">
<div class="alertWrapper">
<div class="container">
<span class="icon dom-bg">
<em class="fa fa-bell">
<!---->
</em>
</span>
<span id="alert">ALERT</span>
<p>All PWCS will open two hours late on Thursday, February 8, due to icy road conditions in certain areas. SACC will open two hours late. Parents always have the option to keep children home if they have safety concerns.
</p>
<p></p>
</div>
</div>
</div>
I want the paragraph at the bottom. Thank you for your help, and I apologize for being a novice. I tried reading the docs but could not decipher much.
First you can get as close as you can to the paragraphs:
container = soup.find('div', attrs={'class':'container'})
Then you look for all the <p> tags in the container and join them.
\n'.join([x.text for x in container.find_all('p') if x.text != ""])
This will put all the paragraphs together, linked by a newline between each paragraph if they're not blank.
Output:
'All PWCS will open two hours late on Thursday, February 8, due to icy
road conditions in certain areas. SACC will open two hours late.
Parents always have the option to keep children home if they have
safety concerns.\n '
soup = BeautifulSoup(data, "lxml")
container = soup.find("div",attrs={'class': 'alertWrapper'})
paragraph = container.find("p")
In you above code you will be getting only first "p" tag. container.find("p") only gives you first "p" tag.
And the first tag you are getting is empty one.
You can check page source of that website.
But actually container has multiple "p" tags in it.
What you need to do is:
for p in container.find_all("p"):
print p.text
Following is the Html content in alertWrapper class present on your website.
<div class="alertWrapper">
<div class="container"><span class="icon dom-bg"><em class="fa fa-bell"><!-- --></em></span>
<!--First "p" tag which is empty-->
<p>
</p>
<table align="center" cellpadding="2" cellspacing="2" class="" style="border: 3px solid rgb(0, 176, 240);">
<tbody>
<tr>
<td class=""
style="margin: 2px; padding: 2px; border-image-source: none; border-image-slice: initial; border-image-width: initial; border-image-outset: initial; border-image-repeat: initial; background-color: rgb(255, 255, 255);">
<ul>
<!--Second "p" tag which you want-->
<p style="text-align: left; margin-left: 120px;"><strong><span
style='font-size: medium; letter-spacing: normal; font-family: "Times New Roman"; color: rgb(0, 112, 192);'>The PWCS Parent Divisionwide surveys, sent on January 9, were unexpectedly delayed at the US Post Office distribution center. The deadline for the parent survey, both paper and online, has been extended to Friday, February 9, 2018. </span></strong>
</p>
</ul>
</td>
</tr>
</tbody>
</table>
</div>
</div>
If you right click and check the page source, the text you want is not available. The HTML you've provided and the page source doesn't match.
<div class="alertWrapper">
<div class="container"><span class="icon dom-bg"><em class="fa fa-bell"><!----></em></span><p>
<table style="border: 3px solid rgb(0, 176, 240);" align="center" cellpadding="2" cellspacing="2" class="">
<tbody>
This is happening because the content you want is generated dynamically by JavaScript. You won't be able to scrape that using requests module.
You'll have to use other tools like Selenium.
As of now, there are multiple divs on that page with class "container". Therefore you could use find_all() method instead of find(). For example, like this:
from bs4 import BeautifulSoup
import requests
r = requests.get("https://pwcs.edu/")
soup = BeautifulSoup(r.text, "lxml")
n = 0
for container in soup.find_all("div",attrs={'class': 'container'}):
n += 1
print('==',n,'==')
for paragraph in container.find_all("p"):
print(paragraph)
Alternatively, you can use .next_sibling:
for span in soup.find_all("span",attrs={'id': 'alert'}):
if span.next_sibling:
print('ALERT',span.next_sibling)
I am trying to insert two pictures side by side in one Markdown cell on a notebook. The way I do it was:
<img src="pic/scan_concept.png" alt="Drawing" style="width: 250px;"/>
in order to be able to size the included picture. Can anyone gives suggestions on top of this?
Thanks,
You can create tables using pipes and dashes like this.
A | B
- | -
![alt](yourimg1.jpg) | ![alt](yourimg2.jpg)
see Tables syntax
I don't have enough reputation to add comments, so I'll just put my 2 cents as a separate answer. I also found that JMann's solution didn't work, but if you wrap his implementation with table tags:
<table><tr>
<td> <img src="Nordic_trails.jpg" alt="Drawing" style="width: 250px;"/> </td>
<td> <img src="Nordic_trails.jpg" alt="Drawing" style="width: 250px;"/> </td>
</tr></table>
then it works.
JMann's solution didn't work for me. But this one worked
from IPython.display import HTML, display
display(HTML("<table><tr><td><img src='img1'></td><td><img src='img2'></td></tr></table>"))
I took the idea from this notebook
I found the following works in a Markdown cell:
<tr>
<td> <img src="Nordic_trails.jpg" alt="Drawing" style="width: 250px;"/> </td>
<td> <img src="Nordic_trails.jpg" alt="Drawing" style="width: 250px;"/> </td>
</tr>
Table of pictures :
|![alt](pathToImage1.jpg) |![alt](pathToImage2.jpg)|
|-|-|
|![alt](pathToImage3.jpg) | ![alt](pathToImage4.jpg)
|![alt](pathToImage5.jpg) | ![alt](pathToImage6.jpg)
View :
<table><tr>
<td>
<p align="center" style="padding: 10px">
<img alt="Forwarding" src="images/IMG_20201012_183152_(2).jpg" width="320">
<br>
<em style="color: grey">Forwarding (Anahtarlama)</em>
</p>
</td>
<td>
<p align="center">
<img alt="Routing" src="images/IMG_20201012_183158_(2).jpg" width="515">
<br>
<em style="color: grey">Routing (yönlendirme)</em>
</p>
</td>
</tr></table>
I'm using VSCode, with native markdown and that solution works for me in terms ...
![alt](yourimg1.jpg) | ![alt](yourimg2.jpg)
Its because I need to insert a lot of images on my website.
Like this:
So, it works on the first two pictures and the others, it doesn't work =/
I find that I need to add some space between image tags
So, I did this and works fine, like the attached picture:
![alt](yourimg1.jpg) | ![alt](yourimg2.jpg)
// space 1
// space 2
// space 3
![alt](yourimg1.jpg) | ![alt](yourimg2.jpg)
// space 1
// space 2
// space 3
And it worked properly for me!
I hope that helped you!