scrapy for table content in pdf file - python

I am working on web scraping for tables in pdf file using python
Can some one suggest me a good module which fetch's only required table
I have tried pypdf,pdf2html,ocr,slate but nothing works
Thanks

First, convert PDF to HTML. See Converting PDF to HTML with Python.
And then, using an HTML parsing library, parse the HTML generated from the PDF. See BeautifulSoup HTML table parsing

Related

how to parse a website if the url of the search request is changing

I have a stock website that I want to parse some content to a csv file for now. I'm using Python requests HTML or beautifulsoap.
The problem is the URL is changing.
the website https://www.tadawul.com.sa/
an example of a stock: https://www.tadawul.com.sa/wps/portal/tadawul/market-participants/issuers/issuers-directory/company-details/!ut/p/z1/04_Sj9CPykssy0xPLMnMz0vMAfIjo8zi_Tx8nD0MLIy83V1DjA0czVx8nYP8PI0MDAz0I4EKzBEKDEJDLYEKjJ0DA11MjQzcTfXDCSkoyE7zBAC-SKhH/?companySymbol=6002
This "04_Sj9CPykssy0xPLM...." is periodically changing.
why is it changing? how can I parse the webpage?
I found the same data in the page source
is there a better way to parse pages in this website?
please check out this website.
https://www.tutorialspoint.com/python/python_sending_email.htm

How to parse html code saved as text?

I have multiple .txt files containing HTML code (HTML code from web pages were copied and saved as .txt).
I want to parse these files as an HTML. Are there any libraries which have similar functionality as requests+bs4 bundle and can treat input from text files as a result of usual web parsing?
Thank you for your help.
As many of the comments stated it is possible to feed .txt file to BeautifulSoup():
from bs4 import BeautifulSoup
path = 'path/to/file.txt'
with open(path) as f:
text = f.read()
BeautifulSoup(text, 'lxml')
You may be looking for Beautiful Soup, which can parse and read text from HTML quite easily:
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

Python Save XML Webpage to .mht

I have a single diagnostic webpage on a device with charts that is in XML format made up of an xsl and gif files. Is there a way with Python to download the entire page and save it as a single .mht file rather than separate files?
This is essentially a combination of those two problems:
How to save "complete webpage" not just basic html using Python
https://stackoverflow.com/a/44815028/679240
AFAIK, you could download the page with urllib, parse the HTML with Beautiful Soup, find the images and other dependencies in the parsed HTML, download those, rewrite the image urls in the parsed html to point to the local copies (Beautiful Soup can do this), save the modified HTML back to the disk, and use MHTifier to generate the MHT.
Perhaps Scrapy could help you, too.
Hi I was able to convert html page from web page and local html to .mht using win32com.
You can have a look at this
https://stackoverflow.com/a/59321911/5290876.
You can share sample xml with xsl with images for testing.

python html parser which doesn't modify actual markup?

I want to parse html code in python and tried beautiful soup and pyquery already. The problem is that those parsers modify original code e.g insert some tag or etc. Is there any parser out there that do not change the code?
I tried HTMLParser but no success! :(
It doesn't modify the code and just tells me where tags are placed. But it fails in parsing web pages like mail.live.com
Any idea how to parse a web page just like a browser?
You can use BeautifulSoup to extract just text and not modify the tags. Its in their documentation.
Same question here:
How to extract text from beautiful soup
No, to this moment there is no such HTML parser and every parser has it's own limitations.
Have you tried the webkit engine with Python bindings?
See this: https://github.com/niwibe/phantompy
You can traverse the real DOM of the parsed web page and do what you need to do.

html to text conversion using python language

I'd like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.
I'd like something more robust than using regular expressions that may fail on poorly formed HTML. I've seen many people recommend Beautiful Soup, but I've had a few problems using it. For one, it picked up unwanted text, such as JavaScript source. Also, it did not interpret HTML entities. For example, I would expect ' in HTML source to be converted to an apostrophe in text, just as if I'd pasted the browser content into notepad.
Update: html2text looks promising. It handles HTML entities correctly and ignores JavaScript. However, it does not exactly produce plain text; it produces markdown that would then have to be turned into plain text. It comes with no examples or documentation, but the code looks clean.
you would need to use urllib2 python library to get the html from the website and then parse through the html to grab the text that you want.
Use BeautifulSoup to parse through the html
import BeautifulSoup
resp = urllib2.urlopen("http://stackoverflow.com")
rawhtml = resp.read()
#parse through html to get text
soup=BeautifulSoup(rawhtml)
I don't "copy-paste from browser" is a well-defined operation. For instance, what would happen if the entire page were covered with a transparent floating div? What if it had tables? What about dynamic content?
BeautifulSoup is a powerful parser; you just need to know how to use it (it is easy, for instance, to remove the script tags from the page). Fortunately, it has a lot of documentation.
You can use xml.sax.utils.unescape to unescape HTML entities.

Categories

Resources