Storing HTML snippets with Python

Storing HTML snippets with Python - python

I'm scrapping pages using Beautiful Soup and I would like to save some html snippets offline and use them to compare with every time I scrape again to check if there as been any change to the page .
Aside from directly writing out an html file, what would be the best strategy for save a lot of html snippets offline ( which format ) for comparison use later on ?
Thank you

This is a classic use for a hash function. Algorithms like md5 and sha256 boil any amount of text down to a few bytes. You can store just the hashes for any file you parse, and then when you get a new file, calculate the hash of that and compare the two hashes.

Related

Ripping video links out of HTML pages using Python

I have a bunch of HTML pages with video players embedded in them, via various different HTML tags, using the <video> tag, but also other ones too. What's uniting in common all of these different approaches to holding video files, is that they are links to various common video websites, such as
YouTube
Rumble
Bitchute
Brighteon
Odysee
Vimeo
Dailymotion
Content videos originating from different video hosting websites may have different ways of embedding the videos. For example, they may or may not use the <video> tag.
Correction, I am not dealing with a bunch of websites, I am dealing with a bunch of HTML pages stored locally. I want to use Python to rip the links to these videos from the HTML pages, and save them into some json file, which could be later read and thrown into youtube-dl for downloading all these videos.
My question is, how exactly would I go about doing this? What kind of strategy should I pursue? Should I just attempt to read the HTML file as a plain text file using Python, and then use some kind of algorithm or regular expression to look for links to these video hosting websites. If that is the case, then I am bad at regular expressions, and would like some assistance about how to find links to the video websites in the text using regular expressions in Python.
Alternatively, I could potentially make use of HTML's DOM structure. I do not know if this is possible to do in Python or not, but basically to read the HTML not as a simple text file, but as a DOM tree, and traversing up and down the tree to specifically pick up only the tags that have the videos embedded in them. I do not know how to do that either.
I guess, what I'm trying to say here is, I first need some kind of strategy to achieve my big goal in mind, and then I actually need to know what kind of code or APIs do I need to use in order to achieve my goal, or ripping out links to video files out of the HTML, and saving them somewhere.

Grabbing data from sperate links of the same website

Thank you for your time to read this
I wanted to know if there's any way that i can get a specific code from different links but they are all of the same domain i mean if i put many facebook pages links it gets all their names in a text file and each one in different line

I think if i understood you need the user's name form the link.
facebook.com/zuck
acebook.com/moskov
You can track this and extract the pagetitle, this may not be accurate always.
> <title id="pageTitle">Mark Zuckerberg</title>
> <title id="pageTitle">Dustin Moskovitz</title>
html2text is a Python script that converts a page of HTML into clean, easy-to-read plain ASCII text. Better yet, that ASCII also happens to be valid Markdown (a text-to-HTML format).
https://github.com/Alir3z4/html2text
if you want to read from the url check the below explanations
How to read html from a url in python 3

Copy a table from a website using selenium and flask

For my local sports club website it would be nice to copy and sync the rankings and results from the official league website. I'm using flask and selenium for python 3.5.
So far I'm using
driver.find_element_by_class_name("table")
to locate the tables. Is there an efficient way to store this and pass this on to the jinja templates all at once? Or do I have to store and process all the different parts of the table (header, rows, elements) separately?

As you have the information in a <table>, you should just extract the information based on <tr>, <td> (and possible <th>) and store that in a CSV or other structured file (YAML, JSON) and take the data for the jinja templates from there.
If you only update your file when data changes this is one of the more efficient ways, you can check e.g. every hour if the input (the official league table) changes.
This decoupling is also important for when the league data changes to use e.g. <div> and <span> and your input processing needs to be adapted.
#Will suggestion to use BeautifulSoup is a good one, especially if the data is to process is large, a one time retrieve of HTML from selenium and processing by BeautifulSoup is much faster. If you are not willing to investigate time into that look at using full CSS (not just class) for selecting elements in selenium (using .find_element_by_css_selector() ) that is most easily translated to BeautifulSoup (using .select()), once you are going to make the transition.

Best way to import a Python list into an HTML table?

I wrote a script that scrapes various things from around the web and stores them in a python list and have a few questions about the best way to get it into a HTML table to display on a web page.
First off should my data be in a list? It will at most be a 25 by 9 list.
I’m assuming I should write the list to a file for the web site to import? Is a text file preferred or something like a CSV, XML file?
Whats the standard way to import a file into a table? In my quick look around the web I didn’t see an obvious answer (Major web design beginner). Is Javascript this best thing to use? Or can python write out something that can easily be read by HTML?
Thanks

store everything in a database eg: sqlite,mysql,mongodb,redis ...
then query the db every time you want to display the data.
this is good for changing it later from multiple sources.
store everything in a "flat file": sqlite,xml,json,msgpack
again, open and read the file whenever you want to use the data.
or read it in completly on startup
simple and often fast enough.
generate a html file from your list with a template engine eg jinja, save it as html file.
good for simple hosters
There are some good python webframeworks out there some i used:
Flask, Bottle, Django, Twisted, Tornado
They all more or less output html.
Feel free to use HTML5/DHTML/Java Script.
You could use a webframework to create/use an "api" on the backend, which serves json or xml.
Then your java script callback will display it on your site.

The most direct way to create an HTML table is to loop through your list and print out the rows.
print '<table><tr><th>Column1</th><th>Column2</th>...</tr>'
for row in my_list:
print '<tr>'
for col in row:
print '<td>%s</td>' % col
print '</tr>'
print '</table>'
Adjust the code as needed for your particular table.

Using Beautifulsoup and regex to traverse javascript in page

I'm fetching webpages with a bunch of javascript on it, and I'm interested in parsing through the javascript portion of the pages for certain relevant info. Right now I have the following code in Python/BeautifulSoup/regex:
scriptResults = soup('script',{'type' : 'text/javascript'})
which yields an array of scripts, of which I can use a for loop to search for text I'd like:
for script in scriptResults:
for block in script:
if *patterniwant* in block:
**extract pattern from line using regex**
(Text in asterisks is pseudocode, of course.)
I was wondering if there was a better way for me to just use regex to find the pattern in the soup itself, searching only through the scripts themselves? My implementation works, but it just seems really clunky so I wanted something more elegant and/or efficient and/or Pythonic.
Thanks in advance!

I lot of website have client side data in JSON format. I that case I would suggest to extract JSON part from JavaScirpt code and parse it using Python's json modules (e.g. json.json.loads ). As a result you will get standard dictionary object.
Another option is to check with your browser what sort of AJAX requests application makes. Quite often it also returns structured data in JSON.
I would also check if page has any structured data already available (e.g. OpenGraph, microformats, RDFa, RSS feeds). A lot of web sites include this to improve pages SEO and make it better integrating with social network sharing.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.