Grabbing data from sperate links of the same website

Grabbing data from sperate links of the same website - python

Thank you for your time to read this
I wanted to know if there's any way that i can get a specific code from different links but they are all of the same domain i mean if i put many facebook pages links it gets all their names in a text file and each one in different line

I think if i understood you need the user's name form the link.
facebook.com/zuck
acebook.com/moskov
You can track this and extract the pagetitle, this may not be accurate always.
> <title id="pageTitle">Mark Zuckerberg</title>
> <title id="pageTitle">Dustin Moskovitz</title>
html2text is a Python script that converts a page of HTML into clean, easy-to-read plain ASCII text. Better yet, that ASCII also happens to be valid Markdown (a text-to-HTML format).
https://github.com/Alir3z4/html2text
if you want to read from the url check the below explanations
How to read html from a url in python 3

Related

Ripping video links out of HTML pages using Python

I have a bunch of HTML pages with video players embedded in them, via various different HTML tags, using the <video> tag, but also other ones too. What's uniting in common all of these different approaches to holding video files, is that they are links to various common video websites, such as
YouTube
Rumble
Bitchute
Brighteon
Odysee
Vimeo
Dailymotion
Content videos originating from different video hosting websites may have different ways of embedding the videos. For example, they may or may not use the <video> tag.
Correction, I am not dealing with a bunch of websites, I am dealing with a bunch of HTML pages stored locally. I want to use Python to rip the links to these videos from the HTML pages, and save them into some json file, which could be later read and thrown into youtube-dl for downloading all these videos.
My question is, how exactly would I go about doing this? What kind of strategy should I pursue? Should I just attempt to read the HTML file as a plain text file using Python, and then use some kind of algorithm or regular expression to look for links to these video hosting websites. If that is the case, then I am bad at regular expressions, and would like some assistance about how to find links to the video websites in the text using regular expressions in Python.
Alternatively, I could potentially make use of HTML's DOM structure. I do not know if this is possible to do in Python or not, but basically to read the HTML not as a simple text file, but as a DOM tree, and traversing up and down the tree to specifically pick up only the tags that have the videos embedded in them. I do not know how to do that either.
I guess, what I'm trying to say here is, I first need some kind of strategy to achieve my big goal in mind, and then I actually need to know what kind of code or APIs do I need to use in order to achieve my goal, or ripping out links to video files out of the HTML, and saving them somewhere.

Storing HTML snippets with Python

I'm scrapping pages using Beautiful Soup and I would like to save some html snippets offline and use them to compare with every time I scrape again to check if there as been any change to the page .
Aside from directly writing out an html file, what would be the best strategy for save a lot of html snippets offline ( which format ) for comparison use later on ?
Thank you

This is a classic use for a hash function. Algorithms like md5 and sha256 boil any amount of text down to a few bytes. You can store just the hashes for any file you parse, and then when you get a new file, calculate the hash of that and compare the two hashes.

Parse XML file in python and display it in HTML page

I am doing a digital signage project using Raspberry-Pi. The R-Pi will be connected to HDMI display and to internet. There will be one XML file and one self-designed HTML webpage in R-Pi.The XML file will be frequently updated from remote terminal.
My idea is to parse the XML file using Python (lxml) and pass this parsed data to my local HTML webpage so that it can display this data in R-Pi's web-browser.The webpage will be frequently reloading to reflect the changed data.
I was able to parse the XML file using Python (lxml). But what tools should I use to display this parsed contents (mostly strings) in a local HTML webpage ?
This question might sound trivial but I am very new to this field and could not find any clear answer anywhere. Also there are methods that use PHP for parsing XML and then pass it to HTML page but as per my other needs I am bound to use Python.

I think there are 3 steps you need to make it work.
Extracting only the data you want from the given XML file.
Using simple template engine to insert the extracted data into a HTML file.
Use a web server to service the file create above.
Step 1) You are already using lxml which is a good library for doing this so I don't think you need help there.
Step 2) Now there are many python templating engines out there but for a simple purpose you just need an HTML file that was created in advance with some special markup such as {{0}}, {{1}} or whatever that works for you. This would be your template. Take the data from step 1 and just do find and replace in the template and save the output to a new HTML file.
Step 3) To make that file accessible using a browser on a different device or a PC you need to service it using a simple HTTP web server. Python provides http.server library or you can use an 3rd party web server and just make sure it can access the file created on step 2.

Instead of passing the parsed data (parsed from a XML file) to specific components in the HTML page, I've written python code such that it rewrites the entire HTML webpage's code periodically.
Suppose we have a XML file, a python script, a HTML webpage.
XML file : Contains certain values that are updated periodically and are to be parsed.
Python Script : Parses the XML file (when ever there are changes in XML file) and updates the HTML page with the newly parsed values
HTML webpage : To be shown on R-Pi screen and reloaded periodically (to reflect any changes at the browser)
The python code will have a string (say, str) declared, that contains the code of the HTML page, say the below code.
<!DOCTYPE html>
<html>
<body>
<h1>My First Heading</h1>
<p>My first paragraph.</p>
</body>
</html>
Then suppose we would like to update the My first paragraph with a value we parsed from XML, we can use Python string replacement function,
str.replace("My first paragraph",root[0][1].text)
After the replacement is done, write that entire string (str) into the HTML file. Now, the HTML file will have new code and once it's reloaded, the updated webpage will show up in the browser (of R-Pi)

Using Beautifulsoup and regex to traverse javascript in page

I'm fetching webpages with a bunch of javascript on it, and I'm interested in parsing through the javascript portion of the pages for certain relevant info. Right now I have the following code in Python/BeautifulSoup/regex:
scriptResults = soup('script',{'type' : 'text/javascript'})
which yields an array of scripts, of which I can use a for loop to search for text I'd like:
for script in scriptResults:
for block in script:
if *patterniwant* in block:
**extract pattern from line using regex**
(Text in asterisks is pseudocode, of course.)
I was wondering if there was a better way for me to just use regex to find the pattern in the soup itself, searching only through the scripts themselves? My implementation works, but it just seems really clunky so I wanted something more elegant and/or efficient and/or Pythonic.
Thanks in advance!

I lot of website have client side data in JSON format. I that case I would suggest to extract JSON part from JavaScirpt code and parse it using Python's json modules (e.g. json.json.loads ). As a result you will get standard dictionary object.
Another option is to check with your browser what sort of AJAX requests application makes. Quite often it also returns structured data in JSON.
I would also check if page has any structured data already available (e.g. OpenGraph, microformats, RDFa, RSS feeds). A lot of web sites include this to improve pages SEO and make it better integrating with social network sharing.

Extracting the introduction part of a Wikipedia article, by python

I want to extract the introduction part of a wikipedia article(ignoring all other stuff, including tables, images and other parts). I looked at html source of the articles, but I don't see any special tag which this part is wrapped in.
Can anyone give me a quick solution to this? I'm writing python scripts.
thanks

You may want to check mwlib to parse the wikipedia source
Alternatively, use the wikidump lib
HTML screen scraping through BeautifulSoup
Ah, there is a question already on SO on this topic:
Parsing a Wikipedia dump
How to parse/extract data from a mediawiki marked-up article via python

I think you can often get to the intro text by taking the full page, stripping out all the tables, and then looking for the first sequence of <p>...</p> blocks after the marker. That last bit would be this regex:
/<!-- bodytext -->.*?(<p>.*?<\/p>\s*)+/
With the .S option to make . match newlines...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.