How to exactly to programmatically copy paste content? - python

I try to create an automation web browser to post something. but i don't know how to copyy paste. the reason i want to copy paste. it simply copy all text include anchor. example reddit. if i put html tag anchor like
Example Anchor
it would not transcode to Example Anchor. but it ouput raw
Example Anchor
But, if i'm doing copy paste. example in below.
I'm an Example Anchor.
It Works.

the main thing why you need a browser that can copy paste behalf you but whatever actually i have not ever make a copy paste automation software but i know about a library called simple html dom for php that can do your work means you can get a text even can download photos using these library . after getting the text you can easily paste that text on whichever site you want but most of site easily can see that you are using a software or script and server will refuse to connect . you can use cURL library to achieve this task . using cURL library you can login into sites and can do your work but of course there are some limits . i do not know whether you know PHP or not but . this is easiest thing you can do without developing a whole browser .
here is the link of simple html dom Documentation https://simplehtmldom.sourceforge.io/
and here is the link of simple html dom library download link :
https://sourceforge.net/projects/simplehtmldom/
i hope this will hope you and future reader and googlers

Related

Python - Download Excel file from clickable "XLS" button of website without using Selenium

I am trying to write a program that goes to the following site and downloads the Excel file that automatically downloads when clicking the XLS button on the bottom of the page. To be honest, I am quite new to programming.
Site: https://echa.europa.eu/assessment-regulatory-needs
At first I tried using Selenium to let the program click itself through the browser and really click on the button. However, I think the website detects the usage of automated software and I cannot bypass the disclaimer that appears when opening the website.
Then I read some answers on the same topic where it is possible by using the requests module. However, I could not get it to work.
One thing I think I understood from this anwer was that you need to get the site/server where the data is requested from by inspecting the button with F12 in the browser. I tried this, and thought I had it, however I cannot get the code to function. I learnt from this answer that you need to give the referer as well, bu I think the referer from this file ist only partially written out, as it is "https://echa.europa.eu/assessment-regulatory-needs".
This answer explained the network process more in detail, howver I am not able to recreate it. Also, to be honest i do not fully understand what the API is and how to search for it.
I also found this answer, but it does not work for me either.
So I am asking for help on this, as I think that my HTML, Java, Website, Python knowledge is too tiny to see what I have to change to be able to download the excel file.

Issues downloading full HTML of webpage with Python

I'm working on a project where I require the all of the game ID #'s found in the current scores section of http://www.nhl.com/ to download content/ parse stats for each game. I want to be able to get all current game ID's in one go, but for some reason, I'm unable to download the full HTML of the page, no matter how I try. I'm using requests and beautifulsoup4.
Here's my problem:
I've determined that the particular tags I'm interested in are div's where the CSS class = 'scrblk'. So, I wrote a function to pass into BeautifulSoup.find_all() to give me, specifically, blocks with that CSS class. It looks like this:
def find_scrblk(css_class):
return css_class is not None and css_class == 'scrblk'
so, when I actually went to the web page in Firefox and saved it, then loaded the saved file in beautifulsoup4, I did the following:
>>>soup = bs(open('nhl.html'))
>>>soup.find_all(class_=find_scrblk)
[<div class="scrblk" id="hsb2015010029"> <div class="defaultState"....]
and everything was all fine and dandy; I had all the info I needed. However, when I tried to download the page using any of several automated methods I know, this returned simply an empty list. Here's what I tried:
using requests.get() and saving the .text attribute in a file
using the iter_content() and iter_lines() methods of the request
object to write to the file piece by piece
using wget to download the page (through subprocess.call())
and open the resultant file. For this option, I was sure to use the --page-requisites and --convert-links flags so I downloaded (or so I thought)
all the necessary data.
With all of the above, I was unable to parse out the data that I need from the HTML files; it's as if they weren't being completely downloaded or something, but I have no idea what that something is or how to fix it. What am I doing wrong or missing here? I'm using python 2.7.9 on Ubuntu 15.04.
All of the files can be downloaded here:
https://www.dropbox.com/s/k6vv8hcxbkwy32b/nhl_html_examples.zip?dl=0
As the comments on your question state, you have to re-think your approach. What you see in the browser is not what the response contains. The site uses JavaScript to load the information you are after so you should look more carefully in the result what you get to find what you are looking for.
In the future to handle such problems try out Chrome's developer console and disable JavaScript and open a site such way. Then you will see if you are facing JS or the site would contain the values you are looking for.
And by the way what you do is against the Terms of Service of the NHL website (according to Section 2. Prohibited Content and Activities)
Engage in unauthorized spidering, scraping, or harvesting of content or information, or use any other unauthorized automated means to compile information;

Suggestion on creation app for getting information from webpage

First want to say that I have experience with python and some web libraries like mechanize, beautiful soup, urllib2.
The idea is to create an app that will grab information from webpage, that I currently looking on in webbrowser. And than store it.
For example:
I manually go to the website, create a user.
Than run my app, that will grab some details from webpage, that I'm currently looking on. like user name, first name, last name and so on.
Problems:
I don't know how to make a program to run kinda on top of my webbrowser. I can't simply make a scipt to login to this webpage and do the rest with Beautiful Soup because it has a very good protection from web-crawlers and web bots.
Need some place to start. So the main question is is it possible to grab information that currently on my web browser? if yes hope to hear some suggestions on how to make my program look at the browser?
Please fill free to ask me if you not kinda understand what I'm asking, or you have some suggestions, some libraries that I can use.
The easiest thing to do is probably to save the HTML content of the current page to a file (using File -> Save Page As or whatever it is in your browser) and then running Beautiful Soup / lxml.html / whatever on that file.
You could probably also get Selenium to do what you want, though I've never used it and am not sure.

Is there a python module which web scrapes the image, title and a description of any link?

What I'm looking for, should give me something like this ->
There are many APIs available that can accomplish your task (more precisely the task you describe on your question, not the image :) ). I personally use diffbot, which I discovered after reading this. Beware though, for this kind of "content" extraction does not always end with success, because of the nature of web pages. Instead, it relies on heuristics and training and thus may not suffice for your specific purposes...
If you're wanting an entire screenshot of the page then something like https://stackoverflow.com/questions/1041371/alexa-api may help you?
Otherwise if you're just wanting to get a few key images from the page..
you could use mechanize to assit you. When you connect to a webpage you can search through all the links on the page using:
for link in br.links():
where br is your browser object.
You can see an example here:
Download all the links(related documents) on a webpage using Python
if you print dir(link) it will show you various properties such as link.text and link.url. furthermore you can import urlparse.urlsplit and use it on the url. You can direct the browser towards the URL and scrape the images as shown in the above example.
You should really use a search engines interpretation of the page and the images in it.
You could use, the python wrapper on the bing API, or the xGoogle library.
Beware the xGoogle library fakes to google as if a browser and may not be endorsed way to consume Google's data.
This one should help: http://palewi.re/posts/2008/04/20/python-recipe-grab-a-page-scrape-a-table-download-a-file/
Learns you how to scrape content and images and store it.

Grabbing text from a webpage

I would like to write a program that will find bus stop times and update my personal webpage accordingly.
If I were to do this manually I would
Visit www.calgarytransit.com
Enter a stop number. ie) 9510
Click the button "next bus"
The results may look like the following:
10:16p Route 154
10:46p Route 154
11:32p Route 154
Once I've grabbed the time and routes then I will update my webpage accordingly.
I have no idea where to start. I know diddly squat about web programming but can write some C and Python. What are some topics/libraries I could look into?
Beautiful Soup is a Python library designed for parsing web pages. Between it and urllib2 (urllib.request in Python 3) you should be able to figure out what you need.
What you're asking about is called "web scraping." I'm sure if you google around you'll find some stuff, but the core notion is that you want to open a connection to the website, slurp in the HTML, parse it and identify the chunks you want.
The Python Wiki has a good lot of stuff on this.
Since you write in C, you may want to check out cURL; in particular, take a look at libcurl. It's great.
You can use the mechanize library that is available for Python http://wwwsearch.sourceforge.net/mechanize/
You can use Perl to help you complete your task.
use strict;
use LWP;
my $browser = LWP::UserAgent->new;
my $responce = $browser->get("http://google.com");
print $responce->content;
Your responce object can tell you if it suceeded as well as returning the content of the page.You can also use this same library to post to a page.
Here is some documentation. http://metacpan.org/pod/LWP::UserAgent
That site doesnt offer an API for you to be able to get the appropriate data that you need. In that case you'll need to parse the actual HTML page returned by, for example, a CURL request .
This is called Web scraping, and it even has its own Wikipedia article where you can find more information.
Also, you might find more details in this SO discussion.
As long as the layout of the web page your trying to 'scrape' doesnt regularly change, you should be able to parse the html with any modern day programming language.

Categories

Resources