I'm a beginner to Python and trying to start my very first project, which revolves around creating a program to automatically fill in pre-defined values in forms on various websites.
Currently, I'm struggling to find a way to identify web elements using the text shown on the website. For example, website A's email field shows "Email:" while another website might show "Fill in your email". In such cases, finding the element using ID or name would not be possible (unless I write a different set of code for each website) as they vary from website to website.
So, my question is, is it possible to write a code where it will scan all the fields -> check the text -> then fill in the values based on the texts that are associated with each field?
It is possible if you know the markup of the page, and you can write code to parse this page. In this case you should use xpath, lxml, beautiful soup, selenium etc. You can look at many manuals on google or youtube, just type "python scraping"
But if you want to write a program able to understand random page on a random site and understand what it should do, it is very difficult, it's a complex task with using machine learning. I guess this task is completely not for beginners.
Related
I'm creating a "data catalog" on SharePoint that consists of a page of links to other pages. Objects are clickable and take the user to a page where related Attributes are listed. (I know that I'd be better off using data cataloging software, but it's a non-starter atm.)
I know that SharePoint has a "search box" web part, but it's apparently available for classic view only, and we are using modern view (and can't change to classic.) I really need a search box web part. Is this possible? I'm python-literate so if I could code something using python or something comparable, I probably could.
Anybody else solved this issue?
You could use SharePoint Framework modern search Web Parts in modern page.
Get to know about SharePoint Framework(SPFx)
I am trying to grab a bunch numbers that are presented in a table on a web page that I’ve accessed using python and Selenium running headless on a Raspberry Pi. The numbers are not in the page source, rather they are deeply embedded in complex html served by several URLs called by the main page (the numbers update every few seconds). I know I could parse the html to get the numbers I want, but the numbers are already sitting on the front page in perfect format all in one place. I can select and copy the numbers when I view the web page in Chrome on my PC.
How can I use python and get Selenium webdriver to get me those numbers? Can Selenium simply provide all the visible text on a page? How? (I've tried driver.page_source but the text returned does not contain the numbers). Or is there a way to essentially copy text and numbers from a table visible on the screen using python and Selenium? (I’ve looked into xdotool but didn’t find enough documentation to help). I’m just learning Selenium so any suggestions will be much appreciated!
Well, I figured out the answer to my question. It's embarrassingly easy. This line gets just what I need - all the text that is visible on the web page:
page_text = driver.find_element_by_tag_name('body').text
So, there are some different situations why you can not get some info on the page:
Information doesn't loaded yet. You must waiting for some time to get your information ready. You may watch this theme for the better understanding. Some times you get dynamically added page elements with JS and so on, which loading is very slowly.
Information may consists of different type of data. For example you are waiting for a text with numbers, but you may get picture with numbers on the page. In this situation you must change your programming tactics and use another functions to get what you need.
I am trying to read review data data from alexaskillstore.com website using BeautifulSoup. For this, I am specifying the target url as https://www.alexaskillstore.com/Business-Leadership-Series/B078LNGS5T, where the string after Business-Leadership-Series/ keeps changing for all the different skills.
I want to know how can I input a regular expression or similar code to my input url so that I am able to read every link that starts with https://www.alexaskillstore.com/Business-Leadership-Series/.
You can't. The web is client-server based, so unless the server is kind enough to map the content for you, you have no way to know which URLs will be responsive and which won't.
You may be able to scrape some index page(s) to find the keys (B078LNGS5T and the like) you need. Once you have them all, actually generating the URLs is a simple matter of string substitution.
The site I am trying to scrap has drop-down menus that end up producing a link to a document. The end documents are what I want. I have no experience with web scraping so I don't know where to start on this. I don't know where to start. I have tried adapting this to my needs, but I couldn't get it working. I also tried to adapt this.
I know basically I need to:
for state in states:
select state
for type in types:
select type
select wage_area_radio button
for area in wage_area:
select area
for locality in localities:
select locality
for date in dates:
select date
get_document
I just haven't found anything that works for me yet. Is there a tool better than Selenium for this? I am currently trying to bend it to my will using the the code from my second example as a starter.
Depending on your coding skills and knowledge of HTTP, I would try one of two things. Note that scraping this site appears slightly non-trivial because of the different form options that appear based on what was previously selected, and the fact that there's a lot of AJAX calls happening.
1) Follow the HTTP requests (especially the AJAX ones) that are being made in something like Chrome DevTools. You'll get a good understanding of how the final URL is being formed and how to construct it yourself. In particular, it looks like the last POST to AFWageScheduleYearSelected is the one that generates the final url. Then, you can make these calls yourself in a Python HTTP library to get the documents.
2) Use something like PhantomJS (http://phantomjs.org/) which is a headless browser. I don't have experience scraping with Selenium, but my understanding is that it is more of a testing/automation tool. In any case, PhantomJS is pretty easy to get up and running and you can basically click page elements, fill out forms, etc.
If you do end up using PhantomJS (or any other browser-like tool), you'll run into issues with the AJAX calls that populate the forms. Basically, you'll end up trying to fill out forms that don't yet exist on the page because the data is still being sent over the network. The easiest way to get around this is to just set timeouts (of say 2 seconds) in between each form field that you fill out. The alternative to using timeouts (which may be unreliable and slow) is to continuously poll the page until the AJAX call is finished.
I was just wondering if it is possible to scrape information form this website that contained in a flash file.(http://www.tomtom.com/lib/doc/licensing/coverage/)
I am trying to get the all the text from the different components of this website.
Can anyone suggest a good starting point in python or any simpler method.
I believe the following blog post answers your question well. The author had the same need, to scrape Flash content using Python. And the same problem came up. He realized that he just needed to instantiate a browser (even just an in-memory one that did not even display to the screen) and then scrape its output. I think this could be a successful approach for what you need, and he makes it easy to understand.
http://blog.motane.lu/2009/06/18/pywebkitgtk-execute-javascript-from-python/