spiders to iterately find a html

spiders to iterately find a html - python

I want to access a website with the form of example.com/<num>-<num>.html
but I don't know the exact number in the url.
the numbers could go from 0 to 10000 or more.
I wrote a small python script to do this, but I feel it is very slow.
Are there some existing tools that could do the job?
for n in range(0,10000):
print("n",n)
for m in range(0,10000):
r = url+str(n)+str('-')+str(m)+str('.html')
html = requests.get(r,headers=headers)
try:
html.raise_for_status()
except requests.exceptions.HTTPError:
continue
#print(r+" doesn't exist")
print(r)
also this code neglect the possibility of strings like 0012, which kind of bad

I think you should make a folder containing the code of the thing you want to make.
In order to have the numbers, you have to name the folder that number. I commonly do this with websites with certain sub-pages. But if this is like a social media site with numbered labeled posts, I don't know what to do.

Related

Getting all review requests from Review Board Python Web API

I would like to get the information about all reviews from my server. That's my code that I used to achieve my goal.
from rbtools.api.client import RBClient
client = RBClient('http://my-server.net/')
root = client.get_root()
reviews = root.get_review_requests()
The variable reviews contains just 25 review requests (I expected much, much more). What's even stranger I tried something a bit different
count = root.get_review_requests(counts_only=True)
Now count.count is equal to 17164. How can I extract the rest of my reviews? I tried to check the official documentation but I haven't found anything connected to my problem.

According to the documentation (https://www.reviewboard.org/docs/manual/dev/webapi/2.0/resources/review-request-list/#webapi2.0-review-request-list-resource), counts_only is only a Boolean flag that indicates following:
If specified, a single count field is returned with the number of results, instead of the results themselves.
But, what you could do, is to provide it with status, so:
count = root.get_review_requests(counts_only=True, status='all')
should return you all the requests.
Keep in mind that I didn't test this part of the code locally. I referred to their repo test example -> https://github.com/reviewboard/rbtools/blob/master/rbtools/utils/tests/test_review_request.py#L643 and the documentation link posted above.

You have to use pagination (unfortunately I can't provide exact code without ability to reproduce your question):
The maximum number of results to return in this list. By default, this is 25. There is a hard limit of 200; if you need more than 200 results, you will need to make more than one request, using the “next” pagination link.
Looks like pagination helper class also available.
If you want to get 200 results you may set max_results:
requests = root.get_review_requests(max_results=200)
Anyway HERE is a good example how to iterate over results.
Also I don't recommend to get all 17164 results by one request even if it possible. Because total data of response will be huge (let's say if size one a result is 10KB total size will be more than 171MB)

Python easy way to get phone number's carrier

I am looking for a way in python to input a phone number and get a caller's carrier. I am looking for a free and simple way, I have used TELNYX and it returns CELLCO PARTNERSHIP DBA VERIZON instead of just simply 'verizon' which does not work for me. I have tried Twilio as well and it has not worked for me. Has anyone found success doing this? Thanks in advance. Code for the TELNYX:
def getcarrier(number):
url = 'https://api.telnyx.com/v1/phone_number/1' + number
html = requests.get(url).text
data = json.loads(html)
data = data["carrier"]
print(data["name"])
global carrier

What I have done in the past is to isolate the number prefix. And match against the prefix database available HERE. I did this only for my own country (Bangladesh), so it was a relatively easy code (just a series of if/else). So to work for any number I believe you'll need to consider the country code as well.
You can do it in two ways.
One. Having the data locally stored as CSV from the Wikipedia page. (scraping the page should be easy to do). And then use panda or similar CSV handling package to use it as the database of your program.
Or, two, you can write a small program that scrapes the page on demand and find the operator then.
Good Luck.

How to create pdfs from rows in a dataset and save them

Very general question here. Need ideas. I have a dataset with about 20 rows. I want to use python or R to automatically take each of these rows and create 1 pdf per row. The pdf is formatted in a particular way that I need to be able to play around with.
Imagine each row is a student's name, and I need to make a pdf "Report Card" for each student. The report card will have a designated spot that says "Math Grade" and then the value will come from the dataset.
I want to be able to hit run, and have all 20 of the pdfs save to a folder on my machine. Eventually, I may try to have this run on a server or something so it is fully automatic. The pdfs ultimately get emailed out to a distribution list.
I am very pretty familiar with R, and mildly familiar with Python. I have no experience in HTML, but is that what I need here?
Any tutorials, ideas, explanations of the process I should use would be appreciated.
I thought about using plot.ly.dash. But I think that is mostly for viewing in a web browser. I want pdfs, so I don't know if that will work.

How do I go to a random website? - python

How to generate a random yet valid website link, regardless of languages. Actually, the more diverse the language of the website it generates, the better it is.
I've been doing it by using other people's script on their webpage, how can i not rely on these random site forwarding script and make my own?. I've been doing it as such:
import webbrowser
from random import choice
random_page_generator = ['http://www.randomwebsite.com/cgi-bin/random.pl',
'http://www.uroulette.com/visit']
webbrowser.open(choice(random_page_generator), new=2)

I've been doing it by using other people's script on their webpage, how can i not rely on these random site forwarding script and make my own?
There are two ways to do this:
Create your own spider that amasses a huge collection of websites, and pick from that collection.
Access some pre-existing collection of websites, and pick from that collection. For example, DMOZ/ODP lets you download their entire database;* Google used to have a customized random site URL;** etc.
There is no other way around it (short of randomly generating and testing valid strings of arbitrary characters, which would be a ridiculously bad idea).
Building a web spider for yourself can be a fun project. Link-driven scraping libraries like Scrapy can do a lot of the grunt work for you, leaving you to write the part you care about.
* Note that ODP is a pretty small database compared to something like Google's or Yahoo's, because it's primarily a human-edited collection of significant websites rather than an auto-generated collection of everything anyone has put on the web.
** Google's random site feature was driven by both popularity and your own search history. However, by feeding it an empty search history, you could remove that part of the equation. Anyway, I don't think it exists anymore.

A conceptual explanation, not a code one.
Their scripts are likely very large and comprehensive. If it's a random website selector, they have a huge, huge list of websites line by line, and the script just picks one. If it's a random URL generator, it probably generates a string of letters (e.g. "asljasldjkns"), plugs it between http:// and .com, tries to see if it is a valid URL, and if it is, sends you that URL.
The easiest way to design your own might be to ask to have a look at theirs, though I'm not certain of the success you'd have there.
The best way as a programmer is simply to decipher the nature of URL language. Practice the building of strings and testing them, or compile a huge database of them yourself.
As a hybridization, you might try building two things. One script that, while you're away, searches for/tests URLs and adds them to a database. Another script that randomly selects a line out of this database to send you on your way. The longer you run the first, the better the second becomes.
EDIT: Do Abarnert's thing about spiders, that's much better than my answer.

The other answers suggest building large databases of URL, there is another method which I've used in the past and documented here:
http://41j.com/blog/2011/10/find-a-random-webserver-using-libcurl/
Which is to create a random IP address and then try and grab a site from port 80 of that address. This method is not perfect with modern virtual hosted sites, and of course only fetches the top page but it can be an easy and effective way of getting random sites. The code linked above is C but it should be easily callable from python, or the method could be easily adapted to python.

Word count statistics on a web page

I am looking for a way to extract basic stats (total count, density, count in links, hrefs) for words on an arbitrary website, ideally a Python based solution.
While it is easy to parse a specific website using, say BautifulSoup and determine where the bulk of the content is, it requires you to define the location of the content in the DOM tree ahead of processing. This is easy for, say, hrefs or any arbitraty tag but gets more complicated when determining where the rest of the data (not enclosed in well defined markers) is.
If I understand correctly, robots used by the likes of Google (GoogleBot?) are able to extract data from any website to determine the keyword density. My scenario is similar, obtain the info related to the words that define what the website is about (i.e. after removing js, links and fillers).
My question is, are there any libraries or web APIs that would allow me to get statistics of meaningful words from any given page?

There is no APIs but there could be few libraries that you can use it as a tool.
you should count the meaningful words and record them by the time.
you can also Start from something like this:
string Link= "http://www.website.com/news/Default.asp";
string itemToSearch= "Word";
int count = new Regex(itemToSearch).Matches(Link).Count;
MessageBox.Show(count.ToString());

There are multiple libraries that deal with more advanced processing of web articles, this question should be a duplicate of this one.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

spiders to iterately find a html - python

Related

Getting all review requests from Review Board Python Web API

Python easy way to get phone number's carrier

How to create pdfs from rows in a dataset and save them

How do I go to a random website? - python

Word count statistics on a web page

Categories

Resources