I have list with urls for crawling.['http://domain1.com','http://domain1.com/page1','http://domain2.com']
Code:
prev_domain = ''
while urls:
url = urls.pop()
if base_url(url) == prev_domain: # base_url is custom function return domain of an url
urls.append(url) # is this is possible?
continue
else:
crawl(url)
Basically I dont want to crawl webpages of same domain continuously. Continuosly crawling a domain url, return http response status code with 429: Too Many Requests. The user has sent too many requests in a given amount of time ("rate limiting"). To by-pass this issue, I'm planning to go with below logic.
Loop through all items in the list and compare current element base url with previously processed element base url.
If base urls are different then process for next step, otherwise do not process current element, just append this element to the same list.
Note : If urls in list are of same domain, make delay in processing each element and then execute.
Please provide your thoughts.
Your algorithm is almost correct, but not the implementation:
>>> L = [1,2,3]
>>> L.pop()
3
>>> L.append(3)
>>> L
[1, 2, 3]
That's why your program loops forever: if the domain is the same as the previous domain, you just append then pop then append, then.... What you need is not a stack, it's a round robin:
>>> L.pop()
3
>>> L.insert(0, 3)
>>> L
[3, 1, 2]
Let's take a shuffled list of permutations of "abcd":
>>> L = [('b', 'c', 'd', 'a'), ('d', 'c', 'b', 'a'), ('a', 'c', 'd', 'b'), ('c', 'd', 'a', 'b'), ('b', 'd', 'a', 'c'), ('b', 'a', 'd', 'c'), ('b', 'c', 'a', 'd'), ('a', 'b', 'd', 'c'), ('d', 'a', 'b', 'c'), ('a', 'b', 'c', 'd'), ('d', 'c', 'a', 'b'), ('a', 'd', 'c', 'b'), ('d', 'a', 'c', 'b'), ('c', 'd', 'b', 'a'), ('d', 'b', 'c', 'a'), ('d', 'b', 'a', 'c'), ('a', 'd', 'b', 'c'), ('b', 'd', 'c', 'a'), ('c', 'b', 'd', 'a'), ('c', 'a', 'b', 'd'), ('b', 'a', 'c', 'd')]
The first letter is the domain. Here's a slightly modified version of your code:
>>> prev = None
>>> while L:
... e = L.pop()
... if L and e[0] == prev:
... L.insert(0, e)
... else:
... print(e)
... prev = e[0]
('b', 'a', 'c', 'd')
('c', 'a', 'b', 'd')
('b', 'd', 'c', 'a')
('a', 'd', 'b', 'c')
('d', 'b', 'a', 'c')
('c', 'd', 'b', 'a')
('d', 'a', 'c', 'b')
('a', 'd', 'c', 'b')
('d', 'c', 'a', 'b')
('a', 'b', 'c', 'd')
('d', 'a', 'b', 'c')
('a', 'b', 'd', 'c')
('b', 'c', 'a', 'd')
('c', 'd', 'a', 'b')
('a', 'c', 'd', 'b')
('d', 'c', 'b', 'a')
('b', 'c', 'd', 'a')
('c', 'b', 'd', 'a')
('d', 'b', 'c', 'a')
('b', 'a', 'd', 'c')
('b', 'd', 'a', 'c')
The modification is: if L and, because if the last element of the list domain is prev, then you'll loop forever with your one element list: pop, same as prev, insert, pop, ...(as with pop/append)
Here's another option: create a dict domain -> list of urls:
>>> d = {}
>>> for e in L:
... d.setdefault(e[0], []).append(e)
>>> d
{'b': [('b', 'c', 'd', 'a'), ('b', 'd', 'a', 'c'), ('b', 'a', 'd', 'c'), ('b', 'c', 'a', 'd'), ('b', 'd', 'c', 'a'), ('b', 'a', 'c', 'd')], 'd': [('d', 'c', 'b', 'a'), ('d', 'a', 'b', 'c'), ('d', 'c', 'a', 'b'), ('d', 'a', 'c', 'b'), ('d', 'b', 'c', 'a'), ('d', 'b', 'a', 'c')], 'a': [('a', 'c', 'd', 'b'), ('a', 'b', 'd', 'c'), ('a', 'b', 'c', 'd'), ('a', 'd', 'c', 'b'), ('a', 'd', 'b', 'c')], 'c': [('c', 'd', 'a', 'b'), ('c', 'd', 'b', 'a'), ('c', 'b', 'd', 'a'), ('c', 'a', 'b', 'd')]}
Now, take an element of every domain and clear the dict, then loop until the dict is empty:
>>> while d:
... for k, vs in d.items():
... e = vs.pop()
... print (e)
... d = {k: vs for k, vs in d.items() if vs} # clear the dict
...
('b', 'a', 'c', 'd')
('d', 'b', 'a', 'c')
('a', 'd', 'b', 'c')
('c', 'a', 'b', 'd')
('b', 'd', 'c', 'a')
('d', 'b', 'c', 'a')
('a', 'd', 'c', 'b')
('c', 'b', 'd', 'a')
('b', 'c', 'a', 'd')
('d', 'a', 'c', 'b')
('a', 'b', 'c', 'd')
('c', 'd', 'b', 'a')
('b', 'a', 'd', 'c')
('d', 'c', 'a', 'b')
('a', 'b', 'd', 'c')
('c', 'd', 'a', 'b')
('b', 'd', 'a', 'c')
('d', 'a', 'b', 'c')
('a', 'c', 'd', 'b')
('b', 'c', 'd', 'a')
('d', 'c', 'b', 'a')
The output is more uniform.
Check the following code snippet,
urls = ['http://domain1.com','http://domain1.com/page1','http://domain2.com']
crawl_for_urls = {}
for url in urls:
domain = base_url(url)
if domain not in crowl_for_urls:
crawl_for_urls.update({domain:url})
crawl(url)
crawl() will be called only for unique domain.
Or you can use:
urls = ['http://domain1.com','http://domain1.com/page1','http://domain2.com']
crawl_for_urls = {}
for url in urls:
domain = base_url(url)
if domain not in crowl_for_urls:
crawl_for_urls.update({domain:[url]})
crawl(url)
else:
crawl_for_urls.get(domain, []).append(url)
This way you can categories the URL's based on domain and also can use crawl() for unique domain.
I get an error when trying to print the permutation/combination of a user generated list of names.
I tried a couple of things from itertools, but can't get either permutations or combinations to work. Ran into some other errors along the way regarding concatenating strings, but currently getting a: TypeError: 'list' object not callable.
I know I'm making a simple mistake, but can't sort it out. Please help!
from itertools import combinations
name_list = []
for i in range(0,20):
name = input('Add up to 20 names.\nWhen finished, enter "Done" to see all first and middle name combinations.\nName: ')
name_list.append(name)
if name != 'Done':
print(name_list)
else:
name_list.remove('Done')
print(name_list(combinations))
I expect:
1) the user adds a name to list
2) the list prints showing user contents of list
3) when finished the user inputs "Done":
a) 'Done' is removed from the list
b) all the combinations of the remaining items on the list printed
Permutations and combinations are two different beasts. Look:
>>> from itertools import permutations,combinations
>>> from pprint import pprint
>>> l = ['a', 'b', 'c', 'd']
>>> pprint(list(combinations(l, 2)))
[('a', 'b'), ('a', 'c'), ('a', 'd'), ('b', 'c'), ('b', 'd'), ('c', 'd')]
>>> pprint(list(permutations(l)))
[('a', 'b', 'c', 'd'),
('a', 'b', 'd', 'c'),
('a', 'c', 'b', 'd'),
('a', 'c', 'd', 'b'),
('a', 'd', 'b', 'c'),
('a', 'd', 'c', 'b'),
('b', 'a', 'c', 'd'),
('b', 'a', 'd', 'c'),
('b', 'c', 'a', 'd'),
('b', 'c', 'd', 'a'),
('b', 'd', 'a', 'c'),
('b', 'd', 'c', 'a'),
('c', 'a', 'b', 'd'),
('c', 'a', 'd', 'b'),
('c', 'b', 'a', 'd'),
('c', 'b', 'd', 'a'),
('c', 'd', 'a', 'b'),
('c', 'd', 'b', 'a'),
('d', 'a', 'b', 'c'),
('d', 'a', 'c', 'b'),
('d', 'b', 'a', 'c'),
('d', 'b', 'c', 'a'),
('d', 'c', 'a', 'b'),
('d', 'c', 'b', 'a')]
>>>
for use combinations , you need to give the r as argument.this code give all combinations for all numbers(0 to length of list),
from itertools import combinations
name_list = []
for i in range(0,20):
name = raw_input('Add up to 20 names.\nWhen finished, enter "Done" to see all first and middle name combinations.\nName: ')
name_list.append(name)
if name != 'Done':
print(name_list)
else:
name_list.remove('Done')
break
for i in range(len(name_list) + 1):
print(list(combinations(name_list, i)))
print("\n")
I have this code:
number = 2
size = 5
list_b = [("b","b","b")]
list_a = [("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a")]
for i in range(number):
list_a.insert(size,list_b)
print list_a
it gives me this:
[('a', 'a', 'a'),
('a', 'a', 'a'),
('a', 'a', 'a'),
('a', 'a', 'a'),
('a', 'a', 'a'),
('b', 'b', 'b'),
('b', 'b', 'b'),
('a', 'a', 'a'),
('a', 'a', 'a'),
('a', 'a', 'a'),
('a', 'a', 'a'),
('a', 'a', 'a'),
('a', 'a', 'a'),
('a', 'a', 'a'),
('a', 'a', 'a'),
('a', 'a', 'a'),
('a', 'a', 'a'),
('a', 'a', 'a'),
('a', 'a', 'a'),
('a', 'a', 'a'),
('a', 'a', 'a'),
('a', 'a', 'a'),
('a', 'a', 'a'),
('a', 'a', 'a'),
('a', 'a', 'a'),
('a', 'a', 'a'),
('a', 'a', 'a'),
('a', 'a', 'a'),
('a', 'a', 'a'),
('a', 'a', 'a'),
('a', 'a', 'a'),
('a', 'a', 'a'),
('a', 'a', 'a')]
basically, it inserts 2 times the list_b in the position defined by size
I want a loop that repeats itself so that list_b is inserted the number of times defined in number but repeats size times. It difficult to explain, so here is the result that I want:
[('a', 'a', 'a'),
('a', 'a', 'a'),
('a', 'a', 'a'),
('a', 'a', 'a'),
('a', 'a', 'a'),
('b', 'b', 'b'),
('b', 'b', 'b'),
('a', 'a', 'a'),
('a', 'a', 'a'),
('a', 'a', 'a'),
('a', 'a', 'a'),
('a', 'a', 'a'),
('b', 'b', 'b'),
('b', 'b', 'b'),
('a', 'a', 'a'),
('a', 'a', 'a'),
('a', 'a', 'a'),
('a', 'a', 'a'),
('a', 'a', 'a'),
('b', 'b', 'b'),
('b', 'b', 'b'),
('a', 'a', 'a'),
('a', 'a', 'a'),
('a', 'a', 'a'),
('a', 'a', 'a'),
('a', 'a', 'a'),
('b', 'b', 'b'),
('b', 'b', 'b'),
('a', 'a', 'a'),
('a', 'a', 'a'),
('a', 'a', 'a'),
('a', 'a', 'a'),
('b', 'b', 'b'),
('b', 'b', 'b'),...and so on]
EDIT
and if I had this:
list_a = [a, ] * 15
list_b = [b,]
s = 5
n = 2
I want to obtain this:
[b,b,a,a,a,a,a,b,b,b,b,a,a,a,a,a,b,b,b,b,a,a,a,a,a,b,b]
since this is an example and list_a, s and n will vary, how can I do this in one or two loops?
Thanks,
Favolas
For the sake of the argument, I'll call the ('a', 'a', 'a') => a and ('b', 'b', 'b') => b.
number=2
size=5
list_a=["a"]*20
list_b=["b"]
workfor=len(list_a)+(len(list_a)/size)*number*len(list_b)
i=0
while i<workfor:
i+=size
for times in range(number):
for elem in list_b:
list_a.insert(i,elem)
i+=len(list_b)
print list_a
Results in =>
['a', 'a', 'a', 'a', 'a', 'b', 'b', 'a', 'a', 'a', 'a', 'a', 'b', 'b', 'a', 'a', 'a', 'a', 'a', 'b', 'b', 'a', 'a', 'a', 'a', 'a', 'b', 'b']
#!/usr/bin/python
number = 2
size = 5
list_b = [("b","b","b")]
list_a = [("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a"),("a","a","a")]
if __name__ == '__main__':
insertion_count = len(list_a) / size
for j in xrange(insertion_count):
# compute insertion position
pos = (j+1)*size + j * number
for i in range(number):
list_a.insert(pos,list_b)
print list_a
from itertools import chain, izip, repeat
list_a = [('a', 'a', 'a')] * 15
list_b = [('b', 'b', 'b')]
a5b2s = [iter(list_a)] * 5 + [repeat(*list_b)] * 2
list_a[:] = chain.from_iterable(izip(*a5b2s))
>>> s,n=5,2
>>> a=[1,]*17
>>> b=2
>>> for i in range(len(a)//s*s,0,-s):
for j in range(n):
a.insert(i,b)
>>> a
[1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1]