As part of the python twitter crawler I'm creating, I am attempting to make a "hash-table" of sorts to ensure that I don't crawl any user more than once. It is below. However, I am running into some problems. When I start crawling at the user NYTimesKrugman, I seem to crawl some users more than once. When I start crawling at the user cleversallie (in another completely independent crawl), I don't crawl any user more than once. Any insight into this behavior would be greatly appreciated!!!
from BeautifulSoup import BeautifulSoup
import re
import urllib2
import twitter
start_follower = "cleversallie"
depth = 3
U = list()
api = twitter.Api()
def add_to_U(user):
U.append(user)
def user_crawled(user):
L = len(L)
for x in (0, L):
a = L[x]
if a != user:
return False
else:
return True
def turn_to_names(users):
names = list()
for u in users:
x = u.screen_name
names.append(x)
return names
def test_users(users):
new = list()
for u in users:
if (user_crawled):
new.append(u)
return new
def crawl(follower,in_depth): #main method of sorts
if in_depth > 0:
add_to_U(follower)
users = api.GetFriends(follower)
names = turn_to_names(users)
select_users = test_users(names)
for u in select_users[0:5]:
crawl(u, in_depth - 1)
crawl(start_follower, depth)
for u in U:
print u
print("Program done.")
EDIT
Based on your suggestions (thank you all very much!) I have rewritten the code as the following:
import re
import urllib2
import twitter
start_follower = "NYTimesKrugman"
depth = 4
searched = set()
api = twitter.Api()
def crawl(follower, in_depth):
if in_depth > 0:
searched.add(follower)
users = api.GetFriends(follower)
names = set([str(u.screen_name) for u in users])
names -= searched
for name in list(names)[0:5]:
crawl(name, in_depth-1)
crawl(start_follower, depth)
for x in searched:
print x
print "Program is completed."
You have a bug where you set L = to len(L), not len(U). Also, you have a bug where you will return false if the first user does not match, not if every user does not match. In Python, the same function may be written as either of the following:
def user_crawled(user):
for a in l:
if a == user:
return True
return False
def user_crawled(user):
return user in a
The test_users function uses a user_crawled as a variable, it does not actually call it. Also, it seems you are doing the inverse of what you intend, you wish new to be populated with untested users, not tested ones. This is that function with the errors corrected:
def test_users(users):
new = list()
for u in users:
if not user_crawled(u):
new.append(u)
return new
Using a generator function, you can further simplify the function (provided you intend on looping over the results):
def test_users(users):
for u in users:
if not user_crawled(u):
yield u
You can also use the filter function:
def test_users(users):
return filter(lambda u: not user_crawled(u), users)
Your using a list to store users, not a hash-based structure. Python provides sets for when you need a list-like structure which can never have duplicates and requires fast existence tests. Sets can also be subtracted to remove all the elements in one set from the other.
Also, your list (U) is of users, but you are matching it against user names. You need to store just the user name of each added user. Also, you are using u to represent a user at one point in the program and to represent a user name at another, you should use more meaningful variable names.
The syntactic sugar of python ends up eliminating the need for all of your functions. This is how I would rewrite the entire program:
import twitter
start_follower = "cleversallie"
MAX_DEPTH = 3
searched = set()
api = twitter.Api()
def crawl(follower, in_depth=MAX_DEPTH):
if in_depth > 0:
searched.add(follower['screen_name'])
users = api.GetFriends(follower)
names = set([u['screen_name'] for u in users])
names -= searched
for name in list(names)[:5]:
crawl(name, in_depth - 1)
crawl(start_follower)
print "\n".join(searched)
print("Program done.")
The code sample you've given just plain doesn't work for starters, but I would guess your problem has something to do with not even making a hashtable (dictionary? set?).
You call L = len(L) when I cannot see anywhere else that L is defined. You then have a loop,
for x in (0, L):
a = L[x]
if a != user:
return False
else:
return True
which will actually just execute twice, once with x = 0 and once with x = L, where L is the len(L). Needless to say when you attempt to index into L the loop will fail. That won't even happen because you have an if-else that returns either way and L is not defined anywhere.
What you are most likely looking for is a set with a check for the user, do some work if they're absent, then add the user. This might look like:
first_user = 'cleversallie'
crawled_users = {first_user} #set literal
def crawl(user, depth, max_depth):
friends = get_friends(first_user)
for friend in friends:
if friend not in crawled_users and depth < max_depth:
crawled_users.add(friend)
crawl(friend, depth + 1, max_depth)
crawl(first_user, 0, 5)
You can fill in the details of what happens in get friends. Haven't tested this so pardon any syntax errors but it should be a strong start for you.
Let's start by saying there's lots of errors in this code a lot of non-python isms.
For instance:
def user_crawled(user):
L = len(U)
for x in (0, L):
a = L[x]
if a != user:
return False
else:
return True
This iterates only once through the loop... So you really ment something like [adding range() and the ability to check all the users.
def user_crawled(user) :
L = len(U)
for x in range(0, L) :
a = L[x]
if a == user :
return True
return False
Now of course a slightly more python way would be to skip the range and just iterate over the loop.
def user_crawled(user) :
for a in U :
if a == user :
return True
return False
Which is nice an simple, but now in true python you would jump on the "in" operator and write:
def user_crawled(user) :
return user in U
A few more python thoughts - list comprehensions.
def test_user(users) :
return [u for u in users if user_crawled(u)]
Which could also be applied to turn_to_names() - left as an exercise to the reader.
Related
Im new to python and cant figure out how to get these functions to call themselves. It asks for an input but no matter what gives 0 as the output. Can someone help debug?
userinput = input("Enter three numbers: ")
userinput = userinput.split(',')
finalsum = 0
finaldata = []
def formatinput(x):
sqrdata = []
for element in x:
sqrdata.append(int(element))
return(sqrdata)
def findsquare(x):
return (x*x)
def sumthesquares(y):
for element in y:
temp = findsquare(element)
finaldata.append(int(temp))
finalsum = finalsum + temp
return finalsum
def findthesquares(userinput):
finalsum = sumthesquares(formatinput(userinput))
print(finalsum)
Have you actually tried running your code? From what you've posted, it looks like you never actually call your functions...
They're defined, but you're missing the actual calls, like formatinput(userinput).
For future reference, if you put something like print("Got here!") into your functions, you can test that they're being called.
In one of my projects, I'm trying to parse parcel numbers that sometimes do, and sometimes don't have lot extensions (a three digit code at the end). I could obviously make an if elif structure to handle cases where lot extensions aren't present, but I was hoping to satisfy my curiosity and get some feedback on more efficient ways to write the code.
In it's current state, I end up with an unwanted trailing dash on parcels without a lot extension: '00-000-0000-'
Final parcel number formats should be:
00-000-0000
00-000-0000-000
and the input pins look like:
pin_that_wont_work1 = '00000000'
pin_that_wont_work2 = '000000000'
pin_that_works1 = '00000000000'
pin_that_works2 = '000000000000'
import re
pattern = r'^(\d{1,2})(\d{3})(\d{4})(\d{3})?$'
def parse_pins(pattern, pin):
L = [x for x in re.search(pattern, pin).groups()]
return '{dist}-{map_sheet}-{lot}-{lot_ext}'.format(dist=L[0] if len(L[0]) == 2 else '0'+L[0],
map_sheet=L[1],
lot=L[2],
lot_ext=L[3] if L[3] else '')
import re
pin_pattern = re.compile(r'^(\d{1,2})(\d{3})(\d{4})(\d{3})?$')
pin_formats = {
3: '{0:02d}-{1:03d}-{2:04d}',
4: '{0:02d}-{1:03d}-{2:04d}-{3:03d}'
}
def parse_pin(s):
groups = [int(d) for d in pin_pattern.search(s).groups() if d is not None]
return pin_formats[len(groups)].format(*groups)
Maybe I'm missing something, but couldn't you just put the dash inside the format call?
def parse_pins(pattern, pin):
L = [x for x in re.search(pattern, pin).groups()]
return '{dist}-{map_sheet}-{lot}{lot_ext}'.format(dist=L[0] if len(L[0]) == 2 else '0'+L[0],
map_sheet=L[1],
lot=L[2],
lot_ext='-{0}'.format(L[3]) if L[3] else '')
Throw them in a list, and list_.join('-'). The list should have 3 or 4 values.
Hi Im trying to create a search function in python, that goes through a list and searches for an element in it.
so far ive got
def search_func(list, x)
if list < 0:
return("failure")
else:
x = list[0]
while x > list:
x = list [0] + 1 <---- how would you tell python to go to the next element in the list ?
if (x = TargetValue):
return "success"
else
return "failure"
Well, you current code isn't very Pythonic. And there are several mistakes! you have to use indexes to acces an element in a list, correcting your code it looks like this:
def search_func(lst, x):
if len(lst) <= 0: # this is how you test if the list is empty
return "failure"
i = 0 # we'll use this as index to traverse the list
while i < len(lst): # this is how you test to see if the index is valid
if lst[i] == x: # this is how you check the current element
return "success"
i += 1 # this is how you advance to the next element
else: # this executes only if the loop didn't find the element
return "failure"
... But notice that in Python you rarely use while to traverse a list, a much more natural and simpler approach is to use for, which automatically binds a variable to each element, without having to use indexes:
def search_func(lst, x):
if not lst: # shorter way to test if the list is empty
return "failure"
for e in lst: # look how easy is to traverse the list!
if e == x: # we no longer care about indexes
return "success"
else:
return "failure"
But we can be even more Pythonic! the functionality you want to implement is so common that's already built into lists. Just use in to test if an element is inside a list:
def search_func(lst, x):
if lst and x in lst: # test for emptiness and for membership
return "success"
else:
return "failure"
Are you saying you want to see if an element is in a list? If so, there is no need for a function like that. Just use in:
>>> lst = [1, 2, 3]
>>> 1 in lst
True
>>> 4 in lst
False
>>>
This method is a lot more efficient.
If you have to do it without in, I suppose this will work:
def search_func(lst, x):
return "success" if lst.count(x) else "failure"
you dont need to write a function for searching, just use
x in llist
Update:
def search_func(llist,x):
for i in llist:
if i==x:
return True
return False
You are making your problem more complex, while solving any problem just think before starting to code. You are using while loops and so on which may sometimes becomes an infinite loop. You should use a for loop to solve it. This is better than while loop. So just check which condition helps you. That's it you are almost done.
def search_func(lst,x):
for e in lst: #here e defines elements in the given list
if e==x: #if condition checks whether element is equal to x
return True
else:
return False
def search(query, result_set):
if isinstance(query, str):
query = query.split()
assert isinstance(query, list)
results = []
for i in result_set:
if all(quer.casefold() in str(i).casefold() for quer in query):
results.append(i)
return results
Works best.
I have this code I'm trying to get to work. I can create a set of random numbers, but I need to make the max value show up. I'm trying not to use python's built in max command, BUT, I will ask for an example if I can't find a solution.
import random
def randomNumbers(number):
myList = []
numbersToCreate = number
while numbersToCreate > 0:
randomNumber = int(random.random() * 100)
myList.append(randomNumber)
numbersToCreate = numbersToCreate -1
return myList
One piece of code I've tried to enter is this:
theList = []
theList.sort()
biggest = theList [-1:][0]
print (theList)
When I try to run that with it I get an error telling me the list isn't defined. Any help would be appreciated.
Here's a solution.
def randomNumbers(number):
theList = []
numbersToCreate = number
while numbersToCreate > 0:
randomNumber = int(random.random() * 100)
theList.append(randomNumber)
numbersToCreate -= 1
return theList
outList = randomNumbers(100)
outList.sort()
print outlist[-1] # No reason to slice the list, which is what you were doing.
You really should use the max() function of Python, at least for readability sake.
If not, you can always check how Python developers have implemented it in Python, since it is open source.
theList = randomNumbers(30)
biggest = max(theList)
print (biggest)
First of all, if you want int for your list, you can use random.randint(min, max) instead of int(random.random()*100).
Second, you need to call your function and pass the return list to theList
def randomNumberList(n):
theList = []
for i in range(n):
theList.append(random.randint(0,100))
return theList
theRealList = randomNumberList(n)
Then you will be able to use the actual list.
theRealList.sort()
theBiggest = theRealList[-1]
So I'm trying to learn python on my own, and am doing coding puzzles. I came across one that pretty much ask for the best position to stand in line to win a contest. The person running the contest gets rid of people standing in odd number positions.
So for example if 1, 2, 3, 4, 5
It would get rid of the odd positions leaving 2, 4
Would get rid of the remaining odd positions leaving 4 as the winner.
When I'm debugging the code seems to be working, but it's returning [1,2,3,4,5] instead of the expected [4]
Here is my code:
def findWinner(contestants):
if (len(contestants) != 1):
remainingContestants = []
for i, contestant in enumerate(contestants, 1):
if (isEven(i)):
remainingContestants.append(contestant)
findWinner(remainingContestants)
return contestants
Am I not seeing a logic error or is there something else that I'm not seeing?
You must return the value from the recurse function to the caller function:
return findWinner(remainingContestants)
else you would return just the original value without any changes.
def findWinner(contestants):
if (len(contestants) != 1):
remainingContestants = []
for i, contestant in enumerate(contestants, 1):
if (isEven(i)):
remainingContestants.append(contestant)
return findWinner(remainingContestants) # here the value must be return
return contestants # without the return above, it will just return this value(original)
How about this:
def findWinner(contestants):
return [contestants[2**int(math.log(len(contestants),2))-1]]
I know its not what the questions really about but I had to =P. I cant just look at all that work for finding the greatest power of 2 less than contestants and not point it out.
or if you don't like the 'artificial' solution and would like to actually perform the process:
def findWinner2(c):
while len(c) > 1:
c = [obj for index, obj in enumerate(c, 1) if index % 2 == 0] #or c = c[1::2] thanks desfido
return c
you shold use
return findWinner(remaingContestants)
otherwise, of course, your list will never be updated and so your func is gonna always return containts
however, see the PEP8 for style guide on python code: http://www.python.org/dev/peps/pep-0008/
the func isEven is probably an overkill...just write
if not num % 2
finally, recursion in python isn't recommended; make something like
def find_winner(alist):
while len(alist) > 1:
to_get_rid = []
for pos, obj in enumerate(alist, 1):
if pos % 2:
to_get_rid.append(obj)
alist = [x for x in alist if not (x in to_get_rid)]
return alist
Is there a reason you're iterating over the list instead of using a slice? Doesn't seem very python-y to not use them to me.
Additionally, you might want to do something sensible in the case of an empty list. You'll currently go into an infinite loop.
I'd write your function as
def findWinner(contestants):
if not contestants:
raise Exception
if len(contestants)==1:
return contestants[0]
return findWinner(contestants[1::2])
(much as #jon_darkstar's point, this is a bit tangential to the question you are explicitly asking, but still a good practice to engage in over what you're doing)
You are missing a return at the line where you call "findWinner"