Finding links in file, keeps repeating same link

Finding links in file, keeps repeating same link - python

I'm a bit new to Python, but I have taken a HS level Java class. I'm trying to write a Python script that will take all the torrent links in my Humble Bundle downloads page and spit them out into a .txt file. I'm currently trying to get it to read all of them and print them, but I can't seem to get it to look past the first one. I've tried some different loops, and some of them spit it out once, others continuously spit out the same one over and over. Here is my code.
f = open("Humble Bundle.htm").read()
pos = f.find('torrents.humblebundle.com') #just to initialize it for the loop
end = f.find('.torrent') #same here
pos1 = f.find('torrents.humblebundle.com') #first time it appears
end1 = f.rfind('.torrent') #last time it appears
while pos >= pos1 and end <= end1:
pos = f.find('torrents.humblebundle.com')
end = f.find('.torrent')
link = f[pos:end+8]#the link in String form
print(link)
I would like help in both my current issue and on how to continue to the final script. This is my first post here, but I've researched what I could before giving up and asking for help. Thanks for your time.

You can find more information about find method at http://docs.python.org/2/library/string.html#string.find
The problem is when you execute these two lines they always return same value for pos and end because function always gets same arguments.
pos = f.find('torrents.humblebundle.com')
end = f.find('.torrent')
find method has another optional parameter called start which tells function where to start searching for given string. So if you change your code:
pos = f.find('torrents.humblebundle.com', pos+1)
end = f.find('.torrent', end+1)
it should work

You can try a regular expression here:
import re
f = open('Humble Bundle.htm').read()
pattern = re.compile(r'torrents\.humblebundle\.com.*\.torrent')
print re.findall(pattern, f)

Related

Is there a way to grab all the links on a page except ones containing a specific word in selenium?

I've been trying for hours to find a way to do this, and so far I've found nothing. I've tried using find element by css, xpath, and partial text using the not function. I'm trying to scan a webpage for all the links that don't contain the word 'google', and append them to an array.
Keep in mind speak and get_audio are seperate functions I have not included.
driver = webdriver.Chrome(executable_path='mypath')
url = "https://www.google.com/search?q="
driver.get(url + text.lower())
speak("How many articles should I pull?")
n = get_audio()
speak(f"I'll grab {n} articles")
url_array = []
for a in driver.find_elements_by_xpath("//*[not(contains(text(), 'google'))]"):
url_array.append(a.get_attribute('href'))
print(url_array)
I always get something along the lines of find_elements_* can't take (whatever I put in here), or it works but it adds everything to the array, even the ones with google in them. Anyone have any ideas? Thanks!

I finally got it by defining a new function and filtering the list after it was made, instead of trying to get selenium to do it.
def Filter(string, substr):
return [str for str in string if
any(sub not in str for sub in substr)]
Then using that and a filter to get rid of the None
url_array_2 = []
for a in driver.find_elements_by_xpath('.//a'):
url_array_2.append(a.get_attribute('href'))
url_array_1 = list(filter(None, url_array_2))
flist = ['google']
url_array = Filter(url_array_1, flist)
print(url_array)
Worked perfectly :)

Using .find(" ") method without cutting last character if the substring is at the very end

I am trying to find a substring which is a basically a link to any website. The idea is that if a user posts something, the link will be extracted and assigned to a variable called web_link. My current code is following:
post = ("You should watch this video https://www.example.com if you have free time!")
web_link = post[post.find("http" or "www"):post.find(" ", post.find("http" or "www"))]
The code works perfectly if there is a spacebar after the link, however, if the link inside the post is at the very end. For example:
post = ("You should definitely watch this video https://www.example.com")
Then the post.find(" ") can not find a spacebar/whitespace and returns -1 which results in web_link "https://www.example.co"
I am trying to find a solution that does not involve an if statement if possible.

Use regex. I've made a little change the solution here.
import re
def func(post):
return re.search("[(http|ftp|https)://]*([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,#?^=%&:/~+#-]*[\w#?^=%&/~+#-])?", post).group(0)
print(func("You should watch this video www.example.com if you have free time!"))
print(func("You should watch this video https://www.example.com"))
Output:
www.example.com
https://www.example.com
But I should say, using "if" is simpler and obvious:
def func(post):
start = post.find("http" or "www")
finish = post.find(" ", start)
return post[start:] if finish == -1 else post[start:finish]

The reason this doesn't work is because if the string isn't found and -1 is returned the slice commands interprets this as "the rest of the string -1 character from the end".
As ifma pointed out the best way to achieve this would be with a regular expression. Something like:
re.search("(https?://|www[^\s]+)", post).group(0)

How Do I Start Pulling Apart This Block of JSON Data?

I'd like to make a program that makes offline copies of math questions from Khan Academy. I have a huge 21.6MB text file that contains data on all of their exercises, but I have no idea how to start analyzing it, much less start pulling the questions from it.
Here is a pastebin containing a sample of the JSON data. If you want to see all of it, you can find it here. Warning for long load time.
I've never used JSON before, but I wrote up a quick Python script to try to load up individual "sub-blocks" (or equivalent, correct term) of data.
import sys
import json
exercises = open("exercises.txt", "r+b")
byte = 0
frontbracket = 0
backbracket = 0
while byte < 1000: #while byte < character we want to read up to
#keep at 1000 for testing purposes
char = exercises.read(1)
sys.stdout.write(char)
#Here we decide what to do based on what char we have
if str(char) == "{":
frontbracket = byte
while True:
char = exercises.read(1)
if str(char)=="}":
backbracket=byte
break
exercises.seek(frontbracket)
block = exercises.read(backbracket-frontbracket)
print "Block is " + str(backbracket-frontbracket) + " bytes long"
jsonblock = json.loads(block)
sys.stdout.write(block)
print jsonblock["translated_display_name"]
print "\nENDBLOCK\n"
byte = byte + 1

Ok, the repeated pattern appears to be this: http://pastebin.com/4nSnLEFZ
To get an idea of the structure of the response, you can use JSONlint to copy/paste portions of your string and 'validate'. Even if the portion you copied is not valid, it will still format it into something you can actually read.
First I have used requests library to pull the JSON for you. It's a super-simple library when you're dealing with things like this. The API is slow to respond because it seems you're pulling everything, but it should work fine.
Once you get a response from the API, you can convert that directly to python objects using .json(). What you have is essentially a mixture of nested lists and dictionaries that you can iterate through and pull specific details. In my example below, my_list2 has to use a try/except structure because it would seem that some of the entries do not have two items in the list under translated_problem_types. In that case, it will just put 'None' instead. You might have to use trial and error for such things.
Finally, since you haven't used JSON before, it's also worth noting that it can behave like a dictionary itself; you are not guaranteed the order in which you receive details. However, in this case, it seems the outermost structure is a list, so in theory it's possible that there is a consistent order but don't rely on it - we don't know how the list is constructed.
import requests
api_call = requests.get('https://www.khanacademy.org/api/v1/exercises')
json_response = api_call.json()
# Assume we first want to list "author name" with "author key"
# This should loop through the repeated pattern in the pastebin
# access items as a dictionary
my_list1 = []
for item in json_response:
my_list1.append([item['author_name'], item['author_key']])
print my_list1[0:5]
# Now let's assume we want the 'sha' of the SECOND entry in translated_problem_types
# to also be listed with author name
my_list2 = []
for item in json_response:
try:
the_second_entry = item['translated_problem_types'][0]['items'][1]['sha']
except IndexError:
the_second_entry = 'None'
my_list2.append([item['author_name'], item['author_key'], the_second_entry])
print my_list2[0:5]

Not picking up all XML elementree sub-sub elements in python

I am trying to capture all claims text tax in a bunch of xml patent files but having trouble with tags within the <claim-test>. Sometimes there's another <claim-text> and sometimes there is also <claim-ref> interrupting the text. In my output, the code gets cut off. Usually there are over 10 claims. I am trying to only get the text in the claim text.
I've already looked and tried the following but these don't work:
xml elementree missing elements python and
How to get all sub-elements of an element tree with Python ElementTree?
I've included a snippet here as it does get quite long to capture all.
My code for this is below (where fullname is the file name and directory).
for _, elem in iterparse(fullname):
description = '' # reset to empty string at beginning of each loop
abtext = '' # reset to empty string at beginning of each loop
claimtext= '' # reset to empty string
if elem.tag == 'claims':
for node4 in tree.findall('.//claims/claim/claim-text'):
claimtext = claimtext + node4.text
f.write('\n\nCLAIMTEXT\n\n\n')
f.write(smart_str(claimtext) + '\n\n')
#put row in df
row = dict(zip(['PATENT_ID', 'CLASS', 'ABSTRACT', 'DESCRIPTION','CLAIMS'], [data,cat,abtext,description,claimtext]))
row_s = pd.Series(row)
row_s.name = i
df = df.append(row_s)
So the resulting problem is twofold a) I only get one of the text printed to fil and b) nothing comes into the dataframe at all. I'm not sure if that's part of the same problem or two separate problems. I can get the claims to print into a file and that works fine but skips some of the text.

python Search for text in a gtk textview

I have looked around and I would think this to be really simple, for some reason I ahve only found parts of what I need.
I have made a text editor and I have a box that what is typed it will find the problem is that it will only find the first word the in the text view adn I can't get it to search the next line.
like a find function in a textdocument.
def search(found):
search_str = findentry.get_text()
start_iter = textbuffer.get_start_iter()
found = start_iter.forward_search(search_str,0, None)
if found:
match_start,match_end = found
textbuffer.select_range(match_start,match_end)
I thought I would be able to do a button that is a search next and make it forward search again adding something and a variable +1.
how can I make it search forward and backwards.

You are using get_start_iter(), which returns the first position in text buffer. Probably, you want to start from match_end, which is the position where word ends in the first search, that is, you should start from there.
Assuming you are returning found and calling again search with that parameter, then can replace the line:
start_iter = textbuffer.get_start_iter()
by
start_iter = found[1] if found else textbuffer.get_start_iter()
The first time, or whenever you want to reset the search, you can pass found=None.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Finding links in file, keeps repeating same link - python

You can try a regular expression here: import re f = open('Humble Bundle.htm').read() pattern = re.compile(r'torrents\.humblebundle\.com.*\.torrent') print re.findall(pattern, f)

Related

Is there a way to grab all the links on a page except ones containing a specific word in selenium?

Using .find(" ") method without cutting last character if the substring is at the very end

How Do I Start Pulling Apart This Block of JSON Data?

Not picking up all XML elementree sub-sub elements in python

python Search for text in a gtk textview

Categories

Resources