I wrote a script that parses a webpage and get the amount of links('a' tag) on it:
import urllib
import lxml.html
connection = urllib.urlopen('http://test.com')
dom = lxml.html.fromstring(connection.read())
for link in dom.xpath('//a/#href'):
print link
The output of a script:
./01.html
./52.html
./801.html
http://www.blablabla.com/1.html
#top
How can i convert it to list to count the amount of links? I use link.split() but it got to me:
['./01.html']
['./52.html']
['./801.html']
['http://www.blablabla.com/1.html']
['#top']
But i want to get:
[./01.html, ./52.html, ./801.html, http://www.blablabla.com/1.html, #top]
Thanks!
link.split() tries to split link itself. But you must work with entity that represents all links. In your case: dom.xpath('//a/#href').
So this must help you:
links = list(dom.xpath('//a/#href'))
And getting length with a built-in len function:
print len(links)
list(dom.xpath('//a/#href'))
This will take the iterator that dom.xpath returns and puts every item into a list.
Related
Please help me fix this, this is my code which I've already tried.
I really appreciate your help.
import urllib.request
import re
search_keyword="ill%20wiat"
html = urllib.request.urlopen("https://www.youtube.com/results?search_query=" + search_keyword)
video_ids = re.findall(r"watch?v=(\S{11})", html.read().decode())
print("https://www.youtube.com/watch?v=" + video_ids[0])
First of all check page you try to parse. you wrote:
r"watch?v=(\S{11})"
just remember that ? char here will be parsed as REGEX operator and not string you want,
so first of all you need to write it like:
/watch[?]v=(\S{11})
so your regex will be parsed properly
Second: good practice to print your list to see what you get and iterate via list using FOR loop instead of directly accessing index [0].
in you case you get this error just because your list of id is empty.
next code is working for me
import urllib.request
import re
search_keyword="ill%20wiat"
url="https://www.youtube.com/results?search_query="+search_keyword
with urllib.request.urlopen(url) as response:
video_ids = re.findall("/watch[?]v=(\S{11})", response.read().decode())
for video in video_ids:
print("https://www.youtube.com/watch?v=" + video)
P.S don't wrap your code with try/except to catch such thrown errors
urlib won't give you data
use
import requests
html=requests.get('https://www.youtube.com/results?search_query='+search_keyword)
text=html.text
text have all html data so search from text
I'm using newspaper3 to extract URLs from news.google, but the problem is I keep getting all the URLs (I've disabled memoize because I need the full list). I would like to only print the top 5 links or 5 random links doesn't really matter.
I've tried setting a max, but that didn't work. Any ideas?
import newspaper
news = newspaper.build('https://news.google.com/topics/CAAqJggKIiBDQkFTRWdvSUwyMHZNRGx6TVdZU0FtVnVHZ0pWVXlnQVAB?oc=3&ceid=US:en', memoize_articles=False)
for article in news.articles:
print(article.url)
This code snippet should be exactly what you want. It doesn't use a newspaper function but rather random to select a certain number of urls. The output from newspaper isn't a list therefore it has to be converted into a list using the append function. Enjoy!
import newspaper
business_news = newspaper.build('https://news.google.com/topics/CAAqJggKIiBDQkFTRWdvSUwyMHZNRGx6TVdZU0FtVnVHZ0pWVXlnQVAB?hl=en-US&gl=US&ceid=US%3Aen', language='en', memoize_articles = False)
myList = []
for article in business_news.articles:
myList.append(str(article.url))
print(myList) #not necessary just for display purposes
import random
aselect = myList
randarticles = random.sample(aselect, 5)
print(randarticles)
I want to only print all the "words" that start with "/watch" from the string, and then add all the '/watch...' to a list. Thanks in advance!
# Take a random video from my youtube recommended and add it to watch2gether
import requests
from bs4 import BeautifulSoup as BS
import time
import random
# Importing libraries
num = random.randint(1, 20)
recommended = requests.get('https://www.youtube.com/results?search_query=svenska+youtube+klassiker&sp=EgIQAQ%253D%253D')
recommended_soup = BS(recommended.content, features='lxml')
recommended_vid = recommended_soup.find_all('a', href=True)
for links in recommended_vid:
print(links['href'])
Output:
/
//www.youtube.com/upload
/
/feed/trending
/feed/history
/premium
/channel/UC-9-kyTW8ZkZNDHQJ6FgpwQ
/channel/UCEgdi0XIXXZ-qJOFPf4JSKw
/gaming
/feed/guide_builder
/watch?v=PbVt_O1kFpA
/watch?v=PbVt_O1kFpA
/user/thedjdoge
/watch?v=1lcksCjvuSs
/watch?v=1lcksCjvuSs
/channel/UCn-puiDqHNMhRvq6wsU3nsQ
/watch?v=AKj_pxp2l1c
/watch?v=AKj_pxp2l1c
/watch?v=QNnEqTQD6DM
/watch?v=QNnEqTQD6DM
/channel/UCDuOAYzgiZzqqlXd2G3GAwg
....
Maybe I can use something like .remove or .replace, don't know what to do so I appreciate all help.
yea re is definitely overkill here. this is a perfect use case for filter
a_list = ["/watch/blah", "not/watch"]
new_list = filter(lambda x: x.startswith("/watch"), a_list)
print(list(new_list))
['/watch/blah']
just be aware it returns a generator, so wrap it in list if you want the list.
http://book.pythontips.com/en/latest/map_filter.html is good if you want more information on functions that do this kind of data cleaning. If you need to get really fancy with your data cleaning look into using pandas. It has a steep learning curve, but it's fantastic for complicated data cleaning.
you can do the following
for links in recommended_vid:
if "/watch" in links[href]:
print(link[href])
This should help you find all the /watch links.
import re
pattern = re.compile(r"/watch")
# pattern = re.compile(r"/watch\?v=[a-zA-Z_0-9]{11}") -- This pattern is to find all the links as well
matches = pattern.finditer(<your_string>)
for m in matches:
print(m) #will print all the locations at which /watch occurs
You can collect all the URLs in a list and proceed. Good Luck!!
Looking at your code, a simple if statement with str.startswith() should suffice to get what you want.
Assuming the links['href'] contains a str, then:
for links in recommended_vid:
href = links['href'] # I think 'href' will be of type 'str'
if href.startswith('/watch'):
print(href)
Note: .startswith() will only work if /watch is really at the start of the href; you could also try if '/watch' in href:, which will match if that string appears anywhere in href.
So I'm new to Python and am working on a simple program that will read a text file of protein names (PDB IDs) and create a URL to search a database (the PDB) for that protein and some associated data.
Unfortunately, as a newbie, I forgot to save my script, so I can't recall what I did to make my code work!
Below is my code so far:
import urllib
import urllib.parse
import urllib.request
import os
os.chdir("C:\\PythonProjects\\Samudrala Lab Projects")
protein_file = open("protein_list.txt","r")
protein_list = protein_file.read()
for item in protein_list:
item = item[0:4]
query_string =urlencode('customReportColumns','averageBFactor','resolution','experimentalTechnique','service=wsfile','format=csv')
**final_URL = url + '?pdbid={}{}'.format(url, item, query_string)**
print(final_URL)
The line of code I'm stuck on is starred.
The object "final_url" within the loop is missing some modification to indicate that I'd like the URL to search for the item as a pdbid. Can anyone give me a hint as to how I can tell the URL to plug in each item on the list as a PDBID?
I'm getting a type error indicating that it's not a valid non-string sequence or mapping object. Original post was edited to add this info.
Please let me know if this is an unclear question, or if you need any additional info.
Thanks!
How about something like this?
final_URL = "{}?pdbids={}{}".format(url, item, query_string)
I have a simple code like:
p = soup.find_all("p")
paragraphs = []
for x in p:
paragraphs.append(str(x))
I am trying to convert a list I obtained from xml and convert it to string. I want to keep it with it's original tag so I can reuse some text, thus the reason why I am appending it like this. But the list contains over 6000 observations, thus an recursion error occurs because of the str:
"RuntimeError: maximum recursion depth exceeded while calling a Python object"
I read that you can change the max recursion but it's not wise to do so. My next idea was to split the conversion to strings into batches of 500, but I am sure that there has to be a better way to do this. Does anyone have any advice?
The problem here is probably that some of the binary graphic data at the bottom of the document contains the sequence of characters <P, which Beautiful Soup is trying to repair into an actual HTML tag. I haven't managed to pinpoint which text is causing the "recursion depth exceeded" error, but it's somewhere in there. It's p[6053] for me, but since you seem to have modified the file a bit (or maybe you're using a different parser for Beautiful Soup), it'll be different for you, I imagine.
Assuming you don't need the binary data at the bottom of the document to extract whatever you need from the actual <p> tags, try this:
# boot out the last `<document>`, which contains the binary data
soup.find_all('document')[-1].extract()
p = soup.find_all('p')
paragraphs = []
for x in p:
paragraphs.append(str(x))
I believe the issue is that the BeautifulsSoup object p is not built iteratiely, therefore the method call limit is reached before you can finish constructing p = soup.find_all('p'). Note the RecursionError is similarly thrown when building soup.prettify().
For my solution I used the re module to gather all <p>...</p> tags (see code below). My final result was len(p) = 5571. This count is lower than yours because the regex conditions did not match any text within the binary graphic data.
import re
import urllib
from urllib.request import Request, urlopen
url = 'https://www.sec.gov/Archives/edgar/data/1547063/000119312513465948/0001193125-13-465948.txt'
response = urllib.request.urlopen(url).read()
p = re.findall('<P((.|\s)+?)</P>', str(response)) #(pattern, string)
paragraphs = []
for x in p:
paragraphs.append(str(x))