limiting the URL output from newspaper

limiting the URL output from newspaper - python

I'm using newspaper3 to extract URLs from news.google, but the problem is I keep getting all the URLs (I've disabled memoize because I need the full list). I would like to only print the top 5 links or 5 random links doesn't really matter.
I've tried setting a max, but that didn't work. Any ideas?
import newspaper
news = newspaper.build('https://news.google.com/topics/CAAqJggKIiBDQkFTRWdvSUwyMHZNRGx6TVdZU0FtVnVHZ0pWVXlnQVAB?oc=3&ceid=US:en', memoize_articles=False)
for article in news.articles:
print(article.url)

This code snippet should be exactly what you want. It doesn't use a newspaper function but rather random to select a certain number of urls. The output from newspaper isn't a list therefore it has to be converted into a list using the append function. Enjoy!
import newspaper
business_news = newspaper.build('https://news.google.com/topics/CAAqJggKIiBDQkFTRWdvSUwyMHZNRGx6TVdZU0FtVnVHZ0pWVXlnQVAB?hl=en-US&gl=US&ceid=US%3Aen', language='en', memoize_articles = False)
myList = []
for article in business_news.articles:
myList.append(str(article.url))
print(myList) #not necessary just for display purposes
import random
aselect = myList
randarticles = random.sample(aselect, 5)
print(randarticles)

Related

Get all names from wikipedia-site?

i try to extract all names from this site -
https://en.wikipedia.org/wiki/Category:Masculine_given_names
(and i want to have all names which are listed on this site and the following pages - but also the subcategories which are listed at the top like Afghan masculine given names, African masculine given names, etc.)
I tried this with the following code:
import pywikibot
from pywikibot import pagegenerators
site = pywikibot.Site()
cat = pywikibot.Category(site,'Category:Masculine_given_names')
gen = pagegenerators.CategorizedPageGenerator(cat)
for idx,page in enumerate(gen):
text = page.text
print(idx)
print(text)
Which generally works fine and gave me at least the detail-page of a single name page. But how can i get all the names / from all the subpages on this site but also from the subcategories?

How to find subcategories and subpages on wikipedia using pywikibot?
This is already answered here using Category methods but you can also use pagegenerators CategorizedPageGenerator function. All you need is setting the recurse option:
>>> gen = pagegenerators.CategorizedPageGenerator(cat, recurse=True)
Refer the documentation for it. You may also include pagegenerators options within your script in such a way decribed in this example and call your script with -catr option:
pwb.py <yourscript> -catr:Masculine_given_names

How do you generate a valid youtube URL with python

Is there any way to generate a valid youtube URL with python
import requests
from string import ascii_uppercases, ascii_lowercase, digits
charset = list(ascii_uppercase) + list(ascii_lowercase)+ list(digits)
def gen_id():
res = ""
for i in range(11):
res += random.choice(charset)
return res
youtube_url = "https://www.youtube.com/watch?v=" + gen_id()
resp = requests.get(youtube_url)
print (resp.status_code)
I am using this example to generate random youtube url
I get response code 200 but no video found when i try to open the video in the browser
I looked at this method but it does not work

ID's are generated randomly, and are not that predictable. They are all supposedly Base64 though, which helps limit the number of characters (you will probably want to add dashes and underscores to your random generation since codes like gbhDL8BT_w0 are possible). The only real approach known is generation and then testing, and as some commenters mentioned, this might get rate-limited by YouTube.
There are some additional details provided in this answer to a similar question that might help in doing the generation, or satiating curiosity.

It's not possible to pick always a valid random url from all the videos Youtube has, not every valid sequence is a valid id. You have to check yourself that the urls you want to choose randomly are valid. Pick some videos up and put them in a list.
myUrls = [
"https://www.youtube.com/watch?v=...",
"https://www.youtube.com/watch?v=...",
...
]
youtube_url = random.choice(myUrls)

Is there a way to grab all the links on a page except ones containing a specific word in selenium?

I've been trying for hours to find a way to do this, and so far I've found nothing. I've tried using find element by css, xpath, and partial text using the not function. I'm trying to scan a webpage for all the links that don't contain the word 'google', and append them to an array.
Keep in mind speak and get_audio are seperate functions I have not included.
driver = webdriver.Chrome(executable_path='mypath')
url = "https://www.google.com/search?q="
driver.get(url + text.lower())
speak("How many articles should I pull?")
n = get_audio()
speak(f"I'll grab {n} articles")
url_array = []
for a in driver.find_elements_by_xpath("//*[not(contains(text(), 'google'))]"):
url_array.append(a.get_attribute('href'))
print(url_array)
I always get something along the lines of find_elements_* can't take (whatever I put in here), or it works but it adds everything to the array, even the ones with google in them. Anyone have any ideas? Thanks!

I finally got it by defining a new function and filtering the list after it was made, instead of trying to get selenium to do it.
def Filter(string, substr):
return [str for str in string if
any(sub not in str for sub in substr)]
Then using that and a filter to get rid of the None
url_array_2 = []
for a in driver.find_elements_by_xpath('.//a'):
url_array_2.append(a.get_attribute('href'))
url_array_1 = list(filter(None, url_array_2))
flist = ['google']
url_array = Filter(url_array_1, flist)
print(url_array)
Worked perfectly :)

How to print only specific words of a string

I want to only print all the "words" that start with "/watch" from the string, and then add all the '/watch...' to a list. Thanks in advance!
# Take a random video from my youtube recommended and add it to watch2gether
import requests
from bs4 import BeautifulSoup as BS
import time
import random
# Importing libraries
num = random.randint(1, 20)
recommended = requests.get('https://www.youtube.com/results?search_query=svenska+youtube+klassiker&sp=EgIQAQ%253D%253D')
recommended_soup = BS(recommended.content, features='lxml')
recommended_vid = recommended_soup.find_all('a', href=True)
for links in recommended_vid:
print(links['href'])
Output:
/
//www.youtube.com/upload
/
/feed/trending
/feed/history
/premium
/channel/UC-9-kyTW8ZkZNDHQJ6FgpwQ
/channel/UCEgdi0XIXXZ-qJOFPf4JSKw
/gaming
/feed/guide_builder
/watch?v=PbVt_O1kFpA
/watch?v=PbVt_O1kFpA
/user/thedjdoge
/watch?v=1lcksCjvuSs
/watch?v=1lcksCjvuSs
/channel/UCn-puiDqHNMhRvq6wsU3nsQ
/watch?v=AKj_pxp2l1c
/watch?v=AKj_pxp2l1c
/watch?v=QNnEqTQD6DM
/watch?v=QNnEqTQD6DM
/channel/UCDuOAYzgiZzqqlXd2G3GAwg
....
Maybe I can use something like .remove or .replace, don't know what to do so I appreciate all help.

yea re is definitely overkill here. this is a perfect use case for filter
a_list = ["/watch/blah", "not/watch"]
new_list = filter(lambda x: x.startswith("/watch"), a_list)
print(list(new_list))
['/watch/blah']
just be aware it returns a generator, so wrap it in list if you want the list.
http://book.pythontips.com/en/latest/map_filter.html is good if you want more information on functions that do this kind of data cleaning. If you need to get really fancy with your data cleaning look into using pandas. It has a steep learning curve, but it's fantastic for complicated data cleaning.

you can do the following
for links in recommended_vid:
if "/watch" in links[href]:
print(link[href])

This should help you find all the /watch links.
import re
pattern = re.compile(r"/watch")
# pattern = re.compile(r"/watch\?v=[a-zA-Z_0-9]{11}") -- This pattern is to find all the links as well
matches = pattern.finditer(<your_string>)
for m in matches:
print(m) #will print all the locations at which /watch occurs
You can collect all the URLs in a list and proceed. Good Luck!!

Looking at your code, a simple if statement with str.startswith() should suffice to get what you want.
Assuming the links['href'] contains a str, then:
for links in recommended_vid:
href = links['href'] # I think 'href' will be of type 'str'
if href.startswith('/watch'):
print(href)
Note: .startswith() will only work if /watch is really at the start of the href; you could also try if '/watch' in href:, which will match if that string appears anywhere in href.

How to convert output to list to count it amount?

I wrote a script that parses a webpage and get the amount of links('a' tag) on it:
import urllib
import lxml.html
connection = urllib.urlopen('http://test.com')
dom = lxml.html.fromstring(connection.read())
for link in dom.xpath('//a/#href'):
print link
The output of a script:
./01.html
./52.html
./801.html
http://www.blablabla.com/1.html
#top
How can i convert it to list to count the amount of links? I use link.split() but it got to me:
['./01.html']
['./52.html']
['./801.html']
['http://www.blablabla.com/1.html']
['#top']
But i want to get:
[./01.html, ./52.html, ./801.html, http://www.blablabla.com/1.html, #top]
Thanks!

link.split() tries to split link itself. But you must work with entity that represents all links. In your case: dom.xpath('//a/#href').
So this must help you:
links = list(dom.xpath('//a/#href'))
And getting length with a built-in len function:
print len(links)

list(dom.xpath('//a/#href'))
This will take the iterator that dom.xpath returns and puts every item into a list.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

limiting the URL output from newspaper - python

Related

Get all names from wikipedia-site?

How do you generate a valid youtube URL with python

Is there a way to grab all the links on a page except ones containing a specific word in selenium?

How to print only specific words of a string

How to convert output to list to count it amount?

Categories

Resources