How can I get some data from a video YouTube? (Python) - python

I want to know how can I get some data from a video Youtube like views, thumbnails or coments it has. I have been looking for in the Google's API but I can't understand it.
Thank you!

A different approach would be using urllib2 and getting the HTML code from the page and then filtering it.
import urllib2
source = 'https://www.youtube.com/watch?v=wDjeBNv6ip0'
response = urllib2.urlopen(source)
html = response.read() #Done, you have the whole HTML file in a gigantic string.
After that all you have to do is to filter it as you would do to a string.
Getting the number of views for instance:
wordBreak = ['<','>']
html = list(html)
i = 0
while i < len(html):
if html[i] in wordBreak:
html[i] = ' '
i += 1
#The block above is just to make the html.split() easier.
html = ''.join(html)
html = html.split()
dataSwitch = False
numOfViews = ''
for element in html:
if element == '/div':
dataSwitch = False
if dataSwitch:
numOfViews += str(element)
if element == 'class="watch-view-count"':
dataSwitch = True
print (numOfViews)
>>> 45.608.212 views
This was a simple example of getting the number of views but you can do that to everything on the page, including number of comments, likes, the content of the comment itself, etc.

i think this is the the part you are looking for (source):
def get_video_localization(youtube, video_id, language):
results = youtube.videos().list(
part="snippet",
id=video_id,
hl=language
).execute()
localized = results["items"][0]["snippet"]["localized"]
localized will now contain title, description, etc.

Related

Comparing results with Beautiful Soup in Python

I've got the following code that filters a particular search on an auction site.
I can display the titles of each value & also the len of all returned values:
from bs4 import BeautifulSoup
import requests
url = requests.get("https://www.trademe.co.nz/a/marketplace/music-instruments/instruments/guitar-bass/electric-guitars/search?search_string=prs&condition=used")
soup = BeautifulSoup(url.text, "html.parser")
listings = soup.findAll("div", attrs={"class":"tm-marketplace-search-card__title"})
print(len(listings))
for listing in listings:
print(listing.text)
This prints out the following:
#print(len(listings))
3
#for listing in listings:
# print(listing.text)
PRS. Ten Top Custom 24, faded Denim, Piezo.
PRS SE CUSTOM 22
PRS Tremonti SE *With Seymour Duncan Pickups*
I know what I want to do next, but don't know how to code it. Basically I want to only display new results. I was thinking storing the len of the listings (3 at the moment) as a variable & then comparing that with another GET request (2nd variable) that maybe runs first thing in the morning. Alternatively compare both text values instead of the len. If it doesn't match, then it shows the new listings. Is there a better or different way to do this? Any help appreciated thank you
With length-comparison, there is the issue of some results being removed between checks, so it might look like there are no new results even if there are; and text-comparison does not account for results with similar titles.
I can suggest 3 other methods. (The 3rd uses my preferred approach.)
Closing time
A comment suggested using the closing time, which can be found in the tag before the title; you can define a function to get the days until closing
from datetime import date
import dateutil.parser
def get_days_til_closing(lSoup):
cTxt = lSoup.previous_sibling.find('div', {'tmid':'closingtime'}).text
cTime = dateutil.parser.parse(cTxt.replace('Closes:', '').strip())
return (cTime.date() - date.today()).days
and then filter by the returned value
min_dtc = 3 # or as preferred
# your current code upto listings = soup.findAll....
new_listings = [l for l in listings if get_days_til_closing(l) > min_dtc]
print(len(new_listings), f'new listings [of {len(listings)}]')
for listing in new_listings: print(listing.text)
However, I don't know if sellers are allowed to set their own closing times or if they're set at a fixed offset; also, I don't see the closing time text when inspecting with the browser dev tools [even though I could extract it with the code above], and that makes me a bit unsure of whether it's always available.
JSON list of Listing IDs
Each result is in a "card" with a link to the relevant listing, and that link contains a number that I'm calling the "listing ID". You can save that in a list as a JSON file and keep checking against it every new scrape
from bs4 import BeautifulSoup
import requests
import json
lFilename = 'listing_ids.json' # or as preferred
url = requests.get("https://www.trademe.co.nz/a/marketplace/music-instruments/instruments/guitar-bass/electric-guitars/search?search_string=prs&condition=used")
try:
prev_listings = json.load(open(lFilename, 'r'))
except Exception as e:
prev_listings = []
print(len(prev_listings), 'saved listings found')
soup = BeautifulSoup(url.text, "html.parser")
listings = soup.select("div.o-card > a[href*='/listing/']")
new_listings = [
l for l in listings if
l.get('href').split('/listing/')[1].split('?')[0]
not in prev_listings
]
print(len(new_listings), f'new listings [of {len(listings)}]')
for listing in new_listings:
print(listing.select_one('div.tm-marketplace-search-card__title').text)
with open(lFilename, 'w') as f:
json.dump(prev_listings + [
l.get('href').split('/listing/')[1].split('?')[0]
for l in new_listings
], f)
This should be fairly reliable as long as they don't tend to recycle the listing ids, this should be fairly reliable. (Even then, every once in a while, after checking the new listings for that day, you can just delete the JSON file and re-run the program once; it will also keep the file from getting too big...)
CSV Logging [including Listing IDs]
Instead of just saving the IDs, you can save pretty much all the details from each result
from bs4 import BeautifulSoup
import requests
from datetime import date
import pandas
lFilename = 'listings.csv' # or as preferred
max_days = 60 # or as preferred
date_today = date.today()
url = requests.get("https://www.trademe.co.nz/a/marketplace/music-instruments/instruments/guitar-bass/electric-guitars/search?search_string=prs&condition=used")
try:
prev_listings = pandas.read_csv(lFilename).to_dict(orient='records')
prevIds = [str(l['listing_id']) for l in prev_listings]
except Exception as e:
prev_listings, prevIds = [], []
print(len(prev_listings), 'saved listings found')
def get_listing_details(lSoup, prevList, lDate=date_today):
selectorsRef = {
'title': 'div.tm-marketplace-search-card__title',
'location_time': 'div.tm-marketplace-search-card__location-and-time',
'footer': 'div.tm-marketplace-search-card__footer',
}
lId = lSoup.get('href').split('/listing/')[1].split('?')[0]
lDets = {'listing_id': lId}
for k, sel in selectorsRef.items():
s = lSoup.select_one(sel)
lDets[k] = None if s is None else s.text
lDets['listing_link'] = 'https://www.trademe.co.nz/a/' + lSoup.get('href')
lDets['new_listing'] = lId not in prevList
lDets['last_scraped'] = lDate.isoformat()
return lDets
soup = BeautifulSoup(url.text, "html.parser")
listings = [
get_listing_details(s, prevIds) for s in
soup.select("div.o-card > a[href*='/listing/']")
]
todaysIds = [l['listing_id'] for l in listings]
new_listings = [l for l in listings if l['new_listing']]
print(len(new_listings), f'new listings [of {len(listings)}]')
for listing in new_listings: print(listing['title'])
prev_listings = [
p for p in prev_listings if str(p['listing_id']) not in todaysIds
and (date_today - date.fromisoformat(p['last_scraped'])).days < max_days
]
pandas.DataFrame(prev_listings + listings).to_csv(lFilename, index=False)
You'll end up with a spreadsheet of scraping history/log that you can check anytime, and depending on what you set max_days to, the oldest data will be automatically cleared.
Fixed it with the following:
allGuitars = ["",]
latestGuitar = soup.select("#-title")[0].text.strip()
if latestGuitar in allGuitars[0]:
print("No change. The latest listing is still: " + allGuitars[0])
elif not latestGuitar in allGuitars[0]:
print("New listing detected! - " + latestGuitar)
allGuitars.clear()
allGuitars.insert(0, latestGuitar)

Output non JSON data from regex web scraping to a JSON file

I'm using requests and regex to scrape data from an entire website and then save it to a JSON file, hosted on github so I and anyone else can access the data from other devices.
The first thing I tried was just to open every single page on the website and get all the data I want but I found that to be unnecessary so I decided to make two scripts, the first one finds the URL of every page on the site and the second one will be the one called which will then scrape the called URL. What I'm having trouble with right now is getting my data formatted correctly for the JSON file. Currently this is a sample of what the output looks like:
{
"Console":"/neo-geo-aes",
"Call ID":"62815",
"URL":"https://www.pricecharting.com/game/jp-sega-mega-drive/bare-knuckle"
}{
"Console":"/neo-geo-cd",
"Call ID":"62817",
"URL":"https://www.pricecharting.com/game/jp-sega-mega-drive/bare-knuckle-2"
}{
"Console":"/neo-geo-pocket-color",
"Call ID":"62578",
"URL":"https://www.pricecharting.com/game/jp-sega-mega-drive/batman"
}{
"Console":"/playstation",
"Call ID":"62580",
"URL":"https://www.pricecharting.com/game/jp-sega-mega-drive/batman-forever"
}
I've looked into this a lot and can't find a solution, here's the code in question:
import re
import requests
import json
##The base URL
URL = "https://www.pricecharting.com/"
r = requests.get(URL)
htmltext = r.text
##Find all system URLs
dataUrl = re.findall('(?<=<li><a href="\/console).*(?=">)', htmltext)
print(dataUrl)
##For each Item(number of consoles) find games
for i in range(len(dataUrl)):
##make console URL
newUrl = ("https://www.pricecharting.com/console" + dataUrl[i])
req = requests.get(newUrl)
newHtml = req.text
##Get item URLs
urlOne = re.findall('(?<=<a href="\/game).*(?=">)', newHtml)
itemId = re.findall('(?<=tr id="product-).*(?=" data)', newHtml)
##For every item in list(items per console)
out_list = []
for i in range(len(urlOne)):
##Make item URL
itemUrl = ("https://www.pricecharting.com/game" + urlOne[i])
callId = (itemId[i])
##Format for JSON
json_file_content = {}
json_file_content['Console'] = dataUrl[i]
json_file_content['Call ID'] = callId
json_file_content['URL'] = itemUrl
out_list.append(json_file_content)
data_json_filename = 'docs/result.json'
with open(data_json_filename, 'a') as data_json_file:
json.dump(out_list, data_json_file, indent=4)

Find Multiple Tags in Beutifulsoup4 and insert them into one string

A have a code that gets your pastebin's data
def user_key():
user_key_data = {'api_dev_key': 'my-dev-key',
'api_user_name': 'my-username',
'api_user_password': 'my-password'}
req = urllib.request.urlopen('https://pastebin.com/api/api_login.php',
urllib.parse.urlencode(user_key_data).encode('utf-8'),
timeout=7)
return req.read().decode()
def user_pastes()
data = data = {'api_dev_key': 'my_dev_key',
'api_user_key': user_key(),
'api_option': 'list'}
req = urllib.request.urlopen('https://pastebin.com/api/api_post.php',
urllib.parse.urlencode(data).encode('utf-8'), timeout=7)
return req.read().decode()
Every Paste has a unique html tag e.g. url, title, paste key, etc.
The Above code will print these out per paste.
I made a code that only takes certain tags. the paste url, paste title and the paste key
my_pastes = []
src = user_pastes()
soup = BeautifulSoup(src, 'html.parser')
for paste in soup.findAll(['paste_url', 'paste_title', 'paste_key']):
my_pastes.append(paste.text)
print(my_pastes)
What I want is to join the url, title and key per paste together into one string.
I tried using the .join method but it only joins the chars. (might not make sense but you'll see when you try it)
Unrelated to the problem.
What I'll do once they're joined. split them again and put them in a PyQt5 table
So This is kind of the answer but I'm still looking for a more simpler code
title = []
key = []
url = []
src = user_pastes()
soup = BeautifulSoup(src, 'html.parser')
for paste_title in soup.findAll('paste_title'):
title.append(paste_title.text)
for paste_key in soup.findAll('paste_key'):
key.append(paste_key.text)
for paste_url in soup.findAll('paste_url'):
url.append(paste_url.text)
for i in range(len(title)):
print(title[i], key[i], url[i])
Maybe from this answer you'll get the idea of what I want to achieve since the post was kind of confusing since I can't really express what I want

Parsing HTML using LXML Python

I'm trying to parse Oxford Dictionary in order to obtain the etymology of a given word.
class SkipException (Exception):
def __init__(self, value):
self.value = value
try:
doc = lxml.html.parse(urlopen('https://en.oxforddictionaries.com/definition/%s' % "good"))
except SkipException:
doc = ''
if doc:
table = []
trs = doc.xpath("//div[1]/div[2]/div/div/div/div[1]/section[5]/div/p")
I cannot seem to work out how to obtain the string of text I need. I know I lack some lines of code in the ones I have copied but I don't know how HTML nor LXML fully works. I would much appreciate if someone could provide me with the correct way to solve this.
You don't want to do web scraping, and especially when probably every dictionary has an API interface. In the case of Oxford create an account at https://developer.oxforddictionaries.com/. Get the API credentials from your account and do something like this:
import requests
import json
api_base = 'https://od-api.oxforddictionaries.com:443/api/v1/entries/{}/{}'
language = 'en'
word = 'parachute'
headers = {
'app_id': '',
'app_key': ''
}
url = api_base.format(language, word)
reply = requests.get(url, headers=headers)
if reply.ok:
reply_dict = json.loads(reply.text)
results = reply_dict.get('results')
if results:
headword = results[0]
entries = headword.get('lexicalEntries')[0].get('entries')
if entries:
entry = entries[0]
senses = entry.get('senses')
if senses:
sense = senses[0]
print(sense.get('short_definitions'))
Here's a sample to get you started scraping Oxford dictionary pages:
import lxml.html as lh
from urllib.request import urlopen
url = 'https://en.oxforddictionaries.com/definition/parachute'
html = urlopen(url)
root = lh.parse(html)
body = root.find("body")
elements = body.xpath("//span[#class='ind']")
for element in elements:
print(element.text)
To find the correct search string you need to format the html so you can see the structure. I used the html formatter at https://www.freeformatter.com/html-formatter.html. Looking at the formatted HTML, I could see the definitions were in the span elements with the 'ind' class attribute.

Beautifulsoup can't find text

I'm trying to write a scraper in python using urllib and beautiful soup. I have a csv of URLs for news stories, and for ~80% of the pages the scraper works, but when there is a picture at the top of the story the script no longer pulls the time or the body text. I am mostly confused because soup.find and soup.find_all don't seem to produce different results. I have tried a variety of different tags that should capture the text as well as 'lxml' and 'html.parser.'
Here is the code:
testcount = 0
titles1 = []
bodies1 = []
times1 = []
data = pd.read_csv('URLsALLjun27.csv', header=None)
for url in data[0]:
try:
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, "lxml")
titlemess = soup.find(id="title").get_text() #getting the title
titlestring = str(titlemess) #make it a string
title = titlestring.replace("\n", "").replace("\r","")
titles1.append(title)
bodymess = soup.find(class_="article").get_text() #get the body with markup
bodystring = str(bodymess) #make body a string
body = bodystring.replace("\n", "").replace("\u3000","") #scrub markup
bodies1.append(body) #add to list for export
timemess = soup.find('span',{"class":"time"}).get_text()
timestring = str(timemess)
time = timestring.replace("\n", "").replace("\r","").replace("年", "-").replace("月","-").replace("日", "")
times1.append(time)
testcount = testcount +1 #counter
print(testcount)
except Exception as e:
print(testcount, e)
And here are some of the results I get (those marked 'nonetype' are the ones where the title was successfully pulled but body/time is empty)
1 http://news.xinhuanet.com/politics/2016-06/27/c_1119122255.htm
2 http://news.xinhuanet.com/politics/2016-05/22/c_129004569.htm 'NoneType' object has no attribute 'get_text'
Any help would be much appreciated! Thanks.
EDIT: I don't have '10 reputation points' so I can't post more links to test but will comment with them if you need more examples of pages.
The issue is that there is no class="article" on the website with the picture in it and same with the "class":"time". Consequently, it seems that you'll have to detect whether there's a picture on the website or not and then if there is a picture, search for the date and text as follows:
For the date, try:
timemess = soup.find(id="pubtime").get_text()
For the body text, it seems that the article is rather just the caption for the picture. Consequently, you could try the following:
bodymess = soup.find('img').findNext().get_text()
In brief, the soup.find('img') finds the image and findNext() goes to the next block which, coincidentally, contains the text.
Thus, in your code, I would do something as follows:
try:
bodymess = soup.find(class_="article").get_text()
except AttributeError:
bodymess = soup.find('img').findNext().get_text()
try:
timemess = soup.find('span',{"class":"time"}).get_text()
except AttributeError:
timemess = soup.find(id="pubtime").get_text()
As a general flow in web scraping, I usually go to the website itself using a browser and find the elements in the website backend in the browser first.

Categories

Resources