Scrapping with request_HTML

Scrapping with request_HTML - python

I am trying to scrape this website down below: https://www.kayak-polo.info/kphistorique.php?Group=CE&lang=en
down below is my code. I am trying to actually get the text inside the caption element (as shown on the screenshot). However I believe I cannot find the tag because it has no closing tag and that's why I think it's not returning the text.
For clarity purposes. I already have the tournament name. But I would like the category too which is "men" in the screenshot below
def grab_ranking():
tournament_list = grab_tournament_metadata()
for item in tournament_list:
url_to_scrape = f'https://www.kayak-polo.info/kphistorique.php?Group={item[1]}&lang=en'
response = session.get(url_to_scrape)
print(url_to_scrape)
season_data = response.html.find('body > div.container-fluid > div > article')
for season in season_data:
season_year_raw = find_extract(season, selector='h3 > div.col-md-6.col-sm-6')
season_year = season_year_raw.replace('Season ', '')
print(season_year)
# TODO Figure out how to deal with the n1h and n2h and other french national categories being togheter in one place.
category_table = season.find('div.col-md-3.col-sm-6.col-xs-12', first=True)
umbrella_competition_name = find_extract(category_table, selector='caption')
competition_name = umbrella_competition_name + " " + season_year
I tried multiple things, such as trying to get the HTML of that element and then wanting to a do .split on certain things. However it seems when I do .html I get the entire page's html which doesn't help my case.
I also tried .attrs in the hopes of finding the right tag, but it returns nothing.

Here is one possible solution:
from time import time
from typing import Generator
from requests_html import HTMLSession
from requests_html import HTMLResponse
def get_competition_types(html: HTMLResponse) -> Generator[None, None, str]:
return (i.attrs.get('value') for i in html.html.find('select[name="Group"] option'))
def get_competition_urls(url: str, comp_types: Generator[None, None, str]) -> Generator[None, None, str]:
return (f'{url}?Group={_type}&lang=en' for _type in comp_types)
def get_data(competition_url: str, session: HTMLSession) -> None:
response = session.get(competition_url)
print(competition_url)
article_data = response.html.find('article.tab-pane')
for article in article_data:
for data in (i.text.split('\n') for i in article.find('div caption')):
if len(data) > 1:
print(f"{data[0]} {article.find('h3')[0].text.split()[1]} {data[1]}\n")
else:
print(f"{data[0]} {article.find('h3')[0].text.split()[1]}\n")
session = HTMLSession()
url = 'https://www.kayak-polo.info/kphistorique.php'
html = session.get(url)
start = time()
competition_types = get_competition_types(html)
competition_urls = get_competition_urls(url, competition_types)
for url in competition_urls:
get_data(url, session)
print(f"Total time: {round(time()-start, 3)}")
The performance of this solution(processing all 4960 elements) is 55 sec
Output:
ECA European Championships - Catania (ITA) 2021 Men
ECA European Championships - Catania (ITA) 2021 Women
ECA European Championships - Catania (ITA) 2021 U21 Men
Solution based on ThreadPoolExecutor:
from time import time
from itertools import repeat
from typing import Generator
from requests_html import HTMLSession
from requests_html import HTMLResponse
from concurrent.futures import ThreadPoolExecutor
def get_competition_types(html: HTMLResponse) -> Generator[None, None, str]:
return (i.attrs.get('value') for i in html.html.find('select[name="Group"] option'))
def get_competition_urls(url: str, comp_types: Generator[None, None, str]) -> Generator[None, None, str]:
return (f'{url}?Group={_type}&lang=en' for _type in comp_types)
def get_data(competition_url: str, session: HTMLSession) -> None:
response = session.get(competition_url)
print(competition_url)
article_data = response.html.find('article.tab-pane')
for article in article_data:
for data in (i.text.split('\n') for i in article.find('div caption')):
if len(data) > 1:
print(f"{data[0]} {article.find('h3')[0].text.split()[1]} {data[1]}\n")
else:
print(f"{data[0]} {article.find('h3')[0].text.split()[1]}\n")
session = HTMLSession()
url = 'https://www.kayak-polo.info/kphistorique.php'
html = session.get(url)
start = time()
competition_types = get_competition_types(html)
competition_urls = get_competition_urls(url, competition_types)
with ThreadPoolExecutor() as executor:
executor.map(get_data, list(competition_urls), repeat(session))
print(f"Total time: {round(time()-start, 3)}")
The performance of this solution(processing all 4960 elements) is ~35 sec
And of course, since in this solution we work with threads all data will be mixed
Output:
European Championships - Sheffield (GBR) 1993 Women
Coupe d'Europe des Nations - Strasbourg (FRA) 1990 Men
European Club Championship - Duisbourg (GER) 2021 Men

Related

How to optimize my performances, using asynchronous python code

I'm looking to optimize my code in order to process the info faster. First time playing with asynchronous requests. And also still new to Python. I hope my code makes sense.
I'm using FastAPI as a framework. And aiohttp to send my requests.
Right now, I'm only interested in getting the total of results per word searched. I will be dumping the json into a DB afterwards.
My code is sending requests to the public crossref API (crossref)
As an example, I'm searching for the terms from 2022-06-02 to 2022-06-03 (inclusive). The terms being searched are: 'paper' (3146 results), 'ammonium' (1430 results) and 'bleach' (23 results). Example:
https://api.crossref.org/works?rows=1000&sort=created&mailto=youremail#domain.com&query=paper&filter=from-index-date:2022-06-02,until-index-date:2022-06-03&cursor=*
This returns 3146 rows. I need to search for only one term at a time. I did not try to split it per day as well to see if it's faster.
There is also a recursive context in this. This is where I feel like I'm mishandling the asynchronous concept. Here is why I need a recursive call.
Deep paging requests
Deep paging using cursors can be used to iterate over large result sets, without any limits on their size.
To use deep paging make a query as normal, but include the cursor parameter with a value of *, for example:
https://api.crossref.org/works?rows=1000&sort=created&mailto=youremail#domain.com&query=ammonium&filter=from-index-date:2022-06-02,until-index-date:2022-06-03&cursor=*
A next-cursor field will be provided in the JSON response. To get the next page of results, pass the value of next-cursor as the cursor parameter. For example:
https://api.crossref.org/works?rows=1000&sort=created&mailto=youremail#domain.com&query=ammonium&filter=from-index-date:2022-06-02,until-index-date:2022-06-03&cursor=<value of next-cursor parameter>
Advice from the CrossRef doc
Clients should check the number of returned items. If the number of returned items is equal to the number of expected rows then the end of the result set has been reached. Using next-cursor beyond this point will result in responses with an empty items list.
My processing time is still through the roof with just 3 words (and 7 requests), it's over 15sec. I'm trying to turn that down to under 5 seconds if possible? Using postman, the longest request took about 4 seconds to come back
This is what I have so far if you want to try it out.
schema.py
class CrossRefSearchRequest(BaseModel):
keywords: List[str]
date_from: Optional[datetime] = None
date_to: Optional[datetime] = None
controler.py
import time
from fastapi import FastAPI, APIRouter, Request
app = FastAPI(title="CrossRef API", openapi_url=f"{settings.API_V1_STR}/openapi.json")
api_router = APIRouter()
service = CrossRefService()
#api_router.post("/search", status_code=201)
async def search_keywords(*, search_args: CrossRefSearchRequest) -> dict:
fixed_search_args = {
"sort": "created",
"rows": "1000",
"cursor": "*"
}
results = await service.cross_ref_request(search_args, **fixed_search_args)
return {k: len(v) for k, v in results.items()}
# sets the header X-Process-Time, in order to have the time for each request
#app.middleware("http")
async def add_process_time_header(request: Request, call_next):
start_time = time.time()
response = await call_next(request)
process_time = time.time() - start_time
response.headers["X-Process-Time"] = str(process_time)
return response
app.include_router(api_router)
if __name__ == "__main__":
# Use this for debugging purposes only
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8001, log_level="debug")
service.py
from datetime import datetime, timedelta
def _setup_date_default(date_from_req: datetime, date_to_req: datetime):
yesterday = datetime.utcnow()- timedelta(days=1)
date_from = yesterday if date_from_req is None else date_from_req
date_to = yesterday if date_to_req is None else date_to_req
return date_from.strftime(DATE_FORMAT_CROSS_REF), date_to.strftime(DATE_FORMAT_CROSS_REF)
class CrossRefService:
def __init__(self):
self.client = CrossRefClient()
# my recursive call for the next cursor
async def _send_client_request(self ,final_result: dict[str, list[str]], keywords: [str], date_from: str, date_to: str, **kwargs):
json_responses = await self.client.cross_ref_request_date_range(keywords, date_from, date_to, **kwargs)
for json_response in json_responses:
message = json_response.get('message', {})
keyword = message.get('query').get('search-terms')
next_cursor = message.get('next-cursor')
total_results = message.get('total-results')
search_results = message.get('items', [{}]) if total_results > 0 else []
if final_result[keyword] is None:
final_result[keyword] = search_results
else:
final_result[keyword].extend(search_results)
if total_results > int(kwargs['rows']) and len(search_results) == int(kwargs['rows']):
kwargs['cursor'] = next_cursor
await self._send_client_request(final_result, [keyword], date_from, date_to, **kwargs)
async def cross_ref_request(self, request: CrossRefSearchRequest, **kwargs) -> dict[str, list[str]]:
date_from, date_to = _setup_date(request.date_from, request.date_to)
results: dict[str, list[str]] = dict.fromkeys(request.keywords)
await self._send_client_request(results, request.keywords, date_from, date_to, **kwargs)
return results
client.py
import asyncio
from aiohttp import ClientSession
async def _send_request_task(session: ClientSession, url: str):
try:
async with session.get(url) as response:
await response.read()
return response
# exception handler to come
except Exception as e:
print(f"exception for {url}")
print(str(e))
class CrossRefClient:
base_url = "https://api.crossref.org/works?" \
"query={}&" \
"filter=from-index-date:{},until-index-date:{}&" \
"sort={}&" \
"rows={}&" \
"cursor={}"
def __init__(self) -> None:
self.headers = {
"User-Agent": f"my_app/v0.1 (example.com/; mailto:youremail#domain.com) using FastAPI"
}
async def cross_ref_request_date_range(
self, keywords: [str], date_from: str, date_to: str, **kwargs
) -> list:
async with ClientSession(headers=self.headers) as session:
tasks = [
asyncio.create_task(
_send_request_task(session, self.base_url.format(
keyword, date_from, date_to, kwargs['sort'], kwargs['rows'], kwargs['cursor']
)),
name=TASK_NAME_BASE.format(keyword, date_from, date_to)
)
for keyword in keywords
]
responses = await asyncio.gather(*tasks)
return [await response.json() for response in responses]
How to optimize this better and use asynchronous calls better? Also this recursive loop might not be the best way to do it neither. Any ideas on that too?
I implemented a solution for synchronous calls and it's even slower. So I guess I'm not too far away.
Thanks!

Your code looks fine and you are not misusing the asynchronous concept.
Perhaps you are limited by the number of client session, which is limited to 100 connections at a time. Take a look at https://docs.aiohttp.org/en/stable/client_reference.html#aiohttp.BaseConnector
Maybe the server upstream is just answering slowly to a massive amount of requests.

Flask loop takes long time to complete

I have this loop in my app.py. For some reason it extends the load time by over 3 seconds. Are there any solutions?
import dateutil.parser as dp
# Converts date from ISO-8601 string to formatted string and returns it
def dateConvert(date):
return dp.parse(date).strftime("%H:%M # %e/%b/%y")
def nameFromID(userID):
if userID is None:
return 'Unknown'
else:
response = requests.get("https://example2.org/" + str(userID), headers=headers)
return response.json()['firstName'] + ' ' + response.json()['lastName']
logs = []
response = requests.get("https://example.org", headers=headers)
for response in response.json():
logs.append([nameFromID(response['member']), dateConvert(response['createdAt'])])

It extends the load time by over 3 seconds because it does a lot of unnecessary work, that's why.
You're not using requests Sessions. Each request will require creating and tearing down an HTTPS connection. That's slow.
You're doing another HTTPS request for each name conversion. (See above.)
You're parsing the JSON you get in that function twice.
Whatever dp.parse() is (dateutil?), it's probably doing a lot of extra work parsing from a free-form string. If you know the input format, use strptime.
Here's a rework that should be significantly faster. Please see the TODO points first, of course.
Also, if you are at liberty to knowing the member id -> name mapping doesn't change, you can make name_cache a suitably named global variable too (but remember it may be persisted between requests).
import datetime
import requests
INPUT_DATE_FORMAT = "TODO_FILL_ME_IN" # TODO: FILL ME IN.
def dateConvert(date: str):
return datetime.datetime.strptime(date, INPUT_DATE_FORMAT).strftime(
"%H:%M # %e/%b/%y"
)
def nameFromID(sess: requests.Session, userID):
if userID is None:
return "Unknown"
response = sess.get(f"https://example2.org/{userID}")
response.raise_for_status()
data = response.json()
return "{firstName} {lastName}".format_map(data)
def do_thing():
headers = {} # TODO: fill me in
name_cache = {}
with requests.Session() as sess:
sess.headers.update(headers)
logs = []
response = sess.get("https://example.org")
for response in response.json():
member_id = response["member"]
name = name_cache.get(member_id)
if not name:
name = name_cache[member_id] = nameFromID(sess, member_id)
logs.append([name, dateConvert(response["createdAt"])])

How to conect data scraping with google sheets

i have this code from a web scraping
from smart_sensor_client.smart_sensor_client import SmartSensorClient
import json
import requests
import pprint
DEFAULT_SETTINGS_FILE = 'settings.yaml'
def run_task(settings_file=DEFAULT_SETTINGS_FILE) -> bool:
# Create the client instance
client = SmartSensorClient(settings_file=settings_file)
# Authenticate
if not client.authenticate():
print('Authentication FAILED')
return False
# Get list of plants
plants = client.get_plant_list()
# Iterate the plant list and print all assets therein
for plant in plants:
# Get list of assets
response = client.get_asset_list(organization_id=client.organization_id)
if len(response) == 0:
print('No assets in this plant')
else:
for asset in response:
print(asset["assetName"],':',asset["lastSyncTimeStamp"])
...
And the following answer. json that i filter with the information i am interested in.
Ventilador FCH 11 : 2020-06-11T09:48:45Z
VTL SVA1 : 2020-06-11T10:20:43Z
Dardelet PV-1 : 2020-06-11T09:58:14Z
CANDLOT 1 (MOTOR N°2) : 2020-06-11T10:37:39Z
PC N°1 S1/2A : 2020-06-11T10:57:34Z
VTL SVA2 : 2020-06-11T11:31:08Z
Ventilador FCH 6 : 2020-06-11T11:43:28Z
Vibrotamiz Tolva Tampon : 2020-06-11T11:44:43Z
Ventilador FCH 10 : 2020-06-11T11:08:03Z
Task SUCCESS
I would like to paste this on a google sheet but i dont know how if its possible. Thank you!

Why are the videos on the most_recent standard feed so out of date?

I'm trying to grab the most recently uploaded videos. There's a standard feed for that - it's called most_recent. I don't have any problems grabbing the feed, but when I look at the entries inside, they're all half a year old, which is hardly recent.
Here's the code I'm using:
import requests
import os.path as P
import sys
from lxml import etree
import datetime
namespaces = {"a": "http://www.w3.org/2005/Atom", "yt": "http://gdata.youtube.com/schemas/2007"}
fmt = "%Y-%m-%dT%H:%M:%S.000Z"
class VideoEntry:
"""Data holder for the video."""
def __init__(self, node):
self.entry_id = node.find("./a:id", namespaces=namespaces).text
published = node.find("./a:published", namespaces=namespaces).text
self.published = datetime.datetime.strptime(published, fmt)
def __str__(self):
return "VideoEntry[id='%s']" % self.entry_id
def paginate(xml):
root = etree.fromstring(xml)
next_page = root.find("./a:link[#rel='next']", namespaces=namespaces)
if next_page == None:
next_link = None
else:
next_link = next_page.get("href")
entries = [VideoEntry(e) for e in root.xpath("/a:feed/a:entry", namespaces=namespaces)]
return entries, next_link
prefix = "https://gdata.youtube.com/feeds/api/standardfeeds/"
standard_feeds = set("top_rated top_favorites most_shared most_popular most_recent most_discussed most_responded recently_featured on_the_web most_viewed".split(" "))
feed_name = sys.argv[1]
assert feed_name in standard_feeds
feed_url = prefix + feed_name
all_video_ids = []
while feed_url is not None:
r = requests.get(feed_url)
if r.status_code != 200:
break
text = r.text.encode("utf-8")
video_ids, feed_url = paginate(text)
all_video_ids += video_ids
all_upload_times = [e.published for e in all_video_ids]
print min(all_upload_times), max(all_upload_times)
As you can see, it prints the min and max timestamps for the entire feed.
misha#misha-antec$ python get_standard_feed.py most_recent
2013-02-02 14:40:02 2013-02-02 14:54:00
misha#misha-antec$ python get_standard_feed.py top_rated
2006-04-06 21:30:53 2013-07-28 22:22:38
I've glanced through the downloaded XML and it appears to match the output. Am I doing something wrong?
Also, on an unrelated note, the feeds I'm getting are all about 100 entries (I'm paginating through them 25 at a time). Is this normal? I expected the feeds to be a bit bigger.

Regarding the "Most-Recent-Feed"-Topic: There is a ticket for this one here. Unfortunately, the YouTube-API-Teams doesn't respond or solved the problem so far.
Regarding the number of entries: That depends on the type of standardfeed, but for the most-recent-Feed it´s usually around 100.
Note: You could try using the "orderby=published" parameter to get recents videos, although I don´t know how "recent" they are.
https://gdata.youtube.com/feeds/api/videos?orderby=published&prettyprint=True
You can combine this query with the "category"-parameter or other ones (region-specific queries - like for the standard feeds - are not possible, afaik).

Moving from multiprocessing to threading

In my project, I use the multiprocessing class in order to run tasks parallely. I want to use threading instead, as it has better performance (my tasks are TCP/IP bound, not CPU or I/O bound).
multiprocessing has wonderful functions, as Pool.imap_unordered and Pool.map_async, that does not exist in the threading class.
What is the right way to convert my code to use threading instead? The documentation introduces the multiprocessing.dummy class, that is a wrapper for the threading class. However that raises lots of errors (at least on python 2.7.3):
pool = multiprocessing.Pool(processes)
File "C:\python27\lib\multiprocessing\dummy\__init__.py", line 150, in Pool
return ThreadPool(processes, initializer, initargs)
File "C:\python27\lib\multiprocessing\pool.py", line 685, in __init__
Pool.__init__(self, processes, initializer, initargs)
File "C:\python27\lib\multiprocessing\pool.py", line 136, in __init__
self._repopulate_pool()
File "C:\python27\lib\multiprocessing\pool.py", line 199, in _repopulate_pool
w.start()
File "C:\python27\lib\multiprocessing\dummy\__init__.py", line 73, in start
self._parent._children[self] = None
AttributeError: '_DummyThread' object has no attribute '_children'
Edit: What actually happens is that I have a GUI that runs a different thread (to prevent the GUI from gettint stuck). That thread runs the specific search function that has the ThreadPool that fails.
Edit 2: The bugfix was fixed and will be included in future releases.
Great to see a crasher fixed!
import urllib2, htmllib, formatter
import multiprocessing.dummy as multiprocessing
import xml.dom.minidom
import os
import string, random
from urlparse import parse_qs, urlparse
from useful_util import retry
import config
from logger import log
class LinksExtractor(htmllib.HTMLParser):
def __init__(self, formatter):
htmllib.HTMLParser.__init__(self, formatter)
self.links = []
self.ignoredSites = config.WebParser_ignoredSites
def start_a(self, attrs):
for attr in attrs:
if attr[0] == "href" and attr[1].endswith(".mp3"):
if not filter(lambda x: (x in attr[1]), self.ignoredSites):
self.links.append(attr[1])
def get_links(self):
return self.links
def GetLinks(url, returnMetaUrlObj=False):
'''
Function gather links from a url.
#param url: Url Address.
#param returnMetaUrlObj: If true, returns a MetaUrl Object list.
Else, returns a string list. Default is False.
#return links: Look up.
'''
htmlparser = LinksExtractor(formatter.NullFormatter())
try:
data = urllib2.urlopen(url)
except (urllib2.HTTPError, urllib2.URLError) as e:
log.error(e)
return []
htmlparser.feed(data.read())
htmlparser.close()
links = list(set(htmlparser.get_links()))
if returnMetaUrlObj:
links = map(MetaUrl, links)
return links
def isAscii(s):
"Function checks is the string is ascii."
try:
s.decode('ascii')
except (UnicodeEncodeError, UnicodeDecodeError):
return False
return True
#retry(Exception, logger=log)
def parse(song, source):
'''
Function parses the source search page and returns the .mp3 links in it.
#param song: Search string.
#param source: Search website source. Value can be dilandau, mp3skull, youtube, seekasong.
#return links: .mp3 url links.
'''
source = source.lower()
if source == "dilandau":
return parse_dilandau(song)
elif source == "mp3skull":
return parse_Mp3skull(song)
elif source == "SeekASong":
return parse_SeekASong(song)
elif source == "youtube":
return parse_Youtube(song)
log.error('no source "%s". (from parse function in WebParser)')
return []
def parse_dilandau(song, pages=1):
"Function connects to Dilandau.eu and returns the .mp3 links in it"
if not isAscii(song): # Dilandau doesn't like unicode.
log.warning("Song is not ASCII. Skipping on dilandau")
return []
links = []
song = urllib2.quote(song.encode("utf8"))
for i in range(pages):
url = 'http://en.dilandau.eu/download_music/%s-%d.html' % (song.replace('-','').replace(' ','-').replace('--','-').lower(),i+1)
log.debug("[Dilandau] Parsing %s... " % url)
links.extend(GetLinks(url, returnMetaUrlObj=True))
log.debug("[Dilandau] found %d links" % len(links))
for metaUrl in links:
metaUrl.source = "Dilandau"
return links
def parse_Mp3skull(song, pages=1):
"Function connects to mp3skull.com and returns the .mp3 links in it"
links = []
song = urllib2.quote(song.encode("utf8"))
for i in range(pages):
# http://mp3skull.com/mp3/how_i_met_your_mother.html
url = 'http://mp3skull.com/mp3/%s.html' % (song.replace('-','').replace(' ','_').replace('__','_').lower())
log.debug("[Mp3skull] Parsing %s... " % url)
links.extend(GetLinks(url, returnMetaUrlObj=True))
log.debug("[Mp3skull] found %d links" % len(links))
for metaUrl in links:
metaUrl.source = "Mp3skull"
return links
def parse_SeekASong(song):
"Function connects to seekasong.com and returns the .mp3 links in it"
song = urllib2.quote(song.encode("utf8"))
url = 'http://www.seekasong.com/mp3/%s.html' % (song.replace('-','').replace(' ','_').replace('__','_').lower())
log.debug("[SeekASong] Parsing %s... " % url)
links = GetLinks(url, returnMetaUrlObj=True)
for metaUrl in links:
metaUrl.source = "SeekASong"
log.debug("[SeekASong] found %d links" % len(links))
return links
def parse_Youtube(song, amount=10):
'''
Function searches a song in youtube.com and returns the clips in it using Youtube API.
#param song: The search string.
#param amount: Amount of clips to obtain.
#return links: List of links.
'''
"Function connects to youtube.com and returns the .mp3 links in it"
song = urllib2.quote(song.encode("utf8"))
url = r"http://gdata.youtube.com/feeds/api/videos?q=%s&max-results=%d&v=2" % (song.replace(' ', '+'), amount)
urlObj = urllib2.urlopen(url, timeout=4)
data = urlObj.read()
videos = xml.dom.minidom.parseString(data).getElementsByTagName('feed')[0].getElementsByTagName('entry')
links = []
for video in videos:
youtube_watchurl = video.getElementsByTagName('link')[0].attributes.item(0).value
links.append(get_youtube_hightest_quality_link(youtube_watchurl))
return links
def get_youtube_hightest_quality_link(youtube_watchurl, priority=config.youtube_quality_priority):
'''
Function returns the highest quality link for a specific youtube clip.
#param youtube_watchurl: The Youtube Watch Url.
#param priority: A list represents the qualities priority.
#return MetaUrlObj: MetaUrl Object.
'''
video_id = parse_qs(urlparse(youtube_watchurl).query)['v'][0]
youtube_embedded_watchurl = "http://www.youtube.com/embed/%s?autoplay=1" % video_id
d = get_youtube_dl_links(video_id)
for x in priority:
if x in d.keys():
return MetaUrl(d[x][0], 'youtube', d['VideoName'], x, youtube_embedded_watchurl)
log.error("No Youtube link has been found in get_youtube_hightest_quality_link.")
return ""
#retry(Exception, logger=log)
def get_youtube_dl_links(video_id):
'''
Function gets the download links for a youtube clip.
This function parses the get_video_info format of youtube.
#param video_id: Youtube Video ID.
#return d: A dictonary of qualities as keys and urls as values.
'''
d = {}
url = r"http://www.youtube.com/get_video_info?video_id=%s&el=vevo" % video_id
urlObj = urllib2.urlopen(url, timeout=12)
data = urlObj.read()
data = urllib2.unquote(urllib2.unquote(urllib2.unquote(data)))
data = data.replace(',url', '\nurl')
data = data.split('\n')
for line in data:
if 'timedtext' in line or 'status=fail' in line or '<AdBreaks>' in line:
continue
try:
url = line.split('&quality=')[0].split('url=')[1]
quality = line.split('&quality=')[1].split('&')[0]
except:
continue
if quality in d:
d[quality].append(url)
else:
d[quality] = [url]
try:
videoName = "|".join(data).split('&title=')[1].split('&')[0]
except Exception, e:
log.error("Could not parse VideoName out of get_video_info (%s)" % str(e))
videoName = ""
videoName = unicode(videoName, 'utf-8')
d['VideoName'] = videoName.replace('+',' ').replace('--','-')
return d
class NextList(object):
"A list with a 'next' method."
def __init__(self, l):
self.l = l
self.next_index = 0
def next(self):
if self.next_index < len(self.l):
value = self.l[self.next_index]
self.next_index += 1
return value
else:
return None
def isEOF(self):
" Checks if the list has reached the end "
return (self.next_index >= len(self.l))
class MetaUrl(object):
"a url strecture data with many metadata"
def __init__(self, url, source="", videoName="", quality="", youtube_watchurl=""):
self.url = str(url)
self.source = source
self.videoName = videoName # Youtube Links Only
self.quality = quality # Youtube Links Onlys
self.youtube_watchurl = youtube_watchurl # Youtube Links Onlys
def __repr__(self):
return "<MetaUrl '%s' | %s>" % (self.url, self.source)
def search(song, n, processes=config.search_processes):
'''
Function searches song and returns n valid .mp3 links.
#param song: Search string.
#param n: Number of songs.
#param processes: Number of processes to launch in the subprocessing pool.
'''
linksFromSources = []
pool = multiprocessing.Pool(processes)
args = [(song, source) for source in config.search_sources]
imapObj = pool.imap_unordered(_parse_star, args)
for i in range(len(args)):
linksFromSources.append(NextList(imapObj.next(15)))
pool.terminate()
links = []
next_source = 0
while len(links) < n and not all(map(lambda x: x.isEOF(), linksFromSources)):
nextItem = linksFromSources[next_source].next()
if nextItem:
log.debug("added song %.80s from source ID %d (%s)" % (nextItem.url.split('/')[-1], next_source, nextItem.source))
links.append(nextItem)
if len(linksFromSources) == next_source+1:
next_source = 0
else:
next_source += 1
return links
def _parse_star(args):
return parse(*args)

I can't reproduce your problem on my machine. What's in your processes variable? Is it an int?
Python 2.7.3 (default, Apr 10 2012, 23:31:26) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import multiprocessing.dummy as multiprocessing
>>> pool = multiprocessing.Pool(5)
>>> pool
<multiprocessing.pool.ThreadPool object at 0x00C7DF90>
>>>
----Edit----
You probably also want to double check if you had messed up your standard library, try an clean install of python 2.7.3 in a different folder.
----Edit 2----
You can quickly patch it like this:
import multiprocessing.dummy
import weakref
import threading
class Worker(threading.Thread):
def __init__(self):
threading.Thread.__init__(self)
def run(self):
poll = multiprocessing.dummy.Pool(5)
print str(poll)
w = Worker()
w._children = weakref.WeakKeyDictionary()
w.start()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrapping with request_HTML - python

Related

How to optimize my performances, using asynchronous python code

Flask loop takes long time to complete

How to conect data scraping with google sheets

Why are the videos on the most_recent standard feed so out of date?

Moving from multiprocessing to threading

Categories

Resources