using supported site for video processing - python

I am trying to change my code to support video processing from multiple sites (youtube, vimeo, etc.) using the youtube extractions. I don't want to import youtube-dl (unless necessary). I would prefer to call a function. my understanding is that this: youtube-dl http://vimeo.com/channels/YOUR-CHANNEL) is a command line tool. please help!
import pymongo
import get_media
import configparser as ConfigParser
# shorten list to first 10 items
def shorten_list(mylist):
return mylist[:10]
def main():
config = ConfigParser.ConfigParser()
config.read('settings.cfg')
youtubedl_filename = config.get('media', 'youtubedl_input')
print('creating file: %s - to be used as input for youtubedl' % youtubedl_filename)
db = get_media.connect_to_media_db()
items = db.raw
url_list = []
cursor = items.find()
records = dict((record['_id'], record) for record in cursor)
# iterate through records in media items collection
# if 'Url' field exists and starts with youtube, add url to list
for item in records:
item_dict = records[item]
#print(item_dict)
if 'Url' in item_dict['Data']:
url = item_dict['Data']['Url']
if url.startswith('https://www.youtube.com/'):
url_list.append(url)
# for testing purposes
# shorten list to only download a few files at a time
url_list = shorten_list(url_list)
# save list of youtube media file urls
with open(youtubedl_filename, 'w') as f:
for url in url_list:
f.write(url+'\n')
if __name__ == "__main__":
main()

Related

Run multiple terminals from python script and execute commands (Ubuntu)

What I have is a text file containing all items that need to be deleted from an online app. Every item that needs to be deleted has to be sent 1 at a time. To make deletion process faster, I divide the items in text file in multiple text files and run the script in multiple terminals (~130 for deletion time to be under 30 minutes for ~7000 items).
This is the code of the deletion script:
from fileinput import filename
from WitApiClient import WitApiClient
import os
dirname = os.path.dirname(__file__)
parent_dirname = os.path.dirname(dirname)
token = input("Enter the token")
file_name = os.path.join(parent_dirname, 'data/deletion_pair.txt')
with open(file_name, encoding="utf-8") as file:
templates = [line.strip() for line in file.readlines()]
for template in templates:
entity, keyword = template.split(", ")
print(entity, keyword)
resp = WitApiClient(token).delete_keyword(entity, keyword)
print(resp)
So, I divide the items in deletion_pair.txt and run this script multiple times in new terminals (~130 terminals). Is there a way to automate this process or do in more efficient manner?
I used threading to run multiple functions simultaneously:
from fileinput import filename
from WitApiClient import WitApiClient
import os
from threading import Thread
dirname = os.path.dirname(__file__)
parent_dirname = os.path.dirname(dirname)
token = input("Enter the token")
file_name = os.path.join(parent_dirname, 'data/deletion_pair.txt')
with open(file_name, encoding="utf-8") as file:
templates = [line.strip() for line in file.readlines()]
batch_size = 20
chunks = [templates[i: i + batch_size] for i in range(0, len(templates), batch_size)]
def delete_function(templates, token):
for template in templates:
entity, keyword = template.split(", ")
print(entity, keyword)
resp = WitApiClient(token).delete_keyword(entity, keyword)
print(resp)
for chunk in chunks:
thread = Thread(target=delete_function, args=(chunk, token))
thread.start()
It worked! Any one has any other solution, please post or if the same code can be written more efficiently then please do tell. Thanks.

Script to convert multiple URLs or files to individual PDFs and save to a specific location

I have written a script where I am taking the input of URLs hardcoded and giving their filenames also hardcoded, whereas I want to take the URLs from a saved text file and save their names automatically in a chronological order to a specific folder.
My code (works) :
import requests
#input urls and filenames
urls = ['https://www.northwestknowledge.net/metdata/data/pr_1979.nc',
'https://www.northwestknowledge.net/metdata/data/pr_1980.nc',
'https://www.northwestknowledge.net/metdata/data/pr_1981.nc']
fns = [r'C:\Users\HBI8\Downloads\pr_1979.nc',
r'C:\Users\HBI8\Downloads\pr_1980.nc',
r'C:\Users\HBI8\Downloads\pr_1981.nc']
#defining the inputs
inputs= zip(urls,fns)
#define download function
def download_url(args):
url, fn = args[0], args[1]
try:
r = requests.get(url)
with open(fn, 'wb') as f:
f.write(r.content)
except Exception as e:
print('Failed:', e)
#loop through all inputs and run download function
for i in inputs :
result = download_url(i)
Trying to fetch the links from text (error in code):
import requests
# getting all URLS from textfile
file = open('C:\\Users\\HBI8\\Downloads\\testing.txt','r')
#for each_url in enumerate(f):
list_of_urls = [(line.strip()).split() for line in file]
file.close()
#input urls and filenames
urls = list_of_urls
fns = [r'C:\Users\HBI8\Downloads\pr_1979.nc',
r'C:\Users\HBI8\Downloads\pr_1980.nc',
r'C:\Users\HBI8\Downloads\pr_1981.nc']
#defining the inputs
inputs= zip(urls,fns)
#define download function
def download_url(args):
url, fn = args[0], args[1]
try:
r = requests.get(url)
with open(fn, 'wb') as f:
f.write(r.content)
except Exception as e:
print('Failed:', e)
#loop through all inputs and run download fupdftion
for i in inputs :
result = download_url(i)
testing.txt has those 3 links pasted in it on each line.
Error :
Failed: No connection adapters were found for "['https://www.northwestknowledge.net/metdata/data/pr_1979.nc']"
Failed: No connection adapters were found for "['https://www.northwestknowledge.net/metdata/data/pr_1980.nc']"
Failed: No connection adapters were found for "['https://www.northwestknowledge.net/metdata/data/pr_1981.nc']"
PS :
I am new to python and it would be helpful if someone could advice me on how to loop or go through files from a text file and save them indivually in a chronological order as opposed to hardcoding the names(as I have done).
When you do list_of_urls = [(line.strip()).split() for line in file], you produce a list of lists. (For each line of the file, you produce the list of urls in this line, and then you make a list of these lists)
What you want is a list of urls.
You could do
list_of_urls = [url for line in file for url in (line.strip()).split()]
Or :
list_of_urls = []
for line in file:
list_of_urls.extend((line.strip()).split())
By far the simplest method in this simple case is use the OS command
so go to the work directory C:\Users\HBI8\Downloads
invoke cmd (you can simply put that in the address bar)
write/paste your list using >notepad testing.txt (if you don't already have it there)
Note NC HDF files are NOT.pdf
https://www.northwestknowledge.net/metdata/data/pr_1979.nc
https://www.northwestknowledge.net/metdata/data/pr_1980.nc
https://www.northwestknowledge.net/metdata/data/pr_1981.nc
then run
for /F %i in (testing.txt) do curl -O %i
92 seconds later
I have inserted a delimiter as ',' by using split function.
In order to give automated file name I used the index number of the stored list.
Data saved in following manner in txt file.
FileName | Object ID | Base URL
url_file = open('C:\\Users\\HBI8\\Downloads\\testing.txt','r')
fns=[]
list_of_urls = []
for line in url_file:
stripped_line = line.split(',')
print(stripped_line)
list_of_urls.append(stripped_line[2]+stripped_line[1])
fns.append(stripped_line[0])
url_file.close()

Limit page depth on scrapy crawler

I have a scraper that takes in a list of URLS, and scans them for additional links, that it then follows to find anything that looks like an email (using REGEX), and returns a list of urls/email addresses.
I currently have it set up in a Jupyter Notebook, so I can easily view the output while testing. The problem is, it takes forever to run - because I'm not limiting the depth of the scraper (per URL).
Ideally, the scraper would go a max of 2-5 pages deep from each start url.
Here's what I have so far:
First, I'm importing my dependencies:
import os, re, csv, scrapy, logging
import pandas as pd
from scrapy.crawler import CrawlerProcess
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from googlesearch import search
from time import sleep
from Urls import URL_List
And I set turn off logs and warnings for using Scrapy inside the Jupyter Notebook:
logging.getLogger('scrapy').propagate = False
From there, I extract the URLS from my URL file:
def get_urls():
urls = URL_List['urls']
Then, I set up my spider:
class MailSpider(scrapy.Spider):
name = 'email'
def parse(self, response):
I search for links inside URLs.
links = LxmlLinkExtractor(allow=()).extract_links(response)
Then take in a list of URLs as input, reading their source codes one by one.
links = [str(link.url) for link in links]
links.append(str(response.url))
I send links from one parse method to another.
And set callback argument that defines which method the request URL must be sent to.
for link in links:
yield scrapy.Request(url=link, callback=self.parse_link)
I then pass URLS to the parse_link method — this method applies regex findall to look for emails
def parse_link(self, response):
html_text = str(response.text)
mail_list = re.findall('\w+#\w+\.{1}\w+', html_text)
dic = {'email': mail_list, 'link': str(response.url)}
df = pd.DataFrame(dic)
df.to_csv(self.path, mode='a', header=False)
The google_urls list are passed as an argument when we call the process method to run the Spider, path defines where to save the CSV file.
Then, I save those emails in a CSV file:
def ask_user(question):
response = input(question + ' y/n' + '\n')
if response == 'y':
return True
else:
return False
def create_file(path):
response = False
if os.path.exists(path):
response = ask_user('File already exists, replace?')
if response == False: return
with open(path, 'wb') as file:
file.close()
For each website, I make a data frame with columns: [email, link], and append it to a previously created CSV file.
Then, I put it all together:
def get_info(root_file, path):
create_file(path)
df = pd.DataFrame(columns=['email', 'link'], index=[0])
df.to_csv(path, mode='w', header=True)
print('Collecting urls...')
google_urls = get_urls()
print('Searching for emails...')
process = CrawlerProcess({'USER_AGENT': 'Mozilla/5.0'})
process.crawl(MailSpider, start_urls=google_urls, path=path)
process.start()
print('Cleaning emails...')
df = pd.read_csv(path, index_col=0)
df.columns = ['email', 'link']
df = df.drop_duplicates(subset='email')
df = df.reset_index(drop=True)
df.to_csv(path, mode='w', header=True)
return df
get_urls()
Lastly, I define a keyword and run the scraper:
keyword = input("Who is the client? ")
df = get_info(f'{keyword}_urls.py', f'{keyword}_emails.csv')
On a list of 100 URLS, I got back 44k results with an email addresses syntax.
Anyone know how to limit the depth?
Set DEPTH_LIMIT in your Spider like this
class MailSpider(scrapy.Spider):
name = 'email'
custom_settings = {
"DEPTH_LIMIT": 5
}
def parse(self, response):
pass

Running Background task for checking regular updates python-flask

I have one web application with Rest API ,In that application I have some video files now i am creating one intermediate server in this one.I am accessing my web content using API but here i need to check updates in regular intervals if new updates will be there i need to download them.
Here files are video files and i'm using flask
I tried this i'm not getting
from flask import Flask,render_template,json,jsonify
import schedule
import time
import requests,json
from pathlib import Path
import multiprocessing
import time
import sys
import schedule,wget,requests,json,os,errno,shutil,time,config
def get_videos():
response = requests.get('my api here')
data = response.json()
files =list() # collecting my list of video files
l = len(data)
for i in range(l):
files.append(data[i]['filename'])
return files
def checkfor_file(myfiles):
for i in range(len(myfiles)):
url = 'http://website.com/static/Video/'+myfiles[i] # checking file exists are not in my folder
if url:
os.remove(url)
else:
pass
def get_newfiles(myfiles):
for i in range(len(myfiles)):
url = config.videos+myfiles[i]
filename= wget.download(url)# downloading video files here
def move_files(myfiles):
for i in range(len(myfiles)):
file = myfiles[i]
shutil.move(config.source_files+file,config.destinatin) # moving download folder to another folder
def videos():
files = set(get_videos()) # getting only unique file only
myfiles = list(files)
checkfor_file(myfiles)
get_newfiles(myfiles)
move_files(myfiles)
def job():
videos()
schedule.every(10).minutes.do(job) # running for every ten minutes
while True:
schedule.run_pending()
time.sleep(1)
pi =Flask(__name__)
#pi.route('/')
def index():
response = requests.get('myapi')
data = response.json()
return render_template('main.html',data=data)

Download a file from Web when file name change every time

I have a task to downoad McAfee virus definition file daily from download site locally. The name of the file changes every day. The path to download site is "http://download.nai.com/products/licensed/superdat/english/intel/7612xdat.exe"
This ID 7612 will change every day for something different hence I can't hard code it. I have to find a way to either list file name before providing it as argument or etc.
On stackoverflow site I found someone's script which will work for me if someone could advice how to handle changing file name.
Here is the script that I'm going to use:
def download(url):
"""Copy the contents of a file from a given URL
to a local file.
"""
import urllib
webFile = urllib.urlopen(url)
localFile = open(url.split('/')[-1], 'w')
localFile.write(webFile.read())
webFile.close()
localFile.close()
if __name__ == '__main__':
import sys
if len(sys.argv) == 2:
try:
download(sys.argv[1])
except IOError:
print 'Filename not found.'
else:
import os
print 'usage: %s http://server.com/path/to/filename' % os.path.basename(sys.argv[0])
Could someone advice me?
Thanks in advance
Its a two-step process. First scrape the index page looking for files. Second, grab the latest and download
import urllib
import lxml.html
import os
import shutil
# index page
pattern_files_url = "http://download.nai.com/products/licensed/superdat/english/intel"
# relative url references based here
pattern_files_base = '/'.join(pattern_files_url.split('/')[:-1])
# scrape the index page for latest file list
doc = lxml.html.parse(pattern_files_url)
pattern_files = [ref for ref in doc.xpath("//a/#href") if ref.endswith('xdat.exe')]
if pattern_files:
pattern_files.sort()
newest = pattern_files[-1]
local_name = newest.split('/')[-1]
# grab it if we don't already have it
if not os.path.exists(local_name):
url = pattern_files_base + '/' + newest
print("downloading %s to %s" % (url, local_name))
remote = urllib.urlopen(url)
print dir(remote)
with open(local_name, 'w') as local:
shutil.copyfileobj(remote, local, length=65536)

Categories

Resources