Translate youtube_dl command line into python .py file - python

It's about how to get a list of URLs by using youtube_dl. Although I have been trying all day long, I couldn't work it out. Thus I would like to ask help to translate the following command lines (partially in Linux) into Python codes. I mean in a .py file.
To get JSON data, use command line: youtube-dl -j --flat-playlist 'https://www.youtube.com/c/3blue1brown/videos'
To parser use command line in Linux: youtube-dl -j --flat-playlist 'https://www.youtube.com/c/3blue1brown/videos' | jq -r '.id' | sed 's_^_https://youtube.com/v/_'
The codes above are from: https://web.archive.org/web/20180309061900/https://archive.zhimingwang.org/blog/2014-11-05-list-youtube-playlist-with-youtube-dl.html (The youtube link there was removed so I replaced the youtube link above)

You can use the same command to run inside a .py file using os as follows:
import os
os.system("youtube-dl -j --flat-playlist 'https://www.youtube.com/c/3blue1brown/videos'")
You can pipe the output of the above commands to a file and then process your json file in python.

Related

Wikipedia Extractor as a parser for Wikipedia Data Dump File

I've tried to convert bz2 to text with "Wikipedia Extractor(http://medialab.di.unipi.it/wiki/Wikipedia_Extractor). I've downloaded wikipedia dump with bz2 extension then on command line used this line of code:
WikiExtractor.py -cb 250K -o extracted itwiki-latest-pages-articles.xml.bz2
This gave me a result that can be seen in the link:
However, following up it is stated:
In order to combine the whole extracted text into a single file one can issue:
> find extracted -name '*bz2' -exec bzip2 -c {} \; > text.xml
> rm -rf extracted
I get the following error:
File not found - '*bz2'
What can I do?
Please go through this. This would help.
Error using the 'find' command to generate a collection file on opencv
The commands mentioned on the WikiExtractor page are for Unix/Linux system and wont work on Windows.
The find command you ran on windows works in different way than the one in unix/linux.
The extracted part works fine on both windows/linux env as long as you run it with python prefix.
python WikiExtractor.py -cb 250K -o extracted your_bz2_file
You would see a extracted folder created in same directory as your script.
After that find command is supposed to work like this, only on linux.
find extracted -name '*bz2' -exec bzip2 -c {} \; > text.xml
find everything in the extracted folder that matches with bz2 and
then execute bzip2 command on those file and put the result in
text.xml file.
Also, if you run bzip -help command, which is supposed to run with the find command above, you would see that it wont work on Windows and for Linux you get the following output.
gaurishankarbadola#ubuntu:~$ bzip2 -help
bzip2, a block-sorting file compressor. Version 1.0.6, 6-Sept-2010.
usage: bzip2 [flags and input files in any order]
-h --help print this message
-d --decompress force decompression
-z --compress force compression
-k --keep keep (don't delete) input files
-f --force overwrite existing output files
-t --test test compressed file integrity
-c --stdout output to standard out
-q --quiet suppress noncritical error messages
-v --verbose be verbose (a 2nd -v gives more)
-L --license display software version & license
-V --version display software version & license
-s --small use less memory (at most 2500k)
-1 .. -9 set block size to 100k .. 900k
--fast alias for -1
--best alias for -9
If invoked as `bzip2', default action is to compress.
as `bunzip2', default action is to decompress.
as `bzcat', default action is to decompress to stdout.
If no file names are given, bzip2 compresses or decompresses
from standard input to standard output. You can combine
short flags, so `-v -4' means the same as -v4 or -4v, &c.
As mentioned above, bzip2 default action is to compress, so use bzcat for decompression.
The modified command that would work only on linux would look like this.
find extracted -name '*bz2' -exec bzcat -c {} \; > text.xml
It works on my ubuntu system.
EDIT :
For Windows :
BEFORE YOU TRY ANYTHING, PLEASE GO THROUGH THE INSTRUCTIONS FIRST
Create a separate folder and put the files in the folder. Files --> WikiExtractor.py and itwiki-latest-pages-articles1.xml-p1p277091.bz2 (in my case, since it is a small file I could find).
2. Open command prompt in current directory and run the following command to extract all the files.
python WikiExtractor.py -cb 250K -o extracted itwiki-latest-pages-articles1.xml-p1p277091.bz2
It will take time based on the file size but now the directory would look like this.
CAUTION : If you already have the extracted folder, move that to current directory so that it matches with the image above and you don't have to do extraction again.
Copy paste the below code and save it in bz2_Extractor.py file.
import argparse
import bz2
import logging
from datetime import datetime
from os import listdir
from os.path import isfile, join, isdir
FORMAT = '%(levelname)s: %(message)s'
logging.basicConfig(format=FORMAT)
logger = logging.getLogger()
logger.setLevel(logging.INFO)
def get_all_files_recursively(root):
files = [join(root, f) for f in listdir(root) if isfile(join(root, f))]
dirs = [d for d in listdir(root) if isdir(join(root, d))]
for d in dirs:
files_in_d = get_all_files_recursively(join(root, d))
if files_in_d:
for f in files_in_d:
files.append(join(f))
return files
def bzip_decompress(list_of_files, output_file):
start_time = datetime.now()
with open(f'{output_file}', 'w+', encoding="utf8") as output_file:
for file in list_of_files:
with bz2.open(file, 'rt', encoding="utf8") as bz2_file:
logger.info(f"Reading/Writing file ---> {file}")
output_file.writelines(bz2_file.read())
output_file.write('\n')
stop_time = datetime.now()
print(f"Total time taken to write out {len(list_of_files)} files = {(stop_time - start_time).total_seconds()}")
def main():
parser = argparse.ArgumentParser(description="Input fields")
parser.add_argument("-r", required=True)
parser.add_argument("-n", required=False)
parser.add_argument("-o", required=True)
args = parser.parse_args()
all_files = get_all_files_recursively(args.r)
bzip_decompress(all_files[:int(args.n)], args.o)
if __name__ == "__main__":
main()
Now the current directory would look like this.
Now open a cmd in current directory and run the following command.
Please read what each input does in the command.
python bz2_Extractor.py -r extracted -o output.txt -n 10
-r : The root directory you have bz2 files in.
-o : Output file name
-n : Number of files to write out. [If not provided, it writes out all the files inside root directory]
CAUTION : I can see that your file is in Gigabytes and it has more than half millions articles. If you try to put that in a single file using above command, I'm not sure what would happen or if your system can survive that and if it did survive that, the output file would be so large, since it is extracted from 2.8GB file, I don't think Windows OS would be able to open it directly.
So my suggestion would be to process 10000 files at a time.
Let me know if this works for you.
PS : For above command, the output looks like this.

Python use curl with subprocess, write outpute into file

If I use the following command in the Git Bash, it works fine. The Output from the curl are write into the file output.txt
curl -k --silent "https://gitlab.myurl.com/api/v4/groups?page=1&per_page=1&simple=yes&private_token=mytoken&all?page=1&per_page=1" > output.txt
Python Code:
import subprocess, shlex
command = shlex.split("curl -k --silent https://gitlab.myurl.com/api/v4/groups?page=1&per_page=1&simple=yes&private_token=mytoken&all?page=1&per_page=1 > output.txt")
subprocess.Popen(command)
The Python code write nothing in my file "output.txt".
How can I write in the output.txt or get the Output direct in Python?
You cannot use redirection directly with subprocess, because it is a shell feature. Use check_output:
import subprocess
command = ["curl", "-k", "--silent", "https://gitlab.myurl.com/api/v4/groups?page=1&per_page=1&simple=yes&private_token=mytoken&all?page=1&per_page=1"]
output = subprocess.check_output(command)
You could use the command in another way in order to write to output.txt:
curl -k --silent "https://gitlab.myurl.com/api/v4/groups?page=1&per_page=1&simple=yes&private_token=mytoken&all?page=1&per_page=1" --output output.txt
Also you should consider that output.txt might not be saved in the directory you expect so I would also advice you to name output.txt in another way, a unique one, then update the locate linux command database (see updatedb command) and then search the file with locate.
PS: all this make sense when you need the writing to output.txt (your question accepts this situation too, so I hope it helps)

How can I store the result of my python script into a new csv file?

Basically I have a bash script that wget's the html of a page, convert it to xhtml using tagsoup and then extracts elements via the python script. Nothing seems to work, I just need the output of that python script to be stored in a new $d csv file.
#!/bin/bash
while sleep 10s
do
d=`date '+%Y-%m-%d-%H-%M-%S'`;
wget -O $d.html http://wsj.com/mdc/public/page/2_3021-activnyse-actives.html
java -jar tagsoup-1.2.1.jar --files $d.html
python3 idk.py $d.xhtml
done
~
From what I understand, you have just to use that command line :
python3 idk.py > $d.xhtml

Wget from Bash translate to Python with flags

I have the following bash wget command
wget -rHncp --cut-dirs=1 -A .txt -e robots=off -l1 -i ./ids.txt -B 'http://archive.org/download/'
I would like to be able to run the exact same command in Python. However, this python file will have to be compatible to running on a windows machine without bash or the wget command. Would I be able to use the wget python package for this? I've tried, but I haven't figured out how to pass it flags yet.
Edit: March 31, 2017
The purpose of the command is to download multiple files from archive.org's website. Given a filetype (in this instance a .txt) and given a file with a list of identifiers (in this instance, the file is call ids.txt), this command will download every txt file associated with the given identifiers.
One such identifier is aeneid_391 and the resulting file from this identifier is virgiletext95anide10_djvu.txt.

How to copy a GIF to clipboard in Mac using python

I would like to write a python function to obtain the gif file given the URL, and store it in Mac's clipboard. Could anyone please help me figure out how to copy GIF into clipboard?
(Let's say the URL is very simple and direct to the GIF I want, "a.com/this.gif".)
You can use the following applescript to copy a gif to the clipboard:
osascript -e 'set the clipboard to (read (POSIX file "/Users/auser/yourgif.gif") as GIF picture)'
You can point the clipboard directly at the file instead of reading its contents:
osascript -e "set the clipboard to (POSIX file \"$f\")"
You can combine this in a script to automatically do the work:
#!/usr/bin/env bash
set -euo pipefail
f="$(mktemp).gif"
curl -o "$f" --fail "${1?Usage: $0 <url-of-gif-file>}"
osascript -e "set the clipboard to (POSIX file \"$f\")"
# Don't delete the tempfile: it holds the GIF. It will be cleaned up on reboot.
Inspired by https://stackoverflow.com/a/51447031/4359699
This would work for any file type by the way, not just GIF.
In Mac OS X, you can use the pbcopy command to copy content to the system clipboard. To achieve this from your Python script, I imagine you would want to fetch the resource (I imagine using urlopen), and then use the subprocess module to invoke pbcopy with the content of the GIF file that you fetched.

Categories

Resources