Wikipedia Extractor as a parser for Wikipedia Data Dump File

Wikipedia Extractor as a parser for Wikipedia Data Dump File - python

I've tried to convert bz2 to text with "Wikipedia Extractor(http://medialab.di.unipi.it/wiki/Wikipedia_Extractor). I've downloaded wikipedia dump with bz2 extension then on command line used this line of code:
WikiExtractor.py -cb 250K -o extracted itwiki-latest-pages-articles.xml.bz2
This gave me a result that can be seen in the link:
However, following up it is stated:
In order to combine the whole extracted text into a single file one can issue:
> find extracted -name '*bz2' -exec bzip2 -c {} \; > text.xml
> rm -rf extracted
I get the following error:
File not found - '*bz2'
What can I do?

Please go through this. This would help.
Error using the 'find' command to generate a collection file on opencv
The commands mentioned on the WikiExtractor page are for Unix/Linux system and wont work on Windows.
The find command you ran on windows works in different way than the one in unix/linux.
The extracted part works fine on both windows/linux env as long as you run it with python prefix.
python WikiExtractor.py -cb 250K -o extracted your_bz2_file
You would see a extracted folder created in same directory as your script.
After that find command is supposed to work like this, only on linux.
find extracted -name '*bz2' -exec bzip2 -c {} \; > text.xml
find everything in the extracted folder that matches with bz2 and
then execute bzip2 command on those file and put the result in
text.xml file.
Also, if you run bzip -help command, which is supposed to run with the find command above, you would see that it wont work on Windows and for Linux you get the following output.
gaurishankarbadola#ubuntu:~$ bzip2 -help
bzip2, a block-sorting file compressor. Version 1.0.6, 6-Sept-2010.
usage: bzip2 [flags and input files in any order]
-h --help print this message
-d --decompress force decompression
-z --compress force compression
-k --keep keep (don't delete) input files
-f --force overwrite existing output files
-t --test test compressed file integrity
-c --stdout output to standard out
-q --quiet suppress noncritical error messages
-v --verbose be verbose (a 2nd -v gives more)
-L --license display software version & license
-V --version display software version & license
-s --small use less memory (at most 2500k)
-1 .. -9 set block size to 100k .. 900k
--fast alias for -1
--best alias for -9
If invoked as `bzip2', default action is to compress.
as `bunzip2', default action is to decompress.
as `bzcat', default action is to decompress to stdout.
If no file names are given, bzip2 compresses or decompresses
from standard input to standard output. You can combine
short flags, so `-v -4' means the same as -v4 or -4v, &c.
As mentioned above, bzip2 default action is to compress, so use bzcat for decompression.
The modified command that would work only on linux would look like this.
find extracted -name '*bz2' -exec bzcat -c {} \; > text.xml
It works on my ubuntu system.
EDIT :
For Windows :
BEFORE YOU TRY ANYTHING, PLEASE GO THROUGH THE INSTRUCTIONS FIRST
Create a separate folder and put the files in the folder. Files --> WikiExtractor.py and itwiki-latest-pages-articles1.xml-p1p277091.bz2 (in my case, since it is a small file I could find).
2. Open command prompt in current directory and run the following command to extract all the files.
python WikiExtractor.py -cb 250K -o extracted itwiki-latest-pages-articles1.xml-p1p277091.bz2
It will take time based on the file size but now the directory would look like this.
CAUTION : If you already have the extracted folder, move that to current directory so that it matches with the image above and you don't have to do extraction again.
Copy paste the below code and save it in bz2_Extractor.py file.
import argparse
import bz2
import logging
from datetime import datetime
from os import listdir
from os.path import isfile, join, isdir
FORMAT = '%(levelname)s: %(message)s'
logging.basicConfig(format=FORMAT)
logger = logging.getLogger()
logger.setLevel(logging.INFO)
def get_all_files_recursively(root):
files = [join(root, f) for f in listdir(root) if isfile(join(root, f))]
dirs = [d for d in listdir(root) if isdir(join(root, d))]
for d in dirs:
files_in_d = get_all_files_recursively(join(root, d))
if files_in_d:
for f in files_in_d:
files.append(join(f))
return files
def bzip_decompress(list_of_files, output_file):
start_time = datetime.now()
with open(f'{output_file}', 'w+', encoding="utf8") as output_file:
for file in list_of_files:
with bz2.open(file, 'rt', encoding="utf8") as bz2_file:
logger.info(f"Reading/Writing file ---> {file}")
output_file.writelines(bz2_file.read())
output_file.write('\n')
stop_time = datetime.now()
print(f"Total time taken to write out {len(list_of_files)} files = {(stop_time - start_time).total_seconds()}")
def main():
parser = argparse.ArgumentParser(description="Input fields")
parser.add_argument("-r", required=True)
parser.add_argument("-n", required=False)
parser.add_argument("-o", required=True)
args = parser.parse_args()
all_files = get_all_files_recursively(args.r)
bzip_decompress(all_files[:int(args.n)], args.o)
if __name__ == "__main__":
main()
Now the current directory would look like this.
Now open a cmd in current directory and run the following command.
Please read what each input does in the command.
python bz2_Extractor.py -r extracted -o output.txt -n 10
-r : The root directory you have bz2 files in.
-o : Output file name
-n : Number of files to write out. [If not provided, it writes out all the files inside root directory]
CAUTION : I can see that your file is in Gigabytes and it has more than half millions articles. If you try to put that in a single file using above command, I'm not sure what would happen or if your system can survive that and if it did survive that, the output file would be so large, since it is extracted from 2.8GB file, I don't think Windows OS would be able to open it directly.
So my suggestion would be to process 10000 files at a time.
Let me know if this works for you.
PS : For above command, the output looks like this.

Related

Running a bash command with input/output parameters but providing them as variables not in the disk

I want to run a command in bash say for instance ./command -i INPUT_FILE -o OUTPUT_FILE which takes two parameters, 1) a path to an input file and 2) a path to an output file. It processes the INPUT_FILE and writes the results in the OUTPUT_FILE. Is there any way that I can provide the INPUT_FILE and OUTPUT_FILE as some variables? So that instead of stored files in the disk, I want to feed/store them as variables in the memory. Note that the output is written in the provided file path, not stdout (otherwise, it was obvious). In the core of the command it opens ofstream in C language to write the results in the OUTPUT_FILE.
I searched and reached a solution for the input part which is working but not for the output part. Here is the suggested solution:
./command -i <(cat <<< "$INPUT_VARIABLE") -o OUTPUT_FILE
Is there any suggestion for the output part? My end goal is using this ability in Python but it seems subprocess module doesn't have this feature.

You can use tmpfs for that, which is a virtual filesystem, located in RAM.
For your current user you can use /run/user/$UID/ directory, so something like
./command -i <(cat <<< "$INPUT_VARIABLE") -o /run/user/$UID/OUTPUT_FILE
can totally work. Just rm that file if you don't need it after your script has stopped. This file won't survive a reboot, obviously.

sh Script to find files(only xml) in a directory, execute a python script (.py) on every file found, source the changed files into a target directory

I have a python script (as a .py file) that modifies xml file headers. I want to find all the files with extension .xml in a directory, copy them to a different directory and run the python script on all the *.xml files that are copied. I would like to source all the changes files through the python script into a different directory.
Currently I am at this step:
find . -type f -name ".xml" -exec cp {} tempdir \;
I am new to shell and I do not know how to execute the python program on each file that's sourced to tempdir and output to a new targetdir
my python command to execute looks something like this: python "xmlchange.py" -i "tempdir" -t "targetdir"
I am thinking of a foreach or a for loop whichever is appropriate in this context.
Any inputs appreciated. TIA!

Translate youtube_dl command line into python .py file

It's about how to get a list of URLs by using youtube_dl. Although I have been trying all day long, I couldn't work it out. Thus I would like to ask help to translate the following command lines (partially in Linux) into Python codes. I mean in a .py file.
To get JSON data, use command line: youtube-dl -j --flat-playlist 'https://www.youtube.com/c/3blue1brown/videos'
To parser use command line in Linux: youtube-dl -j --flat-playlist 'https://www.youtube.com/c/3blue1brown/videos' | jq -r '.id' | sed 's_^_https://youtube.com/v/_'
The codes above are from: https://web.archive.org/web/20180309061900/https://archive.zhimingwang.org/blog/2014-11-05-list-youtube-playlist-with-youtube-dl.html (The youtube link there was removed so I replaced the youtube link above)

You can use the same command to run inside a .py file using os as follows:
import os
os.system("youtube-dl -j --flat-playlist 'https://www.youtube.com/c/3blue1brown/videos'")
You can pipe the output of the above commands to a file and then process your json file in python.

subprocess cp leaves some files empty

I'm trying to copy some files from one directory to another. I want all files in one directory to end up in the root of another directory.
This command does exactly what I want when I run it in the terminal:
cp -rv ./src/CopyPasteIntoBuildDir/* ./build-root/src/
This line of python, however, copies most of the files just like the above command, but it leaves some of the new files empty. Specifically, files in subdirectories are left empty.
subprocess.check_call("cp -rv ./src/CopyPasteIntoBuildDir/* ./build-root/src/", shell=True)
It creates the files if they're not there, and it truncates them if they are.
What is going on?

Assuming that you're decided to use cp rather than native Python operations --
This code will be much more reliable if you write it to not invoke any shell whatsoever. To avoid the need for /* on the source (and the side effects of this -- ie. refusal to copy directories whose names exceed the ARG_MAX combined environment and command-line size storage limit), use . as the last element of the name of the directory whose contents are to be copied, instead of passing a wildcard that needs to be expanded by a shell.
subprocess.check_call(["cp", "-R", "--", '%s/.' % src, dest])
The use of cp -R rather than cp -rv is on account of -R, but not -r, being POSIX-standardized (and thus portable across all compliant UNIXlike platforms).
Demonstrating In Action (copy/pasteable code)
tempdir=$(mktemp -d -t testdir.XXXXXX)
trap 'rm -rf "$tempdir"' EXIT
cd "$tempdir"
mkdir -p ./src/CopyPasteIntoBuildDir/subdir-1 ./build-root/src/
touch ./src/CopyPasteIntoBuildDir/file-1
touch ./src/CopyPasteIntoBuildDir/subdir-1/file-2
script='
import sys, shutil, subprocess
src = sys.argv[1]
dest = sys.argv[2]
subprocess.check_call(["cp", "-R", "--", "%s/." % src, dest])
'
python -c "$script" ./src/CopyPasteIntoBuildDir ./build-root/src/
find ./build-root -type f -print
rm -rf "$tempdir"
...emits output akin to:
./build-root/src/file-1
./build-root/src/subdir-1/file-2
...showing that content was correctly recursively copied with no prefix.

So apparently this is a problem with sh. Using bash instead worked.
subprocess.check_call("cp -rv ./src/CopyPasteIntoBuildDir/* ./build-root/src/", shell=True, executable="/bin/bash")
EDIT: See accepted answer!

Re-write write-protected file

Every 4 hours files are updated with new information if needed - i.e. if any new information has been processed for that particular file (files correspond to people).
I'm running this command to convert my .stp files (those being updated every 4 hours) to .xml files.
rule convert_waveform_stp:
input: '/data01/stpfiles/{file}.Stp'
output: '/data01/workspace/bm_data/xmlfiles/{file}.xml'
shell:
'''
mono /data01/workspace/bm_software/convert.exe {input} -o {output}
'''
My script is in Snakemake (python based) but I'm running the convert.exe through a shell command.
I'm getting an error on the ones already processed using convert.exe. They are saved by convert.exe as write-protected and there is no option to bypass this within the executable itself.
Error Message:
ProtectedOutputException in line 14 of /home/Snakefile:
Write-protected output files for rule convert_waveform_stp:
/data01/workspace/bm_data/xmlfiles/PID_1234567.xml
I'd still like them to be write-protected but would also like to be able to update them as needed.
Is there something I can add to my shell command to write over the write protected files?

take a look at the os standard library package:
https://docs.python.org/3.5/library/os.html?highlight=chmod#os.chmod
It allows for chmod with the following caveat:
Although Windows supports chmod(), you can only set the file’s read-only flag with it (via the stat.S_IWRITE and stat.S_IREAD constants or a corresponding integer value). All other bits are ignored.
#VickiT05, I thought you wanted it in python. Try this:
Check the original file permission with
ls -l [your file name]
stat -c %a [your file name]
Change the protection to with
chmod 777 [your file name]
change back to original file mode or whatever mode you want
chmod [original file protection mode] [your file name]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Wikipedia Extractor as a parser for Wikipedia Data Dump File - python

Related

Running a bash command with input/output parameters but providing them as variables not in the disk

sh Script to find files(only xml) in a directory, execute a python script (.py) on every file found, source the changed files into a target directory

Translate youtube_dl command line into python .py file

subprocess cp leaves some files empty

Re-write write-protected file

Categories

Resources