Search large tar.gz file for keywords,copy and delete - python

What is the best way with large log tar.gz files , some are 20 gig , to open and search for a keyword, copy found files to a directory , then delete the file so it doesn't consume disk space.
I have some code below, it was working but then it stopped extracting files suddenly for some reason. If i remove the -O option from tar it extracts files again.
mkdir -p found;
tar tf "$1" | while read -r FILE
do
if tar xf "$1" "$FILE" -O | grep -l "$2" ;then
echo "found pattern in : $FILE";
cp $FILE found/$(basename $FILE);
rm -f $FILE;
fi
done
$1 is the tar.gz file , $2 is the keyword
UPDATE
Im doing the below which works, but a small file i have has 2million plus compressed files, so will take hours to look at all the files.Is there a python solution or similar that can do it faster.
#!/bin/sh
# tarmatch.sh
if grep -l "$1" ; then
echo "Found keyword in ${TAR_FILENAME}";
tar -zxvf "$2" "${TAR_FILENAME}"
else
echo "Not found in ${TAR_FILENAME}";
fi
true
tar -zxf 20130619.tar.gz --to-command "./tarmatch.sh '#gmail' 20130619.tar.gz "
UPDATE 2
Im using python now and seems to of increased in speed, was doing about 4000 records a second while the bash version was doing about 5.Im not that strong in python so probably this code could be optimized, please let me know if this could be optimized.
import tarfile
import time
import os
import ntpath, sys
if len(sys.argv) < 3 :
print "Please provide the tar.gz file and keyword to search on"
print "USAGE: tarfind.py example.tar.gz keyword"
sys.exit()
t = tarfile.open(sys.argv[1], 'r:gz')
cnt = 0;
foundCnt = 0;
now = time.time()
directory = 'found/'
if not os.path.exists(directory):
os.makedirs(directory)
for tar_info in t:
cnt+=1;
if (tar_info.isdir()): continue
if(cnt%1000 == 0): print "Processed " + str(cnt) + " files"
f=t.extractfile(tar_info)
if sys.argv[2] in f.read():
foundCnt +=1
newFile = open(directory + ntpath.basename(tar_info.name), 'w');
f.seek(0,0)
newFile.write( f.read() )
newFile.close()
print "found in file " + tar_info.name
future = time.time()
timeTaken = future-now
print "Found " + str(foundCnt) + " records"
print "Time taken " + str( int( timeTaken/60) ) + " mins " + str(int(timeTaken%60)) + " seconds"
print str( int(cnt / timeTaken)) + " records per second"
t.close()

If you are trying to search for a keyword in the files and extract only those, and since your file sizes are huge, it might take time if the keyword is somewhere at the middle.
The best advice I can give is probably use a powerful combination of a Inverted index lookup tool such as Solr(based on Lucene Indes) and Apache Tika - a content analysis toolkit.
Using these tools you can index the tar.gz files and when you search for a keyword, relevant documents containig the keyword will be returned.

If the file is really 20GB it will take very long to grep in any case. The only advice I can give is to use zgrep. This will save you from having to explicitly uncompress the archive.
zgrep PATTERN your.tgz

Related

How to cut videos automatically using Python with FFmpeg?

I'm trying to cut videos automatically using Python with FFmpeg so I don't have to type out the command everytime I want a new video cut. I'm not really sure what I'm doing wrong though, here's the code:
import os
path = r'C:/Users/user/Desktop/folder'
for filename in os.listdir(path):
if (filename.endswith(".mp4")):
command = "ffmpeg - i" + filename + "-c copy -map 0 -segment_time 00:00:06 -f segment -reset_timestamps 1 output%03d.mp4"
os.system(command)
else:
continue
Typos
First of all, there's a typo in the syntax, as you wrote - i while the correct syntax is -i.
The syntax " + filename + " is correct, however there must be a space before and after
command = "ffmpeg -i " + filename + " -c copy -map 0 -segment_time 00:00:06 -f segment -reset_timestamps 1 output%03d.mp4"
otherwise, you would get an error like
Unrecognized option 'iC:\Users\user\Desktop\folder\filename.mp4-c'.
Error splitting the argument list: Option not found
Solution
I assumed every other argument is correct, for me it didn't work at first, I just had to add
-fflags +discardcorrupt
but maybe it's just my file.
Here's the correct code, however I recommend to you to read this.
Note: I used os.path.join() to save the output file in that same directory because my python file was in another one.
import os
path = r'C:\Users\user\Desktop\folder'
for filename in os.listdir(path):
if filename.endswith(".mp4"):
command = "ffmpeg -fflags +discardcorrupt -i " + os.path.join(path, filename) + " -c copy -map 0 -segment_time 00:00:03 -f segment -reset_timestamps 1 " + os.path.join(path, "output%03d.mp4")
os.system(command)
else:
continue
You can cut a video file with ffmpeg.
ffmpeg -i 1.mp4 -vcodec copy -acodec copy -ss 00:03:20 -t 00:10:24 out.mpt
-ss : start_time
-to : end_time
-t : duration_time

Copy files from one folder to another with matching names in .txt file

I want to copy files from one big folder to another folder based on matching file names in a .txt file.
My list.txt file contains file names:
S001
S002
S003
and another big folder contains many files for ex. S001, S002, S003, S004, S005.
I only want to copy the files from this big folder that matches the file names in my list.txt file.
I have tried Bash, Python - not working.
for /f %%f in list.txt do robocopy SourceFolder/ DestinationFolder/ %%f
is not working either.
My logic in Python is not working:
import os
import shutil
def main():
destination = "DestinationFolder/copy"
source = "SourceFolder/MyBigData"
with open(source, "r") as lines:
filenames_to_copy = set(line.rstrip() for line in lines)
for filenames in os.walk(destination):
for filename in filenames:
if filename in filenames_to_copy:
shutil.copy(source, destination)
Any answers in Bash, Python or R?
Thanks
I think the issue with your Python code is that with os.walk() your filename will be a list everytime, which will not be found in your filenames_to_copy.
I'd recommend trying with os.listdir() instead as this will return a list of the names of filenames/folders as strings - easier to compare against your filenames_to_copy.
Other note - perhaps you want to do os.listdir() (or os.walk()) on the source instead of the destination. Currently, you're only copying files from the source to the destination if the file already exists in the destination.
os.walk() will return a tuple of three elements: the name of the current directory inspected, the list of folders in it, and the list of files in it. You are only interested in the latter. So your should iterate with:
for _, _, filenames in os.walk(destination):
As pointed out by JezMonkey, os.listdir() is easier to use as it will list of the files and folders in the requested directory. However, you will lose the recursive search that os.walk() enables. If all your files are in the same folder and not hidden in some folders, you'd rather use os.listdir().
The second problem I see in you code is that you copy source when I think you want to copy os.path.join(source, filename).
Can you publish the exact error you have with the Python script so that we can better help you.
UPDATE
You actually don't need to list all the files in the source folder. With os.path.exists you can check that the file exists and copy it if it does.
import os
import shutil
def main():
destination = "DestinationFolder/copy"
source = "SourceFolder/MyBigData"
with open("list.txt", "r") as lines: # adapt the name of the file to open to your exact location.
filenames_to_copy = set(line.rstrip() for line in lines)
for filename in filenames_to_copy:
source_path = os.path.join(source, filename)
if os.path.exists(source_path):
print("copying {} to {}".format(source_path, destination))
shutil.copy(source_path, destination)
Thank you #PySaad and #Guillaume for your contributions, although my script is working now: I added:
if os.path.exists(copy_to):
shutil.rmtree(copy_to)
shutil.copytree(file_to_copy, copy_to)
to the script and its working like a charm :)
Thanks a lot for your help!
You can try with below code -
import glob
big_dir = "~\big_dir"
copy_to = "~\copy_to"
copy_ref = "~\copy_ref.txt"
big_dir_files = [os.path.basename(f) for f in glob.glob(os.path.join(big_dir, '*'))]
print 'big_dir', big_dir_files # Returns all filenames from big directory
with open(copy_ref, "r") as lines:
filenames_to_copy = set(line.rstrip() for line in lines)
print filenames_to_copy # prints filename which you have in .txt file
for file in filenames_to_copy:
if file in big_dir_files: # Matches filename from ref.txt with filename in big dir
file_to_copy = os.path.join(big_dir, file)
copy_(file_to_copy, copy_to)
def copy_(source_dir, dest_dir):
files = glob.iglob(os.path.join(source_dir, '*'))
for file in files:
dest = os.path.join(dest_dir, os.path.basename(os.path.dirname(file)))
if not os.path.exists(dir_name):
os.mkdir(dest)
shutil.copy2(file, dest)
Reference:
https://docs.python.org/3/library/glob.html
If you want an overkill bash script/tool. Check https://github.com/jordyjwilliams/copy_filenames_from_txt out.
This can be invoked by ./copy_filenames_from_txt.sh -f ./text_file_with_names -d search_dir-o output_dir
The script can be summarised (without error handling/args etc) to:
cat $SEARCH_FILE | while read i; do
find $SEARCH_DIR -name "*$i*" | while read file; do
cp -r $file $OUTPUT_DIR/
done
done
The 2nd while loop here is not even strictly necessary... One could just pass $i to the cp. (eg a list of files if multiple matches) I just wanted to have each file handled separately for the tool I was writing...
To make this a bit nicer and add error handling...
The active part of the my tool in the repo is (ignore the color markers):
cat $SEARCH_FILE | while read i; do
# Handle non-matching file
if [[ -z $(find $SEARCH_DIR -name "*$i*") && $VERBOSE ]]; then
echo -e "❌: ${RED}$i${RESET} ||${YELLOW} No matching file found${RESET}"
continue
fi
# Handle case of more than one matching file
find $SEARCH_DIR -name "*$i*" | while read file; do
if [[ -n "$file" ]]; then # Check file matching
FILENAME=$(basename ${file}); DIRNAME=$(dirname "${file}")
if [[ -f $OUTPUT_DIR/"$FILENAME" ]]; then
if [[ $DIRNAME/ != $OUTPUT_DIR && $VERBOSE ]]; then
echo -e "📁: ${CYAN}$FILENAME ${GREEN}already exists${RESET} in ${MAGENTA}$OUTPUT_DIR${RESET}"
fi
else
if [[ $VERBOSE ]]; then
echo -e "✅: ${GREEN}$i${RESET} >> ${CYAN}$file${RESET} >> ${MAGENTA}$OUTPUT_DIR${RESET}"
fi
cp -r $file $OUTPUT_DIR/
fi
fi
done
done

Modify script to read from file to try URL in list

I could use some help. I've got this script:
#!/usr/bin/python
import time
import os
showname="Offsides"
station="http://myurl.com"
timestr = time.strftime("%B_%d_%Y-%H%M%p_%A")
directory = "/mnt/data/data/radio/current/"
filename = directory + showname +"_" + timestr + ".mp4"
command= "-q -O " + filename + " " + station +" &"
os.system("wget " + command)
#Record for an hour then kill this process based on grep for filename
time.sleep(3600)
pid=os.system("ps -ef | grep wget | grep -vi grep | grep " + filename + " | awk '{print $2}'| head -1")
os.system("kill -9 " + str(pid))
This records a streaming radio station for one hour and then kills the process. I've noticed recently that some of my recordings are failing due to a station being unavailable (404 error). The program is available on multiple URLS. I would like to make this more robust.
I have a file (stations.txt) that has a list of URLS one per line. What I would like to do is modify my script so that it tries the first in the list, sends the wget command to retrieve it, waits a second or two and then checks to see if the file (variable 'filename' from above) is growing. If the file is not growing, then it would try the next URL. This should help in missing these recordings.
I'm a novice with all this and would appreciate any help that you might be able to provide.
Thanks
You could just add a for loop statement after opening the file containing the URLs.
Please refer https://docs.python.org/2/library/stdtypes.html#bltin-file-objects for basic file opening structure.
#open file
f = open("your_file")
try:
for line in f:
#line contains urls from file
print line
station=line
# continue with what ever you want to do
finally:
f.close()

zip error: Invalid command arguments (cannot write zip file to terminal)

I am learning the book A Bite of Python. After typing in the example in the book
import os
import time
# 1. The files and directories to be backed up are
# specified in a list.
# Example on Windows:
# source = ['"C:\\MY Documents"', 'C:\\Code']
# Example on Mac OS X and Linux:
source = ['/home/username/Downloads/books']
# Notice we had to use double quotes inside the string
# for names with spaces in it.
# 2. The backup must be stored in a
# main backup directory
# Example on Windows:
# target_dir = 'E:\\Backup'
# Example on Mac OS X and Linux:
target_dir = '/home/username/Downloads/backup'
# Remember to change this to which folder you will be using
# 3. The files are backed up into a zip file.
# 4. The name of the zip archive is the current date and time
target = target_dir + os.sep + \
time.strftime('%Y%m%d%H%M%S') + '.zip'
# Create target directory if it is not present
if not os.path.exists(target_dir):
os.mkdir(target_dir) # make directory
# 5. We use the zip commond to put the files in a zip archive
zip_command = "zip - r {0} {1}".format(target, ' '.join(source))
# Run the backup
print "Zip command is:"
print zip_command
print "Running:"
if os.system(zip_command) == 0:
print 'Successful backup to', target
else:
print 'Backup FALIED'
I goe a message zip error: Invalid command arguments (cannot write zip file to terminal)
I can not figure out where goes wrong, I type in the same code in this book.
Anyone knows why this happen?
The issue is in the zip command you are creating , there is an extra space between - and r . Example -
zip_command = "zip - r {0} {1}".format(target, ' '.join(source))
^
Notice the extra space
There should be no space between - and r. Example -
zip_command = "zip -r {0} {1}".format(target, ' '.join(source))
Also, I would like to suggest that, it would be better to use subprocess.call() rather than os.system , providing it a list of arguments for the command. Example -
import subprocess
zip_command = ['zip','-r',target]
for s in source:
zip_command.append(s)
subprocess.call(zip_command)
It would have been easier to see the error this way.

Need a script to iterate over files and execute a command

Please bear with me, I've not used python before, and I'm trying to get some rendering done as quick as possible and getting stopped in my tracks with this.
I'm outputting the .ifd files to a network drive (Z:), and they are stored in a folder structure like;
Z:
- \0001
- \0002
- \0003
I need to iterate over the ifd files within a single folder, but the number of files is not static so there also needs to be a definable range (1-300, 1-2500, etc). The script therefore has to be able to take an additional two arguments for a start and end range.
On each iteration it executes something called 'mantra' using this statement;
mantra -f file.FRAMENUMBER.ifd outputFile.FRAMENUMBER.png
I've found a script on the internet that is supposed to do something similar;
import sys, os
#import command line args
args = sys.argv
# get args as string
szEndRange = args.pop()
szStartRange = args.pop()
#convert args to int
nStartRange = int(szStartRange, 10);
nEndRange = int(szEndRange, 10);
nOrd = len(szStartRange);
#generate ID range
arVals = range(nStartRange, nEndRange+1);
for nID in arVals:
szFormat = 'mantra -V a -f testDebris.%%(id)0%(nOrd)dd.ifd' % {"nOrd": nOrd};
line = szFormat % {"id": nID};
os.system(line);
The problem I'm having is that I can't get it to work. It seems to iterate, and do something - but it looks like it's just spitting out ifds into a different folder somewhere.
TLDR;
I need a script which will at least take two arguments;
startFrame
endFrame
and from those create a frameRange, which is then used to iterate over all ifd files executing the following command;
mantra -f fileName.currentframe.ifd fileName.currentFrame.png
If I were able to specify the filename and the files directory and output directory that'd be great too. I've tried manually doing that but there must be some convention to that I don't know as it was coming up with errors when I tried (stopping at the colon).
If anyone could hook me up or point me in the right direction that'd be swell. I know I should try and learn python, but I'm at my wits end with the rendering and need a helping hand.
import os, subprocess, sys
if len(sys.argv) != 3:
print('Must have 2 arguments!')
print('Correct usage is "python answer.py input_dir output_dir" ')
exit()
input_dir = sys.argv[1]
output_dir = sys.argv[2]
input_file_extension = '.txt'
cmd = 'currentframe'
# iterate over the contents of the directory
for f in os.listdir(input_dir):
# index of last period in string
fi = f.rfind('.')
# separate filename from extension
file_name = f[:fi]
file_ext = f[fi:]
# create args
input_str = '%s.%s.ifd' % (os.path.join(input_dir, file_name), cmd)
output_str = '%s.%s.png' % (os.path.join(output_dir + file_name), cmd)
cli_args = ['mantra', '-f', input_str, output_str]
#call function
if subprocess.call(cli_args, shell=True):
print('An error has occurred with command "%s"' % ' '.join(cli_args))
This should be sufficient for you to either use currently or with slight modification.
Instead of specifically inputting a start and end range you could just do:
import os
path, dirs, files = os.walk("/Your/Path/Here").next()
nEndRange = len(files)
#generate ID range
arVals = range(1, nEndRange+1);
The command os.walk() counts the # of files in the folder that you specified.
Although, an even easier way of getting your desired output is like this:
import os
for filename in os.listdir('dirname'):
szFormat = 'mantra -f ' + filename + ' outputFile.FRAMENUMBER.png'
line = szFormat % {"id": filename}; # you might need to play around with this formatting
os.system(line);
Because os.listdir() iterates through the specified directory and filename is every file in that directory, so you don't even need to count them.
a little help building the command.
for nID in arVals:
command = 'mantra -V a -f '
infile = '{0}.{1:04d}.ifd '.format(filename, id)
outfile = '{0}.{1:04d}.png '.format(filename, id)
os.system(command + infile + outfile);
and definitely use os.walk or os.listdir like #logic recommends
for file in os.listdir("Z:"):
filebase = os.path.splitext(file)[0]
command = 'mantra -V a -f {0}.ifd {0}.png'.format(filebase)

Categories

Resources