Searching string among 5Gb of text files - python

I have several CSV files (~25k in total) with a total size of ~5Gb. This files are in a network path and I need to search for several strings inside all these files and to save the files' names (in an output file for example) where these strings are found.
I've already tried two things:
With Windows I've used findstr : findstr /s "MYSTRING" *.csv > Output.txt
With Windows PowerShell: gci -r "." -filter "*.csv" | Select-String "MYSTRING" -list > .\Output.txt
I also can use Python but I don't really think it'll be faster.
There is any other way to speed up this search ?
More precision: the structure of all the files is different. They are CSV but they could be just simple TXT files

You can use pandas to go through large csv files. You will use the read_csv() method to read the contents of the csv files, then use the query() method to filter out the columns and then use to_csv() to export those results in a separate csv file.
import pandas as pd
df = pd.read_csv('csv_file.csv')
result = df.query('column_name == "filtered_strings"')
result.to_csv('filtered_result.csv', index=False)
Hopefully this helps you.

One of the fastest ways to search in text files using PowerShell is switch with parameters -File FILENAME -Regex.
This won't make a big difference though, unless you avoid the I/O bottleneck of the network, by running the search code on the server, e. g. using Invoke-Command. Of course you need to have permissions to be able to run scripts on the remote server.
Invoke-Command -ComputerName TheRemoteMachine {
Get-ChildItem C:\Location\Of\Files -Recurse -Filter *.csv -PV file | ForEach-Object {
switch -File $_.Fullname -Regex {
'MYSTRING|ANOTHERSTRING' { $file.FullName; break }
}
}
} | Set-Content output.txt
This outputs the full paths of files that contain the sub string "MYSTRING" or "ANOTHERSTRING", which gets received on the local machine and stored in a local file.
switch -File $_.Fullname -Regex reads the current file line by line, applying the regular expression to each line. We use break to stop searching when the first match has been found.
Parameter -PV file (alias of -PipeLineVariable) for Get-ChildItem is used so we have access to the current file path in the switch statement. In the switch statement $_ denotes the current RegEx match, so it hides $_ from the ForEach-Object command. Using -PV we provide another name for the $_ variable of ForEach-Object.

Related

arranging text files side by side using python

I have 3000 text files in a directory and each .txt file contain single column data. i want to arrange them side by side to make it a mxn matrix file.
For example: paste 1.txt 2.txt 3.txt 4.txt .............3000.txt in linux
For this i tried
printf "%s\n" *.txt | sort -n | xargs -d '\n' paste
However it gives error paste: filename.txt: Too many open files
please suggest a better solution for the same using python.
Based on the inputs received, please follow the below steps.
# Change ulimit to increase the no of open files at a time
$ ulimit -n 4096
# Remove blank lines from all the files
$ sed -i '/^[[:space:]]*$/d' *.txt
# Join all files side by side to form a matrix view
$ paste $(ls -v *.txt) > matrix.txt
# Fill the blank values in the matrix view with 0's using awk inplace
$ awk -i inplace 'BEGIN { FS = OFS = "\t" } { for(i=1; i<=NF; i++) if($i ~ /^ *$/) $i = 0 }; 1' matrix.txt
You don't need python for this; if you first increase the number of open files a process can have using ulimit, it becomes easy to get columns in the right order in bash, zsh, or ksh93 shells, using paste and brace expansion to generate the filenames in the desired order instead of having to sort the results of filename expansion:
% ulimit -n 4096
% paste {1..3000}.txt > matrix.txt
(I tested this in all three shells I mentioned on a Linux box, and it works with all of them with no errors about the command line being too long or anything else.)
You could also arrange to have the original files use a different naming scheme that sorts naturally, like 0001.txt, 0002.txt, ..., 3000.txt and then just paste [0-9]*.txt > matrix.txt.

findstr function doesn't work on original file

I have a code to use powershell to export domain controller's policy using Get-GPOReport. However, I can never use findstr on this exported HTML file. The only way it works is if I change the extension of the HTML file to .txt, then copy all the content in it to another newly created .txt file (e.g. test.txt).
Only then, the findstr function works. Does anyone know why it doesn't work on the original file?
import os, subprocess
subprocess.Popen(["powershell","Get-GPOReport -Name 'Default Domain Controllers Policy' -ReportType HTML -Path 'D:\Downloads\Project\GPOReport.html'"],stdout=subprocess.PIPE)
policyCheck = subprocess.check_output([power_shell,"-Command", 'findstr /c:"Minimum password age"', "D:\Downloads\Project\GPOReport.html"]).decode('utf-8')
print(policyCheck)
# However if I copy all the content in D:\Downloads\Project\GPOReport.html to a newly created test.txt file (MANUALLY - I've tried to do it programmatically, findstr wouldn't work too) under the same directory and use:
power_shell = os.path.join(os.environ["SYSTEMROOT"], "System32","WindowsPowerShell", "v1.0", "powershell.exe")
policyCheck = subprocess.check_output([power_shell,"-Command", 'findstr /c:"Minimum password age"', "D:\Downloads\Project\test.txt"]).decode('utf-8')
print(policyCheck)
# Correct Output Will Show
What I got:
subprocess.CalledProcessError: Command '['C:\\WINDOWS\\System32\\WindowsPowerShell\\v1.0\\powershell.exe', '-Command', 'findstr /c:"Minimum password age"', 'D:\Downloads\Project\GPOReport.html']' returned non-zero exit status 1.
Expected Output:
<tr><td>Minimum password age</td><td>1 days</td></tr>
I'm not a Python guy, but I think this may be an encoding issue. Based on the fact that findstr is not Unicode compatible. As #iRon suggested Select-String should do the trick though you may have to reference the .Line property to get the expected output you mentioned. Other wise it will return match objects.
I'll leave it to you to transpose this into the Python code, but Select-String command should look something like:
(Select-String -Path "D:\Downloads\Project\GPOReport.html" -Pattern "Minimum password age" -SimpleMatch).Line
If there are multiple matches this will return an array of strings; the lines where the matches were made. Let me know if that's helpful.

Find files with same name but different content

I need to find files with the same name but different content in a linux folder structure with a lot of files.
Something like this does the job partially, how do i eliminate files with different content?
#!/bin/sh
dirname=/path/to/directory
find $dirname -type f | sed 's_.*/__' | sort| uniq -d|
while read fileName
do
find $dirname -type f | grep "$fileName"
done
(How to find duplicate filenames (recursively) in a given directory? BASH)
Thanks so much !
The first question is, how can you determine whether two files have the same content?
One obviously possibility is to read (or mmap) both files and compare them a block at a time. On some platforms, a stat is a lot faster than a read, so you may want to first compare sizes. And there are other optimizations that might be useful, depending on what you're actually doing (e.g., if you're going to run this thousands of times, and most of the files are the same every time, you could hash them and cache the hashes, and only check the actual files when the hashes match). But I doubt you're too worried about that kind of performance tweak if your existing code is acceptable (since it searches the whole tree once for every file in the tree), so let's just do the simplest thing.
Here's one way to do it in Python:
#!/usr/bin/env python3
import sys
def readfile(path):
with open(path, 'rb') as f:
return f.read()
contents = [readfile(fname) for fname in sys.argv[1:]]
sys.exit(all(content == contents[0] for content in contents[1:]))
This will exit with code 1 if all files are identical, code 0 if any pair of files are different. So, save this as allequal.py, make it executable, and your bash code can just run allequal.py on the results of that grep, and use the exit value (e.g., via $?) to decide whether to print those results for you.
I am facing the same problem as described in the question. In a large directory tree, some files have the same name and either same content or different content. The ones where the content differs need human attention to decide how to fix the situation in each case. I need to create a list of these files to guide the person doing this.
The code in the question and the code in the abernet's response are both helpful. Here is how one would combine both: Store the python code from abernet's response in some file, e.g. /usr/local/bin/do_these_files_have_different_content:
sudo tee /usr/local/bin/do_these_files_have_different_content <<EOF
#!/usr/bin/env python3
import sys
def readfile(path):
with open(path, 'rb') as f:
return f.read()
contents = [readfile(fname) for fname in sys.argv[1:]]
sys.exit(all(content == contents[0] for content in contents[1:]))
EOF
sudo chmod a+x /usr/local/bin/do_these_files_have_different_content
Then extend the bash code from Illusionist's question to call this program when needed, and react on its outcome:
#!/bin/sh
dirname=$1
find $dirname -type f | sed 's_.*/__' | sort| uniq -d|
while read fileName
do
if do_these_files_have_different_content $(find $dirname -type f | grep "$fileName")
then find $dirname -type f | grep "$fileName"
echo
fi
done
This will write to stdout the paths of all files with same name but different content. Groups of files with same name but different content are separated by empty lines. I store the shell script in /usr/local/bin/find_files_with_same_name_but_different_content and invoke it as
find_files_with_same_name_but_different_content /path/to/my/storage/directory

Copy the nth column of all the files in a directory into a single file

I've a directory containing many .csv files. How can I extract the nth column of every file into a new file column-wise?
For example:
File A:
111,222,333
111,222,333
File B:
AAA,BBB,CCC
AAA,BBB,CCC
File C:
123,456,789
456,342,122
and so on...
If n = 2, I want my resultant file to be:
222,BBB,456,...
222,BBB,342,...
where ... represents that there will be as many columns as the number of files in the directory.
My try so far:
#!/bin/bash
for i in `find ./ -iname "*.csv"`
do
awk -F, '{ print $2}' < $i >> result.csv ## This would append row-wise, not column-wise.
done
UPDATE:
I'm not trying to just join two files. There are 100 of files in a particular directory, and I want to copy the nth column of all the files into a single file. I gave two files as an example to show how I want the data to be if there were only two files.
As pointed out in the comments, joining two files is trivial but joining multiple files may be not that easy which is the whole point of my question. Would python help to do this job?
Building on triplee's solution, here's a generic version which uses eval:
eval paste -d, $(printf "<(cut -d, -f2 %s) " *.csv)
I'm not too fond of eval (always be careful when using it), but it has its uses.
Hmm. My first thought is to have both an outer and inner loop. The outer loop would be a counter on line number. The inner loop would go through the csv files. You'd need to use head/tail in the inner loop to get the correct line number so you could grab the right field.
An alternative is to use the one loop you have now but write each line to a separate file and then merge them.
Neither of these seem ideal. Quite honestly, I'd do this in Perl so you could use an actual in memory data structure and avoid the need to have complex logic.
this one liner should work:
awk -F, -v OFS="," 'NR==FNR{a[NR]=$2;next}{print a[FNR],$2}' file1 file2
Assuming Bash process substitutions are acceptable (i.e. you don't require the solution to be portable to systems where Bash is not available);
paste -d, <(cut -d, -f2 file1) <(cut -d, -f2 file2) <(cut -d, -f2 file3) # etc
A POSIX solution requires temporary files instead.

Python Script to Change Folder Names

I'm on OS X and I'm fed up with our labeling system where I work. The labels are mm/dd/yy and I think that they should be yy/mm/dd. Is there a way to write a script to do this? I understand a bit of Python with lists and how to change the position of characters.
Any suggestions or tips?
What I have now:
083011-HalloweenBand
090311-ViolaClassRecital
090411-JazzBand
What I want:
110830-HalloweenBand
110903-ViolaClassRecital
110904-JazzBand
Thanks
Assuming the script is in the same directory as the files you want to rename, and you already have the list of files that you want to rename, you can do this:
for file in rename_list:
os.rename(file, file[4:6] + file[:2] + file[2:4] + file[6:])
There is a Q&A with information on traversing directories with Python that you could modify to do this. The key method is walk(), but you'll need to add the appropriate calls to rename().
As a beginner it is probably best to start by traversing the directories and writing out the new directory names before attempting to change the directory names. You should also make a backup and notify anyone who might care about this change before doing it.
i know you asked for python, but I would do it from the shell. this is a simple one liner.
ls | awk '{print "mv " $0 FS substr($1,5,2) substr($1,1,4) substr($1,7) }' | bash
I do not use osx but I think it is a bash shell. you may need to rename bash to sh, or awk to gawk.
but what that line is doing is piping the directory listing to awk which is printing "mv" $0 (the line) and a space (FS = field separator, which defaults to space) then two substrings.
substr(s,c,n). This returns the substring from string s starting from character position c up to a maximum length of n characters. If n is not supplied, the rest of the string from c is returned.
lastly this is piped to the shell. allowing it to be executed. This works without problems on ubuntu and variations of this command I use quite a bit. a version of awk (awk,nawk,gawk) should be isntalled on osx which I believe uses bash

Categories

Resources