How to split csv files into multiple files using the delimiter? python - python

I have a tab delimited file as such:
this is a sentence. abb
what is this foo bar. bev
hello foo bar blah black sheep. abb
I could use cut -f1 and cut -f2 in unix terminal to split into two files:
this is a sentence.
what is this foo bar.
hello foo bar blah black sheep.
and:
abb
bev
abb
But is it possible to do the same in python? would it be faster?
I've been doing it as such:
[i.split('\t')[0] for i in open('in.txt', 'r')]

But is it possible to do the same in python?
yes you can:
l1, l2 = [[],[]]
with open('in.txt', 'r') as f:
for i in f:
# will loudly fail if more than two columns on a line
left, right = i.split('\t')
l1.append(left)
l2.append(right)
print("\n".join(l1))
print("\n".join(l2))
would it be faster?
it's not likely, cut is a C program that is optimized towards that kind of processing, python is a general purpose language which has a great flexibility, but is not necessarily fast.
Though, the only advantage you may get by working with an algorithm such as the one I wrote, is that you read the file only once, whereas with cut, you're reading it twice. That could make the difference.
Though we'd need to run some benchmarking to be 100%.
Here's a small benchmark, on my laptop, for what it's worth:
>>> timeit.timeit(stmt=lambda: t("file_of_606251_lines"), number=1)
1.393364901014138
vs
% time cut -d' ' -f1 file_of_606251_lines > /dev/null
cut -d' ' -f1 file_of_606251_lines > /dev/null 0.74s user 0.02s system 98% cpu 0.775 total
% time cut -d' ' -f2 file_of_606251_lines > /dev/null
cut -d' ' -f2 file_of_606251_lines > /dev/null 1.18s user 0.02s system 99% cpu 1.215 total
which is 1.990 seconds.
So the python version is indeed faster, as expected ;-)

Related

Bash use of temporary file pipes disallows redirection to dependent files? [duplicate]

Basically I want to take as input text from a file, remove a line from that file, and send the output back to the same file. Something along these lines if that makes it any clearer.
grep -v 'seg[0-9]\{1,\}\.[0-9]\{1\}' file_name > file_name
however, when I do this I end up with a blank file.
Any thoughts?
Use sponge for this kind of tasks. Its part of moreutils.
Try this command:
grep -v 'seg[0-9]\{1,\}\.[0-9]\{1\}' file_name | sponge file_name
You cannot do that because bash processes the redirections first, then executes the command. So by the time grep looks at file_name, it is already empty. You can use a temporary file though.
#!/bin/sh
tmpfile=$(mktemp)
grep -v 'seg[0-9]\{1,\}\.[0-9]\{1\}' file_name > ${tmpfile}
cat ${tmpfile} > file_name
rm -f ${tmpfile}
like that, consider using mktemp to create the tmpfile but note that it's not POSIX.
Use sed instead:
sed -i '/seg[0-9]\{1,\}\.[0-9]\{1\}/d' file_name
try this simple one
grep -v 'seg[0-9]\{1,\}\.[0-9]\{1\}' file_name | tee file_name
Your file will not be blank this time :) and your output is also printed to your terminal.
You can't use redirection operator (> or >>) to the same file, because it has a higher precedence and it will create/truncate the file before the command is even invoked. To avoid that, you should use appropriate tools such as tee, sponge, sed -i or any other tool which can write results to the file (e.g. sort file -o file).
Basically redirecting input to the same original file doesn't make sense and you should use appropriate in-place editors for that, for example Ex editor (part of Vim):
ex '+g/seg[0-9]\{1,\}\.[0-9]\{1\}/d' -scwq file_name
where:
'+cmd'/-c - run any Ex/Vim command
g/pattern/d - remove lines matching a pattern using global (help :g)
-s - silent mode (man ex)
-c wq - execute :write and :quit commands
You may use sed to achieve the same (as already shown in other answers), however in-place (-i) is non-standard FreeBSD extension (may work differently between Unix/Linux) and basically it's a stream editor, not a file editor. See: Does Ex mode have any practical use?
One liner alternative - set the content of the file as variable:
VAR=`cat file_name`; echo "$VAR"|grep -v 'seg[0-9]\{1,\}\.[0-9]\{1\}' > file_name
Since this question is the top result in search engines, here's a one-liner based on https://serverfault.com/a/547331 that uses a subshell instead of sponge (which often isn't part of a vanilla install like OS X):
echo "$(grep -v 'seg[0-9]\{1,\}\.[0-9]\{1\}' file_name)" > file_name
The general case is:
echo "$(cat file_name)" > file_name
Edit, the above solution has some caveats:
printf '%s' <string> should be used instead of echo <string> so that files containing -n don't cause undesired behavior.
Command substitution strips trailing newlines (this is a bug/feature of shells like bash) so we should append a postfix character like x to the output and remove it on the outside via parameter expansion of a temporary variable like ${v%x}.
Using a temporary variable $v stomps the value of any existing variable $v in the current shell environment, so we should nest the entire expression in parentheses to preserve the previous value.
Another bug/feature of shells like bash is that command substitution strips unprintable characters like null from the output. I verified this by calling dd if=/dev/zero bs=1 count=1 >> file_name and viewing it in hex with cat file_name | xxd -p. But echo $(cat file_name) | xxd -p is stripped. So this answer should not be used on binary files or anything using unprintable characters, as Lynch pointed out.
The general solution (albiet slightly slower, more memory intensive and still stripping unprintable characters) is:
(v=$(cat file_name; printf x); printf '%s' ${v%x} > file_name)
Test from https://askubuntu.com/a/752451:
printf "hello\nworld\n" > file_uniquely_named.txt && for ((i=0; i<1000; i++)); do (v=$(cat file_uniquely_named.txt; printf x); printf '%s' ${v%x} > file_uniquely_named.txt); done; cat file_uniquely_named.txt; rm file_uniquely_named.txt
Should print:
hello
world
Whereas calling cat file_uniquely_named.txt > file_uniquely_named.txt in the current shell:
printf "hello\nworld\n" > file_uniquely_named.txt && for ((i=0; i<1000; i++)); do cat file_uniquely_named.txt > file_uniquely_named.txt; done; cat file_uniquely_named.txt; rm file_uniquely_named.txt
Prints an empty string.
I haven't tested this on large files (probably over 2 or 4 GB).
I have borrowed this answer from Hart Simha and kos.
This is very much possible, you just have to make sure that by the time you write the output, you're writing it to a different file. This can be done by removing the file after opening a file descriptor to it, but before writing to it:
exec 3<file ; rm file; COMMAND <&3 >file ; exec 3>&-
Or line by line, to understand it better :
exec 3<file # open a file descriptor reading 'file'
rm file # remove file (but fd3 will still point to the removed file)
COMMAND <&3 >file # run command, with the removed file as input
exec 3>&- # close the file descriptor
It's still a risky thing to do, because if COMMAND fails to run properly, you'll lose the file contents. That can be mitigated by restoring the file if COMMAND returns a non-zero exit code :
exec 3<file ; rm file; COMMAND <&3 >file || cat <&3 >file ; exec 3>&-
We can also define a shell function to make it easier to use :
# Usage: replace FILE COMMAND
replace() { exec 3<$1 ; rm $1; ${#:2} <&3 >$1 || cat <&3 >$1 ; exec 3>&- }
Example :
$ echo aaa > test
$ replace test tr a b
$ cat test
bbb
Also, note that this will keep a full copy of the original file (until the third file descriptor is closed). If you're using Linux, and the file you're processing on is too big to fit twice on the disk, you can check out this script that will pipe the file to the specified command block-by-block while unallocating the already processed blocks. As always, read the warnings in the usage page.
The following will accomplish the same thing that sponge does, without requiring moreutils:
shuf --output=file --random-source=/dev/zero
The --random-source=/dev/zero part tricks shuf into doing its thing without doing any shuffling at all, so it will buffer your input without altering it.
However, it is true that using a temporary file is best, for performance reasons. So, here is a function that I have written that will do that for you in a generalized way:
# Pipes a file into a command, and pipes the output of that command
# back into the same file, ensuring that the file is not truncated.
# Parameters:
# $1: the file.
# $2: the command. (With $3... being its arguments.)
# See https://stackoverflow.com/a/55655338/773113
siphon()
{
local tmp file rc=0
[ "$#" -ge 2 ] || { echo "Usage: siphon filename [command...]" >&2; return 1; }
file="$1"; shift
tmp=$(mktemp -- "$file.XXXXXX") || return
"$#" <"$file" >"$tmp" || rc=$?
mv -- "$tmp" "$file" || rc=$(( rc | $? ))
return "$rc"
}
There's also ed (as an alternative to sed -i):
# cf. http://wiki.bash-hackers.org/howto/edit-ed
printf '%s\n' H 'g/seg[0-9]\{1,\}\.[0-9]\{1\}/d' wq | ed -s file_name
You can use slurp with POSIX Awk:
!/seg[0-9]\{1,\}\.[0-9]\{1\}/ {
q = q ? q RS $0 : $0
}
END {
print q > ARGV[1]
}
Example
This does the trick pretty nicely in most of the cases I faced:
cat <<< "$(do_stuff_with f)" > f
Note that while $(…) strips trailing newlines, <<< ensures a final newline, so generally the result is magically satisfying.
(Look for “Here Strings” in man bash if you want to learn more.)
Full example:
#! /usr/bin/env bash
get_new_content() {
sed 's/Initial/Final/g' "${1:?}"
}
echo 'Initial content.' > f
cat f
cat <<< "$(get_new_content f)" > f
cat f
This does not truncate the file and yields:
Initial content.
Final content.
Note that I used a function here for the sake of clarity and extensibility, but that’s not a requirement.
A common usecase is JSON edition:
echo '{ "a": 12 }' > f
cat f
cat <<< "$(jq '.a = 24' f)" > f
cat f
This yields:
{ "a": 12 }
{
"a": 24
}
Try this
echo -e "AAA\nBBB\nCCC" > testfile
cat testfile
AAA
BBB
CCC
echo "$(grep -v 'AAA' testfile)" > testfile
cat testfile
BBB
CCC
I usually use the tee program to do this:
grep -v 'seg[0-9]\{1,\}\.[0-9]\{1\}' file_name | tee file_name
It creates and removes a tempfile by itself.

How to efficiently find small typos in source code files?

I would like to recursively search a large code base (mostly python, HTML and javascript) for typos in comments, strings and also variable/method/class names. Strong preference for something that runs in a terminal.
The problem is that spell checkers like aspell or scspell find almost only false positives (e.g. programming terms, camelcased terms) while I would be happy if it could help me primarily find simple typos like scrambled or missing letters e.g. maintenane vs. maintenance, resticted vs. restricted, dpeloyment vs. deployment.
What I was playing with so far is:
for f in **/*.py ; do echo $f ; aspell list < $f | uniq -c ; done
but it will find anything like: assertEqual, MyTestCase, lifecycle
This solution of my own focuses on python files but in the end also found them in html and js. It still needed manual sorting out of false positives but that only took few minutes work and it identified about 150 typos in comments that then also could be found in the non-comment parts.
Save this as executable file e.g extractcomments:
#!/usr/bin/env python3
import argparse
import io
import tokenize
if __name__ == "__main__":
parser = argparse.ArgumentParser(add_help=False)
parser.add_argument('filename')
args = parser.parse_args()
with io.open(args.filename, "r", encoding="utf-8") as sourcefile:
for t in tokenize.generate_tokens(sourcefile.readline):
if t.type == tokenize.COMMENT:
print(t.string.lstrip("#").strip())
Collect all comments for further processing:
for f in **/*.py ; do ~/extractcomments $f >> ~/comments.txt ; done
Run it recursively on your code base with one or more aspell dictionaries and collect all it identified as typos and count their occurrences:
aspell <~/comments.txt --lang=en list|aspell --lang=de list | sort | uniq -c | sort -n > ~/typos.txt
Produces something like:
10 availabe
8 assignement
7 hardwird
Take the list without leading numbers, clean out the false positives, copy it to a 2nd file correct.txt and run aspell on it to get desired replacement for each typo: aspell -c correct.txt
Now paste the two files to get a format of typo;correction with paste -d";" typos.txt correct.txt > known_typos.csv
Now we want to recursively replace those in our codebase:
#!/bin/bash
root_dir=$(git rev-parse --show-toplevel)
while IFS=";" read -r typo fix ; do
git grep -l -z -w "${typo}" -- "*.py" "*.html" | xargs -r --null sed -i "s/\b${typo}\b/${fix}/g"
done < $root_dir/known_typos.csv
My bash skills are poor so there is certainly space for improvement.
Update: I could find more typos in method names by running this:
grep -r def --include \*.py . | cut -d ":" -f 2- |tr "_" " " | aspell --lang=en list | sort -u
Update2: Managed to fix typos that are e.g. inside underscored names or strings that do not have word boundaries as such e.g i_am_a_typpo3:
#!/bin/bash
root_dir=$(git rev-parse --show-toplevel)
while IFS=";" read -r typo fix ; do
echo ${typo}
find $root_dir \( -name '*.py' -or -name '*.html' \) -print0 | xargs -0 perl -pi -e "s/(?<![a-zA-Z])${typo}(?![a-zA-Z])/${fix}/g"
done < $root_dir/known_typos.csv
If you're using typescript you could use the gulp plugin i created for spellchecking:
https://www.npmjs.com/package/gulp-ts-spellcheck
If you are developing in JavaScript or Typescript then you can this spell check plugin for ESLint:
https://www.npmjs.com/package/eslint-plugin-spellcheck
I found it to be very useful.
Another option is scspell:
https://github.com/myint/scspell
It is language-agnostic and claims to "usually catch many errors without an annoying false positive rate."

How to get one line from a print output in linux?

I'm trying to pull one line from a subprocess.check_output but so far I have no luck. I'm running a Python script and this is my code:
output = subprocess.check_output("sox /home/pi/OnoSW/data/opsoroassistant/rec.wav -n stat", shell=True)
and this is what I get back when I run the script:
Samples read: 80000
Length (seconds): 5.000000
Scaled by: 2147483647.0
Maximum amplitude: 0.001129
Minimum amplitude: -0.006561
Midline amplitude: -0.002716
Mean norm: 0.000291
Mean amplitude: -0.000001
RMS amplitude: 0.000477
Maximum delta: 0.002930
Minimum delta: 0.000000
Mean delta: 0.000052
RMS delta: 0.000102
Rough frequency: 272
Volume adjustment: 152.409
Now I want to get the 9th line (RMS amplitude) out of this list. I already tried something with sed but it didnt gave anything back:
output = subprocess.check_output("sox /home/pi/OnoSW/data/opsoroassistant/rec.wav -n stat 2>&1 | sed -n 's#^RMS amplitude:[^0-9]*\([0-9.]*\)$#\1#p0'",stderr= subprocess.STDOUT, shell=True)
Thank You
What about grep-ing the line ?
output = subprocess.check_output("sox /home/pi/OnoSW/data/opsoroassistant/rec.wav -n stat 2>&1 | grep 'RMS amplitude:'",stderr= subprocess.STDOUT, shell=True)
I think the problem might that you're matching for spaces, but sox is actually outputting a tab characters to do the column spacing. Your terminal is likely expanding the tab to spaces, so when you copy/paste the output, you see spaces. Try matching for [[:space:]] (any whitespace character) instead of literal spaces:
output = subprocess.check_output("sox /home/pi/OnoSW/data/opsoroassistant/rec.wav -n stat | sed -n 's#^RMS[[:space:]]*amplitude:[[:space:]]*\([0-9.]*\)$#\1#p'", shell=True)
I also had to remove the 0 after the p at the end of your sed replace command.
You could also do the output processing in Python (using re) rather than spinning off another subprocess to execute the sed command. That would probably be easier to debug.
In order to fix your line, you need to remove the 0 at the end and escape the '\1':
output = subprocess.check_output("sox /home/pi/OnoSW/data/opsoroassistant/rec.wav -n stat 2>&1 | sed -n 's#^RMS amplitude:[^0-9]*\([0-9.]*\)$#\\1#p'",stderr= subprocess.STDOUT, shell=True)
Also using a pipe isn't really advisable, security wise, I'd suggest changing this line to:
p = subprocess.Popen(('sox', '/home/pi/OnoSW/data/opsoroassistant/rec.wav', '-n', 'stat', '2>&1'), stdout=subprocess.PIPE)
output = subprocess.check_output(('sed', '-n', 's#^RMS amplitude:[^0-9]*\([0-9.]*\)$#\\1#p'), stdin=p.stdout)
p.wait()

Better way in Python

Is there a better / simpler way to accomplish this in Python?
I have a bash script that calculates CPS (calls per second). It runs fine on small files but poorly on large ones. It basically takes the file that we are calculating the CPS for and extracts field 7 which is the INVITING time, sorts, and only gets the unique values. This is all put in a tmp.file. The script then cats the original file and greps for each of the values in the tmp.file, counts them, and outputs the time and count to a final file.
#!/bin/bash
cat $1 |cut -d "," -f 7 | sort |uniq > /tmp/uniq.time.txt;
list="/tmp/uniq.time.txt";
while read time
do
VALUE1=`cat $1 |grep "$time" |wc -l`;
echo $VALUE1 >> /tmp/cps.tmp;
done < $list;
rm /tmp/cps.tmp;
I think what you're trying to do is simply:
cat $1 |cut -d "," -f 7 | sort | uniq -c
note: if you want to swap the order of the fields:
| awk -F " *" '{print $3, $2}'
This can certainly be done easier and more efficiently in Python:
import sys
from itertools import groupby
with open(sys.argv[1]) as f:
times = [line.split(",")[6] for line in f]
times.sort()
for time, occurrences in groupby(times):
print time, len(list(occurrences))
The problem with your approach is that you have to serach the whole file for each unique time. You could write this more efficiently even in bash, but I think it's more convenient to do this in Python.
Reading CSV files:
http://docs.python.org/library/csv.html
Uniquifying:
set(nonUniqueItems)

How do I grep for words coming from a file in files listed in a file?

Searching a single file for a word is easy:
grep stuff file.txt
But I have many files, each is a line in files.txt, and many words I want to find, each is a line in words.txt. The output should be a file with each line a => b with a being the line number in words.txt, b being the line number in files.txt.
I need to run it on OSX, so preferably something simple in shell, but any other language would be fine. I haven't had much experience with shell scripts myself, and I'm more used to languages that aren't useful for string searching (namely C - I'm guessing Perl or Python may be helpful, but I've not used them).
You might this to be faster, more Pythonic, and easier to understand:
with open("words.txt") as words:
wlist=[(ln,word.strip()) for ln,word in enumerate(words,1)]
with open("files.txt") as files:
flist=[(ln,file.strip()) for ln,file in enumerate(files,1)]
for filenum, filename in flist:
with open(filename) as fdata:
for fln,line in enumerate(fdata,1):
for wln, word in wlist:
if word in line:
print "%d => %d" % (wln, fln)
This is a two-parter with awk:
1. scan each file in files.txt, and map the word number to the name of the file
2. map the filename to the line number in files.txt
awk '
NR == FNR {word[$1] = NR; next}
{for (i=1; i<=NF; i++) {if ($i in word) {print word[$i] " => " FILENAME; break}}}
' words.txt $(<files.txt) |
sort -u |
awk '
NR == FNR {filenum[$1] = NR; next}
{$3 = filenum[$3]; print}
' files.txt -
First, learn to specify the files of interest. In one directory or more than one directory? The Unix find utility will do that.
At the Bash prompt:
$ cd [the root directory where your files are]
$ find . -name "*.txt"
You did not say, but assumably the files are describable with "star dot something" then find will find the files.
Next, pipe the files names to what you want to do to them:
$ find . -name "*.txt" -print0 | xargs -0 egrep 'stuff'
That will run egrep on each file with the search pattern of stuff
Google find plus xargs for literally thousands of examples. Once you are comfortable finding the files -- rephrase your question so that it is a bit more obvious what you want to do to them. Then I can help you with Perl to do it.
The following script in python does it. This is my first attempt at python, so I'd appreciate any comments
flist = open('files.txt')
filenum = 0
for filename in flist:
filenum = filenum + 1
filenamey = filename.strip()
filedata = open(filenamey)
for fline in filedata:
wordnum = 0
wlist = open('words.txt')
for word in wlist:
wordnum = wordnum + 1
sword = word.strip()
if sword in fline:
s = repr(filenum) + ' => ' + repr(wordnum)
print s
Here's something that will do what you want, but the only thing is that it will not print out the matched word, instead just prints out the line matched, the file name, and the line number. However, if you use --color=auto on grep, it will highlight the matched words using whatever you have set in ${GREP_COLOR}, the default is red.
cat files.txt | xargs grep -nf words.txt --color=auto
This command will dump all contents of files.txt, line by line, and it will pipe the file names to grep, which will search the file for every word that matches in words.txt. Similar to files.txt, words.txt should be all the search terms you want delimited by new-lines.
If your grep was built with the perl regular expression engine, then, you can use Perl regular expressions if you pass the -P option to grep like so:
grep -Pnf words.txt --color=auto
Hope this helps.
Update: At first, I wasn't really sure what #Zeophlite was asking but after he posted his example, I see what he wanted. Here's a python implementation of what he wants to do:
from contextlib import nested
def search_file(line_num, filename):
with nested(open(filename), open('words.txt')) as managers:
open_filename, word_file = managers
for line in open_filename:
for wordfile_line_number, word in enumerate(word_file, 1):
if word.strip() in line:
print "%s => %s" % (line_num, wordfile_line_number)
with open('files.txt') as filenames_file:
for filenames_line_number, fname in enumerate(filenames_file, 1):
search_file(filenames_line_number, fname.strip())
Doing it in pure shell, I'm close:
$ grep -n $(tr '\n' '|' < words.txt | sed 's/|$//') $(cat files.txt)
(Tried to figure out how to remove the $(cat files.txt), but couldn't)
This prints out the words in each file, and prints out the lines where they occur, but it doesn't print out the line in words.txt where that word was located.
There's probably some really ugly (if you didn't think this was ugly enough) stuff I could do, but your real answer is to use a higher level language. The awk solution is shellish since most people now consider awk as just part of the Unix environment. However, if you're using awk, you might as well use perl, python, or ruby.
The only advantage awk has is that it is automatically included in a Linux/Unix distro even if the user who created the distro didn't include any of the development packages. It's rare, but it happens.
To answer your demand
.
Your code:
flist = open('files.txt')
filenum = 0
for filename in flist:
filenum = filenum + 1
filenamey = filename.strip()
filedata = open(filenamey)
for fline in filedata:
wordnum = 0
wlist = open('words.txt')
for word in wlist:
wordnum = wordnum + 1
sword = word.strip()
if sword in fline:
s = repr(filenum) + ' => ' + repr(wordnum)
print s
You open 'files.txt' but don't close it.
with open('files.txt') as flist: is preferable because it is textually cleaner and it manages to close alone.
Instead of filenum = filenum + 1 , use enumerate()
From now, you must never forget enumerate() because it is an extremely useful function. It works very very fast, too.
fline isn't a good name for an iterator of lines, IMO; Isn't line a good one ?
The instruction wlist = open('words.txt') isn't in a good place: it is executed not only each for each file opened, but even each time a line is analysed.
Moreover, the treatment of the names listed in wlist is performed each time the wlist is iterated, that is to say at each line. You must put this treatment out of all the iterations.
wordnum is nothing else than the index of word in wlist. You can use again enumerate() or simply loop with index i and use wlist[i] instead of word
Each time a sword of wlist is in the line, you do
print repr(filenum) + ' => ' + repr(wordnum)
It would be better to do print repr(filenum) + ' => ' + repr(all_wordnum) in which all_wordnum would be the list of all the sword found in one line
You keep your list of words in a file. You'd better serialise the list of this words. See the modules pickle and pickle
There is also something to improve in the recording of result. Because executing the instruction
print repr(filenum) + ' => ' + repr(wordnum)
each time is not a good practice. It's the same if you want to record in a file: you can't repeatedly order write() Better is to list all the result in a list, and print or record when process is over, making "\n".join(list) or something like that
A pure sh answer, assuming that words or filenames do not contain any shell metacharacters such as blanks:
nw=0; while read w; do nw=`expr $nw + 1`; nf=0; { while read f; do nf=`expr $nf + 1`; fgrep -n $w $f | sed 's/:.*//' | while read n; do echo $nw =\> $nf; done; done < /tmp/files.txt;}; done < /tmp/words.txt
But I prefer Perl for this kind of thing.
And the Perl script won't be quite as short or readable as carrrot-top's Python code, unless you use IO::All.

Categories

Resources