How to efficiently find small typos in source code files?

How to efficiently find small typos in source code files? - python

I would like to recursively search a large code base (mostly python, HTML and javascript) for typos in comments, strings and also variable/method/class names. Strong preference for something that runs in a terminal.
The problem is that spell checkers like aspell or scspell find almost only false positives (e.g. programming terms, camelcased terms) while I would be happy if it could help me primarily find simple typos like scrambled or missing letters e.g. maintenane vs. maintenance, resticted vs. restricted, dpeloyment vs. deployment.
What I was playing with so far is:
for f in **/*.py ; do echo $f ; aspell list < $f | uniq -c ; done
but it will find anything like: assertEqual, MyTestCase, lifecycle

This solution of my own focuses on python files but in the end also found them in html and js. It still needed manual sorting out of false positives but that only took few minutes work and it identified about 150 typos in comments that then also could be found in the non-comment parts.
Save this as executable file e.g extractcomments:
#!/usr/bin/env python3
import argparse
import io
import tokenize
if __name__ == "__main__":
parser = argparse.ArgumentParser(add_help=False)
parser.add_argument('filename')
args = parser.parse_args()
with io.open(args.filename, "r", encoding="utf-8") as sourcefile:
for t in tokenize.generate_tokens(sourcefile.readline):
if t.type == tokenize.COMMENT:
print(t.string.lstrip("#").strip())
Collect all comments for further processing:
for f in **/*.py ; do ~/extractcomments $f >> ~/comments.txt ; done
Run it recursively on your code base with one or more aspell dictionaries and collect all it identified as typos and count their occurrences:
aspell <~/comments.txt --lang=en list|aspell --lang=de list | sort | uniq -c | sort -n > ~/typos.txt
Produces something like:
10 availabe
8 assignement
7 hardwird
Take the list without leading numbers, clean out the false positives, copy it to a 2nd file correct.txt and run aspell on it to get desired replacement for each typo: aspell -c correct.txt
Now paste the two files to get a format of typo;correction with paste -d";" typos.txt correct.txt > known_typos.csv
Now we want to recursively replace those in our codebase:
#!/bin/bash
root_dir=$(git rev-parse --show-toplevel)
while IFS=";" read -r typo fix ; do
git grep -l -z -w "${typo}" -- "*.py" "*.html" | xargs -r --null sed -i "s/\b${typo}\b/${fix}/g"
done < $root_dir/known_typos.csv
My bash skills are poor so there is certainly space for improvement.
Update: I could find more typos in method names by running this:
grep -r def --include \*.py . | cut -d ":" -f 2- |tr "_" " " | aspell --lang=en list | sort -u
Update2: Managed to fix typos that are e.g. inside underscored names or strings that do not have word boundaries as such e.g i_am_a_typpo3:
#!/bin/bash
root_dir=$(git rev-parse --show-toplevel)
while IFS=";" read -r typo fix ; do
echo ${typo}
find $root_dir \( -name '*.py' -or -name '*.html' \) -print0 | xargs -0 perl -pi -e "s/(?<![a-zA-Z])${typo}(?![a-zA-Z])/${fix}/g"
done < $root_dir/known_typos.csv

If you're using typescript you could use the gulp plugin i created for spellchecking:
https://www.npmjs.com/package/gulp-ts-spellcheck

If you are developing in JavaScript or Typescript then you can this spell check plugin for ESLint:
https://www.npmjs.com/package/eslint-plugin-spellcheck
I found it to be very useful.
Another option is scspell:
https://github.com/myint/scspell
It is language-agnostic and claims to "usually catch many errors without an annoying false positive rate."

Related

arranging text files side by side using python

I have 3000 text files in a directory and each .txt file contain single column data. i want to arrange them side by side to make it a mxn matrix file.
For example: paste 1.txt 2.txt 3.txt 4.txt .............3000.txt in linux
For this i tried
printf "%s\n" *.txt | sort -n | xargs -d '\n' paste
However it gives error paste: filename.txt: Too many open files
please suggest a better solution for the same using python.

Based on the inputs received, please follow the below steps.
# Change ulimit to increase the no of open files at a time
$ ulimit -n 4096
# Remove blank lines from all the files
$ sed -i '/^[[:space:]]*$/d' *.txt
# Join all files side by side to form a matrix view
$ paste $(ls -v *.txt) > matrix.txt
# Fill the blank values in the matrix view with 0's using awk inplace
$ awk -i inplace 'BEGIN { FS = OFS = "\t" } { for(i=1; i<=NF; i++) if($i ~ /^ *$/) $i = 0 }; 1' matrix.txt

You don't need python for this; if you first increase the number of open files a process can have using ulimit, it becomes easy to get columns in the right order in bash, zsh, or ksh93 shells, using paste and brace expansion to generate the filenames in the desired order instead of having to sort the results of filename expansion:
% ulimit -n 4096
% paste {1..3000}.txt > matrix.txt
(I tested this in all three shells I mentioned on a Linux box, and it works with all of them with no errors about the command line being too long or anything else.)
You could also arrange to have the original files use a different naming scheme that sorts naturally, like 0001.txt, 0002.txt, ..., 3000.txt and then just paste [0-9]*.txt > matrix.txt.

grouping and divding files which contains numbers in it into saperate folders

I wanted to move the files in group of 30 in sequence starting from image_1,image_2... from current folder to the new folder.
the file name pattern is like below
image_1.png
image_2.png
.
.
.
image_XXX.png
I want to move image_[1-30].png to folder fold30
and image[31-60].png to fold60 and so on
I have following code to do this and it works wanted to know is there any shortcut to do this.
or is there any smaller code that i can write for the same
#!/bin/bash
counter=0
folvalue=30
totalFiles=$(ls -1 image_*.png | sort -V | wc -l)
foldernames=fold$folvalue
for file in $(ls -1 image_*.png | sort -V )
do
((counter++))
mkdir -p $foldernames
mv $file ./$foldernames/
if [[ "$counter" -eq "$folvalue" ]];
then
let folvalue=folvalue+30
foldernames="fold${folvalue}"
echo $foldernames
fi
done
the above code moves image_1,image_2,..4..30 in folder
fold30
image_31,....image_60 to folder
fold60

I really recommend using sed all the time. It's hard on the eyes but once you get used to it you can do all these jaring tasks in no time.
What it does is simple. Running sed -e "s/regex/substitution/" <(cat file) goes through each line replacing matching patterns regex with substitution.
With it you can just transform your input into comands and pipe it to bash.
If you want to know more there's good documentation here. (also not easy on the eyes though)
Anyway here's the code:
while FILE_GROUP=$(find . -maxdepth 0 -name "image_*.png" | sort -V | head -30) && [ -n "$FILE_GROUP" ]
do
$FOLDER="${YOUR_PREFIX}$(sed -e "s/^.*image_//" -e "s/\.png//" <(echo "$FILE_GROUP" | tail -1))"
mkdir -p $FOLDER
sed -e "s/\.\///" -e "s|.*|mv & $FOLDER|" <(echo "$FILE_GROUP") | bash
done
And here's what it should do:
- while loop grabs the first 30 files.
- take the number out of the last of those files and name the directory
- mkdir FOLDER
- go through each line and turn $FILE into mv $FILE $FOLDER then execute those lines (pipe to bash)
note: replace $YOUR_PREFIXwith your folder
EDIT: surprisingly the code did not work out of the box(who would have thought...) But I've done some fixing and testing and it should work now.

The simplest way to do that is with rename, a.k.a. Perl rename. It will:
let you run any amount of code of arbitrary complexity to figure out a new name,
let you do a dry run telling you what it would do without doing anything,
warn you if any files would be overwritten,
automatically create intermediate directory hierarchies.
So the command you want is:
rename -n -p -e '(my $num = $_) =~ s/\D//g; $_ = ($num+29)-(($num-1)%30) . "/" . $_' *png
Sample Output
'image_1.png' would be renamed to '30/image_1.png'
'image_10.png' would be renamed to '30/image_10.png'
'image_100.png' would be renamed to '120/image_100.png'
'image_101.png' would be renamed to '120/image_101.png'
'image_102.png' would be renamed to '120/image_102.png'
'image_103.png' would be renamed to '120/image_103.png'
'image_104.png' would be renamed to '120/image_104.png'
...
...
If that looks correct, you can run it again without the -n switch to do it for real.

How do I write a for-loop so a program reiterates itself for a set of 94 DNA samples?

I have written some code in a bash shell (so I can submit it to my university's supercomputer) to edit out contaminant sequences from a batch of DNA extracts I have. Essentially what this code does is take the sequences from the negative extraction blank I did (A1-BLANK) and subtract it from all of the other samples.
I have figured out how to get this to work with individual samples, but I'm attempting to write a for loop so that the little chunks of code will reiterate themselves for each sample, with the outcome of this file being a .sam file with a unique name for each sample where both the forward and reverse reads for the sample are merged and edited for contamination.I have checked stack overflow extensively for help with this specific problem, but haven't been able to apply related answered questions to my code.
Here's an example of part what I'm trying to do for an individual sample, named F10-61C-3-V4_S78_L001_R1_001.fastq:
bowtie2 -q --end-to-end --very-sensitive \ ##bowtie2 is a program that examines sequence similarity compared to a standard
-N 0 -L 31 --time --reorder \
-x A1-BlankIndex \ ##This line compares the sample to the negative extraction blank
-1 /file directory/F10-61C-3-V4_S78_L001_R1_001.fastq
-2 /file directory/F10-61C-3-V4_S78_L001_R2_001.fastq \ ##These two lines above merge the forward and reverse reads of the DNA sequences within the individual files into one file
-S 61C-3.sam ##This line renames the merged and edited file and transforms it into a .sam file
Here's what I've got so far for this little step of the process:
for file in /file directory/*.fastq
do
bowtie2 -q --end-to-end --very-sensitive \
-N 0 -L 31 --time --reorder \
-x A1-BlankIndex \
-1 /file directory/*.fastq
-2 /file directory/*.fastq \
-S *.sam
done
In my resulting slurm file, the error I'm getting right now has to do with the -S command. I'm not sure how to give each merged and edited sample a unique name for the .sam file. I'm new to writing for loops in python (my only experience is in R) and I'm sure it's a simple fix, but I haven't been able to find any specific answers to this question.

Here's a first try. Note I assume the entire fragment between do and done is one command, and therefore needs continuation markers (\).
Also note in my example "$file" occurs twice. I feel a bit uneasy about this, but you seem to explicity need this in your described example.
And finally note I am giving the sam file just a numeric name, because I don't really know what you would like that name to be.
I hope this provides enough information to get you started.
#!/bin/bash
i=0
for file in /file/directory/*.fastq
do
bowtie2 -q --end-to-end --very-sensitive \
-N 0 -L 31 --time --reorder \
-x A1-BlankIndex \
-1 "$file" \
-2 "$file" \
-S "$i".sam
i=$((i+1))
done

This may work as your example but automatically select the output file name referrence with a RegEx:
#!/usr/bin/env bash
input_samples='/input_samples_directory'
output_samples='/output_merged_samples_directory'
while IFS= read -r -d '' R1_fastq; do
# Deduce R2 sample from R1 sample file name
R2_fastq="${R1_fastq/_R1_/_R2_}"
# RegEx match capture group in () for the output sample reference
[[ $R1_fastq =~ [^-]+-([[:digit:]]+[[:alpha:]]-[[:digit:]]).* ]]
# Construct the output sample file path with the captured referrenced
# from the RegEx above
sam="$output_samples/${BASH_REMATCH[1]}.sam"
# Perform the merging
bowtie2 -q --end-to-end --very-sensitive \
-N 0 -L 31 --time --reorder \
-x A1-BlankIndex \
-1 "$R1_fastq" \
-2 "$R2_fastq" \
-S "$sam"
done < <(find "$input_samples" -maxdepth 1 -type -f -name '*_R1_*.fastq' -print0)

Bash use of temporary file pipes disallows redirection to dependent files? [duplicate]

Basically I want to take as input text from a file, remove a line from that file, and send the output back to the same file. Something along these lines if that makes it any clearer.
grep -v 'seg[0-9]\{1,\}\.[0-9]\{1\}' file_name > file_name
however, when I do this I end up with a blank file.
Any thoughts?

Use sponge for this kind of tasks. Its part of moreutils.
Try this command:
grep -v 'seg[0-9]\{1,\}\.[0-9]\{1\}' file_name | sponge file_name

You cannot do that because bash processes the redirections first, then executes the command. So by the time grep looks at file_name, it is already empty. You can use a temporary file though.
#!/bin/sh
tmpfile=$(mktemp)
grep -v 'seg[0-9]\{1,\}\.[0-9]\{1\}' file_name > ${tmpfile}
cat ${tmpfile} > file_name
rm -f ${tmpfile}
like that, consider using mktemp to create the tmpfile but note that it's not POSIX.

Use sed instead:
sed -i '/seg[0-9]\{1,\}\.[0-9]\{1\}/d' file_name

try this simple one
grep -v 'seg[0-9]\{1,\}\.[0-9]\{1\}' file_name | tee file_name
Your file will not be blank this time :) and your output is also printed to your terminal.

You can't use redirection operator (> or >>) to the same file, because it has a higher precedence and it will create/truncate the file before the command is even invoked. To avoid that, you should use appropriate tools such as tee, sponge, sed -i or any other tool which can write results to the file (e.g. sort file -o file).
Basically redirecting input to the same original file doesn't make sense and you should use appropriate in-place editors for that, for example Ex editor (part of Vim):
ex '+g/seg[0-9]\{1,\}\.[0-9]\{1\}/d' -scwq file_name
where:
'+cmd'/-c - run any Ex/Vim command
g/pattern/d - remove lines matching a pattern using global (help :g)
-s - silent mode (man ex)
-c wq - execute :write and :quit commands
You may use sed to achieve the same (as already shown in other answers), however in-place (-i) is non-standard FreeBSD extension (may work differently between Unix/Linux) and basically it's a stream editor, not a file editor. See: Does Ex mode have any practical use?

One liner alternative - set the content of the file as variable:
VAR=`cat file_name`; echo "$VAR"|grep -v 'seg[0-9]\{1,\}\.[0-9]\{1\}' > file_name

Since this question is the top result in search engines, here's a one-liner based on https://serverfault.com/a/547331 that uses a subshell instead of sponge (which often isn't part of a vanilla install like OS X):
echo "$(grep -v 'seg[0-9]\{1,\}\.[0-9]\{1\}' file_name)" > file_name
The general case is:
echo "$(cat file_name)" > file_name
Edit, the above solution has some caveats:
printf '%s' <string> should be used instead of echo <string> so that files containing -n don't cause undesired behavior.
Command substitution strips trailing newlines (this is a bug/feature of shells like bash) so we should append a postfix character like x to the output and remove it on the outside via parameter expansion of a temporary variable like ${v%x}.
Using a temporary variable $v stomps the value of any existing variable $v in the current shell environment, so we should nest the entire expression in parentheses to preserve the previous value.
Another bug/feature of shells like bash is that command substitution strips unprintable characters like null from the output. I verified this by calling dd if=/dev/zero bs=1 count=1 >> file_name and viewing it in hex with cat file_name | xxd -p. But echo $(cat file_name) | xxd -p is stripped. So this answer should not be used on binary files or anything using unprintable characters, as Lynch pointed out.
The general solution (albiet slightly slower, more memory intensive and still stripping unprintable characters) is:
(v=$(cat file_name; printf x); printf '%s' ${v%x} > file_name)
Test from https://askubuntu.com/a/752451:
printf "hello\nworld\n" > file_uniquely_named.txt && for ((i=0; i<1000; i++)); do (v=$(cat file_uniquely_named.txt; printf x); printf '%s' ${v%x} > file_uniquely_named.txt); done; cat file_uniquely_named.txt; rm file_uniquely_named.txt
Should print:
hello
world
Whereas calling cat file_uniquely_named.txt > file_uniquely_named.txt in the current shell:
printf "hello\nworld\n" > file_uniquely_named.txt && for ((i=0; i<1000; i++)); do cat file_uniquely_named.txt > file_uniquely_named.txt; done; cat file_uniquely_named.txt; rm file_uniquely_named.txt
Prints an empty string.
I haven't tested this on large files (probably over 2 or 4 GB).
I have borrowed this answer from Hart Simha and kos.

This is very much possible, you just have to make sure that by the time you write the output, you're writing it to a different file. This can be done by removing the file after opening a file descriptor to it, but before writing to it:
exec 3<file ; rm file; COMMAND <&3 >file ; exec 3>&-
Or line by line, to understand it better :
exec 3<file # open a file descriptor reading 'file'
rm file # remove file (but fd3 will still point to the removed file)
COMMAND <&3 >file # run command, with the removed file as input
exec 3>&- # close the file descriptor
It's still a risky thing to do, because if COMMAND fails to run properly, you'll lose the file contents. That can be mitigated by restoring the file if COMMAND returns a non-zero exit code :
exec 3<file ; rm file; COMMAND <&3 >file || cat <&3 >file ; exec 3>&-
We can also define a shell function to make it easier to use :
# Usage: replace FILE COMMAND
replace() { exec 3<$1 ; rm $1; ${#:2} <&3 >$1 || cat <&3 >$1 ; exec 3>&- }
Example :
$ echo aaa > test
$ replace test tr a b
$ cat test
bbb
Also, note that this will keep a full copy of the original file (until the third file descriptor is closed). If you're using Linux, and the file you're processing on is too big to fit twice on the disk, you can check out this script that will pipe the file to the specified command block-by-block while unallocating the already processed blocks. As always, read the warnings in the usage page.

The following will accomplish the same thing that sponge does, without requiring moreutils:
shuf --output=file --random-source=/dev/zero
The --random-source=/dev/zero part tricks shuf into doing its thing without doing any shuffling at all, so it will buffer your input without altering it.
However, it is true that using a temporary file is best, for performance reasons. So, here is a function that I have written that will do that for you in a generalized way:
# Pipes a file into a command, and pipes the output of that command
# back into the same file, ensuring that the file is not truncated.
# Parameters:
# $1: the file.
# $2: the command. (With $3... being its arguments.)
# See https://stackoverflow.com/a/55655338/773113
siphon()
{
local tmp file rc=0
[ "$#" -ge 2 ] || { echo "Usage: siphon filename [command...]" >&2; return 1; }
file="$1"; shift
tmp=$(mktemp -- "$file.XXXXXX") || return
"$#" <"$file" >"$tmp" || rc=$?
mv -- "$tmp" "$file" || rc=$(( rc | $? ))
return "$rc"
}

There's also ed (as an alternative to sed -i):
# cf. http://wiki.bash-hackers.org/howto/edit-ed
printf '%s\n' H 'g/seg[0-9]\{1,\}\.[0-9]\{1\}/d' wq | ed -s file_name

You can use slurp with POSIX Awk:
!/seg[0-9]\{1,\}\.[0-9]\{1\}/ {
q = q ? q RS $0 : $0
}
END {
print q > ARGV[1]
}
Example

This does the trick pretty nicely in most of the cases I faced:
cat <<< "$(do_stuff_with f)" > f
Note that while $(…) strips trailing newlines, <<< ensures a final newline, so generally the result is magically satisfying.
(Look for “Here Strings” in man bash if you want to learn more.)
Full example:
#! /usr/bin/env bash
get_new_content() {
sed 's/Initial/Final/g' "${1:?}"
}
echo 'Initial content.' > f
cat f
cat <<< "$(get_new_content f)" > f
cat f
This does not truncate the file and yields:
Initial content.
Final content.
Note that I used a function here for the sake of clarity and extensibility, but that’s not a requirement.
A common usecase is JSON edition:
echo '{ "a": 12 }' > f
cat f
cat <<< "$(jq '.a = 24' f)" > f
cat f
This yields:
{ "a": 12 }
{
"a": 24
}

Try this
echo -e "AAA\nBBB\nCCC" > testfile
cat testfile
AAA
BBB
CCC
echo "$(grep -v 'AAA' testfile)" > testfile
cat testfile
BBB
CCC

I usually use the tee program to do this:
grep -v 'seg[0-9]\{1,\}\.[0-9]\{1\}' file_name | tee file_name
It creates and removes a tempfile by itself.

bach linux file rename - how to rename multiple files in linux console

I would like to rename cca 1000 files that are named like:
66-123123.jpg -> abc-123123-66.jpg. So in general file format is:
xx-yyyyyy.jpg -> abc-yyyyyy-xx.jpg, where xx and yyyyyy are numbers, abc is string.
Can someone help me with bash or py script?

Try doing this :
rename 's/(\d{2})-(\d{6})\.jpg/abc-$2-$1.jpg/' *.jpg
There are other tools with the same name which may or may not be able to do this, so be careful.
If you run the following command (linux)
$ file $(readlink -f $(type -p rename))
and you have a result like
.../rename: Perl script, ASCII text executable
then this seems to be the right tool =)
If not, to make it the default (usually already the case) on Debian and derivative like Ubuntu :
$ sudo update-alternatives --set rename /path/to/rename
(replace /path/to/rename to the path of your perl's rename command.
If you don't have this command, search your package manager to install it or do it manually.
Last but not least, this tool was originally written by Larry Wall, the Perl's dad.

for file in ??-??????.jpg ; do
[[ $file =~ (..)-(......)\.jpg ]]
mv "$file" "abc-${BASH_REMATCH[2]}-${BASH_REMATCH[1]}.jpg" ;
done
This requires bash 4 for the regex support. For POSIXy shells, this will do
for f in ??-??????.jpg ; do
g=${f%.jpg} # remove the extension
a=${g%-*} # remove the trailing "-yyyyyy"
b=${g#*-} # remove the leading "xx-"
mv "$f" "abc-$b-$a.jpg" ;
done

You could use the rename command, which renames multiple files using regular expressions. In this case you would like to write
rename 's/(\d\d)-(\d\d\d\d\d\d)/abc-$2-$1/' *
where \dmeans a digit, and $1 and $2 refer to the values matched by the first and second parenthesis.

Being able to do things like this easily, is why I name my files the way I do. Using a + sign lets me cut them all up into variables, and then I can just re-arrange them with echo.
#!/usr/bin/env bash
set -x
find *.jpg -type f | while read files
do
newname=$(echo "${files}" | sed s'#-#+#'g | sed s'#\.jpg#+.jpg#'g)
field1=$(echo "${newname}" | cut -d'+' -f1)
field2=$(echo "${newname}" | cut -d'+' -f2)
field3=$(echo "${newname}" | cut -d'+' -f3)
finalname=$(echo "abc-${field2}-${field1}.${field3}")
mv "${files}" "${finalname}"
done

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to efficiently find small typos in source code files? - python

If you're using typescript you could use the gulp plugin i created for spellchecking: https://www.npmjs.com/package/gulp-ts-spellcheck

Related

arranging text files side by side using python

grouping and divding files which contains numbers in it into saperate folders

How do I write a for-loop so a program reiterates itself for a set of 94 DNA samples?

Bash use of temporary file pipes disallows redirection to dependent files? [duplicate]

bach linux file rename - how to rename multiple files in linux console

Categories

Resources