traversing daily dump directories - python

I have 6 months of data to go through, looking like this
0101
0102
.
.
0131
0201
0202
.
.
all the way to
0630
I want to fo through each directory, and execute an awk file on the contents, or do it in a weekly manner (each 7 directories will make one week of data
is there an easy way to do this in awk or python?
many thanks

You can use find to walk your tree and xargs to apply your awk script:
find . -type f | xargs awk -f awkfile
EDIT: awk syntax corrected thanks to input from #nya. I Am Not An AWK Expert.

Why not use plain bash? You can try this:
find . -type f -exec awk -f 'your_awk_script.awk' {} \;
find traverses through directory tree, and -exec option makes it execute the given comand(in this case awk -f your_awk_script.awk) on each file({} is the default placeholder for argument).
To run this tiny script every seven days, look into cron.

Related

grouping and divding files which contains numbers in it into saperate folders

I wanted to move the files in group of 30 in sequence starting from image_1,image_2... from current folder to the new folder.
the file name pattern is like below
image_1.png
image_2.png
.
.
.
image_XXX.png
I want to move image_[1-30].png to folder fold30
and image[31-60].png to fold60 and so on
I have following code to do this and it works wanted to know is there any shortcut to do this.
or is there any smaller code that i can write for the same
#!/bin/bash
counter=0
folvalue=30
totalFiles=$(ls -1 image_*.png | sort -V | wc -l)
foldernames=fold$folvalue
for file in $(ls -1 image_*.png | sort -V )
do
((counter++))
mkdir -p $foldernames
mv $file ./$foldernames/
if [[ "$counter" -eq "$folvalue" ]];
then
let folvalue=folvalue+30
foldernames="fold${folvalue}"
echo $foldernames
fi
done
the above code moves image_1,image_2,..4..30 in folder
fold30
image_31,....image_60 to folder
fold60
I really recommend using sed all the time. It's hard on the eyes but once you get used to it you can do all these jaring tasks in no time.
What it does is simple. Running sed -e "s/regex/substitution/" <(cat file) goes through each line replacing matching patterns regex with substitution.
With it you can just transform your input into comands and pipe it to bash.
If you want to know more there's good documentation here. (also not easy on the eyes though)
Anyway here's the code:
while FILE_GROUP=$(find . -maxdepth 0 -name "image_*.png" | sort -V | head -30) && [ -n "$FILE_GROUP" ]
do
$FOLDER="${YOUR_PREFIX}$(sed -e "s/^.*image_//" -e "s/\.png//" <(echo "$FILE_GROUP" | tail -1))"
mkdir -p $FOLDER
sed -e "s/\.\///" -e "s|.*|mv & $FOLDER|" <(echo "$FILE_GROUP") | bash
done
And here's what it should do:
- while loop grabs the first 30 files.
- take the number out of the last of those files and name the directory
- mkdir FOLDER
- go through each line and turn $FILE into mv $FILE $FOLDER then execute those lines (pipe to bash)
note: replace $YOUR_PREFIXwith your folder
EDIT: surprisingly the code did not work out of the box(who would have thought...) But I've done some fixing and testing and it should work now.
The simplest way to do that is with rename, a.k.a. Perl rename. It will:
let you run any amount of code of arbitrary complexity to figure out a new name,
let you do a dry run telling you what it would do without doing anything,
warn you if any files would be overwritten,
automatically create intermediate directory hierarchies.
So the command you want is:
rename -n -p -e '(my $num = $_) =~ s/\D//g; $_ = ($num+29)-(($num-1)%30) . "/" . $_' *png
Sample Output
'image_1.png' would be renamed to '30/image_1.png'
'image_10.png' would be renamed to '30/image_10.png'
'image_100.png' would be renamed to '120/image_100.png'
'image_101.png' would be renamed to '120/image_101.png'
'image_102.png' would be renamed to '120/image_102.png'
'image_103.png' would be renamed to '120/image_103.png'
'image_104.png' would be renamed to '120/image_104.png'
...
...
If that looks correct, you can run it again without the -n switch to do it for real.

How to efficiently find small typos in source code files?

I would like to recursively search a large code base (mostly python, HTML and javascript) for typos in comments, strings and also variable/method/class names. Strong preference for something that runs in a terminal.
The problem is that spell checkers like aspell or scspell find almost only false positives (e.g. programming terms, camelcased terms) while I would be happy if it could help me primarily find simple typos like scrambled or missing letters e.g. maintenane vs. maintenance, resticted vs. restricted, dpeloyment vs. deployment.
What I was playing with so far is:
for f in **/*.py ; do echo $f ; aspell list < $f | uniq -c ; done
but it will find anything like: assertEqual, MyTestCase, lifecycle
This solution of my own focuses on python files but in the end also found them in html and js. It still needed manual sorting out of false positives but that only took few minutes work and it identified about 150 typos in comments that then also could be found in the non-comment parts.
Save this as executable file e.g extractcomments:
#!/usr/bin/env python3
import argparse
import io
import tokenize
if __name__ == "__main__":
parser = argparse.ArgumentParser(add_help=False)
parser.add_argument('filename')
args = parser.parse_args()
with io.open(args.filename, "r", encoding="utf-8") as sourcefile:
for t in tokenize.generate_tokens(sourcefile.readline):
if t.type == tokenize.COMMENT:
print(t.string.lstrip("#").strip())
Collect all comments for further processing:
for f in **/*.py ; do ~/extractcomments $f >> ~/comments.txt ; done
Run it recursively on your code base with one or more aspell dictionaries and collect all it identified as typos and count their occurrences:
aspell <~/comments.txt --lang=en list|aspell --lang=de list | sort | uniq -c | sort -n > ~/typos.txt
Produces something like:
10 availabe
8 assignement
7 hardwird
Take the list without leading numbers, clean out the false positives, copy it to a 2nd file correct.txt and run aspell on it to get desired replacement for each typo: aspell -c correct.txt
Now paste the two files to get a format of typo;correction with paste -d";" typos.txt correct.txt > known_typos.csv
Now we want to recursively replace those in our codebase:
#!/bin/bash
root_dir=$(git rev-parse --show-toplevel)
while IFS=";" read -r typo fix ; do
git grep -l -z -w "${typo}" -- "*.py" "*.html" | xargs -r --null sed -i "s/\b${typo}\b/${fix}/g"
done < $root_dir/known_typos.csv
My bash skills are poor so there is certainly space for improvement.
Update: I could find more typos in method names by running this:
grep -r def --include \*.py . | cut -d ":" -f 2- |tr "_" " " | aspell --lang=en list | sort -u
Update2: Managed to fix typos that are e.g. inside underscored names or strings that do not have word boundaries as such e.g i_am_a_typpo3:
#!/bin/bash
root_dir=$(git rev-parse --show-toplevel)
while IFS=";" read -r typo fix ; do
echo ${typo}
find $root_dir \( -name '*.py' -or -name '*.html' \) -print0 | xargs -0 perl -pi -e "s/(?<![a-zA-Z])${typo}(?![a-zA-Z])/${fix}/g"
done < $root_dir/known_typos.csv
If you're using typescript you could use the gulp plugin i created for spellchecking:
https://www.npmjs.com/package/gulp-ts-spellcheck
If you are developing in JavaScript or Typescript then you can this spell check plugin for ESLint:
https://www.npmjs.com/package/eslint-plugin-spellcheck
I found it to be very useful.
Another option is scspell:
https://github.com/myint/scspell
It is language-agnostic and claims to "usually catch many errors without an annoying false positive rate."

bash: remove all files except last version in file name

I have 10K+ files like below. File system date(export time) is one for all files.
YYY101R1.corp.company.org-RUNNINGCONFIG-2015-07-10-23-10-15.config
YYY101R1.corp.company.org-RUNNINGCONFIG-2015-07-11-22-11-10.config
YYY101R1.corp.company.org-RUNNINGCONFIG-2015-10-01-10-05-08.config
LLL101S1.corp.company.org-RUNNINGCONFIG-2015-08-10-23-10-15.config
LLL101S1.corp.company.org-RUNNINGCONFIG-2015-09-11-20-11-10.config
LLL101S1.corp.company.org-RUNNINGCONFIG-2015-10-02-19-05-07.config
How can I delete all files except last version(last date) of file from file name and rename it to
YYY101R1.corp.company.org.config
LLL101S1.corp.company.org.config
Thank you.
The UNIX shell command
ls -t YYY101R1.corp.company.org*
will list files in order of age, newest first. Grab the first line as "latest" and make a symbolic ("soft") link to it:
ln -s $latest YYY101R1.corp.company.org.config
Repeat for each file group.
Does that get you going? If not, please post your code and explanation of the specific problem. See https://stackoverflow.com/help/mcve
I got something similar, first get list of all files sorted ascending by modification time, then get number of them, display all minus last 2 of them and pass that to list to command removing file.
ls -tr | wc -l
ls -tr | head -number_of_files_minus_2 | xargs rm
Was it helpful?
FLast=`ls -tr 'YYY101R1.corp.company.org*' | tail -n 1`
mv ${FLast} YYY101R1.corp.company.org.config
rm -f YYY101R1.corp.company.org-RUNNINGCONFIG-*

Utility To Count Number Of Lines Of Code In Python Or Bash

Is there a quick and dirty way in either python or bash script, that can recursively descend a directory and count the total number of lines of code? We would like to be able to exclude certain directories though.
For example:
start at: /apps/projects/reallycoolapp
exclude: lib/, frameworks/
The excluded directories should be recursive as well. For example:
/app/projects/reallycool/lib SHOULD BE EXCLUDED
/app/projects/reallycool/modules/apple/frameworks SHOULD ALSO BE EXCLUDED
This would be a really useful utility.
Found an awesome utility CLOC. https://github.com/AlDanial/cloc
Here is the command we ran:
perl cloc.pl /apps/projects/reallycoolapp --exclude-dir=lib,frameworks
And here is the output
--------------------------------------------------------------------------------
Language files blank comment code
--------------------------------------------------------------------------------
PHP 32 962 1352 2609
Javascript 5 176 225 920
Bourne Again Shell 4 45 70 182
Bourne Shell 12 52 113 178
HTML 1 0 0 25
--------------------------------------------------------------------------------
SUM: 54 1235 1760 3914
--------------------------------------------------------------------------------
The find and wc arguments alone can solve your problem.
With find you can specify very complex logic like this:
find /apps/projects/reallycoolapp -type f -iname '*.py' ! -path '*/lib/*' ! -path '*/frameworks/*' | xargs wc -l
Here the ! invert the condition so this command will count the lines for each python files not in 'lib/' or in 'frameworks/' directories.
Just dont forget the '*' or it will not match anything.
find ./apps/projects/reallycool -type f | \
grep -v -e /app/projects/reallycool/lib \
-e /app/projects/reallycool/modules/apple/frameworks | \
xargs wc -l | \
cut -d '.' -f 1 | \
awk 'BEGIN{total=0} {total += $1} END{print total}'
A few notes...
the . after the find is important since that's how the cut command can separate the count from the file name
this is a multiline command, so make sure there aren't spaces after the escaping slashes
you might need to exclude other files like svn or something. Also this will give funny values for binary files so you might want to use grep to whitelist the specific file types you are interested in, ie: grep -e .html$ -e .css$

Count lines of code in a Django Project

Is there an easy way to count the lines of code you have written for your django project?
Edit: The shell stuff is cool, but how about on Windows?
Yep:
shell]$ find /my/source -name "*.py" -type f -exec cat {} + | wc -l
Job's a good 'un.
You might want to look at CLOC -- it's not Django specific but it supports Python. It can show you lines counts for actual code, comments, blank lines, etc.
Starting with Aiden's answer, and with a bit of help in a question of my own, I ended up with this god-awful mess:
# find the combined LOC of files
# usage: loc Documents/fourU py html
function loc {
#find $1 -name $2 -type f -exec cat {} + | wc -l
namelist=''
let i=2
while [ $i -le $# ]; do
namelist="$namelist -name \"*.$#[$i]\""
if [ $i != $# ]; then
namelist="$namelist -or "
fi
let i=i+1
done
#echo $namelist
#echo "find $1 $namelist" | sh
#echo "find $1 $namelist" | sh | xargs cat
echo "find $1 $namelist" | sh | xargs cat | wc -l
}
which allows you to specify any number of extensions you want to match. As far as I can tell, it outputs the right answer, but... I thought this would be a one-liner, else I wouldn't have started in bash, and it just kinda grew from there.
I'm sure that those more knowledgable than I can improve upon this, so I'm going to put it in community wiki.
Check out the wc command on unix.
Get wc command on Windows using GnuWin32 (http://gnuwin32.sourceforge.net/packages/coreutils.htm)
wc *.py

Categories

Resources