I have a collection of large (~100,000,000 line) text files in the format:
0.088293 1.3218e-32 2.886e-07 2.378e-02 21617 28702
0.111662 1.1543e-32 3.649e-07 1.942e-02 93804 95906
0.137970 1.2489e-32 4.509e-07 1.917e-02 89732 99938
0.149389 8.0725e-32 4.882e-07 2.039e-02 71615 69733
...
And I'd like to find the mean and sum of column 2 and maximum and minimum values of columns 3 and 4, and the total number of lines. How can I do this efficiently using NumPy? Because of their size, loadtxt and genfromtxt are no good (take a long time to execute) since they attempt to read the whole file into memory. In contrast, Unix tools like awk:
awk '{ total += $2 } END { print total/NR }' <filename>
work in a reasonable amount of time.
Can Python/NumPy do the job of awk for such big files?
You can say something like:
awk '{ total2 += $2
for (i=2;i<=3;i++) {
max[i]=(length(max[i]) && max[i]>$i)?max[i]:$i
min[i]=(length(min[i]) && min[i]<$i)?min[i]:$i
}
} END {
print "items", "average2", "min2", "min3", "max2", "max3"
print NR, total2/NR, min[2], min[3], max[2], max[3]
}' file
Test
With your given input:
$ awk '{total2 += $2; for (i=2;i<=3;i++) {max[i]=(length(max[i]) && max[i]>$i)?max[i]:$i; min[i]=((length(min[i]) && min[i]<$i)?min[i]:$i)}} END {print "items", "average2", "min2", "min3", "max2", "max3"; print NR, total2/NR, min[2], min[3], max[2], max[3]}' a | column -t
items average2 min2 min3 max2 max3
4 2.94938e-32 1.1543e-32 2.886e-07 8.0725e-32 4.882e-07
loop through the lines and apply regex to extract the data you are looking for, adding it into an initially empty list for each column you desire.
Once you have the column in list form you can apply max(list) min(list) avg(list) functions to the data to get whatever calculations you are interested in.
note: You may need to revise where you added the data to the list and convert the numbers from str to int form so that the max, min, avg functions can operate on them.
Related
I have spent a lot of time on this any help would be appreciated.
I have two files as below; what I want to do is to search for every item of f1_col1 and f1_col2 separately inside the f2_col3 - if an item exists then save it and add its related row from the (f2_col3) to a new column in the new df.
f1:(two columns)
f1_col1,f1_col2
kctd,Malat1
Gas5,Snhg6
f2:(three columns)
f2_col1,f2_col2,f2_col3
chr7,snRNA,Gas5
chr1,protein_coding,Malat1
chr2,TEC,Snhg6
chr1,TEC,kctd
So based on the two files mentioned the desired output should be:
new_df:
f1_col1,f1_col2,f2_col1,f2_col1
kctd,Malat1,chr1,chr1
Gas5,Snhg6,chr7,chr2
note: f2_col2 is not important.
I do not have a strong programming background and found this very difficult - Even though I have checked multiple sources but have not been able to develop a solution - any help is appreciated. Thanks
Based on 1 possible interpretation of your requirements and the 1 sunny-day example you provided where every key field always matches on every line, this MAY be what you're trying to do:
$ cat tst.awk
BEGIN { FS=OFS="," }
NR==FNR {
if ( FNR == 1 ) {
hdr = $1
}
map[$3] = $1
next
}
{ print $0, ( FNR>1 ? map[$1] OFS map[$2] : hdr OFS hdr ) }
$ awk -f tst.awk f2 f1
f1_col1,f1_col2,f2_col1,f2_col1
kctd,Malat1,chr1,chr1
Gas5,Snhg6,chr7,chr2
I want to clean all the "waste" (making the files unsuitable for analysis) in unstructured text-files.
In this specific situation, one option to only retain the wanted information, is to only retain all numbers above 250 (the text is a combination of string, numbers, ...)
For a large number of text files, I want to do follow action in R:
x <- x[which(x >= "250"),]
The code for 1 text file works perfectly (above), when I try to do the same in a loop (for the large N of text files, it fails (error: incorrect number of dimensions o)).
for(i in 1:length(files)){
i<- i[which(i >= "250"),]
}
Anyone any idea how to solve this in R (or python) ?
picture: very simplified example of a text file, I want to retain everything between (START) and (END)
This makes no sense if it is 10 K files, why are you even trying to do in R or python? Why not just a simple awk or bash command? Moreover, your images is parsing info between START and END from the text files, not sure if it is data frame with columns across ( try to put in a simple dput rather than images.)
All you are trying to do is a grep between start and end across 10 k files. I would do that in bash.
something like this in bash should work.
for i in *.txt
do
sed -n '/START/,/END/{//!p}' i > i.edited.txt
done
If the columns are standard across in R you can do the following ( But, I would not read 10 K files in R memory).
read the files as a list of dataframe Then simply do an lapply
a = data.frame(col1 = c(100,250,300))
b = data.frame(col1 = c(250,450,100,346))
c = data.frame(col1 = c(250,123,122,340))
df_list <- list(a = a ,b = b,c = c)
lapply(df_list, subset, col1 >= 250)
I am a beginner in bash script programming. I wrote a small script in Python to calculate the values of a list from an initial value up to the last value increasing with 0.5. The Python script is:
# A script for grid corner calculation.
#The goal of the script is to figure out the longitude values which increase with 0.5
# The first and the last longitude values according to CDO sinfo
L1 = -179.85
L2 = 179.979
# The list of the longitude values
n = []
while (L1 < L2):
L1 = L1 + 0.5
n.append(L1)
print "The longitude values are:", n
print "The number of longitude values:", len(n)
I would like to create a same script by bash shell. I tried the following:
!#/bin/bash
L1=-180
L2=180
field=()
while [ $L1 -lt $L2 ]
do
scale=2
L1=$L1+0.5 | bc
field+=("$L1")
done
echo ${#field[#]}
but it does not work. Could someone inform me what I did wrong?
I would appreciate if someone helped me.
You aren't correctly assigning the value to L1. Also, -lt expects integers, so the comparison will fail after the first iteration.
while [ "$(echo "$L1 < $L2" | bc)" = 1 ]; do
do
L1=$(echo "scale=2; $L1+0.5" | bc)
field+=("$L1")
done
If you have seq available, you can use that instead, e.g.,
field=( $(seq -f %.2f -180 0.5 179.5) )
I have about 50 CSV files with 60,000 rows in each, and a varying number of columns. I want to merge all the CSV files by column. I've tried doing this in MATLAB by transposing each csv file and re-saving to disk, and then using the command line to concatenate them. This took my computer over a week and the final result needs to transposed once again! I have to do this again, and I'm looking for a solution that won't take another week. Any help would be appreciated.
[...] transposing each csv file and re-saving to disk, and then using the command line to concatenate them [...]
Sounds like Transpose-Cat-Transpose. Use paste for joining files horizontally.
paste -d ',' a.csv b.csv c.csv ... > result.csv
The Python csv module can be set up so that each record is a dictionary with the column names as keys. You should that way be able to read in all the files as dictionaries, and write them to an out-file that has all columns.
Python is easy to use, so this should be fairly trivial for a programmer of any language.
If your csv-files doesn't have column headings, this will be quite a lot of manual work, though, so then it's perhaps not the best solution.
Since these files are fairly big, it's best not to read all of them into memory once. I'd recommend that you first open them only to collect all column names into a list, and use that list to create the output file. Then you can concatenate each input file to the output file without having to have all of the files in memory.
Horizontal concatenation really is trivial. Considering you know C++, I'm surprised you used MATLAB. Processing a GB or so of data in the way you're doing should be in the order of seconds, not days.
By your description, no CSV processing is actually required. The easiest approach is to just do it in RAM.
vector< vector<string> > data( num_files );
for( int i = 0; i < num_files; i++ ) {
ifstream input( filename[i] );
string line;
while( getline(input, line) ) data[i].push_back(line);
}
(Do obvious sanity checks, such as making sure all vectors are the same length...)
Now you have everything, dump it:
ofstream output("concatenated.csv");
for( int row = 0; row < num_rows; row++ ) {
for( int f = 1; f < num_files; f++ ) {
if( f == 0 ) output << ",";
output << data[f][row];
}
output << "\n";
}
If you don't want to use all that RAM, you can do it one line at a time. You should be able to keep all files open at once, and just store the ifstream objects in a vector/array/list. In that case, you just read one line at a time from each file and write it to the output.
import csv
import itertools
# put files in the order you want concatentated
csv_names = [...whatever...]
readers = [csv.reader(open(fn, 'rb')) for fn in csv_names]
writer = csv.writer(open('result.csv', 'wb'))
for row_chunks in itertools.izip(*readers):
writer.writerow(list(itertools.chain.from_iterable(row_chunks)))
Concatenates horizontally. Assumes all files have the same length. Has low memory overhead and is speedy.
Answer applies to Python 2. In Python 3, opening csv files is slightly different:
readers = [csv.reader(open(fn, 'r'), newline='') for fn in csv_names]
writer = csv.writer(open('result.csv', 'w'), newline='')
Use Go: https://github.com/chrislusf/gleam
Assume there are file "a.csv" has fields "a1, a2, a3, a4, a5".
And assume file "b.csv" has fields "b1, b2, b3".
We want to join the rows where a1 = b2. And the output format should be "a1, a4, b3".
package main
import (
"os"
"github.com/chrislusf/gleam"
"github.com/chrislusf/gleam/source/csv"
)
func main() {
f := gleam.New()
a := f.Input(csv.New("a.csv")).Select(1,4) // a1, a4
b := f.Input(csv.New("b.csv")).Select(2,3) // b2, b3
a.Join(b).Fprintf(os.Stdout, "%s,%s,%s\n").Run() // a1, a4, b3
}
I am attempting to rewrite some of my old bash scripts that I think are very inefficient (not to mention inelegant) and use some horrid piping...Perhaps somebody with real Python skills can give me some pointers...
The script makes uses of multiple temp files...another thing I think is a bad style and probably can be avoided...
It essentially manipulates INPUT-FILE by first cutting out certain number of lines from the top (discarding heading).
Then it pulls out one of the columns and:
calculate number of raws = N;
throws out all duplicate entries from this single column file (I use sort -u -n FILE > S-FILE).
After that, I create a sequential integer index from 1 to N and paste this new index column into the original INPUT-FILE using paste command.
My bash script then generates Percentile Ranks for the values we wrote into S-FILE.
I believe Python leverage scipy.stats, while in bash I determine number of duplicate lines (dupline) for each unique entry in S-FILE, and then calculated per-rank=$((100*($counter+$dupline/2)/$length)), where $length= length of FILE and not S-FILE. I then would print results into a separate 1 column file (and repeat same per-rank as many times as we have duplines).
I would then paste this new column with percentile ranks back into INPUT-FILE (since I would sort INPUT-FILE by the column used for calculation of percentile ranks - everything would line up perfectly in the result).
After this, it goes into the ugliness below...
sort -o $INPUT-FILE $INPUT-FILE
awk 'int($4)>2000' $INPUT-FILE | awk -v seed=$RANDOM 'BEGIN{srand(seed);} {print rand()"\t"$0}' | sort -k1 -k2 -n | cut -f2- | head -n 500 > 2000-$INPUT-FILE
diff $INPUT-FILE 2000-$INPUT-FILE | sed '/^[0-9][0-9]*/d; s/^. //; /^---$/d' | awk 'int($4)>1000' | awk -v seed=$RANDOM 'BEGIN{srand(seed);} {print rand()"\t"$0}' | sort -k1 -k2 -n | cut -f2- | head -n 500 > 1000-$INPUT-FILE
cat 2000-$INPUT-FILE 1000-$INPUT-FILE | sort > merge-$INPUT-FILE
diff merge-$INPUT-FILE $INPUT-FILE | sed '/^[0-9][0-9]*/d; s/^. //; /^---$/d' | awk 'int($4)>500' | awk -v seed=$RANDOM 'BEGIN{srand(seed);} {print rand()"\t"$0}' | sort -k1 -k2 -n | cut -f2- | head -n 500 > 500-$INPUT-FILE
rm merge-$INPUT-FILE
Essentially, this is a very inelegant bash way of doing the following:
RANDOMLY select 500 lines from $INPUT-FILE where value in column 4 is greater then 2000 and write it out to file 2000-$INPUT-FILE
For all REMAINING lines in $INPUT-FILE, randomly select 500 lines where value in column 4 is greater then 1000 and write it out to file 1000-$INPUT-FILE
For all REMAINING lines in $INPUT-FILE after 1) and 2), randomly select 500 lines where value in column 4 is greater then 500 and write it out to file 500-$INPUT-FILE
Again, I am hoping somebody can help me in reworking this ugly piping thing into a thing of python beauty! :) Thanks!
Two crucial points in the comments:
(A) The file is ~50k lines of ~100 characters. Small enough to comfortably fit in memory on modern desktop/server/laptop systems.
(B) The author's main question is about how to keep track of lines that have already been chosen, and don't choose them again.
I suggest three steps.
(1) Go through the file, making three separate lists -- call them u, v, w -- of the line numbers which satisfy each of the criteria. These lists may have more than 500 lines, and they may contain duplicates, but we will get rid of these problems in step (2).
u = []
v = []
w = []
with open(filename, "r") as f:
for linenum, line in enumerate(f):
x = int(line.split()[3])
if x > 2000:
u.append(x)
if x > 1000:
v.append(x)
if x > 500:
w.append(x)
(2) Choose line numbers. You can use the builtin Random.sample() to pick a sample of k elements from a population. We want to remove elements that have previously been chosen, so keep track of such elements in a set. (The "chosen" collection is a set instead of a list because the test "if x not in chosen" is O(log(n)) for a set, but O(n) for a list. Change it to a list and you'll see slowdown if you measure the timings precisely, though it might not be a noticeable delay for a data set of "only" 50k data points / 500 samples / 3 categories.)
import random
rand = random.Random() # change to random.Random(1234) for repeatable results
chosen = set()
s0 = rand.sample(u, 500)
chosen.update(s0)
s1 = rand.sample([x for x in v if x not in chosen], 500)
chosen.update(s1)
s2 = rand.sample([x for x in w if x not in chosen], 500)
chosen.update(s2)
(3) Do another pass through the input file, putting lines whose numbers are s0 into your first output file, lines whose numbers are in s1 into your second output file, and lines whose numbers are in s2 into your third output file. It's pretty trivial in any language, but here's an implementation which uses Python "idioms":
linenum2sample = dict([(x, 0) for x in s0]+[(x, 1) for x in s1]+[(x, 2) for x in s2])
outfile = [open("-".join(x, filename), "w") for x in ["2000", "1000", "500"]]
try:
with open(filename, "r") as f:
for linenum, line in enumerate(f):
s = linenum2sample.get(linenum)
if s is not None:
outfile[s].write(line)
finally:
for f in outfile:
f.close()
Break it up into easy pieces.
Read the file using csv.DictReader, or csv.reader if the headers are unusable. As you're iterating through the lines, check the value of column 4 and insert the lines into a dictionary of lists where the dictionary keys are something like 'gt_2000', 'gt_1000', 'gt_500'.
Iterate through your dictionary keys and for each, create a file and do a loop of 500 and for each iteration, use random.randint(0, len(the_list)-1) to get a random index of the list, write it to the file, then delete the item at that index from the list. If there could ever be fewer than 500 items in any bucket then this will require a tiny bit more.