how to intersect multiple files by several columns

how to intersect multiple files by several columns - python

I have spent a lot of time on this any help would be appreciated.
I have two files as below; what I want to do is to search for every item of f1_col1 and f1_col2 separately inside the f2_col3 - if an item exists then save it and add its related row from the (f2_col3) to a new column in the new df.
f1:(two columns)
f1_col1,f1_col2
kctd,Malat1
Gas5,Snhg6
f2:(three columns)
f2_col1,f2_col2,f2_col3
chr7,snRNA,Gas5
chr1,protein_coding,Malat1
chr2,TEC,Snhg6
chr1,TEC,kctd
So based on the two files mentioned the desired output should be:
new_df:
f1_col1,f1_col2,f2_col1,f2_col1
kctd,Malat1,chr1,chr1
Gas5,Snhg6,chr7,chr2
note: f2_col2 is not important.
I do not have a strong programming background and found this very difficult - Even though I have checked multiple sources but have not been able to develop a solution - any help is appreciated. Thanks

Based on 1 possible interpretation of your requirements and the 1 sunny-day example you provided where every key field always matches on every line, this MAY be what you're trying to do:
$ cat tst.awk
BEGIN { FS=OFS="," }
NR==FNR {
if ( FNR == 1 ) {
hdr = $1
}
map[$3] = $1
next
}
{ print $0, ( FNR>1 ? map[$1] OFS map[$2] : hdr OFS hdr ) }
$ awk -f tst.awk f2 f1
f1_col1,f1_col2,f2_col1,f2_col1
kctd,Malat1,chr1,chr1
Gas5,Snhg6,chr7,chr2

Related

Optimising a NBT parser that's only searching for a fraction of data in Python

The nbt file is created by the schematica mod for Mincraft.
With content.find(f'\xYY\xZZ..') and content = content[lower:upper] I was able to slice the data to the most important part: The blockdata which has the following form when translated (my data is NOT translated!):
TAG_List("blocks (Compound)"): (xdim*ydim*zdim) entries of type TAG_Compound
{
TAG_Compound:
{
TAG_List("pos (Int)"): 3 entries of type TAG_Int
{
TAG_Int: X
TAG_Int: Y
TAG_Int: Z
}
TAG_Int("state"): S
}
//...
TAG_Compound:
{
TAG_Compound("nbt"):
{
//Some uninteresting stuff
}
TAG_List("pos (Int)"): 3 entries of type TAG_Int
{
TAG_Int: X
TAG_Int: Y
TAG_Int: Z
}
TAG_Int("state"): S
}
}
My goal is to extract all "state" information of the raw byte data as fast as possible while ignoring all "pos" and "nbt" data.
Since not every "block"-Compound has a "nbt"-Subcompound, I can't just save every Xth byte, since the length of each "block"-Compound is unknown.
So my current solution is to search for the keyword "state" (data.find(b'\x73\x74\x61\x74\x65')), save the following 4 bytes, cut of the word "state" from the data (data = data[index+4:]) and iterate.
And I'm by no means an expert, but I feel like there has to be a better way.
To give you a better idea of the data I'm dealing with:
b'\t\x00\x06blocks\n\x00\x00\x00\x08\t\x00\x03pos\x03\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x03\x00\x05state\x00\x00\x00\x00\x00\t\x00\x03pos\x03\x00\x00\x00\x03\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x03\x00\x05state\x00\x00\x00\x01\x00\t\x00\x03pos\x03\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x03\x00\x05state\x00\x00\x00\x02\x00\t\x00\x03pos\x03\x00\x00\x00\x03\x00\x00\x00\x01\x00\x00\x00\x01\x00\x00\x00\x00\x03\x00\x05state\x00\x00\x00\x03\x00\n\x00\x03nbt\t\x00\x05Items\x00\x00\x00\x00\x00\x08\x00\x02id\x00\x0fminecraft:chest\x08\x00\x04Lock\x00\x00\x00\t\x00\x03pos\x03\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x03\x00\x05state\x00\x00\x00\x04\x00\t\x00\x03pos\x03\x00\x00\x00\x03\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x01\x03\x00\x05state\x00\x00\x00\x05\x00\t\x00\x03pos\x03\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x05state\x00\x00\x00\x03\x00\t\x00\x03pos\x03\x00\x00\x00\x03\x00\x00\x00\x01\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x05state\x00\x00\x00\x03\x00'
(I want the 4 bytes after each "state")
//Edit: I just realized slicing takes a lot of time, so if I just searched for all occurences, I'd be a lot faster.
But I still hope there's a better way

How to limit values (clamp) in JSON with JQ?

I have an application which writes/concatenates data into JSON, and then displays/graphs it via dygraphs. At times, various events can cause the values to go out of range. That range is user-subjective, so clamping that range at run-time is not the direction I am wishing to go.
I believe jq can help here - ideally I would be able to search for a field > x and if it is > x, replace it with x. I've gone searching for jq examples and not really found anything that's making sense to me yet.
I have spent a bit of time on this but not been able to make anything do what I think it should do ... at all. Like, I don't have bad code to show you because I've not made it do anything yet. I sincerely hope what I am asking is narrowed down enough for someone to be able to show me, in context, so I can extend it for the larger project.
Here's a line which I would expect to be able to modify:
{"cols":[{"type":"datetime","id":"Time","label":"Time"},{"type":"number","id":"Room1Temp","label":"Room One Temp"},{"type":"number","id":"Room1Set","label":"Room One Set"},{"type":"string","id":"Annot1","label":"Room One Note"},{"type":"number","id":"Room2Temp","label":"Room Two Temp"},{"type":"number","id":"Room2Set","label":"Room Two Set"},{"type":"string","id":"Annot2","label":"Room Two Note"},{"type":"number","id":"Room3Temp","label":"Room Three Temp"},{"type":"number","id":"State","label":"State"},{"type":"number","id":"Room4Temp","label":"Room Four Temp"},{"type":"number","id":"Quality","label":"Quality"}],"rows":[
{"c":[{"v":"Date(2019,6,4,20,31,13)"},{"v":68.01},{"v":68.0},null,{"v":62.02},{"v":55.89},null,null,{"v":4},{"v":69.0},{"v":1.052}]}]}
I'd want to do something like:
if JSONFile.Room2Set < 62
set Room2Set = 62
Here's a larger block of JSON which is the source of the chart shown below:
Example Chart

With a function clamp functions defined like so (in your ~/.jq file or inline):
def clamp_min($minInc): if . < $minInc then $minInc else . end;
def clamp_max($maxInc): if . > $maxInc then $maxInc else . end;
def clamp($minInc; $maxInc): clamp_min($minInc) | clamp_max($maxInc);
And with that data, you'll want to find the corresponding cells for each row and modify the value.
$ jq --arg col "Room2Set" --argjson max '62' '
def clamp_max($maxInc): if . > $maxInc then $maxInc else . end;
(INDEX(.cols|to_entries[]|{id:.value.id,index:.key};.id)) as $cols
| .rows[].c[$cols[$col].index] |= (objects.v |= clamp_max($max))
' input.json

With an invocation such as:
jq --arg col Room2Set --argjson mx 72 --argjson mn 62 -f clamp.jq input.json
where clamp.jq contains:
def clamp: if . > $mx then $mx elif . < $mn then $mn else . end;
(.cols | map(.id) | index($col)) as $ix
| .rows[].c[$ix].v |= clamp
the selected cells should be "clamped".

Clean text files - remove unwanted content in LOOP (R/python)

I want to clean all the "waste" (making the files unsuitable for analysis) in unstructured text-files.
In this specific situation, one option to only retain the wanted information, is to only retain all numbers above 250 (the text is a combination of string, numbers, ...)
For a large number of text files, I want to do follow action in R:
x <- x[which(x >= "250"),]
The code for 1 text file works perfectly (above), when I try to do the same in a loop (for the large N of text files, it fails (error: incorrect number of dimensions o)).
for(i in 1:length(files)){
i<- i[which(i >= "250"),]
}
Anyone any idea how to solve this in R (or python) ?
picture: very simplified example of a text file, I want to retain everything between (START) and (END)

This makes no sense if it is 10 K files, why are you even trying to do in R or python? Why not just a simple awk or bash command? Moreover, your images is parsing info between START and END from the text files, not sure if it is data frame with columns across ( try to put in a simple dput rather than images.)
All you are trying to do is a grep between start and end across 10 k files. I would do that in bash.
something like this in bash should work.
for i in *.txt
do
sed -n '/START/,/END/{//!p}' i > i.edited.txt
done
If the columns are standard across in R you can do the following ( But, I would not read 10 K files in R memory).
read the files as a list of dataframe Then simply do an lapply
a = data.frame(col1 = c(100,250,300))
b = data.frame(col1 = c(250,450,100,346))
c = data.frame(col1 = c(250,123,122,340))
df_list <- list(a = a ,b = b,c = c)
lapply(df_list, subset, col1 >= 250)

How to get statistics on a large text file of data

I have a collection of large (~100,000,000 line) text files in the format:
0.088293 1.3218e-32 2.886e-07 2.378e-02 21617 28702
0.111662 1.1543e-32 3.649e-07 1.942e-02 93804 95906
0.137970 1.2489e-32 4.509e-07 1.917e-02 89732 99938
0.149389 8.0725e-32 4.882e-07 2.039e-02 71615 69733
...
And I'd like to find the mean and sum of column 2 and maximum and minimum values of columns 3 and 4, and the total number of lines. How can I do this efficiently using NumPy? Because of their size, loadtxt and genfromtxt are no good (take a long time to execute) since they attempt to read the whole file into memory. In contrast, Unix tools like awk:
awk '{ total += $2 } END { print total/NR }' <filename>
work in a reasonable amount of time.
Can Python/NumPy do the job of awk for such big files?

You can say something like:
awk '{ total2 += $2
for (i=2;i<=3;i++) {
max[i]=(length(max[i]) && max[i]>$i)?max[i]:$i
min[i]=(length(min[i]) && min[i]<$i)?min[i]:$i
}
} END {
print "items", "average2", "min2", "min3", "max2", "max3"
print NR, total2/NR, min[2], min[3], max[2], max[3]
}' file
Test
With your given input:
$ awk '{total2 += $2; for (i=2;i<=3;i++) {max[i]=(length(max[i]) && max[i]>$i)?max[i]:$i; min[i]=((length(min[i]) && min[i]<$i)?min[i]:$i)}} END {print "items", "average2", "min2", "min3", "max2", "max3"; print NR, total2/NR, min[2], min[3], max[2], max[3]}' a | column -t
items average2 min2 min3 max2 max3
4 2.94938e-32 1.1543e-32 2.886e-07 8.0725e-32 4.882e-07

loop through the lines and apply regex to extract the data you are looking for, adding it into an initially empty list for each column you desire.
Once you have the column in list form you can apply max(list) min(list) avg(list) functions to the data to get whatever calculations you are interested in.
note: You may need to revise where you added the data to the list and convert the numbers from str to int form so that the max, min, avg functions can operate on them.

What's the fastest way to merge multiple csv files by column?

I have about 50 CSV files with 60,000 rows in each, and a varying number of columns. I want to merge all the CSV files by column. I've tried doing this in MATLAB by transposing each csv file and re-saving to disk, and then using the command line to concatenate them. This took my computer over a week and the final result needs to transposed once again! I have to do this again, and I'm looking for a solution that won't take another week. Any help would be appreciated.

[...] transposing each csv file and re-saving to disk, and then using the command line to concatenate them [...]
Sounds like Transpose-Cat-Transpose. Use paste for joining files horizontally.
paste -d ',' a.csv b.csv c.csv ... > result.csv

The Python csv module can be set up so that each record is a dictionary with the column names as keys. You should that way be able to read in all the files as dictionaries, and write them to an out-file that has all columns.
Python is easy to use, so this should be fairly trivial for a programmer of any language.
If your csv-files doesn't have column headings, this will be quite a lot of manual work, though, so then it's perhaps not the best solution.
Since these files are fairly big, it's best not to read all of them into memory once. I'd recommend that you first open them only to collect all column names into a list, and use that list to create the output file. Then you can concatenate each input file to the output file without having to have all of the files in memory.

Horizontal concatenation really is trivial. Considering you know C++, I'm surprised you used MATLAB. Processing a GB or so of data in the way you're doing should be in the order of seconds, not days.
By your description, no CSV processing is actually required. The easiest approach is to just do it in RAM.
vector< vector<string> > data( num_files );
for( int i = 0; i < num_files; i++ ) {
ifstream input( filename[i] );
string line;
while( getline(input, line) ) data[i].push_back(line);
}
(Do obvious sanity checks, such as making sure all vectors are the same length...)
Now you have everything, dump it:
ofstream output("concatenated.csv");
for( int row = 0; row < num_rows; row++ ) {
for( int f = 1; f < num_files; f++ ) {
if( f == 0 ) output << ",";
output << data[f][row];
}
output << "\n";
}
If you don't want to use all that RAM, you can do it one line at a time. You should be able to keep all files open at once, and just store the ifstream objects in a vector/array/list. In that case, you just read one line at a time from each file and write it to the output.

import csv
import itertools
# put files in the order you want concatentated
csv_names = [...whatever...]
readers = [csv.reader(open(fn, 'rb')) for fn in csv_names]
writer = csv.writer(open('result.csv', 'wb'))
for row_chunks in itertools.izip(*readers):
writer.writerow(list(itertools.chain.from_iterable(row_chunks)))
Concatenates horizontally. Assumes all files have the same length. Has low memory overhead and is speedy.
Answer applies to Python 2. In Python 3, opening csv files is slightly different:
readers = [csv.reader(open(fn, 'r'), newline='') for fn in csv_names]
writer = csv.writer(open('result.csv', 'w'), newline='')

Use Go: https://github.com/chrislusf/gleam
Assume there are file "a.csv" has fields "a1, a2, a3, a4, a5".
And assume file "b.csv" has fields "b1, b2, b3".
We want to join the rows where a1 = b2. And the output format should be "a1, a4, b3".
package main
import (
"os"
"github.com/chrislusf/gleam"
"github.com/chrislusf/gleam/source/csv"
)
func main() {
f := gleam.New()
a := f.Input(csv.New("a.csv")).Select(1,4) // a1, a4
b := f.Input(csv.New("b.csv")).Select(2,3) // b2, b3
a.Join(b).Fprintf(os.Stdout, "%s,%s,%s\n").Run() // a1, a4, b3
}

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to intersect multiple files by several columns - python

Related

Optimising a NBT parser that's only searching for a fraction of data in Python

How to limit values (clamp) in JSON with JQ?

Clean text files - remove unwanted content in LOOP (R/python)

How to get statistics on a large text file of data

What's the fastest way to merge multiple csv files by column?

Categories

Resources