Improve python code in terms of speed - python

I have a very big file (1.5 billion lines) in the following form:
1 67108547 67109226 gene1$transcript1 0 + 1 0
1 67108547 67109226 gene1$transcript1 0 + 2 1
1 67108547 67109226 gene1$transcript1 0 + 3 3
1 67108547 67109226 gene1$transcript1 0 + 4 4
.
.
.
1 33547109 33557650 gene2$transcript1 0 + 239 2
1 33547109 33557650 gene2$transcript1 0 + 240 0
.
.
.
1 69109226 69109999 gene1$transcript1 0 + 351 1
1 69109226 69109999 gene1$transcript1 0 + 352 0
What I want to do is to reorganize/sort this file based on the identifier on column 4. The file is consisted of blocks. If you concatenate columns 4,1,2 and 3 you create the unique identifier for each block. This is the key for the dicionary all_exons and the value is a numpy array containing all the values of column 8. Then I have a second dictionary unique_identifiers that has as key the attributes from column 4 and values a list of the corresponding block identifiers. As output I write a file in the following form:
>gene1
0
1
3
4
1
0
>gene2
2
0
I already wrote some code (see below) that does this, but my implementation is very slow. It takes around 18 hours to run.
import os
import sys
import time
from contextlib import contextmanager
import pandas as pd
import numpy as np
def parse_blocks(bedtools_file):
unique_identifiers = {} # Dictionary with key: gene, value: list of exons
all_exons = {} # Dictionary contatining all exons
# Parse file and ...
with open(bedtools_file) as fp:
sp_line = []
for line in fp:
sp_line = line.strip().split("\t")
current_id = sp_line[3].split("$")[0]
identifier="$".join([sp_line[3],sp_line[0],sp_line[1],sp_line[2]])
if(identifier in all_exons):
item = float(sp_line[7])
all_exons[identifier]=np.append(all_exons[identifier],item)
else:
all_exons[identifier] = np.array([sp_line[7]],float)
if(current_id in unique_identifiers):
unique_identifiers[current_id].add(identifier)
else:
unique_identifiers[current_id] =set([identifier])
return unique_identifiers, all_exons
identifiers, introns = parse_blocks(options.bed)
w = open(options.out, 'w')
for gene in sorted(list(identifiers)):
w.write(">"+str(gene)+"\n")
for intron in sorted(list(identifiers[gene])):
for base in introns[intron]:
w.write(str(base)+"\n")
w.close()
How can I impove the above code in order to run faster?

You also import pandas, therefore, I provide a pandas solution which requires basically only two lines of code.
However, I do not know how it performs on large data sets and whether that is faster than your approach (but I am pretty sure it is).
In the example below, the data you provide is stored in table.txt. I then use groupby to get all the values in your 8th column, store them in a list for the respective identifier in your column 4 (note that my indices start at 0) and convert this data structure into a dictionary which can then be printed easily.
import pandas as pd
df=pd.read_csv("table.txt", header=None, sep = r"\s+") # replace the separator by e.g. '/t'
op = dict(df.groupby(3)[7].apply(lambda x: x.tolist()))
So in this case op looks like this:
{'gene1$transcript1': [0, 1, 3, 4, 1, 0], 'gene2$transcript1': [2, 0]}
Now you could print the output like this and pipeline it in a certain file:
for k,v in op.iteritems():
print k.split('$')[0]
for val in v:
print val
This gives you the desired output:
gene1
0
1
3
4
1
0
gene2
2
0
Maybe you can give it a try and let me know how it compares to your solution!?
Edit2:
In the comments you mentioned that you would like to print the genes in the correct order. You can do this as follows:
# add some fake genes to op from above
op['gene0$stuff'] = [7,9]
op['gene4$stuff'] = [5,9]
# print using 'sorted'
for k,v in sorted(op.iteritems()):
print k.split('$')[0]
for val in v:
print val
which gives you:
gene0
7
9
gene1
0
1
3
4
1
0
gene2
2
0
gene4
5
9
EDIT1:
I am not sure whether duplicates are intended but you could easily get rid of them by doing the following:
op2 = dict(df.groupby(3)[7].apply(lambda x: set(x)))
Now op2 would look like this:
{'gene1$transcript1': {0, 1, 3, 4}, 'gene2$transcript1': {0, 2}}
You print the output as before:
for k,v in op2.iteritems():
print k.split('$')[0]
for val in v:
print val
which gives you
gene1
0
1
3
4
gene2
0
2

I'll try to simplify your question, my solution is like this:
First, scan over the big file. For every different current_id, open a temporary file and append value of column 8 to that file.
After the full scan, catenate all chunks to a result file.
Here's the code:
# -*- coding: utf-8 -*-
import os
import tempfile
import subprocess
class ChunkBoss(object):
"""Boss for file chunks"""
def __init__(self):
self.opened_files = {}
def write_chunk(self, current_id, value):
if current_id not in self.opened_files:
self.opened_files[current_id] = open(tempfile.mktemp(), 'wb')
self.opened_files[current_id].write('>%s\n' % current_id)
self.opened_files[current_id].write('%s\n' % value)
def cat_result(self, filename):
"""Catenate chunks to one big file
"""
# Sort the chunks
chunk_file_list = []
for current_id in sorted(self.opened_files.keys()):
chunk_file_list.append(self.opened_files[current_id].name)
# Flush chunks
[chunk.flush() for chunk in self.opened_files.values()]
# By calling cat command
with open(filename, 'wb') as fp:
subprocess.call(['cat', ] + chunk_file_list, stdout=fp, stderr=fp)
def clean_up(self):
[os.unlink(chunk.name) for chunk in self.opened_files.values()]
def main():
boss = ChunkBoss()
with open('bigfile.data') as fp:
for line in fp:
data = line.strip().split()
current_id = data[3].split("$")[0]
value = data[7]
# Write value to temp chunk
boss.write_chunk(current_id, value)
boss.cat_result('result.txt')
boss.clean_up()
if __name__ == '__main__':
main()
I tested the performance of my script, with bigfile.data containing about 150k lines. It took about 0.5s to finish on my laptop. Maybe you can give it a try.

Related

calculate median of a list of values parallely using Hadoop map-reduce

I'm new to Hadoop mrjob. I have a text file which consists of data "id groupId value" in each line. I am trying to calculate a median of all values in the text file using Hadoop map-reduce. But i'm stuck when it comes to calculate only the median value. What I get is a median value for each id like:
"123213" 5.0
"123218" 2
"231532" 1
"234634" 7
"234654" 2
"345345" 9
"345445" 4.5
"345645" 2
"346324" 2
"436324" 6
"436456" 2
"674576" 10
"781623" 1.5
The output should be like "median value of all values is: ####". I got influnced by this article https://computehustle.com/2019/09/02/getting-started-with-mapreduce-in-python/
My python file median-mrjob.py :
from mrjob.job import MRJob
from mrjob.step import MRStep
class MRMedian(MRJob):
def steps(self):
return [
MRStep(mapper=self.mapper_get_stats, combiner=self.reducer_count_stats),
MRStep(reducer=self.reducer_sort_by_values),
MRStep(reducer=self.reducer_retrieve_median)
]
def mapper_get_stats(self, _, line):
line_arr = line.split(" ")
values = int(float(line_arr[-1]))
id = line_arr[0]
yield id, values
def reducer_count_stats(self, key, values):
yield str(sum(values)).zfill(2), key
def reducer_sort_by_values(self, values, ids):
for id in ids:
yield id, values
def reducer_retrieve_median(self, id, values):
valList=[]
median = 0
for val in values:
valList.append(int(val))
N = len(valList)
#find the median
if N % 2 == 0:
#if N is even
m1 = N / 2
m2 = (N / 2) + 1
#Convert to integer, match post
m1 = int(m1) - 1
m2 = int(m2) - 1
median = (valList[m1] + valList[m2]) / 2
else:
m = (N + 1) / 2
# Convert to integer, match position
m = int(m) - 1
median = valList[m]
yield (id, median)
if __name__ == '__main__':
MRMedian.run()
My original text files is about 1million and 1billion line of data, but I have created a test file which has arbitrary data. It has the name input.txt :
781623 2 2.3243
781623 1 1.1243
234654 1 2.122
123218 8 2.1245
436456 22 2.26346
436324 3 6.6667
346324 8 2.123
674576 1 10.1232
345345 1 9.56135
345645 7 2.1231
345445 10 6.1232
231532 1 1.1232
234634 6 7.124
345445 6 3.654376
123213 18 8.123
123213 2 2.1232
What I care about is the values. Considering that might be duplicates. I run the command line in the terminal to run the code python median-mrjob.py input.txt
Update: The point of the assignment is not to use any libraries, so I need to sort the list manually(or maybe some of it as I understood) and calculate the median manually(hardcoding). Otherwise the goal of using MapReduce disappears. Using PySpark is not allowed in this assignment. Check this link for more inspiration Computing median in map reduce
The output should be like "median value of all values is: ####"
Then you need to force all data to one reducer first (effectively defeating the purpose of using MapReduce).
You'd do that by not using the ID as the key and discarding it
def mapper_get_stats(self, _, line):
line_arr = line.split()
if line_arr: # prevent empty lines
value = float(line_arr[-1])
yield None, value
After that, sort and find the median (I fixed your parameter order)
def reducer_retrieve_median(self, key, values):
import statistics
yield None, f"median value of all values is: {statistics.median(values)}" # automatically sorts the data
So, only two steps
class MRMedian(MRJob):
def steps(self):
return [
MRStep(mapper=self.mapper_get_stats),
MRStep(reducer=self.reducer_retrieve_median)
]
For the given file, you should see
null "median value of all values is: 2.2938799999999997"
original text files is about 1million and 1billion line of data
Not that it matters, but which is it?
You should upload the file to HDFS first, then you can use better tools than MrJob for this like Hive or Pig.

Python & Pandas: appending data to new column

With Python and Pandas, I'm writing a script that passes text data from a csv through the pylanguagetool library to calculate the number of grammatical errors in a text. The script successfully runs, but appends the data to the end of the csv instead of to a new column.
The structure of the csv is:
The working code is:
import pandas as pd
from pylanguagetool import api
df = pd.read_csv("Streamlit\stack.csv")
text_data = df["text"].fillna('')
length1 = len(text_data)
for i, x in enumerate(range(length1)):
# this is the pylanguagetool operation
errors = api.check(text_data, api_url='https://languagetool.org/api/v2/', lang='en-US')
result = str(errors)
# this pulls the error count "message" from the pylanguagetool json
error_count = result.count("message")
output_df = pd.DataFrame({"error_count": [error_count]})
output_df.to_csv("Streamlit\stack.csv", mode="a", header=(i == 0), index=False)
The output is:
Expected output:
What changes are necessary to append the output like this?
Instead of using a loop, you might consider lambda which would accomplish what you want in one line:
df["error_count"] = df["text"].fillna("").apply(lambda x: len(api.check(x, api_url='https://languagetool.org/api/v2/', lang='en-US')["matches"]))
>>> df
user_id ... error_count
0 10 ... 2
1 11 ... 0
2 12 ... 0
3 13 ... 0
4 14 ... 0
5 15 ... 2
Edit:
You can write the above to a .csv file with:
df.to_csv("Streamlit\stack.csv", index=False)
You don't want to use mode="a" as that opens the file in append mode whereas you want (the default) write mode.
My strategy would be to keep the error counts in a list then create a separate column in the original database and finally write that database to csv:
text_data = df["text"].fillna('')
length1 = len(text_data)
error_count_lst = []
for i, x in enumerate(range(length1)):
errors = api.check(text_data, api_url='https://languagetool.org/api/v2/', lang='en-US')
result = str(errors)
error_count = result.count("message")
error_count_lst.append(error_count)
text_data['error_count'] = error_count_lst
text_data.to_csv('file.csv', index=False)

Splitting a CSV file into multiple csv by target columns values

I'm fairly new to programming and Python in general. I've a big CSV file that I need to split into multiple CSV files based on the target values of the target column (last column).
Here's a simplified version of the CSV file data that I want to split.
1254.00 1364.00 4562.33 4595.32 1
1235.45 1765.22 4563.45 4862.54 1
6235.23 4563.00 7832.31 5320.36 1
8623.75 5632.09 4586.25 9361.86 0
5659.92 5278.21 8632.02 4567.92 0
4965.25 1983.78 4326.50 7901.10 1
7453.12 4993.20 4573.30 8632.08 1
8963.51 7496.56 4219.36 7456.46 1
9632.23 7591.63 8612.37 4591.00 1
7632.08 4563.85 4632.09 6321.27 0
4693.12 7621.93 5201.37 7693.48 0
6351.96 7216.35 795.52 4109.05 0
I want to split so that the output extracts the data in different csv files like below:
sample1.csv
1254.00 1364.00 4562.33 4595.32 1
1235.45 1765.22 4563.45 4862.54 1
6235.23 4563.00 7832.31 5320.36 1
sample2.csv
8623.75 5632.09 4586.25 9361.86 0
5659.92 5278.21 8632.02 4567.92 0
sample3.csv
4965.25 1983.78 4326.50 7901.10 1
7453.12 4993.20 4573.30 8632.08 1
8963.51 7496.56 4219.36 7456.46 1
9632.23 7591.63 8612.37 4591.00 1
sample4.csv
7632.08 4563.85 4632.09 6321.27 0
4693.12 7621.93 5201.37 7693.48 0
6351.96 7216.35 795.52 4109.05 0
I tried with pandas and some groupby functions but it merges all 1 and 0 together in separate files one containing all values with 1 and another 0, which is not the output that I needed.
Any help would be appreciated.
What you can do is get the value of the last column in each row. If the value is the same as the value in previous row, add that row to the same list, and if it's not just create a new list and add that row to that empty list. For data structure use list of lists.
Assume the file 'input.csv' contains the original data.
1254.00 1364.00 4562.33 4595.32 1
1235.45 1765.22 4563.45 4862.54 1
6235.23 4563.00 7832.31 5320.36 1
8623.75 5632.09 4586.25 9361.86 0
5659.92 5278.21 8632.02 4567.92 0
4965.25 1983.78 4326.50 7901.10 1
7453.12 4993.20 4573.30 8632.08 1
8963.51 7496.56 4219.36 7456.46 1
9632.23 7591.63 8612.37 4591.00 1
7632.08 4563.85 4632.09 6321.27 0
4693.12 7621.93 5201.37 7693.48 0
6351.96 7216.35 795.52 4109.05 0
code below
target = None
counter = 0
with open('input.csv', 'r') as file_in:
lines = file_in.readlines()
tmp = []
for idx, line in enumerate(lines):
_target = line.split(' ')[-1].strip()
if idx == 0:
tmp.append(line)
target = _target
continue
else:
last_line = idx + 1 == len(lines)
if _target != target or last_line:
if last_line:
tmp.append(line)
counter += 1
with open('sample{}.csv'.format(counter), 'w') as file_out:
file_out.writelines(tmp)
tmp = [line]
else:
tmp.append(line)
target = _target
Perhaps you want something like this:
from itertools import groupby
from operator import itemgetter
sep = ' '
with open('data.csv') as f:
data = f.read()
split_data = [row.split(sep) for row in data.split('\n')]
gb = groupby(split_data, key=itemgetter(4))
for index, (key, group) in enumerate(gb):
with open('sample{}.csv'.format(index), 'w') as f:
write_data = '\n'.join(sep.join(cell) for cell in group)
f.write(write_data)
Unlike pd.groupby, itertools.groupby doesn't sort the source beforehand. This parses the input CSV into a list of lists and performs a groupby on the outer list based on the 5th column, which contains the target. The groupby object is an iterator over the groups; by writing each group to a different file, the result you want can be achieved.
I propose to use a function to do what was asked for.
There is the possibility of leaving unreferenced the file objects that
we have opened for writing, so that they are automatically closed when
garbage collected but here I prefer to explicitly close every output
file before opening another one.
The script is heavily commented, so no further explanations:
def split_data(data_fname, key_len=1, basename='file%03d.txt')
data = open(data_fname)
current_output = None # because we have yet not opened an output file
prev_key = int(1) # because a string is always different from an int
count = 0 # because we want to count the output files
for line in data:
# line has a trailing newline so that to extract the key
# we have to take into account that
key = line[-key_len-1:-1]
if key != prev_key # key has changed!
count += 1 # a new file is going to be opened
prev_key = key # remember the new key
if current_output: # if a file was opened, close it
current_output.close()
# open a new output file, its name derived from the variable count
current_output = open(basename%count, 'w')
# now we can write to the output file
current_output.write(line)
# note that line is already newline terminated
# clean up what is still going
current_output.close()
This answer has an history.

Python import text file where each line has different number of columns

I'm new to python and I'm trying to figure out how to load a data file that contains blocks of data on a per timestep basis, such as like this:
TIME:,0
Q01 : A:,-10.7436,0.000536907,-0.00963283,0.00102934
Q02 : B:,0,0.0168694,-0.000413983,0.00345921
Q03 : C:,0.0566665
Q04 : D:,0.074456
Q05 : E:,0.077456
Q06 : F:,0.0744835
Q07 : G:,0.140448
Q08 : H:,-0.123968
Q09 : I:,0
Q10 : J:,0.00204377,0.0109621,-0.0539183,0.000708574
Q11 : K:,-2.86115e-17,0.00947104,0.0145645,1.05458e-16,-1.90972e-17,-0.00947859
Q12 : L:,-0.0036781,0.00161254
Q13 : M:,-0.00941257,0.000249692,-0.0046302,-0.00162387,0.000981709,-0.0135982,-0.0223496,-0.00872062,0.00548815,0.0114075,.........,-0.00196206
Q14 : N:,3797, 66558
Q15 : O:,0.0579981
Q16 : P:,0
Q17 : Q:,625
TIME:,0.1
Q01 : A:,-10.563,0.000636907,-0.00963283,0.00102934
Q02 : B:,0,0.01665694
Q03 : C:,0.786,-0.000666,0.6555
Q04 : D:,0.87,0.96
Q05 : E:,0.077456
Q06 : F:,0.07447835
Q07 : G:,0.140448
Q08 : H:,-0.123968
Q09 : I:,0
Q10 : J:,0.00204377,0.0109621,-0.0539183,0.000708574
Q11 : K:,-2.86115e-17,0.00947104,0.0145645,1.05458e-16,-1.90972e-17,-0.00947859
Q12 : L:,-0.0036781,0.00161254
Q13 : M:,-0.00941257,0.000249692,-0.0046302,-0.00162387,0.000981709,-0.0135982,-0.0223496,-0.00872062,0.00548815,0.0114075,.........,-0.00196206
Q14 : N:,3797, 66558
Q15 : O:,0.0579981
Q16 : P:,0,2,4
Q17 : Q:,786
Each block contains a number of variables that may have very different numbers of columns of data in it. The number of columns per variable may change in each timestep block, but the number of variables per block is the same in every timestep and it is always known how many variables were exported. There is no information on the number of blocks of data (timesteps) in the data file.
When the data has been read, it should be loaded in a format of variable per timestep:
Time: | A: | B:
0 | -10.7436,0.000536907,-0.00963283,0.00102934 | ........
0.1 | -10.563,0.000636907,-0.00963283,0.00102934 | ........
0.2 | ...... | ........
If the number of columns of data was the same every timestep and the same for every variable , this would be a very simple problem.
I guess I need to read the file line by line, in two loops, one per block and then once inside each block and then store the inputs in an array (append?). The changing number of columns per line has me a little stumped at the minute since I'm not very familiar with python and numpy yet.
If someone could point me in the right direction, such as what functions I should be using to do this relatively efficiently, that would be great.
import pandas as pd
res = {}
TIME = None
# by default lazy line read
for line in open('file.txt'):
parts = line.strip().split(':')
map(str.strip, parts)
if len(parts) and parts[0] == 'TIME':
TIME = parts[1].strip(',')
res[TIME] = {}
print('New time section start {}'.format(TIME))
# here you can stop and work with data from previou period
continue
if len(parts) <= 1:
continue
res[TIME][parts[1].lstrip()] = parts[2].strip(',').split(',')
df = pd.DataFrame.from_dict(res, 'columns')
# for example for TIME 0
dfZero = df['0']
print(dfZero)
df = pd.DataFrame.from_dict(res, 'index')
dfA = df['A']
print(dfA)
File test.csv:
1,2,3
1,2,3,4
1,2,3,4,5
1,2
1,2,3,4
Handling data:
my_cols = ["A", "B", "C", "D", "E"]
pd.read_csv("test.csv", names=my_cols, engine='python')
Output:
A B C D E
0 1 2 3 NaN NaN
1 1 2 3 4 NaN
2 1 2 3 4 5
3 1 2 NaN NaN NaN
4 1 2 3 4 NaN
Or you can use names parameter.
For example:
1,2,1
2,3,4,2,3
1,2,3,3
1,2,3,4,5,6
If you read it, you'll receive the following error:
>>> pd.read_csv(r'D:/Temp/test.csv')
Traceback (most recent call last):
...
Expected 5 fields in line 4, saw 6
But if you pass names parameters, you'll get result:
>>> pd.read_csv(r'D:/Temp/test.csv', names=list('ABCDEF'))
Output:
A B C D E F
0 1 2 1 NaN NaN NaN
1 2 3 4 2 3 NaN
2 1 2 3 3 NaN NaN
3 1 2 3 4 5 6
Hope it helps.
A very non-polished way to accomplish that is by reading your text file and creating a dict structure as you sweep through. Here's an example that might achieve your goal (based on the input you've provided):
time = 0
output = {}
with open('path_to_file','r') as input_file:
for line in input_file:
line = line.strip('\n')
if 'TIME' in line:
time = line.split(',')[1]
output[time] = {}
else:
col_name = line.split(':')[1].strip()
col_value = line.split(':')[2].strip(',')
output[time][col_name] = col_value
this will deliver a output object which is a dictionary with the following structure:
output = {
'0': {'A': '-10.7436,0.000536907,-0.00963283,0.00102934',
'B': '0,0.0168694,-0.000413983,0.00345921',
...
'Q': '625'},
'0.1': {'A': '-10.563,0.000636907,-0.00963283,0.00102934',
'B': '0,0.01665694',
...
'Q': '786'}
}
Which I think matches what you are looking for. To access one value inside this dictionary you should use value = output['0.1']['A'] which would yield '-10.563,0.000636907,-0.00963283,0.00102934'
This reader is similar to #Lucas's - each block is a dictionary saved in meta dictionary keyed by time. It could have been a list instead.
blocks = {}
with open('stack37354745.txt') as f:
for line in f:
line = line.strip()
if len(line)==0: continue # blank line
d = line.split(':')
if len(d)==2 and d[0]=='TIME': # new block
time = float(d[1].strip(','))
blocks[time] = data = {}
else:
key = d[1].strip() # e.g. A, B, C
value = d[2].strip(',').split(',')
value = np.array(value, dtype=float) # assume valid numeric list
data[key] = value
Values can be fetched, displayed and reorganized with iterations like:
for time in blocks:
b = blocks[time]
print('TIME: %s'%time)
for k in b:
print('%4s: %s'%(k,b[k]))
produces:
TIME: 0.0
C: [ 0.0566665]
G: [ 0.140448]
A: [ -1.07436000e+01 5.36907000e-04 -9.63283000e-03 1.02934000e-03]
...
K: [ -2.86115000e-17 9.47104000e-03 1.45645000e-02 1.05458000e-16
-1.90972000e-17 -9.47859000e-03]
TIME: 0.1
C: [ 7.86000000e-01 -6.66000000e-04 6.55500000e-01]
G: [ 0.140448]
A: [ -1.05630000e+01 6.36907000e-04 -9.63283000e-03 1.02934000e-03]
...
K: [ -2.86115000e-17 9.47104000e-03 1.45645000e-02 1.05458000e-16
-1.90972000e-17 -9.47859000e-03]
(I removed the .... from one of the data lines)
Or in a quasi table format
fmt = '%10s | %s | %s | %s'
print(fmt%('Time','B','D','E'))
for time in blocks:
b = blocks[time]
# print(list(b.keys()))
print(fmt%(time, b['B'], b['D'],b['E']))
producing:
Time | B | D | E
0.0 | [ 0. 0.0168694 -0.00041398 0.00345921] | [ 0.074456] | [ 0.077456]
0.1 | [ 0. 0.01665694] | [ 0.87 0.96] | [ 0.077456]
Since variables like B can have different lengths, it is hard to collect values across time as some sort of 2d array.
In general it is easiest to focus first on loading the file into some sort of Python structure. That kind of action almost has to be written in Python, iterating line by line (unless you let pandas do it for you).
Once that's done you can reorganize the date in many different ways to suit your needs. With something this variable it doesn't make sense to aim for rectangular numpy arrays right a way.

Average data based on specific columns - python

I have a data file with multiple rows, and 8 columns - I want to average column 8 of rows that have the same data on columns 1, 2, 5 - for example my file can look like this:
564645 7371810 0 21642 1530 1 2 30.8007
564645 7371810 0 21642 8250 1 2 0.0103
564645 7371810 0 21643 1530 1 2 19.3619
I want to average the last column of the first and third row since columns 1-2-5 are identical;
I want the output to look like this:
564645 7371810 0 21642 1530 1 2 25.0813
564645 7371810 0 21642 8250 1 2 0.0103
my files (text files) are pretty big (~10000 lines) and redundant data (based on the above rule) are not in regular intervals - so I want the code to find the redundant data, and average them...
in response to larsks comment - here are my 4 lines of code...
import os
import numpy as np
datadirectory = input('path to the data directory, ')
os.chdir( datadirectory)
##READ DATA FILE AND CREATE AN ARRAY
dataset = open(input('dataset_to_be_used, ')).readlines()
data = np.loadtxt(dataset)
##Sort the data based on common X, Y and frequency
datasort = np.lexsort((data[:,0],data[:,1],data[:,4]))
datasorted = data[datasort]
you can use pandas to do this quickly:
import pandas as pd
from StringIO import StringIO
data = StringIO("""564645 7371810 0 21642 1530 1 2 30.8007
564645 7371810 0 21642 8250 1 2 0.0103
564645 7371810 0 21643 1530 1 2 19.3619
""")
df = pd.read_csv(data, sep="\\s+", header=None)
df.groupby(["X.1","X.2","X.5"])["X.8"].mean()
the output is:
X.1 X.2 X.5
564645 7371810 1530 25.0813
8250 0.0103
Name: X.8
if you don't need index, you can call:
df.groupby(["X.1","X.2","X.5"])["X.8"].mean().reset_index()
this will give the result as:
X.1 X.2 X.5 X.8
0 564645 7371810 1530 25.0813
1 564645 7371810 8250 0.0103
Ok, based on Hury's input I updated the code -
import os #needed system utils
import numpy as np# for array data processing
import pandas as pd #import the pandas module
datadirectory = input('path to the data directory, ')
working = os.environ.get("WORKING_DIRECTORY", datadirectory)
os.chdir( working)
##READ DATA FILE AND and convert it to string
dataset = open(input('dataset_to_be_used, ')).readlines()
data = ''.join(dataset)
df = pd.read_csv(data, sep="\\s+", header=None)
sorted_data = df.groupby(["X.1","X.2","X.5"])["X.8"].mean().reset_index()
tuple_data = [tuple(x) for x in sorted_data.values]
datas = np.asarray(tuple_data)
this worked with the test data, as posted by hury - but when I use my file after the df = ... does not seem to work (I get an output like:
Traceback (most recent call last):
File "/media/DATA/arxeia/Programming/MyPys/data_refine_average.py", line 31, in
df = pd.read_csv(data, sep="\s+", header=None)
File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 187, in read_csv
return _read(TextParser, filepath_or_buffer, kwds)
File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 141, in _read
f = com._get_handle(filepath_or_buffer, 'r', encoding=encoding)
File "/usr/lib64/python2.7/site-packages/pandas/core/common.py", line 673, in _get_handle
f = open(path, mode)
IOError: [Errno 36] File name too long: '564645\t7371810\t0\t21642\t1530\t1\t2\t30.8007\r\n564645\t7371810\t0\t21642\t8250\t1\t2\t0.0103\r\n564645\t7371810\t0\t21642\t20370\t1\t2\t0.0042\r\n564645\t7371810\t0\t21642\t33030\t1\t2\t0.0026\r\n564645\t7371810\t0\t21642\t47970\t1\t2\t0.0018\r\n564645\t7371810\t0\t21642\t63090\t1\t2\t0.0013\r\n564645\t7371810\t0\t21642\t93090\t1\t2\t0.0009\r\n564645\t7371810\t0\t216..........
any ideas?
It's not the most elegant of answers, and I have no idea how fast/efficient it is, but I believe it gets the job done based on the information you provided:
import numpy
data_file = "full_location_of_data_file"
data_dict = {}
for line in open(data_file):
line = line.rstrip()
columns = line.split()
entry = [columns[0], columns[1], columns[4]]
entry = "-".join(entry)
try: #valid if have already seen combination of 1,2,5
x = data_dict[entry].append(float(columns[7]))
except (KeyError): #KeyError the first time you see a combination of columns 1,2,5
data_dict[entry] = [float(columns[7])]
for entry in data_dict:
value = numpy.mean(data_dict[entry])
output = entry.split("-")
output.append(str(value))
output = "\t".join(output)
print output
I'm unclear if you want/need columns 3, 6, or 7 so I omited them. Particularly, you do not make clear how you want to deal with different values which may exist within them. If you can elaborate on what behavior you want (ie default to a certain value, or to the first occurrence) I'd suggest either filling in with default values or store the first instance in a dictionary of dictionaries rather than a dictionary of lists.
import os #needed system utils
import numpy as np# for array data processing
datadirectory = '/media/DATA/arxeia/Dimitris/Testing/12_11'
working = os.environ.get("WORKING_DIRECTORY", datadirectory)
os.chdir( working)
##HERE I WAS TRYING TO READ THE FILE, AND THEN USE THE NAME OF THE STRING IN THE FOLLOWING LINE - THAT RESULTED IN THE SAME ERROR DESCRIBED BELOW (ERROR # 42 (I think) - too large name)
data_dict = {} #Create empty dictionary
for line in open('/media/DATA/arxeia/Dimitris/Testing/12_11/1a.dat'): ##above error resolved when used this
line = line.rstrip()
columns = line.split()
entry = [columns[0], columns[1], columns[4]]
entry = "-".join(entry)
try: #valid if have already seen combination of 1,2,5
x = data_dict[entry].append(float(columns[7]))
except (KeyError): #KeyError the first time you see a combination of columns 1,2,5
data_dict[entry] = [float(columns[7])]
for entry in data_dict:
value = np.mean(data_dict[entry])
output = entry.split("-")
output.append(str(value))
output = "\t".join(output)
print output
MY OTHER PROBLEM NOW IS GETTING OUTPUT IN STRING FORMAT (OR ANY FORMAT) - THEN I BELIEVE I KNOW I CAN GET TO THE SAVE PART AND MANIPULATE THE FINAL FORMAT
np.savetxt('sorted_data.dat', sorted, fmt='%s', delimiter='\t') #Save the data
I STILL HAVE TO FIGURE HOW TO ADD THE OTHER COLUMNS - I AM WORKING ON THAT TOO

Categories

Resources