reading sections from a large text file in python efficiently - python

I have a large text file containing several million lines of data. The very first column contains position coordinates. I need to create another file from this original data, but that only contains specified non-contiguous intervals based on the position coordinates. I have another file containing the coordinates for each interval. For instance, my original file is in a format similar to this:
Position Data1 Data2 Data3 Data4
55 a b c d
63 a b c d
68 a b c d
73 a b c d
75 a b c d
82 a b c d
86 a b c d
Then lets say I have my file containing intervals that looks something like this...
name1 50 72
name2 78 93
Then I want my new file to look something like this...
Position Data1 Data2 Data3 Data4
55 a b c d
63 a b c d
68 a b c d
82 a b c d
86 a b c d
So far I have created a function to write the data from the original file contained within a specific interval to my new file. My code is as follows:
def get_block(beg,end):
output=open(output_table,'a')
with open(input_table,'r') as f:
for line in f:
line=line.strip("\r\n")
line=line.split("\t")
position=int(line[0])
if int(position)<=beg:
pass
elif int(position)>=end:
break
else:
for i in line:
output.write(("%s\t")%(i))
output.write("\n")
I then create a list containing the pairs of my intervals and then loop through my original file using the above function like this:
#coords=[[start1,stop1],[start2,stop2],[start3,stop3]..etc]
for i in coords:
start_p=int(i[0]) ; stop_p=int(i[1])
get_block(start_p,stop_p)
This performs what I want, however it gets exponentially slower as it moves along my coordinate list because I am having to read through my entire file until I reach the specified start coordinate each time through the loop. Is there a more efficient way of accomplishing this? Is there a way to skip to a specific line each time instead of reading over every line?

Thanks for the suggestions to use pandas. Previously, my original code had been running for about 18 hours and was only half way finished. Using pandas, it created my desired file in under 5 mins. For future reference and if anyone else has a similar task, here is the code that I used.
import pandas as pd
data=pd.io.parsers.read_csv(input_table,delimiter="\t")
for i in coords:
start_p=int(i[0]);stop_p=int(i[1])
df=data[((data.POSITION>=start_p)&(data.POSITION<=stop_p))]
df.to_csv(output_table,index=False,sep="\t",header=False,cols=None,mode='a')

I'd just use the built-in csv module to simplify reading the input. To further speed things up, all the coord ranges could be read in at once, which would allow the selection process to occur in one pass through the data file.
import csv
# read all coord ranges into memory
with open('ranges', 'rb') as ranges:
range_reader = csv.reader(ranges, delimiter='\t')
coords = [map(int, (start, stop)) for name,start,stop in range_reader]
# make one pass through input file and extract positions specified
with open('output_table', 'w') as outf, open('input_table', 'rb') as inf:
input_reader = csv.reader(inf, delimiter='\t')
outf.write('\t'.join(input_reader.next())+'\n') # copy header row
for row in input_reader:
for coord in coords:
if coord[0] <= int(row[0]) <= coord[1]:
outf.write('\t'.join(row)+'\n')
break;

Related

Python pandas read_csv merge every two columns and read them as a dataframe

Beginner in python and pandas and trying to figure out how to read from csv in a particular way.
My datafile
01 AAA1234 AAA32452 AAA123123 0 -9 C C A A T G A G .......
01 AAA1334 AAA12452 AAA125123 1 -9 C A T G T G T G .......
...
...
...
So I have 100.000 columns in this file and I want to merge every two columns into one. But the merging needs to occur after the first 6 columns. I would prefer to do this while reading the file if possible instead of manipulating this huge datafile/
Desired outcome
01 AAA1234 AAA32452 AAA123123 0 -9 CC AA TG AG .......
01 AAA1334 AAA12452 AAA125123 1 -9 CA TG TG TG .......
...
...
...
That will result in a dataframe with half the columns. My datafile has no col names, the names reside in a different csv but that is another subject.
I d appreciate a solution, thanks in advance!
Separate the data frame initially. I created one for experimental purposes:
Then I defined a function. Then passed in the dataframe which needed manipulation as an argument into the function
def columns_joiner(data):
new_data = pd.DataFrame()
for i in range(0,11,2): # You can change range to your wish
# Here, I had only 10 columns to concatenate (Therefore the range ends at 11)
ser = data[i] + data[i + 1]
new_data = pd.concat([new_data, ser], axis = 1)
return new_data
I don't think this is an efficient solution. But it worked for me.

df apply function in loop overrides prior values

so my df looks like this:
x y group
0 53 10 csv1
1 53 10 csv1
2 48 9 csv0
3 48 9 csv0
4 48 9 csv0
... ... ... ...
I have some files that are depending on the group name and want to use them in a function besides the x and y value.
what I am doing so far is the following:
dfGrouped = df.groupby('group') #group the dataframe
df['newcol'] = np.nan #crete new empty col
#use for loop to load file depending on group, note the file is very large, thats why I want to load it only once per group
for name, group in groupHashed:
file = open(name+'.txt')
#open the file
df['newcol'] = df[df['group'] == name].apply(lambda row: newValueFromFile(row.x,row.y, file), axis=1)
It seemed to work at first, unfortunately, newcol only holds the value of the last loop and seems to override the values created earlier with nan. Somebody any idea?
instead of file = open, use
with open('filename.txt', 'a') as file:
and then for the lambda expression file.write...
The 'a' in the opening tells it should append the data to the existing.
I guess currently you are overwriting the content of the file.
'with open()' also takes care for the automatic closing after you're done with the file.

Adding values from a CSV file

I am beginning to learn python and am struggling with Syntax.
I have a simple CSV file that looks like this
0.01,10,20,0.35,40,50,60,70,80,90,100
2,22,32,42,52,62,72,82,92,102,112
3,33,43,53,63,5647,83,93,103,113,123
I want to look for the highest and lowest value in all the data in the csv file except in the first value of each row.
So effectively the answer here would be
highestValue=5647
lowestValue=0.35
because the data that is looked at is as follows (it ignored the first value of each row)
10,20,0.35,40,50,60,70,80,90,100
22,32,42,52,62,72,82,92,102,112
33,43,53,63,73,5647,93,103,113,123
I would like my code to work for ANY row length.
I really have to admit I'm struggling but here's what I've tried. I usually program PHP so this is all new to me. I have been working on this simple task for a day and can't fathom it out. I think I'm getting confused with terminology 'lists' for example.
import numpy
test_data_file = open ("Anaconda3JamesData/james_test_3.csv","r")
test_data_list = test_data_file.readlines()
test_data_file.close()
for record in test_data_list:
all_values = record.split(',')
maxvalue = np.max(numpy.asfarray(all_values[1:])
print (maxvalue)
With the test data (the CSV file shown at the very top of this question) I would expect the answer to be
highestValue=5647
lowestValue=0.35
If you're using numpy, you can read your csv file as a numpy.ndarray using numpy.genfromtxt() and then use the array's .max() and .min() methods
import numpy
array = numpy.genfromtxt('Anaconda3JamesData/james_test_3.csv', delimiter=',')
array[:, 1:].max()
array[:, 1:].min()
The [:, 1:] part is using numpy's array indexing. It's saying take all the rows (the first [:, part), and for each row take all but the first column (the 1:] part) . This doesn't work with Python's built in lists.
You're overwriting maxvalue each time through the loop, so you're just getting the max value from the last line, not the whole file. You need to compare with the previous maximum.
maxvalue = None
for record in test_data_list:
all_values = record.split(',')
if maxvalue is None:
maxvalue = np.max(numpy.asfarray(all_values[1:])
else:
maxvalue = max(maxvalue, np.max(numpy.asfarray(all_values[1:]))
You do not need the power of numpy for this problem. A simple CSV reader is good enough:
with open("Anaconda3JamesData/james_test_3.csv") as infile:
r = csv.reader(infile)
rows = [list(map(float, line))[1:] for line in r]
max(map(max, rows))
# 5647.0
min(map(min, rows))
# 0.35
I think using numpy is unneeded for this task. First of all, this:
test_data_file = open ("Anaconda3JamesData/james_test_3.csv","r")
test_data_list = test_data_file.readlines()
test_data_file.close()
for record in test_data_list:
can be simplified into this:
with open("Anaconda3JamesData/james_test_3.csv","r") as test_data_file:
for record in test_data_file:
We can use a list comprehension to read in all of the values:
with open("Anaconda3JamesData/james_test_3.csv","r") as test_data_file:
values = [float(val) for line in test_data_file for val in line.split(",")[1:]]
values now contains all relevant numbers, so we can just do:
highest_value = max(values)
lowest_value = min(values)
Here's a pandas solution that can give the desired results:
import pandas as pd
df = pd.read_csv('test1.csv', header=None)
# df:
# 0 1 2 3 4 5 6 7 8 9 10
# 0 0.01 10 20 0.35 40 50 60 70 80 90 100
# 1 2.00 22 32 42.00 52 62 72 82 92 102 112
# 2 3.00 33 43 53.00 63 5647 83 93 103 113 123
df = df.iloc[:, 1:]
print("Highest value: {}".format(df.values.max()))
print("Lowest value: {}".format(df.values.min()))
#Output:
Highest value: 5647.0
Lowest value: 0.35

add computed column to a csv file

I expect that this don't be a classic beginner question. However I read and spent days trying to save my csv data without success.
I have a function that uses an input parameter that I give manually. The function generates 3 columns that I saved in a CSV file. When I want to use the function with other inputs and save the new data allocated at right from the previous computed columns, the result is that pandas sort my CSV file in 3 single columns one below each other with the headings.
I'm using the next code to save my data:
data.to_csv('/Users/Computer/Desktop/Examples anaconda/data_new.csv', sep=',',mode='a')
and the result is:
dot lake mock
1 42 11.914558
2 41 42.446977
3 40 89.188668
dot lake mock
1 42 226.266513
2 41 317.768887
dot lake mock
3 42 560.171830
4. 41. 555.005333
What I want is:
dot lake mock mock mock
0 42 11.914558. 226.266513. 560.171830
1 41 42.446977. 317.768887. 555.005533
2 40 89.188668
UPDATE:
My DataFrame was generated using a function like this:
First I opened a csv file:
df1=pd.read_csv('current_state.csv')
def my_function(df1, photos, coords=['X', 'Y']):
Hzs = t.copy()
shifts = np.floor(Hzs / t_step).astype(np.int)
ms = np.zeros(shifts.size)
delta_inv = np.arange(N+1)
dot = delta_inv[N:0:-1]
lake = np.arange(1,N+1)
for i, shift in enumerate(shifts):
diffs = df1[coords] - df1[coords].shift(-shift)
sqdist = np.square(diffs).sum(axis=1)
ms[i] = sqdist.sum()
mock = np.divide(ms, dot)
msds = pd.DataFrame({'dot':dot, 'lake':lake, 'mock':mock})
return msds
data = my_function(df1, photos, coords=['X', 'Y'])
print(data)
data.to_csv('/Users/Computer/Desktop/Examples anaconda/data_new.csv', sep=',',mode='a'
I looked for several day the way to write in a csv file containing several computed columns just right to the next one. Even the unpleasant comments of some guys! I finally found how to do this. If someone need something similar:
First I save my data using to_csv:
data.to_csv('/Users/Computer/Desktop/Examples/data_new.csv', sep=',',mode='a', index=False)
after the file has been already generated with the headers, I remove the index that I don't need and I only call the function using at the end:
b = data
a = pd.read_csv('data_new.csv')
c = pd.concat ([a,b],axis=1, ignore_index=True)
c.to_csv('/Users/Computer/Desktop/Examples/data_new.csv', sep=',', index=False)
As a result I got the CSV file desired and is possible to call the function the times that you want!

how to save a column vector in an iteration process to a text file

I need to save a column vector obtained in an iteration to a text file using python.
This is what I have been using till now
savetxt('displacement{0}.out'.format(globdat.cycle), a, delimiter=',', fmt='%10.4e',)
globdat.cycle is used as a count so that in each iteration a separate file is made.
requirement - I do not require separate files but a single file which contains all the vectors corresponding to each iteration.
eg - iteration 1 values = [ 1 2 3 4 5 6 ]' and iteration 2 values = [ a c v b f h ]'
my text file should look something similar to
1,a
2,c
3,v
4,b
5,f
6,h
I would much appreciate some help.
Thanks

Categories

Resources