Matching line number with string in a table. - python

I have a file with a list of columns describing particular parameters:
size magnitude luminosity
I need only particular data (in particular lines, and columns) from this file. So far I have a code in python, where I have appended the necessary line numbers. I just need to know how i can match it to get the right string in the text file along with just the variables in columns (magnitude) and (luminosity.) Any suggestions on how I could approach this?
Here is a sample of my code (#comments describe what I have done and what I want to do):
temp_ListMatch = (point[5]).strip()
if temp_ListMatch:
ListMatchaddress = (point[5]).strip()
ListMatchaddress = re.sub(r'\s', '_', ListMatchaddress)
ListMatch_dirname = '/projects/XRB_Web/apmanuel/499/Lists/' + ListMatchaddress
#print ListMatch_dirname+"\n"
try:
file5 = open(ListMatch_dirname, 'r')
except IOError:
print 'Cannot open: '+ListMatch_dirname
Optparline = []
for line in file5:
point5 = line.split()
j = int(point5[1])
Optparline.append(j)
#Basically file5 contains the line numbers I need,
#and I have appended these numbers to the variable j.
temp_others = (point[4]).strip()
if temp_others:
othersaddress = (point[4]).strip()
othersaddress =re.sub(r'\s', '_', othersaddress)
othersbase_dirname = '/projects/XRB_Web/apmanuel/499/Lists/' + othersaddress
try:
file6 = open(othersbase_dirname, 'r')
except IOError:
print 'Cannot open: '+othersbase_dirname
gmag = []
z = []
rh = []
gz = []
for line in file6:
point6 = line.split()
f = float(point6[2])
g = float(point6[4])
h = float(point6[6])
i = float(point6[9])
# So now I have opened file 6 where this list of data is, and have
# identified the columns of elements that I need.
# I only need the particular rows (provided by line number)
# with these elements chosen. That is where I'm stuck!

Load the whole data file in to a pandas DataFrame (assuming that the data file has a header from which we can get the column names)
import pandas as pd
df = pd.read_csv('/path/to/file')
Load the file of line numbers into a pandas Series (assuming there's one per line):
# squeeze = True makes the function return a series
row_numbers = pd.read_csv('/path/to/rows_file', squeeze = True)
Return only those lines which are in the row number file, and the columns magnitude and luminosity (this assumes that the first row is numbered 0):
relevant_rows = df.ix[row_numbers][['magnitude', 'luminosity']

Related

Add one column to a text file

I have multiple txt files and each of these txt files has 6 columns. What I want to do : add just one column as a last column, so at the end the txt file has maximum 7 columns and if i run the script again it shouldn't add a new one:
At the beginning each file has six columns:
637.39 718.53 155.23 -0.51369 -0.18539 0.057838 3.209840789730089
636.56 720 155.57 -0.51566 -0.18487 0.056735 3.3520643559939938
635.72 721.52 155.95 -0.51933 -0.18496 0.056504 3.4997850701290125
What I want is to add a new column of zeros only if the current number of columns is 6, after that it should prevent adding a new column when I run the script again (7 columns is the total number where the last one is zeros):
637.39 718.53 155.23 -0.51369 -0.18539 0.057838 3.209840789730089 0
636.56 720 155.57 -0.51566 -0.18487 0.056735 3.3520643559939938 0
635.72 721.52 155.95 -0.51933 -0.18496 0.056504 3.4997850701290125 0
My code works and add one additional column each time i run the script but i want to add just once when the number of columns 6. Where (a) give me the number of column and if the condition is fulfilled then add a new one:
from glob import glob
import numpy as np
new_column = [0] * 20
def get_new_line(t):
l, c = t
return '{} {}\n'.format(l.rstrip(), c)
def writecolumn(filepath):
# Load data from file
with open(filepath) as datafile:
lines = datafile.readlines()
a=np.loadtxt(lines, dtype='str').shape[1]
print(a)
**#if a==6: (here is the problem)**
n, r = divmod(len(lines), len(new_column))
column = new_column * n + new_column[:r]
new_lines = list(map(get_new_line, zip(lines, column)))
with open(filepath, "w") as f:
f.writelines(new_lines)
if __name__ == "__main__":
filepaths = glob("/home/experiment/*.txt")
for path in filepaths:
writecolumn(path)
When i check the number of columns #if a==6 and shift the content inside the if statement I get error. without shifting the content inside the if every thing works fine and still adding one column each time i run it.
Any help is appreciated.
To test the code create two/one txt files with random number of six columns.
Could be an indentation problem, i.e. block below 'if'. writing new-lines should be indented properly --
This works --
def writecolumn(filepath):
# Load data from file
with open(filepath) as datafile:
lines = datafile.readlines()
a=np.loadtxt(lines, dtype='str').shape[1]
print(a)
if int(a)==6:
n, r = divmod(len(lines), len(new_column))
column = new_column * n + new_column[:r]
new_lines = list(map(get_new_line, zip(lines, column)))
with open(filepath, "w") as f:
f.writelines(new_lines)
Use pandas to read your text file:
import pandas as of
df = pd.read_csv("whitespace.csv", header=None, delimiter=" ")
Add a column or more as needed
df['somecolname'] = 0
Save DataFrame with no header.

how to iterate over files in python and export several output files

I have a code and I want to put it in a for loop. I want to input some data stored as files into my code and based on the each input, generate outputs automatically. At the moment, my code is only working for one input file and consequently gives one output. My input file is named as model000.msh, but the fact is that I have a series of these input files with the names model000.msh, model001.msh, and so on. In the code I am doing some calculation on the imported file and finally compare it to a numpy array (my_data) that is generated by another numpy array (ID) having one column and thousands of rows. ID array is the second variable which I want to iterate over. ID is making my_data through a np.concatenate function. I want to use each column of ID to make my_data (my_data=np.concatenate((ID[:,iterator], gr), axis =1)). So, I want to iterate over several files, then extract arrays from each file (extracted), then follow the loop with generating my_data from each column of ID and do calculations on my_data and extracted and finally export results of each iteration with a dynamic naming method (changed_000, changed_001 and so on). This is my code fo one single input and one single my_data array (made by an ID that has only one column), but I want to change iterate over several input files and several my_data arrays and finally several outputs:
from itertools import islice
with open('model000.msh') as lines:
nodes = np.genfromtxt(islice(lines, 0, 1000))
with open('model000.msh', "r") as f:
saved_lines = np.array([line.split() for line in f if len(line.split()) == 9])
saved_lines[saved_lines == ''] = 0.0
elem = saved_lines.astype(np.int)
# following lines extract some data from my file
extracted=np.c_[elem[:,:-4], nodes[elem[:,-4]-1, 1:], nodes[elem[:,-3]-1, 1:],nodes[elem[:,-2]-1, 1:], nodes[elem[:,-1]-1, 1:]]
…
extracted =np.concatenate((extracted, avs), axis =1) # each input file ('model000.msh') will make this numpy array
# another data set, stored as a numpy array is compared to the data extracted from the file
ID= np.array [[… ..., …, …]] # now, it is has one column, but it should have several columns and each iteration, one column will make a my_data array
my_data=np.concatenate((ID, gr), axis =1) # I think it should be something like my_data=np.concatenate((ID[:,iterator], gr), axis =1)
from scipy.spatial import distance
distances=distance.cdist(extracted [:,17:20],my_data[:,1:4])
ind_min_dis=np.argmin(distances, axis=1).reshape(-1,1)
z=np.array([])
for i in ind_min_dis:
u=my_data[i,0]
z=np.array([np.append(z,u)]).reshape(-1,1)
final_merged=np.concatenate((extracted,z), axis =1)
new_vol=final_merged[:,-1].reshape(-1,1)
new_elements=np.concatenate((elements,new_vol), axis =1)
new_elements[:,[4,-1]] = new_elements[:,[-1,4]]
# The next block is output block
chunk_size = 3
buffer = ""
i = 0
relavent_line = 0
with open('changed_00', 'a') as fout:
with open('model000.msh', 'r') as fin:
for line in fin:
if len(line.split()) == 9:
aux_string = ' '.join([str(num) for num in new_elements[relavent_line]])
buffer += '%s\n' % aux_string
relavent_line += 1
else:
buffer += line
i+=1
if i == chunk_size:
fout.write(buffer)
i=0
buffer = ""
if buffer:
fout.write(buffer)
i=0
buffer = ""
I appreciate any help in advance.
I'm not very sure about your question. But it seems like you are asking for something like:
for idx in range(10):
with open('changed_{:0>2d}'.format(idx), 'a') as fout:
with open('model0{:0>2d}.msh'.format(idx), 'r') as fin:
#read something from fin...
#calculate something...
#write something to fout...
If so, you could search for str.format() for more details.

Averaging columns in a text file with row and column headers

I'm new to the group, and to python. I have a very specific type of input file that I'm working with. It is a text file with one header row of text. In addition there is a column of text too which makes things more annoying. What I want to do is read in this file, and then perform operations on the columns of numbers (like average, stdev, etc)...but reading in the file and parsing out the text column is giving me trouble.
I've played with many different approaches and got it close, but figured I'd reach out to the group here. If this were matlab I'd have had it down hours ago. As of now if I used fixed width to define my columns, I think it will work, but I thought there is likely a more efficient way to read in the lines and ignore strings properly.
Here is the file format. As you can see, row one is header...so can be ignored.
column 1 contains text.
postraw.txt
....I think I figured it out. My code is probably very crude, but it works for now:
CTlist = []
CLlist = []
CDlist = []
CMZlist = []
LDelist = []
loopout = {'a1':CTlist, 'a2':CLlist, 'a3':CDlist, 'a4':CMZlist, 'a5':LDelist}
#Specifcy number of headerlines
headerlines = 1
#set initial index to 0
i = 0
#begin loop to process input file, avoiding any header lines
with open('post.out', 'r') as file:
for row in file:
if i > (headerlines - 1):
rowvars = row.split()
for i in range(2,len(rowvars)):
#print(rowvars[i]) #JUST A CHECK/DEBUG LINE
loopout['a{0}'.format(i-1)].append(float(rowvars[i]))
i = i+1
CTlist = []
CLlist = []
CDlist = []
CMZlist = []
LDelist = []
loopout = {'a1':CTlist, 'a2':CLlist, 'a3':CDlist, 'a4':CMZlist, 'a5':LDelist}
#Specifcy number of headerlines
headerlines = 1
#set initial index to 0
i = 0
#begin loop to process input file, avoiding any header lines
with open('post.out', 'r') as file:
for row in file:
if i > (headerlines - 1):
rowvars = row.split()
for i in range(2,len(rowvars)):
#print(rowvars[i]) #JUST A CHECK/DEBUG LINE
loopout['a{0}'.format(i-1)].append(float(rowvars[i]))
i = i+1

How can i calculate the sum of the values in a field less than a certain value

I have a CSV file separated by commas. I need to read the file, determine the sum of the values in the field [reading] less than (say 406.2).
My code so far is as follows:
myfile = open('3517315a.csv','r')
myfilecount = 0
linecount = 0
firstline = True
for line in myfile:
if firstline:
firstline = False
continue
fields = line.split(',')
linecount += 1
count = int(fields[0])
colour = str(fields[1])
channels = int(fields[2])
code = str(fields[3])
correct = str(fields[4])
reading = float(fields[5])
How can i set this condition?
Use np.genfromtxt to read the CSV.
import numpy as np
#data = np.genfromtxt('3517315a.csv', delimiter=',')
data = np.random.random(10).reshape(5,2) * 600 # exemplary data
# since I don't have your CSV
threshold = 406.2
print(np.sum(data * (data<threshold)))
I haven't tested this (I don't have example data or your file) but this should do it
import numpy as np
#import data from file, give each column a name
data = np.genfromtxt('3517315a.csv', names=['count','channels','code','correct','reading'])
#move to a normal array to make it easier to follow (not necessary)
readingdata = data['reading']
#find the values greater than your limit (np.where())
#extract only those values (readingdata[])
#then sum those extracted values (np.sum())
total = np.sum( readingdata[np.where(readingdata > 406.2)] )
You can write an iterator that extracts the reading field and casts it to a float. Wrap that in another iterator that tests your condition and sum the result.
import csv
with open('3517315a.csv', newline='') as fp:
next(fp) # discard header
reading_sum = sum(reading for reading in
(float(row[5]) for row in csv.reader(fp))
if reading < 406.5)

Get number of rows from .csv file

I am writing a Python module where I read a .csv file with 2 columns and a random amount of rows. I then go through these rows until column 1 > x. At this point I need the data from the current row and the previous row to do some calculations.
Currently, I am using 'for i in range(rows)' but each csv file will have a different amount of rows so this wont work.
The code can be seen below:
rows = 73
for i in range(rows):
c_level = Strapping_Table[Tank_Number][i,0] # Current level
c_volume = Strapping_Table[Tank_Number][i,1] # Current volume
if c_level > level:
p_level = Strapping_Table[Tank_Number][i-1,0] # Previous level
p_volume = Strapping_Table[Tank_Number][i-1,1] # Previous volume
x = level - p_level # Intermediate values
if x < 0:
x = 0
y = c_level - p_level
z = c_volume - p_volume
volume = p_volume + ((x / y) * z)
return volume
When playing around with arrays, I used:
for row in Tank_data:
print row[c] # print column c
time.sleep(1)
This goes through all the rows, but I cannot access the previous rows data with this method.
I have thought about storing previous row and current row in every loop, but before I do this I was wondering if there is a simple way to get the amount of rows in a csv.
Store the previous line
with open("myfile.txt", "r") as file:
previous_line = next(file)
for line in file:
print(previous_line, line)
previous_line = line
Or you can use it with generators
def prev_curr(file_name):
with open(file_name, "r") as file:
previous_line = next(file)
for line in file:
yield previous_line ,line
previous_line = line
# usage
for prev, curr in prev_curr("myfile"):
do_your_thing()
You should use enumerate.
for i, row in enumerate(tank_data):
print row[c], tank_data[i-1][c]
Since the size of each row in the csv is unknown until it's read, you'll have to do an intial pass through if you want to find the number of rows, e.g.:
numberOfRows = (1 for row in file)
However that would mean your code will read the csv twice, which if it's very big you may not want to do - the simple option of storing the previous row into a global variable each iteration may be the best option in that case.
An alternate route could be to just read in the file and analyse it from that from e.g. a panda DataFrame (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)
but again this could lead to slowness if your csv is too big.

Categories

Resources