I am writing a Python module where I read a .csv file with 2 columns and a random amount of rows. I then go through these rows until column 1 > x. At this point I need the data from the current row and the previous row to do some calculations.
Currently, I am using 'for i in range(rows)' but each csv file will have a different amount of rows so this wont work.
The code can be seen below:
rows = 73
for i in range(rows):
c_level = Strapping_Table[Tank_Number][i,0] # Current level
c_volume = Strapping_Table[Tank_Number][i,1] # Current volume
if c_level > level:
p_level = Strapping_Table[Tank_Number][i-1,0] # Previous level
p_volume = Strapping_Table[Tank_Number][i-1,1] # Previous volume
x = level - p_level # Intermediate values
if x < 0:
x = 0
y = c_level - p_level
z = c_volume - p_volume
volume = p_volume + ((x / y) * z)
return volume
When playing around with arrays, I used:
for row in Tank_data:
print row[c] # print column c
time.sleep(1)
This goes through all the rows, but I cannot access the previous rows data with this method.
I have thought about storing previous row and current row in every loop, but before I do this I was wondering if there is a simple way to get the amount of rows in a csv.
Store the previous line
with open("myfile.txt", "r") as file:
previous_line = next(file)
for line in file:
print(previous_line, line)
previous_line = line
Or you can use it with generators
def prev_curr(file_name):
with open(file_name, "r") as file:
previous_line = next(file)
for line in file:
yield previous_line ,line
previous_line = line
# usage
for prev, curr in prev_curr("myfile"):
do_your_thing()
You should use enumerate.
for i, row in enumerate(tank_data):
print row[c], tank_data[i-1][c]
Since the size of each row in the csv is unknown until it's read, you'll have to do an intial pass through if you want to find the number of rows, e.g.:
numberOfRows = (1 for row in file)
However that would mean your code will read the csv twice, which if it's very big you may not want to do - the simple option of storing the previous row into a global variable each iteration may be the best option in that case.
An alternate route could be to just read in the file and analyse it from that from e.g. a panda DataFrame (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)
but again this could lead to slowness if your csv is too big.
Related
I have multiple txt files and each of these txt files has 6 columns. What I want to do : add just one column as a last column, so at the end the txt file has maximum 7 columns and if i run the script again it shouldn't add a new one:
At the beginning each file has six columns:
637.39 718.53 155.23 -0.51369 -0.18539 0.057838 3.209840789730089
636.56 720 155.57 -0.51566 -0.18487 0.056735 3.3520643559939938
635.72 721.52 155.95 -0.51933 -0.18496 0.056504 3.4997850701290125
What I want is to add a new column of zeros only if the current number of columns is 6, after that it should prevent adding a new column when I run the script again (7 columns is the total number where the last one is zeros):
637.39 718.53 155.23 -0.51369 -0.18539 0.057838 3.209840789730089 0
636.56 720 155.57 -0.51566 -0.18487 0.056735 3.3520643559939938 0
635.72 721.52 155.95 -0.51933 -0.18496 0.056504 3.4997850701290125 0
My code works and add one additional column each time i run the script but i want to add just once when the number of columns 6. Where (a) give me the number of column and if the condition is fulfilled then add a new one:
from glob import glob
import numpy as np
new_column = [0] * 20
def get_new_line(t):
l, c = t
return '{} {}\n'.format(l.rstrip(), c)
def writecolumn(filepath):
# Load data from file
with open(filepath) as datafile:
lines = datafile.readlines()
a=np.loadtxt(lines, dtype='str').shape[1]
print(a)
**#if a==6: (here is the problem)**
n, r = divmod(len(lines), len(new_column))
column = new_column * n + new_column[:r]
new_lines = list(map(get_new_line, zip(lines, column)))
with open(filepath, "w") as f:
f.writelines(new_lines)
if __name__ == "__main__":
filepaths = glob("/home/experiment/*.txt")
for path in filepaths:
writecolumn(path)
When i check the number of columns #if a==6 and shift the content inside the if statement I get error. without shifting the content inside the if every thing works fine and still adding one column each time i run it.
Any help is appreciated.
To test the code create two/one txt files with random number of six columns.
Could be an indentation problem, i.e. block below 'if'. writing new-lines should be indented properly --
This works --
def writecolumn(filepath):
# Load data from file
with open(filepath) as datafile:
lines = datafile.readlines()
a=np.loadtxt(lines, dtype='str').shape[1]
print(a)
if int(a)==6:
n, r = divmod(len(lines), len(new_column))
column = new_column * n + new_column[:r]
new_lines = list(map(get_new_line, zip(lines, column)))
with open(filepath, "w") as f:
f.writelines(new_lines)
Use pandas to read your text file:
import pandas as of
df = pd.read_csv("whitespace.csv", header=None, delimiter=" ")
Add a column or more as needed
df['somecolname'] = 0
Save DataFrame with no header.
I'm new to the group, and to python. I have a very specific type of input file that I'm working with. It is a text file with one header row of text. In addition there is a column of text too which makes things more annoying. What I want to do is read in this file, and then perform operations on the columns of numbers (like average, stdev, etc)...but reading in the file and parsing out the text column is giving me trouble.
I've played with many different approaches and got it close, but figured I'd reach out to the group here. If this were matlab I'd have had it down hours ago. As of now if I used fixed width to define my columns, I think it will work, but I thought there is likely a more efficient way to read in the lines and ignore strings properly.
Here is the file format. As you can see, row one is header...so can be ignored.
column 1 contains text.
postraw.txt
....I think I figured it out. My code is probably very crude, but it works for now:
CTlist = []
CLlist = []
CDlist = []
CMZlist = []
LDelist = []
loopout = {'a1':CTlist, 'a2':CLlist, 'a3':CDlist, 'a4':CMZlist, 'a5':LDelist}
#Specifcy number of headerlines
headerlines = 1
#set initial index to 0
i = 0
#begin loop to process input file, avoiding any header lines
with open('post.out', 'r') as file:
for row in file:
if i > (headerlines - 1):
rowvars = row.split()
for i in range(2,len(rowvars)):
#print(rowvars[i]) #JUST A CHECK/DEBUG LINE
loopout['a{0}'.format(i-1)].append(float(rowvars[i]))
i = i+1
CTlist = []
CLlist = []
CDlist = []
CMZlist = []
LDelist = []
loopout = {'a1':CTlist, 'a2':CLlist, 'a3':CDlist, 'a4':CMZlist, 'a5':LDelist}
#Specifcy number of headerlines
headerlines = 1
#set initial index to 0
i = 0
#begin loop to process input file, avoiding any header lines
with open('post.out', 'r') as file:
for row in file:
if i > (headerlines - 1):
rowvars = row.split()
for i in range(2,len(rowvars)):
#print(rowvars[i]) #JUST A CHECK/DEBUG LINE
loopout['a{0}'.format(i-1)].append(float(rowvars[i]))
i = i+1
I've got a .csv file with different data formats and I'm trying to operate with the values on the same column.
my csv file is something like this:
"int","float","stirng", more data
Example:
"2","1.378","Johnny"
"1","1.379","Walker"
"5","1.380","Jack"
"8","1.700","Daniels"
"8","1.710","Baileys"
"8","1.381","Monkey"
"8","1.711","Shoulder"
"8","1.383","Captain"
"8","1.385","Morgan"
"8","1.392","Drinks"
More rows
I would like to subtract values in the second column if their difference is >x.
(only those, I don't care about the others).
My code so far:
with open ('input.csv', 'r') as file, open ('output.csv', 'w') as f_out:
readCSV = csv.reader(file)
writeCSV = csv.writer(f_out, lineterminator='\n')
last = None
for row in readCSV:
datalat = float(row[1])
if last is not None:
#print("difference -> %f" %(datalat-last))
outp = (datalat-last)
if outp <= 0.02:
writeCSV.writerow(row)
last = datalat
The output looks like:
5,1.380,Jack
8,1.710,Baileys
8,1.381,Monkey
8,1.383,Captain
8,1.385,Morgan
8,1.392,Drinks
But I would like it to be:
"2","1.378","Johnny"
"1","1.379","Walker"
"5","1.380","Jack"
"8","1.381","Monkey"
"8","1.383","Captain"
"8","1.385","Morgan"
"8","1.392","Drinks"
So what it should do is only write rows that have less than 0.02 difference, IF there is a row with a bigger difference discard it, then compare the next row to the last written row, as opposed to the last discarded row.
Two things:
You should take the absolute value (using abs) of the difference as you don't know apriori which of the two is greater.
Only update last if the condition is fulfilled, so last is never a discarded value.
last = float(next(readCSV)[1]) # assign first reference value
f_out.seek(0) # return to start of file
for row in readCSV:
datalat = float(row[1])
diff = abs(datalat-last)
if diff <= 0.02:
writeCSV.writerow(row)
last = datalat
I have 2 csv files (well, one of them is .tab), both of them with 2 columns of numbers. My job is to go through each row of the first file, and see if it matches any of the rows in the second file. If it does, I print a blank line to my output csv file. Otherwise, I print 'R,R' to the output csv file. My current algorithm does the following:
Scan each row of the second file (two integers each), go to the position of those two integers in a 2D array (so if the integers are 2 and 3, I'll go to position [2,3]) and assign a value of 1.
Go through each row of the first file, check if the position of the two integers of each row has a value of 1 in the array, and then print the according output to a third csv file.
Unfortunately the csv files are very large, so I instantly get "MemoryError:" when running this. What is an alternative for scanning through large csv files?
I am using Jupyter Notebook. My code:
import csv
import numpy
def SNP():
thelines = numpy.ndarray((6639,524525))
tempint = 0
tempint2 = 0
with open("SL05_AO_RO.tab") as tsv:
for line in csv.reader(tsv, dialect="excel-tab"):
tempint = int(line[0])
tempint2 = int(line[1])
thelines[tempint,tempint2] = 1
return thelines
def common_sites():
tempint = 0
tempint2 = 0
temparray = SNP()
print('Checkpoint.')
with open('output_SL05.csv', 'w', newline='') as fp:
with open("covbreadth_common_sites.csv") as tsv:
for line in csv.reader(tsv, dialect="excel-tab"):
tempint = int(line[0])
tempint2 = int(line[1])
if temparray[tempint,tempint2] == 1:
a = csv.writer(fp, delimiter=',')
data = [['','']]
a.writerows(data)
else:
a = csv.writer(fp, delimiter=',')
data = [['R','R']]
a.writerows(data)
print('Done.')
return
common_sites()
Files:
https://drive.google.com/file/d/0B5v-nJeoVouHUjlJelZtV01KWFU/view?usp=sharing and https://drive.google.com/file/d/0B5v-nJeoVouHSDI4a2hQWEh3S3c/view?usp=sharing
You're dataset really isn't that big, but it is relatively sparse. You aren't using a sparse structure to store the data which is causing the problem.
Just use a set of tuples to store the seen data, and then the lookup on that set is O(1), e.g:
In [1]:
import csv
with open("SL05_AO_RO.tab") as tsv:
seen = set(map(tuple, csv.reader(tsv, dialect="excel-tab")))
with open("covbreadth_common_sites.csv") as tsv:
common = [line for line in csv.reader(tsv, dialect="excel-tab") if tuple(line) in seen]
common[:10]
Out[1]:
[['1049', '7280'], ['1073', '39198'], ['1073', '39218'], ['1073', '39224'], ['1073', '39233'],
['1098', '661'], ['1098', '841'], ['1103', '15100'], ['1103', '15107'], ['1103', '28210']]
10 loops, best of 3: 150 ms per loop
In [2]:
len(common), len(seen)
Out[2]:
(190, 138205)
I have 2 csv files (well, one of them is .tab), both of them with 2 columns of numbers. My job is to go through each row of the first file, and see if it matches any of the rows in the second file. If it does, I print a blank line to my output csv file. Otherwise, I print 'R,R' to the output csv file.
import numpy as np
f1 = np.loadtxt('SL05_AO_RO.tab')
f2 = np.loadtxt('covbreadth_common_sites.csv')
f1.sort(axis=0)
f2.sort(axis=0)
i, j = 0, 0
while i < f1.shape[0]:
while j < f2.shape[0] and f1[i][0] > f2[j][0]:
j += 1
while j < f2.shape[0] and f1[i][0] == f2[j][0] and f1[i][1] > f2[j][1]:
j += 1
if j < f2.shape[0] and np.array_equal(f1[i], f2[j]):
print()
else:
print('R,R')
i += 1
Load data to ndarray to optimize memory usage
Sort data
Find matches in sorted arrays
Total complexity is O(n*log(n) + m*log(m)), where n and m are sizes of input files.
Using of set() will not reduce memory usage per unique entry so I do not recommend to use it with large datasets.
Since CSV is just a DB dump, import it to any SQL DB, then do query on it. This is very efficient way.
I have a csv file with 2 columns (titles are value, image). The value list contains values in ascending order (0,25,30...), and the image list contains pathway to images (e.g. X.jpg). Total lines are 81 including the titles (that is, there are 80 values and 80 images)
What I want to divide this list 4-ways. Basically the idea is to have a spread of pairs of images.
In the first group I took the image part of every two near rows (2+3, 4+5....), and wrote them in a new csv file. I write each image in a different column. Here's the code:
import csv
f = open('random_sorted.csv')
csv_f = csv.reader(f)
i = 0
prev = ""
#open csv file for writing
with open('first_group.csv', 'wb') as test_file:
csv_writer = csv.writer(test_file)
csv_writer.writerow(["image1"] + ["image2"])
for row in csv_f:
if i%2 == 0 and i!=0:
#print prev + "," + row[1]
csv_writer.writerow([prev] + [row[1]])
else:
prev = row[1]
i = i+1
Here's the output of this:
I want to keep the concept similar with the rest 3 groups(write into a new csv file the paired images and having two columns), but just increase the spread. That is, pair together every 5 rows (i.e. 2+7 etc.), every 7 (i.e. 2+9 etc.), and every 9 rows together. Would love to get some directions as to how to execute it. I was lucky with the first group (just learned about the remainder/divider option in the CodeAcademy course, but can't think of ideas for the other groups.
First collect all the rows in the csv file in a list:
with open('random_sorted.csv') as csvfile:
csv_reader = csv.reader(csvfile, delimiter=';')
headers = next(csv_reader)
rows = [row for row in csv_reader]
Then set your required step size (5, 7 or 9) and identify the rows on the basis of their index in the list of rows:
with open('first_group.csv', 'wb') as test_file:
csv_writer = csv.writer(test_file)
csv_writer.writerow(["image1"] + ["image2"])
step_size = 7 # set step size here
seen = set() # here we remember images we've already seen
for x in range(0, len(rows)-step_size):
img1 = rows[x][1]
img2 = rows[x+step_size][1]
if not (img1 in seen or img2 in seen):
csv_writer.writerow([img1, img2])
seen.add(img1)
seen.add(img2)