How do I search through a very large csv file? - python

I have 2 csv files (well, one of them is .tab), both of them with 2 columns of numbers. My job is to go through each row of the first file, and see if it matches any of the rows in the second file. If it does, I print a blank line to my output csv file. Otherwise, I print 'R,R' to the output csv file. My current algorithm does the following:
Scan each row of the second file (two integers each), go to the position of those two integers in a 2D array (so if the integers are 2 and 3, I'll go to position [2,3]) and assign a value of 1.
Go through each row of the first file, check if the position of the two integers of each row has a value of 1 in the array, and then print the according output to a third csv file.
Unfortunately the csv files are very large, so I instantly get "MemoryError:" when running this. What is an alternative for scanning through large csv files?
I am using Jupyter Notebook. My code:
import csv
import numpy
def SNP():
thelines = numpy.ndarray((6639,524525))
tempint = 0
tempint2 = 0
with open("SL05_AO_RO.tab") as tsv:
for line in csv.reader(tsv, dialect="excel-tab"):
tempint = int(line[0])
tempint2 = int(line[1])
thelines[tempint,tempint2] = 1
return thelines
def common_sites():
tempint = 0
tempint2 = 0
temparray = SNP()
print('Checkpoint.')
with open('output_SL05.csv', 'w', newline='') as fp:
with open("covbreadth_common_sites.csv") as tsv:
for line in csv.reader(tsv, dialect="excel-tab"):
tempint = int(line[0])
tempint2 = int(line[1])
if temparray[tempint,tempint2] == 1:
a = csv.writer(fp, delimiter=',')
data = [['','']]
a.writerows(data)
else:
a = csv.writer(fp, delimiter=',')
data = [['R','R']]
a.writerows(data)
print('Done.')
return
common_sites()
Files:
https://drive.google.com/file/d/0B5v-nJeoVouHUjlJelZtV01KWFU/view?usp=sharing and https://drive.google.com/file/d/0B5v-nJeoVouHSDI4a2hQWEh3S3c/view?usp=sharing

You're dataset really isn't that big, but it is relatively sparse. You aren't using a sparse structure to store the data which is causing the problem.
Just use a set of tuples to store the seen data, and then the lookup on that set is O(1), e.g:
In [1]:
import csv
with open("SL05_AO_RO.tab") as tsv:
seen = set(map(tuple, csv.reader(tsv, dialect="excel-tab")))
with open("covbreadth_common_sites.csv") as tsv:
common = [line for line in csv.reader(tsv, dialect="excel-tab") if tuple(line) in seen]
common[:10]
Out[1]:
[['1049', '7280'], ['1073', '39198'], ['1073', '39218'], ['1073', '39224'], ['1073', '39233'],
['1098', '661'], ['1098', '841'], ['1103', '15100'], ['1103', '15107'], ['1103', '28210']]
10 loops, best of 3: 150 ms per loop
In [2]:
len(common), len(seen)
Out[2]:
(190, 138205)

I have 2 csv files (well, one of them is .tab), both of them with 2 columns of numbers. My job is to go through each row of the first file, and see if it matches any of the rows in the second file. If it does, I print a blank line to my output csv file. Otherwise, I print 'R,R' to the output csv file.
import numpy as np
f1 = np.loadtxt('SL05_AO_RO.tab')
f2 = np.loadtxt('covbreadth_common_sites.csv')
f1.sort(axis=0)
f2.sort(axis=0)
i, j = 0, 0
while i < f1.shape[0]:
while j < f2.shape[0] and f1[i][0] > f2[j][0]:
j += 1
while j < f2.shape[0] and f1[i][0] == f2[j][0] and f1[i][1] > f2[j][1]:
j += 1
if j < f2.shape[0] and np.array_equal(f1[i], f2[j]):
print()
else:
print('R,R')
i += 1
Load data to ndarray to optimize memory usage
Sort data
Find matches in sorted arrays
Total complexity is O(n*log(n) + m*log(m)), where n and m are sizes of input files.
Using of set() will not reduce memory usage per unique entry so I do not recommend to use it with large datasets.

Since CSV is just a DB dump, import it to any SQL DB, then do query on it. This is very efficient way.

Related

how to iterate over files in python and export several output files

I have a code and I want to put it in a for loop. I want to input some data stored as files into my code and based on the each input, generate outputs automatically. At the moment, my code is only working for one input file and consequently gives one output. My input file is named as model000.msh, but the fact is that I have a series of these input files with the names model000.msh, model001.msh, and so on. In the code I am doing some calculation on the imported file and finally compare it to a numpy array (my_data) that is generated by another numpy array (ID) having one column and thousands of rows. ID array is the second variable which I want to iterate over. ID is making my_data through a np.concatenate function. I want to use each column of ID to make my_data (my_data=np.concatenate((ID[:,iterator], gr), axis =1)). So, I want to iterate over several files, then extract arrays from each file (extracted), then follow the loop with generating my_data from each column of ID and do calculations on my_data and extracted and finally export results of each iteration with a dynamic naming method (changed_000, changed_001 and so on). This is my code fo one single input and one single my_data array (made by an ID that has only one column), but I want to change iterate over several input files and several my_data arrays and finally several outputs:
from itertools import islice
with open('model000.msh') as lines:
nodes = np.genfromtxt(islice(lines, 0, 1000))
with open('model000.msh', "r") as f:
saved_lines = np.array([line.split() for line in f if len(line.split()) == 9])
saved_lines[saved_lines == ''] = 0.0
elem = saved_lines.astype(np.int)
# following lines extract some data from my file
extracted=np.c_[elem[:,:-4], nodes[elem[:,-4]-1, 1:], nodes[elem[:,-3]-1, 1:],nodes[elem[:,-2]-1, 1:], nodes[elem[:,-1]-1, 1:]]
…
extracted =np.concatenate((extracted, avs), axis =1) # each input file ('model000.msh') will make this numpy array
# another data set, stored as a numpy array is compared to the data extracted from the file
ID= np.array [[… ..., …, …]] # now, it is has one column, but it should have several columns and each iteration, one column will make a my_data array
my_data=np.concatenate((ID, gr), axis =1) # I think it should be something like my_data=np.concatenate((ID[:,iterator], gr), axis =1)
from scipy.spatial import distance
distances=distance.cdist(extracted [:,17:20],my_data[:,1:4])
ind_min_dis=np.argmin(distances, axis=1).reshape(-1,1)
z=np.array([])
for i in ind_min_dis:
u=my_data[i,0]
z=np.array([np.append(z,u)]).reshape(-1,1)
final_merged=np.concatenate((extracted,z), axis =1)
new_vol=final_merged[:,-1].reshape(-1,1)
new_elements=np.concatenate((elements,new_vol), axis =1)
new_elements[:,[4,-1]] = new_elements[:,[-1,4]]
# The next block is output block
chunk_size = 3
buffer = ""
i = 0
relavent_line = 0
with open('changed_00', 'a') as fout:
with open('model000.msh', 'r') as fin:
for line in fin:
if len(line.split()) == 9:
aux_string = ' '.join([str(num) for num in new_elements[relavent_line]])
buffer += '%s\n' % aux_string
relavent_line += 1
else:
buffer += line
i+=1
if i == chunk_size:
fout.write(buffer)
i=0
buffer = ""
if buffer:
fout.write(buffer)
i=0
buffer = ""
I appreciate any help in advance.
I'm not very sure about your question. But it seems like you are asking for something like:
for idx in range(10):
with open('changed_{:0>2d}'.format(idx), 'a') as fout:
with open('model0{:0>2d}.msh'.format(idx), 'r') as fin:
#read something from fin...
#calculate something...
#write something to fout...
If so, you could search for str.format() for more details.

How to count lines in a text file with specified values?

I'm working with a .csv file that lists Timestamps in one column and Wind Speeds in the second column. I need to read through this .csv file and calculate the percent of time where wind speed was above 2m/s. Here's what I have so far.
txtFile = r"C:\Data.csv"
line = o_txtFile.readline()[:-1]
while line:
line = oTextfile.readline()
for line in txtFile:
line = line.split(",")[:-1]
How do I get a count of the lines where the 2nd element in the line is greater than 2?
CSV File Sample
You will probably have to update slightly your CSV, depending on the chosen option (for option 1 and option 2, you will definitely want to remove all header rows, whereas for option 3, you will keep only the middle one, i.e. the one that starts with TIMESTAMP).
You actually have three options:
Option 1: Vanilla Python
count = 0
with open('data.csv', 'r') as file:
for line in file:
value = int(line.split(',')[1])
if value > 100:
count += 1
# Now you have the value in ``count`` variable
Option 2: CSV module
Here I use the Python's CSV module (you could as well use the DictReader, but I'll let you do the search yourself).
import csv
count = 0
with open('data.csv', 'r') as file:
reader = csv.read(file, delimiter=',')
for row in reader:
if int(row[1]) > 100:
count += 1
# Now you have the value in ``count`` variable
Option 3: Pandas
Pandas is a really cool, awesome library used by a lot of people to do data analysis. Doing what you want to do would look like:
import pandas as pd
df = pd.read_csv('data.csv')
# Here you are
count = len(df[df['WindSpd_ms'] > 100])
You can read in the file line by line, if something in it, split it.
You count the lines read and how many are above 10m/s - then calculate the percentage:
# create data file for processing with random data
import random
random.seed(42)
with open("data.txt","w") as f:
f.write("header\n")
f.write("header\n")
f.write("header\n")
f.write("header\n")
for sp in random.choices(range(10),k=200):
f.write(f"some date,{sp+3.5}, data,data,data\n")
# open/read/calculate percentage of data that has 10m/s speeds
days = 0
speedGreater10 = 0
with open("data.txt","r") as f:
for _ in range(4):
next(f) # ignore first 4 rows containing headers
for line in f:
if line: # not empty
_ , speed, *p = line.split(",")
# _ and *p are ignored (they take 'some date' + [data,data,data])
days += 1
if float(speed) > 10:
speedGreater10 += 1
print(f"{days} datapoints, of wich {speedGreater10} "+
f"got more then 10m/s: {speedGreater10/days}%")
Output:
200 datapoints, of wich 55 got more then 10m/s: 0.275%
Datafile:
header
header
header
header
some date,9.5, data,data,data
some date,3.5, data,data,data
some date,5.5, data,data,data
some date,5.5, data,data,data
some date,10.5, data,data,data
[... some more ...]
some date,8.5, data,data,data
some date,3.5, data,data,data
some date,12.5, data,data,data
some date,11.5, data,data,data

Loop within loop when comparing csv files in Python

I have two csv files. I am trying to look up a value the first column in one file (file 1) in the first column in the other file (file 2). If they match then print the row from file 2.
Pseudo code:
read file1.csv
read file2.csv
loop through file1
compare each row with each row of file 2 in turn
if file1[0] == file2[0]:
print row of file 2
file1:
45,John
46,Fred
47,Bill
File2:
46,Roger
48,Pete
49,Bob
I want it to print :
46 Roger
EDIT - these are examples, the actual file is much bigger (5,000 rows, 7 columns)
I have the following:
import csv
with open('csvfile1.csv', 'rt') as csvfile1, open('csvfile2.csv', 'rt') as csvfile2:
csv1reader = csv.reader(csvfile1)
csv2reader = csv.reader(csvfile2)
for rowcsv1 in csv1reader:
for rowcsv2 in csv2reader:
if rowcsv1[0] == rowcsv2[0]:
print(rowcsv1)
However I am getting no output.
I am aware there are other ways of doing it (with dict, pandas) but I cam keen to know why my approach is not working.
EDIT: I now see that it is only iterating through the first row of file 1 and then closing, but I am unclear how to stop it closing (I also understand that this is not the best way to do do it).
You open csv2reader = csv.reader(csvfile2) then iterate through it vs the first row of csv1reader - it has now reached end of file and will not produce any more data.
So for the second through last rows of csv1reader you are comparing against the items of an empty list, ie no comparison takes place.
In any case, this is a very inefficient method; unless you are working on very large files, it would be much better to do
import csv
# load second file as lookup table
data = {}
with open("csv2file.csv") as inf2:
for row in csv.reader(inf2):
data[row[0]] = row
# now process first file against it
with open("csv1file.csv") as inf1:
for row in csv.reader(inf1):
if row[0] in data:
print(data[row[0]])
See Hugh Bothwell's answer for why your code isn't working. For a fast way of doing what you stated you want to do in your question, try this:
import csv
with open('csvfile1.csv', 'rt') as csvfile1, open('csvfile2.csv', 'rt') as csvfile2:
csv1 = list(csv.reader(csvfile1))
csv2 = list(csv.reader(csvfile2))
duplicates = {a[0] for a in csv1} & {a[0] for a in csv2}
for row in csv2:
if row[0] in duplicates:
print(row)
It gets the duplicate numbers from the two csv files, then loops through the second cvs file, printing the row if the number at index 0 is in the first cvs file. This is a much faster algorithm than what you were attempting to do.
If order matters, as #hugh-bothwell's mentioned in #will-da-silva's answer, you could do:
import csv
from collections import OrderedDict
with open('csvfile1.csv', 'rt') as csvfile1, open('csvfile2.csv', 'rt') as csvfile2:
csv1 = list(csv.reader(csvfile1))
csv2 = list(csv.reader(csvfile2))
d = {row[0]: row for row in csv2}
k = OrderedDict.fromkeys([a[0] for a in csv1]).keys()
duplicate_keys = [k for k in k if k in d]
for k in duplicate_keys:
print(d[k])
I'm pretty sure there's a better way to do this, but try out this solution, it should work.
counter = 0
import csv
with open('csvfile1.csv', 'rt') as csvfile1, open('csvfile2.csv', 'rt') as
csvfile2:
csv1reader = csv.reader(csvfile1)
csv2reader = csv.reader(csvfile2)
for rowcsv1 in csv1reader:
for rowcsv2 in csv2reader:
if rowcsv1[counter] == rowcsv2[counter]:
print(rowcsv1)
counter += 1 #increment it out of the IF statement.

How can i calculate the sum of the values in a field less than a certain value

I have a CSV file separated by commas. I need to read the file, determine the sum of the values in the field [reading] less than (say 406.2).
My code so far is as follows:
myfile = open('3517315a.csv','r')
myfilecount = 0
linecount = 0
firstline = True
for line in myfile:
if firstline:
firstline = False
continue
fields = line.split(',')
linecount += 1
count = int(fields[0])
colour = str(fields[1])
channels = int(fields[2])
code = str(fields[3])
correct = str(fields[4])
reading = float(fields[5])
How can i set this condition?
Use np.genfromtxt to read the CSV.
import numpy as np
#data = np.genfromtxt('3517315a.csv', delimiter=',')
data = np.random.random(10).reshape(5,2) * 600 # exemplary data
# since I don't have your CSV
threshold = 406.2
print(np.sum(data * (data<threshold)))
I haven't tested this (I don't have example data or your file) but this should do it
import numpy as np
#import data from file, give each column a name
data = np.genfromtxt('3517315a.csv', names=['count','channels','code','correct','reading'])
#move to a normal array to make it easier to follow (not necessary)
readingdata = data['reading']
#find the values greater than your limit (np.where())
#extract only those values (readingdata[])
#then sum those extracted values (np.sum())
total = np.sum( readingdata[np.where(readingdata > 406.2)] )
You can write an iterator that extracts the reading field and casts it to a float. Wrap that in another iterator that tests your condition and sum the result.
import csv
with open('3517315a.csv', newline='') as fp:
next(fp) # discard header
reading_sum = sum(reading for reading in
(float(row[5]) for row in csv.reader(fp))
if reading < 406.5)

Get number of rows from .csv file

I am writing a Python module where I read a .csv file with 2 columns and a random amount of rows. I then go through these rows until column 1 > x. At this point I need the data from the current row and the previous row to do some calculations.
Currently, I am using 'for i in range(rows)' but each csv file will have a different amount of rows so this wont work.
The code can be seen below:
rows = 73
for i in range(rows):
c_level = Strapping_Table[Tank_Number][i,0] # Current level
c_volume = Strapping_Table[Tank_Number][i,1] # Current volume
if c_level > level:
p_level = Strapping_Table[Tank_Number][i-1,0] # Previous level
p_volume = Strapping_Table[Tank_Number][i-1,1] # Previous volume
x = level - p_level # Intermediate values
if x < 0:
x = 0
y = c_level - p_level
z = c_volume - p_volume
volume = p_volume + ((x / y) * z)
return volume
When playing around with arrays, I used:
for row in Tank_data:
print row[c] # print column c
time.sleep(1)
This goes through all the rows, but I cannot access the previous rows data with this method.
I have thought about storing previous row and current row in every loop, but before I do this I was wondering if there is a simple way to get the amount of rows in a csv.
Store the previous line
with open("myfile.txt", "r") as file:
previous_line = next(file)
for line in file:
print(previous_line, line)
previous_line = line
Or you can use it with generators
def prev_curr(file_name):
with open(file_name, "r") as file:
previous_line = next(file)
for line in file:
yield previous_line ,line
previous_line = line
# usage
for prev, curr in prev_curr("myfile"):
do_your_thing()
You should use enumerate.
for i, row in enumerate(tank_data):
print row[c], tank_data[i-1][c]
Since the size of each row in the csv is unknown until it's read, you'll have to do an intial pass through if you want to find the number of rows, e.g.:
numberOfRows = (1 for row in file)
However that would mean your code will read the csv twice, which if it's very big you may not want to do - the simple option of storing the previous row into a global variable each iteration may be the best option in that case.
An alternate route could be to just read in the file and analyse it from that from e.g. a panda DataFrame (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)
but again this could lead to slowness if your csv is too big.

Categories

Resources