How to count lines in a text file with specified values? - python

I'm working with a .csv file that lists Timestamps in one column and Wind Speeds in the second column. I need to read through this .csv file and calculate the percent of time where wind speed was above 2m/s. Here's what I have so far.
txtFile = r"C:\Data.csv"
line = o_txtFile.readline()[:-1]
while line:
line = oTextfile.readline()
for line in txtFile:
line = line.split(",")[:-1]
How do I get a count of the lines where the 2nd element in the line is greater than 2?
CSV File Sample

You will probably have to update slightly your CSV, depending on the chosen option (for option 1 and option 2, you will definitely want to remove all header rows, whereas for option 3, you will keep only the middle one, i.e. the one that starts with TIMESTAMP).
You actually have three options:
Option 1: Vanilla Python
count = 0
with open('data.csv', 'r') as file:
for line in file:
value = int(line.split(',')[1])
if value > 100:
count += 1
# Now you have the value in ``count`` variable
Option 2: CSV module
Here I use the Python's CSV module (you could as well use the DictReader, but I'll let you do the search yourself).
import csv
count = 0
with open('data.csv', 'r') as file:
reader = csv.read(file, delimiter=',')
for row in reader:
if int(row[1]) > 100:
count += 1
# Now you have the value in ``count`` variable
Option 3: Pandas
Pandas is a really cool, awesome library used by a lot of people to do data analysis. Doing what you want to do would look like:
import pandas as pd
df = pd.read_csv('data.csv')
# Here you are
count = len(df[df['WindSpd_ms'] > 100])

You can read in the file line by line, if something in it, split it.
You count the lines read and how many are above 10m/s - then calculate the percentage:
# create data file for processing with random data
import random
random.seed(42)
with open("data.txt","w") as f:
f.write("header\n")
f.write("header\n")
f.write("header\n")
f.write("header\n")
for sp in random.choices(range(10),k=200):
f.write(f"some date,{sp+3.5}, data,data,data\n")
# open/read/calculate percentage of data that has 10m/s speeds
days = 0
speedGreater10 = 0
with open("data.txt","r") as f:
for _ in range(4):
next(f) # ignore first 4 rows containing headers
for line in f:
if line: # not empty
_ , speed, *p = line.split(",")
# _ and *p are ignored (they take 'some date' + [data,data,data])
days += 1
if float(speed) > 10:
speedGreater10 += 1
print(f"{days} datapoints, of wich {speedGreater10} "+
f"got more then 10m/s: {speedGreater10/days}%")
Output:
200 datapoints, of wich 55 got more then 10m/s: 0.275%
Datafile:
header
header
header
header
some date,9.5, data,data,data
some date,3.5, data,data,data
some date,5.5, data,data,data
some date,5.5, data,data,data
some date,10.5, data,data,data
[... some more ...]
some date,8.5, data,data,data
some date,3.5, data,data,data
some date,12.5, data,data,data
some date,11.5, data,data,data

Related

Quickly Remove Header from Large .csv Files

My question is not how to open a .csv file, detect which rows I want to omit, and write a new .csv file with my desired lines. I'm already doing that successfully:
def sanitize(filepath): #Removes header information, leaving only column names and data. Outputs "sanitized" file.
with open(filepath) as unsan, open(dirname + "/" + newname + '.csv', 'w', newline='') as san:
writer = csv.writer(san)
line_count = 0
headingrow = 0
datarow = 0
safety = 1
for row in csv.reader(unsan, delimiter=','):
#Detect data start
if "DATA START" in str(row):
safety = 0
headingrow = line_count + 1
datarow = line_count + 4
#Detect data end
if "DATA END" in str(row):
safety = 1
#Write data
if safety == 0:
if line_count == headingrow or line_count >= datarow:
writer.writerow(row)
line_count += 1
I have .csv data files that are megabytes, sometimes gigabytes (up to 4Gb) in size. Out of 180,000 lines in each file, I only need to omit about 50 lines.
Example pseudo-data (rows I want to keep are indented):
[Header Start]
...48 lines of header data...
[Header End]
Blank Line
[Data Start]
Row with Column Names
Column Units
Column Variable Type
...180,000 lines of data...
I understand that I can't edit a .csv file as I iterate over it (Learned here:
How to Delete Rows CSV in python). Is there a quicker way to remove the header information from the file, like maybe writing the remaining 180,000 lines as a block instead of iterating through and writing each line?
Maybe one solution would be to append all the data rows to a list of lists and then use writer.writerows(list of lists) instead of writing them one at a time (Batch editing of csv files with Python, https://docs.python.org/3/library/csv.html)? However, wouldn't that mean I'm loading essentially the whole file (up to 4Gb) into my RAM?
UPDATE:
I've got a pandas import working, but when I time it, it takes about twice as long as the code above. Specifically, the to_csv portion takes about 10s for a 26Mb file.
import csv, pandas as pd
filepath = r'input'
with open(filepath) as unsan:
line_count = 0
headingrow = 0
datarow = 0
safety = 1
row_count = sum(1 for row in csv.reader(unsan, delimiter=','))
for row in csv.reader(unsan, delimiter=','):
#Detect data start
if "DATA START" in str(row):
safety = 0
headingrow = line_count + 1
datarow = line_count + 4
#Write data
if safety == 0:
if line_count == headingrow:
colnames = row
line_count +=1
break
line_count += 1
badrows = [*range(0, 55, 1),row_count - 1]
df = pd.read_csv(filepath, names=[*colnames], skiprows=[*badrows], na_filter=False)
df.to_csv (r'output', index = None, header=True)
Here's the research I've done:
Deleting rows with Python in a CSV file
https://intellipaat.com/community/18827/how-to-delete-only-one-row-in-csv-with-python
https://www.reddit.com/r/learnpython/comments/7tzbjm/python_csv_cleandelete_row_function_doesnt_work/
https://nitratine.net/blog/post/remove-columns-in-a-csv-file-with-python/
Delete blank rows from CSV?
If it is not important that the file is read in Python, or with a CSV reader/writer, you can use other tools. On *nix you can use sed:
sed -n '/DATA START/,/DATA END/p' myfile.csv > headerless.csv
This will be very fast for millions of lines.
perl is more multi-platform:
perl -F -lane "print if /DATA START/ .. /DATA END/;" myfile.csv
To avoid editing the file, and read the file with headers straight into Python and then into Pandas, you can wrap the file in your own file-like object.
Given an input file called myfile.csv with this content:
HEADER
HEADER
HEADER
HEADER
HEADER
HEADER
now, some, data
1,2,3
4,5,6
7,8,9
You can read that file in directly using a wrapper class:
import io
class HeaderSkipCsv(io.TextIOBase):
def __init__(self, filename):
""" create an iterator from the filename """
self.data = self.yield_csv(filename)
def readable(self):
""" here for compatibility """
return True
def yield_csv(self, filename):
""" open filename and read past the first empty line
Then yield characters one by one. This reads just one
line at a time in memory
"""
with open(filename) as f:
for line in f:
if line.strip() == "":
break
for line in f:
for char in line:
yield char
def read(self, n=None):
""" called by Pandas with some 'n', this returns
the next 'n' characters since the last read as a string
"""
data = ""
for i in range(n):
try:
data += next(self.data)
except StopIteration:
break
return data
WANT_PANDAS=True #set to False to just write file
if WANT_PANDAS:
import pandas as pd
df = pd.read_csv(HeaderSkipCsv('myfile.csv'))
print(df.head(5))
else:
with open('myoutfile.csv', 'w') as fo:
with HeaderSkipCsv('myfile.csv') as fi:
c = fi.read(1024)
while c:
fo.write(c)
c = fi.read(1024)
which outputs:
now some data
0 1 2 3
1 4 5 6
2 7 8 9
Because Pandas allows any file-like object, we can provide our own! Pandas calls read on the HeaderSkipCsv object as it would on any file object. Pandas just cares about reading valid csv data from a file object when it calls read on it. Rather than providing Pandas with a clean file, we provide it with a file-like object that filters out the data Pandas does not like (i.e. the headers).
The yield_csv generator iterates over the file without reading it in, so only as much data as Pandas requests is loaded into memory. The first for loop in yield_csv advances f to beyond the first empty line. f represents a file pointer and is not reset at the end of a for loop while the file remains open. Since the second for loop receives f under the same with block, it starts consuming at the start of the csv data, where the first for loop left it.
Another way of writing the first for loop would be
next((line for line in f if line.isspace()), None)
which is more explicit about advancing the file pointer, but arguably harder to read.
Because we skip the lines up to and including the empty line, Pandas just gets the valid csv data. For the headers, no more than one line is ever loaded.

How can i calculate the sum of the values in a field less than a certain value

I have a CSV file separated by commas. I need to read the file, determine the sum of the values in the field [reading] less than (say 406.2).
My code so far is as follows:
myfile = open('3517315a.csv','r')
myfilecount = 0
linecount = 0
firstline = True
for line in myfile:
if firstline:
firstline = False
continue
fields = line.split(',')
linecount += 1
count = int(fields[0])
colour = str(fields[1])
channels = int(fields[2])
code = str(fields[3])
correct = str(fields[4])
reading = float(fields[5])
How can i set this condition?
Use np.genfromtxt to read the CSV.
import numpy as np
#data = np.genfromtxt('3517315a.csv', delimiter=',')
data = np.random.random(10).reshape(5,2) * 600 # exemplary data
# since I don't have your CSV
threshold = 406.2
print(np.sum(data * (data<threshold)))
I haven't tested this (I don't have example data or your file) but this should do it
import numpy as np
#import data from file, give each column a name
data = np.genfromtxt('3517315a.csv', names=['count','channels','code','correct','reading'])
#move to a normal array to make it easier to follow (not necessary)
readingdata = data['reading']
#find the values greater than your limit (np.where())
#extract only those values (readingdata[])
#then sum those extracted values (np.sum())
total = np.sum( readingdata[np.where(readingdata > 406.2)] )
You can write an iterator that extracts the reading field and casts it to a float. Wrap that in another iterator that tests your condition and sum the result.
import csv
with open('3517315a.csv', newline='') as fp:
next(fp) # discard header
reading_sum = sum(reading for reading in
(float(row[5]) for row in csv.reader(fp))
if reading < 406.5)

Get number of rows from .csv file

I am writing a Python module where I read a .csv file with 2 columns and a random amount of rows. I then go through these rows until column 1 > x. At this point I need the data from the current row and the previous row to do some calculations.
Currently, I am using 'for i in range(rows)' but each csv file will have a different amount of rows so this wont work.
The code can be seen below:
rows = 73
for i in range(rows):
c_level = Strapping_Table[Tank_Number][i,0] # Current level
c_volume = Strapping_Table[Tank_Number][i,1] # Current volume
if c_level > level:
p_level = Strapping_Table[Tank_Number][i-1,0] # Previous level
p_volume = Strapping_Table[Tank_Number][i-1,1] # Previous volume
x = level - p_level # Intermediate values
if x < 0:
x = 0
y = c_level - p_level
z = c_volume - p_volume
volume = p_volume + ((x / y) * z)
return volume
When playing around with arrays, I used:
for row in Tank_data:
print row[c] # print column c
time.sleep(1)
This goes through all the rows, but I cannot access the previous rows data with this method.
I have thought about storing previous row and current row in every loop, but before I do this I was wondering if there is a simple way to get the amount of rows in a csv.
Store the previous line
with open("myfile.txt", "r") as file:
previous_line = next(file)
for line in file:
print(previous_line, line)
previous_line = line
Or you can use it with generators
def prev_curr(file_name):
with open(file_name, "r") as file:
previous_line = next(file)
for line in file:
yield previous_line ,line
previous_line = line
# usage
for prev, curr in prev_curr("myfile"):
do_your_thing()
You should use enumerate.
for i, row in enumerate(tank_data):
print row[c], tank_data[i-1][c]
Since the size of each row in the csv is unknown until it's read, you'll have to do an intial pass through if you want to find the number of rows, e.g.:
numberOfRows = (1 for row in file)
However that would mean your code will read the csv twice, which if it's very big you may not want to do - the simple option of storing the previous row into a global variable each iteration may be the best option in that case.
An alternate route could be to just read in the file and analyse it from that from e.g. a panda DataFrame (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)
but again this could lead to slowness if your csv is too big.

How do I search through a very large csv file?

I have 2 csv files (well, one of them is .tab), both of them with 2 columns of numbers. My job is to go through each row of the first file, and see if it matches any of the rows in the second file. If it does, I print a blank line to my output csv file. Otherwise, I print 'R,R' to the output csv file. My current algorithm does the following:
Scan each row of the second file (two integers each), go to the position of those two integers in a 2D array (so if the integers are 2 and 3, I'll go to position [2,3]) and assign a value of 1.
Go through each row of the first file, check if the position of the two integers of each row has a value of 1 in the array, and then print the according output to a third csv file.
Unfortunately the csv files are very large, so I instantly get "MemoryError:" when running this. What is an alternative for scanning through large csv files?
I am using Jupyter Notebook. My code:
import csv
import numpy
def SNP():
thelines = numpy.ndarray((6639,524525))
tempint = 0
tempint2 = 0
with open("SL05_AO_RO.tab") as tsv:
for line in csv.reader(tsv, dialect="excel-tab"):
tempint = int(line[0])
tempint2 = int(line[1])
thelines[tempint,tempint2] = 1
return thelines
def common_sites():
tempint = 0
tempint2 = 0
temparray = SNP()
print('Checkpoint.')
with open('output_SL05.csv', 'w', newline='') as fp:
with open("covbreadth_common_sites.csv") as tsv:
for line in csv.reader(tsv, dialect="excel-tab"):
tempint = int(line[0])
tempint2 = int(line[1])
if temparray[tempint,tempint2] == 1:
a = csv.writer(fp, delimiter=',')
data = [['','']]
a.writerows(data)
else:
a = csv.writer(fp, delimiter=',')
data = [['R','R']]
a.writerows(data)
print('Done.')
return
common_sites()
Files:
https://drive.google.com/file/d/0B5v-nJeoVouHUjlJelZtV01KWFU/view?usp=sharing and https://drive.google.com/file/d/0B5v-nJeoVouHSDI4a2hQWEh3S3c/view?usp=sharing
You're dataset really isn't that big, but it is relatively sparse. You aren't using a sparse structure to store the data which is causing the problem.
Just use a set of tuples to store the seen data, and then the lookup on that set is O(1), e.g:
In [1]:
import csv
with open("SL05_AO_RO.tab") as tsv:
seen = set(map(tuple, csv.reader(tsv, dialect="excel-tab")))
with open("covbreadth_common_sites.csv") as tsv:
common = [line for line in csv.reader(tsv, dialect="excel-tab") if tuple(line) in seen]
common[:10]
Out[1]:
[['1049', '7280'], ['1073', '39198'], ['1073', '39218'], ['1073', '39224'], ['1073', '39233'],
['1098', '661'], ['1098', '841'], ['1103', '15100'], ['1103', '15107'], ['1103', '28210']]
10 loops, best of 3: 150 ms per loop
In [2]:
len(common), len(seen)
Out[2]:
(190, 138205)
I have 2 csv files (well, one of them is .tab), both of them with 2 columns of numbers. My job is to go through each row of the first file, and see if it matches any of the rows in the second file. If it does, I print a blank line to my output csv file. Otherwise, I print 'R,R' to the output csv file.
import numpy as np
f1 = np.loadtxt('SL05_AO_RO.tab')
f2 = np.loadtxt('covbreadth_common_sites.csv')
f1.sort(axis=0)
f2.sort(axis=0)
i, j = 0, 0
while i < f1.shape[0]:
while j < f2.shape[0] and f1[i][0] > f2[j][0]:
j += 1
while j < f2.shape[0] and f1[i][0] == f2[j][0] and f1[i][1] > f2[j][1]:
j += 1
if j < f2.shape[0] and np.array_equal(f1[i], f2[j]):
print()
else:
print('R,R')
i += 1
Load data to ndarray to optimize memory usage
Sort data
Find matches in sorted arrays
Total complexity is O(n*log(n) + m*log(m)), where n and m are sizes of input files.
Using of set() will not reduce memory usage per unique entry so I do not recommend to use it with large datasets.
Since CSV is just a DB dump, import it to any SQL DB, then do query on it. This is very efficient way.

String Replace for Multiple Lines In A CSV

Below is a snippet from a csv file. The first column is the product number, 2 is the stock level, 3 is the target level, and 4 is the distance from target (target minus stock level.)
34512340,0,95,95
12395675,3,95,92
56756777,70,95,25
90673412,2,95,93
When the stock level gets to 5 or below, I want to have the stock levels updated from python when a user requests it.
I am currently using this piece of code which I have adapted from just updating one line in the CSV. It isn't working though. The first line is written back to the file as 34512340,0,95,95 and the rest of the file is deleted.
choice = input("\nTo update the stock levels of the above products, type 1. To cancel, enter anything else.")
if choice == '1':
with open('stockcontrol.csv',newline='') as f:
for line in f:
data = line.split(",")
productcode = int(data[0])
target = int(data[2])
stocklevel = int(data[1])
if stocklevel <= 5:
target = str(target)
import sys
import csv
data=[]
newval= target
newtlevel = "0"
f=open("stockcontrol.csv")
reader=csv.DictReader(f,fieldnames=['code','level', 'target', 'distancefromtarget'])
for line in reader:
line['level']= newval
line['distancefromtarget']= newtlevel
data.append('%s,%s,%s,%s'%(line['code'],line['level'],line['target'],line['distancefromtarget']))
f.close()
f=open("stockcontrol.csv","w")
f.write("\n".join(data))
f.close()
print("The stock levels were updated successfully")
else:
print("Goodbye")
Here is the code that I had changing one line in the CSV file and works:
with open('stockcontrol.csv',newline='') as f:
for line in f:
if code in line:
data = line.split(",")
target = (data[2])
newlevel = stocklevel - quantity
updatetarget = int(target) - int(newlevel)
stocklevel = str(stocklevel)
newlevel = str(newlevel)
updatetarget = str(updatetarget)
import sys
import csv
data=[]
code = code
newval= newlevel
newtlevel = updatetarget
f=open("stockcontrol.csv")
reader=csv.DictReader(f,fieldnames=['code','level', 'target', 'distancefromtarget'])
for line in reader:
if line['code'] == code:
line['level']= newval
line['distancefromtarget']= newtlevel
data.append('%s,%s,%s,%s'%(line['code'],line['level'],line['target'],line['distancefromtarget']))
f.close()
f=open("stockcontrol.csv","w")
f.write("\n".join(data))
f.close()
What can I change to make the code work? I basically want the program to loop through each line of the CSV file, and if the stock level (column 2) is equal to or less than 5, update the stock level to the target number in column 3, and then set the number in column 4 to zero.
Thanks,
The below code reads each line and checks the value of column 2. If it is less than or equal to 5, the value of column2 is changed to value of column3 and last column is changed to 0 else all the columns are left unchanged.
import sys
import csv
data=[]
f=open("stockcontrol.csv")
reader=csv.DictReader(f,fieldnames=['code','level','target','distancefromtarget'])
for line in reader:
if int(line['level']) <= 5:
line['level']= line['target']
line['distancefromtarget']= 0
data.append("%s,%s,%s,%s"%(line['code'],line['level'],line['target'],line['distancefromtarget']))
f.close()
f=open("stockcontrol.csv","w")
f.write("\n".join(data))
f.close()
Coming to issues in your code:
You are first reading the file without using the csv module and getting the values in each column by splitting the line. You are again using the DictReader method of csv module to read the values you already had.

Categories

Resources