How to select data rows - python

I have a 4 columns x 180.000 rows data file. I'd like to select entire rows of data to be saved to a new file, based on the criterion that the value in column 3 is within a specific interval, i.e. min value < column 3 value < max value.
Any ideas how to do this?

Use the csv module to read and write, then just filter:
with open(inputfilename, 'rb') as inputfile, open(outputfile, 'wb') as outputfile:
reader = csv.reader(inputfile)
writer = csv.writer(outputfile)
for row in reader:
if minval <= int(row[2]) <= maxval:
writer.writerow(row)

Can be done with simple CSV Read/ Write.
Can be done more elegantly and in a vectorized form using Numpy and moreover since the number of rows is huge, Numpy might get be a lot quicker.
import numpy as np
#Load file into a 'MATRIX'
data=np.loadtxt('name_of_delimited_file.txt')
# Find indices where the condition is met
idx_condition_met=(data[:,2] > min) & (data[:,2] < max)
np.savetxt('output.txt', data[idx_condition_met], delimiter=',')

Related

How to delete a row in a CSV file if a cell is empty using Python

I want to go through large CSV files and if there is missing data I want to remove that row completely, This is only row specific so if there is a cell that = 0 or has no value then I want to remove the entire row. I want this to happen for all the columns so if any column has a black cell it should delete the row, and return the corrected data in a corrected csv.
import csv
with open('data.csv', 'r') as csvfile:
csvreader = csv.reader(csvfile)
for row in csvreader:
print(row)
if not row[0]:
print("12")
This is what I found and tried but it doesnt not seem to be working and I dont have any ideas about how to aproach this problem, help please?
Thanks!
Due to the way in which CSV reader presents rows of data, you need to know how many columns there are in the original CSV file. For example, if the CSV file content looks like this:
1,2
3,
4
Then the lists return by iterating over the reader would look like this:
['1','2']
['3','']
['4']
As you can see, the third row only has one column whereas the first and second rows have 2 columns albeit that one is (effectively) empty.
This function allows you to either specify the number of columns (if you know them before hand) or allow the function to figure it out. If not specified then it is assumed that the number of columns is the greatest number of columns found in any row.
So...
import csv
DELIMITER = ','
def valid_column(col):
try:
return float(col) != 0
except ValueError:
pass
return len(col.strip()) > 0
def fix_csv(input_file, output_file, cols=0):
if cols == 0:
with open(input_file, newline='') as indata:
cols = max(len(row) for row in csv.reader(indata, delimiter=DELIMITER))
with open(input_file, newline='') as indata, open(output_file, 'w', newline='') as outdata:
writer = csv.writer(outdata, delimiter=DELIMITER)
for row in csv.reader(indata, delimiter=DELIMITER):
if len(row) == cols:
if all(valid_column(col) for col in row):
writer.writerow(row)
fix_csv('original.csv', 'fixed.csv')
maybe like this
import csv
with open('data.csv', 'r') as csvfile:
csvreader = csv.reader(csvfile)
data=list(csvreader)
data=[x for x in data if '' not in x and '0' not in x]
you can then rewrite the the csv file if you like
Instead of using csv, you should use Pandas module, something like this.
import pandas as pd
df = pd.read_csv('file.csv')
print(df)
index = 1 #index of the row that you want to remove
df = df.drop(index)
print(df)
df.to_csv('file.csv')

Read csv file with empty lines

Analysis software I'm using outputs many groups of results in 1 csv file and separates the groups with 2 empty lines.
I would like to break the results in groups so that I can then analyse them separately.
I'm sure there is a built-in function in python (or one of it's libraries) that does this, I tried this piece of code that I found somewhere but it doesn't seem to work.
import csv
results = open('03_12_velocity_y.csv').read().split("\n\n")
# Feed first csv.reader
first_csv = csv.reader(results[0], delimiter=',')
# Feed second csv.reader
second_csv = csv.reader(results[1], delimiter=',')
Update:
The original code actually works, but my python skills are pretty limited and I did not implement it properly.
.split(\n\n\n) method does work but the csv.reader is an object and to get the data in a list (or something similar), it needs to iterate through all the rows and write them to the list.
I then used Pandas to remove the header and convert the scientific notated values to float. Code is bellow. Thanks everyone for help.
import csv
import pandas as pd
# Open the csv file, read it and split it when it encounters 2 empty lines (\n\n\n)
results = open('03_12_velocity_y.csv').read().split('\n\n\n')
# Create csv.reader objects that are used to iterate over rows in a csv file
# Define the output - create an empty multi-dimensional list
output1 = [[],[]]
# Iterate through the rows in the csv file and append the data to the empty list
# Feed first csv.reader
csv_reader1 = csv.reader(results[0].splitlines(), delimiter=',')
for row in csv_reader1:
output1.append(row)
df = pd.DataFrame(output1)
# remove first 7 rows of data (the start position of the slice is always included)
df = df.iloc[7:]
# Convert all data from string to float
df = df.astype(float)
If your row counts are inconsistent across groups, you'll need a little state machine to check when you're between groups and do something with the last group.
#!/usr/bin/env python3
import csv
def write_group(group, i):
with open(f"group_{i}.csv", "w", newline="") as out_f:
csv.writer(out_f).writerows(group)
with open("input.csv", newline="") as f:
reader = csv.reader(f)
group_i = 1
group = []
last_row = []
for row in reader:
if row == [] and last_row == [] and group != []:
write_group(group, group_i)
group = []
group_i += 1
continue
if row == []:
last_row = row
continue
group.append(row)
last_row = row
# flush remaining group
if group != []:
write_group(group, group_i)
I mocked up this sample CSV:
g1r1c1,g1r1c2,g1r1c3
g1r2c1,g1r2c2,g1r2c3
g1r3c1,g1r3c2,g1r3c3
g2r1c1,g2r1c2,g2r1c3
g2r2c1,g2r2c2,g2r2c3
g3r1c1,g3r1c2,g3r1c3
g3r2c1,g3r2c2,g3r2c3
g3r3c1,g3r3c2,g3r3c3
g3r4c1,g3r4c2,g3r4c3
g3r5c1,g3r5c2,g3r5c3
And when I run the program above I get three CSV files:
group_1.csv
g1r1c1,g1r1c2,g1r1c3
g1r2c1,g1r2c2,g1r2c3
g1r3c1,g1r3c2,g1r3c3
group_2.csv
g2r1c1,g2r1c2,g2r1c3
g2r2c1,g2r2c2,g2r2c3
group_3.csv
g3r1c1,g3r1c2,g3r1c3
g3r2c1,g3r2c2,g3r2c3
g3r3c1,g3r3c2,g3r3c3
g3r4c1,g3r4c2,g3r4c3
g3r5c1,g3r5c2,g3r5c3
If your row counts are consistent, you can do this with fairly vanilla Python or using the Pandas library.
Vanilla Python
Define your group size and the size of the break (in "rows") between groups.
Loop over all the rows adding each row to a group accumulator.
When the group accumulator reaches the pre-defined group size, do something with it, reset the accumulator, and then skip break-size rows.
Here, I'm writing each group to its own numbered file:
import csv
group_sz = 5
break_sz = 2
def write_group(group, i):
with open(f"group_{i}.csv", "w", newline="") as f_out:
csv.writer(f_out).writerows(group)
with open("input.csv", newline="") as f_in:
reader = csv.reader(f_in)
group_i = 1
group = []
for row in reader:
group.append(row)
if len(group) == group_sz:
write_group(group, group_i)
group_i += 1
group = []
for _ in range(break_sz):
try:
next(reader)
except StopIteration: # gracefully ignore an expected StopIteration (at the end of the file)
break
group_1.csv
g1r1c1,g1r1c2,g1r1c3
g1r2c1,g1r2c2,g1r2c3
g1r3c1,g1r3c2,g1r3c3
g1r4c1,g1r4c2,g1r4c3
g1r5c1,g1r5c2,g1r5c3
With Pandas
I'm new to Pandas, and learning this as I go, but it looks like Pandas will automatically trim blank rows/records from a chunk of data^1.
With that in mind, all you need to do is specify the size of your group, and tell Pandas to read your CSV file in "iterator mode", where you can ask for a chunk (your group size) of records at a time:
import pandas as pd
group_sz = 5
with pd.read_csv("input.csv", header=None, iterator=True) as reader:
i = 1
while True:
try:
df = reader.get_chunk(group_sz)
except StopIteration:
break
df.to_csv(f"group_{i}.csv")
i += 1
Pandas add an "ID" column and default header when it writes out the CSV:
group_1.csv
,0,1,2
0,g1r1c1,g1r1c2,g1r1c3
1,g1r2c1,g1r2c2,g1r2c3
2,g1r3c1,g1r3c2,g1r3c3
3,g1r4c1,g1r4c2,g1r4c3
4,g1r5c1,g1r5c2,g1r5c3
TRY this out with your output:
import pandas as pd
# csv file name to be read in
in_csv = 'input.csv'
# get the number of lines of the csv file to be read
number_lines = sum(1 for row in (open(in_csv)))
# size of rows of data to write to the csv,
# you can change the row size according to your need
rowsize = 500
# start looping through data writing it to a new file for each set
for i in range(1,number_lines,rowsize):
df = pd.read_csv(in_csv,
header=None,
nrows = rowsize,#number of rows to read at each loop
skiprows = i)#skip rows that have been read
#csv to write data to a new file with indexed name. input_1.csv etc.
out_csv = 'input' + str(i) + '.csv'
df.to_csv(out_csv,
index=False,
header=False,
mode='a', #append data to csv file
)
I updated the question with the last details that answer my question.

Write and recode from one csv file to another

I am trying to select specific columns from a large tab-delimited CSV file and output only certain columns to a new CSV file. Furthermore, I want to recode the data as this happens. If the cell has a value of 0 then just output 0. However, if the cell has a value of greater than 0, then just output 1 (i.e., all values greater than 0 are coded as 1).
Here's what I have so far:
import csv
outputFile = open('output.csv', 'wb')
outputWriter = csv.writer(outputFile)
included_cols = range(9,2844)
with open('source.txt', 'rb') as f:
reader = csv.reader(f, delimiter='\t')
for row in reader:
content = list(row[i] for i in included_cols)
outputWriter.writerow(content)
The first issue I am having is that I want to also take from column 6. I wasn't sure how to write column 6 and then columns 9-2844.
Second, I wasn't sure how to do the recoding on the fly as I write the new CSV.
I wasn't sure how to write column 6 and then columns 9-2844.
included_cols = [6] + list(range(9,2844))
This works because you can add two lists together. Note that in Python3, range doesn't return a list, so we have to coerce it.
I wasn't sure how to do the recoding on the fly
content = list((1 if row[i] > 0 else 0) for i in included_cols)
This works because of the conditional expression: 1 if row[i] > 0 else 0. The general form A if cond else B evaluates to either A or B, depending upon the condition.
Another form, which I think is "too clever by half" is content = list((row[i] and 1) for i in included_cols). This works because the and operator always returns one or the other of its inputs.
This should work:
import csv
outputFile = open('output.csv', 'wb')
outputWriter = csv.writer(outputFile)
included_cols = [5] + range(8,2844) # you can just merge two lists
with open('source.txt', 'rb') as f:
reader = csv.reader(f, delimiter='\t')
outputWriter.writerow(reader[0]) # write header row unchanged
for row in reader[1:]: # skip header row
content = [int(row[i]) if i == 5 else (0 if int(row[i]) == 0 else 1) for i in included_cols]
outputWriter.writerow(content)

write 5M data to a csv file using file write operation

I'm writing some data in a single column of a CSV file using file write operation. I am able to write values only in 1048576 rows. I have 5 million integer data values and I want it to be saved in a single CSV file. Below is my code
with open(path, 'w') as fp:
for i in range(0,len(values)):
fp.write(values[i] + '\n')
fp.close()
Is it possible to continue writing values after 1048576 rows to 3rd/4th column of the CSV file?? OR
Is it possible to write values in a sequential way so that i can have all the values in a single file?
You can use itertools.izip_longest to "chunk" the values into "columns", then use the csv module to write those rows to the file. eg:
import csv
from itertools import izip_longest
N = 5 # adapt as needed
values = range(1, 23) # use real values here
with open(path, 'wb') as fout:
csvout = csv.writer(fout)
rows = izip_longest(*[iter(values)] * N, fillvalue='')
csvout.writerows(rows)
This will give you the following output:
1,2,3,4,5
6,7,8,9,10
11,12,13,14,15
16,17,18,19,20
21,22,,,
You can also "transpose" the data so the data "runs the other way round", eg:
import csv
from itertools import izip_longest, izip
N = 5 # adapt as needed
values = range(1, 23) # use real values here
with open(path, 'wb') as fout:
csvout = csv.writer(fout)
rows = izip_longest(*[iter(values)] * N, fillvalue='')
transposed = izip(*rows)
csvout.writerows(transposed)
This will give you:
1,6,11,16,21
2,7,12,17,22
3,8,13,18,
4,9,14,19,
5,10,15,20,
As an alternative, you can use islice to give you the required number of columns per row as follows:
from itertools import islice
import csv
path = 'output.txt'
values = range(105) # Create sample 'values' data
columns = 10
ivalues = iter(values)
with open(path, 'wb') as fp:
csv_output = csv.writer(fp)
for row in iter(lambda: list(islice(ivalues, columns)), []):
csv_output.writerow(row)
Giving you the following:
0,1,2,3,4,5,6,7,8,9
10,11,12,13,14,15,16,17,18,19
20,21,22,23,24,25,26,27,28,29
30,31,32,33,34,35,36,37,38,39
40,41,42,43,44,45,46,47,48,49
50,51,52,53,54,55,56,57,58,59
60,61,62,63,64,65,66,67,68,69
70,71,72,73,74,75,76,77,78,79
80,81,82,83,84,85,86,87,88,89
90,91,92,93,94,95,96,97,98,99
100,101,102,103,104
Note, in your example, you should convert range to xrange to avoid Python creating a huge list of numbers to iterate on.

Deleting Multiple Rows in a Matrix

I'm working on writing a program that takes data from a CSV and turns it into a table to be exported to a PDF. The CSV I am working with has a bunch of empty rows so when I create my Matrix in Python, I have a bunch of empty rows. I want to delete all rows beginning with ''. The code I wrote is:
i=0
x=rows-empty ##where empty has been defined and the number of rows I need to delete.
for i in range(x):
if Matrix[i][0] == '':
del Matrix[i]
i+=1
The issue I'm having is if there are two consecutive empty rows only one is deleted. Any ideas on how to get rid of both lines?
I create and fill the Matrix using the following code:
##creates empty matrix
with open(filename) as csvfile:
serverinfo=csv.reader(csvfile, delimiter=",", quotechar="|")
rows=0
for row in serverinfo:
NumColumns = len(row)
rows += 1
Matrix=[[0 for x in xrange(9)] for x in xrange(rows)]
csvfile.close()
##fills Matrix
with open(filename) as csvfile:
serverinfo=csv.reader(csvfile, delimiter=",", quotechar="|")
rows=0
for row in serverinfo:
colnum = 0
for col in row:
Matrix[rows][colnum] = col
if col==0:
del col
colnum += 1
rows += 1
csvfile.close()
Instead of deleting them after, just don't load them to start with. You can also short cut re-reading the file twice, as it appears you have a column limit of 9 set, so for each row, just pad it out with 0's to that size.... eg:
import csv
from itertools import chain, islice, repeat
COLS = 9 # or pre-scan file to get max columns
FILL_VALUE = 0 # or None, or blank for instance
with open(filename) as fin:
csvin = csv.reader(fin) # use appropriate delimiter/dialect settings
non_blanks = (row for row in csvin if row[0]) # filter out rows with blank 1st col
matrix = [list(islice(chain(row, repeat(FILL_VALUE)), COLS)) for row in non_blanks]
Depending what you're doing with the data, you may also wish to look at the numpy module and the available loadtxt() method.

Categories

Resources