Selecting rows in cvs file and write them in another csv file

Selecting rows in cvs file and write them in another csv file - python

I have a csv file with 2 columns (titles are value, image). The value list contains values in ascending order (0,25,30...), and the image list contains pathway to images (e.g. X.jpg). Total lines are 81 including the titles (that is, there are 80 values and 80 images)
What I want to divide this list 4-ways. Basically the idea is to have a spread of pairs of images.
In the first group I took the image part of every two near rows (2+3, 4+5....), and wrote them in a new csv file. I write each image in a different column. Here's the code:
import csv
f = open('random_sorted.csv')
csv_f = csv.reader(f)
i = 0
prev = ""
#open csv file for writing
with open('first_group.csv', 'wb') as test_file:
csv_writer = csv.writer(test_file)
csv_writer.writerow(["image1"] + ["image2"])
for row in csv_f:
if i%2 == 0 and i!=0:
#print prev + "," + row[1]
csv_writer.writerow([prev] + [row[1]])
else:
prev = row[1]
i = i+1
Here's the output of this:
I want to keep the concept similar with the rest 3 groups(write into a new csv file the paired images and having two columns), but just increase the spread. That is, pair together every 5 rows (i.e. 2+7 etc.), every 7 (i.e. 2+9 etc.), and every 9 rows together. Would love to get some directions as to how to execute it. I was lucky with the first group (just learned about the remainder/divider option in the CodeAcademy course, but can't think of ideas for the other groups.

First collect all the rows in the csv file in a list:
with open('random_sorted.csv') as csvfile:
csv_reader = csv.reader(csvfile, delimiter=';')
headers = next(csv_reader)
rows = [row for row in csv_reader]
Then set your required step size (5, 7 or 9) and identify the rows on the basis of their index in the list of rows:
with open('first_group.csv', 'wb') as test_file:
csv_writer = csv.writer(test_file)
csv_writer.writerow(["image1"] + ["image2"])
step_size = 7 # set step size here
seen = set() # here we remember images we've already seen
for x in range(0, len(rows)-step_size):
img1 = rows[x][1]
img2 = rows[x+step_size][1]
if not (img1 in seen or img2 in seen):
csv_writer.writerow([img1, img2])
seen.add(img1)
seen.add(img2)

Related

Read csv file with empty lines

Analysis software I'm using outputs many groups of results in 1 csv file and separates the groups with 2 empty lines.
I would like to break the results in groups so that I can then analyse them separately.
I'm sure there is a built-in function in python (or one of it's libraries) that does this, I tried this piece of code that I found somewhere but it doesn't seem to work.
import csv
results = open('03_12_velocity_y.csv').read().split("\n\n")
# Feed first csv.reader
first_csv = csv.reader(results[0], delimiter=',')
# Feed second csv.reader
second_csv = csv.reader(results[1], delimiter=',')
Update:
The original code actually works, but my python skills are pretty limited and I did not implement it properly.
.split(\n\n\n) method does work but the csv.reader is an object and to get the data in a list (or something similar), it needs to iterate through all the rows and write them to the list.
I then used Pandas to remove the header and convert the scientific notated values to float. Code is bellow. Thanks everyone for help.
import csv
import pandas as pd
# Open the csv file, read it and split it when it encounters 2 empty lines (\n\n\n)
results = open('03_12_velocity_y.csv').read().split('\n\n\n')
# Create csv.reader objects that are used to iterate over rows in a csv file
# Define the output - create an empty multi-dimensional list
output1 = [[],[]]
# Iterate through the rows in the csv file and append the data to the empty list
# Feed first csv.reader
csv_reader1 = csv.reader(results[0].splitlines(), delimiter=',')
for row in csv_reader1:
output1.append(row)
df = pd.DataFrame(output1)
# remove first 7 rows of data (the start position of the slice is always included)
df = df.iloc[7:]
# Convert all data from string to float
df = df.astype(float)

If your row counts are inconsistent across groups, you'll need a little state machine to check when you're between groups and do something with the last group.
#!/usr/bin/env python3
import csv
def write_group(group, i):
with open(f"group_{i}.csv", "w", newline="") as out_f:
csv.writer(out_f).writerows(group)
with open("input.csv", newline="") as f:
reader = csv.reader(f)
group_i = 1
group = []
last_row = []
for row in reader:
if row == [] and last_row == [] and group != []:
write_group(group, group_i)
group = []
group_i += 1
continue
if row == []:
last_row = row
continue
group.append(row)
last_row = row
# flush remaining group
if group != []:
write_group(group, group_i)
I mocked up this sample CSV:
g1r1c1,g1r1c2,g1r1c3
g1r2c1,g1r2c2,g1r2c3
g1r3c1,g1r3c2,g1r3c3
g2r1c1,g2r1c2,g2r1c3
g2r2c1,g2r2c2,g2r2c3
g3r1c1,g3r1c2,g3r1c3
g3r2c1,g3r2c2,g3r2c3
g3r3c1,g3r3c2,g3r3c3
g3r4c1,g3r4c2,g3r4c3
g3r5c1,g3r5c2,g3r5c3
And when I run the program above I get three CSV files:
group_1.csv
g1r1c1,g1r1c2,g1r1c3
g1r2c1,g1r2c2,g1r2c3
g1r3c1,g1r3c2,g1r3c3
group_2.csv
g2r1c1,g2r1c2,g2r1c3
g2r2c1,g2r2c2,g2r2c3
group_3.csv
g3r1c1,g3r1c2,g3r1c3
g3r2c1,g3r2c2,g3r2c3
g3r3c1,g3r3c2,g3r3c3
g3r4c1,g3r4c2,g3r4c3
g3r5c1,g3r5c2,g3r5c3

If your row counts are consistent, you can do this with fairly vanilla Python or using the Pandas library.
Vanilla Python
Define your group size and the size of the break (in "rows") between groups.
Loop over all the rows adding each row to a group accumulator.
When the group accumulator reaches the pre-defined group size, do something with it, reset the accumulator, and then skip break-size rows.
Here, I'm writing each group to its own numbered file:
import csv
group_sz = 5
break_sz = 2
def write_group(group, i):
with open(f"group_{i}.csv", "w", newline="") as f_out:
csv.writer(f_out).writerows(group)
with open("input.csv", newline="") as f_in:
reader = csv.reader(f_in)
group_i = 1
group = []
for row in reader:
group.append(row)
if len(group) == group_sz:
write_group(group, group_i)
group_i += 1
group = []
for _ in range(break_sz):
try:
next(reader)
except StopIteration: # gracefully ignore an expected StopIteration (at the end of the file)
break
group_1.csv
g1r1c1,g1r1c2,g1r1c3
g1r2c1,g1r2c2,g1r2c3
g1r3c1,g1r3c2,g1r3c3
g1r4c1,g1r4c2,g1r4c3
g1r5c1,g1r5c2,g1r5c3
With Pandas
I'm new to Pandas, and learning this as I go, but it looks like Pandas will automatically trim blank rows/records from a chunk of data^1.
With that in mind, all you need to do is specify the size of your group, and tell Pandas to read your CSV file in "iterator mode", where you can ask for a chunk (your group size) of records at a time:
import pandas as pd
group_sz = 5
with pd.read_csv("input.csv", header=None, iterator=True) as reader:
i = 1
while True:
try:
df = reader.get_chunk(group_sz)
except StopIteration:
break
df.to_csv(f"group_{i}.csv")
i += 1
Pandas add an "ID" column and default header when it writes out the CSV:
group_1.csv
,0,1,2
0,g1r1c1,g1r1c2,g1r1c3
1,g1r2c1,g1r2c2,g1r2c3
2,g1r3c1,g1r3c2,g1r3c3
3,g1r4c1,g1r4c2,g1r4c3
4,g1r5c1,g1r5c2,g1r5c3

TRY this out with your output:
import pandas as pd
# csv file name to be read in
in_csv = 'input.csv'
# get the number of lines of the csv file to be read
number_lines = sum(1 for row in (open(in_csv)))
# size of rows of data to write to the csv,
# you can change the row size according to your need
rowsize = 500
# start looping through data writing it to a new file for each set
for i in range(1,number_lines,rowsize):
df = pd.read_csv(in_csv,
header=None,
nrows = rowsize,#number of rows to read at each loop
skiprows = i)#skip rows that have been read
#csv to write data to a new file with indexed name. input_1.csv etc.
out_csv = 'input' + str(i) + '.csv'
df.to_csv(out_csv,
index=False,
header=False,
mode='a', #append data to csv file
)

I updated the question with the last details that answer my question.

Python: How to iterate every third row starting with the second row of a csv file

I'm trying to write a program that iterates through the length of a csv file row by row. It will create 3 new csv files and write data from the source csv file to each of them. The program does this for the entire row length of the csv file.
For the first if statement, I want it to copy every third row starting at the first row and save it to a new csv file(the next row it copies would be row 4, row 7, row 10, etc)
For the second if statement, I want it to copy every third row starting at the second row and save it to a new csv file(the next row it copies would be row 5, row 8, row 11, etc).
For the third if statement, I want it to copy every third row starting at the third row and save it to a new csv file(the next row it copies would be row 6, row 9, row 12, etc).
The second "if" statement I wrote that creates the first "agentList1.csv" works exactly the way I want it to but I can't figure out how to get the first "elif" statement to start from the second row and the second "elif" statement to start from the third row. Any help would be much appreciated!
Here's my code:
for index, row in Sourcedataframe.iterrows(): #going through each row line by line
#this for loop counts the amount of times it has gone through the csv file. If it has gone through it more than three times, it resets the counter back to 1.
for column in Sourcedataframe:
if count > 3:
count = 1
#if program is on it's first count, it opens the 'Sourcedataframe', reads/writes every third row to a new csv file named 'agentList1.csv'.
if count == 1:
with open('blankAgentList.csv') as infile:
with open('agentList1.csv', 'w') as outfile:
reader = csv.DictReader(infile)
writer = csv.DictWriter(outfile, fieldnames=reader.fieldnames)
writer.writeheader()
for row in reader:
count2 += 1
if not count2 % 3:
writer.writerow(row)
elif count == 2:
with open('blankAgentList.csv') as infile:
with open('agentList2.csv', 'w') as outfile:
reader = csv.DictReader(infile)
writer = csv.DictWriter(outfile, fieldnames=reader.fieldnames)
writer.writeheader()
for row in reader:
count2 += 1
if not count2 % 3:
writer.writerow(row)
elif count == 3:
with open('blankAgentList.csv') as infile:
with open('agentList3.csv', 'w') as outfile:
reader = csv.DictReader(infile)
writer = csv.DictWriter(outfile, fieldnames=reader.fieldnames)
writer.writeheader()
for row in reader:
count2 += 1
if not count2 % 3:
writer.writerow(row)
count = count + 1 #counts how many times it has ran through the main for loop.

convert csv to dataframe as (df.to_csv(header=True)) to start indexing from second row
then,pass row/record no in iloc function to fetch particular record using
( df.iloc[ 3 , : ])

you are open your csv file in each if claus from the beginning. I believe you already opened your file into Sourcedataframe. so just get rid of reader = csv.DictReader(infile) and read data like this:
Sourcedataframe.iloc[column]

Using plain python we can create a solution that works for any number of interleaved data rows, let's call it NUM_ROWS, not just three.
Nota Bene: the solution does not require to read and keep the whole input all the data in memory. It processes one line at a time, grouping the last needed few and works fine for a very large input file.
Assuming your input file contains a number of data rows which is a multiple of NUM_ROWS, i.e. the rows can be split evenly to the output files:
NUM_ROWS = 3
outfiles = [open(f'blankAgentList{i}.csv', 'w') for i in range(1,NUM_ROWS+1)]
with open('blankAgentList.csv') as infile:
header = infile.readline() # read/skip the header
for f in outfiles: # repeat header in all output files if needed
f.write(header)
row_groups = zip(*[iter(infile)]*NUM_ROWS)
for rg in row_groups:
for f, r in zip(outfiles, rg):
f.write(r)
for f in outfiles:
f.close()
Otherwise, for any number of data rows we can use
import itertools as it
NUM_ROWS = 3
outfiles = [open(f'blankAgentList{i}.csv', 'w') for i in range(1,NUM_ROWS+1)]
with open('blankAgentList.csv') as infile:
header = infile.readline() # read/skip the header
for f in outfiles: # repeat header in all output files if needed
f.write(header)
row_groups = it.zip_longest(*[iter(infile)]*NUM_ROWS)
for rg in row_groups:
for f, r in it.zip_longest(outfiles, rg):
if r is None:
break
f.write(r)
for f in outfiles:
f.close()
which, for example, with an input file of
A,B,C
r1a,r1b,r1c
r2a,r2a,r2c
r3a,r3b,r3c
r4a,r4b,r4c
r5a,r5b,r5c
r6a,r6b,r6c
r7a,r7b,r7c
produces (output copied straight from the terminal)
(base) SO $ cat blankAgentList.csv
A,B,C
r1a,r1b,r1c
r2a,r2a,r2c
r3a,r3b,r3c
r4a,r4b,r4c
r5a,r5b,r5c
r6a,r6b,r6c
r7a,r7b,r7c
(base) SO $ cat blankAgentList1.csv
A,B,C
r1a,r1b,r1c
r4a,r4b,r4c
r7a,r7b,r7c
(base) SO $ cat blankAgentList2.csv
A,B,C
r2a,r2a,r2c
r5a,r5b,r5c
(base) SO $ cat blankAgentList3.csv
A,B,C
r3a,r3b,r3c
r6a,r6b,r6c
Note: I understand the line
row_groups = zip(*[iter(infile)]*NUM_ROWS)
may be intimidating at first (it was for me when I started).
All it does is simply to group consecutive lines from the input file.
If your objective includes learning Python, I recommend studying it thoroughly via a book or a course or both and practising a lot.
One key subject is the iteration protocol, along with all the other protocols. And namespaces.

Filtering specific lines in csv keeping track of last value

I've got a .csv file with different data formats and I'm trying to operate with the values on the same column.
my csv file is something like this:
"int","float","stirng", more data
Example:
"2","1.378","Johnny"
"1","1.379","Walker"
"5","1.380","Jack"
"8","1.700","Daniels"
"8","1.710","Baileys"
"8","1.381","Monkey"
"8","1.711","Shoulder"
"8","1.383","Captain"
"8","1.385","Morgan"
"8","1.392","Drinks"
More rows
I would like to subtract values in the second column if their difference is >x.
(only those, I don't care about the others).
My code so far:
with open ('input.csv', 'r') as file, open ('output.csv', 'w') as f_out:
readCSV = csv.reader(file)
writeCSV = csv.writer(f_out, lineterminator='\n')
last = None
for row in readCSV:
datalat = float(row[1])
if last is not None:
#print("difference -> %f" %(datalat-last))
outp = (datalat-last)
if outp <= 0.02:
writeCSV.writerow(row)
last = datalat
The output looks like:
5,1.380,Jack
8,1.710,Baileys
8,1.381,Monkey
8,1.383,Captain
8,1.385,Morgan
8,1.392,Drinks
But I would like it to be:
"2","1.378","Johnny"
"1","1.379","Walker"
"5","1.380","Jack"
"8","1.381","Monkey"
"8","1.383","Captain"
"8","1.385","Morgan"
"8","1.392","Drinks"
So what it should do is only write rows that have less than 0.02 difference, IF there is a row with a bigger difference discard it, then compare the next row to the last written row, as opposed to the last discarded row.

Two things:
You should take the absolute value (using abs) of the difference as you don't know apriori which of the two is greater.
Only update last if the condition is fulfilled, so last is never a discarded value.
last = float(next(readCSV)[1]) # assign first reference value
f_out.seek(0) # return to start of file
for row in readCSV:
datalat = float(row[1])
diff = abs(datalat-last)
if diff <= 0.02:
writeCSV.writerow(row)
last = datalat

Get number of rows from .csv file

I am writing a Python module where I read a .csv file with 2 columns and a random amount of rows. I then go through these rows until column 1 > x. At this point I need the data from the current row and the previous row to do some calculations.
Currently, I am using 'for i in range(rows)' but each csv file will have a different amount of rows so this wont work.
The code can be seen below:
rows = 73
for i in range(rows):
c_level = Strapping_Table[Tank_Number][i,0] # Current level
c_volume = Strapping_Table[Tank_Number][i,1] # Current volume
if c_level > level:
p_level = Strapping_Table[Tank_Number][i-1,0] # Previous level
p_volume = Strapping_Table[Tank_Number][i-1,1] # Previous volume
x = level - p_level # Intermediate values
if x < 0:
x = 0
y = c_level - p_level
z = c_volume - p_volume
volume = p_volume + ((x / y) * z)
return volume
When playing around with arrays, I used:
for row in Tank_data:
print row[c] # print column c
time.sleep(1)
This goes through all the rows, but I cannot access the previous rows data with this method.
I have thought about storing previous row and current row in every loop, but before I do this I was wondering if there is a simple way to get the amount of rows in a csv.

Store the previous line
with open("myfile.txt", "r") as file:
previous_line = next(file)
for line in file:
print(previous_line, line)
previous_line = line
Or you can use it with generators
def prev_curr(file_name):
with open(file_name, "r") as file:
previous_line = next(file)
for line in file:
yield previous_line ,line
previous_line = line
# usage
for prev, curr in prev_curr("myfile"):
do_your_thing()

You should use enumerate.
for i, row in enumerate(tank_data):
print row[c], tank_data[i-1][c]

Since the size of each row in the csv is unknown until it's read, you'll have to do an intial pass through if you want to find the number of rows, e.g.:
numberOfRows = (1 for row in file)
However that would mean your code will read the csv twice, which if it's very big you may not want to do - the simple option of storing the previous row into a global variable each iteration may be the best option in that case.
An alternate route could be to just read in the file and analyse it from that from e.g. a panda DataFrame (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)
but again this could lead to slowness if your csv is too big.

Deleting Multiple Rows in a Matrix

I'm working on writing a program that takes data from a CSV and turns it into a table to be exported to a PDF. The CSV I am working with has a bunch of empty rows so when I create my Matrix in Python, I have a bunch of empty rows. I want to delete all rows beginning with ''. The code I wrote is:
i=0
x=rows-empty ##where empty has been defined and the number of rows I need to delete.
for i in range(x):
if Matrix[i][0] == '':
del Matrix[i]
i+=1
The issue I'm having is if there are two consecutive empty rows only one is deleted. Any ideas on how to get rid of both lines?
I create and fill the Matrix using the following code:
##creates empty matrix
with open(filename) as csvfile:
serverinfo=csv.reader(csvfile, delimiter=",", quotechar="|")
rows=0
for row in serverinfo:
NumColumns = len(row)
rows += 1
Matrix=[[0 for x in xrange(9)] for x in xrange(rows)]
csvfile.close()
##fills Matrix
with open(filename) as csvfile:
serverinfo=csv.reader(csvfile, delimiter=",", quotechar="|")
rows=0
for row in serverinfo:
colnum = 0
for col in row:
Matrix[rows][colnum] = col
if col==0:
del col
colnum += 1
rows += 1
csvfile.close()

Instead of deleting them after, just don't load them to start with. You can also short cut re-reading the file twice, as it appears you have a column limit of 9 set, so for each row, just pad it out with 0's to that size.... eg:
import csv
from itertools import chain, islice, repeat
COLS = 9 # or pre-scan file to get max columns
FILL_VALUE = 0 # or None, or blank for instance
with open(filename) as fin:
csvin = csv.reader(fin) # use appropriate delimiter/dialect settings
non_blanks = (row for row in csvin if row[0]) # filter out rows with blank 1st col
matrix = [list(islice(chain(row, repeat(FILL_VALUE)), COLS)) for row in non_blanks]
Depending what you're doing with the data, you may also wish to look at the numpy module and the available loadtxt() method.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Selecting rows in cvs file and write them in another csv file - python

Related

Read csv file with empty lines

Python: How to iterate every third row starting with the second row of a csv file

Filtering specific lines in csv keeping track of last value

Get number of rows from .csv file

Deleting Multiple Rows in a Matrix

Categories

Resources