deleting useless columns and rows in csvfile and save using python - python

Here's a sample csv file;
out_gate,uless_col,in_gate,n_con
p,x,x,1
p,x,y,1
p,x,z,1
a_a,u,b,1
a_a,s,b,3
a_b,e,a,2
a_b,l,c,4
a_c,e,a,5
a_c,s,b,5
a_c,s,b,3
a_c,c,a,4
a_d,o,c,2
a_d,l,c,3
a_d,m,b,2
p,y,x,1
p,y,y,1
p,y,z,3
I want to remove the useless columns (2nd column) and useless rows (first three and last three rows) and create a new csv file and then save this new one. and How can I deal with the csv file that has more than 10 useless columns and useless rows?
(assuming useless rows are located only on the top or the bottom lines not scattered in the middle)
(and I am also assuming all the rows we want to use has its first element name starting with 'a_')
Can I get solution without using numpys or pandas as well? thanks!

Assuming that you have one or more unwanted columns and the wanted rows start with "a_".
import csv
with open('filename.csv') as infile:
reader = csv.reader(infile)
header = next(reader)
data = list(reader)
useless = set(['uless_col', 'n_con']) # Let's say there are 2 useless columns
mask, new_header = zip(*[(i,name) for i,name in enumerate(header)
if name not in useless])
#(0,2) - column mask
#('out_gate', 'in_gate') - new column headers
new_data = [[row[i] for i in mask] for row in data] # Remove unwanted columns
new_data = [row for row in new_data if row[0].startswith("a_")] # Remove unwanted rows
with open('filename.csv', 'w') as outfile:
writer = csv.writer(outfile)
writer.writerow(new_header)
writer.writerows(new_data)

You can try this:
import csv
data = list(csv.reader(open('filename.csv')))
header = [data[0][0]]+data[0][2:]
final_data = [[i[0]]+i[2:] for i in data[1:]][3:-3]
with open('filename.csv', 'w') as f:
write = csv.writer(f)
write.writerows([header]+final_data)
Output:
out_gate,in_gate,n_con
a,b,1
a,b,3
b,a,2
b,c,4
c,a,5
c,b,5
c,b,3
c,a,4
d,c,2
d,c,3
d,b,2

Below solution uses Pandas.
As the pandas dataframe drop function suggests, you can do the following:
import pandas as pd
df = pd.read_csv("csv_name.csv")
df.drop(columns=['ulesscol'])
Above code is considering dropping columns, you can drop rows by index as:
df.drop([0, 1])
Alternatively, don't read in the column in the first place:
df = pd.read_csv("csv_name.csv",
usecols=["out_gate", "in_gate", "n_con"])

Related

Read csv file with empty lines

Analysis software I'm using outputs many groups of results in 1 csv file and separates the groups with 2 empty lines.
I would like to break the results in groups so that I can then analyse them separately.
I'm sure there is a built-in function in python (or one of it's libraries) that does this, I tried this piece of code that I found somewhere but it doesn't seem to work.
import csv
results = open('03_12_velocity_y.csv').read().split("\n\n")
# Feed first csv.reader
first_csv = csv.reader(results[0], delimiter=',')
# Feed second csv.reader
second_csv = csv.reader(results[1], delimiter=',')
Update:
The original code actually works, but my python skills are pretty limited and I did not implement it properly.
.split(\n\n\n) method does work but the csv.reader is an object and to get the data in a list (or something similar), it needs to iterate through all the rows and write them to the list.
I then used Pandas to remove the header and convert the scientific notated values to float. Code is bellow. Thanks everyone for help.
import csv
import pandas as pd
# Open the csv file, read it and split it when it encounters 2 empty lines (\n\n\n)
results = open('03_12_velocity_y.csv').read().split('\n\n\n')
# Create csv.reader objects that are used to iterate over rows in a csv file
# Define the output - create an empty multi-dimensional list
output1 = [[],[]]
# Iterate through the rows in the csv file and append the data to the empty list
# Feed first csv.reader
csv_reader1 = csv.reader(results[0].splitlines(), delimiter=',')
for row in csv_reader1:
output1.append(row)
df = pd.DataFrame(output1)
# remove first 7 rows of data (the start position of the slice is always included)
df = df.iloc[7:]
# Convert all data from string to float
df = df.astype(float)
If your row counts are inconsistent across groups, you'll need a little state machine to check when you're between groups and do something with the last group.
#!/usr/bin/env python3
import csv
def write_group(group, i):
with open(f"group_{i}.csv", "w", newline="") as out_f:
csv.writer(out_f).writerows(group)
with open("input.csv", newline="") as f:
reader = csv.reader(f)
group_i = 1
group = []
last_row = []
for row in reader:
if row == [] and last_row == [] and group != []:
write_group(group, group_i)
group = []
group_i += 1
continue
if row == []:
last_row = row
continue
group.append(row)
last_row = row
# flush remaining group
if group != []:
write_group(group, group_i)
I mocked up this sample CSV:
g1r1c1,g1r1c2,g1r1c3
g1r2c1,g1r2c2,g1r2c3
g1r3c1,g1r3c2,g1r3c3
g2r1c1,g2r1c2,g2r1c3
g2r2c1,g2r2c2,g2r2c3
g3r1c1,g3r1c2,g3r1c3
g3r2c1,g3r2c2,g3r2c3
g3r3c1,g3r3c2,g3r3c3
g3r4c1,g3r4c2,g3r4c3
g3r5c1,g3r5c2,g3r5c3
And when I run the program above I get three CSV files:
group_1.csv
g1r1c1,g1r1c2,g1r1c3
g1r2c1,g1r2c2,g1r2c3
g1r3c1,g1r3c2,g1r3c3
group_2.csv
g2r1c1,g2r1c2,g2r1c3
g2r2c1,g2r2c2,g2r2c3
group_3.csv
g3r1c1,g3r1c2,g3r1c3
g3r2c1,g3r2c2,g3r2c3
g3r3c1,g3r3c2,g3r3c3
g3r4c1,g3r4c2,g3r4c3
g3r5c1,g3r5c2,g3r5c3
If your row counts are consistent, you can do this with fairly vanilla Python or using the Pandas library.
Vanilla Python
Define your group size and the size of the break (in "rows") between groups.
Loop over all the rows adding each row to a group accumulator.
When the group accumulator reaches the pre-defined group size, do something with it, reset the accumulator, and then skip break-size rows.
Here, I'm writing each group to its own numbered file:
import csv
group_sz = 5
break_sz = 2
def write_group(group, i):
with open(f"group_{i}.csv", "w", newline="") as f_out:
csv.writer(f_out).writerows(group)
with open("input.csv", newline="") as f_in:
reader = csv.reader(f_in)
group_i = 1
group = []
for row in reader:
group.append(row)
if len(group) == group_sz:
write_group(group, group_i)
group_i += 1
group = []
for _ in range(break_sz):
try:
next(reader)
except StopIteration: # gracefully ignore an expected StopIteration (at the end of the file)
break
group_1.csv
g1r1c1,g1r1c2,g1r1c3
g1r2c1,g1r2c2,g1r2c3
g1r3c1,g1r3c2,g1r3c3
g1r4c1,g1r4c2,g1r4c3
g1r5c1,g1r5c2,g1r5c3
With Pandas
I'm new to Pandas, and learning this as I go, but it looks like Pandas will automatically trim blank rows/records from a chunk of data^1.
With that in mind, all you need to do is specify the size of your group, and tell Pandas to read your CSV file in "iterator mode", where you can ask for a chunk (your group size) of records at a time:
import pandas as pd
group_sz = 5
with pd.read_csv("input.csv", header=None, iterator=True) as reader:
i = 1
while True:
try:
df = reader.get_chunk(group_sz)
except StopIteration:
break
df.to_csv(f"group_{i}.csv")
i += 1
Pandas add an "ID" column and default header when it writes out the CSV:
group_1.csv
,0,1,2
0,g1r1c1,g1r1c2,g1r1c3
1,g1r2c1,g1r2c2,g1r2c3
2,g1r3c1,g1r3c2,g1r3c3
3,g1r4c1,g1r4c2,g1r4c3
4,g1r5c1,g1r5c2,g1r5c3
TRY this out with your output:
import pandas as pd
# csv file name to be read in
in_csv = 'input.csv'
# get the number of lines of the csv file to be read
number_lines = sum(1 for row in (open(in_csv)))
# size of rows of data to write to the csv,
# you can change the row size according to your need
rowsize = 500
# start looping through data writing it to a new file for each set
for i in range(1,number_lines,rowsize):
df = pd.read_csv(in_csv,
header=None,
nrows = rowsize,#number of rows to read at each loop
skiprows = i)#skip rows that have been read
#csv to write data to a new file with indexed name. input_1.csv etc.
out_csv = 'input' + str(i) + '.csv'
df.to_csv(out_csv,
index=False,
header=False,
mode='a', #append data to csv file
)
I updated the question with the last details that answer my question.

Python read columns until end of line

I have to read from an csv all columns starting from the forth column. For example from row[4] until the end of line. Any ideas how I achieve that?
for col in range(4,row):
covid.addCell(row[col])
csv.reader returns rows as lists, so you can slice each row to get a list made of the columns from the fourth to the final column
import csv
with open('mycsv.csv', newline='') as f:
reader = csv.reader(f)
for row in reader:
chunk = row[3:]
# do something with chunk (which is a list)
What have you tried so far? Where is your code/attempt?
import pandas as pd
df = pd.read_csv("XXX.csv")
df_selected = df.iloc[ :, 3: ] # from 4th column, reminder: in python we count from 0

Delete specific rows and columns from csv using Python in one step

I have a csv file where I need to delete the second and the third row and 3rd to 18th column. I was able to do get it to work in two steps, which produced an interim file. I am thinking that there must be a better and more compact way to do this. Any suggestions would be really appreciated.
Also, if I want to remove multiple ranges of columns, how do I specify in this code. For example, if I want to remove columns 25 to 29, in addition to columns 3 to 18 already specified, how would I add to the code? Thanks
remove_from = 2
remove_to = 17
with open('file_a.csv', 'rb') as infile, open('interim.csv', 'wb') as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
for row in reader:
del row[remove_from : remove_to]
writer.writerow(row)
with open('interim.csv', 'rb') as infile, open('file_b.csv', 'wb') as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
writer.writerow(next(reader))
reader.next()
reader.next()
for row in reader:
writer.writerow(row)
Here is a pandas approach:
Step 1, creating a sample dataframe
import pandas as pd
# Create sample CSV-file (100x100)
df = pd.DataFrame(np.arange(10000).reshape(100,100))
df.to_csv('test.csv', index=False)
Step 2, doing the magic
import pandas as pd
import numpy as np
# Read first row to determine size of columns
size = pd.read_csv('test.csv',nrows=0).shape[1]
#want to remove columns 25 to 29, in addition to columns 3 to 18 already specified,
# Ok so let's create an array with the length of dataframe deleting the ranges
ranges = np.r_[3:19,25:30]
ar = np.delete(np.arange(size),ranges)
# Now let's read the dataframe
# let us also skip rows 2 and 3
df = pd.read_csv('test.csv', skiprows=[2,3], usecols=ar)
# And output
dt.to_csv('output.csv', index=False)
And the proof:

How to perform a simple calculation in a CSV and append the results to the file

I have a csv which contains 38 colums of data, all I want to find our how to do is, divide column 11 by column by column 38 and append this data tot he end of each row. Missing out the title row of the csv (row 1.)
If I am able to get a snippet of code that can do this, I will be able to manipulate the same code to perform lots of similar functions.
My attempt involved editing some code that was designed for something else.
See below:
from collections import defaultdict
class_col = 11
data_col = 38
# Read in the data
with open('test.csv', 'r') as f:
# if you have a header on the file
# header = f.readline().strip().split(',')
data = [line.strip().split(',') for line in f]
# Append the relevant sum to the end of each row
for row in xrange(len(data)):
data[row].append(int(class_col)/int(data_col))
# Write the results to a new csv file
with open('testMODIFIED2.csv', 'w') as nf:
nf.write('\n'.join(','.join(row) for row in data))
Any help will be greatly appreciated. Thanks SMNALLY
import csv
with open('test.csv', 'rb') as old_csv:
csv_reader = csv.reader(old_csv)
with open('testMODIFIED2.csv', 'wb') as new_csv:
csv_writer = csv.writer(new_csv)
for i, row in enumerate(csv_reader):
if i != 0:
row.append(float(row[10]) / float(row[37]))
csv_writer.writerow(row)
Use pandas:
import pandas
df = pandas.read_csv('test.csv') #assumes header row exists
df['FRACTION'] = 1.0*df['CLASS']/df['DATA'] #by default new columns are appended to the end
df.to_csv('out.csv')

how to write a matrix to a csv file in python with adding static headers in first row and first column?

I have a matrix which is generated after running a correlation - mat = Statistics.corr(result, method="pearson"). now I want to write this matrix to a csv file but I want to add headers to the first row and first column of the file so that the output looks like this:
index,col1,col2,col3,col4,col5,col6
col1,1,0.005744233,0.013118052,-0.003772589,0.004284689
col2,0.005744233,1,-0.013269414,-0.007132092,0.013950261
col3,0.013118052,-0.013269414,1,-0.014029249,-0.00199437
col4,-0.003772589,-0.007132092,-0.014029249,1,0.022569309
col5,0.004284689,0.013950261,-0.00199437,0.022569309,1
I have a list which contains the columns names - colmn = ['col1','col2','col3','col4','col5','col6']. The index in the above format is a static string to indicate the index names. i wrote this code but it only add the header in first row but i am unable to get the header in the first column as well:
with open("file1", "wb") as f:
writer = csv.writer(f,delimiter=",")
writer.writerow(['col1','col2','col3','col4','col5','col6'])
writer.writerows(mat)
How can I write the matrix to a csv file with heading static headers to the first row and 1st column?
You could use pandas. DataFrame.to_csv() defaults to writing both the column headers and the index.
import pandas as pd
headers = ['col1','col2','col3','col4','col5','col6']
df = pd.DataFrame(mat, columns=headers, index=headers)
df.to_csv('file1')
If on the other hand this is not an option, you can add your index with a little help from enumerate:
with open("file1", "wb") as f:
writer = csv.writer(f,delimiter=",")
headers = ['col1','col2','col3','col4','col5','col6']
writer.writerow(['index'] + headers)
# If your mat is already a python list of lists, you can skip wrapping
# the rows with list()
writer.writerows(headers[i:i+1] + list(row) for i, row in enumerate(mat))
You can use a first variable to indicate the first line, and then add each row name to the row as it is written:
cols = ["col2", "col2", "col3", "col4", "col5"]
with open("file1", "wb") as f:
writer = csv.writer(f)
first = True
for i, line in enumerate(mat):
if first:
writer.writerow(["Index"] + cols)
first = False
else:
writer.writerow(["Row"+str(i)] + line)

Categories

Resources