Related
Analysis software I'm using outputs many groups of results in 1 csv file and separates the groups with 2 empty lines.
I would like to break the results in groups so that I can then analyse them separately.
I'm sure there is a built-in function in python (or one of it's libraries) that does this, I tried this piece of code that I found somewhere but it doesn't seem to work.
import csv
results = open('03_12_velocity_y.csv').read().split("\n\n")
# Feed first csv.reader
first_csv = csv.reader(results[0], delimiter=',')
# Feed second csv.reader
second_csv = csv.reader(results[1], delimiter=',')
Update:
The original code actually works, but my python skills are pretty limited and I did not implement it properly.
.split(\n\n\n) method does work but the csv.reader is an object and to get the data in a list (or something similar), it needs to iterate through all the rows and write them to the list.
I then used Pandas to remove the header and convert the scientific notated values to float. Code is bellow. Thanks everyone for help.
import csv
import pandas as pd
# Open the csv file, read it and split it when it encounters 2 empty lines (\n\n\n)
results = open('03_12_velocity_y.csv').read().split('\n\n\n')
# Create csv.reader objects that are used to iterate over rows in a csv file
# Define the output - create an empty multi-dimensional list
output1 = [[],[]]
# Iterate through the rows in the csv file and append the data to the empty list
# Feed first csv.reader
csv_reader1 = csv.reader(results[0].splitlines(), delimiter=',')
for row in csv_reader1:
output1.append(row)
df = pd.DataFrame(output1)
# remove first 7 rows of data (the start position of the slice is always included)
df = df.iloc[7:]
# Convert all data from string to float
df = df.astype(float)
If your row counts are inconsistent across groups, you'll need a little state machine to check when you're between groups and do something with the last group.
#!/usr/bin/env python3
import csv
def write_group(group, i):
with open(f"group_{i}.csv", "w", newline="") as out_f:
csv.writer(out_f).writerows(group)
with open("input.csv", newline="") as f:
reader = csv.reader(f)
group_i = 1
group = []
last_row = []
for row in reader:
if row == [] and last_row == [] and group != []:
write_group(group, group_i)
group = []
group_i += 1
continue
if row == []:
last_row = row
continue
group.append(row)
last_row = row
# flush remaining group
if group != []:
write_group(group, group_i)
I mocked up this sample CSV:
g1r1c1,g1r1c2,g1r1c3
g1r2c1,g1r2c2,g1r2c3
g1r3c1,g1r3c2,g1r3c3
g2r1c1,g2r1c2,g2r1c3
g2r2c1,g2r2c2,g2r2c3
g3r1c1,g3r1c2,g3r1c3
g3r2c1,g3r2c2,g3r2c3
g3r3c1,g3r3c2,g3r3c3
g3r4c1,g3r4c2,g3r4c3
g3r5c1,g3r5c2,g3r5c3
And when I run the program above I get three CSV files:
group_1.csv
g1r1c1,g1r1c2,g1r1c3
g1r2c1,g1r2c2,g1r2c3
g1r3c1,g1r3c2,g1r3c3
group_2.csv
g2r1c1,g2r1c2,g2r1c3
g2r2c1,g2r2c2,g2r2c3
group_3.csv
g3r1c1,g3r1c2,g3r1c3
g3r2c1,g3r2c2,g3r2c3
g3r3c1,g3r3c2,g3r3c3
g3r4c1,g3r4c2,g3r4c3
g3r5c1,g3r5c2,g3r5c3
If your row counts are consistent, you can do this with fairly vanilla Python or using the Pandas library.
Vanilla Python
Define your group size and the size of the break (in "rows") between groups.
Loop over all the rows adding each row to a group accumulator.
When the group accumulator reaches the pre-defined group size, do something with it, reset the accumulator, and then skip break-size rows.
Here, I'm writing each group to its own numbered file:
import csv
group_sz = 5
break_sz = 2
def write_group(group, i):
with open(f"group_{i}.csv", "w", newline="") as f_out:
csv.writer(f_out).writerows(group)
with open("input.csv", newline="") as f_in:
reader = csv.reader(f_in)
group_i = 1
group = []
for row in reader:
group.append(row)
if len(group) == group_sz:
write_group(group, group_i)
group_i += 1
group = []
for _ in range(break_sz):
try:
next(reader)
except StopIteration: # gracefully ignore an expected StopIteration (at the end of the file)
break
group_1.csv
g1r1c1,g1r1c2,g1r1c3
g1r2c1,g1r2c2,g1r2c3
g1r3c1,g1r3c2,g1r3c3
g1r4c1,g1r4c2,g1r4c3
g1r5c1,g1r5c2,g1r5c3
With Pandas
I'm new to Pandas, and learning this as I go, but it looks like Pandas will automatically trim blank rows/records from a chunk of data^1.
With that in mind, all you need to do is specify the size of your group, and tell Pandas to read your CSV file in "iterator mode", where you can ask for a chunk (your group size) of records at a time:
import pandas as pd
group_sz = 5
with pd.read_csv("input.csv", header=None, iterator=True) as reader:
i = 1
while True:
try:
df = reader.get_chunk(group_sz)
except StopIteration:
break
df.to_csv(f"group_{i}.csv")
i += 1
Pandas add an "ID" column and default header when it writes out the CSV:
group_1.csv
,0,1,2
0,g1r1c1,g1r1c2,g1r1c3
1,g1r2c1,g1r2c2,g1r2c3
2,g1r3c1,g1r3c2,g1r3c3
3,g1r4c1,g1r4c2,g1r4c3
4,g1r5c1,g1r5c2,g1r5c3
TRY this out with your output:
import pandas as pd
# csv file name to be read in
in_csv = 'input.csv'
# get the number of lines of the csv file to be read
number_lines = sum(1 for row in (open(in_csv)))
# size of rows of data to write to the csv,
# you can change the row size according to your need
rowsize = 500
# start looping through data writing it to a new file for each set
for i in range(1,number_lines,rowsize):
df = pd.read_csv(in_csv,
header=None,
nrows = rowsize,#number of rows to read at each loop
skiprows = i)#skip rows that have been read
#csv to write data to a new file with indexed name. input_1.csv etc.
out_csv = 'input' + str(i) + '.csv'
df.to_csv(out_csv,
index=False,
header=False,
mode='a', #append data to csv file
)
I updated the question with the last details that answer my question.
So I have a CSV file with a column called content. However, the contents in column look like it is based on JSON, and, therefore, house more columns. I would like to split these contents into multiple columns or extract the final part of it after "value". See picture below to see an example of the file. Any ideas how to get this? I would prefer using Python. I don't have any experience with JSON.
Using pandas you could do in a simpler way.
EDIT updated to handle the single quotes:
import pandas as pd
import json
data = pd.read_csv('test.csv', delimiter="\n")["content"]
res = [json.loads(row.replace("'", '"')) for row in data]
result = pd.DataFrame(res)
result.head()
# Export result to CSV
result.to_csv("result.csv")
my csv:
result:
This script will create a new csv file with the 'value' added to the csv as an additional column
(make sure that the input_csv and output_csv are different filenames)
import csv
import json
input_csv = "data.csv"
output_csv = "data_updated.csv"
values = []
with open(input_csv) as f_in:
dr = csv.DictReader(f_in)
for row in dr:
value = json.loads(row["content"].replace("'", '"'))["value"]
values.append(value)
with open(input_csv) as f_in:
with open(output_csv, "w+") as f_out:
w = csv.writer(f_out, lineterminator="\n")
r = csv.reader(f_in)
all = []
row = next(r)
row.append("value")
all.append(row)
i = 0
for row in r:
row.append(values[i])
all.append(row)
i += 1
w.writerows(all)
I'm trying to create a script that selects some columns from a CSV file and saves them into another one (ideally specifying the column header).
This is the query I'm starting from, which will copy all columns. How to change that to copy just a selection of them?
# importing openpyxl module
import openpyxl as xl;
# opening the source excel file
filename ="C:\\Users\\...\\input.clv"
wb1 = xl.load_workbook(filename)
ws1 = wb1.worksheets[0]
# opening the destination excel file
filename1 ="C:\\Users\\...\\output.clv"
wb2 = xl.load_workbook(filename1)
ws2 = wb2.active
# calculate total number of rows and
# columns in source excel file
mr = ws1.max_row
mc = ws1.max_column
# copying the cell values from source
# excel file to destination excel file
for i in range (1, mr + 1):
for j in range (1, mc + 1):
# reading cell value from source excel file
c = ws1.cell(row = i, column = j)
# writing the read value to destination excel file
ws2.cell(row = i, column = j).value = c.value
# saving the destination excel file
wb2.save(str(filename1))
Thank you in advance!
This is how I might do it using Python file reading/writing.
def readCsv(fileName):
data = []
myFile = open(fileName, "r")
for line in myFile:
lineList = line.split(",")
lineList[len(lineList)-1] = lineList[len(lineList) - 1].replace("\n", "")
data.append(lineList)
myFile.close()
return data
def writeCsv(data):
dataString = ""
for line in data:
dataString =dataString + ','.join(line)+"\n"
myNewFile = open("output.csv", "w")
myNewFile.write(dataString)
myNewFile.close()
data = readCsv("yourCsv.csv")
# Remove the data you don't need
writeCsv(dataAfterRemovingColumns)
My readCsv function produces a 2D List where each item is a list of the data in one row of the CSV file. So, where I have commented # Remove the data you don't need you would iterate through the 2D list, removing the item from each row that makes up part of the column you want to delete. Hope that makes sense!
You can use CSV from stdlib:
#!/usr/bin/env python
import csv
inputCsvFilePath = 'input.csv'
outputCsvFilePath = 'output.csv'
inputCsvColumnNumbers = [1,3,5]
outputCsvColumnHeaders = ['one', 'three', 'five']
# reading/writing row by row (high IO, low memory):
with open(inputCsvFilePath) as inputCsv:
inputCsvReader = csv.reader(inputCsv)
with open(outputCsvFilePath, 'w') as outputCsv:
outputCsvWriter = csv.writer(outputCsv)
# write custom csv header:
outputCsvWriter.writerow(outputCsvColumnHeaders)
# skip input file header:
inputCsvReader.__next__()
for inputRow in inputCsvReader:
outputCsvWriter.writerow( [inputRow[i] for i in inputCsvColumnNumbers] )
Personally, I'd use sqlite for that:
#!/bin/bash
sqlite3 <<EOF
-- input:
.separator ',' "\n"
.import 'input.csv' inputData
-- output:
.mode csv
.header on
.once 'output.csv'
select
user_id as "one"
, login_id as "three"
, password as "five"
from inputData
;
EOF
Based on your purpose of saving some columns from a csv file to another, you can utilize pandas library as follows:
import pandas as pd
def save_csv(df, path, cols):
df[cols].to_csv(path, index=False)
with open('path/to/csv', r) as f:
df = pd.read_csv(f)
# Assuming you want to save columns colA and colB
save_csv(df, path/to/dest/csv, ['colA', 'colB'])
You can also use csv DictReader, DictWriter as another approach which is longer in code, but faster in time (based on my timing):
import csv
def use_csv():
def new_dict(d, cols):
new_dict = {}
for col in cols:
new_dict[col] = d[col]
return new_dict
with open('path/to/csv', 'r') as f:
df = csv.DictReader(f)
with open('path/to/dest/csv', 'w') as csvfile:
fieldnames = ['colA', 'colB']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for row in df:
data = new_dict(row, fieldnames)
writer.writerow(data)
I have my data in this form
and the required form of data is
Can anybody help me in this regard?
The content of the initial CSV file as text is:
var1,var2,col1,col2,col3
a,f,1,2,3
b,g,4,5,6
c,h,7,8,9
d,i,10,11,12
You can do it directly with the csv module. You just read from the initial file, and write up to 3 rows per initial row into the resulting file:
with open('in.csv') as fdin, open('out.csv', 'w', newline='') as fdout:
rd = csv.reader(fdin)
wr = csv.writer(fdout)
header = next(rd) # read and process header
_ = wr.writerow(header[:2] + ['columns',''])
for row in rd: # loop on rows
for i in range(3): # loop on the 3 columns
try:
row2 = row[:2] + ['col{}'.format(i+1), row[2 + i]]
_ = wr.writerow(row2)
except IndexError: # prevent error on shorter line
break
If you intend to do heavy data processing, you should contemplate using the Pandas module.
With the data sample, it gives:
var1,var2,columns,
a,f,col1,1
a,f,col2,2
a,f,col3,3
b,g,col1,4
b,g,col2,5
b,g,col3,6
c,h,col1,7
c,h,col2,8
c,h,col3,9
d,i,col1,10
d,i,col2,11
d,i,col3,12
I have a csv which contains 38 colums of data, all I want to find our how to do is, divide column 11 by column by column 38 and append this data tot he end of each row. Missing out the title row of the csv (row 1.)
If I am able to get a snippet of code that can do this, I will be able to manipulate the same code to perform lots of similar functions.
My attempt involved editing some code that was designed for something else.
See below:
from collections import defaultdict
class_col = 11
data_col = 38
# Read in the data
with open('test.csv', 'r') as f:
# if you have a header on the file
# header = f.readline().strip().split(',')
data = [line.strip().split(',') for line in f]
# Append the relevant sum to the end of each row
for row in xrange(len(data)):
data[row].append(int(class_col)/int(data_col))
# Write the results to a new csv file
with open('testMODIFIED2.csv', 'w') as nf:
nf.write('\n'.join(','.join(row) for row in data))
Any help will be greatly appreciated. Thanks SMNALLY
import csv
with open('test.csv', 'rb') as old_csv:
csv_reader = csv.reader(old_csv)
with open('testMODIFIED2.csv', 'wb') as new_csv:
csv_writer = csv.writer(new_csv)
for i, row in enumerate(csv_reader):
if i != 0:
row.append(float(row[10]) / float(row[37]))
csv_writer.writerow(row)
Use pandas:
import pandas
df = pandas.read_csv('test.csv') #assumes header row exists
df['FRACTION'] = 1.0*df['CLASS']/df['DATA'] #by default new columns are appended to the end
df.to_csv('out.csv')