How can I call columns with column names in Python - python

I have hundreds of csv files. Each one need not necessarily have the same headers like below.
CSV1:
G,B,C,D
1,2,3,4
2,4,5,6
CSV2:
A,C,D
1,2,6
2,5,7
I'd like to call each column by its name, like this:
if file has column A: select that column
else: skip to the next required column (which could be B) and repeat the same process for each file until all the required columns have been referenced. I'd really really appreciate if you could help me do this.

Call following function with the desired column name and it returns list of all values belonged to this column:
import csv
file = 'c:\\temp\\test.csv'
def GetValuesFromColumn(title):
values = []
rownum = 0
with open(file, 'r') as f:
reader = csv.reader(f)
for row in reader:
if rownum == 0:
index = row.index(title)
rownum = 1
else:
values.append(row[index])
return values
values = GetValuesFromColumn('D')

Solution 1: use the csv module's DictReader.
Solution 2: if you data is really numerical, as in your example, you could use numpy.genfromtxt to produced structured arrays: http://docs.scipy.org/doc/numpy/user/basics.rec.html

if you'd like a Pandas approach, here's an option. It opens each file, grabs one row, looks at the columns and sees if any of the column names are in the desired list. If there are any columns we want it reads the csv into a Pandas DataFrame
example data:
df = pd.DataFrame( [(2014, 30, 15), (2015, 10, 20), (2007, 5, 3)] )
df.columns = ['year','a','b']
df.set_index('year', inplace=True)
df.to_csv('tst.csv')
df.columns = ['c','z']
df.to_csv('tst2.csv')
Do the work:
import glob
wanted = ['year','a','z']
path = '.'
allFiles = glob.glob(path + "/*.csv")
for file in allFiles:
#grab only one row for testing
df = pd.read_csv(file, nrows=1)
includedCols = []
for x in wanted:
if x in df.columns:
includedCols.append(x)
if len(includedCols) > 0:
df = pd.read_csv(file, usecols = includedCols)
print df
## do something with df here

Use csv.DictReader()
Create an object which operates like a regular reader but maps the information read into a dict whose keys are given by the optional fieldnames parameter.
If the fieldnames parameter is omitted, the values in the first row of the csvfile will be used as the fieldnames.
For example
import csv
def get_values_for_column(csvfile, col):
with open(csvfile, 'rb') as f:
reader = csv.DictReader(f)
values = [row[col] for row in reader]
return values
# Usage
>> get_values_for_column('CSV1.csv', 'D')
# prints [4, 6]
Based on the answers from both Alan and Aldervan

Related

Read csv file with empty lines

Analysis software I'm using outputs many groups of results in 1 csv file and separates the groups with 2 empty lines.
I would like to break the results in groups so that I can then analyse them separately.
I'm sure there is a built-in function in python (or one of it's libraries) that does this, I tried this piece of code that I found somewhere but it doesn't seem to work.
import csv
results = open('03_12_velocity_y.csv').read().split("\n\n")
# Feed first csv.reader
first_csv = csv.reader(results[0], delimiter=',')
# Feed second csv.reader
second_csv = csv.reader(results[1], delimiter=',')
Update:
The original code actually works, but my python skills are pretty limited and I did not implement it properly.
.split(\n\n\n) method does work but the csv.reader is an object and to get the data in a list (or something similar), it needs to iterate through all the rows and write them to the list.
I then used Pandas to remove the header and convert the scientific notated values to float. Code is bellow. Thanks everyone for help.
import csv
import pandas as pd
# Open the csv file, read it and split it when it encounters 2 empty lines (\n\n\n)
results = open('03_12_velocity_y.csv').read().split('\n\n\n')
# Create csv.reader objects that are used to iterate over rows in a csv file
# Define the output - create an empty multi-dimensional list
output1 = [[],[]]
# Iterate through the rows in the csv file and append the data to the empty list
# Feed first csv.reader
csv_reader1 = csv.reader(results[0].splitlines(), delimiter=',')
for row in csv_reader1:
output1.append(row)
df = pd.DataFrame(output1)
# remove first 7 rows of data (the start position of the slice is always included)
df = df.iloc[7:]
# Convert all data from string to float
df = df.astype(float)
If your row counts are inconsistent across groups, you'll need a little state machine to check when you're between groups and do something with the last group.
#!/usr/bin/env python3
import csv
def write_group(group, i):
with open(f"group_{i}.csv", "w", newline="") as out_f:
csv.writer(out_f).writerows(group)
with open("input.csv", newline="") as f:
reader = csv.reader(f)
group_i = 1
group = []
last_row = []
for row in reader:
if row == [] and last_row == [] and group != []:
write_group(group, group_i)
group = []
group_i += 1
continue
if row == []:
last_row = row
continue
group.append(row)
last_row = row
# flush remaining group
if group != []:
write_group(group, group_i)
I mocked up this sample CSV:
g1r1c1,g1r1c2,g1r1c3
g1r2c1,g1r2c2,g1r2c3
g1r3c1,g1r3c2,g1r3c3
g2r1c1,g2r1c2,g2r1c3
g2r2c1,g2r2c2,g2r2c3
g3r1c1,g3r1c2,g3r1c3
g3r2c1,g3r2c2,g3r2c3
g3r3c1,g3r3c2,g3r3c3
g3r4c1,g3r4c2,g3r4c3
g3r5c1,g3r5c2,g3r5c3
And when I run the program above I get three CSV files:
group_1.csv
g1r1c1,g1r1c2,g1r1c3
g1r2c1,g1r2c2,g1r2c3
g1r3c1,g1r3c2,g1r3c3
group_2.csv
g2r1c1,g2r1c2,g2r1c3
g2r2c1,g2r2c2,g2r2c3
group_3.csv
g3r1c1,g3r1c2,g3r1c3
g3r2c1,g3r2c2,g3r2c3
g3r3c1,g3r3c2,g3r3c3
g3r4c1,g3r4c2,g3r4c3
g3r5c1,g3r5c2,g3r5c3
If your row counts are consistent, you can do this with fairly vanilla Python or using the Pandas library.
Vanilla Python
Define your group size and the size of the break (in "rows") between groups.
Loop over all the rows adding each row to a group accumulator.
When the group accumulator reaches the pre-defined group size, do something with it, reset the accumulator, and then skip break-size rows.
Here, I'm writing each group to its own numbered file:
import csv
group_sz = 5
break_sz = 2
def write_group(group, i):
with open(f"group_{i}.csv", "w", newline="") as f_out:
csv.writer(f_out).writerows(group)
with open("input.csv", newline="") as f_in:
reader = csv.reader(f_in)
group_i = 1
group = []
for row in reader:
group.append(row)
if len(group) == group_sz:
write_group(group, group_i)
group_i += 1
group = []
for _ in range(break_sz):
try:
next(reader)
except StopIteration: # gracefully ignore an expected StopIteration (at the end of the file)
break
group_1.csv
g1r1c1,g1r1c2,g1r1c3
g1r2c1,g1r2c2,g1r2c3
g1r3c1,g1r3c2,g1r3c3
g1r4c1,g1r4c2,g1r4c3
g1r5c1,g1r5c2,g1r5c3
With Pandas
I'm new to Pandas, and learning this as I go, but it looks like Pandas will automatically trim blank rows/records from a chunk of data^1.
With that in mind, all you need to do is specify the size of your group, and tell Pandas to read your CSV file in "iterator mode", where you can ask for a chunk (your group size) of records at a time:
import pandas as pd
group_sz = 5
with pd.read_csv("input.csv", header=None, iterator=True) as reader:
i = 1
while True:
try:
df = reader.get_chunk(group_sz)
except StopIteration:
break
df.to_csv(f"group_{i}.csv")
i += 1
Pandas add an "ID" column and default header when it writes out the CSV:
group_1.csv
,0,1,2
0,g1r1c1,g1r1c2,g1r1c3
1,g1r2c1,g1r2c2,g1r2c3
2,g1r3c1,g1r3c2,g1r3c3
3,g1r4c1,g1r4c2,g1r4c3
4,g1r5c1,g1r5c2,g1r5c3
TRY this out with your output:
import pandas as pd
# csv file name to be read in
in_csv = 'input.csv'
# get the number of lines of the csv file to be read
number_lines = sum(1 for row in (open(in_csv)))
# size of rows of data to write to the csv,
# you can change the row size according to your need
rowsize = 500
# start looping through data writing it to a new file for each set
for i in range(1,number_lines,rowsize):
df = pd.read_csv(in_csv,
header=None,
nrows = rowsize,#number of rows to read at each loop
skiprows = i)#skip rows that have been read
#csv to write data to a new file with indexed name. input_1.csv etc.
out_csv = 'input' + str(i) + '.csv'
df.to_csv(out_csv,
index=False,
header=False,
mode='a', #append data to csv file
)
I updated the question with the last details that answer my question.

I need to read a csv with unknown number of columns and then write the data to a csv with a set number of columns

So I have a file that looks like this:
name,number,email,job1,job2,job3,job4
I need to convert it to one that looks like this:
name,number,email,job1
name,number,email,job2
name,number,email,job3
name,number,email,job4
How would I do this in Python?
As said in a comment that you can use pandas to read, write and manipulate csv file.
Here is one example of how you can solve your problem with pandas in python
import pandas as pd
# df = pd.read_csv("filename.csv") # read csv file from disk
# comment out below line when open from disk
df = pd.DataFrame([['ss','0152','ss#','student','others']],columns=['name','number','email','job1','job2'])
print(df)
this line output is
name number email job1 job2
0 ss 0152 ss# student others
Now we need to know how many columns are there:
x = len(df.columns)
print(x)
it will store the number of column in x
5
Now let's create a empty Dataframe with columns= [name,number,email,job]
c = pd.DataFrame(columns=['name','number','email','job'])
print(c)
output:
Columns: [name, number, email, job]
Index: []
Now we use loop from range 3 to end of the column and concat datafarme with our empty dataframe:
for i in range(3,x):
df1 = df.iloc[:,0:3].copy() # we took first 3 column
df2 = df.iloc[:,[i]].copy() # we took ith coulmn
df1['job'] = df2; # added ith coulmn to the df1
c = pd.concat([df1,c]); # concat df1 and c
print(c)
output:
name number email job
0 ss 0152 ss# others
0 ss 0152 ss# student
Dataframe c has your desired output. Now you can save it using
c.to_csv('ouput.csv')
Let's assume this is the dataframe:
import pandas as pd
df = pd.DataFrame(columns=['name','number','email','job1','job2','job3','job4'])
df = df.append({'name':'jon', 'number':123, 'email':'smth#smth.smth', 'job1':'a','job2':'b','job3':'c','job4':'d'},ignore_index=True)
We define a new dataframe:
new_df = pd.DataFrame(columns=['name','number','email','job'])
Now, we loop over the old one to split it based on the jobs. I assume you have 4 jobs to split:
for i, row in df.iterrows():
for job in range(1,5):
job_col = "job" + str(job)
new_df = new_df.append({'name':row['name'], 'number':row['number'], 'email':row['email'], 'job':row[job_col]}, ignore_index=True)
You can use the csv module and Python's unpacking syntax to get the data from the input file and write it to the output file.
import csv
with open('input.csv', newline='') as infile, open('output.csv', 'w', newline='') as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
# Skip header row, if necessary
next(reader)
# Use sequence unpacking to get the fixed variables and
# and arbitrary number of "jobs".
for name, number, email, *jobs in reader:
for job in jobs:
writer.writerow([name, number, email, job])
Below:
with open('input.csv') as f_in:
lines = [l.strip() for l in f_in.readlines()]
with open('output.csv','w') as f_out:
for idx,line in enumerate(lines):
if idx > 0:
fields = line.split(',')
for idx in range(3,len(fields)):
f_out.write(','.join(fields[:3]) + ',' + fields[idx] + '\n')
input.csv
header row
name,number,email,job1,job2,job3,job4
name1,number1,email1,job11,job21,job31,job41
output.csv
name,number,email,job1
name,number,email,job2
name,number,email,job3
name,number,email,job4
name1,number1,email1,job11
name1,number1,email1,job21
name1,number1,email1,job31
name1,number1,email1,job41

How do i write csv basis on comparing two csv file[column based]

I have two csv files:
csv1
csv2
(*note headers can be differ)
csv1 has 1 single column an csv2 has 5 columns
now column 1 of csv1 has some matching values in column2 of csv2
my concern is how can i write a csv where column1 of csv1 does not have a MATCHING VALUES to column2 of csv2
I have attached three files csv1, csv2 and expected output..
Expected Output:
ProfileID,id,name,class ,rollnumber
1,lkha,prince,sfasd,DAS
2,hgfhfk,kabir,AD,AD
5,jlh,antriskh,ASDA,AD
CSV 1:
id,name
10927,prince
109582,kabir
f546416,rahul
g44674,saini
r7341,antriskh
CSV 2:
ProfileID,id,name,class ,rollnumber
1,lkha,prince,sfasd,DAS
2,hgfhfk,kabir,AD,AD
3,f546416,rahul,AD,FF
44,g44674,saini,DD,FF
5,jlh,antriskh,ASDA,AD
I tried using converting them into dictionary and match them csv1 keys to csv2 values but it is not working as expected
def read_csv1(filename):
prj_structure = {}
f = open(filename, "r")
data = f.read()
f.close()
lst = data.split("\n")
prj = ""
for i in range(0, len(lst)):
val = lst[i].split(",")
if len(val)>0:
prj = val[0]
if prj!="":
if prj not in prj_structure.keys():
prj_structure[prj] = []
prj_structure[prj].append([val[1], val[2], val[3], val[4])
return prj_structure
def read_csv2(filename):
prj_structure = {}
f = open(filename, "r")
data = f.read()
f.close()
lst = data.split("\n")
prj = ""
for i in range(0, len(lst)):
val = lst[i].split(",")
if len(val)>0:
prj = val[0]
if prj!="":
if prj not in prj_structure.keys():
prj_structure[prj] = []
prj_structure[prj].append([val[0])
return prj_structure
csv1_data = read_csv1("csv1.csv")
csv2_data = read_csv2("csv2.csv")
for k, v in csv1_data.items():
for ks, vs in csv2_data.items():
if k==vs[0][0]:
#here it is not working
sublist = []
sublist.append(k)
Use the DictReader from the csv package.
import csv
f1 = open('csv1.csv')
csv_1 = csv.DictReader(f1)
f2 = open('csv2.csv')
csv_2 = csv.DictReader(f2)
first_dict = {}
for row in csv_1:
first_dict[row['name']]=row
f1.close()
f_out = open('output.csv','w')
csv_out = csv.DictWriter(f_out,csv_2.fieldnames)
csv_out.writeheader()
for second_row in csv_2:
if second_row['name'] in first_dict:
first_row = first_dict[second_row['name']]
if first_row['id']!=second_row['id']:
csv_out.writerow(second_row)
f2.close()
f_out.close()
If you have the option, I have always found pandas as a great tool to import and manipulate CSV files.
import pandas as pd
# Read in both the CSV files
df_1 = pd.DataFrame(pd.read_csv('csv1.csv'))
df_2 = pd.DataFrame(pd.read_csv('csv2.csv'))
# Iterate over both DataFrames and if any id's from in df_2 match
# df_1, remove them from df_2
for num1, row1 in df_1.iterrows():
for num2, row2 in df_2.iterrows():
if row1['id'] == row2['id']:
df_2.drop(num2, inplace=True)
df_2.head()
For any kind of csv processing, using the builtin csv module makes most of the error prone processing trivial. Given your example values, the following code should produce the desired results. I use comprehensions to do the filtering.
import csv
import io
# example data, as StringIO that will behave like file objects
raw_csv_1 = io.StringIO('''\
id,name
10927,prince
109582,kabir
f546416,rahul
g44674,saini
r7341,antriskh''')
raw_csv_2 = io.StringIO('''\
ProfileID,id,name,class,rollnumber
1,lkha,prince,sfasd,DAS
2,hgfhfk,kabir,AD,AD
3,f546416,rahul,AD,FF
44,g44674,saini,DD,FF
5,jlh,antriskh,ASDA,AD''')
# in your actual data, you would use actual file objects instead, like
# with open('location/of/your/csv_1') as file_1:
# raw_csv_1 = file_1.read()
# with open('location/of/your/csv_2') as file_2:
# raw_csv_2 = file_2.read()
Then we need to transform then into csv.reader objects:
csv_1 = csv.reader(raw_csv_1)
next(csv_1) # consume once to skip the header
csv_2 = csv.reader(raw_csv_2)
header = next(csv_2) # consume once to skip the header, but store it
Last but not least, collect the names of the first csv in a set to use them as lookup table, filter the second csv with it, and write it back as 'result.csv' into your file system.
skip_keys = {id_ for id_, name in vals_1}
result = [row for row in vals_2 if row[1] not in skip_keys]
# at this point, result contains
# [['1', 'lkha', 'prince', 'sfasd', 'DAS'],
# ['2', 'hgfhfk', 'kabir', 'AD', 'AD'],
# ['5', 'jlh', 'antriskh', 'ASDA', 'AD']]
with open('result.csv', 'w') as result_file:
csv.writer(result_file).writerows(header+result)

Mapping CSV Header using a Dictionary

I have a reference file that looks like this:
Experiment,Array,Drug
8983,Genechip,Famotidine
8878,Microarray,Dicyclomine
8988,Genechip,Etidronate
8981,Microarray,Flunarizine
I successfully created a dictionary mapping the Experiment numbers to the Drug name using the following:
reader = csv.reader(open('C:\Users\Troy\Documents\ExPSRef.txt'))
#Configure dictionary
result = {}
for row in reader:
key = row[0]
result[key] = row[2]
di = result
I want to map this dictionary to the header of another file which consists of the experiment number. It currently looks like this:
Gene,8988,8981,8878,8983
Vcp,0.011,-0.018,-0.032,-0.034
Ube2d2,0.034,0.225,-0.402,0.418
Becn1,0.145,-0.108,-0.421,-0.048
Lypla2,-0.146,-0.026,-0.101,-0.011
But it should look like this:
Gene,Etidronate,Flunarizine,Dicyclomine,Famotidine
Vcp,0.011,-0.018,-0.032,-0.034
Ube2d2,0.034,0.225,-0.402,0.418
Becn1,0.145,-0.108,-0.421,-0.048
Lypla2,-0.146,-0.026,-0.101,-0.011
I tried using:
import csv
import pandas as pd
reader = csv.reader(open('C:\Users\Troy\Documents\ExPSRef.txt'))
result = {}
for row in reader:
key = row[0]
result[key] = row[2]
di = result
df = pd.read_csv('C:\Users\Troy\Documents\ExPS2.txt')
df['row[0]'].replace(di, inplace=True)
but it returned a KeyError: 'row[0]'.
I tried the following as well, even transposing in order to merge:
import pandas as pd
df1 = pd.read_csv('C:\Users\Troy\Documents\ExPS2.txt',).transpose()
df2 = pd.read_csv('C:\Users\Troy\Documents\ExPSRef.txt', delimiter=',', engine='python')
df3 = df1.merge(df2)
df4 = df3.set_index('Drug').drop(['Experiment', 'Array'], axis=1)
df4.index.name = 'Drug'
print df4
and this time received MergeError('No common columns to perform merge on').
Is there a simpler way to map my dictionary to the header that would work?
One of the things to keep in mind would be to making sure that both the keys corresponding to the mapper dictionary as well as the header which it is mapped to are of the same data type.
Here, one is a string and the other of integer type. So while reading itself, we'll let it not interpret dtype by setting it to str for the reference DF.
df1 = pd.read_csv('C:\Users\Troy\Documents\ExPS2.txt') # Original
df2 = pd.read_csv('C:\Users\Troy\Documents\ExPSRef.txt', dtype=str) # Reference
Convert the columns of the original DF to it's series representation and then replace the old value which were Experiment Nos. with the new Drug name retrieved from the reference DF.
df1.columns = df1.columns.to_series().replace(df2.set_index('Experiment').Drug)
df1
I used csv for the whole script. This fixes the header you wanted and saves into a new file. The new filename can be replaced with the same one if that's what you prefer. This program is written with python3.
import csv
with open('sample.txt', 'r') as ref:
reader = csv.reader(ref)
# skip header line
next(reader)
# make dictionary
di = dict([(row[0], row[2]) for row in reader])
data = []
with open('sample1.txt', 'r') as df:
reader = csv.reader(df)
header = next(reader)
new_header = [header[0]] + [di[i] for i in header if i in di]
data = list(reader)
# used to make new file, can also replace with the same file name
with open('new_sample1.txt', 'w') as df_new:
writer = csv.writer(df_new)
writer.writerow(new_header)
writer.writerows(data)

how to write a matrix to a csv file in python with adding static headers in first row and first column?

I have a matrix which is generated after running a correlation - mat = Statistics.corr(result, method="pearson"). now I want to write this matrix to a csv file but I want to add headers to the first row and first column of the file so that the output looks like this:
index,col1,col2,col3,col4,col5,col6
col1,1,0.005744233,0.013118052,-0.003772589,0.004284689
col2,0.005744233,1,-0.013269414,-0.007132092,0.013950261
col3,0.013118052,-0.013269414,1,-0.014029249,-0.00199437
col4,-0.003772589,-0.007132092,-0.014029249,1,0.022569309
col5,0.004284689,0.013950261,-0.00199437,0.022569309,1
I have a list which contains the columns names - colmn = ['col1','col2','col3','col4','col5','col6']. The index in the above format is a static string to indicate the index names. i wrote this code but it only add the header in first row but i am unable to get the header in the first column as well:
with open("file1", "wb") as f:
writer = csv.writer(f,delimiter=",")
writer.writerow(['col1','col2','col3','col4','col5','col6'])
writer.writerows(mat)
How can I write the matrix to a csv file with heading static headers to the first row and 1st column?
You could use pandas. DataFrame.to_csv() defaults to writing both the column headers and the index.
import pandas as pd
headers = ['col1','col2','col3','col4','col5','col6']
df = pd.DataFrame(mat, columns=headers, index=headers)
df.to_csv('file1')
If on the other hand this is not an option, you can add your index with a little help from enumerate:
with open("file1", "wb") as f:
writer = csv.writer(f,delimiter=",")
headers = ['col1','col2','col3','col4','col5','col6']
writer.writerow(['index'] + headers)
# If your mat is already a python list of lists, you can skip wrapping
# the rows with list()
writer.writerows(headers[i:i+1] + list(row) for i, row in enumerate(mat))
You can use a first variable to indicate the first line, and then add each row name to the row as it is written:
cols = ["col2", "col2", "col3", "col4", "col5"]
with open("file1", "wb") as f:
writer = csv.writer(f)
first = True
for i, line in enumerate(mat):
if first:
writer.writerow(["Index"] + cols)
first = False
else:
writer.writerow(["Row"+str(i)] + line)

Categories

Resources