Input file:
$ cat test.csv
company,spread,cat1,cat2,cat3
A,XYZ,32,67,0
B,XYZ,43,0,432
C,XYZ,32,76,32
D,XYZ,454,87,43
E,XYZ,0,0,65
F,XYZ,0,0,7
Expected CSV output (Sum columns cat1, cat2 and cat3 and append the sum.):
$ cat test.csv
company,spread,cat1,cat2,cat3
A,XYZ,32,67,0
B,XYZ,43,0,432
C,XYZ,32,76,32
D,XYZ,454,87,43
E,XYZ,0,0,65
F,XYZ,0,0,7
,,561,230,579
Code:
import csv
all_keys = ['cat1', 'cat2', 'cat3']
default_values = {i: 0 for i in all_keys}
def read_csv():
with open('test.csv', 'r') as f:
reader = csv.DictReader(f)
yield from reader
for row in read_csv():
for i in all_keys:
default_values[i] += int(row[i])
with open('test.csv', 'a') as w:
writer = csv.DictWriter(w, fieldnames=all_keys)
writer.writerow(default_values)
Actual Output:
$ cat test.csv
company,spread,cat1,cat2,cat3
A,XYZ,32,67,0
B,XYZ,43,0,432
C,XYZ,32,76,32
D,XYZ,454,87,43
E,XYZ,0,0,65
F,XYZ,0,0,7
561,230,579
Question:
The csv.DictWriter is not appending row with correct column alignment. I understand that I have 5 columns but I am providing values for only 3 columns. But I thought as this is DictWriter, it will append values to only a matching column header. If I open my Actual Output CSV, it is quite visual that columns are not aligned:
You should include the column names for the first two in fieldnames:
with open('test.csv', 'a') as w:
writer = csv.DictWriter(w, fieldnames=['company', 'spread']+all_keys)
writer.writerow(default_values)
Blank values will be written to the first two columns if the keys are not available in the dictionary.
You can declare you writter like that:
with open('test.csv', 'a') as w:
writer = csv.DictWriter(w, fieldnames=all_keys, restval=' ')
writer.writerow(default_values)
So you don't have to specify all the missing keys : for all the missing keys, restval char will fill with the value you chose. https://docs.python.org/3/library/csv.html#csv.DictWriter
Related
I have a huge csv file with approximately 992 rows * 992columns.
The file for example looks like this:
I need to create an output file that essentially contains a header and looks like below:
I tried to use csv reader and dict reader too but i am getting stuck on removing NA columns and also getting the name of the column into one list (or column) and the corresponding value into another.
I am not at all good at pandas and clueless in that aspect.
I tried:
def csv_reader():
with open("/Users/svadali/Downloads/test_1.csv") as csv_infile, open("/Users/svadali/Downloads/result_file.txt", "w+") as outfile:
reader = csv.reader(csv_infile, delimiter=',')
file_writer = csv.writer(outfile, delimiter="\t")
file_writer.writerow(["SPC", "SPCs_within_0.2_phylo_distance", "Phylo_Distances"])
for row in reader:
for column in reader:
print("this is row", row)
print("this is column", column)
if column == 'NA':
print("this non NA", column)
print("this is supposed to be non NA row", row)
break
I also trie transpose but they are not yielding the results I need.
You can extract the names from the header, zip them with the distances in each row, filter those with invalid distances, and then zip them again to produce names and distances in separate columns:
with open("test_1.csv") as infile, open("result_file.txt", "w+") as outfile:
reader = csv.reader(infile, delimiter=',')
writer = csv.writer(outfile, delimiter="\t")
writer.writerow(["SPC", "SPCs_within_0.2_phylo_distance", "Phylo_Distances"])
_, *names = next(reader)
for name, *distances in reader:
writer.writerow((
name,
*map(
','.join,
zip(*((n, d) for n, d in zip(names, distances) if d != 'NA'))
)
))
Demo: https://replit.com/#blhsing/OutrageousInvolvedProtools
I want to go through large CSV files and if there is missing data I want to remove that row completely, This is only row specific so if there is a cell that = 0 or has no value then I want to remove the entire row. I want this to happen for all the columns so if any column has a black cell it should delete the row, and return the corrected data in a corrected csv.
import csv
with open('data.csv', 'r') as csvfile:
csvreader = csv.reader(csvfile)
for row in csvreader:
print(row)
if not row[0]:
print("12")
This is what I found and tried but it doesnt not seem to be working and I dont have any ideas about how to aproach this problem, help please?
Thanks!
Due to the way in which CSV reader presents rows of data, you need to know how many columns there are in the original CSV file. For example, if the CSV file content looks like this:
1,2
3,
4
Then the lists return by iterating over the reader would look like this:
['1','2']
['3','']
['4']
As you can see, the third row only has one column whereas the first and second rows have 2 columns albeit that one is (effectively) empty.
This function allows you to either specify the number of columns (if you know them before hand) or allow the function to figure it out. If not specified then it is assumed that the number of columns is the greatest number of columns found in any row.
So...
import csv
DELIMITER = ','
def valid_column(col):
try:
return float(col) != 0
except ValueError:
pass
return len(col.strip()) > 0
def fix_csv(input_file, output_file, cols=0):
if cols == 0:
with open(input_file, newline='') as indata:
cols = max(len(row) for row in csv.reader(indata, delimiter=DELIMITER))
with open(input_file, newline='') as indata, open(output_file, 'w', newline='') as outdata:
writer = csv.writer(outdata, delimiter=DELIMITER)
for row in csv.reader(indata, delimiter=DELIMITER):
if len(row) == cols:
if all(valid_column(col) for col in row):
writer.writerow(row)
fix_csv('original.csv', 'fixed.csv')
maybe like this
import csv
with open('data.csv', 'r') as csvfile:
csvreader = csv.reader(csvfile)
data=list(csvreader)
data=[x for x in data if '' not in x and '0' not in x]
you can then rewrite the the csv file if you like
Instead of using csv, you should use Pandas module, something like this.
import pandas as pd
df = pd.read_csv('file.csv')
print(df)
index = 1 #index of the row that you want to remove
df = df.drop(index)
print(df)
df.to_csv('file.csv')
I need to re-order columns in a csv but I'll need to call each column from a dictionary.
EXAMPLE:
Sample input csv File:
$ cat file.csv
A,B,C,D,E
a1,b1,c1,d1,e1
a2,b2,c2,d2,e2
Code
import csv
with open('file.csv', 'r') as infile, open('reordered.csv', 'a') as outfile:
order_of_headers_should_be = ['A', 'C', 'D', 'E', 'B']
dictionary = {'A':'X1','B':'Y1','C':'U1','D':'T1','E':'K1'}
writer = csv.DictWriter(outfile)
# reorder the header first
writer.writeheader()
for row in csv.DictReader(infile):
# writes the reordered rows to the new file
writer.writerow(row)
The Output csv file needs to look like this:
$ cat reordered.csv
X1,U1,T1,K1,Y1
a1,c1,d1,e1,b1
a2,c2,d2,e2,b2
Trying to make a variable to call the dictionary
You can do this by permuting the keys when you are about to write the row like so:
for row in csv.DictReader(infile):
# writes the reordered rows to the new file
writer.writerow({dictionary[i]: row[i] for i in row})
Note the use of a dictionary comprehension.
My dictionaries that include names of characters and their offsets look like this:
{'Amestris': [(2247, 2255)],
'Beda': [(3266, 3270)],
'Fuery': [(2285, 2290),
(2380, 2385),
(2686, 2691),
(2723, 2728),
'Gloria': [(1565, 1571)],
'Hawkeye': [(22, 29),
(832, 839),
(999, 1006),
(1119, 1126),
(1927, 1934),
(3007, 3014),
(4068, 4075)]}
My desired output would be the keys of the dictionary (the character names).
In addition, the first column should be enumerating the character names as their Ids, starting from 0.
The tab-delimited csv file should look like this:
Id Label
0 Amestris
1 Beda
2 Fuery
3 Gloria
4 Hawkeye
So far I've reached this point:
import csv
import pickle
exmen = pickle.load(open("exmen.p", "rb"))
with open('mycsvfile.csv', 'w') as f:
i = 0
fieldnames = ['Id', 'Label']
w = csv.writer(f, delimiter=' ', fieldnames=fieldnames)
w.writeheader()
w.writerow(i(dict(fieldnames, (key))) for key in exmen)
I'm getting this error message:
line 28, in <module>
w = csv.writer(f, delimiter=' ', fieldnames=fieldnames)
TypeError: 'fieldnames' is an invalid keyword argument for this function
I'm not sure how to include the headers Id and Label other than using fieldnames and how to implement the enmuerating of the rows, here I tried to apply i = 0 and tried to find somewhere in the last line to apply an i += 1 but it gave me the error warning syntax error
Any ideas for improving the code?
Thanks in advance!
fieldnames is only an argument for csv.DictWriter which you do not need here. You could try sth. along the following lines, using csv.writer:
with open('mycsvfile.csv', 'w') as f:
w = csv.writer(f, delimiter='\t') # tab delimited
w.writerow(['Id', 'Label']) # just write header row
w.writerows(enumerate(exmen)) # write (id, key) rows
If exmen is common dict, there is no guarantee of the keys' order. You can do:
w.writerows(enumerate(sorted(exmen)))
to enforce alphabetical order.
I am attempting to merge two CSV files based on a specific field in each file.
file1.csv
id,attr1,attr2,attr3
1,True,7,"Purple"
2,False,19.8,"Cucumber"
3,False,-0.5,"A string with a comma, because it has one"
4,True,2,"Nope"
5,True,4.0,"Tuesday"
6,False,1,"Failure"
file2.csv
id,attr4,attr5,attr6
2,"python",500000.12,False
5,"program",3,True
3,"Another string",-5,False
This is the code I am using:
import csv
from collections import OrderedDict
with open('file2.csv','r') as f2:
reader = csv.reader(f2)
fields2 = next(reader,None) # Skip headers
dict2 = {row[0]: row[1:] for row in reader}
with open('file1.csv','r') as f1:
reader = csv.reader(f1)
fields1 = next(reader,None) # Skip headers
dict1 = OrderedDict((row[0], row[1:]) for row in reader)
result = OrderedDict()
for d in (dict1, dict2):
for key, value in d.iteritems():
result.setdefault(key, []).extend(value)
with open('merged.csv', 'wb') as f:
w = csv.writer(f)
for key, value in result.iteritems():
w.writerow([key] + value)
I get output like this, which merges appropriately, but does not have the same number of attributes for all rows:
1,True,7,Purple
2,False,19.8,Cucumber,python,500000.12,False
3,False,-0.5,"A string with a comma, because it has one",Another string,-5,False
4,True,2,Nope
5,True,4.0,Tuesday,program,3,True
6,False,1,Failure
file2 will not have a record for every id in file1. I'd like the output to have empty fields from file2 in the merged file. For example, id 1 would look like this:
1,True,7,Purple,,,
How can I add the empty fields to records that don't have data in file2 so that all of my records in the merged CSV have the same number of attributes?
If we're not using pandas, I'd refactor to something like
import csv
from collections import OrderedDict
filenames = "file1.csv", "file2.csv"
data = OrderedDict()
fieldnames = []
for filename in filenames:
with open(filename, "rb") as fp: # python 2
reader = csv.DictReader(fp)
fieldnames.extend(reader.fieldnames)
for row in reader:
data.setdefault(row["id"], {}).update(row)
fieldnames = list(OrderedDict.fromkeys(fieldnames))
with open("merged.csv", "wb") as fp:
writer = csv.writer(fp)
writer.writerow(fieldnames)
for row in data.itervalues():
writer.writerow([row.get(field, '') for field in fieldnames])
which gives
id,attr1,attr2,attr3,attr4,attr5,attr6
1,True,7,Purple,,,
2,False,19.8,Cucumber,python,500000.12,False
3,False,-0.5,"A string with a comma, because it has one",Another string,-5,False
4,True,2,Nope,,,
5,True,4.0,Tuesday,program,3,True
6,False,1,Failure,,,
For comparison, the pandas equivalent would be something like
df1 = pd.read_csv("file1.csv")
df2 = pd.read_csv("file2.csv")
merged = df1.merge(df2, on="id", how="outer").fillna("")
merged.to_csv("merged.csv", index=False)
which is much simpler to my eyes, and means you can spend more time dealing with your data and less time reinventing wheels.
You can use pandas to do this:
import pandas
csv1 = pandas.read_csv('filea1.csv')
csv2 = pandas.read_csv('file2.csv')
merged = csv1.merge(csv2, on='id')
merged.to_csv("output.csv", index=False)
I haven't tested this yet but it should put you on the right track until I can try it out. The code is quite self-explanatory; first you import the pandas library so that you can use it. Then using pandas.read_csv you read the 2 csv files and use the merge method to merge them. The on parameter specifies which column should be used as the "key". Finally, the merged csv is written to output.csv.
Use dict of dict then update it. Like this:
import csv
from collections import OrderedDict
with open('file2.csv','r') as f2:
reader = csv.reader(f2)
lines2 = list(reader)
with open('file1.csv','r') as f1:
reader = csv.reader(f1)
lines1 = list(reader)
dict1 = {row[0]: dict(zip(lines1[0][1:], row[1:])) for row in lines1[1:]}
dict2 = {row[0]: dict(zip(lines2[0][1:], row[1:])) for row in lines2[1:]}
#merge
updatedDict = OrderedDict()
mergedAttrs = OrderedDict.fromkeys(lines1[0][1:] + lines2[0][1:], "?")
for id, attrs in dict1.iteritems():
d = mergedAttrs.copy()
d.update(attrs)
updatedDict[id] = d
for id, attrs in dict2.iteritems():
updatedDict[id].update(attrs)
#out
with open('merged.csv', 'wb') as f:
w = csv.writer(f)
for id, rest in sorted(updatedDict.iteritems()):
w.writerow([id] + rest.values())