Python: appending/merging multiple csv files respecting headers and write to csv - python

[Using Python3] I'm very new to (Python) programming but nonetheless am writing a script that scans a folder for certain csv files, then I want to read them all and append them and write them into another csv file.
In between it is required that data is returned only where the values in a certain columns are matched to a set criteria.
All csv files have the same columns, and would look somewhere like this:
header1 header2 header3 header4 ...
string float string float ...
string float string float ...
string float string float ...
string float string float ...
... ... ... ... ...
The code I'm working with right now is the following (below), however it just keeps on overwriting the data from the previous file. That does make sense to me, I just cannot figure out how to get it working though.
Code:
import csv
import datetime
import sys
import glob
import itertools
from collections import defaultdict
# Raw data files have the format like '2013-06-04'. To be able to use this script during the whole of 2013, the glob is set to search for the pattern '2013-*.csv'
files = [f for f in glob.glob('2013-*.csv')]
# Output file looks like '20130620-filtered.csv'
outfile = '{:%Y%m%d}-filtered.csv'.format(datetime.datetime.now())
# List of 'Header4' values to be filtered for writing output
header4 = ['string1', 'string2', 'string3', 'string4']
for f in files:
with open(f, 'r') as f_in:
dict_reader = csv.DictReader(f_in)
with open(outfile, 'w') as f_out:
dict_writer = csv.DictWriter(f_out, lineterminator='\n', fieldnames=dict_reader.fieldnames)
dict_writer.writeheader()
for row in dict_reader:
if row['Campaign'] in campaign_names:
dict_writer.writerow(row)
I also tried something like readers = list(itertools.chain(*map(lambda f: csv.DictReader(open(f)), files))), and trying to iterate over the readers however then I cannot figure out how to work with the headers. (I get the error that itertools.chain() does not have the fieldnames attribute).
Any help is very much appreciated!

You keep re-opening the file and overwriting it.
Open outfile once, before your loops start. For the first file you read, write the header and the rows. For rest of the files, just write the rows.
Something like
with open(outfile, 'w') as f_out:
dict_writer = None
for f in files:
with open(f, 'r') as f_in:
dict_reader = csv.DictReader(f_in)
if not dict_writer:
dict_writer = csv.DictWriter(f_out, lineterminator='\n', fieldnames=dict_reader.fieldnames)
dict_writer.writeheader()
for row in dict_reader:
if row['Campaign'] in campaign_names:
dict_writer.writerow(row)

Related

Create a csv file in from a python list

I have a list like the following with the \n separating each new lines
['Data,9,record,timestamp,"896018545",s,position_lat,"504719750",semicircles,position_long,"-998493490",semicircles,distance,"10.87",m,altitude,"285.79999999999995",m,speed,"1.773",m/s,unknown,"3929",,unknown,"1002",,enhanced_altitude,"285.79999999999995",m,enhanced_speed,"1.773",m/s,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,\n', 'Data,9,record,timestamp,"896018560",s,position_lat,"504717676",semicircles,position_long,"-998501870",semicircles,distance,"71.85",m,altitude,"285.0",m,speed,"5.533",m/s,unknown,"3924",,unknown,"1001",,enhanced_altitude,"285.0",m,enhanced_speed,"5.533",m/s,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,]
This is hard to read, so I need to extract the following out of this list into a CSV file in python in the below format without using pandas:
timestamp, position_lat,altitude
"896018545","504719750","285.79999999999995"
"896018560","504717676","285.0"
I have the following, but I am confused about how to add the data into the CSV file:
header = ['timestamp','latitude','altitude']
with open('target.csv', 'w', encoding='UTF8', newline='') as f:
writer = csv.writer(f)
# write the header
writer.writerow(header)
# write the data
If I'm understanding your question correctly, all you need to do is write additional rows that include your data.
...
...
writer.writerow(["896018545","504719750","285.79999999999995"])
writer.writerow(["896018560","504717676","285.0"])
# alternatively,
data = [["896018545","504719750","285.79999999999995"],
["896018560","504717676","285.0"]]
...
...
for row in data:
writer.writerow(row)

Python: efficient way of combining 2 differently delimited csv files

I have written the following code to combine a Tab-separted csv file with another Comma-separated csv file (which has header too). The final output is a Tab-separated csv, without header.
with open('train.csv',"r") as infile1, open("test.csv", "r") as infile2, open('final.csv',"a") as outfile:
reader1 = csv.reader(infile1, delimiter='\t')
reader2 = csv.reader(infile2)
next(reader2, None) # skip the headers
writer = csv.writer(outfile, delimiter='\t')
for row in reader1:
writer.writerow(row)
for row in reader2:
writer.writerow(row)
Below are sample file for train.csv and test.csv respectively
main-captions MSRvid 2012 0001 5.000 A plane. is taking off.
main-captions MSRvid 2012 0004 3.800 A man. is playing a flute.
Domain,Task Name,Year,Index,Score,Sentence 1,Sentence 2
Exp,Exp,2020,1,5,product,damage
Exp,Exp,2020,2,5,product,broken
The above code works fine.
But is there a shorter way to achieve this? Say, that makes use of any new packages or maybe features within csv module?
Your code is already efficient. But it can be shortened further using writer.writerows
from itertools import chain
writer.writerows(chain(reader1, reader2))

Combine two rows into one in a csv file with Python

I am trying to combine multiple rows in a csv file together. I could easily do it in Excel but I want to do this for hundreds of files so I need it to be as a code. I have tried to store rows in arrays but it doesn't seem to work. I am using Python to do it.
So lets say I have a csv file;
1,2,3
4,5,6
7,8,9
All I want to do is to have a csv file as this;
1,2,3,4,5,6,7,8,9
The code I have tried is this;
fin = open("C:\\1.csv", 'r+')
fout = open("C:\\2.csv",'w')
for line in fin.xreadlines():
new = line.replace(',', ' ', 1)
fout.write (new)
fin.close()
fout.close()
Could you please help?
You should be using the csv module for this as splitting CSV manually on commas is very error-prone (single columns can contain strings with commas, but you would incorrectly end up splitting this into multiple columns). The CSV module uses lists of values to represent single rows.
import csv
def return_contents(file_name):
with open(file_name) as infile:
reader = csv.reader(infile)
return list(reader)
data1 = return_contents('csv1.csv')
data2 = return_contents('csv2.csv')
print(data1)
print(data2)
combined = []
for row in data1:
combined.extend(row)
for row in data2:
combined.extend(row)
with open('csv_out.csv', 'w', newline='') as outfile:
writer = csv.writer(outfile)
writer.writerow(combined)
That code gives you the basis of the approach but it would be ugly to extend this for hundreds of files. Instead, you probably want os.listdir to pull all the files in a single directory, one by one, and add them to your output. This is the reason that I packed the reading code into the return_contents function; we can repeat the same process millions of times on different files with only one set of code to do the actual reading. Something like this:
import csv
import os
def return_contents(file_name):
with open(file_name) as infile:
reader = csv.reader(infile)
return list(reader)
all_files = os.listdir('my_csvs')
combined_output = []
for file in all_files:
data = return_contents('my_csvs/{}'.format(file))
for row in data:
combined_output.extend(row)
with open('csv_out.csv', 'w', newline='') as outfile:
writer = csv.writer(outfile)
writer.writerow(combined_output)
If you are specially dealing with csv file format. I recommend you to use csv package for the file operations. If you also use with...as statement, you don't need to worry about closing the file etc. You just need to define the PATH then program will iterate all .csv files
Here is what you can do:
PATH = "your folder path"
def order_list():
data_list = []
for filename in os.listdir(PATH):
if filename.endswith(".csv"):
with open("data.csv") as csvfile:
read_csv = csv.reader(csvfile, delimiter=',', quoting=csv.QUOTE_NONNUMERIC)
for row in read_csv:
data_list.extend(row)
print(data_list)
if __name__ == '__main__':
order_list()
Store your data in pandas df
import pandas as pd
df = pd.read_csv('file.csv')
Store the modified dataframe into new one
df_2 = df.groupby('Column_Name').agg(lambda x: ' '.join(x)).reset_index() ## Write Name of your column
Write the df to new csv
df2.to_csv("file_modified.csv")
You could do it also like this:
fIn = open("test.csv", "r")
fOut = open("output.csv", "w")
fOut.write(",".join([line for line in fIn]).replace("\n",""))
fIn.close()
fOut.close()
I've you want now to run it on multiple file you can run it as script with arguments:
import sys
fIn = open(sys.argv[1], "r")
fOut = open(sys.argv[2], "w")
fOut.write(",".join([line for line in fIn]).replace("\n",""))
fIn.close()
fOut.close()
So now expect you use some Linux System and the script is called csvOnliner.py you could call it with:
for i in *.csv; do python csvOnliner.py $i changed_$i; done
With windows you could do it in a way like this:
FOR %i IN (*.csv) DO csvOnliner.py %i changed_%i

Python script to turn input csv columns into output csv row values

I have an input csv that look like
email,trait1,trait2,trait3
foo#gmail,biz,baz,buzz
bar#gmail,bizzy,bazzy,buzzy
foobars#gmail,bizziest,bazziest,buzziest
and I need the output format to look like
Indv,AttrName,AttrValue,Start,End
foo#gmail,"trait1",biz,,,
foo#gmail,"trait2",baz,baz,,
foo#gmail,"trait3",buzz,,,
For each row in my input file I need to write a row for the N-1 columns in the input csv. The Start and End fields in the output file can be empty in some cases.
I'm trying to read in the data using a DictReader. So for i've been able to read in the data with
import unicodecsv
import os
import codecs
with open('test.csv') as csvfile:
reader = unicodecsv.csv.DictReader(csvfile)
outfile = codecs.open("test-write", "w", "utf-8")
outfile.write("Indv", "ATTR", "Value", "Start","End\n")
for row in reader:
outfile.write([row['email'],"trait1",row['trait1'],'',''])
outfile.write([row['email'],"trait2",row['trait2'],row['trait2'],''])
outfile.write([row['email'],"trait3",row['trait3'],'','')
Which doesn't work. (I think I need to cast the list to a string), and is also very brittle as I'm hardcoding the column names for each row. The bigger issue is that the data within the for loop isn't written to "test-write". Only the line
outfile.write("Indv", "ATTR", "Value", "Start","End\n") actually write out to the file. Is DictReader the appropriate class to use in my case?
This uses a unicodecsv.DictWriter and the zip() function to do what you want, and the code is fairly readable in my opinion.
import unicodecsv
import os
import codecs
with open('test.csv') as infile, \
codecs.open('test-write.csv', 'w', 'utf-8') as outfile:
reader = unicodecsv.DictReader(infile)
fieldnames = 'Indv,AttrName,AttrValue,Start,End'.split(',')
writer = unicodecsv.DictWriter(outfile, fieldnames)
writer.writeheader()
for row in reader:
email = row['email']
trait1, trait2, trait3 = row['trait1'], row['trait2'], row['trait3']
writer.writerows([ # writes three rows of output from each row of input
dict(zip(fieldnames, [email, 'trait1', trait1])),
dict(zip(fieldnames, [email, 'trait2', trait2, trait2])),
dict(zip(fieldnames, [email, 'trait3', trait3]))])
Here's the contents of the test-write.csv file it produced from your example input csv file:
Indv,AttrName,AttrValue,Start,End
foo#gmail,trait1,biz,,
foo#gmail,trait2,baz,baz,
foo#gmail,trait3,buzz,,
bar#gmail,trait1,bizzy,,
bar#gmail,trait2,bazzy,bazzy,
bar#gmail,trait3,buzzy,,
foobars#gmail,trait1,bizziest,,
foobars#gmail,trait2,bazziest,bazziest,
foobars#gmail,trait3,buzziest,,
I may be completely off since I don't do a lot of work with unicode, but it seems to me that the following should work:
import csv
with open('test.csv', 'ur') as csvin, open('test-write', 'uw') as csvout:
reader = csv.DictReader(csvin)
writer = csv.DictWriter(csvout, fieldnames=['Indv', 'AttrName',
'AttrValue', 'Start', 'End'])
for row in reader:
for traitnum in range(1, 4):
key = "trait{}".format(traitnum)
writer.writerow({'Indv': row['email'], 'AttrName': key,
'AttrValue': row[key]})
import pandas as pd
pd1 = pd.read_csv('input_csv.csv')
pd2 = pd.melt(pd1, id_vars=['email'], value_vars=['trait1','trait2','trait3'], var_name='AttrName', value_name='AttrValue').rename(columns={'email': 'Indv'}).sort(columns=['Indv','AttrName']).reset_index(drop=True)
pd2.to_csv('output_csv.csv', index=False)
Unclear on what the Start and End fields represent, but this gets you everything else.

Parsing a pipe-delimited file in Python

I'm trying to parse a pipe-delimited file and pass the values into a list, so that later I can print selective values from the list.
The file looks like:
name|age|address|phone|||||||||||..etc
It has more than 100 columns.
Use the 'csv' library.
First, register your dialect:
import csv
csv.register_dialect('piper', delimiter='|', quoting=csv.QUOTE_NONE)
Then, use your dialect on the file:
with open(myfile, "rb") as csvfile:
for row in csv.DictReader(csvfile, dialect='piper'):
print row['name']
Use Pandas:
import pandas as pd
pd.read_csv(filename, sep="|")
This will store the file in a dataframe. For each column, you can apply conditions to select the required values to print. It takes a very short time to execute. I tried with 111,047 rows.
If you're parsing a very simple file that won't contain any | characters in the actual field values, you can use split:
fileHandle = open('file', 'r')
for line in fileHandle:
fields = line.split('|')
print(fields[0]) # prints the first fields value
print(fields[1]) # prints the second fields value
fileHandle.close()
A more robust way to parse tabular data would be to use the csv library as mentioned in Spencer Rathbun's answer.
In 2022, with Python 3.8 or above, you can simply do:
import csv
with open(file_path, "r") as csvfile:
reader = csv.reader(csvfile, delimiter='|')
for row in reader:
print(row[0], row[1])

Categories

Resources