I need to parse a FWF to a df, but the system that exports the file instead of & symbol exports &.
This break the fixed-width as the lines with & have now more characters so I cant use read_fwf. I can import it with the below code, but this gives me a single column that I must later split.
import pandas as pd
cnv = lambda txt: txt.replace('&', '&')
df = pd.read_csv('AccountStatement1.txt', skiprows=4, skipfooter=3, engine='python', sep='$%^#~&', converters= { i: cnv for i in range(1) })
df
i use sep='$%^#~&' so i can have only one column and correct the text using the converter.
What is the proper solution to this?
Sample of the text file:
=======================================================================================================================================================================================================================================================================================================================================================
From: Transaction Date: 21/05/2021 To: Transaction Date: 23/06/2021 IBAN:: CYxxxxxx0000000000000 Currency: EUR Previous Statement Balance: 1,111.10 BIC: xxxxxxxx
=======================================================================================================================================================================================================================================================================================================================================================
Transaction Date Transaction details Reference No. Description 1 Description 2 Value date Debit Credit Balance
27/05/2021 CHQ: 12568987 26/05/2021 645.00 9,708.70
27/05/2021 DEBIT EB2021057554434221149 xxxx xxxxxxxxx xxxxxx 27/05/2021 0,888.36 3,723.74
28/05/2021 I2456787437452 B/O: xxxxxxxxxxxxxxx LTD TRANSFER xxxxxxxxx xxxxxxxxxx 27/05/2021 19,002.00 13,755.74
28/05/2021 INWARD TRANSFER COMMISSION CY21jhjh884786 I2107675689452 28/05/2021 10.00 15,723.74
31/05/2021 ATM/POS DEBIT jhgjhkjhjk jkh f4 1211 xxxxxx xxxxx & xx xxxxx 27/05/2021 60.00 52,680.74
31/05/2021 Service Charges MONTHLY MAINTENANCE FEE 31/05/2021 35.00 73,645.74
01/06/2021 Service Charges STATEMENT FEE - MAY 2021 31/05/2021 5.00 19,645.74
02/06/2021 ATM/POS DEBIT POS 567521 124454 1211 xxxxxxxxxxxx & Exxxxxxx 31/05/2021 170.00 09,320.74
03/06/2021 CHQ: 13456784 02/06/2021 80.00 10,230.74
04/06/2021 ATM/POS DEBIT POS 345671 124258 1278 xxxxxxxxxxxx & xxxxxxxx 02/06/2021 940.00 23,960.74
08/06/2021 ATM/POS DEBIT POS 345671 125678 1278 xxxxxxx xxxxx xxxxx 04/06/2021 13.20 13,347.54
15/06/2021 ATM/POS DEBIT POS 145671 156612 1671 xxxx xxxxxxxxxxxxxx680 11/06/2021 25.53 13,322.01
=======================================================================================================================================================================================================================================================================================================================================================
Number of records: 22 IBAN:: xxxx234567898765434567876545 Currency: EUR Current Statement Balance: 0,000.00
=======================================================================================================================================================================================================================================================================================================================================================
Maybe you could load the file, replace the problematic characters, then read it as fixed width with pd.read_fwf using io.StringIO to make an in-memory buffer:
>>> import io, pandas as pd
>>> with open('test.csv') as f:
... lines = f.readlines()
>>> pd.read_fwf(io.StringIO(''.join(lines[4:-3]).replace('&', '&')))
a b c
0 11 fo& 0
This is the file’s content, unaligned by & as you indicate:
>>> print(''.join(lines))
foo
bar
baz
qux
a b c
11 fo& 0
ig
nore
me
Related
I was trying to read a text into a pandas DataFrame, but instead of 4 columns I got a lot of columns.
How to read the following text as pandas DataFrame?
txt="""2020-09-12, Budget, GD-0032-DD-XP, Ford,\\n 2020-04-22, Avis, D143123, Toyota,\\n 2020-04-03, Herz, 331029411, Jeep,\\n 2020-10-31, HERZ, , Hyundai,\\n 2020-09-10, Budget, Gd-1932-Ee-Rm, Chevrolet,\\n 2020-12-01, National, 9890001, Ford,\\n 2020-05-13, Alamo, W***, Hyundai,\\n 2020-01-21, Enterprise, GD-8888-TL-MP, Jeep,\\n"""
My attempt:
txt="""2020-09-12, Budget, GD-0032-DD-XP, Ford,\n 2020-04-22, Avis, D143123, Toyota,\n 2020-04-03, Herz, 331029411, Jeep,\n 2020-10-31, HERZ, , Hyundai,\n 2020-09-10, Budget, Gd-1932-Ee-Rm, Chevrolet,\n 2020-12-01, National, 9890001, Ford,\n 2020-05-13, Alamo, W***, Hyundai,\n 2020-01-21, Enterprise, GD-8888-TL-MP, Jeep,\n"""
# input file with only ONE row
with open('input000.txt','w') as fo:
txt = txt.replace('\n','\\n')
fo.write(txt)
# read the data file
import io
import pandas as pd
df = pd.read_csv('input000.txt', lineterminator='\n')
df
My output
2020-09-12 Budget GD-0032-DD-XP Ford \n 2020-04-22 Avis D143123 Toyota \n 2020-04-03 Herz ... Ford.1 \n 2020-05-13 Alamo W*** Hyundai.1 \n 2020-01-21 Enterprise GD-8888-TL-MP Jeep.1 \n
0 rows × 33 columns
Required output
0 1 2 3
0 2020-09-12 Budget GD-0032-DD-XP Ford
1 2020-04-22 Avis D143123 Toyota
2 2020-04-03 Herz 331029411 Jeep
3 2020-10-31 HERZ Hyundai
4 2020-09-10 Budget Gd-1932-Ee-Rm Chevrolet
5 2020-12-01 National 9890001 Ford
6 2020-05-13 Alamo W*** Hyundai
7 2020-01-21 Enterprise GD-8888-TL-MP Jeep
You have to pass in the separator as well using sep parameter. Since the last element in each row also has a comma, this will result in a NaN column, so you can drop that one using drop:
df = pd.read_csv(io.StringIO(txt), lineterminator='\n', sep=',').drop(columns='Unnamed: 4')
On a text file, \n seems to be converted to \\n and since c-engine of read_csv doesn't support \\n as a separator, perhaps it's better if we open it as a text file and build the DataFrame:
with open("input000.txt") as file:
df = pd.DataFrame([line.split(',') for line in file.read().split('\\n')])
Output:
2020-09-12 Budget GD-0032-DD-XP Ford
0 2020-04-22 Avis D143123 Toyota
1 2020-04-03 Herz 331029411 Jeep
2 2020-10-31 HERZ Hyundai
3 2020-09-10 Budget Gd-1932-Ee-Rm Chevrolet
4 2020-12-01 National 9890001 Ford
5 2020-05-13 Alamo W*** Hyundai
6 2020-01-21 Enterprise GD-8888-TL-MP Jeep
I have a dataset that I daily download from amazon aws. Problem is that there are some lines bad downloaded (see image. Also can download the sample here). Those 2 lines that start with "ref" should be append in the previous row that starts with "001ec214-97e6-4d84-a64a-d1bee0079fd8" in order to have that row correct. I have like a hundred of this cases to solve, so I wonder if there's a method to do this kind of append with pandas read_csv function. Also i can't just ignore those rows cause I'd be missing useful data.
This oneliner should do:
from io import StringIO
import pandas as pd
df = pd.read_csv(StringIO(open('filename.csv').read().replace('\n\\\nref :', ' ref :')), sep="\t", header=None)
It reads the file and replaces the redundant newlines before loading the string representation of the csv file into pandas with StringIO.
Given the text from the Google doc in the questions:
001ea9a6-2c30-4ce6-b4ee-f445803303db 596d89bf-e641-4e9c-9374-695241694a6d 45640 MUNDO JURIDICO 26204 20350 \N A0000000031010 10530377 MATEO FLORES ELMER ASOC INTEGRACION LOS OLIVOS AV NAPOLES MZ I LTE 31 visanet 650011581 1 0 mpos contactless 0 0 0 visa 421410 5969 12 24 c6d06de1-c9f0-4b4a-992b-ad3a5f263039 000066 666407 2021-07-20 09:31:23 000 301201522830426 995212014496392 158348 \N \N \N PEN 200.00 200.00 0.00 1 0 0.00 200026214 5999 ASOC INTEGRACION LOS OLIVOS AV NAPOLES MZ I LTE 31 MUNDO JURIDICO 947761525 AUTHORIZED 2021-07-20 09:31:24 2021-07-20 09:32:00 210720 093123
001ec0fd-8d0e-4332-851a-bcd93fdf0a37 ee32d094-8fc4-4d92-b788-58ca2ae2a590 36750 Chifa Maryori 18923 25313 \N A0000000031010 46753818 Chifa Maryori San isidro Peru visanet 650011581 1 0 mpos contactless 0 0 0 visa 421355 5765 04 26 7d00f10e-d0b7-40ec-9b97-d20590cb7710 000708 620744 2021-06-27 16:52:21 000 301178787424243 996211783887688 100513 \N \N \N PEN 17.00 17.00 0.00 0 0 0.00 400034782 5499 San isidro Peru Chifa Maryori +51 01 5179805 AUTHORIZED 2021-06-27 16:52:23 2021-06-27 16:52:23 210627 165221
001ec214-97e6-4d84-a64a-d1bee0079fd8 3c4a98cc-d279-4f9e-af8b-5198647889c5 33214 Inversiones Polleria el rey MPOS 15699 23053 \N \N 88846910264 JoseOrbegoso Puelle D 5 Lt 2 urb Mariscal
\
ref : altura de elektra visanet 650011581 1 0 mpos chip 0 0 0 visa 455788 2123 09 22 8b3fd975-140a-42d5-8601-1b4bcc16bd60 000022 170790 2020-10-20 11:32:44 687 \N \N \N \N \N \N PEN 1.00 1.00 0.00 0 0 0.00 200020168 5999 D 5 Lt 2 urb Mariscal
\
ref : altura de elektra Inversiones Polleria el rey MPOS 964974226 DENIED 2020-10-20 11:32:44 2020-10-20 11:32:44 201020 113244
001ec66f-6350-4a33-a1fe-b9375dac2161 34169a7a-a66f-4258-80c2-8c0d4512aa36 27044 ABRAHAM LORENZO MELGAREJO 10353 13074 \N \N 99944748991 ABRAHAM LORENZO MELGAREJO JR RENOVACION 399 visanet 650011581 1 0 mpos chip 0 0 0 visa 455788 4712 08 24 83915520-1a1f-4d5e-b118-f0c57a9d96ae 000367 161286 2020-10-14 15:59:51 000 300288755920408 995202889603642 242407 \N \N \N PEN 15.00 15.00 0.00 1 0 0.00 200012523 5811 JR RENOVACION 399 ABRAHAM LORENZO MELGAREJO 957888909 AUTHORIZED 2020-10-14 15:59:52 2020-10-14 16:00:21 201014 155951
001eebaf-bccc-47a7-87a3-be8b09eb971b c5e14889-d61c-4f18-8000-d3bac0564cfb 27605 Polleria Arroyo Express 10792 21904 \N A0000000031010 41429707 Polleria Arroyo Express San isidro Peru visanet 650011581 1 0 mpos contactless 0 0 0 visa 421355 9238 09 25 3c5268e8-5731-4fea-905d-45b03ed623d2 000835 379849 2021-01-30 19:43:36 000 301031026163928 995210303271986 928110 \N \N \N PEN 24.00 24.00 0.00 0 0 0.00 400026444 5499 San isidro Peru Polleria Arroyo Express +51 01 5179805 AUTHORIZED 2021-01-30 19:43:37 2021-01-30 19:43:37 210130 194336
...and using the code from Concatenate lines with previous line based on number of letters in first column, if the "good lines" all start 001 (that's what the regex is checking for below) you can try...
Code:
import re
import pandas as pd
all_the_data = ""
pattern = r'^001.*$'
with open('aws.txt') as f, open('aws_out.csv','w') as output:
all_the_data = ""
for line in f:
if not re.search(pattern, line):
all_the_data = re.sub("\n$", "", all_the_data)
all_the_data = "".join([all_the_data, line])
output.write(all_the_data)
df = pd.read_csv('aws_out.csv', header=None, sep='\t')
print(df)
Output example:
I need to automate the validations performed on text file. I have two text files and I need to check if the row in one file having unique combination of two columns is present in other text file having same combination of columns then the new column in text file two needs to be written in text file one.
The text file 1 has thousands of records and text file 2 is considered as reference to text file 1.
As of now I have written the following code. Please help me to solve this.
import pandas as pd
data=pd.read_csv("C:\\Users\\hp\\Desktop\\py\\sample2.txt",delimiter=',')
df=pd.DataFrame(data)
print(df)
# uniquecal=df[['vehicle_Brought_City','Vehicle_Brand']]
# print(uniquecal)
data1=pd.read_csv("C:\\Users\\hp\\Desktop\\py\\sample1.txt",delimiter=',')
df1=pd.DataFrame(data1)
print(df1)
# uniquecal1=df1[['vehicle_Brought_City','Vehicle_Brand']]
# print(uniquecal1
How can I put the vehicle price in dataframe one and save it to text file1?
Below is my sample dataset:
File1:
fname lname vehicle_Brought_City Vehicle_Brand Vehicle_price
0 aaa xxx pune honda NaN
1 aaa yyy mumbai tvs NaN
2 aaa xxx hyd maruti NaN
3 bbb xxx pune honda NaN
4 bbb aaa mumbai tvs NaN
File2:
vehicle_Brought_City Vehicle_Brand Vehicle_price
0 pune honda 50000
1 mumbai tvs 40000
2 hyd maruti 45000
del df['Vehicle_price']
print(df)
dd = pd.merge(df, df1, on=['vehicle_Brought_City', 'Vehicle_Brand'])
print(dd)
output:
fname lname vehicle_Brought_City Vehicle_Brand Vehicle_price
0 aaa xxx pune honda 50000
1 aaa yyy mumbai tvs 40000
2 bbb aaa mumbai tvs 40000
3 aaa xxx hyd maruti 45000
I have a large sep="|" tsv with an address field that has a bunch of values with the following
...xxx|yyy|Level 1 2 xxx Street\(MYCompany)|...
This ends up as:
line1) ...xxx|yyy|Level 1 2 xxx Street\
line2) (MYCompany)|...
Tried running the quote=2 to turn non numeric into strings in read_table with Pandas but it still treats the backslash as new line. What is an efficient way to ignore rows with values in a field that contain backslash escapes to new line, is there a way to ignore the new line for \?
Ideally it will prepare the data file so it can be read into a dataframe in pandas.
Update: showing 5 lines with breakage on 3rd line.
1788768|1831171|208434489|2014-08-14 13:40:02|108|c||Desktop|coupon|49 XXX Ave|Australia|Victoria|3025|Melbourne
1788772|1831177|202234489|2014-08-14 13:41:37|108|c||iOS||u7 38-46 South Street|Australia|New South Wales|2116|Sydney
1788776|1831182|205234489|2014-08-14 13:42:41|108|c||Desktop||Level XXX Margaret Street\
(My Company)|Australia|New South Wales|2000|Sydney|Sydney
1788780|1831186|202634489|2014-08-14 13:43:46|108|c||Desktop||Po box ZZZ|Australia|New South Wales|2444|NSW Other|Port Macquarie
Here is another solution using regex:
import pandas as pd
import re
f = open('input.tsv')
fl = f.read()
f.close()
#Replace '\\n' with '\' using regex
fl = re.sub('\\\\\n','\\\\',s)
o = open('input_fix.tsv','w')
o.write(fl)
o.close()
cols = range(1,17)
#Prime the number of columns by specifying names for each column
#This takes care of the issue of variable number of columns
df = pd.read_csv(fl,sep='|',names=cols)
will produce the following result:
I think you can try first read_csv with sep which is NOT in values and it seems that it read correct:
import pandas as pd
import io
temp=u"""
49 XXX Ave|Australia
u7 38-46 South Street|Australia
XXX Margaret Street\
New South Wales|Australia
Po box ZZZ|Australia"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep="^", header=None)
print df
0
0 49 XXX Ave|Australia
1 u7 38-46 South Street|Australia
2 XXX Margaret StreetNew South Wales|Australia
3 Po box ZZZ|Australia
Then you can create new file with to_csv and read_csv with sep="|":
df.to_csv('myfile.csv', header=False, index=False)
print pd.read_csv('myfile.csv', sep="|", header=None)
0 1
0 49 XXX Ave Australia
1 u7 38-46 South Street Australia
2 XXX Margaret StreetNew South Wales Australia
3 Po box ZZZ Australia
Next solution with not createing new file, but write to variable output and then read_csv with io.StringIO:
import pandas as pd
import io
temp=u"""
49 XXX Ave|Australia
u7 38-46 South Street|Australia
XXX Margaret Street\
New South Wales|Australia
Po box ZZZ|Australia"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep=";", header=None)
print df
0
0 49 XXX Ave|Australia
1 u7 38-46 South Street|Australia
2 XXX Margaret StreetNew South Wales|Australia
3 Po box ZZZ|Australia
output = df.to_csv(header=False, index=False)
print output
49 XXX Ave|Australia
u7 38-46 South Street|Australia
XXX Margaret StreetNew South Wales|Australia
Po box ZZZ|Australia
print pd.read_csv(io.StringIO(u""+output), sep="|", header=None)
0 1
0 49 XXX Ave Australia
1 u7 38-46 South Street Australia
2 XXX Margaret StreetNew South Wales Australia
3 Po box ZZZ Australia
If I test it in your data, it seems that 1. and 2.rows have 14 fields, next two 15 fields.
So I remove last item from both rows (3. and 4.), maybe this is only typo (I hope it):
import pandas as pd
import io
temp=u"""1788768|1831171|208434489|2014-08-14 13:40:02|108|c||Desktop|coupon|49 XXX Ave|Australia|Victoria|3025|Melbourne
1788772|1831177|202234489|2014-08-14 13:41:37|108|c||iOS||u7 38-46 South Street|Australia|New South Wales|2116|Sydney
1788776|1831182|205234489|2014-08-14 13:42:41|108|c||Desktop||Level XXX Margaret Street\
(My Company)|Australia|New South Wales|2000|Sydney
1788780|1831186|202634489|2014-08-14 13:43:46|108|c||Desktop||Po box ZZZ|Australia|New South Wales|2444|NSW Other"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep=";", header=None)
print df
0
0 1788768|1831171|208434489|2014-08-14 13:40:02|...
1 1788772|1831177|202234489|2014-08-14 13:41:37|...
2 1788776|1831182|205234489|2014-08-14 13:42:41|...
3 1788780|1831186|202634489|2014-08-14 13:43:46|...
output = df.to_csv(header=False, index=False)
print pd.read_csv(io.StringIO(u""+output), sep="|", header=None)
0 1 2 3 4 5 6 7 \
0 1788768 1831171 208434489 2014-08-14 13:40:02 108 c NaN Desktop
1 1788772 1831177 202234489 2014-08-14 13:41:37 108 c NaN iOS
2 1788776 1831182 205234489 2014-08-14 13:42:41 108 c NaN Desktop
3 1788780 1831186 202634489 2014-08-14 13:43:46 108 c NaN Desktop
8 9 10 11 \
0 coupon 49 XXX Ave Australia Victoria
1 NaN u7 38-46 South Street Australia New South Wales
2 NaN Level XXX Margaret Street(My Company) Australia New South Wales
3 NaN Po box ZZZ Australia New South Wales
12 13
0 3025 Melbourne
1 2116 Sydney
2 2000 Sydney
3 2444 NSW Other
But if data are correct, add parameter names=range(15) to read_csv:
print pd.read_csv(io.StringIO(u""+output), sep="|", names=range(15))
0 1 2 3 4 5 6 7 \
0 1788768 1831171 208434489 2014-08-14 13:40:02 108 c NaN Desktop
1 1788772 1831177 202234489 2014-08-14 13:41:37 108 c NaN iOS
2 1788776 1831182 205234489 2014-08-14 13:42:41 108 c NaN Desktop
3 1788780 1831186 202634489 2014-08-14 13:43:46 108 c NaN Desktop
8 9 10 11 \
0 coupon 49 XXX Ave Australia Victoria
1 NaN u7 38-46 South Street Australia New South Wales
2 NaN Level XXX Margaret Street(My Company) Australia New South Wales
3 NaN Po box ZZZ Australia New South Wales
12 13 14
0 3025 Melbourne NaN
1 2116 Sydney NaN
2 2000 Sydney Sydney
3 2444 NSW Other Port Macquarie
I need to convert a huge number of files in structured text format into excel (csv would work) to be able to merge them with some other data I have.
Here is a sample of the text:
FILER:
COMPANY DATA:
COMPANY CONFORMED NAME: NORTHQUEST CAPITAL FUND INC
CENTRAL INDEX KEY: 0001142728
IRS NUMBER: 223772454
STATE OF INCORPORATION: NJ
FISCAL YEAR END: 1231
FILING VALUES:
FORM TYPE: NSAR-A
SEC ACT: 1940 Act
SEC FILE NUMBER: 811-10419
FILM NUMBER: 03805344
BUSINESS ADDRESS:
STREET 1: 16 RIMWOOD LANE
CITY: COLTS NECK
STATE: NJ
ZIP: 07722
BUSINESS PHONE: 7328423504
FORMER COMPANY:
FORMER CONFORMED NAME: NORTHPOINT CAPITAL FUND INC
DATE OF NAME CHANGE: 20010615
</SEC-HEADER>
<DOCUMENT>
<TYPE>NSAR-A
<SEQUENCE>1
<FILENAME>answer.fil
<DESCRIPTION>ANSWER.FIL
<TEXT>
<PAGE> PAGE 1
000 A000000 06/30/2003
000 C000000 0001142728
000 D000000 N
000 E000000 NF
000 F000000 Y
000 G000000 N
000 H000000 N
000 I000000 6.1
000 J000000 A
001 A000000 NORTHQUEST CAPITAL FUND, INC.
001 B000000 811-10493
001 C000000 7328921057
002 A000000 16 RIMWOOD LANE
002 B000000 COLTS NECK
002 C000000 NJ
002 D010000 07722
003 000000 N
004 000000 N
005 000000 N
006 000000 N
007 A000000 N
007 B000000 0
007 C010100 1
007 C010200 2
007 C010300 3
007 C010400 4
007 C010500 5
007 C010600 6
007 C010700 7
007 C010800 8
007 C010900 9
007 C011000 10
008 A000001 EMERALD RESEARCH CORP.
008 B000001 A
008 C000001 801-60455
008 D010001 BRICK
008 D020001 NJ
008 D030001 08724
013 A000001 SANVILLE & COMPANY
013 B010001 ABINGTON
013 B020001 PA
013 B030001 19001
015 A000001 FLEET BANK
015 B000001 C
015 C010001 POINT PLEASANT BEACH
015 C020001 NJ
015 C030001 08742
015 E030001 X
018 000000 Y
019 A000000 N
019 B000000 0
<PAGE> PAGE 2
020 A000001 SCHWAB
020 B000001 94-1737782
020 C000001 0
020 A000002 BESTVEST BROOKERAGE
020 B000002 23-1452837
020 C000002 0
and it continues to page 8 of the same structure.
The information about the company's name should go into relative columns and the rest should be like the first two values are the column names and the third value would be the value of the row.
I was trying to work it out with pyparsing but haven't been able to successfully do so.
Any comment on the approach would be helpful.
The way you describe it, these are like key:value pairs for each file. I would handle the parsing part like this:
import sys
import re
import csv
colonseperated = re.compile(' *(.+) *: *(.+) *')
fixedfields = re.compile('(\d{3} \w{7}) +(.*)')
matchers = [colonseperated, fixedfields]
outfile = csv.writer(open('out.csv', 'w'))
outfile.writerow(['Filename', 'Key', 'Value'])
for filename in sys.argv[1:]:
for line in open(filename):
line = line.strip()
for matcher in matchers:
match = matcher.match(line)
if match:
outfile.writerow([filename] + list(match.groups()))
You can call this something like parser.py and call it with python parser.py *.infile or whatever your filename convention is. It will create a csv file with three columns: a filename, a key and a value. You can open this in excel and then use a pivot table to get the values into the correct format.
Alternatively you can use this:
import csv
headers = []
rows = {}
filenames = []
outfile = csv.writer(open('flat.csv', 'w'))
infile = csv.reader(open('out.csv'))
infile.next()
for filename, key, value in infile:
if not filename in rows:
rows[filename] = {}
filenames.append(filename)
if key not in headers:
headers.append(key)
rows[filename][key] = value
outfile.writerow(headers)
for filename in filenames:
outfile.writerow([rows[filename].get(header, '') for header in headers])