parsing file in python

parsing file in python - python

I'm trying to parse the 2 pipe/comma separated files and if the particular field matches in the file create the new entry in the 3rd file.
Code as follows:
#! /usr/bin/python
fo = open("c-1.txt" , "r" )
for line in fo:
#print line
fields = line.split('|')
src = fields[0]
f1 = open("Airport.txt", 'r')
f2 = open("b.txt", "a")
#with open('c.csv', 'r') as f1:
# line1 = f1.read()
for line1 in f1:
reader = line1.split(',')
hi = False
target = reader[0]
if target == src and fields[1] == 'ZHT':
print target
hi = True
f2.write(fields[0])
f2.write("|")
f2.write(fields[1])
f2.write("|")
f2.write(fields[2])
f2.write("|")
f2.write(fields[3])
f2.write("|")
f2.write(fields[4])
f2.write("|")
f2.write(fields[5])
f2.write("|")
f2.write(reader[2])
if hi == False:
f2.write(line)
f2.close()
f1.close()
fo.close()
The matching field gets printed 2 times in the new file. what could be the reason?

The problem seems to be that you reset hi to False in each iteration of the loop. Lets say the second line matches, but the third does not. You set hi to True in the second line, but then to False again in the third, and then print the original line.
Try like this:
hi = False
for line1 in f1:
reader = line1.split(',')
target = reader[0]
if target == src and fields[1] == 'ZHT':
hi = True
f2.write(stuff)
if hi == False:
f2.write(line)
Or, assuming that only one line will ever match, you could use for/else:
for line1 in f1:
reader = line1.split(',')
target = reader[0]
if target == src and fields[1] == 'ZHT':
f2.write(stuff)
break
else:
f2.write(line)
Also note that you could probably replace that series of f2.write statements by this one, joining the several parts with |:
f2.write('|'.join(fields[0:6] + [reader[2]])

As mentioned already, you reset the flag within the loop so are liable to printing multiple lines.
If there is definitely only one row that will match it might be worth breaking the loop once that row has been found.
and finally check your data to make sure there aren't identical matching rows.
Other than that I have a couple other suggestions to clean up your code and make it easier to debug:
1) Use the csv library.
2) If the files can be held in memory, it would be better to hold them in memory instead of constantly opening and closing them.
3) Use with to handle the files (I not you have already tried in your comments).
Something like the following should work.
#! /usr/bin/python
import csv
data_0 = {}
data_1 = {}
with open("c-1.txt" , "r" ) as fo, open("Airport.txt", "r") as f1:
fo_reader = csv.reader(fo, delimiter="|")
f1_reader = csv.reader(f1) # default delimiter is ','
for line in fo_reader:
if line[1] == 'ZHT':
try: # Add to a list here in case keys are duplicated.
data_0[line[0]].append(line)
except KeyError:
data_0[line[0]] = [line]
for line in f1_reader:
data_1[line[0]] = line[2] # We only need the third column of this row to append to the data.
with open("b.txt", "a") as f2:
writer = csv.writer(f2, delimiter="|") # I would be tempted to not make this a pipe, but probably too late already if you've got a pre-made file.
for key in data_0:
if key in data_1.keys():
for row in data_0[key]:
writer.writerow(row[:6]+data_1[key]) # index to the 6th column, and append the data from the other file.
else:
for row in data_0[key]:
writer.writerow(row)
That should avoid having the extra rows as well as there is no true/False flag to rely on.

Related

Find coincidence and add column

I want to achieve this specific task, I have 2 files, the first one with emails and credentials:
xavier.desprez#william.com:Xavier
xavier.locqueneux#william.com:vocojydu
xaviere.chevry#pepe.com:voluzigy
Xavier.Therin#william.com:Pussycat5
xiomara.rivera#william.com:xrhj1971
xiomara.rivera#william-honduras.william.com:xrhj1971
and the second one, with emails and location:
xavier.desprez#william.com:BOSNIA
xaviere.chevry#pepe.com:ROMANIA
I want that, whenever the email from the first file is found on the second file, the row is substituted by EMAIL:CREDENTIAL:LOCATION , and when it is not found, it ends up being: EMAIL:CREDENTIAL:BLANK
so the final file must be like this:
xavier.desprez#william.com:Xavier:BOSNIA
xavier.locqueneux#william.com:vocojydu:BLANK
xaviere.chevry#pepe.com:voluzigy:ROMANIA
Xavier.Therin#william.com:Pussycat5:BLANK
xiomara.rivera#william.com:xrhj1971:BLANK
xiomara.rivera#william-honduras.william.com:xrhj1971:BLANK
I have do several tries in python, but it is not even worth it to write it because I am not really close to the solution.
Regards !
EDIT:
This is what I tried:
import os
import sys
with open("test.txt", "r") as a_file:
for line_a in a_file:
stripped_email_a = line_a.strip().split(':')[0]
with open("location.txt", "r") as b_file:
for line_b in b_file:
stripped_email_b = line_b.strip().split(':')[0]
location = line_b.strip().split(':')[1]
if stripped_email_a == stripped_email_b:
a = line_a + ":" + location
print(a.replace("\n",""))
else:
b = line_a + ":BLANK"
print (b.replace("\n",""))
This is the result I get:
xavier.desprez#william.com:Xavier:BOSNIA
xavier.desprez#william.com:Xavier:BLANK
xaviere.chevry#pepe.com:voluzigy:BLANK
xaviere.chevry#pepe.com:voluzigy:ROMANIA
xavier.locqueneux#william.com:vocojydu:BLANK
xavier.locqueneux#william.com:vocojydu:BLANK
Xavier.Therin#william.com:Pussycat5:BLANK
Xavier.Therin#william.com:Pussycat5:BLANK
xiomara.rivera#william.com:xrhj1971:BLANK
xiomara.rivera#william.com:xrhj1971:BLANK
xiomara.rivera#william-honduras.william.com:xrhj1971:BLANK
xiomara.rivera#william-honduras.william.com:xrhj1971:BLANK
I am very close but I get duplicates ;)
Regards

The duplication issue comes from the fact that you are reading two files in a nested way, once a line from the test.txt is read, you open the location.txt file for reading and process it. Then, you read the second line from test.txt, and re-open the location.txt and process it again.
Instead, get all the necessary data from the location.txt, say, into a dictionary, and then use it while reading the test.txt:
email_loc_dict = {}
with open("location.txt", "r") as b_file:
for line_b in b_file:
splits = line_b.strip().split(':')
email_loc_dict[splits[0]] = splits[1]
with open("test.txt", "r") as a_file:
for line_a in a_file:
line_a = line_a.strip()
stripped_email_a = line_a.split(':')[0]
if stripped_email_a in email_loc_dict:
a = line_a + ":" + email_loc_dict[stripped_email_a]
print(a)
else:
b = line_a + ":BLANK"
print(b)
Output:
xavier.desprez#william.com:Xavier:BOSNIA
xavier.locqueneux#william.com:vocojydu:BLANK
xaviere.chevry#pepe.com:voluzigy:ROMANIA
Xavier.Therin#william.com:Pussycat5:BLANK
xiomara.rivera#william.com:xrhj1971:BLANK
xiomara.rivera#william-honduras.william.com:xrhj1971:BLANK

How to convert this text file to csv?

I try analyze text file with data - columns, and records.
My file:
Name Surname Age Sex Grade
Chris M. 14 M 4
Adam A. 17 M
Jack O. M 8
The text file has some empty data. As above.
User want to show Name and Grade:
import csv
with open('launchlog.txt', 'r') as in_file:
stripped = (line.strip() for line in in_file)
lines = (line.split() for line in stripped if line)
with open('log.txt', 'w') as out_file:
writer = csv.writer(out_file)
writer.writerow(('Name', 'Surname', 'Age', 'Sex', 'Grade'))
writer.writerows(lines)
log.txt :
Chris,M.,14,M,4
Adam,A.,17,M
Jack,O.,M,8
How to empty data insert a "None" string?
For example:
Chris,M.,14,M,4
Adam,A.,17,M,None
Jack,O.,None,M,8
What would be the best way to do this in Python?

Use pandas:
import pandas
data=pandas.read_fwf("file.txt")
To get your dictionnary:
data.set_index("Name")["Grade"].to_dict()

Here's something in Pure Python™ that seems to do what you want, at least on the sample data file in your question.
In a nutshell what it does is first determine where each of the field names in column header line start and end, and then for each of the remaining lines of the file, does the same thing getting a second list which is used to determine what column each data item in the line is underneath (which it then puts in its proper position in the row that will be written to the output file).
import csv
def find_words(line):
""" Return a list of (start, stop) tuples with the indices of the
first and last characters of each "word" in the given string.
Any sequence of consecutive non-space characters is considered
as comprising a word.
"""
line_len = len(line)
indices = []
i = 0
while i < line_len:
start, count = i, 0
while line[i] != ' ':
count += 1
i += 1
if i >= line_len:
break
indices.append((start, start+count-1))
while i < line_len and line[i] == ' ': # advance to start of next word
i += 1
return indices
# convert text file with missing fields to csv
with open('name_grades.txt', 'rt') as in_file, open('log.csv', 'wt', newline='') as out_file:
writer = csv.writer(out_file)
header = next(in_file) # read first line
fields = header.split()
writer.writerow(fields)
# determine the indices of where each field starts and stops based on header line
field_positions = find_words(header)
for line in in_file:
line = line.rstrip('\r\n') # remove trailing newline
row = ['None' for _ in range(len(fields))]
value_positions = find_words(line)
for (vstart, vstop) in value_positions:
# determine what field the value is underneath
for i, (hstart, hstop) in enumerate(field_positions):
if vstart <= hstop and hstart <= vstop: # overlap?
row[i] = line[vstart:vstop+1]
break # stop looking
writer.writerow(row)
Here's the contents of the log.csv file it created:
Name,Surname,Age,Sex,Grade
Chris,C.,14,M,4
Adam,A.,17,M,None
Jack,O.,None,M,8

I would use baloo's answer over mine -- but if you just want to get a feel for where your code went wrong, the solution below mostly works (there is a formatting issue with the Grade field, but I'm sure you can get through that.) Add some print statements to your code and to mine and you should be able to pick up the differences.
import csv
<Old Code removed in favor of new code below>
EDIT: I see your difficulty now. Please try the below code; I'm out of time today so you will have to fill in the writer parts where the print statement is, but this will fulfill your request to replace empty fields with None.
import csv
with open('Test.txt', 'r') as in_file:
with open('log.csv', 'w') as out_file:
writer = csv.writer(out_file)
lines = [line for line in in_file]
name_and_grade = dict()
for line in lines[1:]:
parts = line[0:10], line[11:19], line[20:24], line[25:31], line[32:]
new_line = list()
for part in parts:
val = part.replace('/n','')
val = val.strip()
val = val if val != '' else 'None'
new_line.append(val)
print(new_line)

Without using pandas:
Edited based on your comment, I hard coded this solution based on your data. This will not work for the rows doesn't have Surname column.
I'm writing out Name and Grade since you only need those two columns.
o = open("out.txt", 'w')
with open("inFIle.txt") as f:
for lines in f:
lines = lines.strip("\n").split(",")
try:
grade = int(lines[-1])
if (lines[-2][-1]) != '.':
o.write(lines[0]+","+ str(grade)+"\n")
except ValueError:
print(lines)
o.close()

Consolidate several lines of a CSV file with firewall rules, in order to parse them easier?

I have a CSV file, which I created using an HTML export from a Check Point firewall policy.
Each rule is represented as several lines, in some cases. That occurs when a rule has several address sources, destinations or services.
I need the output to have each rule described in only one line.
It's easy to distinguish when each rule begins. In the first column, there's the rule ID, which is a number.
Here's an example. In green are marked the strings that should be moved:
http://i.imgur.com/i785sDi.jpg
Let me show you an example:
NO.;NAME;SOURCE;DESTINATION;SERVICE;ACTION;
1;;fwgcluster;mcast_vrrp;vrrp;accept;
;;;;igmp;;
2;Testing;fwgcluster;fwgcluster;FireWall;accept;
;;fwmgmpe;fwmgmpe;ssh;;
;;fwmgm;fwmgm;;;
What I need ,explained in pseudo code, is this:
Read the first column of the next line. If there's a number:
Evaluate the first column of the next line. If there's no number there, concatenate (separating with a comma) \
the strings in the columns of this line with the last one and eliminate the text in the current one
The output should be something like this:
NO.;NAME;SOURCE;DESTINATION;SERVICE;ACTION;
1;;fwgcluster;mcast_vrrp;vrrp-igmp;accept;
;;;;;;
2;Testing;fwgcluster-fwmgmpe-fwmgm;fwgcluster-fwmgmpe-fwmgm;FireWall-ssh;accept;
;;;;;;
The empty lines are there only to be more clear, I don't actually need them.
Thanks!

This should get you started
import csv
with open('data.txt', 'r') as f:
reader = csv.DictReader(f, delimiter=';')
for r in reader:
print r
EDIT: Given your required output, this should get you nearly there. Its a bit crude but does the majority of what you need. It checks for the 'No.' key and if it has a value it will start a record. If not it will join any other data in the row with the equivalent data in the record. Finally, when a new record is created the old one is appended to the result, this also happens at the end to catch the last item.
import csv
result, record = [], None
with open('data2.txt', 'r') as f:
reader = csv.DictReader(f, delimiter=';', lineterminator='\n')
for r in reader:
if r['NO.']:
if record:
result.append(record)
record = r
else:
for key in r.keys():
if r[key]:
record[key] = '-'.join([record[key], r[key]])
if record:
result.append(record)
print result

Graeme, thanks again, just before your edit I solved it with the following code.
But you got me looking in the right direction!
If anyone needs it, here it is:
import csv
# adjust these 3 lines
WRITE_EMPTIES = False
INFILE = "input.csv"
OUTFILE = "output.csv"
with open(INFILE, "r") as in_file:
r = csv.reader(in_file, delimiter=";")
with open(OUTFILE, "wb") as out_file:
previous = None
empties_to_write = 0
out_writer = csv.writer(out_file, delimiter=";")
for i, row in enumerate(r):
first_val = row[0].strip()
if first_val:
if previous:
out_writer.writerow(previous)
if WRITE_EMPTIES and empties_to_write:
out_writer.writerows(
[["" for _ in previous]] * empties_to_write
)
empties_to_write = 0
previous = row
else: # append sub-portions to each other
previous = [
"|".join(
subitem
for subitem in existing.split(",") + [new]
if subitem
)
for existing, new in zip(previous, row)
]
empties_to_write += 1
if previous: # take care of the last row
out_writer.writerow(previous)
if WRITE_EMPTIES and empties_to_write:
out_writer.writerows(
[["" for _ in previous]] * empties_to_write
)

Create an array from data in a table using Python

I want to get data from a table in a text file into a python array. The text file that I am using as an input has 7 columns and 31 rows. Here is an example of the first two rows:
10672 34.332875 5.360831 0.00004035881220 0.00000515052523 4.52E-07 6.5E-07
12709 40.837833 19.429158 0.00012010938453 -0.00000506426720 7.76E-06 2.9E-07
The code that I have tried to write isn't working as it is not reading one line at a time when it goes through the for loop.
data = []
f = open('hyadeserr.txt', 'r')
while True:
eof = "no"
array = []
for i in range(7):
line = f.readline()
word = line.split()
if len(word) == 0:
eof = "yes"
else:
array.append(float(word[0]))
print array
if eof == "yes": break
data.append(array)
Any help would be greatly appreciated.

A file with space-separated values is just a dialect of the classic comma-separated values (CSV) file where the delimiter is a space (), followed by more spaces, which can be ignored.
Happily, Python comes with a csv.reader class that understands dialects.
You should use this:
Example:
#!/usr/bin/env python
import csv
csv.register_dialect('ssv', delimiter=' ', skipinitialspace=True)
data = []
with open('hyadeserr.txt', 'r') as f:
reader = csv.reader(f, 'ssv')
for row in reader:
floats = [float(column) for column in row]
data.append(floats)
print data

If you don't want to use cvs here, since you don't really need it:
data = []
with open("hyadeserr.txt") as file:
for line in file:
data.append([float(f) for f in line.strip().split()])
Or, if you know for sure that the only extra chars are spaces and line ending \n, you can turn the last line into:
data.append([float(f) for f in line[:-1].split()])

How do I match 0,2,3,4 elements of one array to 0,2,3,4 elements of another array and print the 5th element from both the arrays in python?

I am trying to match the 0,2,3,4 elements of an array storing the columns of one tab delimited file to 0,2,3,4 elements of another array storing the columns of another tab delimited file and print out the element 5 (column 6) from both the input files in python.
Here is the code that I worked on but I guess that the code matches line by line between two files. However, I wanted to match the line of file1 to any line in file 2
#!/usr/bin/python
import sys
import itertools
import csv, pprint
from array import *
#print len(sys.argv)
if len(sys.argv) != 4:
print 'Usage: python scores.py <infile1> <infile2> <outfile>'
sys.exit(1)
f1=open("/home/user/ab/ab/ab/file1.txt", "r")
f2 = open ("/home/user/ab/ab/ab/file2.txt", "r")
f3 = open ("out.txt", "w")
lines1 = f1.readlines()
lines2 = f2.readlines()
for f1line, f2line in zip(lines1, lines2): ## for loop to read lines line by line simultaneously from two files
#for f1line, f2line in itertools.izip(lines1,lines2):
row1 = f1line.split('\t') #split on tab
row2 = f2line.split('\t') # split on tab
if ((row1[0:1] + row1[2:5]) == (row2[0:1] + row2[2:5])): # columns 0,2,3,4 matching between two infiles
writer = csv.writer(f3, delimiter = '\t')
writer.writerow((row1[0:1] + row1[2:5]) + row1[5:6] + (row2[0:1] + row2[2:5]) + row2[5:6])

For each line on file 1 to match
op = operator.itemgetter(0,2,3,4)
f2 = file2.readlines() # otherwise it won't work every loop
for line1 in file1:
... #split 1
for line2 in f2:
... #split 2
if op(row1) == op(row2):
...

So, just do what you said: for each line of file1, match each line of file2
for f1line in lines1:
row1 = f1line.split('\t') #split on tab
for f2line in lines2:
row2 = f2line.split('\t') # split on tab
if ((row1[0:1] + row1[2:5]) == (row2[0:1] + row2[2:5])):
...

This assumes that each key value (row[0,3,4,5]) is unique per file:
import sys
import csv
datalen = 12
keyfn = lambda row: tuple(row[0:1] + row[3:6])
datafn = lambda row: row[8:datalen]
def load_dict(fname, keyfn, datafn):
with open(fname, 'rb') as inf:
data = (row.split() for row in inf if not row.startswith('##'))
return {keyfn(row):datafn(row) for row in data if len(row) >= datalen}
def main(fname1, fname2, outfname):
data1 = load_dict(fname1, keyfn, datafn)
data2 = load_dict(fname2, keyfn, datafn)
common_keys = sorted(set(data1).intersection(data2))
with open(outfname, 'wb') as outf:
outcsv = csv.writer(outf, delimiter='\t')
outcsv.writerows(list(key) + data1[key] + data2[key] for key in common_keys)
if __name__=="__main__":
if len(sys.argv) != 4:
print 'Usage: python scores.py <infile1> <infile2> <outfile>'
sys.exit(1)
else:
main(*sys.argv[1:4])
Edit: problems found:
I made one mistake: the return value from the key function was a list; a list is not hashable, therefore cannot be a dictionary key. I have made the return value a tuple instead.
You, on the other hand, failed to mention that
your files begin with several lines of comments (I have modified the script to ignore comment rows, meaning anything starting with ##)
your file is NOT tab-delimited (or at least the file examples you provided are not). It actually seems to be columnar, separated with multiple spaces - this cannot be handled by the csv module. Luckily, the data seems simple enough to use .split() instead.
you are matching on the wrong columns; column 2 in your data files does not appear to match between files at all. I think you need to key on columns 0, 3, 4, 5 instead. I have updated keyfn to reflect this.
Columns 3 and 4 appear to be identical, but I am not certain of this. If columns 3 and 4 are always identical, you could save some memory and speed things up a bit by only keying on columns 0, 4, 5: keyfn = lambda row: tuple(row[0:1] + row[4:6])
I am guessing that columns 8,9,10,11 are the desired data; I have changed datafn to reflect this. The script should now work as required.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

parsing file in python - python

Related

Find coincidence and add column

How to convert this text file to csv?

Consolidate several lines of a CSV file with firewall rules, in order to parse them easier?

Create an array from data in a table using Python

How do I match 0,2,3,4 elements of one array to 0,2,3,4 elements of another array and print the 5th element from both the arrays in python?

Categories

Resources