compare text lines(multiple lists) using python

compare text lines(multiple lists) using python - python

I have the text file which contains below lines:
Cycle 0 DUT 2 Bad Block : 2,4,6,7,8,10,12,14,16,18,20,22,24,26,28
Cycle 0 DUT 3 Bad Block : 4,6,8,10,12,14,16,18,20,22,24,26
Cycle 0 DUT 4 Bad Block : 4,6,8,10,12,14,16,18,20,22,24,26
Cycle 1 DUT 2 Bad Block : 2,4,6,7,8,10,12,14,16,18,20,22,24,26,28
Cycle 1 DUT 3 Bad Block : 4,6,8,10,12,14,16,18,20,22,24,26,28,30,32
I want to compare the Cycle 0 DUT 2 text line (numbers after colon separated with commas) to the Cycle 1 DUT 2 text line (numbers after colon seperated with commas) and get the differences, then compare Cycle 0 DUT 3 text line to Cycle 1 DUT 3 text line and get the differences or the unique values.

I guess you want to key things to to the DUT digit:
import re
dut_data = {}
cycle_dut = re.compile('^Cycle\s+(\d)\s+DUT\s+(\d)\s+Bad Block\s*:\s*(.*)$')
with open(inputfile, 'r') as infile:
for line in infile:
match = cycle_dut.search(line)
if match:
cycle, dut, data = match.groups()
data = [int(v) for v in data.split(',')]
if cycle == '0':
# Store cycle 0 DUT values keyed on the DUT number
dut_data[dut] = data
else:
# Compare against cycle 0 data, if the same DUT number was present
cycle_0_data = dut_data.get(dut)
if cycle_0_data is not None:
# compare cycle_0_data and data here
print 'DUT {} differences: {}'.format(dut, ','.join([str(v) for v in sorted(set(cycle_0_data).symmetric_difference(data))]))
I used a quick set difference to print the differences, this may require refining.
For your sample data, this prints:
DUT 2 differences:
DUT 3 differences: 28,30,32

Related

How to parse a text file into a dictionary in python with key on one line followed by two lines of values

I have a file with lines in this format:
CALSPHERE 1
1 00900U 64063C 20161.15561498 .00000210 00000-0 21550-3 0 9996
2 00900 90.1544 28.2623 0029666 80.8701 43.4270 13.73380512769319
CALSPHERE 2
1 00902U 64063E 20161.16836122 .00000025 00000-0 23933-4 0 9990
2 00902 90.1649 30.9038 0019837 126.9344 3.6737 13.52683749559421
..etc.
I would like to parse this into a dictionary of the format:
{CALSPHERE 1:(1 00900U 64063C 20161.15561498 .00000210 00000-0 21550-3 0 9996, 2 00900 90.1544 28.2623 0029666 80.8701 43.4270 13.73380512769319),
CALSPHERE 2:(1 00902U 64063E 20161.16836122 .00000025 00000-0 23933-4 0 9990, 2 00902 90.1649 30.9038 0019837 126.9344 3.6737 13.52683749559421),...}
I'm puzzled as to how to parse this, so that every third line is the key, with the following two lines forming a tuple for the value. What would be the best way to do this in python?
I've attempted to add some logic for "every third line" though it seems kind of convoluted; something like
with open(r"file") as f:
i = 3
for line in f:
if i%3=0:
key = line
else:
#not sure what to do with the next lines here

If your file always have the same distribution (i.e: the 'CALSPHERE' word -or any other that you want it as your dictionary key-, followed by two lines), you can achieve what you want by doing something as follows:
with open(filename) as file:
lines = file.read().splitlines()
d = dict()
for i in range(0, len(lines), 3):
d[lines[i].strip()] = (lines[i + 1], lines[i + 2])
Output:
{
'CALSPHERE 1': ('1 00900U 64063C 20161.15561498 .00000210 00000-0 21550-3 0 9996', '2 00900 90.1544 28.2623 0029666 80.8701 43.4270 13.73380512769319'),
'CALSPHERE 2': ('1 00902U 64063E 20161.16836122 .00000025 00000-0 23933-4 0 9990', '2 00902 90.1649 30.9038 0019837 126.9344 3.6737 13.52683749559421')
}

Assuming that your content is in file.txt you can use the following.
It shall work for any number of CALSPHERE keyword occurrences and also various number of entries between.
with open('file.txt') as inp:
buffer = []
for line in inp:
# remove newline
copy = line.replace('\n','')
# check if next entry
if 'CALSPHERE' in copy:
buffer.append([])
# add line
buffer[-1].append(copy)
# put the output into dictionary
res = {}
for chunk in buffer:
# safety check
if len(chunk) > 1:
res[chunk[0]] = tuple( chunk[1:] )
print(res)

Separate lines in Python

I have a .txt file. It has 3 different columns. The first one is just numbers. The second one is numbers which starts with 0 and it goes until 7. The final one is a sentence like. And I want to keep them in different lists because of matching them for their numbers. I want to write a function. How can I separate them in different lists without disrupting them?
The example of .txt:
1234 0 my name is
6789 2 I am coming
2346 1 are you new?
1234 2 Who are you?
1234 1 how's going on?
And I have keep them like this:
----1----
1234 0 my name is
1234 1 how's going on?
1234 2 Who are you?
----2----
2346 1 are you new?
----3-----
6789 2 I am coming
What I've tried so far:
inputfile=open('input.txt','r').read()
m_id=[]
p_id=[]
packet_mes=[]
input_file=inputfile.split(" ")
print(input_file)
input_file=line.split()
m_id=[int(x) for x in input_file if x.isdigit()]
p_id=[x for x in input_file if not x.isdigit()]

With your current approach, you are reading the entire file as a string, and performing a split on a whitespace (you'd much rather split on newlines instead, because each line is separated by a newline). Furthermore, you're not segregating your data into disparate columns properly.
You have 3 columns. You can split each line into 3 parts using str.split(None, 2). The None implies splitting on space. Each group will be stored as key-list pairs inside a dictionary. Here I use an OrderedDict in case you need to maintain order, but you can just as easily declare o = {} as a normal dictionary with the same grouping (but no order!).
from collections import OrderedDict
o = OrderedDict()
with open('input.txt', 'r') as f:
for line in f:
i, j, k = line.strip().split(None, 2)
o.setdefault(i, []).append([int(i), int(j), k])
print(dict(o))
{'1234': [[1234, 0, 'my name is'],
[1234, 2, 'Who are you?'],
[1234, 1, "how's going on?"]],
'6789': [[6789, 2, 'I am coming']],
'2346': [[2346, 1, 'are you new?']]}
Always use the with...as context manager when working with file I/O - it makes for clean code. Also, note that for larger files, iterating over each line is more memory efficient.

Maybe you want something like that:
import re
# Collect data from inpu file
h = {}
with open('input.txt', 'r') as f:
for line in f:
res = re.match("^(\d+)\s+(\d+)\s+(.*)$", line)
if res:
if not res.group(1) in h:
h[res.group(1)] = []
h[res.group(1)].append((res.group(2), res.group(3)))
# Output result
for i, x in enumerate(sorted(h.keys())):
print("-------- %s -----------" % (i+1))
for y in sorted(h[x]):
print("%s %s %s" % (x, y[0], y[1]))
The result is as follow (add more newlines if you like):
-------- 1 -----------
1234 0 my name is
1234 1 how's going on?
1234 2 Who are you?
-------- 2 -----------
2346 1 are you new?
-------- 3 -----------
6789 2 I am coming
It's based on regexes (module re in python). This is a good tool when you want to match simple line based patterns.
Here it relies on spaces as columns separators but it can as easily be adapted for fixed width columns.
The results is collected in a dictionary of lists. each list containing tuples (pairs) of position and text.
The program waits output for sorting items.

It's a quite ugly code but it's quite easy to understand.
raw = []
with open("input.txt", "r") as file:
for x in file:
raw.append(x.strip().split(None, 2))
raw = sorted(raw)
title = raw[0][0]
refined = []
cluster = []
for x in raw:
if x[0] == title:
cluster.append(x)
else:
refined.append(cluster)
cluster = []
title = x[0]
cluster.append(x)
refined.append(cluster)
for number, group in enumerate(refined):
print("-"*10+str(number)+"-"*10)
for line in group:
print(*line)

run a search over the items of a list, and save each search into a file

I have a data.dat file that has 3 columns: The 3rd column is just the numbers 1 to 6 repeated again and again:
( In reality, column 3 has numbers from 1 to 1917, but for a minimal working example, let's stick to 1 to 6 )
# Title
127.26 134.85 1
127.26 135.76 2
127.26 135.76 3
127.26 160.97 4
127.26 160.97 5
127.26 201.49 6
125.88 132.67 1
125.88 140.07 2
125.88 140.07 3
125.88 165.05 4
125.88 165.05 5
125.88 203.06 6
137.20 140.97 1
137.20 140.97 2
137.20 148.21 3
137.20 155.37 4
137.20 155.37 5
137.20 184.07 6
I would like to:
1) extract the lines that contain 1 in the 3rd column and save them to a file called mode_1.dat.
2) extract the lines that contain 2 in the 3rd column and save them to a file called mode_2.dat.
3) extract the lines that contain 3 in the 3rd column and save them to a file called mode_3.dat.
.
.
.
6) extract the lines that contain 6 in the 3rd column and save them to a file called mode_6.dat.
In order to accomplish this, I have:
a) defined a variable factor = 6
a) created a one_to_factor list that has numbers 1 to 6
b) The re.search statement is in charge of extracting the lines for each value of one_to_factor. %s are the i inside the one_to_factor list
c) append these results to an empty LINES list.
However, this does not work. I cannot manage to extract the lines that contain i in the 3rd column and save them to a file called mode_i.dat
I would appreciate if you could help me.
factor = 6
one_to_factor = range(1,factor+1)
LINES = []
f_2 = open('data.dat', 'r')
for line in f_2:
for i in one_to_factor:
if re.search(r' \b%s$' %i , line):
print 'line = ', line
LINES.append(line)
print 'LINES =' , LINES

I would do it like this:
no regexes, just use str.split() to split according to whitespace
use last item (the digit) of the current line to generate the filename
use a dictionary to open the file the first time, and reuse the handle for subsequent matches (write title line at file open)
close all handles in the end
code:
title_line="# Vol \t Freq \t Mod \n"
handles = dict()
next(f_2) # skip title
for line in f_2:
toks = line.split()
filename = "mode_{}.dat".format(toks[-1])
# create files first time id encountered
if filename in handles:
pass
else:
handles[filename] = open(filename,"w")
handles[filename].write(title_line) # write title
handles[filename].write(line)
# close all files
for v in handles.values():
v.close()
EDIT: that's the fastest way but the problem is if you have too many suffixes (like in your real example), you'll get "too many open files" exception. So for this case, there's a slightly less efficient method but which works too:
import glob,os
# pre-processing: cleanup old files if any
for f in glob.glob("mode_*.dat"):
os.remove(f)
next(f_2) # skip title
s = set()
title_line="# Vol \t Freq \t Mod \n"
for line in f_2:
toks = line.split()
filename = "mode_{}.dat".format(toks[-1])
with open(filename,"a") as f:
if filename in s:
pass
else:
s.add(filename)
f.write(title_line)
f.write(line)
It basically opens as append mode, writes the lines, and closes the file.
(the set is used to detect first write in this file, so title can be written before the data)
There's a directory cleanup first to ensure that no data is left from a previous computation (append mode expects that no file exists, and if input data set changes, there's a possibility that there's an indentifier not present in the new dataset, so there would be an "orphan" file remaining from previous run)

First, instead of looping on you one_to_factor, you can get the index in one step :
index = line[-1] # Last character on the line
Then, you can check if index is in your one_to_factor list.
You should created a dictionary of lists to store your lines.
Something like :
{ "1" : [line1, line7, ...],
"2" : ....
}
And then you can use the key of the dictionnary to create the file and populate it with lines.

Modifying a data file: issue with savetxt header option

I have a bunch of files from a LAMMPS simulation made by a 9-lines header and a Nx5 data array (N is large, order 10000). A file looks like this:
ITEM: TIMESTEP
1700000
ITEM: NUMBER OF ATOMS
40900
ITEM: BOX BOUNDS pp pp pp
0 59.39
0 59.39
0 59.39
ITEM: ATOMS id type xu yu zu
1 1 -68.737755560980844 1.190046376093027 122.754819323806714
2 1 -68.334493269859621 0.365731265115530 122.943111038981527
3 1 -68.413018326512173 -0.456802254452782 123.436843456292138
4 1 -68.821350328206080 -1.360098170077123 123.314784135612115
5 1 -67.876948635447775 -1.533699833382506 123.072964235308660
6 1 -67.062910322675322 -2.006415676993953 123.431518511867381
7 1 -67.069984116148134 -2.899068427170739 123.057125785834685
8 1 -66.207325578729183 -3.292545155979909 123.377770523297343
...
I would like to open every file, perform a certain operation on the numerical data and save the file with a different name leaving the header unchanged. My script is:
for f in files:
filename=path+"/"+f
with open(filename) as myfile:
header = ' '.join([next(myfile) for x in xrange(9)])
data=np.loadtxt(filename,skiprows=9)
data[:,2:5]%=L #Put everything inside the box...
savetxt(filename.replace("lammpstrj","fold.lammpstrj"),data,header=header,comments="",fmt="%d %d %.15f %.15f %.15f")
The output, though, looks like this:
ITEM: TIMESTEP
1700000
ITEM: NUMBER OF ATOMS
40900
ITEM: BOX BOUNDS pp pp pp
0 59.39
0 59.39
0 59.39
ITEM: ATOMS id type xu yu zu
1 1 50.042244439019157 1.190046376093027 3.974819323806713
2 1 50.445506730140380 0.365731265115530 4.163111038981526
3 1 50.366981673487828 58.933197745547218 4.656843456292137
4 1 49.958649671793921 58.029901829922878 4.534784135612114
5 1 50.903051364552226 57.856300166617494 4.292964235308659
6 1 51.717089677324680 57.383584323006048 4.651518511867380
7 1 51.710015883851867 56.490931572829261 4.277125785834684
8 1 52.572674421270818 56.097454844020092 4.597770523297342
...
The header is not exactly the same: there are spaces at the beginning of every lines except for the first, and a newline after the last line of the header. I need to get rid of those, but I don't know how.
What am I doing wrong?

The issue is in the ' '.join(a):
a = ['sadf\n', 'sdfg\n']
' '.join(a)
>>>'sadf\n sdfg\n' # Note the space at the start of the second line.
Instead:
''.join(a)
>>>'sadf\nsdfg\n'
You will also need to trim the last '\n' in your header to prevent the empty line:
''.join(a).rstrip()
>>>'sadf\nsdfg'

The header parameter will add a newline after it automatically, so you can eliminate the original last '\n' as a redundant newline.
header = header.rstrip('\n')
The leading spaces occurs, since you join each line by an extra space character. you can solve it by the below command.
header = ''.join([next(myfile) for x in xrange(9)])

How to count the suffixes in words contained in a csv file?

I have previously found a way to count the prefixes, as shown below, so is there a way similar to this which is so obvious I'm missing it completely?
for i in range (0, len(hardprefix)):
if len(word) > len(hardprefix[i]):
if word.startswith(hardprefix[i]):
hardprefixcount += 1
break
I need this code to use the first column of the file and count the number of a set array of suffixes found within these words
This is what i have so far
for i in range (0, len(easysuffix)):
if len (word) > len(easysuffix[i]):
if word.endswith(easysuffix[i]):
easysuffixcount += 1
break
below is a sample of my data from the csv file, with the arrays using the suffixes below that
on 1
only 4
our 1
own 1
part 7
piece 4
pieces 4
place 1
pressed 1
riot 1
september 1
shape 3
hardsuffix = ['ism']
easysuffix = ['ity', 'esome', 'ece']

Your input file is tab delimited CSV so you can use the csv module to process it.
import csv
suffixes = ['ity', 'esome', 'ece']
with open('input.csv') as words:
suffix_count = 0
reader = csv.reader(words, delimiter='\t')
for word, _ in reader:
if any(word.endswith(suffix) for suffix in suffixes):
suffix_count += 1
print "Found {} suffix(es)".format(suffix_count)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

compare text lines(multiple lists) using python - python

Related

How to parse a text file into a dictionary in python with key on one line followed by two lines of values

Separate lines in Python

run a search over the items of a list, and save each search into a file

Modifying a data file: issue with savetxt header option

How to count the suffixes in words contained in a csv file?

Categories

Resources