i have big cvs file with 1740 rows and i want to compare line by line in file with entire same file.
the problem is i cant figure out how i can send lines to the algorithm and specify to not compare with same line.
enter image description here
import csv
with open("pcr_data.csv", "r") as file:
csv_reader = csv.reader(file)
` data = []
for row in csv_reader:
data.append(row)
# print(data)
# Python program to read CSV file line by line
# Iterate over each row in the csv
# file using reader object
def naive_string_matching(text, pattern):
n = len(text)
m = len(pattern)
for i in range(n - m + 1):
j = 0
while j < m and text[i + j] == pattern[j]:
j += 1
if j == m:
return i
return -1
def search_in_file(file_path, pattern):
with open(file_path, "r") as file:
text = file.readline();
index = naive_string_matching(text, pattern)
if index != -1:
print("The pattern found at index: ", index)
else:
print("The pattern was not found in the file")
for r in range len(data):
file_path = data
pattern =r
search_in_file(file_path, pattern)
help me please.
send lines to the algorithm and specify to not compare with same line.
Create all combinations, taken two at a time using itertools.combinations. This will prevent comparing a line to itself.
from itertools import combinations
for line_one,line_two in combinations(data,r=2):
# assert line_one != line_two
# send line_one and line_two to the algorithm
data = list('abcd')
for one,two in combinations(data,r=2):
assert one != two
print((one,two),end=' | ')
>>>
('a', 'b') | ('a', 'c') | ('a', 'd') | ('b', 'c') | ('b', 'd') | ('c', 'd') |
Related
I have 2 files: fileA is composed of 1 row and file B is 2 rows.
fileA (1 row):
*****s**e**********************************************q*
fileB (2 rows):
Row 1 is the subject
Row 2 is the query
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
I need to produce an output file, where if the fileA string contains an s or *, the subject character at the corresponding index position, will be written to the output file. If there is a q or e the query character will be written to the output file.
Output:
AAAAAAAABAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABA
my code:
ff = open("filea.txt")
gg = open("fileb.txt")
file_as_list = ff.readline()
file_as_last = gg.readlines()
query = file_as_last[0]
subject = file_as_last[1]
for i in file_as_list:
z = -1
while z <= len(file_as_list):
if i == "*":
f = open('output.txt', 'a+', encoding='utf-8')
f.write(subject[z])
z += 1
elif i == "s":
f = open('output.txt', 'a+', encoding='utf-8')
f.write(subject[z])
z += 1
elif i == "e":
f = open('output.txt', 'a+', encoding='utf-8')
f.write(query[z])
z += 1
elif i == "q":
f = open('output.txt', 'a+', encoding='utf-8')
f.write(query[z])
z += 1
break
the things work more or less but not properly: I have always that the loop works only for the first statement and produce an output that is just a copy of the subject
with open is used, so all files will be automatically closed
convert each string into list, then.strip to remove \n & \r
load the lists into a pandas.DataFrame
pandas.DataFrame.apply with axis=1, for row wise operations
np.where to return the correct value
write out, to a list, and convert it into a str
write out, to the output.txt file
Code:
import pandas as pd
import numpy as np
with open('fileA.txt', 'r') as filA:
with open('fileB.txt', 'r') as filB:
with open('output.txt', 'w', newline='\n') as output:
fil_a = filA.readline()
fil_b = filB.readlines()
sub = [x for x in fil_b[0].strip()]
que = [x for x in fil_b[1].strip()]
line = [x for x in fil_a.strip()]
df = pd.DataFrame({'A': line, 'sub': sub, 'que': que})
df['out'] = df.apply(lambda x: str(np.where(x[0] in ['*', 's'], x[1], x[2])), axis=1)
out = df.out.to_list()
out = ''.join(x for x in out)
output.write(out)
I have files with hundreds and thousands rows of data but they are without any column.
I am trying to go to every file and make them row by row and store them in list after that I want to assign values by columns. But here I am confused what to do because values are around 60 in every row and some extra columns with value assigned and they should be added in every row.
Code so for:
import re
import glob
filenames = glob.glob("/home/ashfaque/Desktop/filetocsvsample/inputfiles/*.txt")
columns = []
with open("/home/ashfaque/Downloads/coulmn names.txt",encoding = "ISO-8859-1") as f:
file_data = f.read()
lines = file_data.splitlines()
for l in lines:
columns.append(l.rstrip())
total = {}
for name in filenames:
modified_data = []
with open(name,encoding = "ISO-8859-1") as f:
file_data = f.read()
lines = file_data.splitlines()
for l in lines:
if len(l) >= 1:
modified_data.append(re.split(': |,',l))
rows = []
i = len(modified_data)
x = 0
while i > 60:
r = lines[x:x+59]
x = x + 60
i = i - 60
rows.append(r)
z = len(modified_data)
while z >= 60:
z = z - 60
if z > 1:
last_columns = modified_data[-z:]
x = []
for l in last_columns:
if len(l) > 1:
del l[0]
x.append(l)
elif len(l) == 1:
x.append(l)
for row in rows:
for vl in x:
row.append(vl)
for r in rows:
for i in range(0,len(r)):
if len(r) >= 60:
total.setdefault(columns[i],[]).append(r[i])
In other script I have separated both row with 60 values and last 5 to 15 columns which should be added with row are separate but again I am confused how to bind all the data.
Data Should look like this after binding.
outputdata.xlsx
Data Input file:
inputdata.txt
What Am I missing here? any tool ?
I believe that your issue can be resolved by taking the input file and turning it into a CSV file which you can then import into whatever program you like.
I wrote a small generator that would read a file a line at a time and return a row after a certain number of lines, in this case 60. In that generator, you can make whatever modifications to the data as you need.
Then with each generated row, I write it directly to the csv. This should keep the memory requirements for this process pretty low.
I didn't understand what you were doing with the regex split, but it would be simple enough to add it to the generator.
import csv
OUTPUT_FILE = "/home/ashfaque/Desktop/File handling/outputfile.csv"
INPUT_FILE = "/home/ashfaque/Desktop/File handling/inputfile.txt"
# This is a generator that will pull only num number of items into
# memory at a time, before it yields the row.
def get_rows(path, num):
row = []
with open(path, "r", encoding="ISO-8859-1") as f:
for n, l in enumerate(f):
# apply whatever transformations that you need to here.
row.append(l.rstrip())
if (n + 1) % num == 0:
# if rows need padding then do it here.
yield row
row = []
with open(OUTPUT_FILE, "w") as output:
csv_writer = csv.writer(output)
for r in get_rows(INPUT_FILE, 60):
csv_writer.writerow(r)
I have a mlt.ctl file in which the text is arranged like this:
znrmi_001/znrmi_001_001
znrmi_001/znrmi_001_002
znrmi_001/znrmi_001_003
zntoy_001/zntoy_001_001
zntoy_001/zntoy_001_002
zntoy_001/zntoy_001_003
zntoy_001/zntoy_001_004
.......................
zntoy_001/zntoy_001_160
....................
zntoy_002/zntoy_002_001
zntoy_002/zntoy_002_002
.......................
zntoy_002/zntoy_002_149
Need to save the desired format in the newmlt.ctl file, the desired format is shown below:
znrmi_001 znrmi_001_001 znrmi_001_002 znrmi_001_003
zntoy_001 zntoy_001_001 zntoy_001_002..................zntoy_001_160
zntoy_002 zntoy_002_001 zntoy_002_002..................zntoy_002_149
....................................................................
I am trying hard in python, but getting the errors everytime.
#!/usr/bin/env python
fi= open("mlt.ctl","r")
y_list = []
for line in fi.readlines():
a1 = line[0:9]
a2 = line[10:19]
a3 = line[20:23]
if a3 in xrange(1,500):
y = a1+ " ".join(line[20:23].split())
print(y)
elif int(a3) < 2:
fo.write(lines+ "\n")
else:
stop
y_list.append(y)
print(y)
fi.close()
fo = open ("newmlt.ctl", "w")
for lines in y_list:
fo.write(lines+ "\n")
fo.close()
I am getting elif error and code is not running properly, kindly provide the inputs.
using regular expressions and saving the matches to a dictionary:
import re
REGEX = r"\d.\s(\S+)/(\S+)" # group 1: the unique index; group 2: the value
finder = re.compile(REGEX) # compile the regular expression
with open('mlt.ctl', 'r') as f:
data = f.read() # read the entire file into data
matches = re.finditer(finder, data) # find all matches (one for each line)
d = {}
indices = []
for match in matches: # loop through the matches
key = match.group(1) # the index
val = match.group(2) # the value
if key in d.keys(): # the key has already been processed, just append the value to the list
d[key].append(val)
else: # the key is new; create a new dict entry and keep track of the index in the indices list
d[key] = [val]
indices.append(key)
with open("newmlt.ctl", "w") as out:
for i, idx in enumerate(indices):
vals = " ".join(d[idx]) # join the values into a space-delimited string
to_string = "{} {}\n".format(idx,vals)
out.write(to_string)
a little more pythonic:
from collections import defaultdict
d = defaultdict(list)
with open('mlt.ctl') as f:
for line in f:
grp, val = line.strip().split('/')
d[grp].append(val)
with open('newmlt.ctl','w') as f:
for k in sorted(d):
oline = ' '.join([k]+d[k])+'\n'
f.write(oline)
Maybe it is not related, but it seems you forget a ')' on line 11
y = a1+ " ".join(line[20:23].split()
should be
y = a1+ " ".join(line[20:23].split())
and the ':' at the else at line 14 and at the for at line 20
Also at line 12 you will probably compare a string and an integer
I have to input a text file that contains comma seperated and line seperated data in the following format:
A002,R051,02-00-00,05-21-11,00:00:00,REGULAR,003169391,001097585,05-21-11,04:00:00,REGULAR,003169415,001097588,05-21-11,08:00:00,REGULAR,003169431,001097607
Multiple sets of such data is present in the text file
I need to print all this in new lines with the condition:
1st 3 elements of every set followed by 5 parameters in a new line. So solution of the above set would be:
A002,R051,02-00-00,05-21-11,00:00:00,REGULAR,003169391,001097585
A002,R051,02-00-00,05-21-11,04:00:00,REGULAR,003169415,001097588
A002,R051,02-00-00,05-21-11,08:00:00,REGULAR,003169431,001097607
My function to achieve it is given below:
def fix_turnstile_data(filenames):
for name in filenames:
f_in = open(name, 'r')
reader_in = csv.reader(f_in, delimiter = ',')
f_out = open('updated_' + name, 'w')
writer_out = csv.writer(f_out, delimiter = ',')
array=[]
for line in reader_in:
i = 0
j = -1
while i < len(line):
if i % 8 == 0:
i+=2
j+=1
del array[:]
array.append(line[0])
array.append(line[1])
array.append(line[2])
elif (i+1) % 8 == 0:
array.append(line[i-3*j])
writer_out.writerow(array)
else:
array.append(line[i-3*j])
i+=1
f_in.close()
f_out.close()
The output is wrong and there is a space of 3 lines at the end of those lines whose length is 8. I suspect it might be the writer_out.writerow(array) which is to blame.
Can anyone please help me out?
Hmm, the logic you use ends up being fairly confusing. I'd do it more along these lines (this replaces your for loop), and this is more Pythonic:
for line in reader_in:
header = line[:3]
for i in xrange(3, len(line), 5):
writer_out.writerow(header + line[i:i+5])
I have an input file:
3
PPP
TTT
QPQ
TQT
QTT
PQP
QQQ
TXT
PRP
I want to read this file and group these cases into proper boards.
To read the Count (no. of boards) i have code:
board = []
count =''
def readcount():
fp = open("input.txt")
for i, line in enumerate(fp):
if i == 0:
count = int(line)
break
fp.close()
But i don't have any idea of how to parse these blocks into List:
TQT
QTT
PQP
I tried using
def readboard():
fp = open('input.txt')
for c in (1, count): # To Run loop to total no. of boards available
for k in (c+1, c+3): #To group the boards into board[]
board[c].append(fp.readlines)
But its wrong way. I know basics of List but here i am not able to parse the file.
These boards are in line 2 to 4, 6 to 8 and so on. How to get them into Lists?
I want to parse these into Count and Boards so that i can process them further?
Please suggest
I don't know if I understand your desired outcome. I think you want a list of lists.
Assuming that you want boards to be:
[[data,data,data],[data,data,data],[data,data,data]], then you would need to define how to parse your input file... specifically:
line 1 is the count number
data is entered per line
boards are separated by white space.
If that is the case, this should parse your files correctly:
board = []
count = 0
currentBoard = 0
fp = open('input.txt')
for i,line in enumerate(fp.readlines()):
if i == 0:
count = int(i)
board.append([])
else:
if len(line[:-1]) == 0:
currentBoard += 1
board.append([])
else: #this has board data
board[currentBoard].append(line[:-1])
fp.close()
import pprint
pprint.pprint(board)
If my assumptions are wrong, then this can be modified to accomodate.
Personally, I would use a dictionary (or ordered dict) and get the count from len(boards):
from collections import OrderedDict
currentBoard = 0
board = {}
board[currentBoard] = []
fp = open('input.txt')
lines = fp.readlines()
fp.close()
for line in lines[1:]:
if len(line[:-1]) == 0:
currentBoard += 1
board[currentBoard] = []
else:
board[currentBoard].append(line[:-1])
count = len(board)
print(count)
import pprint
pprint.pprint(board)
If you just want to take specific line numbers and put them into a list:
line_nums = [3, 4, 5, 1]
fp = open('input.txt')
[line if i in line_nums for i, line in enumerate(fp)]
fp.close()