def find_string(header,file_1,counter):
ab = re.compile(str(header))
for line in file_1:
if re.search(ab,line) !=None:
print line
counter+=1
return counter
file_1 = open("text_file_with_headers.txt",'r')
header_array = []
header_array.append("header1")
header_array.append("header2")
# ...
counter = 0
for header in header_array:
counter = find_string(header,file_1,counter)
Every time I run this it searches for only one of the headers and I cannot figure out why.
Because when the loop for line in file_1: has ended for the first header, the file's pointer is at the end of the file. You must move this pointer to the file's beginning again, that is done with function seek() . You must add seek(0,0) like that
counter = 0
for header in header_array:
counter = find_string(header,file_1,counter)
f1.seek(0,0)
.
EDIT
1) ab is a compiled regex, then you can write ab.search(line)
2) bool(None) is False, then you can write if ab.search(line): no need of != None
3)
def find_string(header,file_1,counter):
lwh = re.compile('^.*?'+header+'.*$',re.MULTILINE)
lines_with_header = lwh.findall(file-1.read())
print ''.join(lines_with_header)
return counter + 1
and even
def find_string(header,file_1,counter):
lwh = re.compile('^.*?'+header+'.*$',re.MULTILINE)
print ''.join(matline.group() for matline in lwh.finditer(file-1.read()) )
return counter + 1
4)
def find_string(header,file_1):
lwh = re.compile('^.*?'+header+'.*$',re.MULTILINE)
lines_with_header = lwh.findall(file-1.read())
print ''.join(lines_with_header)
file_1 = open("text_file_with_headers.txt",'r')
header_list = ["header1","header2",....]
for counter,header in header_list:
find_string(header,file_1)
file_1.seek(0,0)
counter += 1 # because counter began at 0
5) You run through file_1 as many times that there are headers in header_list.
You should run through it only one time and recording each line containing one of the headers in a list being one of the values of a dictionary whose keys should be the headers. It would be faster.
6) An array in Python is an array
The file object keeps track of your position in the file, and after you've gone through the outer loop once, you're at the end of the file and there are no more lines to read.
If I were you, I would reverse the order in which your loops are nested: I would iterate through the file line by line, and for each line, iterate through the list of strings you want to find. That way, I would only have to read each line from the file once.
Related
I have a 4 columns tab separated text file. Also I have a list of values which need to be iterated through and searched in the text file to get the value of one of the columns:
Here's my code (Python 2.7):
def populate_data():
file = open('file.txt', 'r')
values = ['value1', 'value2', 'value3']
secondary_values = ['second_value1', 'second_value2', 'second_value3']
os = 'iOS'
i = 0
outputs = []
while i < len(values):
value = values[i]
secondary_value = secondary_values[i]
output = lookup(file, os, value, secondary_value)
if output != None:
outputs.append(output)
i += 1
def lookup(file, input_os, input_value, input_secondary_value):
for line in file:
columns = line.strip().split('\t')
if len(columns) != 4:
continue
else:
value = columns[0]
secondary_value = columns[1]
os = columns[2]
output = columns[3]
if input_os == os and input_value == value and input_secondary_value == secondary_value:
return output
The search basically should work as this SQL statement:
SELECT output FROM data_set WHERE os='os' AND value='value' and secondary_value='secondary_value'
The problem I'm experiencing is that the lookup method works in the while look and also maintains a for loop and obviously the parent while loop doesn't wait for the inner loop to finish and return the value before continue. This results in a problem that despite of the fact of the match the data is not returned. If this was JavaScript I would do that with Promises, but not sure how to achieve it in Python.
Any clues how this could be solved?
The correct thing to do here was to read the file and insert all of the rows into a dict like so:
dc = dict()
dc[value+secondary_value+os] = output
Then accessing the values in the main while loop.
I have a list of sorted data arranged so that each item in the list is a csv line to be written to file.
The final step of the script checks the contents of each field and if all but the last field match then it will copy the current line's last field onto the previous line's last field.
I would like to as I've found and processed one of these matches skip the current line where the field was copied from thus only leaving one of the lines.
Here's an example set of data
field1,field2,field3,field4,something
field1,field2,field3,field4,else
Desired output
field1,field2,field3,field4,something else
This is what I have so far
output_csv = ['field1,field2,field3,field4,something',
'field1,field2,field3,field4,else']
# run through the output
# open and create a csv file to save output
with open('output_table.csv', 'w') as f:
previous_line = None
part_duplicate_line = None
part_duplicate_flag = False
for line in output_csv:
part_duplicate_flag = False
if previous_line is not None:
previous = previous_line.split(',')
current = line.split(',')
if (previous[0] == current[0]
and previous[1] == current[1]
and previous[2] == current[2]
and previous[3] == current[3]):
print(previous[0], current[0])
previous[4] = previous[4].replace('\n', '') + ' ' + current[4]
part_duplicate_line = ','.join(previous)
part_duplicate_flag = True
f.write(part_duplicate_line)
if part_duplicate_flag is False:
f.write(previous_line)
previous_line = line
ATM script adds the line but doesn't skip the next line, I've tried various renditions of continue statements after part_duplicate_line is written to file but to no avail.
Looks like you want one entry for each combination of the first 4 fields
You can use a dict to aggregate data -
#First we extract the key and values
output_csv_keys = list(map(lambda x: ','.join(x.split(',')[:-1]), output_csv))
output_csv_values = list(map(lambda x: x.split(',')[-1], output_csv))
#Then we construct a dictionary with these keys and combine the values into a list
from collections import defaultdict
output_csv_dict = defaultdict(list)
for key, value in zip(output_csv_keys, output_csv_values):
output_csv_dict[key].append(value)
#Then we extract the key/value combinations from this dictionary into a list
for_printing = [','.join([k, ' '.join(v)]) for k, v in output_csv_dict.items()]
print(for_printing)
#Output is ['field1,field2,field3,field4,something else']
#Each entry of this list can be output to the csv file
I propose to encapsulate what you want to do in a function where the important part obeys this logic:
either join the new info to the old record
or output the old record and forget it
of course at the end of the loop we have in any case a dangling old record to output
def join(inp_fname, out_fname):
'''Input file contains sorted records, when two (or more) records differ
only in the last field, we join the last fields with a space
and output only once, otherwise output the record as-is.'''
######################### Prepare for action ##########################
from csv import reader, writer
with open(inp_fname) as finp, open(out_fname, 'w') as fout:
r, w = reader(finp), writer(fout)
######################### Important Part starts here ##############
old = next(r)
for new in r:
if old[:-1] == new[:-1]:
old[-1] += ' '+new[-1]
else:
w.writerow(old)
old = new
w.writerow(old)
To check what I've proposed you can use these two snippets (note that these records are shorter than yours, but it's an example and it doesn't matter because we use only -1 to index our records).
The 1st one has a "regular" last record
open('a0.csv', 'w').write('1,1,2\n1,1,3\n1,2,0\n1,3,1\n1,3,2\n3,3,0\n')
join('a0.csv', 'a1.csv')
while the 2nd has a last record that must be joined to the previous one.
open('b0.csv', 'w').write('1,1,2\n1,1,3\n1,2,0\n1,3,1\n1,3,2\n')
join('b0.csv', 'b1.csv')
If you run the snippets, as I have done before posting, in the environment where you have defined join you should get what you want.
I am trying to match parentheses from a line in a file but when I use the code below without getting data from a file and entering it instead it works and matched the parentheses. I don't know how to allow it to work with numbers and letters in between too.
i have tried many different ways but this has worked the best so far. I think there is firstly something wrong with what i am printing but i have tried everything that i know to fix that. i am also new to python so it might not be the best code.
class Stack:
def __init__(self):
self._items = []
def isEmpty(self):
return self._items == []
def push(self,item):
self._items.append(item)
def pop(self):
return self._items.pop()
stack = Stack()
open_list = ["[","{","("]
close_list = ["]","}",")"]
def open_file():
file = open("testing.txt","r")
testline = file.readline()
count = 1
while testline != "":
testline = testline[:-1]
check(testline,count)
testline = file.readline()
count = count + 1
def check(testline,count):
stack = []
for i in testline:
if i in open_list:
stack.append(i)
elif i in close_list:
pos = close_list.index(i)
if ((len(stack) > 0) and
(open_list[pos] == stack[len(stack)-1])):
stack.pop()
else:
print ("Unbalanced")
print (count)
if len(stack) == 0:
print ("Balanced")
print (count)
def main():
open_file()
if __name__=="__main__":
main()
output:
if the file contains
dsf(hkhk[khh])
ea{jhkjh[}}
hksh[{(]
sd{hkh{hkhk[hkh]}}]
the output is
Balanced
1
Unbalanced
2
Unbalanced
2
Unbalanced
3
Unbalanced
4
Balanced
4
The first four are correct but it adds 2 and i have no idea where it is coming from. I need the count for later purposes when i am printing (ie line 1 is balanced)
Time to learn the basis of debugging...
#emilanov has given hints for the open_file function, so I will focus on the check function.
for i in range (0,len(testline),1):
Is probably not what you want: i will take integer values from 0 to len(testline) -1. The rule is: when things go wrong, use a debugger or add trace prints. Here
for i in range (0,len(testline),1):
print(i) # trace - comment out for production code
would have made the problem evident.
What you want is probably:
for i in testline:
There are some problems with your open_file() function.
The while-loop finishes only when testline == "" returns true. So when later you do check(testline), you actually give the function an empty string, so it can't really do it's job.
I assume the purpose of the while loop is to remove the newline character \n for each line in the file? The problem is you're not saving the intermediate lines anywhere. Then when file.readline() returns a "" because the file doesn't have any more lines, you give that empty string to the function.
Some suggestions
# A way to remove newlines
testline = testline.replace("\n", "")
# Check all the lines
lines = file.readlines()
count = len(lines)
for testline in lines:
testline = testline.replace("\n", "")
check(testline)
# And if you're sure that the file will have only one line
testline = file.readline()[:1] # read line and remove '\n'
check(testline)
Remember, a string is just a list full of characters. So you can do len(string) to see the length. Or you can do len(file.readlines()) to see how many lines a file has. Either way, you can get rid of the count variable.
Printing
When you call print(check()) it first calls the check() function with no parameters, so it can't actually check anything. That's why you can't see the right print statement.
A suggested edit would be to move the print statement at the end of your open_file() function, so that you have print(check(testline))
Another possible solution would be to put a return statement in your open_file() function.
def open_file():
# Some code...
return check(testline)
def check():
# Some code...
print(open_file())
The easiest will probably be to replace the return statements in check() with print statements though.
The text file contains two columns- index number(5 spaces) and characters(30 spaces).
It is arranged in lexicographic order. I want to perform binary search to search for the keyword.
Here's an interesting way to do it with Python's built-in bisect module.
import bisect
import os
class Query(object):
def __init__(self, query, index=5):
self.query = query
self.index = index
def __lt__(self, comparable):
return self.query < comparable[self.index:]
class FileSearcher(object):
def __init__(self, file_pointer, record_size=35):
self.file_pointer = file_pointer
self.file_pointer.seek(0, os.SEEK_END)
self.record_size = record_size + len(os.linesep)
self.num_bytes = self.file_pointer.tell()
self.file_size = (self.num_bytes // self.record_size)
def __len__(self):
return self.file_size
def __getitem__(self, item):
self.file_pointer.seek(item * self.record_size)
return self.file_pointer.read(self.record_size)
if __name__ == '__main__':
with open('data.dat') as file_to_search:
query = raw_input('Query: ')
wrapped_query = Query(query)
searchable_file = FileSearcher(file_to_search)
print "Located # line: ", bisect.bisect(searchable_file, wrapped_query)
Do you need do do a binary search? If not, try converting your flatfile into a cdb (constant database). This will give you very speedy hash lookups to find the index for a given word:
import cdb
# convert the corpus file to a constant database one time
db = cdb.cdbmake('corpus.db', 'corpus.db_temp')
for line in open('largecorpus.txt', 'r'):
index, word = line.split()
db.add(word, index)
db.finish()
In a separate script, run queries against it:
import cdb
db = cdb.init('corpus.db')
db.get('chaos')
12345
If you need to find a single keyword in a file:
line_with_keyword = next((line for line in open('file') if keyword in line),None)
if line_with_keyword is not None:
print line_with_keyword # found
To find multiple keywords you could use set() as #kriegar suggested:
def extract_keyword(line):
return line[5:35] # assuming keyword starts on 6 position and has length 30
with open('file') as f:
keywords = set(extract_keyword(line) for line in f) # O(n) creation
if keyword in keywords: # O(1) search
print(keyword)
You could use dict() above instead of set() to preserve index information.
Here's how you could do a binary search on a text file:
import bisect
lines = open('file').readlines() # O(n) list creation
keywords = map(extract_keyword, lines)
i = bisect.bisect_left(keywords, keyword) # O(log(n)) search
if keyword == keywords[i]:
print(lines[i]) # found
There is no advantage compared to the set() variant.
Note: all variants except the first one load the whole file in memory. FileSearcher() suggested by #Mahmoud Abdelkader don't require to load the whole file in memory.
I wrote a simple Python 3.6+ package that can do this. (See its github page for more information!)
Installation: pip install binary_file_search
Example file:
1,one
2,two_a
2,two_b
3,three
Usage:
from binary_file_search.BinaryFileSearch import BinaryFileSearch
with BinaryFileSearch('example.file', sep=',', string_mode=False) as bfs:
# assert bfs.is_file_sorted() # test if the file is sorted.
print(bfs.search(2))
Result: [[2, 'two_a'], [2, 'two_b']]
It is quite possible, with a slight loss of efficiency to perform a binary search on a sorted text file with records of unknown length, by repeatedly bisecting the range, and reading forward past the line terminator. Here's what I do to look for look thru a csv file with 2 header lines for a numeric in the first field. Give it an open file, and the first field to look for. It should be fairly easy to modify this for your problem. A match on the very first line at offset zero will fail, so this may need to be special-cased. In my circumstance, the first 2 lines are headers, and are skipped.
Please excuse my lack of polished python below. I use this function, and a similar one, to perform GeoCity Lite latitude and longitude calculations directly from the CSV files distributed by Maxmind.
Hope this helps
========================================
# See if the input loc is in file
def look1(f,loc):
# Compute filesize of open file sent to us
hi = os.fstat(f.fileno()).st_size
lo=0
lookfor=int(loc)
# print "looking for: ",lookfor
while hi-lo > 1:
# Find midpoint and seek to it
loc = int((hi+lo)/2)
# print " hi = ",hi," lo = ",lo
# print "seek to: ",loc
f.seek(loc)
# Skip to beginning of line
while f.read(1) != '\n':
pass
# Now skip past lines that are headers
while 1:
# read line
line = f.readline()
# print "read_line: ", line
# Crude csv parsing, remove quotes, and split on ,
row=line.replace('"',"")
row=row.split(',')
# Make sure 1st fields is numeric
if row[0].isdigit():
break
s=int(row[0])
if lookfor < s:
# Split into lower half
hi=loc
continue
if lookfor > s:
# Split into higher half
lo=loc
continue
return row # Found
# If not found
return False
Consider using a set instead of a binary search for finding a keyword in your file.
Set:
O(n) to create, O(1) to find, O(1) to insert/delete
If your input file is separated by a space then:
f = open('file')
keywords = set( (line.strip().split(" ")[1] for line in f.readlines()) )
f.close()
my_word in keywords
<returns True or False>
Dictionary:
f = open('file')
keywords = dict( [ (pair[1],pair[0]) for pair in [line.strip().split(" ") for line in f.readlines()] ] )
f.close()
keywords[my_word]
<returns index of my_word>
Binary Search is:
O(n log n) create, O(log n) lookup
edit: for your case of 5 characters and 30 characters you can just use string slicing
f = open('file')
keywords = set( (line[5:-1] for line in f.readlines()) )
f.close()
myword_ in keywords
or
f = open('file')
keywords = dict( [(line[5:-1],line[:5]) for line in f.readlines()] )
f.close()
keywords[my_word]
i am using GM862 module and i want to write the cordinates as it is in a file "cordinates.txt" but i get some error, this is the code i wrote:
import MDM
cordlist = []
f = open("cordinates.txt", 'w')
def AcquiredPosition():
res = MDM.send('AT$GPSACP\r', 0)
res = MDM.receive(30)
if(res.find('OK') != -1):
tmp = res.split("\r\n")
res = tmp[1]
tmp = res.split(" ")
return tmp[1]
else:
return ""
while (1):
res = MDM.receive(60)
p = AcquiredPosition()
cordlist.append(p)
cordlist.append("\r\n")
f.writelines(cordlist)
so the problem that the cordinates are being repeted in the list each time the appened happened.
and this is an example of the file content called "cordinates.txt":
160439.246,2612.7206N,05027.6068E,3.0,23.6,2,339.34,4.21,2.27,181109,03 first time
160439.246,2612.7206N,05027.6068E,3.0,23.6,2,339.34,4.21,2.27,181109,03 repeted1
160445.246,2612.7305N,05027.6079E,3.0,23.6,2,161.61,6.37,3.43,181109,03 first time
160439.246,2612.7206N,05027.6068E,3.0,23.6,2,339.34,4.21,2.27,181109,03 repeted2
160445.246,2612.7305N,05027.6079E,3.0,23.6,2,161.61,6.37,3.43,181109,03 repeted1
160451.246,2612.7634N,05027.5939E,3.0,23.6,2,143.18,1.36,0.73,181109,03 first time
160439.246,2612.7206N,05027.6068E,3.0,23.6,2,339.34,4.21,2.27,181109,03 repeted3
160445.246,2612.7305N,05027.6079E,3.0,23.6,2,161.61,6.37,3.43,181109,03
160451.246,2612.7634N,05027.5939E,3.0,23.6,2,143.18,1.36,0.73,181109,03
160458.246,2612.7471N,05027.5979E,3.0,23.6,2,333.97,7.66,4.13,181109,03
160439.246,2612.7206N,05027.6068E,3.0,23.6,2,339.34,4.21,2.27,181109,03 and so on...
160445.246,2612.7305N,05027.6079E,3.0,23.6,2,161.61,6.37,3.43,181109,03
160451.246,2612.7634N,05027.5939E,3.0,23.6,2,143.18,1.36,0.73,181109,03
160458.246,2612.7471N,05027.5979E,3.0,23.6,2,333.97,7.66,4.13,181109,03
160504.246,2612.7496N,05027.5961E,3.0,47.2,3,316.66,3.16,1.70,181109,04
160439.246,2612.7206N,05027.6068E,3.0,23.6,2,339.34,4.21,2.27,181109,03
160445.246,2612.7305N,05027.6079E,3.0,23.6,2,161.61,6.37,3.43,181109,03
160451.246,2612.7634N,05027.5939E,3.0,23.6,2,143.18,1.36,0.73,181109,03
160458.246,2612.7471N,05027.5979E,3.0,23.6,2,333.97,7.66,4.13,181109,03
160504.246,2612.7496N,05027.5961E,3.0,47.2,3,316.66,3.16,1.70,181109,04
160510.000,2612.7446N,05027.5996E,3.0,53.7,3,162.56,0.50,0.27,181109,04
thanks for any help.
You are appending to your list and then writing the full list to the file each time through the loop.
You need to clear down the list in each pass through the loop.
Put cordlist = [] as the first line under while(1)
Why not open the file in append mode ('a' instead of 'w') and just writelines to that?
Because that's what you've asked it to do. On every iteration, you append an item to the list, then write out all the lines so far. So each time you'll repeat everything you've already done, plus the one new line.
Since your function only returns a single line I don't know why you're bothering with a list at all - just write the result of the function straight to the file.