Comparing all contents of two files

Comparing all contents of two files - python

I am trying to compare two files. One file has a list of stores. The other list has the same list of stores, except it is missing a few from a filter I had run against it from another script. I would like to compare these two files, if the store in file 1 is not anywhere to be located in file 2, I want to print it out, or append to a list, not too picky on that part. Below are examples of partials in both files:
file 1:
Store: 00377
Main number: 8033056238
Store: 00525
Main number: 4075624470
Store: 00840
Main number: 4782736996
Store: 00920
Main number: 4783337031
Store: 00998
Main number: 9135631751
Store: 02226
Main number: 3107501983
Store: 02328
Main number: 8642148700
Store: 02391
Main number: 7272645342
Store: 02392
Main number: 9417026237
Store: 02393
Main number: 4057942724
File 2:
00377
00525
00840
00920
00998
02203
02226
02328
02391
02392
02393
02394
02395
02396
02397
02406
02414
02425
02431
02433
02442
Here is what I built to try and make this work, but it just keeps spewing all stores in the file.
def comparesitestest():
with open("file_1.txt", "r") as pairsin:
pairs = pairsin.readlines()
pairsin.close
with open("file_2.txt", "r") as storesin:
stores = storesin.readlines()
storesin.close
for pair in pairs:
for store in stores:
if store not in pair:
print(store)

When you read your first file, add the store number to a set.
store_nums_1 = set()
with open("file_1.txt") as f:
for line in f:
line = line.strip() # Remove trailing whitespace
if line.startswith("Store"):
store_nums_1.add(line[7:]) # Add only store number to set
Next, read the other file and add those numbers to another set
store_nums_2 = set()
with open("file_2.txt") as f:
for line in f:
line = line.strip() # Remove trailing whitespace
store_nums_2.add(line) # The entire line is the store number, so no need to slice.
Finally, find the set difference between the two sets.
file1_extras = store_nums_1 - store_nums_2
Which gives a set containing only the store numbers in file 1 but not in file 2. (I changed your file_2 to have only the first three lines, because the file you've shown actually contains more store numbers than file_1, so the result file1_extras was empty using your input)
{'00920', '00998', '02226', '02328', '02391', '02392', '02393'}
This is more efficient than using lists, because checking if something exists in a list is an O(N) operation. When you do it once for each of the M items in your first list, you end up with an O(N*M) operation. On the other hand, membership checks in a set are O(1), so the entire set-difference operation is O(M) instead of O(N*M)

You are getting the output you get because your check is not checking what you want. Try changing your for loop to something like this:
for pairline in pairs:
if pairline:
name, number = pairline.split(': ')
if name == "Store":
if number not in stores:
print(number)
Explanation is as follows:
You start with a File 1 of pairs, and a File 2 of stores (store numbers, really). Your file 2 is in decent shape. After you read it in, you've got a list of store numbers. You don't need to put that through a second loop. In fact, it's wasteful and unnecessary.
Your File 1 is a little more complicated. Although you refer to the info as pairs, it's a little more complicated than that, because the lines have a store number and what I assume is a phone number. So, for each line in the File 1, I would check if the line starts with "Store:", knowing I can ignore all the other lines. If the line starts with "Store;", the next part of the line is the store number I actually want to check for in the list of File 2.
So, the program above does a little more checking to see if it's reading in a line it needs to act on. and then it acts on it if necessary by checking whether the store number is in the store number list.
Also, as a side note, it's great to use the with structure. It's good coding practice. But when you do that, you do not need to explicitly close the file. That happens automatically with that context structure. Once you leave the context, the close happens automatically.
As another side note, there are usually multiple good ways and bad ways to solve a problem. Another possible reasonable solution/version is:
for pairline in pairs:
if pairline and pairline.startswith("Store:"):
store = pairline.split()[1]
if store not in stores:
print(stores)
It's different. Not necessarily better or worse, just different.

Related

Python - How to create a delete function for user input?

I've been trying to create a program which allows users to view a text file's contents and delete some or all of a single entry block.
An example of the text's file contents can be seen below:
Special Type A Sunflower
2016-10-12 18:10:40
Asteraceae
Ingredient in Sunflower Oil
Brought to North America by Europeans
Requires fertile and moist soil
Full sun
Pine Tree
2018-12-15 13:30:45
Pinaceae
Evergreen
Tall and long-lived
Temperate climate
Tropical Sealion
2019-01-20 12:10:05
Otariidae
Found in zoos
Likes fish
Likes balls
Likes zookeepers
Big Honey Badger
2015-06-06 10:10:25
Mustelidae
Eats anything
King of the desert
As such, the entry block refers to all lines without a horizontal space.
Currently, my progress is at:
import time
import os
global o
global dataset
global database
from datetime import datetime
MyFilePath = os.getcwd()
ActualFile = "creatures.txt"
FinalFilePath = os.path.join(MyFilePath, ActualFile)
def get_dataset():
database = []
shown_info = []
with open(FinalFilePath, "r") as textfile:
sections = textfile.read().split("\n\n")
for section in sections:
lines = section.split("\n")
database.append({
"Name": lines[0],
"Date": lines[1],
"Information": lines[2:]
})
return database
def delete_creature():
dataset = get_dataset()
delete_question = str(input("Would you like to 1) delete a creature or 2) only some of its information from the dataset or 3) return to main page? Enter 1, 2 or 3: "))
if delete_question == "1":
delete_answer = str(input("Enter the name of the creature: "))
for line in textfile:
if delete_answer in line:
line.clear()
elif delete_question == "2":
delete_answer = str(input("Enter the relevant information of the creature: "))
for line in textfile:
if delete_answer in line:
line.clear()
elif delete_question == "3":
break
else:
raise ValueError
except ValueError:
print("\nPlease try again! Your entry is invalid!")
while True:
try:
option = str(input("\nGood day, This is a program to save and view creature details.\n" +
"1) View all creatures.\n" +
"2) Delete a creature.\n" +
"3) Close the program.\n" +
"Please select from the above options: "))
if option == "1":
view_all()
elif option == "2":
delete()
elif option == "3":
break
else:
print("\nPlease input one of the options 1, 2 or 3.")
except:
break
The delete_function() is meant to delete the creature by:
Name, which deletes the entire text block associated with the name
Information, which deletes only the line of information
I can't seem to get the delete_creature() function to work, however, and I am unsure of how to get it to work.
Does anyone know how to get it to work?
Many thanks!

Your problem with removing lines from a section is that you specifically hardcoded which line represents what. Removing a section in your case will be easy, removing a line will, if you do not change your concept, involve setting the line in question to empty or to some character representing the empty string later.
Another question here is, do you need your sections to remain ordered as they were entered, or you can have them sorted back in a file in some other order.
What I would do is to change the input file format to e.g. INI file format. Then you can use the configparser module to parse and edit them in an easy manner.
The INI file would look like:
[plant1]
name="Some plant's English name"
species="The plant's Latin species part"
subspecies="The plant's subspecies in Latin ofcourse"
genus="etc."
[animal1]
# Same as above for the animal
# etc. etc. etc.
configparser.ConfigParser() will let you load it in an dictionarish manner and edit sections and values. Sections you can name animal1, plant1, or use them as something else, but I prefer to keep the name inside the value, usually under the name key, then use configparser to create a normal dictionary from names, where its value is another dictionary containing key-value pairs as specified in the section. And I reverse the process when saving the results. Either manually, or using configparser again.
The other format you might consider is JSON, using the json module.
Using its function dumps() with separators and indentation set correctly, you will get pretty human-readable and editable output format. The nice thing is that you save the data structure you are working with, e.g. dictionary, then you load it and it comes back as you saved it, and you do not need to perform some additional stuff to get it done, as with configparser. The thing is, that an INI file is a bit less confusing for an user not custom to JSON to construct, and results in less errors, while JSON must be strictly formatted, and any errors in opening and closing the scopes or with separators results in whole thing not working or incorrect input. And it easily happens when the file is big.
Both formats allows users to put empty lines wherever they want and they will not change the way the file will be loaded, while your method is strict in regard to empty lines.
If you are expecting your database to be edited only by your program, then use the pickle module to do it and save yourself the mess.
Otherwise you can:
def getdata (stringfromfile):
end = {}
l = [] # lines in a section
for x in stringfromfile.strip().splitlines():
x = x.strip()
if not x: # New section encountered
end[l[0].lower()] = l[1:]
l = []
continue
end.append(x)
end[l[0].lower()] = l[1:] # Add last section
# Connect keys to numbers in the same dict(), so that users can choose by number too
for n, key in enumerate(sorted(end)):
end[n] = key
return end
# You define some constants for which line is what in a dict():
values = {"species": 0, "subspecies": 1, "genus": 2}
# You load the file and parse the data
data = getdata(f.read())
def edit (name_or_number, edit_what, new_value):
if isinstance(name_or_number, int):
key = data[name_or_number]
else:
key = name_or_number.lower().strip()
if isinstance(edit_what, str):
edit_what = values[edit_what.strip().lower()]
data[key][edit_what] = new_value.strip()
def add (name, list_of_lines):
n = len(data)/2 # Number for new entry for numeric getting
name = name.strip().lower()
data[name] = list_of_lines
data[n] = name
def remove (name):
name = name.lower().strip()
del data[name]
# Well, this part is bad and clumsy
# It would make more sense to keep numeric mappings in separate list
# which will do this automatically, especially if the database file is big-big-big...
# But I started this way, so, keeping it simple and stupid, just remap everything after removing the item (inefficient as hell itself)
for x in data.keys():
if isinstance(x, int):
del data[x]
for n, key in enumerate(sorted(data)):
data[n] = key
def getstring (d):
# Serialize for saving
end = []
for l0, ls in d.items():
if isinstance(l0, int):
continue # Skip numeric mappings
lines = l0+"\n"+"\n".join(ls)
end.append(lines)
return "\n\n".join(end)
I didn't test the code. There might be bugs.
If you need no specific lines, you can modify my code easily to search in the lines using the list.index() method, or just use numbers for the lines if they exist when you need to get to them. For doing so with configparser, use generic keys in a section like: answer0, answer1..., or just 0, 1, 2..., Then ignore them and load answers as a list or however. If you are going to use configparser to work on the file, you will get sometimes answer0, answer3... when you remove.
And a warning. If you want to keep the order in which input file gives the creatures, use ordereddict instead of the normal dictionary.
Also, editing the opened file in place is, of course, possible, but complicated and inadvisable, so just don't. Load and save back. There are very rare situations when you want to change the file directly. And for that you would use the mmap module. Just don't!

How can I simplify this Python code (assignment from a book)?

I am studying "Python for Everybody" book written by Charles R. Severance and I have a question to the exercise2 from Chapter7.
The task is to go through the mbox-short.txt file and "When you encounter a line that starts with “X-DSPAM-Confidence:” pull apart the line to extract the floating-point number on the line. Count these lines and then compute the total of the spam confidence values from these lines. When you reach the end of the file, print out the average spam confidence."
Here is my way of doing this task:
fname = input('Enter the file name: ')
try:
fhand = open(fname)
except:
print('File cannot be opened:', fname)
exit()
count = 0
values = list()
for line in fhand:
if line.startswith('X-DSPAM-Confidence:'):
string = line
count = count + 1
colpos = string.find(":")
portion = string[colpos+1:]
portion = float(portion)
values.append(portion)
print('Average spam confidence:', sum(values)/count)
I know this code works because I get the same result as in the book, however, I think this code can be simpler. The reason I think so is because I used a list in this code (declared it and then stored values in it). However, "Lists" is the next topic in the book and when solving this task I didn't know anything about lists and had to google them. I solved this task this way, because this is what I'd do in the R language (which I am already quite familiar with), I'd make a vector in which I'd store the values from my iteration.
So my question is: Can this code be simplified? Can I do the same task without using list? If yes, how can I do it?

I could change the "values" object to a floating type. The overhead of a list is not really needed in the problem.
values = 0.0
Then in the loop use
values += portion
Otherwise, there really is not a simpler way as this problem has tasks and you must meet all of the tasks in order to solve it.
Open File
Check For Error
Loop Through Lines
Find certain lines
Total up said lines
Print average
If you can do it in 3 lines of code great but that doesn't make what goes on in the background necessarily simpler. It will also probably look ugly.

You could filter the file's lines before the loop, then you can collapse the other variables into one, and get the values using list-comprehension. From that, you have your count from the length of that list.
interesting_lines = (line.startswith('X-DSPAM-Confidence:') for line in fhand)
values = [float(line[(line.find(":")+1):]) for line in interesting_lines]
count = len(values)
Can I do the same task without using list?
If the output needs to be an average, yes, you can accumlate the sum and the count as their own variables, and not need a list to call sum(values) against
Note that open(fname) is giving you an iterable collection anyway, and you're looping over the "list of lines" in the file.

List-comprehensions can often replace for-loops that add to a list:
fname = input('Enter the file name: ')
try:
fhand = open(fname)
except:
print('File cannot be opened:', fname)
exit()
values = [float(l[l.find(":")+1:]) for l in fhand if l.startswith('X-DSPAM-Confidence:')]
print('Average spam confidence:', sum(values)/len(values))
The inner part is simply your code combined, so perhaps less readable.
EDIT: Without using lists, it can be done with "reduce":
from functools import reduce
fname = input('Enter the file name: ')
try:
fhand = open(fname)
except:
print('File cannot be opened:', fname)
exit()
sum, count = reduce(lambda acc, l: (acc[0] + float(l[l.find(":")+1:]), acc[1]+1) if l.startswith('X-DSPAM-Confidence:') else acc, fhand, (0,0))
print('Average spam confidence:', sum / count)
Reduce is often called "fold" in other languages, and it basically allows you to iterate over a collection with an "accumulator". Here, I iterate the collection with an accumulator which is a tuple of (sum, count). With each item, we add to the sum and increment the count. See Reduce documentation.
All this being said, "simplify" does not necessarily mean as little code as possible, so I would stick with your own code if you're not comfortable with these shorthand notations.

max min and average looking up file python

I'm trying to create a program that asks for a name of a file, opens the file, and determines the maximum and minimum values in the files, and also computes the average of the numbers in the file. I want to print the max and min values, and return the average number of values in the file. The file has only one number per line, which consists of many different numbers top to bottom. Here is my program so far:
def summaryStats():
fileName = input("Enter the file name: ") #asking user for input of file
file = open(fileName)
highest = 1001
lowest = 0
sum = 0
for element in file:
if element.strip() > (lowest):
lowest = element
if element.strip() < (highest):
highest = element
sum += element
average = sum/(len(file))
print("the maximum number is ") + str(highest) + " ,and the minimum is " + str(lowest)
file.close()
return average
When I run my program, it is giving me this error:
summaryStats()
Enter the file name: myFile.txt
Traceback (most recent call last):
File "/Applications/Wing101.app/Contents/MacOS/src/debug/tserver/_sandbox.py", line 1, in <module>
# Used internally for debug sandbox under external interpreter
File "/Applications/Wing101.app/Contents/MacOS/src/debug/tserver/_sandbox.py", line 8, in summaryStats
builtins.TypeError: unorderable types: str() > int()
I think I'm struggling determining which part to make a string. What do you guys think?

You are comparing two incompatible types str and int. You need a make sure you are comparing similar types. You may want to rewrite your for loop to include a call to make sure you are comparing two int values.
for element in file:
element_value = int(element.strip())
if element_value > (lowest):
lowest = element
if element_value < (highest):
highest = element_value
sum += element_value
average = sum/(len(file))
When python reads in files, it reads them in as type str for the whole line. You make the call to strip to remove surrounding white space and newline characters. You then need to parse the remaining str into the correct type (int) for comparison and manipulation.
You should read through your error messages, they are there to enlighten you on where and why your code failed to run. The error message traces where the error took place. the line
File "/Applications/Wing101.app/Contents/MacOS/src/debug/tserver/_sandbox.py", line 8, in summaryStats
Tells you to examine line 8 which is the place for the error takes place.
The next line:
builtins.TypeError: unorderable types: str() > int()
Tells you what is going wrong. A quick search through the python docs locates the description of the error. An easy way to search for advice is to look in the documentation for the language and maybe search for the entire error message. It is likely you are not the first person with this problem and that there is probably a discussion and solution advice available to figure out your specific error.

Lines like these:
if element.strip() > (lowest):
Should probably be explicitly converting to a number. Currently you're comparing a str to and int. Converting using int will take whitespace into account, where int(' 1 ') is 1
if int(element.string()) > lowest:
Also, you could do this like so:
# Assuming test.txt is a file with a number on each line.
with open('test.txt') as f:
nums = [int(x) for x in f.readlines()]
print 'Max: {0}'.format(max(nums))
print 'Min: {0}'.format(min(nums))
print 'Average: {0}'.format(sum(nums) / float(len(nums)))

when you call open(filename), you are constructing a file object. You cannot iterate through this in a for loop.
If each value is on it's own line: after creating the file object, call:
lines = file.readlines()
Then loop through those lines and convert to int:
for line in lines:
value = int(line)

Binary search over a huge file with unknown line length

I'm working with huge data CSV file. Each file contains milions of record , each record has a key. The records are sorted by thier key. I dont want to go over the whole file when searching for certian data.
I've seen this solution : Reading Huge File in Python
But it suggests that you use the same length of lines on the file - which is not supported in my case.
I thought about adding a padding to each line and then keeping a fixed line length , but I'd like to know if there is a better way to do it.
I'm working with python

You don't have to have a fixed width record because you don't have to do a record-oriented search. Instead you can just do a byte-oriented search and make sure that you realign to keys whenever you do a seek. Here's a (probably buggy) example of how to modify the solution you linked to from record-oriented to byte-oriented:
bytes = 24935502 # number of entries
for i, search in enumerate(list): # list contains the list of search keys
left, right = 0, bytes - 1
key = None
while key != search and left <= right:
mid = (left + right) / 2
fin.seek(mid)
# now realign to a record
if mid:
fin.readline()
key, value = map(int, fin.readline().split())
if search > key:
left = mid + 1
else:
right = mid - 1
if key != search:
value = None # for when search key is not found
search.result = value # store the result of the search

To resolve it, you also can use binary search, but you need to change it a bit:
Get the file size.
Seek to the middle of size, with File.seek.
And search the first EOL character. Then you find a new line.
Check this line's key and if not what you want, update size and go to 2.
Here is a sample code:
fp = open('your file')
fp.seek(0, 2)
begin = 0
end = fp.tell()
while (begin < end):
fp.seek((end + begin) / 2, 0)
fp.readline()
line_key = get_key(fp.readline())
if (key == line_key):
pass # find what you want
elif (key > line_key):
begin = fp.tell()
else:
end = fp.tell()
Maybe the code has bugs. Verify yourself. And please check the performance if you really want a fastest way.

The answer on the referenced question that says binary search only works with fixed-length records is wrong. And you don't need to do a search at all, since you have multiple items to look up. Just walk through the entire file one line at a time, build a dictionary of key:offset for each line, and then for each of your search items jump to the record of interest using os.lseek on the offset corresponding to each key.
Of course, if you don't want to read the entire file even once, you'll have to do a binary search. But if building the index can be amortized over several lookups, perhaps saving the index if you only do one lookup per day, then a search is unnecessary.

index a list in a Python for loop

I'm making a for loop within a for loop. I'm looping through a list and finding a specific string that contains a regular expression pattern. Once I find the line, I need to search to find the next line of a certain pattern. I need to store both lines to be able to parse out the time for them. I've created a counter to keep track of the index number of the list as the outer for loop works. Can I use a construction like this to find the second line I need?
index = 0
for lineString in summaryList:
match10secExp = re.search('taking 10 sec. exposure', lineString)
if match10secExp:
startPlate = lineString
for line in summaryList[index:index+10]:
matchExposure = re.search('taking \d\d\d sec. exposure', line)
if matchExposure:
endPlate = line
break
index = index + 1
The code runs, but I'm not getting the result I'm looking for.
Thanks.

matchExposure = re.search('taking \d\d\d sec. exposure', lineString)
should probably be
matchExposure = re.search('taking \d\d\d sec. exposure', line)

Depending on your exact needs, you can just use an iterator on the list, or two of them as mae by itertools.tee. I.e., if you want to search lines following the first pattern only for the second pattern, a single iterator will do:
theiter = iter(thelist)
for aline in theiter:
if re.search(somestart, aline):
for another in theiter:
if re.search(someend, another):
yield aline, another # or print, whatever
break
This will not search lines from aline to the ending another for somestart, only for someend. If you need to search them for both purposes, i.e., leave theiter itself intact for the outer loop, that's where tee can help:
for aline in theiter:
if re.search(somestart, aline):
_, anotheriter = itertools.tee(iter(thelist))
for another in anotheriter:
if re.search(someend, another):
yield aline, another # or print, whatever
break
This is an exception to the general rule about tee which the docs give:
Once tee() has made a split, the
original iterable should not be used
anywhere else; otherwise, the iterable
could get advanced without the tee
objects being informed.
because the advancing of theiter and that of anotheriter occur in disjoint parts of the code, and anotheriter is always rebuilt afresh when needed (so the advancement of theiter in the meantime is not relevant).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.