I have two CSV files that have been renamed to text files. I need to compare a column in each one (a date) to confirm they have been updated.
For example, c:\temp\oldfile.txt has 6 columns and the last one is called version. I need to make sure that c:\temp\newfile.txt has a different value for version. It doesn't need to do any date verification of any kind, as long as the comparison sees that they're different, it can proceed. If possible, I would prefer to stick with 'standard' libraries as I'm just learning and don't want to start creating dictionaries and learning pandas and numpy just yet.
Edit
Here's a copy of oldfile.txt and newfile.txt.
oldfile.txt:
feed_publisher_name,feed_publisher_url,feed_lang,feed_start_date,feed_end_date,feed_version
MyStuff,http://www.mystuff.com,en,20220103,20220417,22APR_20220401
newfile.txt:
feed_publisher_name,feed_publisher_url,feed_lang,feed_start_date,feed_end_date,feed_version
MyStuff,http://www.mystuff.com,en,20220103,20220417,22APR_20220414
In this case the comparison would note that the last column has a different value and would know to proceed with the rest of the script. Otherwise, if the values are the same, it will know that it was not updated and I'll have the program exit.
You can do it by using the csv module in the standard library since that's the format of your files.
import csv
with open('oldfile.txt', 'r', newline='') as oldfile, \
open('newfile.txt', 'r', newline='') as newfile:
old_reader = csv.DictReader(oldfile)
new_reader = csv.DictReader(newfile)
old_row = next(old_reader)
new_row = next(new_reader)
same = old_row['feed_version'] == new_row['feed_version']
print(f"The files are {'the same' if same else 'different'}.")
If you are only interested in checking if there two files are the equal (essentially "updated"), you can compute the hash of one file and compare with the hash of the other
To compute hash (for example, sha256), you can use the following function:
import hashlib
def sha256sum(filename):
# Opens the file
with open(filename, 'rb') as file:
content = file.read()
hasher = hashlib.sha256()
hasher.update(content)
return hasher.hexdigest()
hashlib is probably part of the standard library if you went through the default installation process.
For example, if you write "v1.0" in a text document, the hasher function will give "fa8b919c909d5eb9e373d090928170eb0e7936ac20ccf413332b96520903168e"
If you later change it to "v1.1", the hasher function will give "eb79768c42dbbf9f10733e525a06ea9eb08f28b7b8edf9c6dcacb63940aedcb0".
These are two different hexdigest values, so it would imply that two files are different.
Reading the file-
we don't need any libraries for this. just opening the file and reading it, then doing a little parsing:
a, b = "", "" # set the globals for the comparison
with open("c:/temp/oldfile.txt") as f: # open the file as f
text = f.read().split('\n')[1] # get the contents of the file then cut just the second line from it
a = text.split(',')[5] # spliting the string by ',' to an array then getting the 6th element
Then opening the other one:
with open("c:/temp/newfile.txt") as f:
text = f.read().split('\n')[1]
b = text.split(',')[5]
more on reading files here
Comparing the lines-
if a == b:
print("The date is the same!")
else:
print("The date is different...")
Of course you can make this into a function and make it return whether or not they're equal then use the value to determine the future of the program.
Hope this helps!
Related
I am new here to try to solve one of my interesting questions in World of Tanks. I heard that every battle data is reserved in the client's disk in the Wargaming.net folder because I want to make a batch of data analysis for our clan's battle performances.
image
It is said that these .dat files are a kind of json files, so I tried to use a couple of lines of Python code to read but failed.
import json
f = open('ex.dat', 'r', encoding='unicode_escape')
content = f.read()
a = json.loads(content)
print(type(a))
print(a)
f.close()
The code is very simple and obviously fails to make it. Well, could anyone tell me the truth about that?
Added on Feb. 9th, 2022
After I tried another set of codes via Jupyter Notebook, it seems like something can be shown from the .dat files
import struct
import numpy as np
import matplotlib.pyplot as plt
import io
with open('C:/Users/xukun/Desktop/br/ex.dat', 'rb') as f:
fbuff = io.BufferedReader(f)
N = len(fbuff.read())
print('byte length: ', N)
with open('C:/Users/xukun/Desktop/br/ex.dat', 'rb') as f:
data =struct.unpack('b'*N, f.read(1*N))
The result is a set of tuple but I have no idea how to deal with it now.
Here's how you can parse some parts of it.
import pickle
import zlib
file = '4402905758116487.dat'
cache_file = open(file, 'rb') # This can be improved to not keep the file opened.
# Converting pickle items from python2 to python3 you need to use the "bytes" encoding or "latin1".
legacyBattleResultVersion, brAllDataRaw = pickle.load(cache_file, encoding='bytes', errors='ignore')
arenaUniqueID, brAccount, brVehicleRaw, brOtherDataRaw = brAllDataRaw
# The data stored inside the pickled file will be a compressed pickle again.
vehicle_data = pickle.loads(zlib.decompress(brVehicleRaw), encoding='latin1')
account_data = pickle.loads(zlib.decompress(brAccount), encoding='latin1')
brCommon, brPlayersInfo, brPlayersVehicle, brPlayersResult = pickle.loads(zlib.decompress(brOtherDataRaw), encoding='latin1')
# Lastly you can print all of these and see a lot of data inside.
The response contains a mixture of more binary files as well as some data captured from the replays.
This is not a complete solution but it's a decent start to parsing these files.
First you can look at the replay file itself in a text editor. But it won't show the code at the beginning of the file that has to be cleaned out. Then there is a ton of info that you have to read in and figure out but it is the stats for each player in the game. THEN it comes to the part that has to do with the actual replay. You don't need that stuff.
You can grab the player IDs and tank IDs from WoT developer area API if you want.
After loading the pickle files like gabzo mentioned, you will see that it is simply a list of values and without knowing what the value is referring to, its hard to make sense of it. The identifiers for the values can be extracted from your game installation:
import zipfile
WOT_PKG_PATH = "Your/Game/Path/res/packages/scripts.pkg"
BATTLE_RESULTS_PATH = "scripts/common/battle_results/"
archive = zipfile.ZipFile(WOT_PKG_PATH, 'r')
for file in archive.namelist():
if file.startswith(BATTLE_RESULTS_PATH):
archive.extract(file)
You can then decompile the python files(uncompyle6) and then go through the code to see the identifiers for the values.
One thing to note is that the list of values for the main pickle objects (like brAccount from gabzo's code) always has a checksum as the first value. You can use this to check whether you have the right order and the correct identifiers for the values. The way these checksums are generated can be seen in the decompiled python files.
I have been tackling this problem for some time (albeit in Rust): https://github.com/dacite/wot-battle-results-parser/tree/main/datfile_parser.
This is my function to build a record of user's performed action in python csv. It will get the username from the global and perform increment given in the amount parameter to the specific location of the csv, matching the user's row and current date.
In brief, the function will read the csv in a list, and do any modification on the data before rewriting the whole list back into the csv file.
Every first item on rows is the username, and the header has the dates.
Accs\Dates,12/25/2016,12/26/2016,12/27/2016
user1,217,338,653
user2,261,0,34
user3,0,140,455
However, I'm not sure why sometimes, the header get's pushed down to the second row, and data gets wiped entirely when it crashes.
Also, I need to point out that there maybe multiple script running this function and writing on the same file, not sure if that causing the issue.
I'm thinking maybe I can write the stats separately and uniquely to each users and combine later, hence eliminating the possible clashing in writing. Although would be great if I could just improve from what I have here and read/write everything on a file.
Any fail-safe way to do what I'm trying to do here?
# Search current user in first rows and updating the count on the column (today's date)
# 'amount' will be added to the respective position
def dailyStats(self, amount, code = None):
def initStats():
# prepping table
with open(self.stats, 'r') as f:
reader = csv.reader(f)
for row in reader:
if row:
self.statsTable.append(row)
self.statsNames.append(row[0])
def getIndex(list, match):
# get the index of the matched date or user
for i, j in enumerate(list):
if j == match:
return i
self.statsTable = []
self.statsNames = []
self.statsDates = None
initStats()
today = datetime.datetime.now().strftime('%m/%d/%Y')
user_index = None
today_index = None
# append header if the csv is empty
if len(self.statsTable) == 0:
self.statsTable.append([r'Accs\Dates'])
# rebuild updated table
initStats()
# add new user/date if not found in first row/column
self.statsDates = self.statsTable[0]
if getIndex(self.statsNames, self.username) is None:
self.statsTable.append([self.username])
if getIndex(self.statsDates, today) is None:
self.statsDates.append(today)
# rebuild statsNames after table appended
self.statsNames = []
for row in self.statsTable:
self.statsNames.append(row[0])
# getting the index of user (row) and date (column)
user_index = getIndex(self.statsNames, self.username)
today_index = getIndex(self.statsDates, today)
# the row where user is matched, if there are previous dates than today which
# has no data, append 0 (e.g. user1,0,0,0,) until the column where today's date is match
if len(self.statsTable[user_index]) < today_index + 1:
for i in range(0,today_index + 1 - len(self.statsTable[user_index])):
self.statsTable[user_index].append(0)
# insert pv or tb code if found
if code is None:
self.statsTable[user_index][today_index] = amount + int(re.match(r'\b\d+?\b', str(self.statsTable[user_index][today_index])).group(0))
else:
self.statsTable[user_index][today_index] = str(re.match(r'\b\d+?\b', str(self.statsTable[user_index][today_index])).group(0)) + ' - ' + code
# Writing final table
with open(self.stats, 'w', newline='') as f:
writer = csv.writer(f)
writer.writerows(self.statsTable)
# return the summation of the user's total count
total_follow = 0
for i in range(1, len(self.statsTable[user_index])):
total_follow += int(re.match(r'\b\d+?\b', str(self.statsTable[user_index][i])).group(0))
return total_follow
As David Z says, concurrency is more likely the cause of your problem.
I will add that CSV format is not suitable for Database storing, indexing, sorting, because it is plain/text and sequential.
You could handle it using a RDBMS for storing and updating your data and periodically processing your stats. Then your CSV format is just an import/export format.
Python offers a SQLite binding in its Standard Library, if you build a connector that import/update CSV content in a SQLite schema and then dump results as CSV you will be able to handle concurency and keep your native format without worring about installing a database server and installing new packages in Python.
Also, I need to point out that there maybe multiple script running this function and writing on the same file, not sure if that causing the issue.
More likely than not that is exactly your issue. When two things are trying to write to the same file at the same time, the outputs from the two sources can easily get mixed up together, resulting in a file full of gibberish.
An easy way to fix this is just what you mentioned in the question, have each different process (or thread) write to its own file and then have separate code to combine all those files in the end. That's what I would probably do.
If you don't want to do that, what you can do is have different processes/threads send their information to an "aggregator process", which puts everything together and writes it to the file - the key is that only the aggregator ever writes to the file. Of course, doing that requires you to build in some method of interprocess communication (IPC), and that in turn can be tricky, depending on how you do it. Actually, one of the best ways to implement IPC for simple programs is by using temporary files, which is just the same thing as in the previous paragraph.
So, I started doing some python recently and I have always like to lift some weights as well. Therefore, I was thinking about a little program where I can put in my training progress (as some kind of python excercise).
I do something like the following as an example:
from sys import argv
file = argv[1]
target_file = open(file, 'w')
weigth = raw_input("Enter what you lifted today: ")
weigth_list = []
weigth_list.append(weigth)
file.write(weigth_list)
file.close()
Now, I know that a lot is wrong here but this is just to get across the idea I had in mind. So what I was hoping to do, was creating a file and getting a list into and store the "raw_input()" in that file. Then I want to save that file and the next time I run the script (say after the next training), I want to save another number and put that to the list. Additionally, I want to do some plotting with the data stored in the list and the file.
Now, I know I could simply do that in Excel but I would prefer to do it in python. Hopefully, someone understood what I mean.
Unsure what exactly your weight_list looks like, or whether you're planning this for one specific workout or the general case, but you'll probably want to use something like a CSV (comma-separated values) format to save the info and be able to easily plot it (for the general case of N different workout types). See below for what I mean:
$ ./record-workout saved-workouts.csv
where the record-form is
<workout type>,<number of sets>,<number of reps>,<weight>
and saved-workouts.csv is the file we'll save to
then, modifying your script ever-so-slightly:
# even though this is a small example, it's usually preferred
# to import the modules from a readability standpoint [1]
import sys
# we'll import time so we can get todays date, so you can record
# when you worked out
import time
# you'll likely want to check that the user provided arguments
if len(sys.argv) != 2:
# we'll print a nice message that will show the user
# how to use the script
print "usage: {} <workout_file>".format(sys.argv[0])
# after printing the message, we'll exit with an error-code
# because we can't do anything else!
sys.exit(1)
# `sys.argv[1]` should contain the first command line argument,
# which in this case is the name of the data file we want
# to write to (and subsequently read from when we're plotting)
# therefore, the type of `f` below is `str` (string).
#
# Note: I changed the name from `file` to `filename` because although `file`
# is not a reserved word, it's the name of a built-in type (and constructor) [2]
filename = sys.argv[1]
# in Python, it's recommended to use a `with` statement
# to safely open a file. [3]
#
# Also, note that we're using 'a' as the mode with which
# to open the file, which means `append` rather than `write`.
# `write` will overwrite the file when we call `f.write()`, but
# in this case we want to `append`.
#
# Lastly, note that `target_file` is the name of the file object,
# which is the object to which you'll be able to read or write or append.
with open(filename, 'a') as target_file:
# you'd probably want the csv-form to look like
#
# benchpress,2,5,225
#
# so for the general case, let's build this up
workout = raw_input("Enter what workout you did today: ")
num_sets = raw_input("Enter the number of sets you did today")
num_reps = raw_input("Enter the number of reps per set you did today")
weight = raw_input("Enter the weight you lifted today")
# you might also want to record the day and time you worked out [4]
todays_date = time.strftime("%Y-%m-%d %H:%M:%S")
# this says "join each element in the passed-in tuple/list
# as a string separated by a comma"
workout_entry = ','.join((workout, num_sets, num_reps, weight, todays_date))
# you don't need to save all the entries to a list,
# you can simply write the workout out to the file obj `target_file`
target_file.write(workout_entry)
# Note: I removed the `target_file.close()` because the file closes when the
# program reaches the end of the `with` statement.
The structure of saved-workouts.csv would thus be:
workout,sets,reps,weight
benchpress,2,5,225
This would also allow you to easily parse the data when you're getting ready to plot it. In this case, you'd want another script (or another function in the above script) to read the file using something like below:
import sys
# since we're reading the csv file, we'll want to use the `csv` module
# to help us parse it
import csv
if len(sys.argv) < 2:
print "usage: {} <workout_file>".format(sys.argv[0])
sys.exit(1)
filename = sys.argv[1]
# now that we're reading the file, we'll use `r`
with open(filename, 'r') as data_file:
# to use `csv`, you need to create a csv-reader object by
# passing in the `data_file` `file` object
reader = csv.reader(data_file)
# now reader contains a parsed iterable version of the file
for row in reader:
# here's where you'll want to investigate different plotting
# libraries and such, where you'll be accessing the various
# points in each line as follows:
workout_name = row[0]
num_sets = row[1]
num_reps = row[2]
weight = row[3]
workout_time = row[4]
# optionally, if your csv file contains headers (as in the example
# above), you can access fields in each row using:
#
# row['weight'] or row['workout'], etc.
Sources:
[1] https://softwareengineering.stackexchange.com/questions/187403/import-module-vs-from-module-import-function
[2] https://docs.python.org/2/library/functions.html#file
[3] http://effbot.org/zone/python-with-statement.htm
[4] How to get current time in Python
I'm trying to write a script in Python for sorting through files (photos, videos), checking metadata of each, finding and moving all duplicates to a separate directory. Got stuck with the metadata checking part. Tried os.stat - doesn't return True for duplicate files. Ideally, I should be able to do something like :
if os.stat("original.jpg")== os.stat("duplicate.jpg"):
shutil.copy("duplicate.jpg","C:\\Duplicate Folder")
Pointers anyone?
There's a few things you can do. You can compare the contents or hash of each file or you can check a few select properties from the os.stat result, ex
def is_duplicate(file1, file2):
stat1, stat2 = os.stat(file1), os.stat(file2)
return stat1.st_size==stat2.st_size and stat1.st_mtime==stat2.st_mtime
A basic loop using a set to keep track of already encountered files:
import glob
import hashlib
uniq = set()
for fname in glob.glob('*.txt'):
with open(fname,"rb") as f:
sig = hashlib.sha256(f.read()).digest()
if sig not in uniq:
uniq.add(sig)
print fname
else:
print fname, " (duplicate)"
Please note as with any hash function there is a slight chance of collision. That is two different files having the same digest. Depending your needs, this is acceptable of not.
According to Thomas Pornin in an other answer :
"For instance, with SHA-256 (n=256) and one billion messages (p=109) then the probability [of collision] is about 4.3*10-60."
Given your need, if you have to check for additional properties in order to identify "true" duplicates, change the sig = ....line to whatever suits you. For example, if you need to check for "same content" and "same owner" (st_uidas returned by os.stat()), write:
sig = ( hashlib.sha256(f.read()).digest(),
os.stat(fname).st_uid )
If two files have the same md5 they are exact duplicates.
from hashlib import md5
with open(file1, "r") as original:
original_md5 = md5(original.read()).hexdigest()
with open(file2, "r") as duplicate:
duplicate_md5 = md5(duplicate.read()).hexdigest()
if original_md5 == duplicate_md5:
do_stuff()
In your example you're using jpg file in that case you want to call the method open with its second argument equals to rb. For that see the documentation for open
os.stat offers information about some file's metadata and features, including the creation time. That is not a good approach in order to find out if two files are the same.
For instance: Two files can be the same and have different time creation. Hence, comparing stats will fail here. Sylvain Leroux approach is the best one when combining performance and accuracy, since it is very rare two different files has the same hash.
So, unless you have an incredibly large amount of data and a repeated file will cause a system fatality, this is the way to go.
If that your case (it not seems to be), well ... the only way you can be 100% sure two file are the same is iterating and perform a comparison byte per byte.
Could I please be advised on the following problem. I have csv files which I would like to compare. The first contains coordinates of specific points in the genome (e.g. chr3: 987654 – 987654). The other csv files contain coordinates of genomic regions (e.g.chr3: 135596 – 123456789). I would like to cross compare my first file with my other files to see if any point locations in the first file overlaps with any regional coordinates in the other files and to write this set of overlap into a separate file. To make things simple for a start, I have drafted a simple code to cross compare between 2 csv files. Strangely, my code runs and prints the coordinates but does not write the point coordinates into a separate file. My first question is if my approach (from my code) at comparing these two files optimal or is there a better way of doing this? Secondly, why is it not writing into a separate file?
import csv
Region = open ('Region_test1.csv', 'rt', newline = '')
reader_Region = csv.reader (Region, delimiter = ',')
DMC = open ('DMC_test.csv', 'rt', newline = '')
reader_DMC = csv.reader (DMC, delimiter = ',')
DMC_testpoint = open ('DMC_testpoint.csv', 'wt', newline ='')
writer_Exon = csv.writer (DMC_testpoint, delimiter = ',')
for col in reader_Region:
Chr_region = col[0]
Start_region = int(col[1])
End_region = int(col [2])
for col in reader_DMC:
Chr_point = col[0]
Start_point = int(col [1])
End_point = int(col[2])
if Chr_region == Chr_point and Start_region <= Start_point and End_region >= End_point:
print (True, col)
else:
print (False, col)
writer_Exon.writerow(col)
Region.close()
DMC.close()
A couple of things are wrong, not the least of which is that you never check to see if your files opened successfully. The most glaring is that you never close your writer.
That said this an incredibly non-optimal way to go about the program. File I/O is slow. You don't want to keep rereading everything in a factorial fashion. Given how your search requires all possible comparisons you'll want to store at least one of the two files completely in memory, and potentially use a generator/iterator over the other if you dont wish to store both complete sets of data in memory.
One you have both sets loaded, proceed to do your intersection checks
I'd suggest you take a look at http://docs.python.org/2/library/csv.html for how to use a csv reader because what you are doing doesn't appear to make anysense because col[0], col[1] and col[2] aren't going to be what you think they are.
These are style and readability things but:
The names of some iteration variables seem a bit off, for col in ... should probably be for token in ... because you are processing token by token, and not column by columns/line by line, etc.
Additionally it would be nice to pick something consistent to stick to for your variables, sometimes you start with uppercase, sometimes you save the uppercase for after your '_'
That are putting ' ' between your objects and some function noames and not others is also very odd. But again these dont change the functionality of your code.