I have around 2000 JSON files which I'm trying to run through a Python program. A problem occurs when a JSON file is not in the correct format. (Error: ValueError: No JSON object could be decoded) In turn, I can't read it into my program.
I am currently doing something like the below:
for files in folder:
with open(files) as f:
data = json.load(f); # It causes an error at this part
I know there's offline methods to validating and formatting JSON files but is there a programmatic way to check and format these files? If not, is there a free/cheap alternative to fixing all of these files offline i.e. I just run the program on the folder containing all the JSON files and it formats them as required?
SOLVED using #reece's comment:
invalid_json_files = []
read_json_files = []
def parse():
for files in os.listdir(os.getcwd()):
with open(files) as json_file:
try:
simplejson.load(json_file)
read_json_files.append(files)
except ValueError, e:
print ("JSON object issue: %s") % e
invalid_json_files.append(files)
print invalid_json_files, len(read_json_files)
Turns out that I was saving a file which is not in JSON format in my working directory which was the same place I was reading data from. Thanks for the helpful suggestions.
The built-in JSON module can be used as a validator:
import json
def parse(text):
try:
return json.loads(text)
except ValueError as e:
print('invalid json: %s' % e)
return None # or: raise
You can make it work with files by using:
with open(filename) as f:
return json.load(f)
instead of json.loads and you can include the filename as well in the error message.
On Python 3.3.5, for {test: "foo"}, I get:
invalid json: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
and on 2.7.6:
invalid json: Expecting property name: line 1 column 2 (char 1)
This is because the correct json is {"test": "foo"}.
When handling the invalid files, it is best to not process them any further. You can build a skipped.txt file listing the files with the error, so they can be checked and fixed by hand.
If possible, you should check the site/program that generated the invalid json files, fix that and then re-generate the json file. Otherwise, you are going to keep having new files that are invalid JSON.
Failing that, you will need to write a custom json parser that fixes common errors. With that, you should be putting the original under source control (or archived), so you can see and check the differences that the automated tool fixes (as a sanity check). Ambiguous cases should be fixed by hand.
Yes, there are ways to validate that a JSON file is valid. One way is to use a JSON parsing library that will throw exceptions if the input you provide is not well-formatted.
try:
load_json_file(filename)
except InvalidDataException: # or something
# oops guess it's not valid
Of course, if you want to fix it, you naturally cannot use a JSON loader since, well, it's not valid JSON in the first place. Unless the library you're using will automatically fix things for you, in which case you probably wouldn't even have this question.
One way is to load the file manually and tokenize it and attempt to detect errors and try to fix them as you go, but I'm sure there are cases where the error is just not possible to fix automatically and would be better off throwing an error and asking the user to fix their files.
I have not written a JSON fixer myself so I can't provide any details on how you might go about actually fixing errors.
However I am not sure whether it would be a good idea to fix all errors, since then you'd have assume your fixes are what the user actually wants. If it's a missing comma or they have an extra trailing comma, then that might be OK, but there may be cases where it is ambiguous what the user wants.
Here is a full python3 example for the next novice python programmer that stumbles upon this answer. I was exporting 16000 records as json files. I had to restart the process several times so I needed to verify that all of the json files were indeed valid before I started importing into a new system.
I am no python programmer so when I tried the answers above as written, nothing happened. Seems like a few lines of code were missing. The example below handles files in the current folder or a specific folder.
verify.py
import json
import os
import sys
from os.path import isfile,join
# check if a folder name was specified
if len(sys.argv) > 1:
folder = sys.argv[1]
else:
folder = os.getcwd()
# array to hold invalid and valid files
invalid_json_files = []
read_json_files = []
def parse():
# loop through the folder
for files in os.listdir(folder):
# check if the combined path and filename is a file
if isfile(join(folder,files)):
# open the file
with open(join(folder,files)) as json_file:
# try reading the json file using the json interpreter
try:
json.load(json_file)
read_json_files.append(files)
except ValueError as e:
# if the file is not valid, print the error
# and add the file to the list of invalid files
print("JSON object issue: %s" % e)
invalid_json_files.append(files)
print(invalid_json_files)
print(len(read_json_files))
parse()
Example:
python3 verify.py
or
python3 verify.py somefolder
tested with python 3.7.3
It was not clear to me how to provide path to the file folder, so I'd like to provide answer with this option.
path = r'C:\Users\altz7\Desktop\your_folder_name' # use your path
all_files = glob.glob(path + "/*.json")
data_list = []
invalid_json_files = []
for filename in all_files:
try:
df = pd.read_json(filename)
data_list.append(df)
except ValueError:
invalid_json_files.append(filename)
print("Files in correct format: {}".format(len(data_list)))
print("Not readable files: {}".format(len(invalid_json_files)))
#df = pd.concat(data_list, axis=0, ignore_index=True) #will create pandas dataframe
from readable files, if you like
Related
This question already has answers here:
How to create an encrypted ZIP file?
(8 answers)
Closed 11 months ago.
I have an encrypted ZIP file and for some reason, any password I feed it doesn't seem to matter as it can add files to the archive regardless. I checked for any ignored exceptions or anything, but nothing seems to be fairly obvious.
I posted the minimalist code below:
import zipfile
z = zipfile.ZipFile('test.zip', 'a') #Set zipfile object
zipPass = str(input("Please enter the zip password: "))
zipPass = bytes(zipPass, encoding='utf-8')
z.setpassword(zipPass) #Set password
z.write("test.txt")
I am not sure what I am missing here, but I was looking around for anything in zipfile that can handle encrypted zipfiles and add files into them using the password, as the only thing I have is the ``z.setpassword()` function that seems to not work here.
TL;DR: z.write() doesn't throw an exception and neither does z.setpassword() or anything zipfile related when fed the incorrect password, and willingly adds files no matter what. I was expecting to get BadPasswordForFile.
Is there any way to do this?
What I found in the documentation for zipfile is that the library supports decryption only with a password. It cannot encrypt. So you won't be able to add files with a password.
It supports decryption of encrypted files in ZIP archives, but it currently cannot create an encrypted file.
https://docs.python.org/3/library/zipfile.html
EDIT: Further, looking into python bugs Issue 34546: Add encryption support to zipfile it appears that in order to not perpetuate a weak password scheme that is used in zip, they opted to not include it.
Something that you could do is utilize subprocess to add files with a password.
Further, if you wanted to "validate" the entered password first, you could do something like this but you'd have to know the contents of the file because decrypt will happily decrypt any file with any password, the plaintext result will just be not correct.
Issues you'll have to solved:
Comparing file contents to validate password
Handling when a file exists already in the zip file
handling when the zipfile already exists AND when it doesn't.
import subprocess
import zipfile
def zip_file(zipFilename, filename):
zipPass = str(input("Please enter the zip password: "))
zipPass = bytes(zipPass, encoding='utf-8')
#If there is a file that we know the plain-text (or original binary)
#TODO: handle fipFilename not existing.
validPass=False
with zipfile.ZipFile(zipFilename, 'r') as zFile:
zFile.setpassword(zipPass)
with zFile.open('known.txt') as knownFile:
#TODO: compare contents of known.txt with actual
validPass=True
#Next to add file with password cannot use zipfile because password not supported
# Note this is a linux only solution, os dependency will need to be checked
#if compare was ok, then valid password?
if not validPass:
print('Invalid Password')
else:
#TODO: handle zipfile not-exist and existing may have to pass
# different flags.
#TODO: handle filename existing in zipFilename
#WARNING the linux manual page for 'zip' states -P is UNSECURE.
res = subprocess.run(['zip', '-e', '-P', zipPass, zipFilename, filename])
#TODO: Check res for success or failure.
EDIT:
I looked into fixing the whole "exposed password" issue with -P. Unfortunately, it is non trivial. You cannot simply write zipPass into the stdin of the subprocess.run with input=. I think something like pexpect might be a solution for this, but I haven't spent the time to make that work. See here for example of how to use pexpect to accomplish this: Use subprocess to send a password_
After all of the lovely replies, I did find a workaround for this just in case someone needs the answer!
I did first retry the z.testzip() and it does actually catch the bad passwords, but after seeing that it wasn't reliable (apparently hash collisions that allow for bad passwords to somehow match a small hash), I decided to use the password, extract the first file it sees in the archive, and then extract it. If it works, remove the extracted file, and if it doesn't, no harm done.
Code works as below:
try:
z = zipfile.ZipFile(fileName, 'a') #Set zipfile object
zipPass = bytes(zipPass, encoding='utf-8') #Str to Bytes
z.setpassword(zipPass) #Set password
filesInArray = z.namelist() #Get all files
testfile = filesInArray[0] #First archive in list
z.extract(testfile, pwd=zipPass) #Extract first file
os.remove(testfile) #remove file if successfully extracted
except Exception as e:
print("Exception occurred: ",repr(e))
return None #Return to mainGUI - this exits the function without further processing
Thank you guys for the comments and answers!
I am working on an app that will be something like a text editor, using kivy.
I included a FileChooser to choose a file to edit, I included a try except to catch the problems of files readability like videos and executable files.The problem is that this doesn't work and it raises an error of decode so the try except didn't work as expected.
I am thinking now of something like precising the extentions of files to open.But I would like to request from you a list of extensions that python can read.
Please Help
I think no need to share my code because there is nothing specific in it and I think my request is clear:
A list of extensions that python can read.
Ok so here is my check function that is called when a user click a button after choosing a file(This is in kivy by the way):
def check(self):
app= App.get_running_app()
with open(os.path.join(sys.path[0], "data.txt"), "r") as f:
dataa= f.read()
data= dataa.split('\n')
if self.namee.text in dataa:
for i in data:
ii = i.split(':')
if ii[1] == self.namee.text and ii[3]== self.password.text:
self.btn.text="Identified successfully!!"
time.sleep(0.5)
break
else:
self.btn.text="Not identified successfully!!"
elif self.password.text.strip() =="":
try:
with codecs.open(self.namee.text, 'r', encoding="utf-8", errors='ignore') as file:
try:
test = file.read()
print(test)
self.btn.text=="Identified successfully!!"
except:
self.btn.text ="Oops! Couldn't access the content of this file, try again later or verify your typing."
self.btn.text=="Identified successfully!!"
except:
self.btn.text ="Oops! Couldn't open the file, try again later or verify your typing."
else:
self.btn.text = "Cannot find any secret file with this name!"
if self.btn.text=="Identified successfully!!":
app.root.current = "control"
self.btn.text = "open"
self.namee.text = ""
self.password.text = ""
File extensions don't actually really work that way, there is no simple list that Python can read, and nor does the extension actually mean anything - for instance, you could give a video file the extension .py if you wanted.
You initial idea of catching errors is the right way to do things, and you'll want to have that functionality even if you also limit the available file extensions, since a file with a valid extension may contain invalid data. There's no reason it shouldn't work so probably you had some bug you can resolve.
Also, your request isn't clear, since "how to limit the type of files to open in kivy" != "what extensions can python read". It sounds like you already know the answer to the former question.
To my understanding python can read any file. Extensions don't matter. What matters is the way you handle the file.
I have a simple JSON file that I was supposed to use as a configuration file, it contains the default directories for whoever is running the script using their MacBooks:
{
"main_sheet_path": "/Users/jammer/Documents/Studios/CAT/000-WeeklyReports/2020/",
"reference_sheet_path": "/Users/jammer/Documents/DownloadedFiles/"
}
I read the JSON file and obtain the values using this code:
with open('reportconfig.json','r') as j:
config_data = json.load(j)
main_sheet_path = str(config_data.get('main_sheet_path'))
reference_sheet_path = str(config_data.get('reference_sheet_path'))
I use the path to check for a source file's existence before doing anything with it:
source_file = 'source.xlsx'
source_file = main_sheet_path + filename
if not os.path.isfile(source_file) :
print ('ERROR: Source file \'' + source_file + '\' NOT FOUND!')
return
Note that the filename is inputted as a parameter when the script is run (there are multiple files, the script has to know which one to target).
The file is there for sure but the script never seems to "see" it so I get that "ERROR" that I printed in the above code. Why do I think there are invisible characters? Because when I copy and paste from what was printed in the "error" notice above into the terminal, the last few characters of the file name always gets substituted by some invisible characters and hitting backspace erases characters where the cursor isn't supposed to be.
How do I know for sure that the file is there and that my problem is with reading the JSON file and not in the Directory names or anywhere else in the code? Because I finally gave up on using a JSON config file and went with a configuration file like this instead:
#!/usr/local/bin/python3.7
# -*- coding: utf-8 -*-
file_paths = { "main_sheet_path": "/Users/jammer/Documents/Studios/CAT/000-WeeklyReports/2020/",
"reference_sheet_path": "/Users/jammer/Documents/DownloadedFiles/"
}
I then just import the file and obtain the values like this:
import reportconfig as cfg
main_sheet_path = cfg.file_paths['main_sheet_path']
reference_sheet_path = cfg.file_paths['reference_sheet_path']
...
This workaround works perfectly — I don't get the "error" that the file isn't there when it is and the rest of the script is executed as expected. When the file isn't there, I get the proper "error" I expect and copying-and-pasting the full path and filename from the "error message" gives me the complete file name and hitting the backspace erases the right characters (no funny behavior, no invisible characters).
But could anyone please tell me how read the JSON file properly without getting those pesky invisible characters? I've spent hours trying to figure it out including searching seemingly related questions in stackoverflow but couldn't find the answer. TIA!
I think there is just a typo error in this code:
source_file = 'source.xlsx'
source_file = main_sheet_path + filename
Maybe filename is set to some other file which is not present hence it is giving you error.
Try to set filename='source.xlsx'
Maybe it will help
I recently made a Twitter-bot that takes a specified .txt file and tweets it out, line by line. A lot of the features I built into the program to troubleshoot some formatting issues actually allows the program to work with pretty much any text file.
I would like build in a feature where I can "import" a .txt file to use.
I put that in quotes because the program runs in the command line at them moment.
I figured there are two ways I can tackle this problem but need some guidance on each:
A) I begin the program with a prompt asking which file the user want to use. This is stored as a string (lets say variable string) and the code looks like this-
file = open(string,'r')
There are two main issues with. The first is I'm unsure how to keep the program from crashing if the program specified is misspelled or does not exist. The second is that it won't mesh with future development (eventually I'd like to build app functionality around this program)
B) Somehow specify the desired file somehow in the command line. While the program will still occasionally crash, it isn't as inconvenient to the user. Also, this would lend itself to future development, as it'll be easier to pass a value in through the command line than an internal prompt.
Any ideas?
For the first part of the question, exception handling is the way to go . Though for the second part you can also use a module called argparse.
import argparse
# creating command line argument with the name file_name
parser = argparse.ArgumentParser()
parser.add_argument("file_name", help="Enter file name")
args = parser.parse_args()
# reading the file
with open(args.file_name,'r') as f:
file = f.read()
You can read more about the argparse module on its documentation page.
Regarding A), you may want to investigate
try:
with open(fname) as f:
blah = f.read()
except Exception as ex:
# handle error
and for B) you can, e.g.
import sys
fname = sys.argv[1]
You could also combine the both to make sure that the user has passed an argument:
#!/usr/bin/env python
# encoding: utf-8
import sys
def tweet_me(text):
# your bot goes here
print text
if __name__ == '__main__':
try:
fname = sys.argv[1]
with open(fname) as f:
blah = f.read()
tweet_me(blah)
except Exception as ex:
print ex
print "Please call this as %s <name-of-textfile>" % sys.argv[0]
Just in case someone wonders about the # encoding: utf-8 line. This allows the source code to contain utf-8 characters. Otherwise only ASCII is allowed, which would be ok for this script. So the line is not necessary. I was, however, testing the script on itself (python x.py x.py) and, as a little test, added a utf-8 comment (# ä). In real life, you will have to care a lot more for character encoding of your input...
Beware, however, that just catching any Exception that may arise from the whole program is not considered good coding style. While Python encourages to assume the best and try it, it might be wise to catch expectable errors right where they happen. For example , accessing a file which does not exist will raise an IOError. You may end up with something like:
except IndexError as ex:
print "Syntax: %s <text-file>" % sys.argv[0]
except IOError as ex:
print "Please provide an existing (and accessible) text-file."
except Exception as ex:
print "uncaught Exception:", type(ex)
I have a python program that just needs to save one line of text (a path to a specific folder on the computer).
I've got it working to store it in a text file and read from it; however, I'd much prefer a solution where the python file is the only one.
And so, I ask: is there any way to save text in a python program even after its closed, without any new files being created?
EDIT: I'm using py2exe to make the program an .exe file afterwards: maybe the file could be stored in there, and so it's as though there is no text file?
You can save the file name in the Python script and modify it in the script itself, if you like. For example:
import re,sys
savefile = "widget.txt"
x = input("Save file name?:")
lines = list(open(sys.argv[0]))
out = open(sys.argv[0],"w")
for line in lines:
if re.match("^savefile",line):
line = 'savefile = "' + x + '"\n'
out.write(line)
This script reads itself into a list then opens itself again for writing and amends the line in which savefile is set. Each time the script is run, the change to the value of savefile will be persistent.
I wouldn't necessarily recommend this sort of self-modifying code as good practice, but I think this may be what you're looking for.
Seems like what you want to do would better be solved using the Windows Registry - I am assuming that since you mentioned you'll be creating an exe from your script.
This following snippet tries to read a string from the registry and if it doesn't find it (such as when the program is started for the first time) it will create this string. No files, no mess... except that there will be a registry entry lying around. If you remove the software from the computer, you should also remove the key from the registry. Also be sure to change the MyCompany and MyProgram and My String designators to something more meaningful.
See the Python _winreg API for details.
import _winreg as wr
key_location = r'Software\MyCompany\MyProgram'
try:
key = wr.OpenKey(wr.HKEY_CURRENT_USER, key_location, 0, wr.KEY_ALL_ACCESS)
value = wr.QueryValueEx(key, 'My String')
print('Found value:', value)
except:
print('Creating value.')
key = wr.CreateKey(wr.HKEY_CURRENT_USER, key_location)
wr.SetValueEx(key, 'My String', 0, wr.REG_SZ, 'This is what I want to save!')
wr.CloseKey(key)
Note that the _winreg module is called winreg in Python 3.
Why don't you just put it at the beginning of the code. E.g. start your code:
import ... #import statements should always go first
path = 'what you want to save'
And now you have path saved as a string