What is this JSON Decoder piece of code doing? - python

I have been using this piece of code:
def read_text_files(filename):
# Creates JSON Decoder
decoder = json.JSONDecoder()
with open(filename, 'r') as inputfile:
# Returns next item in input file, removes whitespace from it and saves it in line
line = next(inputfile).strip()
while line:
try:
# Returns 2-tuple of Python representation of data and index where data ended
obj, index = decoder.raw_decode(line)
# Remove object
yield obj
# Remove already scanned part of line from rest of file
line = line[index:]
except ValueError:
line += next(inputfile).strip()
if not line:
line += next(inputfile).strip()
global count
count+=1
print str(count)
all_files = glob.glob('Documents/*')
for filename in all_files:
for data in read_text_files(filename):
rawTweet = data['text']
print 'Here'
It reads in a JSON file and decodes it. However, what I realise is that when I place the count and print statements inside the ValueError, I'm losing almost half of the documents being scanned in here - they never make it back to the main method.
Could somebody explain to me exactly what the try statement is doing and why I'm losing documents in the except part. Is it due to bad JSON?
Edit: Including more code
Currently, with the code posted, the machine prints:
"Here"
2
3 etc...
199
Here
200
Here (alternating like this until)...
803
804
805 etc...
1200
Is this happening because some of the JSON is corrupt? Is it because some of the documents are duplicates (and some definitely are)?
Edit 2:
Interesting, deleting:
line=next(inputfile).strip()
while line
and replacing it with:
for line in inputfile:
appears to have fixed the problem. Is there a reason for this?

The try statement is specifying a block of statements for which exceptions are handled through the following except blocks (only one in your case).
My impression is that with your modifications you are making a second exception trigger inside the exception handler itself. This makes control go to a higher-level exception handler, even outside function read_text_files. If no exception occurs in the exception handler, the loop can continue.
Please check that count exists and has been initialized with an integer value (say 0).

Related

Json.load rising "json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)" even with an "anti_empty" condition

I already post about my problem and I thought it was solved, but after a few time the error rise again. I'm gonna explain my program from the beginning.
I got a Json file that contain values permanently update by another program, I want to get an overlay that display those values, that means I got to open and read my json file every second (or more) with the "after()" method. (Im using tkinter for my overlay).
When I run my tkinter window without the other program that update the values, everything work perfectly, I can update manually a value and it will be update on the overlay.
When I run both programs together after an amount of time, I get the empty json error, sometimes after 5 minutes, sometimes after 45 minutes, It's random.
I tried the following issues :
Issue 1 :
def is_json():
with open ('my_json') as my_file :
myjson = my_file.read()
try:
json_object = json.loads(myjson)
except json.JSONDecodeError as e:
return False
return True
if is_json():
with open ('my_json') as my_file:
data = json.load(my_file)
else :
time.sleep(0.1)
Issue 2:
while True:
if os.path.getsize("/my_json") > 0:
with open ('my_json') as my_file :
myjson = my_file.read()
else:
time.sleep(0.2)
I tryed another one, but I dont want to code it again, that was a function that allow one program to read/write on the json only in "even" seconds and the other one can only do it in "odd" second.
I try this to avoid interactions, cause I think that's my problem, but none of those solutions worked.
You should return the parsed JSON in the same function that has the try/except. Otherwise, the file could change between calling is_json() and json.load().
def json_load_retry(filename, sleeptime=0.1):
while True:
with open(filename) as f:
try:
return json.load(f)
except json.JSONDecodeError:
time.sleep(sleeptime)
myjson = json_load_retry('my_json', sleeptime=0.2)

Trying to check if a word string is in a text file and I encountered StopIteration Error. I'm not sure why (closed)

Here is the function I'm trying to run:
def parsefile(file_name,d_name):
with open(file_name) as fn:
if "Station" in fn.read():
next(fn)
for line in fn:
(stat,north,east) = line.split()
d_name[stat] = (north,east)
else:
print("Please input correct file with header: Stations, Northings, Eastings")
return d_name
Here is a snippet of my text file:
Station Northings Eastings
1 10001.00 10001.00
2 10070.09 10004.57
3 10105.80 10001.70
The result of open (i.e. fn) is an iterator. An iterator gives values with next(), which raises StopIteration when the iterator is exhausted. for .. in ... construct (and many functions that handle iterators) will handle this exception for you.
fn.read() reads the whole file, after which there is nothing more to read, effectively exhausting the iterator; requesting the next value from fn raises StopIteration. If you want to go back to being able to read the file from start, you can use the fn.seek(0) to rewind the file pointer; however, note that some file handles cannot be rewound (notably, standard input).

Why does this code throw an IndexError?

I am trying to read values from a file, data.txt, with this code. However, it throws an IndexError when I run it. Why would this be happening?
def main():
myfile=open('data.txt','r')
line=myfile.readline()
while line!='':
line=line.split()
age=line[1]
line=myfile.readline()
myfile.close()
main()
If line happens to contain exactly one fragment, line.split() returns a list of exactly one element, and accessing its second element (at index 1) leads to an error.
Also, to make your code better, don't ever reassign the variables. It hampers readers, and the code is written mostly to be read, especially by yourself.
I'd use a simpler loop:
for line in myfile: # this iterates over the lines of the file
fragments = line.split()
if len(fragments) >= 2:
age = fragments[1]
...
Also, the idiomatic way to open the file for a particular duration and close it automatically is the use of with:
with open(...) as myfile:
for line in myfile:
...
# At this point, the file will be automatically closed.
Python starts indexing at 0.
in your age=line[1] part, if there is only one word in the line, Python will throw an IndexError to tell you that. Seeing your data would be helpful, but the following is the generally accepted and much easier way of reading a file:
with open('data.txt', 'r') as myfile:
for line in myfile:
# '' is a a false value, so, this is the same as if line != ''
if line:
line = line.split()
# if age is always the first thing on the line:
age = line[0]
# if age can be somewhere else on the line, you need to do something more complicated
Note that, because you used with, you don't need to close the file yourself, the with statement does that
def main():
myfile=open('data.txt','r')
line=myfile.readline()
while line!='':
line=line.split()
try:
age=line[1]
except IndexError:
age = None
line=myfile.readline()
myfile.close()
main()
The try statement works as follows.
First, the try clause (the statement(s) between the try and except keywords) is executed.
If no exception occurs, the except clause is skipped and execution of the try statement is finished.
If an exception occurs during execution of the try clause, the rest of the clause is skipped. Then if its type matches the exception named after the except keyword, the except clause is executed, and then execution continues after the try statement.
If an exception occurs which does not match the exception named in the except clause, it is passed on to outer try statements; if no handler is found, it is an unhandled exception and execution stops with a message.
For more details, see https://docs.python.org/2/tutorial/errors.html#handling-exceptions

Python File Remains Empty After Writing to it Issue

I am trying to read URL directly from MYSQLDB table and tldextract to get the domain from the url and find the SPF(Sender Policy Framework) Record for the domain.
When i'm trying to write the SPF records of each and every domain i scan,My Ouput_SPF_Records.txt do not contain any records i write.
Not sure with the issue,Any suggestions please ?
import sys
import socket
import dns.resolver
import re
import MySQLdb
import tldextract
from django.utils.encoding import smart_str, smart_unicode
def getspf (domain):
answers = dns.resolver.query(domain, 'TXT')
for rdata in answers:
for txt_string in rdata.strings:
if txt_string.startswith('v=spf1'):
return txt_string.replace('v=spf1','')
db=MySQLdb.connect("x.x.x.x","username","password","db_table")
cursor=db.cursor()
cursor.execute("SELECT application_id,url FROM app_info.app_urls")
data=cursor.fetchall()
x=0
while x<len(data):
c=tldextract.extract(data[x][1])
#print c
app_id=data[x][0]
#print app_id
d=str(app_id)+','+c[1]+'.'+c[2]
#with open('spfout.csv','a') as out:
domain=smart_str(d)
#print domain
with open('Ouput_SPF_Records.txt','w') as g:
full_spf=""
spf_rec=""
y=domain.split(',')
#print "y===",y,y[0],y[1]
app_id=y[0]
domains=y[1]
try:
full_spf=getspf(domains.strip())+"\n"
spf_rec=app_id+","+full_spf
print spf_rec
except Exception:
pass
g.write(spf_rec)
x=x+1
g.close()
Try openning the file with append mode, instead of w mode. w mode overwrites the file in each iteration. Example -
with open('Ouput_SPF_Records.txt','a') as g:
Most probably, the last time you open the file in write mode, you do not write anything in since, you are catching and ignoring all exceptions , which causes the empty file.
Also, if you know the error which you are expecting, you should use except <Error>: instead of except Exception: . Example -
try:
full_spf=getspf(domains.strip())+"\n"
spf_rec=app_id+","+full_spf
print spf_rec
except <Error you want to catch>:
pass
Your problem is you open the file many times, each time through the loop. You use w mode, which erases the contents and writes from the beginning.
Either open the file once before the loop, or open in append mode a, so you don't delete the previously written data.
You can use :
import pdb;pdb.set_trace()
Debug your code and try to figure out the problem.
also note that :
1. You shouldn't just write 'pass' in the try/except block. Deal with the Exception
2.
with open('Ouput_SPF_Records.txt','w') as g:
it will automatically close the file, so there is no need to do : g.close() explicitly.
I think this is the result of getspf return None by default.
The Problem is that python cant concatenate str and NoneType (the type of None) (which throws an exception that you quickly discard).
You may try this instead:
def getspf (domain):
answers = dns.resolver.query(domain, 'TXT')
for rdata in answers:
for txt_string in rdata.strings:
if txt_string.startswith('v=spf1'):
return txt_string.replace('v=spf1','')
return ""#"Error"
Probably you should check for the exception, my guess is that statements inside it are not performed, and spf_rec is left to "".
As per POSIX definition (http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_206), every line you write should end with "\n".
You might considering to initialise spf_rec with "\n" rather than "".
Also, as "Anand S Kumar" said, without an append, the file is overwritten at every "while x
I think that if you open the Ouput_SPF_Records.txt file with "vi", you will see the last line written (unless an exception occurred on the last execution of the cycle, causing the file to be just "").
In other words, the problem is that many software may not read a line that doesn't respect the POSIX standard, and because your file is probably composed by a unique line that doesn't respect this standard, the file won't be read at all.

How to troubleshoot code in the case of big data

I'm trying to implement this python solution to count the number of lines with identical content in the first few columns of a table. Here is my code:
#count occurrences of reads
import pandas as pd
#pd.options.display.large_repr = 'info'
#pd.set_option('display.max_rows', 100000000)
#pd.set_option('display.width',50000)
import sys
file1 = sys.argv[1]
file2 = file1[:4] + '_multi_nobidir_count.soap'
df = pd.read_csv(file1,sep='\t',header=None)
df.columns = ['v0','v1','v2','v3','v4','v5','v6','v7','v8','v9','v10','v11']
df['v3']=df.groupby(['v0','v1','v2']).transform(sum).v3
df.to_csv(file2,sep='\t',index=False,header=False)
It worked fine with the test data (200 lines) but gives me the following error when I apply it to the real data (20 million lines):
Traceback (most recent call last):
File "count_same_reads.py", line 14, in <module>
df['v3']=df.groupby(['v0','v1','v2']).transform(sum).v3
File "/usr/local/lib/python2.7/dist-packages/pandas-0.14.0-py2.7-linux-x86_64.egg/pandas/core/groupby.py", line 2732, in transform
return self._transform_item_by_item(obj, fast_path)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.14.0-py2.7-linux-x86_64.egg/pandas/core/groupby.py", line 2799, in _transform_item_by_item
raise TypeError('Transform function invalid for data types')
TypeError: Transform function invalid for data types
How do I go about troubleshooting, to find out why I am getting this error?
[EDIT] Uncommenting the pd.options. and pd.set_option lines did not change the outcome.
[EDIT2] Taking into consideration some of the replies below, I ran the following code on my data to output any lines of data that do not have a number in the 4th column:
#test data type
import sys
file1 = sys.argv[1]
def is_number(s):
try:
float(s)
return True
except ValueError:
return False
with open(file1, 'r') as data:
for row in data:
a = row.strip().split()[3]
if is_number(a) == False:
print row.strip()
This worked on the test data in which I changed one of the rows' fourth column value from 1 to e, it output only the line containing the letter instead of a number. I ran it on the original big data but no lines were returned.
The exception you have received is TypeError which hints at problems with the file. But with large files it is always possible that there are, e.g., memory problems with the code handling the comparisons. So, you have two possibilities:
the file is broken
the code (yours or pandas's) is broken
In order to debug this, you may try to feed your file into your code in pieces. At some point you have isolated the problem. It may be one of the two:
no matter which n lines you take, it throws an exception (but not with n-1 lines); memory management or something else is broken
the problems can be isolated onto a single line or lines of your data file; the data file is broken
I second merlin2011's guess: there is something unexpected in your file. It is unlikely that pandas will choke with only 200 000 000 records.
Open the file /usr/local/lib/python2.7/dist-packages/pandas-0.14.0-py2.7-linux-x86_64.egg/pandas/core/groupby.py, go to line 2799.
Right before the following statement, in the same indent level, add a line to print the value of the offending data.
raise TypeError('Transform function invalid for data types')
Now, right before the TypeError is thrown, you will know what data caused the error.
Given that you are trying to sum, I would speculate that you have a non-numeric value in your column, but I do not have your data, so that is pure speculation.
I have taken a quick look at the code region around where the error occurs, and it appears that you should be this case be inspecting the object obj before the TypeError is raised.
for i, col in enumerate(obj):
try:
output[col] = self[col].transform(wrapper)
inds.append(i)
except Exception:
pass
if len(output) == 0: # pragma: no cover
raise TypeError('Transform function invalid for data types')
Here's how to troublshoot something like this:
Create a function to wrap the operation (this will be a fair bit slower as its not cythonized), but should catch your error.
def f(x):
try:
return x.sum()
except:
import pdb; pdb.set_trace()
df['v3']=df.groupby(['v0','v1','v2']).transform(f).v3

Categories

Resources