Is there a way to log auto generated messages on Python console? - python

I'm using pandas to load a csv file that has few bad lines. This means that in few lines there are some extra commas and that is why pandas is not able to load it. Which is fine by me. I'm using error_bad_lines=False to ignore those lines. When those bad lines are ignored by pandas, it shows a message like this on console:
b'Skipping line 3: expected 3 fields, saw 4\n
What I want is to be able to load the data but log this skipping line number in a log file. I went throught a lot of tutorials on logging but couldn't find a way to log this auto generated message when pandas skip a line number while loading the data.
This is the simple piece of code I'm using to load a file.
import pandas as pd
import os
def main():
filename = "test_data3.csv"
data= pd.read_csv(filename,error_bad_lines=False)
print(data.head())
if __name__=="__main__":
main()
Here is the sample data I'm using
Col1,Col2,Col3
a,3,g4
b,4,s5,r
c,5,p9
f,6,v4,7
x,65,h5
as you can see line number 2 and 4 should be skipped. But it needs to be recorded in a log file.

You can use a context manager to temporarily intercept calls to sys.stderr.write and write the messages to a file:
import pandas as pd
import sys
class CaptureErrors:
def __init__(self, stderr, output_name):
self.stderr = stderr
self.output_name = output_name
self.output_file = None
def __enter__(self):
self.output_file = open(self.output_name, "w")
return self
def __exit__(self, exc_type, exc_value, traceback):
if self.output_file:
self.output_file.close()
sys.stderr = self.stderr
def write(self, message):
self.stderr.write(message)
self.output_file.write(message)
def main():
filename = "test_data3.csv"
with CaptureErrors(sys.stderr, 'error.txt') as sys.stderr:
data = pd.read_csv(filename, error_bad_lines=False)
print(data.head())
if __name__=="__main__":
main()
If this isn't what you are looking for, you may need to add more information to your question.

You can use Redirect the output into a file doubg:
python script.py > out.txt

Related

Create a separate logger for each process when using concurrent.futures.ProcessPoolExecutor in Python

I am cleaning up a massive CSV data dump. I was able to split the single large file into smaller ones using gawk initially using a unix SE Query as a following flow:
BIG CSV file -> use gawk script + bash -> Small CSV files based on columns
I have about 12 split csv files that are created using the above mentioned flow and each with ~170K lines in them.
I am using python3.7.7 on a Windows 10 machine.
Code
def convert_raw_data(incoming_line, f_name, line_counter):
# do some decoding magic
# catch exception and try to log it into the a logger file under `f_name.log`
def convert_files(dir_name, f_name, dest_dir_name):
# Open the CSV file
# Open the Destination CSV file to store decoded data
line_counter = 1
for line in csv_reader:
# convert raw HEX to Floating point values using `convert_raw_data` function call
line_counter = line_counter + 1
status = convert_raw_data(csv)
if status:
return f'All good for {f_name}.'
else:
return f'Failed for {f_name}'
def main():
# Parse Arguments Logic here
# get CSV Files and their respective paths
csv_files = get_data_files_list(args.datasets)
# decode raw data from each split csv file as an individual process
with concurrent.futures.ProcessPoolExecutor() as executor:
results = [ executor.submit(convert_files, dir_name, f_name, dest_dir) for dir_name, f_name in csv_files ]
for f in concurrent.futures.as_completed(results):
print(f.result())
Requirements
I wish to set a logging logger with the name f_name.log within each process spawned by the ProcessPoolExecutor and want to store the logs with the respective parsed file name. I am not sure if I should use something like:
def convert_raw_data(...., logger):
logger.exception(raw_data_here)
def convert_files(....):
logger = logging.basicConfig(filename=f_name, level=logging.EXCEPTION)
or are there caveats for using logging modules in a multiprocessing environment?
Found out a simple way to achieve this task:
import logging
def create_log_handler(fname):
logger = logging.getLogger(name=fname)
logger.setLevel(logging.ERROR)
fileHandler = logging.FileHandler(fname + ".log")
fileHandler.setLevel(logging.ERROR)
logger.addHandler(fileHandler)
formatter = logging.Formatter('%(name)s %(levelname)s: %(message)s')
fileHandler.setFormatter(formatter)
return logger
I called the create_log_handler within my convert_files(.....) function and then used logger.info and logger.error` accordingly.
by passing the logger as a parameter to convert_raw_data I was able to log even the erroneous data point in each of my csv file on each process.

how to Improve execution time of importing data in python

Below code take 2.5 seconds to import a log file with 1 million lines of code.
Is there a better way to the code and also decrease the execution time ?
""" This code is used to read the log file into the memory and convert into the data frame
Once the log file is loaded ,every item in the IPQuery file checked if exist and result is print onto the console"""
#importing python modules required for this script to perform operations
import pandas as pd
import time
import sys
#code to check the arguments passed """
if len(sys.argv)!= 3:
raise ValueError(""" PLEASE PASS THE BOTH LOG FILE AND IPQUERY FILE AS INPUT TO SCRIPT
ex: python program.py log_file query_file """)
# extracting file names from command line """
log_file_name=sys.argv[1]
query_file_name = sys.argv[2]
start = time.time()#capturing time instance
#Reading the content from the log file into dataframe log_df """
log_df = pd.read_csv(log_file_name," ",header=None ,names = ['DATE','TIME', 'IPADDR','URL','STATUS'],skip_blank_lines = True)
#Reading the content from the IPquery file into the data frame query_df """
query_df = pd.read_csv(query_file_name," ",header=None,skip_blank_lines=True )
#Cheking if the IP address exists in the log file"""
Ipfound = query_df.isin(log_df.IPADDR).astype(int)
#print all the results to the Query results onto the stdout"""
for items in Ipfound[0]:
print items
print "Execution Time of this script is %f" %(time.time() - start)
#importing python modules required for this script to perform operations
import time
import sys
start = time.time()#capturing time instance
class IpQuery:
"""Below methods contain the functionality to read file paths ,import log and query data
and provide the result to the console """
def __init__(self):
self.log_file_name= ""
self.query_file_name = ""
self.logset = {}
self.IPlist= []
def Inputfiles(self):
"""code to check the arguments passed and throw an error """
if len(sys.argv)!= 3:
raise ValueError(""" PLEASE PASS THE BOTH LOG FILE AND IPQUERY FILE AS INPUT TO SCRIPT
ex: python program.py log_file query_file """)
# extracting file names from command line
self.log_file_name=sys.argv[1]
self.query_file_name = sys.argv[2]
def read_logfile(self):
#Reading the log data
with open(self.log_file_name,'r') as f:
self.logset = {line.split(' ')[2] for line in f if not line.isspace()}
def read_Queryfile(self):
#Reading the query file into the dataframe"""
with open(self.query_file_name,'r') as f:
self.IPlist = [line.rstrip('\n') for line in f if not line.isspace() ]
def CheckIpAdress(self):
#Ip address from query file ae checked against the log file """
dummy= self.logset.intersection(set(self.IPlist))
for element in self.IPlist:
if element in dummy:
print "1"
else :
print "0"
try:
#Create an instance of the IpQuery file
msd=IpQuery()
#Extracting the input file information
msd.Inputfiles()
#Importing the Ip information from the log files
msd.read_logfile()
#Importing the Ipquery information from the query file
msd.read_Queryfile()
#Searching for the Ip in log file
msd.CheckIpAdress()
except IOError:
print "Error: can\'t find file or read data"
except ValueError :
print "PLEASE PASS THE BOTH LOG FILE AND IPQUERY FILE AS INPUT TO SCRIPT "

python is not logging all content to file

I want to log the script output to a file while still displaying the output to the screen.
It works fine, except for some cases where not all the content is written to the file (one or two lines can be missed, if the output is long)
Below is my code:
class Tee(object):
def __init__(self, *files):
self.files = files
def write(self, obj):
for f in self.files:
f.write(obj)
f.flush()
write_log = open("log.txt", 'a', 0)
sys.stdout = Tee(sys.stdout, write_log)
sys.stderr = Tee(sys.stderr, write_log)
Tried all the following options at the end of the code, but the result is the same:
os.fsync(write_log.fileno())
write_log.flush()
write_log.close()
Try using the with statement or use try-except and explicitly close the file.

Getting HTML body with cgitb

I'm using cgitb (python 2.7) to create html documents server end. I have on file that does a bunch of query and then produces html. I'd like to be able to link just the html so if I could print the html to a new file and link that that, it would work.
Is there a way to get the html the page will generate at the end of processing so that I can put it in a new file without keeping track of everything I've done so far along the way?
Edit: Found a snipped here: https://stackoverflow.com/a/616686/1576740
class Tee(object):
def __init__(self, name, mode):
self.file = open(name, mode)
self.stdout = sys.stdout
sys.stdout = self
def __del__(self):
sys.stdout = self.stdout
self.file.close()
def write(self, data):
self.file.write(data)
self.stdout.write(data)
You have to call it after you import cgi as it overrides stdout in what appears to be a less friendly way. But works like a charm.
I just did import cgi;.......
Tee(filname, "w") and then I have a link to the file.
From the Python Documentation
Optionally, you can save this information to a file instead of sending it to the browser.
In this case you would want to use
cgitb.enable(display=1, logdir=directory)
import cgitb
import sys
try:
...
except:
with open("/path/to/file.html", "w") as fd:
fd.write(cgitb.html(sys.exc_info()))

how to create log files for test execution

I am trying to create a testcontroller and wants the execution of tests to be collected a a file.
i know using, tee and redirecting the test script execution to a certain file, but I am interested to do it with python over linux.
So, in this case whenever a test is executed the log file should get created, and all the execution logs including stdin,stdout and stderr should get collected to this file.
Requesting some body to suggest me, how to implement this kind of idea!
Thanks
OpenFile
There are several good logging modules, starting with the built-in logging, here is the official cookbook. Among the more interesting 3rd party libraries is Logbook, here is a pretty bare example just scratching the surface of its very cool features:
import logbook
def f(i,j):
return i+j
logger = logbook.Logger('my application logger')
log = logbook.FileHandler('so.log')
log.push_application()
try:
f(1, '2')
logger.info('called '+f.__name__)
except:
logger.warn('failed on ')
try:
f(1, 2)
logger.info('called '+f.__name__)
except:
logger.warn('choked on, ')
so.log then looks like this:
[2011-05-19 07:40] WARNING: my application logger: failed on
[2011-05-19 07:40] INFO: my application logger: called f
Try this:
import sys
# Save the current stream
save_out = sys.stdout
# Define the log file
f = "a_log_file.log"
# Append to existing log file.
# Change 'a' to 'w' to recreate the log file each time.
fsock = open(f, 'a')
# Set stream to file
sys.stdout = fsock
###
# do something here
# any print function calls will send the stream to file f
###
# Reset back the stream to what it was
# any print function calls will send the stream to the previous stream
sys.stdout = save_out
fsock.close()
Open and write to a file:
mylogfile = 'bla.log'
f = open(mylogfile, 'a')
f.write('i am logging! logging logging!....loggin? timber!....')
f.close()
look in root direct of script for 'bla.log' and read, enjoy
You can write a function like this:
def writeInLog(msg):
with open("log", "a") as f:
f.write(msg+"\n")
It will open the file "log", and append ("a") the message followed by a newline, then close the file.
# Save the current stream
saveout = sys.stdout
f = "a_log_file.log"
fsock = open(f, 'w')
# Set stream to file
sys.stdout = fsock
###
# do something here
# any print function will send the stream to file f
###
# Reset back the stream to what it was
sys.stdout = saveout
fsock.close()

Categories

Resources