I have a sql dump in txt format , it looks like this way -
"Date:","8/21/2015","","Time:","16:18:38","","Name:","NC.S.RHU10.BRD"
"System Name:","NC.S.RHU10.BRD"
"Operator:","SYSTEM"
"Action:","Trend data loss"
"Comment:"," trend definition data loss occurred at 10:21:05 AM on 8/21/2015"
"Revision:","6"
"Location:",""
"Seq Number:","1278738"
" ********************************************************************************"
"Date:","8/21/2015","","Time:","16:17:17","","Name:","SC.L.SIDESHOWBOB.MBC009"
"System Name:","SC.L.SIDESHOWBOB.MBC009"
"Operator:","SYSTEM"
"Action:","FLN device return from failure"
"Comment:","Z8 RETURN from failure in Cabinet 9, Lan 3, Drop 1."
"Revision:","81"
"Location:","SC.L.SIDESHOWBOB.MBC009"
"Seq Number:","1278737"
" ********************************************************************************"
"Date:","8/21/2015","","Time:","16:17:17","","Name:","NC.S.EHU07.EAT"
"System Name:","NC.S.EHU07.EAT"
"Operator:","ITWVSIEMP01\InsightSCH"
"Action:","Trend data collection The target object could not be found on the Field"
"Panel."
"Comment:","Trend COV (0.000) Failed - The target object could not be found on the"
"Field Panel"
"Revision:","1318"
"Location:","ITWVSIEMP01"
"Seq Number:","1278735"
" ********************************************************************************"
"Date:","8/21/2015","","Time:","16:17:15","","Name:","NC.S.EHU03.TCFM"
"System Name:","NC.S.EHU03.TCFM"
"Operator:","ITWVSIEMP01\InsightSCH"
"Action:","Trend data collection"
"Comment:","COV Data Loss Detected"
"Revision:","1481"
"Location:","ITWVSIEMP01"
"Seq Number:","1278734"
" ********************************************************************************
I want to convert in column way using Python with following fields :-
"Date","Time","Name","System Name","Operator","Action","Comment","Type","Revision","Location","Seq Number"
Is there a ready function in python that does this ?
import csv
c = csv.writer(open('out.csv', 'w'), delimiter=',')
file = open('myfile.txt')
for col in file:
data = col.split('\t')
# find index "Date=0","Time=1","Name=2","System Name=3","Operator=4","Action=5","Comment=6","Type=7","Revision=8","Location=9","Seq Number=10"
c.writerow(data[0],data[1],data[2],data[3],data[4],data[5],data[6],data[7],data[8],data[9],data[10])
f.close()
I've just written a little utility here. Maybe this could help you.
I think the last line of your input file is missing a ". Please add it at the end for a uniform delimiter.
import operator
import csv
with open('path/to/input') as infile, open('path/to/output', 'w') as outfile:
data = {}
writer = csv.writer(outfile, delimiter=',')
writer.writerow(["Date","Time","Name","System Name","Operator","Action","Comment","Revision","Location","Seq Number"])
fields = operator.itemgetter("Date","Time","Name","System Name","Operator","Action","Comment","Revision","Location","Seq Number")
for line in infile:
if line.startswith('" *'):
try:
writer.writerow(fields(data))
except AttributeError:
print('malformed input')
raise
data = {}
continue
parts = line.split(',')
if line.startswith('"Date'):
data['Date'] = parts[1]
data['Time'] = parts[4]
data['Name'] = parts[-1]
continue
name = parts[0].strip('"').rstrip(":")
value = parts[1].strip('"')
data[name] = value
The following script should work, it generates your header fields automatically and preserves the order in the CSV file, as such it should still work if the format changes a bit:
import csv
with open("sqldump.txt", "r") as f_input, open("output.csv", "wb") as f_output:
csv_input = csv.reader(f_input)
csv_output = csv.writer(f_output)
headers = []
for cols in csv_input:
if len(cols) > 1:
headers.extend([header.strip(":") for header in cols if header.endswith(':')])
else:
break
csv_output.writerow(headers)
f_input.seek(0)
entry = []
for cols in csv_input:
if cols[0] == 'Date:':
entry.extend([cols[1], cols[4], cols[-1]])
elif len(cols) > 1:
entry.append(cols[1])
elif cols[0].startswith(' *'):
csv_output.writerow(entry)
entry = []
This would give you an output CSV file looking like:
Date,Time,Name,System Name,Operator,Action,Comment,Revision,Location,Seq Number
8/21/2015,16:18:38,NC.S.RHU10.BRD,NC.S.RHU10.BRD,SYSTEM,Trend data loss, trend definition data loss occurred at 10:21:05 AM on 8/21/2015,6,,1278738
8/21/2015,16:17:17,SC.L.SIDESHOWBOB.MBC009,SC.L.SIDESHOWBOB.MBC009,SYSTEM,FLN device return from failure,"Z8 RETURN from failure in Cabinet 9, Lan 3, Drop 1.",81,SC.L.SIDESHOWBOB.MBC009,1278737
8/21/2015,16:17:17,NC.S.EHU07.EAT,NC.S.EHU07.EAT,ITWVSIEMP01\InsightSCH,Trend data collection The target object could not be found on the Field,Trend COV (0.000) Failed - The target object could not be found on the,1318,ITWVSIEMP01,1278735
8/21/2015,16:17:15,NC.S.EHU03.TCFM,NC.S.EHU03.TCFM,ITWVSIEMP01\InsightSCH,Trend data collection,COV Data Loss Detected,1481,ITWVSIEMP01,1278734
Tested using Python 2.7. If you are using Python 3.0, change the code to open("output.csv", "w", newline="")
Note, there is no 'Type' field in your example data?
Related
I am trying to extract values from json ld to csv as they are in the file. There are a couple of issues I am facing.
1. The values being read for different fields are getting truncated in most of the cases. In the remaining cases the value of some other field is appearing in some other field.
2. I am also getting an error - 'Additional data' after some 4,000 lines.
The file is quite big(half a gb). I am attaching a shortened version of my code. Please tell me where am I going wrong.
The input file - I have shortened it and kept it here. There was no way of putting it here.
https://github.com/Architsi/json-ld-issue
I tried writing this script and I tried multiple online converters too
import csv, sys, math, operator, re, os, json, ijson
from pprint import pprint
filelist = []
for file in os.listdir("."):
if file.endswith(".json"):
filelist.append(file)
for input in filelist:
newCsv = []
splitlist = input.split(".")
output = splitlist[0] + '.csv'
newFile = open(output, 'w', newline='') #wb for windows, else you'll see newlines added to csv
# initialize csv writer
writer = csv.writer(newFile)
#Name of the columns
header_row = ('Format', 'Description', 'Object', 'DataProvider')
writer.writerow(header_row)
with open(input, encoding="utf8") as json_file:
data = ijson.items(json_file, 'item')
#passing all the values through try except
for s in data:
source = s['_source']
try:
source_resource = source['sourceResource']
except:
print ("Warning: No source resource in record ID: " + id)
try:
data_provider = source['dataProvider'].encode()
except:
data_provider = "N/A"
try:
_object = source['object'].encode()
except:
_object = "N/A"
try:
descriptions = source_resource['description']
string = ""
for item in descriptions:
if len(descriptions) > 1:
description = item.encode() #+ " | "
else:
description = item.encode()
string = string + description
description = string.encode()
except:
description = "N/A"
created = ""
#writing it to csv
write_tuple = ('format', description, _object, data_provider)
writer.writerow(write_tuple)
print ("File written to " + output)
newFile.close()
The error that I am getting is this- raise common.JSONError('Additional Data')
Expected result is a csv file with all the columns and correct values
I'm trying to convert text file to excel sheet in python. The txt file contains data in the below specified formart
Column names: reg no, zip code, loc id, emp id, lastname, first name. Each record has one or more error numbers. Each record have their column names listed above the values. I would like to create an excel sheet containing reg no, firstname, lastname and errors listed in separate rows for each record.
How can I put the records in excel sheet ? Should I be using regular expressions ? And how can I insert error numbers in different rows for that corresponding record?
Expected output:
Here is the link to the input file:
https://github.com/trEaSRE124/Text_Excel_python/blob/master/new.txt
Any code snippets or suggestions are kindly appreciated.
Here is a draft code. Let me know if any changes needed:
# import pandas as pd
from collections import OrderedDict
from datetime import date
import csv
with open('in.txt') as f:
with open('out.csv', 'wb') as csvfile:
spamwriter = csv.writer(csvfile, delimiter=',', quoting=csv.QUOTE_MINIMAL)
#Remove inital clutter
while("INPUT DATA" not in f.readline()):
continue
header = ["REG NO", "ZIP CODE", "LOC ID", "EMP ID", "LASTNAME", "FIRSTNAME", "ERROR"]; data = list(); errors = list()
spamwriter.writerow(header)
print header
while(True):
line = f.readline()
errors = list()
if("END" in line):
exit()
try:
int(line.split()[0])
data = line.strip().split()
f.readline() # get rid of \n
line = f.readline()
while("ERROR" in line):
errors.append(line.strip())
line = f.readline()
spamwriter.writerow(data + errors)
spamwriter.flush()
except:
continue
# while(True):
# line = f.readline()
Use python-2 to run. The errors are appended as subsequent columns. It's slightly complicated the way you want it. I can fix it if still needed
Output looks like:
You can do this using the openpyxl library which is capable of depositing items directly into a spreadsheet. This code shows how to do that for your particular situation.
NEW_PERSON, ERROR_LINE = 1,2
def Line_items():
with open('katherine.txt') as katherine:
for line in katherine:
line = line.strip()
if not line:
continue
items = line.split()
if items[0].isnumeric():
yield NEW_PERSON, items
elif items[:2] == ['ERROR', 'NUM']:
yield ERROR_LINE, line
else:
continue
from openpyxl import Workbook
wb = Workbook()
ws = wb.active
ws['A2'] = 'REG NO'
ws['B2'] = 'LASTNAME'
ws['C2'] = 'FIRSTNAME'
ws['D2'] = 'ERROR'
row = 2
for kind, data in Line_items():
if kind == NEW_PERSON:
row += 2
ws['A{:d}'.format(row)] = int(data[0])
ws['B{:d}'.format(row)] = data[-2]
ws['C{:d}'.format(row)] = data[-1]
first = True
else:
if first:
first = False
else:
row += 1
ws['D{:d}'.format(row)] = data
wb.save(filename='katherine.xlsx')
This is a screen snapshot of the result.
I am trying to recreate this analysis: https://rstudio-pubs-static.s3.amazonaws.com/203258_d20c1a34bc094151a0a1e4f4180c5f6f.html
I could not get the shell script to work on my computer so I created a code to essentially do just that:
import sys
input_file = sys.argv[1]
output_file = sys.argv[2]
in_fp = open(input_file,"r")
out_fp = open(output_file,"w")
count = 0
for line in in_fp:
if count == 1:
out_fp.write(line+"\n")
elif count>1:
elems = line.split(",")
loan = elems[16].upper()
if loan == "FULLY PAID" or loan == "LATE (31-120 DAYS)" or loan == "DEFAULT" or loan == "CHARGED OFF":
out_fp.write(line+"\n")
count+=1
in_fp.close()
out_fp.close()
While this code works for the year 2015 data, when I run it for 2012-2013 data I get the error message:
File "ShellScript.py", line 16, in <module>
loan = elems[16].upper()
IndexError: list index out of range
Can someone please tell me how to fix this error to get the data to sort? Thank you
One of your lines doesn't have 17 elements therefore elems[16] fails. This is usually caused by a blank line in your data. It can also be caused by a quoted field with embedded newlines. If it's a quoted field with embedded newlines you will need to use the csv module.
Here is a rewrite using the csv module. It reports and skips short lines. I have changed it to be more Pythonic.
import sys
import csv
input_file = sys.argv[1]
output_file = sys.argv[2]
ncolumns = 17 # IS THIS RIGHT?
keep_loans = {"FULLY PAID", "LATE (31-120 DAYS)", "DEFAULT", "CHARGED OFF"}
# with statment automatically closes files after block
with open(input_file, "rb") as in_fp, open(output_file, "wb") as out_fp:
reader = csv.reader(in_fp)
writer = csv.writer(out_fp)
# you are currently skipping line 0
next(reader)
# copy headers
writer.writerow(next(reader))
# you are currently adding an extra newline to headers
# writer.writerow([]) # uncomment if you want that extra newline
for row_num, row in enumerate(reader, start=2):
if len(row) < ncolumns:
# report and skip short rows
print "row %s shorter than expected. skipping row. row: %s" % (row_num, row)
continue
# use `in` rather than multiple == statements
if row[16].upper() in keep_loans
writer.writerow(row)
I have an application that works. But in the interest of attempting to understand functions and python better. I am trying to split it out into various functions.
I"m stuck on the file_IO function. I'm sure the reason it does not work is because the main part of the application does not understand reader or writer. To better explain. Here is a full copy of the application.
Also I'm curious about using csv.DictReader and csv.DictWriter. Do either provide any advantages/disadvantages to the current code?
I suppose another way of doing this is via classes which honestly I would like to know how to do it that way as well.
#!/usr/bin/python
""" Description This script will take a csv file and parse it looking for specific criteria.
A new file is then created based of the original file name containing only the desired parsed criteria.
"""
import csv
import re
import sys
searched = ['aircheck', 'linkrunner at', 'onetouch at']
def find_group(row):
"""Return the group index of a row
0 if the row contains searched[0]
1 if the row contains searched[1]
etc
-1 if not found
"""
for col in row:
col = col.lower()
for j, s in enumerate(searched):
if s in col:
return j
return -1
#Prompt for File Name
def file_IO():
print "Please Enter a File Name, (Without .csv extension): ",
base_Name = raw_input()
print "You entered: ",base_Name
in_Name = base_Name + ".csv"
out_Name = base_Name + ".parsed.csv"
print "Input File: ", in_Name
print "OutPut Files: ", out_Name
#Opens Input file for read and output file to write.
in_File = open(in_Name, "rU")
reader = csv.reader(in_File)
out_File = open(out_Name, "wb")
writer = csv.writer(out_File, delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)
return (reader, writer)
file_IO()
# Read header
header = reader.next()
stored = []
writer.writerow([header[0], header[3]])
for i, row in enumerate(reader):
g = find_group(row)
if g >= 0:
stored.append((g, i, row))
stored.sort()
for g, i, row in stored:
writer.writerow([row[0], row[3]])
# Closing Input and Output files.
in_File.close()
out_File.close()
If I were you, I'd only separate find_group.
import csv
def find_group(row):
GROUPS = ['aircheck', 'linkrunner at', 'onetouch at']
for idx, group in enumerate(GROUPS):
if group in map(str.lower, row):
return idx
return -1
def get_filenames():
# this might be the only other thing you'd want to factor
# into a function, and frankly I don't really like getting
# user input this way anyway....
basename = raw_input("Enter a base filename (no extension): ")
infilename = basename + ".csv"
outfilename = basename + ".parsed.csv"
return infilename, outfilename
# notice that I don't open the files yet -- let main handle that
infilename, outfilename = get_filenames()
with open(infilename, 'rU') as inf, open(outfilename, 'wb') as outf:
reader = csv.reader(inf)
writer = csv.writer(outf, delimiter=',',
quotechar='"', quoting=csv.QUOTE_ALL)
header = next(reader)
writer.writerow([[header[0], header[3]])
stored = sorted([(find_group(row),idx,row) for idx,row in
enumerate(reader)) if find_group(row) >= 0])
for _, _, row in stored:
writer.writerow([row[0], row[3]])
I have a text file consisting of 100 records like
fname,lname,subj1,marks1,subj2,marks2,subj3,marks3.
I need to extract and print lname and marks1+marks2+marks3 in python. How do I do that?
I am a beginner in python.
Please help
When I used split, i got an error saying
TypeError: Can't convert 'type' object to str implicitly.
The code was
import sys
file_name = sys.argv[1]
file = open(file_name, 'r')
for line in file:
fname = str.split(str=",", num=line.count(str))
print fname
If you want to do it that way, you were close. Is this what you were trying?
file = open(file_name, 'r')
for line in file.readlines():
fname = line.rstrip().split(',') #using rstrip to remove the \n
print fname
Note: its not a tested code. but it tries to solve your problem. Please give it a try
import csv
with open(file_name, 'rb') as csvfile:
marksReader = csv.reader(csvfile)
for row in marksReader:
if len(row) < 8: # 8 is the number of columns in your file.
# row has some missing columns or empty
continue
# Unpack columns of row; you can also do like fname = row[0] and lname = row[1] and so on ...
(fname,lname,subj1,marks1,subj2,marks2,subj3,marks3) = *row
# you can use float in place of int if marks contains decimals
totalMarks = int(marks1) + int(marks2) + int(marks3)
print '%s %s scored: %s'%(fname, lname, totalMarks)
print 'End.'
"""
sample file content
poohpool#signet.com; meixin_kok#hotmail.com; ngai_nicole#hotmail.com; isabelle_gal#hotmail.com; michelle-878#hotmail.com;
valerietan98#gmail.com; remuskan#hotmail.com; genevieve.goh#hotmail.com; poonzheng5798#yahoo.com; burgergirl96#hotmail.com;
insyirah_powergals#hotmail.com; little_princess-angel#hotmail.com; ifah_duff#hotmail.com; tweety_butt#hotmail.com;
choco_ela#hotmail.com; princessdyanah#hotmail.com;
"""
import pandas as pd
file = open('emaildump.txt', 'r')
for line in file.readlines():
fname = line.split(';') #using split to form a list
#print(fname)
df1 = pd.DataFrame(fname,columns=['Email'])
print(df1)