Generating Log Event Objects from Text Log File - python

I have a text log file that looks like this:
Line 1 - Date/User Information
Line 2 - Type of LogEvent
Line 3-X, variable number of lines with additional information,
could be 1, could be hundreds
Then the sequence repeats.
There are around 20K lines of log, 50+ types of log events, approx. 15K separate user/date events. I would like to parse this in Python and make this information queryable.
So I thought I'd create a class LogEvent that records user, date (which I extract and convert to datetime), action, description... something like:
class LogEvent():
def __init__(self,date,user):
self.date = date # string converted to datetime object
self.user = user
self.content = ""
Such an event is created each time a line of text with user/date information is parsed.
To add the log event information and any descriptive content, there could be something like:
def classify(self,logevent):
self.logevent = logevent
def addContent(self,lineoftext):
self.content += lineoftext
To process the text file, I would use readline() and proceed one line at a time. If the line is user/date, I instantiate a new object and add it to a list...
newevent = LogEvent(date,user)
eventlist.append(newevent)
and start adding action/content until I encounter a new object.
eventlist[-1].classify(logevent)
eventlist[-1].addContent(line)
All this makes sense (unless you convince me there is a smarter way to do it or a useful Python module I am not aware of). I'm trying to decide how best to classify the log event type when working with a set list of possible log event types that might hold more than 50 possible types, and I don't just want to accept the entire line of text as the log event type. Instead I need to compare the start of the line against a list of possible values...
What I don't want to do is have 50 of these:
if line.startswith("ABC"):
logevent = "foo"
if line.startswith("XYZ"):
logevent = "boo"
I thought about using a dict as lookup table but I am not sure how to implement that with the "startswith"... Any suggestions would be appreciated, and my apologies if I was way too long winded.

If you have a dictionary of your logEvent types as keys and whatever you want to go into the logevent attribute as values, you can do this,
logEvents = {"ABC":"foo", "XYZ":"boo", "Data Load":"DLtag"}
and the line from your log file is this,
line = "Data Load: 127 row uploaded"
you can check if any of the keys above are at the beginning of the line,
for k in logEvents:
if line.startswith(k):
logevent = logEvents[k]
This will loop over all the keys in logEvents and check if line starts with one of them. You can do whatever you like after the if conditional. You could put this into a function that is called after a line of text with user/date information is parsed. If you want to do something if no keys are found you can do this,
for k in logEvents:
if line.startswith(k):
logevent = logEvents[k]
return
raise ValueError( "logEvent not recognized.\n line = " + line )
Note, the exact type of exception you raise is not super important. I chose one of the builtin exceptions to avoid subclassing. Here you can see all the builtin exceptions.

Since I didn't do a good job posing my question, I have given it more thought and come up with this answer, which is similar to this thread.
I would like a clean, easily manageable solution to process each line of text differently, based on whether certain conditions are met. I didn't want to use a bunch of if/else clauses. So I tried instead to move both condition and consequence (processing) into a decisionDict = {}.
### RESPONSES WHEN CERTAIN CONDITIONS ARE MET - simple examples
def shorten(line):
return line[:25]
def abc_replace(line):
return line.replace("xyz","abc")
### CONDITIONAL CHECKS FOR CONTENTS OF LINES OF TEXT - simple examples
def check_if_string_in_line(line):
response = False
if "xyz" in line:
response = True
return response
def check_if_longer_than25(line):
response = False
if len(line)>25:
response = True
return response
### DECISION DICTIONARY - could be extended for any number of condition/response
decisionDict = {check_if_string_in_line:abc_replace, check_if_longer_than25:shorten}
### EXAMPLE LINES OF SILLY TEXT
lines = ["Alert level raised to xyz",
"user 5 just uploaded duplicate file",
"there is confusion between xyz and abc"]
for line in lines:
for k in decisionDict.keys():
if k(line):#in line:
print decisionDict[k](line)
This keeps all the conditions and actions neatly separated. It also currently does not allow for more than one condition to apply to any one line of text. Once the first condition that resolves to 'True', we move on to the next line of text.

Related

Can I put file content checks in switches?

I'm checking if there are various strings present in file contents at the same time and I am curious if this could be done in switches, which I'm moving the code to right now, because I have a lot of lines that look as the one below.
What larsks also mentioned in the comments, is, if I mean the match statement. Yes, I'm aiming for results like this statement, but I've also found another solution, which works for me in cases, where I am looking only for one substring.
My current code looks like this:
f = open('somesortoffilename')
if "string" in f.read() and "otherstring" in f.read(): variable = 'value'
And I would like something like this:
f = open('somesortoffilename')
def f(variable):
return {
'string' and 'otherstring' in f.read(): 'value'
}
Is it possible in any way?
First, we need to make sure that we read the file stream only once. First call to f.read() will already have used all the bytes in the file.
Let's store the contents in a string instead.
with open('somesortoffilename') as file:
contents = file.read()
The with form makes sure that the file stream is closed after we fetch its contents.
The “switch” pattern can be implemented in Python with dictionaries (the dict type).
switches = {
'term1': ['string', 'other_string'],
'term2': ['another_string']
}
We can use this lookup to check if any string corresponding a term is found in the file.
def f(contents):
for term, values in switches.items():
if any(x in contents for x in values):
return term
return None

Extract text from a config file [duplicate]

This question already has answers here:
Parse key value pairs in a text file
(7 answers)
Closed 1 year ago.
I'm using a config file to inform my Python script of a few key-values, for use in authenticating the user against a website.
I have three variables: the URL, the user name, and the API token.
I've created a config file with each key on a different line, so:
url:<url string>
auth_user:<user name>
auth_token:<API token>
I want to be able to extract the text after the key words into variables, also stripping any "\n" that exist at the end of the line. Currently I'm doing this, and it works but seems clumsy:
with open(argv[1], mode='r') as config_file:
lines = config_file.readlines()
for line in lines:
url_match = match('jira_url:', line)
if url_match:
jira_url = line[9:].split("\n")[0]
user_match = match('auth_user:', line)
if user_match:
auth_user = line[10:].split("\n")[0]
token_match = match('auth_token', line)
if token_match:
auth_token = line[11:].split("\n")[0]
Can anybody suggest a more elegant solution? Specifically it's the ... = line[10:].split("\n")[0] lines that seem clunky to me.
I'm also slightly confused why I can't reuse my match object within the for loop, and have to create new match objects for each config item.
you could use a .yml file and read values with yaml.load() function:
import yaml
with open('settings.yml') as file:
settings = yaml.load(file, Loader=yaml.FullLoader)
now you can access elements like settings["url"] and so on
If the format is always <tag>:<value> you can easily parse it by splitting the line at the colon and filling up a custom dictionary:
config_file = open(filename,"r")
lines = config_file.readlines()
config_file.close()
settings = dict()
for l in lines:
elements = l[:-1].split(':')
settings[elements[0]] = ':'.join(elements[1:])
So, you get a dictionary that has the tags as keys and the values as values. You can then just refer to these dictionary entries in your pogram.
(e.g.: if you need the auth_token, just call settings["auth_token"]
if you can add 1 line for config file, configparser is good choice
https://docs.python.org/3/library/configparser.html
[1] config file : 1.cfg
[DEFAULT] # configparser's config file need section name
url:<url string>
auth_user:<user name>
auth_token:<API token>
[2] python scripts
import configparser
config = configparser.ConfigParser()
config.read('1.cfg')
print(config.get('DEFAULT','url'))
print(config.get('DEFAULT','auth_user'))
print(config.get('DEFAULT','auth_token'))
[3] output
<url string>
<user name>
<API token>
also configparser's methods is useful
whey you can't guarantee config file is always complete
You have a couple of great answers already, but I wanted to step back and provide some guidance on how you might approach these problems in the future. Getting quick answers sometimes prevents you from understanding how those people knew about the answers in the first place.
When you zoom out, the first thing that strikes me is that your task is to provide config, using a file, to your program. Software has the remarkable property of solve-once, use-anywhere. Config files have been a problem worth solving for at least 40 years, so you can bet your bottom dollar you don't need to solve this yourself. And already-solved means someone has already figured out all the little off-by-one and edge-case dramas like stripping line endings and dealing with expected input. The challenge of course, is knowing what solution already exists. If you haven't spent 40 years peeling back the covers of computers to see how they tick, it's difficult to "just know". So you might have a poke around on Google for "config file format" or something.
That would lead you to one of the most prevalent config file systems on the planet - the INI file. Just as useful now as it was 30 years ago, and as a bonus, looks not too dissimilar to your example config file. Then you might search for "read INI file in Python" or something, and come across configparser and you're basically done.
Or you might see that sometime in the last 30 years, YAML became the more trendy option, and wouldn't you know it, PyYAML will do most of the work for you.
But none of this gets you any better at using Python to extract from text files in general. So zooming in a bit, you want to know how to extract parts of lines in a text file. Again, this problem is an age-old problem, and if you were to learn about this problem (rather than just be handed the solution), you would learn that this is called parsing and often involves tokenisation. If you do some research on, say "parsing a text file in python" for example, you would learn about the general techniques that work regardless of the language, such as looping over lines and splitting each one in turn.
Zooming in one more step closer, you're looking to strip the new line off the end of the string so it doesn't get included in your value. Once again, this ain't a new problem, and with the right keywords you could dig up the well-trodden solutions. This is often called "chomping" or "stripping", and with some careful search terms, you'd find rstrip() and friends, and not have to do awkward things like splitting on the '\n' character.
Your final question is about re-using the match object. This is much harder to research. But again, the "solution" wont necessarily show you where you went wrong. What you need to keep in mind is that the statements in the for loop are sequential. To think them through you should literally execute them in your mind, one after one, and imagine what's happening. Each time you call match, it either returns None or a Match object. You never use the object, except to check for truthiness in the if statement. And next time you call match, you do so with different arguments so you get a new Match object (or None). Therefore, you don't need to keep the object around at all. You can simply do:
if match('jira_url:', line):
jira_url = line[9:].split("\n")[0]
if match('auth_user:', line):
auth_user = line[10:].split("\n")[0]
and so on. Not only that, if the first if triggered then you don't need to bother calling match again - it will certainly not trigger any of other matches for the same line. So you could do:
if match('jira_url:', line):
jira_url = line[9:].rstrip()
elif match('auth_user:', line):
auth_user = line[10:].rstrip()
and so on.
But then you can start to think - why bother doing all these matches on the colon, only to then manually split the string at the colon afterwards? You could just do:
tokens = line.rstrip().split(':')
if token[0] == 'jira_url':
jira_url = token[1]
elif token[0] == 'auth_user':
auth_user = token[1]
If you keep making these improvements (and there's lots more to make!), eventually you'll end up re-writing configparse, but at least you'll have learned why it's often a good idea to use an existing library where practical!

How to efficiently detect an XML schema without having the entire file in python

I have a very large feed file that is sent as an XML document (5GB). What would be the fastest way to parse the structure of the main item node without previously knowing its structure? Is there a means in Python to do so 'on-the-fly' without having the complete xml loaded in memory? For example, what if I just saved the first 5MB of the file (by itself it would be invalid xml, as it wouldn't have ending tags) -- would there be a way to parse the schema from that?
Update: I've included an example XML fragment here: https://hastebin.com/uyalicihow.xml. I'm looking to extract something like a dataframe (or list or whatever other data structure you want to use) similar to the following:
Items/Item/Main/Platform Items/Item/Info/Name
iTunes Chuck Versus First Class
iTunes Chuck Versus Bo
How could this be done? I've added a bounty to encourage answers here.
Several people have misinterpreted this question, and re-reading it, it's really not at all clear. In fact there are several questions.
How to detect an XML schema
Some people have interpreted this as saying you think there might be a schema within the file, or referenced from the file. I interpreted it as meaning that you wanted to infer a schema from the content of the instance.
What would be the fastest way to parse the structure of the main item node without previously knowing its structure?
Just put it through a parser, e.g. a SAX parser. A parser doesn't need to know the structure of an XML file in order to split it up into elements and attributes. But I don't think you actually want the fastest parse possible (in fact, I don't think performance is that high on your requirements list at all). I think you want to do something useful with the information (you haven't told us what): that is, you want to process the information, rather than just parsing the XML.
Is there a python utility that can do so 'on-the-fly' without having
the complete xml loaded in memory?
Yes, according to this page which mentions 3 event-based XML parsers in the Python world: https://wiki.python.org/moin/PythonXml (I can't vouch for any of them)
what if I just saved the first 5MB of the file (by itself it would be invalid xml, as it wouldn't have ending tags) -- would there be a way to parse the schema from that?
I'm not sure you know what the verb "to parse" actually means. Your phrase certainly suggests that you expect the file to contain a schema, which you want to extract. But I'm not at all sure you really mean that. And in any case, if it did contain a schema in the first 5Mb, you could find it just be reading the file sequentially, there would be no need to "save" the first part of the file first.
Question: way to parse the structure of the main item node without previously knowing its structure
This class TopSequenceElement parse a XML File to find all Sequence Elements.
The default is, to break at the FIRST closing </...> of the topmost Element.
Therefore, it is independend of the file size or even by truncated files.
from lxml import etree
from collections import OrderedDict
class TopSequenceElement(etree.iterparse):
"""
Read XML File
results: .seq == OrderedDict of Sequence Element
.element == topmost closed </..> Element
.xpath == XPath to top_element
"""
class Element:
"""
Classify a Element
"""
SEQUENCE = (1, 'SEQUENCE')
VALUE = (2, 'VALUE')
def __init__(self, elem, event):
if len(elem):
self._type = self.SEQUENCE
else:
self._type = self.VALUE
self._state = [event]
self.count = 0
self.parent = None
self.element = None
#property
def state(self):
return self._state
#state.setter
def state(self, event):
self._state.append(event)
#property
def is_seq(self):
return self._type == self.SEQUENCE
def __str__(self):
return "Type:{}, Count:{}, Parent:{:10} Events:{}"\
.format(self._type[1], self.count, str(self.parent), self.state)
def __init__(self, fh, break_early=True):
"""
Initialize 'iterparse' only to callback at 'start'|'end' Events
:param fh: File Handle of the XML File
:param break_early: If True, break at FIRST closing </..> of the topmost Element
If False, run until EOF
"""
super().__init__(fh, events=('start', 'end'))
self.seq = OrderedDict()
self.xpath = []
self.element = None
self.parse(break_early)
def parse(self, break_early):
"""
Parse the XML Tree, doing
classify the Element, process only SEQUENCE Elements
record, count of end </...> Events,
parent from this Element
element Tree of this Element
:param break_early: If True, break at FIRST closing </..> of the topmost Element
:return: None
"""
parent = []
try:
for event, elem in self:
tag = elem.tag
_elem = self.Element(elem, event)
if _elem.is_seq:
if event == 'start':
parent.append(tag)
if tag in self.seq:
self.seq[tag].state = event
else:
self.seq[tag] = _elem
elif event == 'end':
parent.pop()
if parent:
self.seq[tag].parent = parent[-1]
self.seq[tag].count += 1
self.seq[tag].state = event
if self.seq[tag].count == 1:
self.seq[tag].element = elem
if break_early and len(parent) == 1:
break
except etree.XMLSyntaxError:
pass
finally:
"""
Find the topmost completed '<tag>...</tag>' Element
Build .seq.xpath
"""
for key in list(self.seq):
self.xpath.append(key)
if self.seq[key].count > 0:
self.element = self.seq[key].element
break
self.xpath = '/'.join(self.xpath)
def __str__(self):
"""
String Representation of the Result
:return: .xpath and list of .seq
"""
return "Top Sequence Element:{}\n{}"\
.format( self.xpath,
'\n'.join(["{:10}:{}"
.format(key, elem) for key, elem in self.seq.items()
])
)
if __name__ == "__main__":
with open('../test/uyalicihow.xml', 'rb') as xml_file:
tse = TopSequenceElement(xml_file)
print(tse)
Output:
Top Sequence Element:Items/Item
Items :Type:SEQUENCE, Count:0, Parent:None Events:['start']
Item :Type:SEQUENCE, Count:1, Parent:Items Events:['start', 'end', 'start']
Main :Type:SEQUENCE, Count:2, Parent:Item Events:['start', 'end', 'start', 'end']
Info :Type:SEQUENCE, Count:2, Parent:Item Events:['start', 'end', 'start', 'end']
Genres :Type:SEQUENCE, Count:2, Parent:Item Events:['start', 'end', 'start', 'end']
Products :Type:SEQUENCE, Count:1, Parent:Item Events:['start', 'end']
... (omitted for brevity)
Step 2: Now, you know there is a <Main> Tag, you can do:
print(etree.tostring(tse.element.find('Main'), pretty_print=True).decode())
<Main>
<Platform>iTunes</Platform>
<PlatformID>353736518</PlatformID>
<Type>TVEpisode</Type>
<TVSeriesID>262603760</TVSeriesID>
</Main>
Step 3: Now, you know there is a <Platform> Tag, you can do:
print(etree.tostring(tse.element.find('Main/Platform'), pretty_print=True).decode())
<Platform>iTunes</Platform>
Tested with Python:3.5.3 - lxml.etree:3.7.1
For very big files, reading is always a problem. I would suggest a simple algorithmic behavior for the reading of the file itself. The key point is always the xml tags inside the files. I would suggest you read the xml tags and sort them inside a heap and then validate the content of the heap accordingly.
Reading the file should also happen in chunks:
import xml.etree.ElementTree as etree
for event, elem in etree.iterparse(xmL, events=('start', 'end', 'start-ns', 'end-ns')):
store_in_heap(event, element)
This will parse the XML file in chunks at a time and give it to you at every step of the way. start will trigger when a tag is first encountered. At this point elem will be empty except for elem.attrib that contains the properties of the tag. end will trigger when the closing tag is encountered, and everything in-between has been read.
you can also benefit from the namespaces that are in start-ns and end-ns. ElementTree has provided this call to gather all the namespaces in the file.
Refer to this link for more information about namespaces
My interpretation of your needs is that you want to be able to parse the partial file and build up the structure of the document as you go. I've taken some assumptions from the file you uploaded:
Fundamentally you want to be parsing collections of things which have similar properties - I'm inferring this from the way you presented your desired output as a table with rows containing the values.
You expect these collections of things to have the same number of values.
You need to be able to parse partial files.
You don't worry about the properties of elements, just their contents.
I'm using xml.sax as this deals with arbitrarily large files and doesn't need to read the whole file into memory. Note that the strategy I'm following now doesn't actually scale that well as I'm storing all the elements in memory to build the dataframe, but you could just as well output the paths and contents.
In the sample file there is a problem with having one row per Item since there are multiples of the Genre tag and there are also multiple Product tags. I've handled the repeated Genre tags by appending them. This relies on the Genre tags appearing consecutively. It is not at all clear how the Product relationships can be handled in a single table.
import xml.sax
from collections import defaultdict
class StructureParser(xml.sax.handler.ContentHandler):
def __init__(self):
self.text = ''
self.path = []
self.datalist = defaultdict(list)
self.previouspath = ''
def startElement(self, name, attrs):
self.path.append(name)
def endElement(self, name):
strippedtext = self.text.strip()
path = '/'.join(self.path)
if strippedtext != '':
if path == self.previouspath:
# This handles the "Genre" tags in the sample file
self.datalist[path][-1] += f',{strippedtext}'
else:
self.datalist[path].append(strippedtext)
self.path.pop()
self.text = ''
self.previouspath = path
def characters(self, content):
self.text += content
You'd use this like this:
parser = StructureParser()
try:
xml.sax.parse('uyalicihow.xml', parser)
except xml.sax.SAXParseException:
print('File probably ended too soon')
This will read the example file just fine.
Once this has read and probably printed "File probably ended to soon", you have the parsed contents in parser.datalist.
You obviously want to have just the parts which read successfully, so you can figure out the shortest list and build a DataFrame with just those paths:
import pandas as pd
smallest_items = min(len(e) for e in parser.datalist.values())
df = pd.DataFrame({key: value for key, value in parser.datalist.items() if len(value) == smallest_items})
This gives something similar to your desired output:
Items/Item/Main/Platform Items/Item/Main/PlatformID Items/Item/Main/Type
0 iTunes 353736518 TVEpisode
1 iTunes 495275084 TVEpisode
The columns for the test file which are matched here are
>> df.columns
Index(['Items/Item/Main/Platform', 'Items/Item/Main/PlatformID',
'Items/Item/Main/Type', 'Items/Item/Main/TVSeriesID',
'Items/Item/Info/BaseURL', 'Items/Item/Info/EpisodeNumber',
'Items/Item/Info/HighestResolution',
'Items/Item/Info/LanguageOfMetadata', 'Items/Item/Info/LastModified',
'Items/Item/Info/Name', 'Items/Item/Info/ReleaseDate',
'Items/Item/Info/ReleaseYear', 'Items/Item/Info/RuntimeInMinutes',
'Items/Item/Info/SeasonNumber', 'Items/Item/Info/Studio',
'Items/Item/Info/Synopsis', 'Items/Item/Genres/Genre',
'Items/Item/Products/Product/URL'],
dtype='object')
Based on your comments, it appears as though it is more important to you to have all the elements represented, but perhaps just showing a preview, in which case you can perhaps use only the first elements from the data. Note that in this case the Products entries won't match the Item entries.
df = pd.DataFrame({key: value[:smallest_items] for key, value in parser.datalist.items()})
Now we get all the paths:
>> df.columns
Index(['Items/Item/Main/Platform', 'Items/Item/Main/PlatformID',
'Items/Item/Main/Type', 'Items/Item/Main/TVSeriesID',
'Items/Item/Info/BaseURL', 'Items/Item/Info/EpisodeNumber',
'Items/Item/Info/HighestResolution',
'Items/Item/Info/LanguageOfMetadata', 'Items/Item/Info/LastModified',
'Items/Item/Info/Name', 'Items/Item/Info/ReleaseDate',
'Items/Item/Info/ReleaseYear', 'Items/Item/Info/RuntimeInMinutes',
'Items/Item/Info/SeasonNumber', 'Items/Item/Info/Studio',
'Items/Item/Info/Synopsis', 'Items/Item/Genres/Genre',
'Items/Item/Products/Product/URL',
'Items/Item/Products/Product/Offers/Offer/Price',
'Items/Item/Products/Product/Offers/Offer/Currency'],
dtype='object')
There are a number of tools around that will generate a schema from a supplied instance document. I don't know how many of them will work on a 5Gb input file, and I don't know how many of them can be invoked from Python.
Many years ago I wrote a Java-based, fully streamable tool to generate a DTD from an instance document. It hasn't been touched in years but it should still run: https://sourceforge.net/projects/saxon/files/DTDGenerator/7.0/dtdgen7-0.zip/download?use_mirror=vorboss
There are other tools listed here: Any tools to generate an XSD schema from an XML instance document?
As I see it your question is very clear. I give it a plus one up vote for clearness. You are wanting to parse text.
Write a little text parser, we can call that EditorB, that reads in chunks of the file or at least line by line. Then edit or change it as you like and re-save that chunk or line.
It can be easy in Windows from 98SE on. It should be easy in other operating systems.
The process is (1) Adjust (manually or via program), as you currently do, we can call this EditorA, that is editing your XML document, and save it; (2) stop EditorA; (3) Run your parser or editor, EditorB, on the saved XML document either manually or automatically (started via detecting that the XML document has changed via date or time or size, etc.); (4) Using EditorB, save manually or automatically the edits from step 3; (5) Have your EditorA reload the XML document and go on from there; (6) do this as often as is necessary, making edits with EditorA and automatically adjusting them outside of EditorA by using EditorB.
Edit this way before you send the file.
It is a lot of typing to explain, but XML is just a glorified text document. It can be easily parsed and edited and saved, either character by character or by larger amounts line by line or in chunks.
As a further note, this can be applied via entire directory contained documents or system wide documents as I have done in the past.
Make certain that EditorA is stopped before EditorB is allowed to start it's changing. Then stop EditorB before restarting EditorA. If you set this up as I described, then EditorB can be run continually in the background, but put in it an automatic notifier (maybe a message box with options, or a little button that is set formost on the screen when activated) that allows you to turn off (on continue with) EditorA before using EditorB. Or, as I would do it, put in a detector to keep EditorB from executing its own edits as long as EditorA is running.
B Lean

How to merge few lines with filtering some text

I have a text file with the following format.
The first line includes "USERID"=12345678 and the other lines include the user groups for each application:
For example:
User with user T-number T12345 has WRITE access to the APP1 and APP2 and READ-ONLY access to APP1.
T-Number is just some other kind of ID.
00001, 00002 and so on are sequence numbers and can be ignored.
T12345;;USERID;00001;12345678;
T12345;APPLICATION;WRITE;00001;APP1
T12345;APPLICATION;WRITE;00002;APP2
T12345;APPLICATION;READ-ONLY;00001;APP1
I need to do some filtering and merge the line containing USERID with all the lines having user groups, matching t-number with userid (T12345 = 12345678)
So the output should look like this.
12345678;APPLICATION;WRITE;APP1
12345678;APPLICATION;WRITE;APP2
12345678;APPLICATION;READ-ONLY;APP1
Should I use csv python module to accomplish this?
I do not see any advantage in using the csv module for reading and parsing the input text file. The number of fields varies: 6 fields in the USERID line, with 2 of them empty, but 5 non-empty fields in the other lines. The fields look very simple, so there is no need for csv's handling of the separator character hidden away in quotes and the like. There is no header line as in a csv file, but rather many headers sprinkled in among the data lines.
A simple routine that reads each line, splits each on the semicolon character, and parses the line, and combines related lines would suffice.
The output file is another matter. The lines have the same format, with the same number of fields. So creating that output may be a good use for csv. However, the format is so simple that the file could also be created without csv.
I am not so sure if you should use the csv module here - it has mixed data, possibly more than just users and user group rights? In the case of a user declaration, you only need to retrieve its group and id, while for the application rights you need to extract the group, app name and right. The more differing data you have, the more issues you will encounter - with manual parsing of the data you are always able to just continue when you met certain criterias.
So far i must say you are better off with a manual, line-by-line parsing of the lines, structure it into something meaningful, then output the data. For instance
from StringIO import StringIO
from pprint import pprint
feed = """T12345;;USERID;00001;12345678;
T12345;;USERID;00001;2345678;
T12345;;USERID;00002;345678;
T12345;;USERID;00002;45678;
T12345;APPLICATION;WRITE;00001;APP1
T12345;APPLICATION;WRITE;00002;APP2
T12345;APPLICATION;READ-ONLY;00001;APP1
T12345;APPLICATION;WRITE;00002;APP1
T12345;APPLICATION;WRITE;00002;APP2"""
buf = StringIO(feed)
groups = {}
# Read all data into a dict of dicts
for line in buf:
values = line.strip().split(";")
if values[3] not in groups:
groups[values[3]] = {"users": [], "apps": {}}
if values[2] == "USERID":
groups[values[3]]['users'].append(values[4])
continue
if values[1] == "APPLICATION":
if values[4] not in groups[values[3]]["apps"]:
groups[values[3]]["apps"][values[4]] = []
groups[values[3]]["apps"][values[4]].append(values[2])
print("Structured data with group as root")
pprint(groups)
print("Output data")
for group_id, group in groups.iteritems():
# Order by user, app
for user in group["users"]:
for app_name, rights in group["apps"].iteritems():
for right in rights:
print(";".join([user, "APPLICATION", right, app_name]))
Online demo here

Using Python to split a Unicode file object into dictionary Keys and values

Hi and thanks for reading. I’ll admit that this is a progression on from a previous question I asked earlier, after I partially solved the issue. I am trying to process a block of text (file_object) in an earlier working function. The text or file_object happens to be in Unicode, but I have managed to convert to ascii text and split on a line by line basis. I am hoping to then further split the text on the ‘=’ symbol so that I can drop the text into a dictionary. For example Key: Value as ‘GPS Time’:’ 14:18:43’ so removing the trailing '.000' from the time (though this is a second issue).
Here’s the file_object format…
2015 Jan 01 20:07:16.047 GPS Info #Log packet ID
GPS Time = 14:18:43.000
Longitude = 000.65341
Latitude = +41.25385
Altitude = +111.400
This is my partially working function…
def process_data(file_object):
file_object = file_object.encode('ascii','ignore')
split = file_object.split('\n')
for i in range(len(split)):
while '=' in split[i]:
processed_data = (split[i].split('=', 1) for _ in xrange(len(split)))
return {k.strip(): v.strip() for k, v in processed_data}
This is the initial section of the main script that prompts the above function, and then sets GPS Time as the Dictionary key…
while (mypkt.Next()): #mypkt.Next is an API function in the log processor app I am using – essentially it grabs the whole GPS Info packet shown above
data = process_data(mypkt.Text, 1)
packets[data['GPS Time']] = data
The code above has no problem splitting the first instance ‘GPS Time’, but it ignores Lonitude, Latitude etc, To make matters worse, there is sometimes a blank line between each packet item too. I guess I need to store previous dictionary related splits before the ‘return’, but I am having difficulty trying to find out how to do this.
The dict output I am currently getting is…
'14:19:09.000': {'GPS Time': '14:19:09.000'},
But What I am hoping for is…
'14:19:09': {'GPS Time': '14:19:09',
‘Longitude’:’000.65341’,
‘Latitude’:’+41.25385’,
‘Altitude’:’+111.400’},
Thanks in advance for any help.
MikG
All this use of range(len(whatever)) is nonsense. You almost never need to do that in Python. Just iterate through the thing.
Your problem however is more fundamental: you return from inside the while loop. That means you only ever get one element, because as soon as that first line is processed, you return and the function ends.
Also, you have a while loop which means that processing will end as soon as the program encounters a line without an equals; but you have blank lines between each data line, so again execution would never proceed past the first one.
So all you need is:
split_data = file_object.split('\n')
result = {}
for line in split_data:
if '=' in line:
key, value = line.split('=', 1)
result[key.strip()] = value.strip()
return result

Categories

Resources