I have a huge (2gb) mixed Log file which I want to split/group by the CMS, which made the log entry.
Now I run over the whole file and filter for different CMS Tags and export all Logs grouped by CMS Tags.
As I know, how many CMS I got, I can easy make the correct amount of Lists to fill: eg:
all_wordpress_logs = []
all_cms2_logs = []
...
and fill that with all_wordpress_logs.append(x)
So far so good.
Now I want to group/filter by the class, which trown something that get logged.
But as I dont know how many lists I need, I cant prepare them like above.
So my question is, how can I create Lists "on demand" with correct names to fill in with data?
E.g:
wordpress_class1 = []
wordpress_class1.append(x)
wordpress_class2 = []
wordpress_class2.append(x)
...
wordpress_classN = []
wordpress_classN.append(x)
Any help would be appreciated.
You can use a dictionary, where the keys are the classes you want to group by, and when there is a new list, you just create and fill it then add it to the suitable key.
A dictionary is a good storage tool here where you dont know what classes you might have. It can expand easily and you can have lists inside it mapped to keys. Below is just an example of this. You can then access your lists by the class name.
data="""class1 some log info
class1 more log info
class2 other log info
class3 different log into
class1 last log info
"""
dynamic_lists = {}
for line in data.splitlines():
line_data = line.split()
class_name = line_data[0]
if class_name in dynamic_lists:
dynamic_lists[class_name].append(" ".join(line_data[1:]))
else:
dynamic_lists[class_name] = [" ".join(line_data[1:])]
print(f'lines logged for class1 are: {dynamic_lists["class1"]}')
OUTPUT
lines logged for class1 are: ['some log info', 'more log info', 'last log info']
Related
Please help!
Well, first of all, I will explain what this should do. I'm trying to store users and servers from discord in a list (users that use the bot and servers in which the bot is in) with the id.
for example:
class User(object):
name = ""
uid = 0
Now, all discord id are very long and I want to store lots of users and servers in my list (one list for each one) but suppose that I get 10.000 users in my list, and I want to get the last one (without knowing it's the last one), this would take a lot of time. Instead, I thought that I could make a directory system for storing users in the list and finding it quickly. This is how it works:
I can get the id easily so imagine my id is 12345.
Now I convert it into a string using python str(id) function and I store it in a variable, strId.
For each digit of the list, I use it as an index for the users list, like this:
The User() is where the user is stored
users_list = [[[], [[], [], [[], [], [], [User()]]]]]
actual_dir = 0
for digit in strId:
actual_dir = digit
user = actual_dir[0]
And that's how I reach the user (or something like that)
Now, here is where my problem is. I know I can get the user easily by getting the user by id, but when I want to save the changes, I should do something like users_list[1][2][3][4][5] = changed_user_variable, but how far I know I cannot do something like list[1] += [2]
Is there any way to reach the user and save the changes?
Thanks in advance
You can use a python dictionary with the user id as the key and the user object as the value. I ran a test on my own computer and found that finding 100 000 random users in a dictionary with 10 million users only took 0.3s. This method is much simpler and I would guess it's just as fast, if not faster.
You can create a dictionary and add users with:
users = {}
users[userID] = some_user
(many other ways of doing this)
by using a dictionary you can easily change a user's field by:
users[userID].some_field = "Some value"
or overwrite the same way you add users in the first place.
I have a text file with the following format.
The first line includes "USERID"=12345678 and the other lines include the user groups for each application:
For example:
User with user T-number T12345 has WRITE access to the APP1 and APP2 and READ-ONLY access to APP1.
T-Number is just some other kind of ID.
00001, 00002 and so on are sequence numbers and can be ignored.
T12345;;USERID;00001;12345678;
T12345;APPLICATION;WRITE;00001;APP1
T12345;APPLICATION;WRITE;00002;APP2
T12345;APPLICATION;READ-ONLY;00001;APP1
I need to do some filtering and merge the line containing USERID with all the lines having user groups, matching t-number with userid (T12345 = 12345678)
So the output should look like this.
12345678;APPLICATION;WRITE;APP1
12345678;APPLICATION;WRITE;APP2
12345678;APPLICATION;READ-ONLY;APP1
Should I use csv python module to accomplish this?
I do not see any advantage in using the csv module for reading and parsing the input text file. The number of fields varies: 6 fields in the USERID line, with 2 of them empty, but 5 non-empty fields in the other lines. The fields look very simple, so there is no need for csv's handling of the separator character hidden away in quotes and the like. There is no header line as in a csv file, but rather many headers sprinkled in among the data lines.
A simple routine that reads each line, splits each on the semicolon character, and parses the line, and combines related lines would suffice.
The output file is another matter. The lines have the same format, with the same number of fields. So creating that output may be a good use for csv. However, the format is so simple that the file could also be created without csv.
I am not so sure if you should use the csv module here - it has mixed data, possibly more than just users and user group rights? In the case of a user declaration, you only need to retrieve its group and id, while for the application rights you need to extract the group, app name and right. The more differing data you have, the more issues you will encounter - with manual parsing of the data you are always able to just continue when you met certain criterias.
So far i must say you are better off with a manual, line-by-line parsing of the lines, structure it into something meaningful, then output the data. For instance
from StringIO import StringIO
from pprint import pprint
feed = """T12345;;USERID;00001;12345678;
T12345;;USERID;00001;2345678;
T12345;;USERID;00002;345678;
T12345;;USERID;00002;45678;
T12345;APPLICATION;WRITE;00001;APP1
T12345;APPLICATION;WRITE;00002;APP2
T12345;APPLICATION;READ-ONLY;00001;APP1
T12345;APPLICATION;WRITE;00002;APP1
T12345;APPLICATION;WRITE;00002;APP2"""
buf = StringIO(feed)
groups = {}
# Read all data into a dict of dicts
for line in buf:
values = line.strip().split(";")
if values[3] not in groups:
groups[values[3]] = {"users": [], "apps": {}}
if values[2] == "USERID":
groups[values[3]]['users'].append(values[4])
continue
if values[1] == "APPLICATION":
if values[4] not in groups[values[3]]["apps"]:
groups[values[3]]["apps"][values[4]] = []
groups[values[3]]["apps"][values[4]].append(values[2])
print("Structured data with group as root")
pprint(groups)
print("Output data")
for group_id, group in groups.iteritems():
# Order by user, app
for user in group["users"]:
for app_name, rights in group["apps"].iteritems():
for right in rights:
print(";".join([user, "APPLICATION", right, app_name]))
Online demo here
I am trying to make a function that will reorder headers along with their fields:
import dbf
table = dbf.Table('somefile.dbf', default_data_types = {'C': dbf.Char})
headerlist = [head0, head1, head2, head3]
def _reorder(x,y):
z = table._meta.user_fields
with table:
z.insert(z.index(x), z.pop(z.index(y)))
while headerlist != table.field_names:
_reorder(2, 0)
_reorder(1, 3) #along with any others needed
but it doesn't seem that _meta.user_fields is the correct way to manipulate that data. I'm wondering if it would be easier to write the data in the specific order that I needed to a new .dbf file..
I am likely way off with the code here, just beginning to learn python... Could someone point me in the right direction?
The easiest way to reorder the fields is indeed to just create a new database. I did not make provisions in my dbf module to allow for reordering fields.
If you are trying to present or process the fields in a certain order you can do something like:
field_order = ['head3', 'head0', 'head2', 'head1']
for field in field_order:
print_or_process(my_table[field])
I have an sysadmin type CLI app that reads in info from a config file. I cannot change the format of the config file below.
TYPE_A = "value1,value2,value2"
TYPE_A = "value3,value4,value5"
TYPE_B = "valuex,valuey,valuez"
Based on the TYPE, I'll need to do some initial processing with each one. After I'm done with that step for all, I need to do some additional processing and depending on the options chosen either print the state and intended action(s) or execute those action(s).
I'd like to do the initial parsing of the config into a dict of lists of dicts and update every instance of TYPE_A, TYPE_B, TYPE_C, etc with all the pertinent info about it. Then either print the full state or execute the actions (or fail if the state of something was incorrect)
My thought is it would look something like:
dict
TYPE_A_list
dict_A[0] key:value,key:value,key:value
dict_A[1] key:value,key:value,key:value
TYPE_B_list
dict_A[0] key:value,key:value,key:value
dict_A[1] key:value,key:value,key:value
I think I'd want to read the config into that and then add keys and values or update values as the app progresses and reprocesses each TYPE.
Finally my questions.
I'm not sure how iterate over each list of dicts or to add list elements and add or update key:value pairs.
Is what I describe above the best way to go about this?
I'm fairly new to Python, so I'm open to any advice. FWIW, this will be python 2.6.
A little clarification on the config file lines
CAR_TYPE = "Ford,Mustang,Blue,2005"
CAR_TYPE = "Honda,Accord,Green,2009"
BIKE_TYPE = "Honda,VTX,Black,2006"
BIKE_TYPE = "Harley,Sportster,Red,2010"
TIRE_TYPE = "170R15,whitewall"
Each type will have the same order and number of values.
No need to "remember" there are two different TYPE_A assignments - you can combine them.
TYPE_A = "value1,value2,value2"
TYPE_A = "value3,value4,value5"
would be parsed as only one of them, or both, depends on the implementation of your sysadmin CLI app.
Then the data model should be:
dict
TYPE_A: list(value1, value2, value3)
TYPE_B: list(valuex, valuey, valuez)
That way, you can iterate through dict.items() pretty easily:
for _type, values in dict.items():
for value in values:
print "%s: %s" % (_type, value)
# or whatever you wish to do
Original Question:
I have a flat file with each row representing text associated with an application. I would like to cluster applications based on the words associated with that application Is there a free code available for text mining a single flat file? Thank you.
Update 1:
There are 30,000 applications. I am trying to figure our what behaviors (of customers) are associated with each cluster. I dont have a pre defined set of words to start with. I could inspect a random few and determine some words, but then that would not give me a exaustive list of words. I would like to capture majority of the behaviors in a systematic way.
I tried converting the text file into an xml file and cluster using carrot2 workbench, but that didnt work. I havent used carrot2 before, so I may be doing something wrong there.
My understanding is that your have a file like:
game Solitaire
productivity OpenOffice
game MineSweeper
...
And you want to categorize everything based on their tag word, as in putting applications in buckets based on their associated tag/description/...
I think you can use a dictionary of lists for this purpose, e.g.:
f = open('input.txt')
out = {}
inline = f.readline()
while inline:
if ' ' not in inline:
continue
tag, appname = inline.strip('\n').split(' ', 1)
if tag not in out:
out[tag] = []
out[tag].append(appname)
inline = f.readline()
print out['game']
This iterates through input once and clusters application names based on their tags very efficiently.