Sanitizing a Piece of Text in Python

Sanitizing a Piece of Text in Python - python

I am fairly new to Python, but I would like to get started and learn a bit by doing some scripting for features I will use. I have some text that is retrieved from typing in "status" in Team Fortress 2's console. What I would like to achieve is, I want to convert this text below into text where there is only the STEAM_X:X:XXXXXXXX which is the Steam64 ID.
# userid name uniqueid connected ping loss state
# 31 "Atonement -Ai-" STEAM_0:1:27464943 00:48 103 0 active
# 10 "?loop?" STEAM_0:0:31072991 40:48 62 0 active
# 11 "爱 -Ai-" STEAM_0:0:41992530 40:46 68 0 active
# 12 "MrKateUpton -Ai-" STEAM_0:1:10894538 40:25 81 0 active
# 13 "Tacet -Ai-" STEAM_0:1:52131782 39:59 83 0 active
# 14 "CottonBonbon-Ai-" STEAM_0:1:47812003 39:39 51 0 active
# 15 "belt -Ai-" STEAM_0:1:4941202 38:43 123 0 active
# 16 "boutros :3" STEAM_0:0:32271324 38:21 65 0 active
# 17 "[tilt] Xikkari" STEAM_0:1:41148798 38:14 92 0 active
# 24 "ElenaWitch" STEAM_0:0:17495028 31:30 73 0 active
# 19 "[tilt] Batcan #boutros" STEAM_0:1:41205650 38:10 63 0 active
# 20 "[?l??]whatupmydiggas" STEAM_0:1:50559125 37:58 112 0 active
# 21 "[tilt] musicman" STEAM_0:1:37758467 37:31 89 0 active
# 22 "Jack Frost" STEAM_0:0:24206189 37:28 90 0 active
# 28 "[tilt-sub]deaf ears #best safet" STEAM_0:1:29612138 19:05 94 0 active
# 25 "? notez ?ai" STEAM_0:1:29663879 31:23 113 0 active
# 27 "-Ai- Lord English" STEAM_0:1:44114633 24:08 116 0 active
# 29 "1.prototypes" STEAM_0:0:42256202 17:41 83 0 active
# 30 "SourceTV // name for SourceTV" BOT active
# 32 "PUT ME IN COACH" STEAM_0:1:48004781 00:36 173 0 spawning
Is there any built-in functions in Python that does the following algorithm?
For all that is not (!) Steam_X:X:XXXXXXXX, delete/remove.
I have done a fair amount of Googling, but nothing really gets specific. If someone could get me started with a built-in Python function, I would be so grateful to start coding.
P.S. The output would be like this
STEAM_0:1:27464943
STEAM_0:0:31072991
STEAM_0:1:10894538
etc
etc

Sounds like an easy case for a regex. Assuming they're always digits like that:
>>> import re
>>> with open('/tmp/spam.txt') as f:
... for steam64id in re.findall(r'STEAM_\d:\d:\d+', f.read()):
... print steam64id
...
STEAM_0:1:27464943
STEAM_0:0:31072991
STEAM_0:0:41992530
STEAM_0:1:10894538
STEAM_0:1:52131782
STEAM_0:1:47812003
STEAM_0:1:4941202
STEAM_0:0:32271324
STEAM_0:1:41148798
STEAM_0:0:17495028
STEAM_0:1:41205650
STEAM_0:1:50559125
STEAM_0:1:37758467
STEAM_0:0:24206189
STEAM_0:1:29612138
STEAM_0:1:29663879
STEAM_0:1:44114633
STEAM_0:0:42256202
STEAM_0:1:48004781
The usual recipe for removing lines is not to delete them from the original file, but print out the lines you want to keep to a new file (and then, optionally, copy it overwriting the original file, if the processing was successful).

Related

How to obtain the first 4 rows for every 20 rows from a CSV file

I've Read the CVS file using pandas and have managed to print the 1st, 2nd, 3rd and 4th row for every 20 rows using .iloc.
Prem_results = pd.read_csv("../data sets analysis/prem/result.csv")
Prem_results.iloc[:320:20,:]
Prem_results.iloc[1:320:20,:]
Prem_results.iloc[2:320:20,:]
Prem_results.iloc[3:320:20,:]
Is there a way using iloc to print the 1st 4 rows of every 20 lines together rather then seperately like I do now? Apologies if this is worded badly fairly new to both python and using pandas.

Using groupby.head:
Prem_results.groupby(np.arange(len(Prem_results)) // 20).head(4)

You can concat slices together like this:
pd.concat([df[i::20] for i in range(4)]).sort_index()
MCVE:
df = pd.DataFrame({'col1':np.arange(1000)})
pd.concat([df[i::20] for i in range(4)]).sort_index().head(20)
Output:
col1
0 0
1 1
2 2
3 3
20 20
21 21
22 22
23 23
40 40
41 41
42 42
43 43
60 60
61 61
62 62
63 63
80 80
81 81
82 82
83 83
Start at 0 get every 20 rows
Start at 1 get every 20 rows
Start at 2 get every 20 rows
And, start at 3 get every 20 rows.

You can also do this while reading the csv itself.
df = pd.DataFrame()
for chunk in pd.read_csv(file_name, chunksize = 20):
df = pd.concat((df, chunk.head(4)))
More resources:
You can read more about the usage of chunksize in Pandas official documentation here.
I also have a post about its usage here.

Facebook Prophet: Providing different data sets to build a better model

My data frame looks like that. My goal is to predict event_id 3 based on data of event_id 1 & event_id 2
ds tickets_sold y event_id
3/12/19 90 90 1
3/13/19 40 130 1
3/14/19 13 143 1
3/15/19 8 151 1
3/16/19 13 164 1
3/17/19 14 178 1
3/20/19 10 188 1
3/20/19 15 203 1
3/20/19 13 216 1
3/21/19 6 222 1
3/22/19 11 233 1
3/23/19 12 245 1
3/12/19 30 30 2
3/13/19 23 53 2
3/14/19 43 96 2
3/15/19 24 120 2
3/16/19 3 123 2
3/17/19 5 128 2
3/20/19 3 131 2
3/20/19 25 156 2
3/20/19 64 220 2
3/21/19 6 226 2
3/22/19 4 230 2
3/23/19 63 293 2
I want to predict sales for the next 10 days of that data:
ds tickets_sold y event_id
3/24/19 20 20 3
3/25/19 30 50 3
3/26/19 20 70 3
3/27/19 12 82 3
3/28/19 12 94 3
3/29/19 12 106 3
3/30/19 12 118 3
So far my model is that one. However, I am not telling the model that these are two separate events. However, it would be useful to consider all data from different events as they belong to the same organizer and therefore provide more information than just one event. Is that kind of fitting possible for Prophet?
# Load data
df = pd.read_csv('event_data_prophet.csv')
df.drop(columns=['tickets_sold'], inplace=True, axis=0)
df.head()
# The important things to note are that cap must be specified for every row in the dataframe,
# and that it does not have to be constant. If the market size is growing, then cap can be an increasing sequence.
df['cap'] = 500
# growth: String 'linear' or 'logistic' to specify a linear or logistic trend.
m = Prophet(growth='linear')
m.fit(df)
# periods is the amount of days that I look in the future
future = m.make_future_dataframe(periods=20)
future['cap'] = 500
future.tail()
forecast = m.predict(future)
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()
fig1 = m.plot(forecast)

Start dates of events seem to cause peaks. You can use holidays for this by setting the starting date of each event as a holiday. This informs prophet about the events (and their peaks). I noticed event 1 and 2 are overlapping. I think you have multiple options here to deal with this. You need to ask yourself what the predictive value of each event is related to event3. You don't have too much data, that will be the main issue. If they have equal value, you could change the date of one event. For example 11 days earlier. The unequal value scenario could mean you drop 1 event.
events = pd.DataFrame({
'holiday': 'events',
'ds': pd.to_datetime(['2019-03-24', '2019-03-12', '2019-03-01']),
'lower_window': 0,
'upper_window': 1,
})
m = Prophet(growth='linear', holidays=events)
m.fit(df)
Also I noticed you forecast on the cumsum. I think your events are stationary therefor prophet probably benefits from forecasting on the daily ticket sales rather than the cumsum.

How to print DataFrame on single line

With:
import pandas as pd
df = pd.read_csv('pima-data.csv')
print df.head(2)
the print is automatically formatted across multiple lines:
num_preg glucose_conc diastolic_bp thickness insulin bmi diab_pred \
0 6 148 72 35 0 33.6 0.627
1 1 85 66 29 0 26.6 0.351
age skin diabetes
0 50 1.3790 True
1 31 1.1426 False
I wonder if there is a way to avoid the multi-line formatting. I would rather have it printed in a single line like so:
num_preg glucose_conc diastolic_bp thickness insulin bmi diab_pred age skin diabetes
0 6 148 72 35 0 33.6 0.627 50 1.3790 True
1 1 85 66 29 0 26.6 0.351 31 1.1426 False

You need set:
pd.set_option('expand_frame_repr', False)
option_context context manager has been exposed through the top-level API, allowing you to execute code with given option values. Option values are restored automatically when you exit the with block:
#temporaly set expand_frame_repr
with pd.option_context('expand_frame_repr', False):
print (df)
Pandas documentation.

Loading Large File in Python

I'm using Python 2.6.2 [GCC 4.3.3] running on Ubuntu 9.04. I need to read a big datafile (~1GB, >3 million lines) , line by line using a Python script.
I tried the methods below, I find it uses a very large space of memory (~3GB)
for line in open('datafile','r').readlines():
process(line)
or,
for line in file(datafile):
process(line)
Is there a better way to load a large file line by line, say
a) by explicitly mentioning the maximum number of lines the file could load at any one time in memory ? Or
b) by loading it by chunks of size, say, 1024 bytes, provided the last line of the said chunk loads completely without being truncated?
Several suggestions gave the methods I mentioned above and already tried, I'm trying to see if there is a better way to handle this. My search has not been fruitful so far. I appreciate your help.
p/s I have done some memory profiling using Heapy and found no memory leaks in the Python code I am using.
Update 20 August 2012, 16:41 (GMT+1)
Tried both approach as suggested by J.F. Sebastian, mgilson and IamChuckB, (datafile is a variable)
with open(datafile) as f:
for line in f:
process(line)
Also,
import fileinput
for line in fileinput.input([datafile]):
process(line)
Strangely both of them uses ~3GB of memory, my datafile size in this test is 765.2MB consisting of 21,181,079 lines. I see the memory get incremented along the time (around 40-80MB steps) before stabilizing at 3GB.
An elementary doubt,
Is it necessary to flush the line after usage?
I did memory profiling using Heapy to understand this better.
Level 1 Profiling
Partition of a set of 36043 objects. Total size = 5307704 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 15934 44 1301016 25 1301016 25 str
1 50 0 628400 12 1929416 36 dict of __main__.NodeStatistics
2 7584 21 620936 12 2550352 48 tuple
3 781 2 590776 11 3141128 59 dict (no owner)
4 90 0 278640 5 3419768 64 dict of module
5 2132 6 255840 5 3675608 69 types.CodeType
6 2059 6 247080 5 3922688 74 function
7 1716 5 245408 5 4168096 79 list
8 244 1 218512 4 4386608 83 type
9 224 1 213632 4 4600240 87 dict of type
<104 more rows. Type e.g. '_.more' to view.>
============================================================
Level 2 Profiling for Level 1-Index 0
Partition of a set of 15934 objects. Total size = 1301016 bytes.
Index Count % Size % Cumulative % Referred Via:
0 2132 13 274232 21 274232 21 '.co_code'
1 2132 13 189832 15 464064 36 '.co_filename'
2 2024 13 114120 9 578184 44 '.co_lnotab'
3 247 2 110672 9 688856 53 "['__doc__']"
4 347 2 92456 7 781312 60 '.func_doc', '[0]'
5 448 3 27152 2 808464 62 '[1]'
6 260 2 15040 1 823504 63 '[2]'
7 201 1 11696 1 835200 64 '[3]'
8 188 1 11080 1 846280 65 '[0]'
9 157 1 8904 1 855184 66 '[4]'
<4717 more rows. Type e.g. '_.more' to view.>
Level 2 Profiling for Level 1-Index 1
Partition of a set of 50 objects. Total size = 628400 bytes.
Index Count % Size % Cumulative % Referred Via:
0 50 100 628400 100 628400 100 '.__dict__'
Level 2 Profiling for Level 1-Index 2
Partition of a set of 7584 objects. Total size = 620936 bytes.
Index Count % Size % Cumulative % Referred Via:
0 1995 26 188160 30 188160 30 '.co_names'
1 2096 28 171072 28 359232 58 '.co_varnames'
2 2078 27 157608 25 516840 83 '.co_consts'
3 261 3 21616 3 538456 87 '.__mro__'
4 331 4 21488 3 559944 90 '.__bases__'
5 296 4 20216 3 580160 93 '.func_defaults'
6 55 1 3952 1 584112 94 '.co_freevars'
7 47 1 3456 1 587568 95 '.co_cellvars'
8 35 0 2560 0 590128 95 '[0]'
9 27 0 1952 0 592080 95 '.keys()[0]'
<189 more rows. Type e.g. '_.more' to view.>
Level 2 Profiling for Level 1-Index 3
Partition of a set of 781 objects. Total size = 590776 bytes.
Index Count % Size % Cumulative % Referred Via:
0 1 0 98584 17 98584 17 "['locale_alias']"
1 29 4 35768 6 134352 23 '[180]'
2 28 4 34720 6 169072 29 '[90]'
3 30 4 34512 6 203584 34 '[270]'
4 27 3 33672 6 237256 40 '[0]'
5 25 3 26968 5 264224 45 "['data']"
6 1 0 24856 4 289080 49 "['windows_locale']"
7 64 8 20224 3 309304 52 "['inters']"
8 64 8 17920 3 327224 55 "['galog']"
9 64 8 17920 3 345144 58 "['salog']"
<84 more rows. Type e.g. '_.more' to view.>
============================================================
Level 3 Profiling for Level 2-Index 0, Level 1-Index 0
Partition of a set of 2132 objects. Total size = 274232 bytes.
Index Count % Size % Cumulative % Referred Via:
0 2132 100 274232 100 274232 100 '.co_code'
Level 3 Profiling for Level 2-Index 0, Level 1-Index 1
Partition of a set of 50 objects. Total size = 628400 bytes.
Index Count % Size % Cumulative % Referred Via:
0 50 100 628400 100 628400 100 '.__dict__'
Level 3 Profiling for Level 2-Index 0, Level 1-Index 2
Partition of a set of 1995 objects. Total size = 188160 bytes.
Index Count % Size % Cumulative % Referred Via:
0 1995 100 188160 100 188160 100 '.co_names'
Level 3 Profiling for Level 2-Index 0, Level 1-Index 3
Partition of a set of 1 object. Total size = 98584 bytes.
Index Count % Size % Cumulative % Referred Via:
0 1 100 98584 100 98584 100 "['locale_alias']"
Still troubleshooting this.
Do share with me if you have faced this before.
Thanks for your help.
Update 21 August 2012, 01:55 (GMT+1)
mgilson, the process function is used to post process a Network Simulator 2 (NS2) tracefile. Some of the lines in the tracefile is shared as below. I am using numerous objects, counters, tuples, and dictionaries in the python script to learn how a wireless network performs.
s 1.231932886 _25_ AGT --- 0 exp 10 [0 0 0 0 Y Y] ------- [25:0 0:0 32 0 0]
s 1.232087886 _25_ MAC --- 0 ARP 86 [0 ffffffff 67 806 Y Y] ------- [REQUEST 103/25 0/0]
r 1.232776108 _42_ MAC --- 0 ARP 28 [0 ffffffff 67 806 Y Y] ------- [REQUEST 103/25 0/0]
r 1.232776625 _34_ MAC --- 0 ARP 28 [0 ffffffff 67 806 Y Y] ------- [REQUEST 103/25 0/0]
r 1.232776633 _9_ MAC --- 0 ARP 28 [0 ffffffff 67 806 Y Y] ------- [REQUEST 103/25 0/0]
r 1.232776658 _0_ MAC --- 0 ARP 28 [0 ffffffff 67 806 Y Y] ------- [REQUEST 103/25 0/0]
r 1.232856942 _35_ MAC --- 0 ARP 28 [0 ffffffff 64 806 Y Y] ------- [REQUEST 100/25 0/0]
s 1.232871658 _0_ MAC --- 0 ARP 86 [13a 67 1 806 Y Y] ------- [REPLY 1/0 103/25]
r 1.233096712 _29_ MAC --- 0 ARP 28 [0 ffffffff 66 806 Y Y] ------- [REQUEST 102/25 0/0]
r 1.233097047 _4_ MAC --- 0 ARP 28 [0 ffffffff 66 806 Y Y] ------- [REQUEST 102/25 0/0]
r 1.233097050 _26_ MAC --- 0 ARP 28 [0 ffffffff 66 806 Y Y] ------- [REQUEST 102/25 0/0]
r 1.233097051 _1_ MAC --- 0 ARP 28 [0 ffffffff 66 806 Y Y] ------- [REQUEST 102/25 0/0]
r 1.233109522 _25_ MAC --- 0 ARP 28 [13a 67 1 806 Y Y] ------- [REPLY 1/0 103/25]
s 1.233119522 _25_ MAC --- 0 ACK 38 [0 1 67 0 Y Y]
r 1.233236204 _17_ MAC --- 0 ARP 28 [0 ffffffff 65 806 Y Y] ------- [REQUEST 101/25 0/0]
r 1.233236463 _20_ MAC --- 0 ARP 28 [0 ffffffff 65 806 Y Y] ------- [REQUEST 101/25 0/0]
D 1.233236694 _18_ MAC COL 0 ARP 86 [0 ffffffff 65 806 67 1] ------- [REQUEST 101/25 0/0]
The aim of doing 3 level profiling using Heapy is to assist me to narrow down which object(s) is eating up much of the memory. As you can see, unfortunately I could not see which one specifically need a tweaking as its too generic. Example I know though "dict of main.NodeStatistics" has only 50 objects out of 36043 (0.1%) objects, yet it takes up 12% of the total memory used to run the script, I am unable to find which specific dictionary I would need to look into.
I tried implementing David Eyk's suggestion as below (snippet), trying to manually garbage collect at every 500,000 lines,
import gc
for i,line in enumerate(file(datafile)):
if (i%500000==0):
print '-----------This is line number', i
collected = gc.collect()
print "Garbage collector: collected %d objects." % (collected)
Unfortunately, the memory usage is still at 3GB and the output (snippet) is as below,
-----------This is line number 0
Garbage collector: collected 0 objects.
-----------This is line number 500000
Garbage collector: collected 0 objects.
Implemented martineau's suggestion, I see the memory usage is now 22MB from the earlier 3GB! Something that I had looking forward to achieve. The strange thing is the below,
I did the same memory profiling as before,
Level 1 Profiling
Partition of a set of 35474 objects. Total size = 5273376 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 15889 45 1283640 24 1283640 24 str
1 50 0 628400 12 1912040 36 dict of __main__.NodeStatistics
2 7559 21 617496 12 2529536 48 tuple
3 781 2 589240 11 3118776 59 dict (no owner)
4 90 0 278640 5 3397416 64 dict of module
5 2132 6 255840 5 3653256 69 types.CodeType
6 2059 6 247080 5 3900336 74 function
7 1716 5 245408 5 4145744 79 list
8 244 1 218512 4 4364256 83 type
9 224 1 213632 4 4577888 87 dict of type
<104 more rows. Type e.g. '_.more' to view.>
Comparing the previous memory profiling output with the above, str has reduced 45 objects (17376 bytes), tuple has reduced 25 objects (3440 bytes) and dict(no owner) though no object change, it has reduced 1536 bytes of the memory size. All other objects are the same including dict of main.NodeStatistics. The total number of objects are 35474. The small reduction in object (0.2%) produced 99.3% of memory saving (22MB from 3GB). Very strange.
If you realize, though I know the place the memory starvation is occurring, I am yet able to narrow down which one causing the bleed.
Will continue to investigate this.
Thanks to all the pointers, using this opportunity to learn much on python as I ain't an expert. Appreciate your time taken to assist me.
Update 23 August 2012, 00:01 (GMT+1) -- SOLVED
I continued debugging using the minimalistic code per martineau's suggestion. I began to add codes in the process function and observe the memory bleeding.
I find the memory starts to bleed when I add a class as below,
class PacketStatistics(object):
def __init__(self):
self.event_id = 0
self.event_source = 0
self.event_dest = 0
...
I am using 3 classes with 136 counters.
Discussed this issue with my friend Gustavo Carneiro, he suggested to use slot to replace dict.
I converted the class as below,
class PacketStatistics(object):
__slots__ = ('event_id', 'event_source', 'event_dest',...)
def __init__(self):
self.event_id = 0
self.event_source = 0
self.event_dest = 0
...
When I converted all the 3 classes, the memory usage of 3GB before now became 504MB. A whopping 80% of memory usage saving!!
The below is the memory profiling after the dict to slot convertion.
Partition of a set of 36157 objects. Total size = 4758960 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 15966 44 1304424 27 1304424 27 str
1 7592 21 624776 13 1929200 41 tuple
2 780 2 587424 12 2516624 53 dict (no owner)
3 90 0 278640 6 2795264 59 dict of module
4 2132 6 255840 5 3051104 64 types.CodeType
5 2059 6 247080 5 3298184 69 function
6 1715 5 245336 5 3543520 74 list
7 225 1 232344 5 3775864 79 dict of type
8 244 1 223952 5 3999816 84 type
9 166 0 190096 4 4189912 88 dict of class
<101 more rows. Type e.g. '_.more' to view.>
The dict of __main__.NodeStatistics is not in the top 10 anymore.
I am happy with the result and glad to close this issue.
Thanks for all your guidance. Truly appreciate it.
rgds
Saravanan K

with open('datafile') as f:
for line in f:
process(line)
This works because files are iterators yielding 1 line at a time until there are no more lines to yield.

The fileinput module will let you read it line by line without loading the entire file into memory. pydocs
import fileinput
for line in fileinput.input(['myfile']):
do_something(line)
Code example taken from yak.net

#mgilson 's answer is correct. The simple solution bears official mention though (#HerrKaputt mentioned this in a comment)
file = open('datafile')
for line in file:
process(line)
file.close()
This is simple, pythonic, and understandable. If you don't understand how with works just use this.
As the other poster mentioned this does not create a large list like file.readlines(). Rather it pulls off one line at a time in the way that is traditional to unix files/pipes.

If the file is JSON, XML, CSV, genomics or any other well-known format, there are specialized readers which use C code directly and are far more optimized for both speed and memory than parsing in native Python - avoid parsing it natively whenever possible.
But in general, tips from my experience:
Python's multiprocessing package is fantastic for managing subprocesses, all memory leaks go away when the subprocess ends.
run the reader subprocess as a multiprocessing.Process and use a multiprocessing.Pipe(duplex=True) to communicate (send the filename and any other args, then read its stdout)
read in small (but not tiny) chunks, say 64Kb-1Mb. Better for memory usage, also for responsiveness wrt other running processes/subprocesses

seek a better design suggestion for a trial-and-error mechanism in python?

See below data matrix get from sensors, just INT numbers, nothing specical.
A B C D E F G H I J K
1 25 0 25 66 41 47 40 12 69 76 1
2 17 23 73 97 99 39 84 26 0 44 45
3 34 15 55 4 77 2 96 92 22 18 71
4 85 4 71 99 66 42 28 41 27 39 75
5 65 27 28 95 82 56 23 44 97 42 38
…
10 95 13 4 10 50 78 4 52 51 86 20
11 71 12 32 9 2 41 41 23 31 70
12 54 31 68 78 55 19 56 99 67 34 94
13 47 68 79 66 10 23 67 42 16 11 96
14 25 12 88 45 71 87 53 21 96 34 41
The horizontal A to K is the sensor name, and vertical is the data from sensor by the timer manner.
Now I want to analysis those data with trial-and-error methods, I defined some concepts to explain what I want:
o source
source is all the raw data I get
o entry
a entry is a set of all A to K sensor, take the vertical 1st row for example: the entry is
25 0 25 66 41 47 40 12 69 76 1
o rules
a rule is a "suppose" function with assert value return, so far just "true" or "false".
For example, I suppose the sensor A, E and F value will never be same in one enrty, if one entry with A=E=F, it will tigger violation and this rule function will return false.
o range:
a range is function for selecting vertical entry, for example, the first 5 entries
Then, the basic idea is:
o source + range = subsource(s)
o subsource + rules = valiation(s)
The finally I want to get a list may looks like this:
rangeID ruleID violation
1 1 Y
2 1 N
3 1 Y
1 2 N
2 2 N
3 2 Y
1 3 N
2 3 Y
3 3 Y
But the problem is the rule and range I defined here will getting very complicated soon if you looks deeper, they have too much possible combinations, take "A=E=F" for example, one can define "B=E=F"，"C=E=F"，"C>F" ......
So soon I need a rule/range generator which may accept those "core parameters" such as "A=E=F" as input parameter even using regex string later. That is too complicated just defeated me, leave alone I may need to persistence rules unique ID, data storage problem, rules self nest combination problem ......
So my questions are:
Anyone knows if there's some module/soft fit for this kind of trial-and-error calculation or the rules defination I want?
Anyone can share me a better rules/range design I described?
Thanks for any hints.
Rgs,
KC

If I understand what you're asking correctly, I probably wouldn't even venture down the Numbpy path as I don't think given your description that it's really required. Here's a sample implementation of how I might go about solving the specific issue that you presented:
l = [\
{'a':25, 'b':0, 'c':25, 'd':66, 'e':41, 'f':47, 'g':40, 'h':12, 'i':69, 'j':76, 'k':1},\
{'a':25, 'b':0, 'c':25, 'd':66, 'e':41, 'f':47, 'g':40, 'h':12, 'i':69, 'j':76, 'k':1}\
]
r = ['a=g=i', 'a=b', 'a=c']
res = []
# test all given rules
for n in range(0, len(r)):
# i'm assuming equality here - you'd have to change this to accept other operators if needed
c = r[n].split('=')
vals = []
# build up a list of values given our current rule
for e in c:
vals.append(l[0][e])
# using len(set(v)) gives us the number of distinct values
res.append({'rangeID': 0, 'ruleID':n, 'violation':'Y' if len(set(vals)) == 1 else 'N'})
print res
Output:
[{'violation': 'N', 'ruleID': 0, 'rangeID': 0}, {'violation': 'N', 'ruleID': 1, 'rangeID': 0}, {'violation': 'Y', 'ruleID': 2, 'rangeID': 0}]
http://ideone.com/zbTZr
There are a few assumptions made here (such as equality being the only operator in use in your rules) and some functionality left out (such as parsing your input to the list of dicts I used, but I'm hopeful that you can figure that out on your own.
Of course, there could be a Numpy-based solution that's simpler than this that I'm just not thinking of at the moment (it's late and I'm going to bed now ;)), but hopefully this helps you out anyway.
Edit:
Woops, missed something else (forgot to add it in prior to posting) - I only test the first element in l (the given range).. You'd just want to stick that in another for loop rather than using that hard-coded 0 index.

You want to look at Numpy matrix for data structures like matrix etc. It exposes a list of functions that work on matrix manipulation.
As for rule / range generator I am afraid you will have to build your own domain specific language to achieve that.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sanitizing a Piece of Text in Python - python

Related

How to obtain the first 4 rows for every 20 rows from a CSV file

Facebook Prophet: Providing different data sets to build a better model

How to print DataFrame on single line

Loading Large File in Python

seek a better design suggestion for a trial-and-error mechanism in python?

Categories

Resources