Hi I'm working on converting perl to python for something to do.
I've been looking at some code on hash tables in perl and I've come across a line of code that I really don't know how it does what it does in python. I know that it shifts the bit strings of page by 1
%page_table = (); #page table is a hash of hashes
%page_table_entry = ( #page table entry structure
"dirty", 0, #0/1 boolean
"referenced", 0, #0/1 boolean
"valid", 0, #0/1 boolean
"frame_no", -1, #-1 indicates an "x", i.e. the page isn't in ram
"page", 0 #used for aging algorithm. 8 bit string.);
#ram = ((-1) x $num_frames);
Could someone please give me an idea on how this would be represented in python? I've got the definitions of the hash tables done, they're just there as references as to what I'm doing. Thanks for any help that you can give me.
for($i=0; $i<#ram; $i++){
$page_table{$ram[$i]}->{page} = $page_table{$ram[$i]}->{page} >> 1;}
The only thing confusing is that page table is a hash of hashes. $page_table{$v} contains a hashref to a hash that contains a key 'page' whose value is an integer. The loop bitshifts that integer but is not very clear perl code. Simpler would be:
foreach my $v (#ram) {
$page_table{$v}->{page} >>= 1;
}
Now the translation to python should be obvious:
for v in ram:
page_table[v][page] >>= 1
Here is what my Pythonizer generates for that code:
#!/usr/bin/env python3
# Generated by "pythonizer -mV q8114826.pl" v0.974 run by snoopyjc on Thu Apr 21 23:35:38 2022
import perllib, builtins
_str = lambda s: "" if s is None else str(s)
perllib.init_package("main")
num_frames = 0
builtins.__PACKAGE__ = "main"
page_table = {} # page table is a hash of hashes
page_table_entry = {"dirty": 0, "referenced": 0, "valid": 0, "frame_no": -1, "page": 0}
# page table entry structure
# 0/1 boolean
# 0/1 boolean
# 0/1 boolean
# -1 indicates an "x", i.e. the page isn't in ram
# used for aging algorithm. 8 bit string.
ram = [(-1) for _ in range(num_frames)]
for i in range(0, len(ram)):
page_table[_str(ram[i])]["page"] = perllib.num(page_table.get(_str(ram[i])).get("page")) >> 1
Woof! No wonder you want to try Python!
Yes, Python can do this because Python dictionaries (what you'd call hashes in Perl) can contain other arrays or dictionaries without doing references to them.
However, I highly suggest that you look into moving into object oriented programming. After looking at that assignment statement of yours, I had to lie down for a bit. I can't imagine trying to maintain and write an entire program like that.
Whenever you have to do a hash that contains an array, or an array of arrays, or a hash of hashes, you should be looking into using object oriented code. Object oriented code can prevent you from making all the sorts of errors that happen when you do that type of stuff. And, it can make your code much more readable -- even Perl code.
Take a look at the Python Tutorial and take a look at the Perl Object Oriented Tutorial and learn a bit about object oriented programming.
This is especially true in Python which was written from the ground up to be object oriented.
Related
I found myself writing some tricky algorithmic code, and I tried to comment it as well as I could since I really do not know who is going to maintain this part of the code.
Following this idea, I've wrote quite a lot of block and inline comments, also trying not to over-comment it. But still, when I go back to the code I wrote a week ago, I find it difficult to read because of the swarming presence of the comments, especially the inline ones.
I though that indenting them (to ~120char) could easy the readability, but would obviously make the line way too long according to style standards.
Here's an example of the original code:
fooDataTableOccurrence = nestedData.find("table=\"public\".")
if 0 > fooDataTableOccurrence: # selects only tables without tag value "public-"
otherDataTableOccurrence = nestedData.find("table=")
dbNamePos = nestedData.find("dbname=") + 7 # 7 is the length of "dbname="
if -1 < otherDataTableOccurrence: # selects only tables with tag value "table="
# database resource case
resourceName = self.findDB(nestedData, dbNamePos, otherDataTableOccurrence, destinationPartitionName)
if resourceName: #if the resource is in a wrong path
if resourceName in ["foo", "bar", "thing", "stuff"]:
return True, False, False # respectively isProjectAlreadyExported, isThereUnexpectedData and wrongPathResources
wrongPathResources.append("Database table: " + resourceName)
And here's how indenting inline comments would look like:
fooDataTableOccurrence = nestedData.find("table=\"public\".")
if 0 > seacomDataTableOccurrence: # selects only tables without tag value "public-"
otherDataTableOccurrence = nestedData.find("table=")
dbNamePos = nestedData.find("dbname=") + 7 # 7 is the length of "dbname="
if -1 < otherDataTableOccurrence: #selects only tables with tag value "table="
# database resource case
resourceName = self.findDB(nestedData, dbNamePos, otherDataTableOccurrence, destinationPartitionName)
if resourceName: # if the resource is in a wrong path
if resourceName in ["foo", "bar", "thing", "stuff"]:
return True, False, False # respectively isProjectAlreadyExported, isThereUnexpectedData and wrongPathResources
wrongPathResources.append("Database table: " + resourceName)
The code is in Python (my company legacy code is not drastically following the PEP8 standard so we had to stick with that), but my point is not about the cleanness of the code itself, but on the comments. I am looking for a trade-off between readability and easy understanding of the code, and sometimes I find difficult achieving both at the same time
Which of the examples is better? If none, what would be?
Maybe this is an XY_Problem?
Could the comments be eliminated altogether?
Here is a (quick & dirty) attempt at refactoring the code posted:
dataTableOccurrence_has_tag_public = nestedData.find("table=\"public\".") > 0
if dataTableOccurrence_has_tag_public:
datataTableOccurrence_has_tag_table = nestedData.find("table=") > 0
prefix = "dbname="
dbNamePos = nestedData.find(prefix) + len(prefix)
if datataTableOccurrence_has_tag_table:
# database resource case
resourceName = self.findDB(nestedData,
dbNamePos,
otherDataTableOccurrence,
destinationPartitionName)
resource_name_in_wrong_path = len(resourceName) > 0
if resourceNameInWrongPath:
if resourceName in ["foo", "bar", "thing", "stuff"]:
project_already_exported = True
unexpected_data = False
return project_already_exported,
unexpected_data,
resource_name_in_wrong_path
wrongPathResources.append("Database table: " + resourceName)
Further work could involve extracting functions out of the block of code.
I wrote a simple procedure to calculate the average of the test coverage of some specific packages in a Java project. The raw data in a huge html file is like this:
<body>
package pkg1 <line_coverage>11/111,<branch_coverage>44/444<end>
package pkg2 <line_coverage>22/222,<branch_coverage>55/555<end>
package pkg3 <line_coverage>33/333,<branch_coverage>66/666<end>
...
</body>
Given the specified packages "pkg1" and "pkg3", for example, the average line coverage is:
(11+33)/(111+333)
and average branch coverage is:
(44+66)/(444+666)
I wrote the follow procedure to get the result and it works well. But how to implement this calculation in a functional style? Something like "(x,y) for x in ... for b in ... if...". I know a little Erlang, Haskell and Clojure, So solutions in these languages are also appreciated. Thanks a lot!
from __future__ import division
import re
datafile = ('abc', 'd>11/23d>34/89d', 'e>25/65e>13/25e', 'f>36/92f>19/76')
core_pkgs = ('d', 'f')
covered_lines, total_lines, covered_branches, total_branches = 0, 0, 0, 0
for line in datafile:
for pkg in core_pkgs:
ptn = re.compile('.*'+pkg+'.*'+'>(\d+)/(\d+).*>(\d+)/(\d+).*')
match = ptn.match(line)
if match is not None:
cvln, tlln, cvbh, tlbh = match.groups()
covered_lines += int(cvln)
total_lines += int(tlln)
covered_branches += int(cvbh)
total_branches += int(tlbh)
print 'Line coverage:', '{:.2%}'.format(covered_lines / total_lines)
print 'Branch coverage:', '{:.2%}'.format(covered_branches/total_branches)
Down below you can find my Haskell solution. I will try to explain the important points I went through as I wrote it.
First you will find that I created a data structure for coverage data. It's generally a good idea to create data structures to represent whatever data you want to handle. This is in part because it makes it easier to design your code when you can think in terms of whatever you are designing – closely related to functional programming philosophies, and in part because it can eliminate a few bugs where you think you are doing something but are in actuality doing something else.
Related to the point before: The first thing I do is to convert the string-represented data into my own data structure. When you are doing functional programming, you are often doing things in "sweeps." You don't have a single function that converts data to your format, filters out the unwanted data and summarises the result. You have three different functions for each of those tasks, and you do them one at a time!
This is because functions are very composable, i.e. if you have three different ones, you can stick them together to form a single one if you want to. If you start with a single one, it is very difficult to take it apart to form three different ones.
The actual workings of the conversion function is actually quite uninteresting unless you are specifically doing Haskell. All it does is try to match each string with a regex, and if it succeeds, it adds the coverage data to the resulting list.
Again, mad composition is about to happen. I don't create a function to loop over a list of coverages and sum them up. I create a single function to sum two coverages, because I know I can use it together with the specialised fold loop (which is sort of like a for loop on steroids) to summarise all coverages in a list. There's no need for me to reinvent the wheel and create a loop myself.
Besides, my sumCoverages function works with a lot of specialised loops, so I don't have to write a ton of functions, I just stick my single function into a ton of pre-made library functions!
In the main function you will see what I mean by programming in "sweeps" or "passes" over the data. First I convert it to the internal format, then I filter out the unwanted data, then I summarise the remaining data. These are completely independent computations. That's functional programming.
You will also notice that I use two specialised loops there, filter and fold. This means that I don't have to write any loops myself, I just stick in a function to those standard library loops and let those take it from there.
import Data.Maybe (catMaybes)
import Data.List (foldl')
import Text.Printf (printf)
import Text.Regex (matchRegex, mkRegex)
corePkgs = ["d", "f"]
stats = [
"d>11/23d>34/89d",
"e>25/65e>13/25e",
"f>36/92f>19/76"
]
format = mkRegex ".*(\\w+).*>([0-9]+)/([0-9]+).*>([0-9]+)/([0-9]+).*"
-- It might be a good idea to define a datatype for coverage data.
-- A bit of coverage data is defined as the name of the package it
-- came from, the lines covered, the total amount of lines, the
-- branches covered and the total amount of branches.
data Coverage = Coverage String Int Int Int Int
-- Then we need a way to convert the string data into a list of
-- coverage data. We do this by regex. We try to match on each
-- string in the list, and then we choose to keep only the successful
-- matches. Returned is a list of coverage data that was represented
-- by the strings.
convert :: [String] -> [Coverage]
convert = catMaybes . map match
where match line = do
[name, cl, tl, cb, tb] <- matchRegex format line
return $ Coverage name (read cl) (read tl) (read cb) (read tb)
-- We need a way to summarise two coverage data bits. This can of course also
-- be used to summarise entire lists of coverage data, by folding over it.
sumCoverage (Coverage nameA clA tlA cbA tbA) (Coverage nameB clB tlB cbB tbB) =
Coverage (nameA ++ nameB ++ ",") (clA + clB) (tlA + tlB) (cbA + cbB) (tbA + tbB)
main = do
-- First we need to convert the strings to coverage data
let coverageData = convert stats
-- Then we want to filter out only the relevant data
relevantData = filter (\(Coverage name _ _ _ _) -> name `elem` corePkgs) coverageData
-- Then we need to summarise it, but we are only interested in the numbers
Coverage _ cl tl cb tb = foldl' sumCoverage (Coverage "" 0 0 0 0) relevantData
-- So we can finally print them!
printf "Line coverage: %.2f\n" (fromIntegral cl / fromIntegral tl :: Double)
printf "Branch coverage: %.2f\n" (fromIntegral cb / fromIntegral tb :: Double)
Here are some quickly-hacked, untested ideas applied to your code:
import numpy as np
import re
datafile = ('abc', 'd>11/23d>34/89d', 'e>25/65e>13/25e', 'f>36/92f>19/76')
core_pkgs = ('d', 'f')
covered_lines, total_lines, covered_branches, total_branches = 0, 0, 0, 0
for pkg in core_pkgs:
ptn = re.compile('.*'+pkg+'.*'+'>(\d+)/(\d+).*>(\d+)/(\d+).*')
matches = map(datafile, ptn.match)
statsList = [map(int, match.groups()) for match in matches if matches]
# statsList is a list of [cvln, tlln, cvbh, tlbh]
stats = np.array(statsList)
covered_lines, total_lines, covered_branches, total_branches = stats.sum(axis=1)
Well, as you can see I haven't bothered to finish off the remaining loop, but I think the point is made by now. There's certainly a lot more than one way to do this; I elected to show off map() (which some will say makes this less efficient, and it probably does), as well as NumPy to get the (admittedly light) math done.
This is the corresponding Clojure solution:
(defn extract-data
"extract 4 integer from a string line according to a package name"
[pkg line]
(map read-string
(rest (first
(re-seq
(re-pattern
(str pkg ".*>(\\d+)/(\\d+).*>(\\d+)/(\\d+)"))
line)))))
(defn scan-lines-by-pkg
"scan all string lines and extract all data as integer sequences
according to package names"
[pkgs lines]
(filter seq (for [pkg pkgs
line lines]
(extract-data pkg line))))
(defn sum-data
"add all data in valid lines together"
[pkgs lines]
(apply map + (scan-lines-by-pkg pkgs lines)))
(defn get-percent
[covered all]
(str (format "%.2f" (float (/ (* covered 100) all))) "%"))
(defn get-cov
[pkgs lines]
{:line-cov (apply get-percent (take 2 (sum-data pkgs lines)))
:branch-cov (apply get-percent (drop 2 (sum-data pkgs lines)))})
(get-cov ["d" "f"] ["abc" "d>11/23d>34/89d" "e>25/65e>13/25e" "f>36/92f>19/76"])
I am trying to understand if there is an advantage in space/time/programming to storing data from a signal processing system as nested list in either :
data[channel][sample]
data[sample][channel]
I can code processing for both - thou I personally find 1) easy to write and index to then 2).
However, 2) is the more common was my local group programs in and stores the data (either in excel/csv or from the data gathering systems). While it is easy to transpose
dataA = map(list, zip(*dataB))
I was wondering if there are any storage or performance - or even - module compatibility issues with 1 over 2?
with 1) I can loop like this
for R in dataA :
for C in R :
process_channel(C)
matplotlib.loglog(dataA[0], dataA[i])
where dataA[0] is time or frequency and i is some other channel to plot
with 2)
for R in dataB :
for C in R
process_sample(C)
matplotlib.loglog([j[0] for j in dataB],[k[i] for k in dataB])
This looks worse in programming style. Maybe I am missing a list method of making this easier? I have also developed code to used dicts ... but this really breaks with general use. So I am less inclined to continue to use dicts. Although the dict storage is
dataC = list(['f':0.1,'chnl1':100.0],['f':0.2,'chnl1':110.0])
or some such. It seems that to be better integrated option 2 is better. However, I am trying to understand how better to code when using option 2) when you wish to process over channels then samples? Just transpose the matrix first and then do the work in option 1) space and transpose back the results:
dataA = smoothing(dataA, smooth_factor)
def smoothing(d, s) :
td = numpy.transpose(d)
td = map(list, zip(*d))
nd=[]
for row in td :
col = []
for i in xrange(0,len(row)-step,step) :
col.append(sum(row[i:i+step]/step)
nd.append(col)
nd = numpy.transpose(nd)
return nd
while this construction works - transposing back and forth all the time looks - um - inefficient.
I'm not sure what to call what I'm looking for; so if I failed to find this question else where, I apologize. In short, I am writing python code that will interface directly with the Linux kernel. Its easy to get the required values from include header files and write them in to my source:
IFA_UNSPEC = 0
IFA_ADDRESS = 1
IFA_LOCAL = 2
IFA_LABEL = 3
IFA_BROADCAST = 4
IFA_ANYCAST = 5
IFA_CACHEINFO = 6
IFA_MULTICAST = 7
Its easy to use these values when constructing structs to send to the kernel. However, they are of almost no help to resolve the values in the responses from the kernel.
If I put the values in to dict I would have to scan all the values in the dict to look up keys for each item in each struct from the kernel I presume. There must be a simpler, more efficient way.
How would you do it? (feel free to retitle the question if its way off)
If you want to use two dicts, you can try this to create the inverted dict:
b = {v: k for k, v in a.iteritems()}
Your solution leaves a lot of work do the repeated person creating the file. That is a source for error (you actually have to write each name three times). If you have a file where you need to update those from time to time (like, when new kernel releases come out), you are destined to include an error sooner or later. Actually, that was just a long way of saying, your solution violates DRY.
I would change your solution to something like this:
IFA_UNSPEC = 0
IFA_ADDRESS = 1
IFA_LOCAL = 2
IFA_LABEL = 3
IFA_BROADCAST = 4
IFA_ANYCAST = 5
IFA_CACHEINFO = 6
IFA_MULTICAST = 7
__IFA_MAX = 8
values = {globals()[x]:x for x in dir() if x.startswith('IFA_') or x.startswith('__IFA_')}
This was the values dict is generated automatically. You might want to (or have to) change the condition in the if statement there, according to whatever else is in that file. Maybe something like the following. That version would take away the need to list prefixes in the if statement, but it would fail if you had other stuff in the file.
values = {globals()[x]:x for x in dir() if not x.endswith('__')}
You could of course do something more sophisticated there, e.g. check for accidentally repeated values.
What I ended up doing is leaving the constant values in the module and creating a dict. The module is ip_addr.py (the values are from linux/if_addr.h) so when constructing structs to send to the kernel I can use if_addr.IFA_LABEL and resolves responses with if_addr.values[2]. I'm hoping this is the most straight forward so when I have to look at this again in a year+ its easy to understand :p
IFA_UNSPEC = 0
IFA_ADDRESS = 1
IFA_LOCAL = 2
IFA_LABEL = 3
IFA_BROADCAST = 4
IFA_ANYCAST = 5
IFA_CACHEINFO = 6
IFA_MULTICAST = 7
__IFA_MAX = 8
values = {
IFA_UNSPEC : 'IFA_UNSPEC',
IFA_ADDRESS : 'IFA_ADDRESS',
IFA_LOCAL : 'IFA_LOCAL',
IFA_LABEL : 'IFA_LABEL',
IFA_BROADCAST : 'IFA_BROADCAST',
IFA_ANYCAST : 'IFA_ANYCAST',
IFA_CACHEINFO : 'IFA_CACHEINFO',
IFA_MULTICAST : 'IFA_MULTICAST',
__IFA_MAX : '__IFA_MAX'
}
I'm rewriting some code from Ruby to Python. The code is for a Perceptron, listed in section 8.2.6 of Clever Algorithms: Nature-Inspired Programming Recipes. I've never used Ruby before and I don't understand this part:
def test_weights(weights, domain, num_inputs)
correct = 0
domain.each do |pattern|
input_vector = Array.new(num_inputs) {|k| pattern[k].to_f}
output = get_output(weights, input_vector)
correct += 1 if output.round == pattern.last
end
return correct
end
Some explanation: num_inputs is an integer (2 in my case), and domain is a list of arrays: [[1,0,1], [0,0,0], etc.]
I don't understand this line:
input_vector = Array.new(num_inputs) {|k| pattern[k].to_f}
It creates an array with 2 values, every values |k| stores pattern[k].to_f, but what is pattern[k].to_f?
Try this:
input_vector = [float(pattern[i]) for i in range(num_inputs)]
pattern[k].to_f
converts pattern[k] to a float.
I'm not a Ruby expert, but I think it would be something like this in Python:
def test_weights(weights, domain, num_inputs):
correct = 0
for pattern in domain:
output = get_output(weights, pattern[:num_inputs])
if round(output) == pattern[-1]:
correct += 1
return correct
There is plenty of scope for optimising this: if num_inputs is always one less then the length of the lists in domain then you may not need that parameter at all.
Be careful about doing line by line translations from one language to another: that tends not to give good results no matter what languages are involved.
Edit: since you said you don't think you need to convert to float you can just slice the required number of elements from the domain value. I've updated my code accordingly.