python2 to 3 use of list() - python

I'm converting python2.7 scripts to python3.
2to3 makes these kinds of suggestions:
result = result.split(',')
syslog_trace("Result : {0}".format(result), False, DEBUG)
- data.append(map(float, result))
+ data.append(list(map(float, result)))
if (len(data) > samples):
data.pop(0)
syslog_trace("Data : {0}".format(data), False, DEBUG)
# report sample average
if (startTime % reportTime < sampleTime):
- somma = map(sum, zip(*data))
+ somma = list(map(sum, list(zip(*data))))
# not all entries should be float
# 0.37, 0.18, 0.17, 4, 143, 32147, 3, 4, 93, 0, 0
averages = [format(sm / len(data), '.3f') for sm in somma]
I'm sure the makers of Python3 did not want to do it like that. At least, it gives me a "you must be kidding" feeling.
Is there a more pythonic way of doing this?

What's wrong with the unfixed somma?
2to3 cannot know how somma is going to be used, in that case, as a generator in the next line to compute averages it is OK and optimal, no need to convert it as a list.
That's the genius of python 3 list to generator changes: most people used those lists as generators already, wasting precious memory materializing lists they did not need.
# report sample average
if (startTime % reportTime < sampleTime):
somma = map(sum, zip(*data))
# not all entries should be float
# 0.37, 0.18, 0.17, 4, 143, 32147, 3, 4, 93, 0, 0
averages = [format(sm / len(data), '.3f') for sm in somma]
Of course the first statement, unconverted, will fail since we append a generator whereas we need a list. In that case, the error is quickly fixed.
If left like this: data.append(map(float, result)), the next trace shows something fishy: 'Data : [<map object at 0x00000000043DB6A0>]', that you can quickly fix by cnverting to list as 2to3 suggested.
2to3 does its best to create running Python 3 code, but it does not replace manual rework or produce optimal code. When you are in a hurry you can apply it, but always check the diffs vs the old code like the OP did.
The -3 option of latest Python 2 versions print warnings when an error would be raised using Python 3. It's another approach, better when you have more time to perform your migration.

I'm sure the makers of Python3 did not want to do it like that
Well, the makers of Python generally don't like seeing Python 2 being used, I've seen that sentiment being expressed in pretty much every recent PyCon.
Is there a more pythonic way of doing this?
That really depends on your interpretation of Pythonic, list comps seem more intuitive in your case, you want to construct a list so there's no need to create an iterator with map or zip and then feed it to list().
Now, why 2to3 chose list() wrapping instead of comps, I do not know; probably easiest to actually implement.

Related

Min, Max Avg from a input text Python making loop generic

I am pretty new to Python and is busy with a bootcamp one of the task I have to complete have me a bit stump. They give as a input txt file that looks like the following:
min:1,2,3,4,5,6
max:1,2,3,4,5,6
avg:1,2,3,4,5,6
The task is that I have to open the txt file in my program and then work out the min, max and the avg of each line. I can do this a long way of doing .readlines(), but they want it in a generic way such that the lines don't matter. They want me to read through the lines with a loop statement and check the first word and then make that word start the operations.
I hope that I have put the question through correctly.
Regards
While your question wasnt entirely clear with how to use use readlines and not , maybe this is what you were looking for.
f=open("store.txt","r")
for i in f.readlines():
func,data=i.split(":")
data=[int(j) for j in data.rstrip('\n').split(",")]
print(func,end=":")
if(func=="max"):
print(max(data))
elif(func=="min"):
print(min(data))
else:
print(sum(data)/len(data))
Next time please try to show your work and ask specific errors , i.e. not how to solve a problem but rather how to change your solution to fix the problem that you are facing
eval() may be useful here.
The name of the math operation to perform is conveniently the first word of each line in the text file, and some are python keywords. So after parsing the file into a math expression I found it tempting to just use python's eval function to perform the operations on the list of numbers.
Note: this is a one-off solution as use of eval is discouraged on unknown data, but safe here as we manage the input data.
avg, is not a built-in operation, so we can define it (and any other operations that are not built-ins) with a lambda.
with open('input.txt', 'r') as f:
data = f.readlines()
clean = [d.strip('\n').split(':') for d in data]
lines = []
# define operations in input file that are not built-in functions
avg = lambda x: sum(x)/float(len(x)) # float for accurate calculation result
for i in clean:
lines.append([i[0], list(map(int, i[1].split(',')))])
for expr in lines:
info = '{}({})'.format(str(expr[0]), str(expr[1]))
print('{} = {}'.format(info, eval('{op}({d})'.format(op=expr[0],d=expr[1]))))
output:
min([1, 2, 3, 4, 5, 6]) = 1
max([1, 2, 3, 4, 5, 6]) = 6
avg([1, 2, 3, 4, 5, 6]) = 3.5

I'm trying to make a simple script that says two different two phrase lines(Python)

So, I'm just starting to program Python and I wanted to make a very simple script that will say something like "Gabe- Hello, my name is Gabe (Just an example of a sentence" + "Jerry- Hello Gabe, I'm Jerry" OR "Gabe- Goodbye, Jerry" + "Jerry- Goodbye, Gabe". Here's pretty much what I wrote.
answers1 = [
"James-Hello, my name is James!"
]
answers2 = [
"Jerry-Hello James, my name is Jerry!"
]
answers3 = [
"Gabe-Goodbye, Samuel."
]
answers4 = [
"Samuel-Goodbye, Gabe"
]
Jack1 = (answers1 + answers2)
Jack2 = (answers3 + answers4)
Jacks = ([Jack1,Jack2])
import random
for x in range(2):
a = random.randint(0,2)
print (random.sample([Jacks, a]))
I'm quite sure it's a very simple fix, but as I have just started Python (Like, literally 2-3 days ago) I don't quite know what the problem would be. Here's my error message
Traceback (most recent call last):
File "C:/Users/Owner/Documents/Test Python 3.py", line 19, in <module>
print (random.sample([Jacks, a]))
TypeError: sample() missing 1 required positional argument: 'k'
If anyone could help me with this, I would very much appreciate it! Other than that, I shall be searching on ways that may be relevant to fixing this.
The problem is that sample requires a parameter k that indicates how many random samples you want to take. However in this case it looks like you do not need sample, since you already have the random integer. Note that that integer should be in the range [0,1], because the list Jack has only two elements.
a = random.randint(0,1)
print (Jacks[a])
or the same behavior with sample, see here for an explanation.
print (random.sample(Jacks,1))
Hope this helps!
random.sample([Jacks, a])
This sample method should looks like
random.sample(Jacks, a)
However, I am concerted you also have no idea how lists are working. Can you explain why do you using lists of strings and then adding values in them? I am losing you here.
If you going to pick a pair or strings, use method described by Florian (requesting data by index value.)
k parameter tell random.sample function that how many sample you need, you should write:
print (random.sample([Jacks, a], 3))
which means you need 3 sample from your list. the output will be something like:
[1, jacks, 0]

How is timeit affected by the length of a list literal?

Update: Apparently I'm only timing the speed with which Python can read a list. This doesn't really change my question, though.
So, I read this post the other day and wanted to compare what the speeds looked like. I'm new to pandas so any time I see an opportunity to do something moderately interesting, I jump on it. Anyway, I initially just tested this out with 100 numbers, thinking that would be sufficient to satisfy my itch to play with pandas. But this is what that graph looked like:
Notice that there are 3 different runs. These runs were run in sequential order, but they all had a spike at the same two spots. The spots were approximately 28 and 64. So my initial thought was it had something to do with bytes, specifically 4. Maybe the first byte contains additional information about it being a list, and then the next byte is all data and every 4 bytes after that causes a spike in speed, which kinda made sense. So I needed to test it with more numbers. So I created a DataFrame of 3 sets of arrays, each with 1000 lists ranging in length from 0-999. I then timed them all in the same manner, that is:
Run 1: 0, 1, 2, 3, ...
Run 2: 0, 1, 2, 3, ...
Run 3: 0, 1, 2, 3, ...
What I expected to see was a dramatic increase approximately every 32 items in the array, but instead there's no recurrence to the pattern(I did zoom in and look for spikes):
However, you'll notice, that they all vary a lot between the numbers 400 and 682. Oddly, 1 run always a spike in the same place making the pattern harder to distinguish in the 28 and 64 points in this graph. The green line is all over the place really. Shameful.
Question: What's happening at the initial two spikes and why does it get "fuzzy" on the graph between 400 and 682? I just finished running a test over the 0-99 sets but this time did simple addition to each item in the array and the result was exactly linear, so I think it has something to do with strings.
I tested with other methods first, and got the same results, but the graph was messed up because I joined the results wrong, so I ran it again overnight(this took a long time) using this code to make sure the times were correctly aligned with their indexes and the runs were performed in the correct order:
import statistics as s
import timeit
df = pd.DataFrame([[('run_%s' % str(x + 1)), r, np.random.choice(100, r).tolist()]
for r in range(0, 1000) for x in range(3)],
columns=['run', 'length', 'array']).sort_values(['run', 'length'])
df['time'] = df.array.apply(lambda x: s.mean(timeit.repeat(str(x))))
# Graph
ax = df.groupby(['run', 'length']).mean().unstack('run').plot(y='time')
ax.set_ylabel('Time [ns]')
ax.set_xlabel('Array Length')
ax.legend(loc=3)
I also have the dataframe pickled if you'd like to see the raw data.
You are severely overcomplicating things using pandas and .apply here. There is no need - it is simply inefficient. Just do it the vanilla Python way:
In [3]: import timeit
In [4]: setup = "l = list(range({}))"
In [5]: test = "str(l)"
Note, timeit functions take a number parameter, which is the number of times everything is run. It defaults to 1000000, so let's make that more reasonable, by using number=100, so we don't have to wait around for forever...
In [8]: data = [timeit.repeat(test, setup.format(n), number=100) for n in range(0, 10001, 100)]
In [9]: import statistics
In [10]: mean_data = list(map(statistics.mean, data))
Visual inspection of the results:
In [11]: mean_data
Out[11]:
[3.977467228348056e-05,
0.0012597616684312622,
0.002014552320664128,
0.002637979011827459,
0.0034494600258767605,
0.0046060653403401375,
0.006786816345993429,
0.006134035007562488,
0.006666974319765965,
0.0073876206879504025,
0.008359026357841989,
0.008946725012113651,
0.01020014965130637,
0.0110439983351777,
0.012085124345806738,
0.013095536657298604,
0.013812023680657148,
0.014505649354153624,
0.015109792332320163,
0.01541508767210568,
0.018623976677190512,
0.018014412683745224,
0.01837641668195526,
0.01806374565542986,
0.01866597666715582,
0.021138361655175686,
0.020885809014240902,
0.023644315680333722,
0.022424093661053728,
0.024507874331902713,
0.026360396664434422,
0.02618172235088423,
0.02721496132047226,
0.026609957004742075,
0.027632603014353663,
0.029077719994044553,
0.030218352350251127,
0.03213361800105,
0.0321545610204339,
0.032791375007946044,
0.033749551337677985,
0.03418213398739075,
0.03482868466138219,
0.03569800598779693,
0.035460735321976244,
0.03980560234049335,
0.0375820419867523,
0.03880414469555641,
0.03926491799453894,
0.04079093333954612,
0.0420664346893318,
0.044861480011604726,
0.045125720323994756,
0.04562378901755437,
0.04398221097653732,
0.04668888701902082,
0.04841196699999273,
0.047662509993339576,
0.047592316346708685,
0.05009777001881351,
0.04870589632385721,
0.0532167866670837,
0.05079756366709868,
0.05264475334358091,
0.05531930166762322,
0.05283398299555605,
0.055121281009633094,
0.056162080339466534,
0.05814277834724635,
0.05694748067374652,
0.05985202432687705,
0.05949359833418081,
0.05837553597909088,
0.05975819365509475,
0.06247356999665499,
0.061310798317814864,
0.06292542165222888,
0.06698586166991542,
0.06634997764679913,
0.06443380867131054,
0.06923895300133154,
0.06685209332499653,
0.06864909763680771,
0.06959929631557316,
0.06832000267847131,
0.07180017333788176,
0.07092387134131665,
0.07280202202188472,
0.07342300032420705,
0.0745120863430202,
0.07483605532130848,
0.0734497313387692,
0.0763389469939284,
0.07811927401538317,
0.07915793966579561,
0.08072184936221068,
0.08046915601395692,
0.08565403800457716,
0.08061318534115951,
0.08411134833780427,
0.0865995019945937]
This looks pretty darn linear to me. Now, pandas is a handy way to graph things, especially if you want a convenient wrapper around matplotlib's API:
In [14]: import pandas as pd
In [15]: df = pd.DataFrame({'time': mean_data, 'n':list(range(0, 10001, 100))})
In [16]: df.plot(x='n', y='time')
Out[16]: <matplotlib.axes._subplots.AxesSubplot at 0x1102a4a58>
And here is the result:
This should get you on the right track to actually time what you've been trying to time. What you wound up timing, as I explained in the comments:
You are timing the result of str(x) which results in some list-literal,
so you are timing the interpretation of list literals, not the
conversion of list->str
I can only speculate as to the patterns you are seeing as the result of that, but that is likely interpreter/hardware dependent. Here are my findings on my machine:
In [18]: data = [timeit.repeat("{}".format(str(list(range(n)))), number=100) for n in range(0, 10001, 100)]
And using a range that isn't so large:
In [23]: data = [timeit.repeat("{}".format(str(list(range(n)))), number=10000) for n in range(0, 101)]
And the results:
Which I guess sort of looks like yours. Perhaps that is better suited for it's own question, though.

Converting an imperative algorithm into functional style

I wrote a simple procedure to calculate the average of the test coverage of some specific packages in a Java project. The raw data in a huge html file is like this:
<body>
package pkg1 <line_coverage>11/111,<branch_coverage>44/444<end>
package pkg2 <line_coverage>22/222,<branch_coverage>55/555<end>
package pkg3 <line_coverage>33/333,<branch_coverage>66/666<end>
...
</body>
Given the specified packages "pkg1" and "pkg3", for example, the average line coverage is:
(11+33)/(111+333)
and average branch coverage is:
(44+66)/(444+666)
I wrote the follow procedure to get the result and it works well. But how to implement this calculation in a functional style? Something like "(x,y) for x in ... for b in ... if...". I know a little Erlang, Haskell and Clojure, So solutions in these languages are also appreciated. Thanks a lot!
from __future__ import division
import re
datafile = ('abc', 'd>11/23d>34/89d', 'e>25/65e>13/25e', 'f>36/92f>19/76')
core_pkgs = ('d', 'f')
covered_lines, total_lines, covered_branches, total_branches = 0, 0, 0, 0
for line in datafile:
for pkg in core_pkgs:
ptn = re.compile('.*'+pkg+'.*'+'>(\d+)/(\d+).*>(\d+)/(\d+).*')
match = ptn.match(line)
if match is not None:
cvln, tlln, cvbh, tlbh = match.groups()
covered_lines += int(cvln)
total_lines += int(tlln)
covered_branches += int(cvbh)
total_branches += int(tlbh)
print 'Line coverage:', '{:.2%}'.format(covered_lines / total_lines)
print 'Branch coverage:', '{:.2%}'.format(covered_branches/total_branches)
Down below you can find my Haskell solution. I will try to explain the important points I went through as I wrote it.
First you will find that I created a data structure for coverage data. It's generally a good idea to create data structures to represent whatever data you want to handle. This is in part because it makes it easier to design your code when you can think in terms of whatever you are designing – closely related to functional programming philosophies, and in part because it can eliminate a few bugs where you think you are doing something but are in actuality doing something else.
Related to the point before: The first thing I do is to convert the string-represented data into my own data structure. When you are doing functional programming, you are often doing things in "sweeps." You don't have a single function that converts data to your format, filters out the unwanted data and summarises the result. You have three different functions for each of those tasks, and you do them one at a time!
This is because functions are very composable, i.e. if you have three different ones, you can stick them together to form a single one if you want to. If you start with a single one, it is very difficult to take it apart to form three different ones.
The actual workings of the conversion function is actually quite uninteresting unless you are specifically doing Haskell. All it does is try to match each string with a regex, and if it succeeds, it adds the coverage data to the resulting list.
Again, mad composition is about to happen. I don't create a function to loop over a list of coverages and sum them up. I create a single function to sum two coverages, because I know I can use it together with the specialised fold loop (which is sort of like a for loop on steroids) to summarise all coverages in a list. There's no need for me to reinvent the wheel and create a loop myself.
Besides, my sumCoverages function works with a lot of specialised loops, so I don't have to write a ton of functions, I just stick my single function into a ton of pre-made library functions!
In the main function you will see what I mean by programming in "sweeps" or "passes" over the data. First I convert it to the internal format, then I filter out the unwanted data, then I summarise the remaining data. These are completely independent computations. That's functional programming.
You will also notice that I use two specialised loops there, filter and fold. This means that I don't have to write any loops myself, I just stick in a function to those standard library loops and let those take it from there.
import Data.Maybe (catMaybes)
import Data.List (foldl')
import Text.Printf (printf)
import Text.Regex (matchRegex, mkRegex)
corePkgs = ["d", "f"]
stats = [
"d>11/23d>34/89d",
"e>25/65e>13/25e",
"f>36/92f>19/76"
]
format = mkRegex ".*(\\w+).*>([0-9]+)/([0-9]+).*>([0-9]+)/([0-9]+).*"
-- It might be a good idea to define a datatype for coverage data.
-- A bit of coverage data is defined as the name of the package it
-- came from, the lines covered, the total amount of lines, the
-- branches covered and the total amount of branches.
data Coverage = Coverage String Int Int Int Int
-- Then we need a way to convert the string data into a list of
-- coverage data. We do this by regex. We try to match on each
-- string in the list, and then we choose to keep only the successful
-- matches. Returned is a list of coverage data that was represented
-- by the strings.
convert :: [String] -> [Coverage]
convert = catMaybes . map match
where match line = do
[name, cl, tl, cb, tb] <- matchRegex format line
return $ Coverage name (read cl) (read tl) (read cb) (read tb)
-- We need a way to summarise two coverage data bits. This can of course also
-- be used to summarise entire lists of coverage data, by folding over it.
sumCoverage (Coverage nameA clA tlA cbA tbA) (Coverage nameB clB tlB cbB tbB) =
Coverage (nameA ++ nameB ++ ",") (clA + clB) (tlA + tlB) (cbA + cbB) (tbA + tbB)
main = do
-- First we need to convert the strings to coverage data
let coverageData = convert stats
-- Then we want to filter out only the relevant data
relevantData = filter (\(Coverage name _ _ _ _) -> name `elem` corePkgs) coverageData
-- Then we need to summarise it, but we are only interested in the numbers
Coverage _ cl tl cb tb = foldl' sumCoverage (Coverage "" 0 0 0 0) relevantData
-- So we can finally print them!
printf "Line coverage: %.2f\n" (fromIntegral cl / fromIntegral tl :: Double)
printf "Branch coverage: %.2f\n" (fromIntegral cb / fromIntegral tb :: Double)
Here are some quickly-hacked, untested ideas applied to your code:
import numpy as np
import re
datafile = ('abc', 'd>11/23d>34/89d', 'e>25/65e>13/25e', 'f>36/92f>19/76')
core_pkgs = ('d', 'f')
covered_lines, total_lines, covered_branches, total_branches = 0, 0, 0, 0
for pkg in core_pkgs:
ptn = re.compile('.*'+pkg+'.*'+'>(\d+)/(\d+).*>(\d+)/(\d+).*')
matches = map(datafile, ptn.match)
statsList = [map(int, match.groups()) for match in matches if matches]
# statsList is a list of [cvln, tlln, cvbh, tlbh]
stats = np.array(statsList)
covered_lines, total_lines, covered_branches, total_branches = stats.sum(axis=1)
Well, as you can see I haven't bothered to finish off the remaining loop, but I think the point is made by now. There's certainly a lot more than one way to do this; I elected to show off map() (which some will say makes this less efficient, and it probably does), as well as NumPy to get the (admittedly light) math done.
This is the corresponding Clojure solution:
(defn extract-data
"extract 4 integer from a string line according to a package name"
[pkg line]
(map read-string
(rest (first
(re-seq
(re-pattern
(str pkg ".*>(\\d+)/(\\d+).*>(\\d+)/(\\d+)"))
line)))))
(defn scan-lines-by-pkg
"scan all string lines and extract all data as integer sequences
according to package names"
[pkgs lines]
(filter seq (for [pkg pkgs
line lines]
(extract-data pkg line))))
(defn sum-data
"add all data in valid lines together"
[pkgs lines]
(apply map + (scan-lines-by-pkg pkgs lines)))
(defn get-percent
[covered all]
(str (format "%.2f" (float (/ (* covered 100) all))) "%"))
(defn get-cov
[pkgs lines]
{:line-cov (apply get-percent (take 2 (sum-data pkgs lines)))
:branch-cov (apply get-percent (drop 2 (sum-data pkgs lines)))})
(get-cov ["d" "f"] ["abc" "d>11/23d>34/89d" "e>25/65e>13/25e" "f>36/92f>19/76"])

Only index needed: enumerate or (x)range?

If I want to use only the index within a loop, should I better use the range/xrange function in combination with len()
a = [1,2,3]
for i in xrange(len(a)):
print i
or enumerate? Even if I won't use p at all?
for i,p in enumerate(a):
print i
I would use enumerate as it's more generic - eg it will work on iterables and sequences, and the overhead for just returning a reference to an object isn't that big a deal - while xrange(len(something)) although (to me) more easily readable as your intent - will break on objects with no support for len...
Using xrange with len is quite a common use case, so yes, you can use it if you only need to access values by index.
But if you prefer to use enumerate for some reason, you can use underscore (_), it's just a frequently seen notation that show you won't use the variable in some meaningful way:
for i, _ in enumerate(a):
print i
There's also a pitfall that may happen using underscore (_). It's also common to name 'translating' functions as _ in i18n libraries and systems, so beware to use it with gettext or some other library of such kind (thnks to #lazyr).
That's a rare requirement – the only information used from the container is its length! In this case, I'd indeed make this fact explicit and use the first version.
xrange should be a little faster, but enumerate will mean you don't need to change it when you realise that you need p afterall
I ran a time test and found out range is about 2x faster than enumerate. (on python 3.6 for Win32)
best of 3, for len(a) = 1M
enumerate(a): 0.125s
range(len(a)): 0.058s
Hope it helps.
FYI: I initialy started this test to compare python vs vba's speed...and found out vba is actually 7x faster than range method...is it because of my poor python skills?
surely python can do better than vba somehow
script for enumerate
import time
a = [0]
a = a * 1000000
time.perf_counter()
for i,j in enumerate(a):
pass
print(time.perf_counter())
script for range
import time
a = [0]
a = a * 1000000
time.perf_counter()
for i in range(len(a)):
pass
print(time.perf_counter())
script for vba (0.008s)
Sub timetest_for()
Dim a(1000000) As Byte
Dim i As Long
tproc = Timer
For i = 1 To UBound(a)
Next i
Debug.Print Timer - tproc
End Sub
I wrote this because I wanted to test it.
So it depends if you need the values to work with.
Code:
testlist = []
for i in range(10000):
testlist.append(i)
def rangelist():
a = 0
for i in range(len(testlist)):
a += i
a = testlist[i] + 1 # Comment this line for example for testing
def enumlist():
b = 0
for i, x in enumerate(testlist):
b += i
b = x + 1 # Comment this line for example for testing
import timeit
t = timeit.Timer(lambda: rangelist())
print("range(len()):")
print(t.timeit(number=10000))
t = timeit.Timer(lambda: enumlist())
print("enum():")
print(t.timeit(number=10000))
Now you can run it and will get most likely the result, that enum() is faster.
When you comment the source at a = testlist[i] + 1 and b = x + 1 you will see range(len()) is faster.
For the code above I get:
range(len()):
18.766527627612255
enum():
15.353173553868345
Now when commenting as stated above I get:
range(len()):
8.231641875551514
enum():
9.974262515773656
Based on your sample code,
res = [[profiel.attr[i].x for i,p in enumerate(profiel.attr)] for profiel in prof_obj]
I would replace it with
res = [[p.x for p in profiel.attr] for profiel in prof_obj]
Just use range(). If you're going to use all the indexes anyway, xrange() provides no real benefit (unless len(a) is really large). And enumerate() creates a richer datastructure that you're going to throw away immediately.

Categories

Resources