Pandas .loc and PEP8

Pandas .loc and PEP8 - python

I've tried to search this a number of times but I don't see it answered so here goes...
I often use pandas to clean up a dataframe and conform it to my needs. With this comes a lot of .loc accessing to query it and return values. Depending on what I am doing (and column lengths), this can get pretty lengthy. Given PEP8 constrains to 79 characters a line, are there any best practices? Some examples below (these are simplified and for explanatory purposes):
missing_address_df = address_df.loc[address_df['address'].notnull()].copy()
or multiple query points:
nc_drive_df = address.loc[(address_df['address'].str.contains('drive')) & (address_df['state'] == 'NC')]

I'd advise two things
Ignore PEP 8's 80 char advice, but try to keep to 120 or 150 lines
Keeping some line length requirement makes sense to aid readability, but if you're trying to keep to 80 chars in (for example) a class method, it will lead to worse and less-readable code
PEP 8 actually has a section on this, A Foolish Consistency is the Hobgoblin of Little Minds, which describes cases you should deviate from its other advice, for example
When applying the guideline would make the code less readable, even for someone who is used to reading code that follows this PEP
split the .loc contents onto multiple lines
nc_drive_df = address.loc[
(address_df['address'].str.contains('drive')) & \
(address_df['state'] == 'NC')
]
It's hard to be objective about when code "looks bad", despite being valid syntax, but you will experience it. Practically, PEP 8 and Cyclomatic Complexity checkers are tools which will help you fight against and defend and propose code styles in a scientific way.
If you have a great many boolean statements, you (often must) break them up with parentheses to clarify their order
nc_drive_df = address.loc[
(
(address_df['address'].str.contains('drive')) & \
(address_df['state'] == 'NC')
) || (
address_df['zip'] == "00000"
)
]
This is somewhat in conflict with conventional Python operators, which are suggested to preceed lines (PEP8), but I challenge this when forming a Pandas boolean array because the dataframes must be the same to get a good result and it's likely easier to observe this when working with many dataframes when they are first.
Finally, often when doing scientific Python, you should absolutely try many possibilities against a partial and full data if possible to draw good performance conclusions, consider their readability to be second, and provide excellent comments about and linking to your research, etc. over any particular style.

Related

How to detect the amount of almost-repetition in a text file?

I am a programming teacher, and I would like to write a script that detects the amount of repetition in a C/C++/Python file. I guess I can treat any file as pure text.
The script's output would be the number of similar sequences that repeat. Eventually, I am only interested in a DRY's metric (how much the code satisfied the DRY principle).
Naively I tried to do a simple autocorrelation but it would be hard to find the proper threshold.
u = open("find.c").read()
v = [ord(x) for x in u]
y = np.correlate(v, v, mode="same")
y = y[: int(len(y) / 2)]
x = range(len(y))
z = np.polyval(np.polyfit(x, y, 3), x)
f = (y - z)[: -5]
plt.plot(f)
plt.show();
So I am looking at different strategies... I also tried to compare the similarities between each line, each group of 2 lines, each group of 3 lines ...
import difflib
import numpy as np
lines = open("b.txt").readlines()
lines = [line.strip() for line in lines]
n = 3
d = []
for i in range(len(lines)):
a = lines[i:i+n]
for j in range(len(lines)):
b = lines[j:j+n]
if i == j: continue # skip same line
group_size = np.sum([len(x) for x in a])
if group_size < 5: continue # skip short lines
ratio = 0
for u, v in zip(a, b):
r = difflib.SequenceMatcher(None, u, v).ratio()
ratio += r if r > 0.7 else 0
d.append(ratio)
dry = sum(d) / len(lines)
In the following, we can identify some repetition at a glance:
w = int(len(d) / 100)
e = np.convolve(d, np.ones(w), "valid") / w * 10
plt.plot(range(len(d)), d, range(len(e)), e)
plt.show()
Why not using:
d = np.exp(np.array(d))
Thus, difflib module looks promising, the SequenceMatcher does some magic (Levenshtein?), but I would need some magic constants as well (0.7)... However, this code is > O(n^2) and runs very slowly for long files.
What is funny is that the amount of repetition is quite easily identified with attentive eyes (sorry to this student for having taken his code as a good bad example):
I am sure there is a more clever solution out there.
Any hint?

I would build a system based on compressibility, because that is essentially what things being repeated means. Modern compression algorithms are already looking for how to reduce repetition, so let's piggy back on that work.
Things that are similar will compress well under any reasonable compression algorithm, eg LZ. Under the hood a compression algo is a text with references to itself, which you might be able to pull out.
Write a program that feeds lines [0:n] into the compression algorithm, compare it to the output length with [0:n+1].
When you see the incremental length of the compressed output increases by a lot less than the incremental input, you note down that you potentially have a DRY candidate at that location, plus if you can figure out the format, you can see what previous text it was deemed similar to.
If you can figure out the compression format, you don't need to rely on the "size doesn't grow as much" heuristic, you can just pull out the references directly.
If needed, you can find similar structures with different names by pre-processing the input, for instance by normalizing the names. However I foresee this getting a bit messy, so it's a v2 feature. Pre-processing can also be used to normalize the formatting.

Looks like you're choosing a long path. I wouldn't go there.
I would look into trying to minify the code before analyzing it. To completely remove any influence of variable names, extra spacing, formatting and even slight logic reshuffling.
Another approach would be comparing byte-code of the students. But it may be not a very good idea since the result will likely have to be additionally cleaned up.
Dis would be an interesting option.
I would, most likely, stop on comparing their AST. But ast is likely to give false positives for short functions. Cuz their structure may be too similar, so consider checking short functions with something else, something trivial.
On top of thaaaat, I would consider using Levenshtein distance or something similar to numerically calculate the differences between byte-codes/sources/ast/dis of the students. This would be what? Almost O(N^2)? Shouldn't matter.
Or, if needed, make it more complex and calculate the distance between each function of student A and each function of student B, highlighting cases when the distance is too short. It may be not needed though.
With simplification and normalization of the input, more algorithms should start returning good results. If a student is good enough to take someone's code and reshuffle not only the variables, but the logic and maybe even improve the algo, then this student understands the code well enough to defend it and use it with no help in future. I guess, that's the kind of help a teacher would want to be exchanged between students.

You can treat this as a variant of the longest common subsequence problem between the input and itself, where the trivial matching of each element with itself is disallowed. This retains the optimal substructure of the standard algorithm, since it can be phrased as a non-transitive “equality” and the algorithm never relies on transitivity.
As such, we can write this trivial implementation:
import operator
class Repeat:
def __init__(self,l):
self.l=list(l)
self.memo={}
def __call__(self,m,n):
l=self.l
memo=self.memo
k=m,n
ret=memo.get(k)
if not ret:
if not m or not n: ret=0,None
elif m!=n and l[m-1]==l[n-1]: # critical change here!
z,tail=self(m-1,n-1)
ret=z+1,((m-1,n-1),tail)
else: ret=max(self(m-1,n),self(m,n-1),key=operator.itemgetter(0))
memo[k]=ret
return ret
def go(self):
n=len(self.l)
v=self(n,n)[1]
ret=[]
while v:
x,v=v
ret.append(x)
ret.reverse()
return ret
def repeat(l): return Repeat(l).go()
You might want to canonicalize lines of code by removing whitespace (except perhaps between letters), removing comments, and/or replacing each unique identifier with a standardized label. You might also want to omit trivial lines like } in C/C++ to reduce noise. Finally, the symmetry should allow only cases with, say, m>=n to be treated.
Of course, there are also "real" answers and real research on this issue!

Frame challenge: I’m not sure you should do this
It’d be a fun programming challenge for yourself, but if you intend to use it as a teaching tool—-I’m not sure I would. There’s not a good definition of “repeat” from the DRY principle that would be easy to test for fully in a computer program. The human definition, which I’d say is basically “failure to properly abstract your code at an appropriate level, manifested via some type of repetition of code, whether repeating exact blocks of whether repeating the same idea over and over again, or somewhere in between” isn’t something I think anyone will be able to get working well enough at this time to use as a tool that teaches good habits with respect to DRY without confusing the student or teaching bad habits too. For now I’d argue this is a job for humans because it’s easy for us and hard for computers, at least for now…
That said if you want to give it a try, first define for yourself requirements for what errors you want to catch, what they’ll look like, and what good code looks like, and then define acceptable false positive and false negative rates and test your code on a wide variety of representative inputs, validating your code against human judgement to see if it performs well enough for your intended use. But I’m guessing you’re really looking for more than simple repetition of tokens, and if you want to have a chance at succeeding I think you need to clearly define what you’re looking for and how you’ll measure success and then validate your code. A teaching tool can do great harm if it doesn’t actually teach the correct lesson. For example if your tool simply encourages students to obfuscate their code so it doesn’t get flagged as violating DRY, or if the tool doesn’t flag bad code so the student assumes it’s ok. Or if it flags code that is actually very well written.
More specifically, what types of repetition are ok and what aren’t? Is it good or bad to use “if” or “for” or other syntax repeatedly in code? Is it ok for variables and functions/methods to have names with common substrings (e.g. average_age, average_salary, etc.?). How many times is repetition ok before abstraction should happen, and when it does what kind of abstraction is needed and at what level (e.g. a simple method, or a functor, or a whole other class, or a whole other module?). Is more abstraction always better or is perfect sometimes the enemy of on time on budget? This is a really interesting problem, but it’s also a very hard problem, and honestly I think a research problem, which is the reason for my frame challenge.
Edit:
Or if you definitely want to try this anyway, you can make it a teaching tool--not necessarily as you may have intended, but rather by showing your students your adherence to DRY in the code you write when creating your tool, and by introducing them to the nuances of DRY and the shortcomings of automated code quality assessment by being transparent with them about the limitations of your quality assessment tool. What I wouldn’t do is use it like some professors use plagiarism detection tools, as a digital oracle whose assessment of the quality of the students’ code is unquestioned. That approach is likely to cause more harm than good toward the students.

I suggest the following approach: let's say that repetitions should be at least 3 lines long. Then we hash every 3 lines. If hash repeats, then we write down the line number where it occurred. All is left is to join together adjacent duplicated line numbers to get longer sequences.
For example, if you have duplicate blocks on lines 100-103 and 200-203, you will get {HASH1:(100,200), HASH2:(101,201)} (lines 100-102 and 200-202 will produce the same value HASH1, and HASH2 covers lines 101-103 and 201-203). When you join the results, it will produce a sequence (100,101,200,201). Finding in that monotonic subsequences, you will get ((100,101), (200,201)).
As no loops used, the time complexity is linear (both hashing and dictionary insertion are O(n))
Algorithm:
read text line by line
transform it by
removing blanks
removing empty lines, saving mapping to original for the future
for each 3 transformed lines, join them and calculate hash on it
filter lines for which their hashes occur more then once (these are repetitions of at least 3 lines)
find longest sequences and present repetitive text
Code:
from itertools import groupby, cycle
import re
def sequences(l):
x2 = cycle(l)
next(x2)
grps = groupby(l, key=lambda j: j + 1 == next(x2))
yield from (tuple(v) + (next((next(grps)[1])),) for k,v in grps if k)
with open('program.cpp') as fp:
text = fp.readlines()
# remove white spaces
processed, text_map = [], {}
proc_ix = 0
for ix, line in enumerate(text):
line = re.sub(r"\s+", "", line, flags=re.UNICODE)
if line:
processed.append(line)
text_map[proc_ix] = ix
proc_ix += 1
# calc hashes
hashes, hpos = [], {}
for ix in range(len(processed)-2):
h = hash(''.join(processed[ix:ix+3])) # join 3 lines
hashes.append(h)
hpos.setdefault(h, []).append(ix) # this list will reflect lines that are duplicated
# filter duplicated three liners
seqs = []
for k, v in hpos.items():
if len(v) > 1:
seqs.extend(v)
seqs = sorted(list(set(seqs)))
# find longer sequences
result = {}
for seq in sequences(seqs):
result.setdefault(hashes[seq[0]], []).append((text_map[seq[0]], text_map[seq[-1]+3]))
print('Duplicates found:')
for v in result.values():
print('-'*20)
vbeg, vend = v[0]
print(''.join(text[vbeg:vend]))
print(f'Found {len(v)} duplicates, lines')
for line_numbers in v:
print(f'{1+line_numbers[0]} : {line_numbers[1]}')

Is overwriting variables names for lengthy operations bad style?

I quite often find myself in a situation where I undertake several steps to get from my start data input to the output I want to have, e.g. in functions/loops. To avoid making my lines very long, I sometimes overwrite the variable name I am using in these operations.
One example would be:
df_2 = df_1.loc[(df1['id'] == val)]
df_2 = df_2[['c1','c2']]
df_2 = df_2.merge(df3, left_on='c1', right_on='c1'))
The only alternative I can come up with is:
df_2 = df_1.loc[(df1['id'] == val)][['c1','c2']]\
.merge(df3, left_on='c1', right_on='c1'))
But none of these options feels really clean. how should these situations be handled?

You can refer to this article which discusses exactly your question.
The pandas core team now encourages the use of "method chaining".
This is a style of programming in which you chain together multiple
method calls into a single statement. This allows you to pass
intermediate results from one method to the next rather than storing
the intermediate results using variables.
In addition to prettifying the chained codes by using brackets and indentation like #perl's answer, you might also find using functions like .query() and .assign() useful for coding in a "method chaining" style.
Of course, there are some drawbacks for method chaining, especially when excessive:
"One drawback to excessively long chains is that debugging can be
harder. If something looks wrong at the end, you don't have
intermediate values to inspect."

Just as another option, you can put everything in brackets and then break the lines, like this:
df_2 = (df_1
.loc[(df1['id'] == val)][['c1','c2']]
.merge(df3, left_on='c1', right_on='c1')))
It's generally pretty readable even if you have a lot of lines, and if you want to change the name of the output variable, you only need to change it in one place. So, a bit less verbose, and a bit easier to make changes to as compared to overwriting the variables

Readability of Scientific Python Code (Line Continuations, Variable Names, Imports)

Do Python's stylistic best practices apply to scientific coding?
I am finding it difficult to keep scientific Python code readable.
For example, it is suggested to use meaningful names for variables and to keep the namespace ordered by avoiding import *. Thus, e.g. :
import numpy as np
normbar = np.random.normal(mean, std, np.shape(foo))
But these suggestions can lead to some difficult-to-read code, especially given the 79-character line width. For example, I just wrote the following operation:
net["weights"][ix1][ix2] += lrate * (CD / nCases - opts["weightcost_pretrain"].dot(net["weights"][ix1][ix2]))
I can span the expression across lines:
net["weights"][ix1][ix2] += lrate * (CD / nCases -
opts["weightcost_pretrain"].dot(net["weights"][ix1][ix2]))
but this does not seem much better, and I am not sure how deep to indent the second line. These kinds of line continuations become even trickier when one is double-indented into a nested loop, and there are only 50 characters available on a line.
Should I accept that scientific Python looks clunky, or are there ways to avoid lines like the example above?
Some potential approaches are:
using shorter variable names
using shorter dictionary key names
importing numpy functions directly and assigning them short names
defining helper functions for combinations of arithmetic operations
breaking operations into smaller pieces, and placing one on each line
I would appreciate any wisdom on which of these to pursue and which to avoid, as well as suggestions for other remedies.

defining helper functions for combinations of arithmetic operations
breaking operations into smaller pieces, and placing one on each line
These are both good ideas—in keeping with the intent behind PEP 8, and with Pythonic style in general. In fact, whenever someone suggests modifying PEP 8 to give more information about long lines, half the responses are usually "If you're going over the line limit, you're probably doing too much in one expression".
And, more generally, factoring out code and giving sensible names to sensible operations are always a good idea.
Of course without knowing exactly what all these things represent, I can only guess at how to split them up, but I think something like this would be pretty readable and meaningful:
cost = opts["weightcost_pretrain"].dot(net["weights"][ix1][ix2])
weight = lrate * (CD / nCases - cost)
net["weights"][ix1][ix2] += weight

I think the style guide always applies- I use Python daily for scientific work and find that I'm able to read my code more easily and come back to it months later with little effort if I've split up long lines into logical components and sensible variable names, or used a function.
I'd do something more like this:
weights = net["weights"][ix1][ix2]
opts_arr = opts["weightcost_pretrain"]
weights += lrate * (CD / nCases - opts_arr.dot(weights))
Another way of saying that Python is "concise" is that Python is syntactically dense, and I find it harder to read and understand a long line of Python than a long line of Java (especially when using high-level functions from 3rd party libraries that hide low-level logic, like NumPy).

How can I make this code more DRY

From web2py example 33 we see:
db.purchase.insert(buyer_id=form.vars.buyer_id,
product_id=form.vars.product_id,
quantity=form.vars.quantity)
but I think there should be some way to make this less repetitive. perhaps this?
db.purchase.insert(**dict( [k = getattr(form.vars, k) for k in "buyer_id product_id quantity".split()]))

For me, DRY means 1) not repeating actual code, and 2) (and more important) not repeating information; i.e. there should be one place that each item of information exists.
In this instance, you're really just repeating a pattern, and I think that's fine. The second example is much harder to read; why complicate it just to save a few characters?

You could avoid repeating form.vars:
vars = form.vars
db.purchase.insert(
buyer_id=vars.buyer_id,
product_id=vars.product_id,
quantity=vars.quantity)
There is still some repetition, but I think it's better to leave it with some repetition rather than making your code hard to read.

if those 3 things are all there is you could do
db.purchase.insert(**form.vars)
otherwise I think that the original code is plenty dry
but i guess you could do
to_insert = {"product_id":form.vars.product_id,"quantity":form.vars.quantity,"buyer_id":form.vars.buyer_id}
db.purchase.insert(**to_insert)
which is similar to your second example but is more readable and simple (some of the tenets of python)

A better way to assign list into a var

Was coding something in Python. Have a piece of code, wanted to know if it can be done more elegantly...
# Statistics format is - done|remaining|200's|404's|size
statf = open(STATS_FILE, 'r').read()
starf = statf.strip().split('|')
done = int(starf[0])
rema = int(starf[1])
succ = int(starf[2])
fails = int(starf[3])
size = int(starf[4])
...
This goes on. I wanted to know if after splitting the line into a list, is there any better way to assign each list into a var. I have close to 30 lines assigning index values to vars. Just trying to learn more about Python that's it...

done, rema, succ, fails, size, ... = [int(x) for x in starf]
Better:
labels = ("done", "rema", "succ", "fails", "size")
data = dict(zip(labels, [int(x) for x in starf]))
print data['done']

What I don't like about the answers so far is that they stick everything in one expression. You want to reduce the redundancy in your code, without doing too much at once.
If all of the items on the line are ints, then convert them all together, so you don't have to write int(...) each time:
starf = [int(i) for i in starf]
If only certain items are ints--maybe some are strings or floats--then you can convert just those:
for i in 0,1,2,3,4:
starf[i] = int(starf[i]))
Assigning in blocks is useful; if you have many items--you said you had 30--you can split it up:
done, rema, succ = starf[0:2]
fails, size = starf[3:4]

I might use the csv module with a separator of | (though that might be overkill if you're "sure" the format will always be super-simple, single-line, no-strings, etc, etc). Like your low-level string processing, the csv reader will give you strings, and you'll need to call int on each (with a list comprehension or a map call) to get integers. Other tips include using the with statement to open your file, to ensure it won't cause a "file descriptor leak" (not indispensable in current CPython version, but an excellent idea for portability and future-proofing).
But I question the need for 30 separate barenames to represent 30 related values. Why not, for example, make a collections.NamedTuple type with appropriately-named fields, and initialize an instance thereof, then use qualified names for the fields, i.e., a nice namespace? Remember the last koan in the Zen of Python (import this at the interpreter prompt): "Namespaces are one honking great idea -- let's do more of those!"... barenames have their (limited;-) place, but representing dozens of related values is not one -- rather, this situation "cries out" for the "let's do more of those" approach (i.e., add one appropriate namespace grouping the related fields -- a much better way to organize your data).

Using a Python dict is probably the most elegant choice.
If you put your keys in a list as such:
keys = ("done", "rema", "succ" ... )
somedict = dict(zip(keys, [int(v) for v in values]))
That would work. :-) Looks better than 30 lines too :-)
EDIT: I think there are dict comphrensions now, so that may look even better too! :-)
EDIT Part 2: Also, for the keys collection, you'd want to break that into multpile lines.
EDIT Again: fixed buggy part :)

Thanks for all the answers. So here's the summary -
Glenn's answer was to handle this issue in blocks. i.e. done, rema, succ = starf[0:2] etc.
Leoluk's approach was more short & sweet taking advantage of python's immensely powerful dict comprehensions.
Alex's answer was more design oriented. Loved this approach. I know it should be done the way Alex suggested but lot of code re-factoring needs to take place for that. Not a good time to do it now.
townsean - same as 2
I have taken up Leoluk's approach. I am not sure what the speed implication for this is? I have no idea if List/Dict comprehensions take a hit on speed of execution. But it reduces the size of my code considerable for now. I'll optimize when the need comes :) Going by - "Pre-mature optimization is the root of all evil"...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.