Iterate through a list of methods? - python

I have a dataframe and I want to create a list of dataframes produced by the methods .mean(), and .median().
Is there any way to do that like this?:
grouped_df = df.groupby(md)
my_list = [grouped_df.i for i in [mean(), median()]

The most straight forward and no-nonsense way would be:
[grouped_df.mean(), grouped_df.median()]
Any alternative is really unnecessarily more complex:
[i() for i in (grouped_df.mean, grouped_df.median)]
[getattr(grouped_df, i)() for i in ('mean', 'median')]
I'm not sure about the class structure here, but maybe something like this works too:
[i(grouped_df) for i in (df.mean, df.median)]
You'd need to get to a somewhat substantially long list of methods, or a much more dynamic/functional code, to make any of these approaches worth it.

Related

Is overwriting variables names for lengthy operations bad style?

I quite often find myself in a situation where I undertake several steps to get from my start data input to the output I want to have, e.g. in functions/loops. To avoid making my lines very long, I sometimes overwrite the variable name I am using in these operations.
One example would be:
df_2 = df_1.loc[(df1['id'] == val)]
df_2 = df_2[['c1','c2']]
df_2 = df_2.merge(df3, left_on='c1', right_on='c1'))
The only alternative I can come up with is:
df_2 = df_1.loc[(df1['id'] == val)][['c1','c2']]\
.merge(df3, left_on='c1', right_on='c1'))
But none of these options feels really clean. how should these situations be handled?
You can refer to this article which discusses exactly your question.
The pandas core team now encourages the use of "method chaining".
This is a style of programming in which you chain together multiple
method calls into a single statement. This allows you to pass
intermediate results from one method to the next rather than storing
the intermediate results using variables.
In addition to prettifying the chained codes by using brackets and indentation like #perl's answer, you might also find using functions like .query() and .assign() useful for coding in a "method chaining" style.
Of course, there are some drawbacks for method chaining, especially when excessive:
"One drawback to excessively long chains is that debugging can be
harder. If something looks wrong at the end, you don't have
intermediate values to inspect."
Just as another option, you can put everything in brackets and then break the lines, like this:
df_2 = (df_1
.loc[(df1['id'] == val)][['c1','c2']]
.merge(df3, left_on='c1', right_on='c1')))
It's generally pretty readable even if you have a lot of lines, and if you want to change the name of the output variable, you only need to change it in one place. So, a bit less verbose, and a bit easier to make changes to as compared to overwriting the variables

I got around a SettingWithCopyWarning, feels like the wrong way and computationally inefficient, is there a better way?

I encountered the ever-common SettingWithCopyWarning when trying to change some values in a DataFrame. I found a way to get around this without having to disable the warning, but I feel like I've done it the wrong way, and that it is needlessly wasteful and computationally inefficient.
label_encoded_feature_data_to_be_standardised_X_train = X_train_label_encoded[['price', 'vintage']]
label_encoded_feature_data_to_be_standardised_X_test = X_test_label_encoded[['price', 'vintage']]
label_encoded_standard_scaler = StandardScaler()
label_encoded_standard_scaler.fit(label_encoded_feature_data_to_be_standardised_X_train)
X_train_label_encoded_standardised = label_encoded_standard_scaler.transform(label_encoded_feature_data_to_be_standardised_X_train)
X_test_label_encoded_standardised = label_encoded_standard_scaler.transform(label_encoded_feature_data_to_be_standardised_X_test)
That's how it's set up, then I get the warning if I do this:
X_train_label_encoded.loc[:,'price'] = X_train_label_encoded_standardised[:,0]
of if I do this:
X_train_label_encoded_standardised_df = pd.DataFrame(data=X_train_label_encoded_standardised, columns=['price', 'vintage'])
And I solved it by doing this:
X_train_label_encoded = X_train_label_encoded.drop('price', axis=1)
X_train_label_encoded['price'] = X_train_label_encoded_standardised_df.loc[:,'price']
This also works:
X_train_label_encoded.replace(to_replace=X_train_label_encoded['price'], value=X_train_label_encoded_standardised_df['price'])
But even that feels overly clunky with the extra DataFrame creation.
Why can't I just assign the column in some way? Or using some arrangement of the replace method? The documentation doesn't seem to have a solution, or am I just reading it wrong? Missing some obvious but not spelled out solution?
Is there a better way of doing this?
Many times, this warning is just a warning. If your code works and you aren't using chained assignment, you often have nothing to worry about.
If your transformation maintains the index, including order, and your data is numeric, you can use pd.DataFrame.values:
X_train_label_encoded['price'] = X_train_label_encoded_standardised.values[:, 0]
This should sidestep the warning since X_train_label_encoded_standardised.values evaluates to a lower-level NumPy array.

What is happening when I assign two Dataframes in Python

I am noticing some interesting behavior in my code.
If I do df1=df2, and then df2=df3. Why does df1 also equal df3, if I look inside? Something to do with DataFrame.copy(deep=True)?
Would the same behavior be observed in simple variables, or only complex objects like DFs?
Thanks.
In order to copy values instead of the memory location, you need to use df1 = df2.copy(). This is true mostly for complex objects.

A better way to assign list into a var

Was coding something in Python. Have a piece of code, wanted to know if it can be done more elegantly...
# Statistics format is - done|remaining|200's|404's|size
statf = open(STATS_FILE, 'r').read()
starf = statf.strip().split('|')
done = int(starf[0])
rema = int(starf[1])
succ = int(starf[2])
fails = int(starf[3])
size = int(starf[4])
...
This goes on. I wanted to know if after splitting the line into a list, is there any better way to assign each list into a var. I have close to 30 lines assigning index values to vars. Just trying to learn more about Python that's it...
done, rema, succ, fails, size, ... = [int(x) for x in starf]
Better:
labels = ("done", "rema", "succ", "fails", "size")
data = dict(zip(labels, [int(x) for x in starf]))
print data['done']
What I don't like about the answers so far is that they stick everything in one expression. You want to reduce the redundancy in your code, without doing too much at once.
If all of the items on the line are ints, then convert them all together, so you don't have to write int(...) each time:
starf = [int(i) for i in starf]
If only certain items are ints--maybe some are strings or floats--then you can convert just those:
for i in 0,1,2,3,4:
starf[i] = int(starf[i]))
Assigning in blocks is useful; if you have many items--you said you had 30--you can split it up:
done, rema, succ = starf[0:2]
fails, size = starf[3:4]
I might use the csv module with a separator of | (though that might be overkill if you're "sure" the format will always be super-simple, single-line, no-strings, etc, etc). Like your low-level string processing, the csv reader will give you strings, and you'll need to call int on each (with a list comprehension or a map call) to get integers. Other tips include using the with statement to open your file, to ensure it won't cause a "file descriptor leak" (not indispensable in current CPython version, but an excellent idea for portability and future-proofing).
But I question the need for 30 separate barenames to represent 30 related values. Why not, for example, make a collections.NamedTuple type with appropriately-named fields, and initialize an instance thereof, then use qualified names for the fields, i.e., a nice namespace? Remember the last koan in the Zen of Python (import this at the interpreter prompt): "Namespaces are one honking great idea -- let's do more of those!"... barenames have their (limited;-) place, but representing dozens of related values is not one -- rather, this situation "cries out" for the "let's do more of those" approach (i.e., add one appropriate namespace grouping the related fields -- a much better way to organize your data).
Using a Python dict is probably the most elegant choice.
If you put your keys in a list as such:
keys = ("done", "rema", "succ" ... )
somedict = dict(zip(keys, [int(v) for v in values]))
That would work. :-) Looks better than 30 lines too :-)
EDIT: I think there are dict comphrensions now, so that may look even better too! :-)
EDIT Part 2: Also, for the keys collection, you'd want to break that into multpile lines.
EDIT Again: fixed buggy part :)
Thanks for all the answers. So here's the summary -
Glenn's answer was to handle this issue in blocks. i.e. done, rema, succ = starf[0:2] etc.
Leoluk's approach was more short & sweet taking advantage of python's immensely powerful dict comprehensions.
Alex's answer was more design oriented. Loved this approach. I know it should be done the way Alex suggested but lot of code re-factoring needs to take place for that. Not a good time to do it now.
townsean - same as 2
I have taken up Leoluk's approach. I am not sure what the speed implication for this is? I have no idea if List/Dict comprehensions take a hit on speed of execution. But it reduces the size of my code considerable for now. I'll optimize when the need comes :) Going by - "Pre-mature optimization is the root of all evil"...

Fastest Way To Remove Duplicates In Lists Python

I have two very large lists and to loop through it once takes at least a second and I need to do it 200,000 times. What's the fastest way to remove duplicates in two lists to form one?
This is the fastest way I can think of:
import itertools
output_list = list(set(itertools.chain(first_list, second_list)))
Slight update: As jcd points out, depending on your application, you probably don't need to convert the result back to a list. Since a set is iterable by itself, you might be able to just use it directly:
output_set = set(itertools.chain(first_list, second_list))
for item in output_set:
# do something
Beware though that any solution involving the use of set() will probably reorder the elements in your list, so there's no guarantee that elements will be in any particular order. That said, since you're combining two lists, it's hard to come up with a good reason why you would need a particular ordering over them anyway, so this is probably not something you need to worry about.
I'd recommend something like this:
def combine_lists(list1, list2):
s = set(list1)
s.update(list2)
return list(s)
This eliminates the problem of creating a monster list of the concatenation of the first two.
Depending on what you're doing with the output, don't bother to convert back to a list. If ordering is important, you might need some sort of decorate/sort/undecorate shenanigans around this.
As Daniel states, a set cannot contain duplicate entries - so concatenate the lists:
list1 + list2
Then convert the new list to a set:
set(list1 + list2)
Then back to a list:
list(set(list1 + list2))
result = list(set(list1).union(set(list2)))
That's how I'd do it. I am not so sure about performance, though, but it is certainly better, than doing it by hand.

Categories

Resources