What is happening when I assign two Dataframes in Python - python

I am noticing some interesting behavior in my code.
If I do df1=df2, and then df2=df3. Why does df1 also equal df3, if I look inside? Something to do with DataFrame.copy(deep=True)?
Would the same behavior be observed in simple variables, or only complex objects like DFs?
Thanks.

In order to copy values instead of the memory location, you need to use df1 = df2.copy(). This is true mostly for complex objects.

Related

Iterate through a list of methods?

I have a dataframe and I want to create a list of dataframes produced by the methods .mean(), and .median().
Is there any way to do that like this?:
grouped_df = df.groupby(md)
my_list = [grouped_df.i for i in [mean(), median()]
The most straight forward and no-nonsense way would be:
[grouped_df.mean(), grouped_df.median()]
Any alternative is really unnecessarily more complex:
[i() for i in (grouped_df.mean, grouped_df.median)]
[getattr(grouped_df, i)() for i in ('mean', 'median')]
I'm not sure about the class structure here, but maybe something like this works too:
[i(grouped_df) for i in (df.mean, df.median)]
You'd need to get to a somewhat substantially long list of methods, or a much more dynamic/functional code, to make any of these approaches worth it.

Is overwriting variables names for lengthy operations bad style?

I quite often find myself in a situation where I undertake several steps to get from my start data input to the output I want to have, e.g. in functions/loops. To avoid making my lines very long, I sometimes overwrite the variable name I am using in these operations.
One example would be:
df_2 = df_1.loc[(df1['id'] == val)]
df_2 = df_2[['c1','c2']]
df_2 = df_2.merge(df3, left_on='c1', right_on='c1'))
The only alternative I can come up with is:
df_2 = df_1.loc[(df1['id'] == val)][['c1','c2']]\
.merge(df3, left_on='c1', right_on='c1'))
But none of these options feels really clean. how should these situations be handled?
You can refer to this article which discusses exactly your question.
The pandas core team now encourages the use of "method chaining".
This is a style of programming in which you chain together multiple
method calls into a single statement. This allows you to pass
intermediate results from one method to the next rather than storing
the intermediate results using variables.
In addition to prettifying the chained codes by using brackets and indentation like #perl's answer, you might also find using functions like .query() and .assign() useful for coding in a "method chaining" style.
Of course, there are some drawbacks for method chaining, especially when excessive:
"One drawback to excessively long chains is that debugging can be
harder. If something looks wrong at the end, you don't have
intermediate values to inspect."
Just as another option, you can put everything in brackets and then break the lines, like this:
df_2 = (df_1
.loc[(df1['id'] == val)][['c1','c2']]
.merge(df3, left_on='c1', right_on='c1')))
It's generally pretty readable even if you have a lot of lines, and if you want to change the name of the output variable, you only need to change it in one place. So, a bit less verbose, and a bit easier to make changes to as compared to overwriting the variables

Pandas dataframe from dict, why?

I can create a pandas dataframe from dict as follows:
d = {'Key':['abc','def','xyz'], 'Value':[1,2,3]}
df = pd.DataFrame(d)
df.set_index('Key', inplace=True)
And also by first creating a series like this:
d = {'abc': 1, 'def': 2, 'xyz': 3}
a = pd.Series(d, name='Value')
df = pd.DataFrame(a)
But not directly like this:
d = {'abc': 1, 'def': 2, 'xyz': 3}
df = pd.DataFrame(d)
I'm aware of the from_dict method, and this also gives the desired result:
d = {'abc': 1, 'def': 2, 'xyz': 3}
pd.DataFrame.from_dict(d, orient='index')
but I don't see why:
(1) a separate method is needed to create a dataframe from dict when creating from series or list works without issue;
(2) how/why creating a dataframe from dict/list of lists works, but not creating from dict directly.
Have found several SE answers that offer solutions, but looking for the 'why' as this behavior seems inconsistent. Can anyone shed some light on what I may be missing here.
There's actually a lot happening here, so let's break it down.
The Problem
There are soooo many different ways to create a DataFrame (from a list of records, dict, csv, ndarray, etc ...) that even for python veterans it can take a long time to understand them all. Hell, within each of those ways, there are EVEN MORE ways to build a DataFrame by tweaking some parameters and whatnot.
For example, for dictionaries (where the values are equal length lists), here are two ways pandas can handle them:
Case 1:
You treat each key-value pair as a column title and it's values at each row respectively. In this case, the rows don't have names, and so by default you might just name them by their row index.
Case 2:
You treat each key-value pair as the row's name and it's values at each column respectively. In this case, the columns don't have names, and so by default you might just name them by their index.
The Solution
Python's is a weakly typed language (aka variables don't declare a type and functions don't declare a return). As a result, it doesn't have function overloading. So, you basically have two philosophies when you want to create a object class that can have multiple ways of being constructed:
Create only one constructor that checks the input and handles it accordingly, covering all possible options. This can get very bloated and complicated when certain inputs have their own options/parameters and when there's simply just too much variety.
Separate each option into #classmethod's to handle each specific individual way of constructing the object.
The second is generally better, as it really enforces seperation of concerns as a SE design principle, however the user will need to know all the different #classmethod constructor calls as a result. Although, in my opinion, if you're object class is complicated enough to have many different construction options, the user should be aware of that anyways.
The Panda's Way
Pandas adopts a sorta mix between the two solutions. It'll use the default behaviour for each input type, and it you wanna get any extra functionality you'll need to use the respective #classmethod constructor.
For example, for dicts, by default, if you pass a dict into the DataFrame constructor, it will handle it as Case 1. If you want to do the second case, you'll need to use DataFrame.from_dict and pass in orient='index' (without orient='index', it would would use default behaviour described base Case 1).
In my opinion, I'm not a fan of this kind of implementation. Personally, it's more confusing than helpful. Honestly, a lot of pandas is designed like that. There's a reason why pandas is the topic of every other python tagged question on stackoverflow.

Creating new pandas dataframe in each loop iteration

I have several pandas dataframes (A,B,C,D) and I want to merge each one of them individually with another dataframe (E).
I wanted to write a for loop that allows me to run the merge code for all of them and save each resulting dataframe with a different name, so for example something like:
tables = [A,B,C,D]
n=0
for df in tables:
merged_n = df.merge(E, left_index = True, right_index = True)
n=n+1
I can't find a way to get the different names for the new dataframes created in the loop. I have searched stackoverflow but people say this should never be done (but couldn't find an explanation why) or to use dictionaries, but having dataframes inside dictionaries is not as practical.
you want to clutter the namespace with automatically generated variable names? if so, don't do that. just use a dictionary.
if you really don't want to use a dictionary (really think about why you don't want to do this), you can just do it the slow-to-write, obvious way:
ea = E.merge(A)
eb = E.merge(B)
...
edit: if you really want to add vars to your namespace, which i don't recommend, you can do something like this:
l = locals()
for c in 'abcd':
l[f'e{c}'] = E.merge(l[c.upper()])

What is the purpose of the alias method in PySpark?

While learning Spark in Python, I'm having trouble understanding both the purpose of the alias method and its usage. The documentation shows it being used to create copies of an existing DataFrame with new names, then join them together:
>>> from pyspark.sql.functions import *
>>> df_as1 = df.alias("df_as1")
>>> df_as2 = df.alias("df_as2")
>>> joined_df = df_as1.join(df_as2, col("df_as1.name") == col("df_as2.name"), 'inner')
>>> joined_df.select("df_as1.name", "df_as2.name", "df_as2.age").collect()
[Row(name=u'Bob', name=u'Bob', age=5), Row(name=u'Alice', name=u'Alice', age=2)]
My question has two parts:
What is the purpose of the alias input? It seems redundant to give the alias string "df_as1" when we are already assigning the new DataFrame to the variable df_as1. If we were to instead use df_as1 = df.alias("new_df"), where would "new_df" ever appear?
In general, when is the alias function useful? The example above feels a bit artificial, but from exploring tutorials and examples it seems to be used regularly -- I'm just not clear on what value it provides.
Edit: some of my original confusion came from the fact that both DataFrame and Column have alias methods. Nevertheless, I'm still curious about both of the above questions, with question 2 now applying to Column.alias as well.
The variable name is irrelevant and can be whatever you like it to be. It's the alias what will be used in string column identifiers and printouts.
I think that the main purpose of aliases is to achieve better brevity and avoid possible confusion when having conflicting column names. For example what was simply 'age' could be aliased to 'max_age' for brevity after you searched for the biggest value in that column. Or you could have a data frame for employees in a company joined with itself and filter so that you have manager-subordinate pairs. It could be useful to use column names like "manager.name" in such context.

Categories

Resources