How to create a view of dataframe in pandas? - python

I have a large dataframe (10m rows, 40 columns, 7GB in memory). I would like to create a view in order to have a shorthand name for a view that is complicated to express, without adding another 2-4 GB to memory usage. In other words, I would rather type:
df2
Than:
df.loc[complicated_condition, some_columns]
The documentation states that, while using .loc ensures that setting values modifies the original dataframe, there is still no guarantee as to whether the object returned by .loc is a view or a copy.
I know I could assign the condition and column list to variables (e.g. df.loc[cond, cols]), but I'm generally curious to know whether it is possible to create a view of a dataframe.
Edit: Related questions:
What rules does Pandas use to generate a view vs a copy?
Pandas: Subindexing dataframes: Copies vs views

You generally can't return a view.
Your answer lies in the pandas docs:
returning-a-view-versus-a-copy.
Whenever an array of labels or a boolean vector are involved in the
indexing operation, the result will be a copy. With single label /
scalar indexing and slicing, e.g. df.ix[3:6] or df.ix[:, 'A'], a view
will be returned.
This answer was found in the following post: Link.

Related

What does the pandas groupby only return?

I noticed in some of my code I was doing different operations on the same groupby call. So I made only one call:
c = df.groupby("SP")
And ran different operations on the 'DataFrameGroupBy' object, such as:
c["SP"].cumsum()
And several others, most of them were of the cumsum, transform type, so same index expected.
So far so good, this sped the script somewhat, but than I noticed that I occassionaly added attributes to the original df after I had made the groupby object (c). This attributes were used on groupby operations. It seems to have worked fine.
But when some rows are droped from the df than the aggregate operations on c are still being done on the old df with all the rows.
So what is the DataFrameGroubBy? is it only a pointer to indices? Or does it make some copy of the DataFrame. From the behavior explained above it appears a copy is made when you change the number of rows in the original DataFrame.

When should I worry about using copy() with a pandas DataFrame?

I'm more of an R user and have recently been "switching" to Python. So that means I'm way more used to the R way of dealing with things. In Python, the whole concept of mutability and passing by assignment is kind of hard to grasp at first.
I can easily understand the issues that mutability may lead to when using lists or dictionaries. However, when using pandas DataFrames, I find that mutability is specially difficult to understand.
For example: let's say I have a DataFrame (df) with some raw data. I want to use a function that receives df as a parameter and outputs a modified version of that df, but keeping the original df. If I wrote the function, maybe I can inspect it and be assured that it makes a copy of the input before applying any manipulation. However, if it's a function I don't know (let's say, from some package), should I always pass my input df as df.copy()?
In my case, I'm trying to write some custom function that transforms a df using a WoE encoder. The data parameter is a DataFrame with feature columns and a label column. It kinda looks like this:
def my_function(data, var_list, label_column):
encoder = category_encoders.WOEEncoder(cols=var_list) # var_list = cols to be encoded
fit_encoder = encoder.fit(
X=data[var_list],
y=data[label_column]
)
new_data = fit_encoder.transform(
data[var_list]
)
new_data[label_column] = data[label_column]
return new_data
So should I be passing data[var_list].copy() instead of data[var_list]? Should I assume that every function that receives a df will modify it in-place or will it return a different object? I mean, how can I be sure that fit_encoder.transform won't modify data itself? I also learned that Pandas sometimes produces views and sometimes not, depending on the operation you apply to the whatever subset of the df. So I feel like there's too much uncertainty surrounding operations on DataFrames.
From the exercise shown on the website https://www.statology.org/pandas-copy-dataframe/ it shows that if you don't use .copy() when manipulating a subset of your dataframe, it could change values in your original dataframe as well. This is not what you want, so you should use .copy() when passing your dataframe to your function.
The example on the link I listed above really illustrates this concept well (and no I'm not affiliated with their site lol, I was just searching for this answer myself).

Dataframe .where() method returns None [duplicate]

In the pandas library many times there is an option to change the object inplace such as with the following statement...
df.dropna(axis='index', how='all', inplace=True)
I am curious what is being returned as well as how the object is handled when inplace=True is passed vs. when inplace=False.
Are all operations modifying self when inplace=True? And when inplace=False is a new object created immediately such as new_df = self and then new_df is returned?
If you are trying to close a question where someone should use inplace=True and hasn't, consider replace() method not working on Pandas DataFrame instead.
When inplace=True is passed, the data is renamed in place (it returns nothing), so you'd use:
df.an_operation(inplace=True)
When inplace=False is passed (this is the default value, so isn't necessary), performs the operation and returns a copy of the object, so you'd use:
df = df.an_operation(inplace=False)
In pandas, is inplace = True considered harmful, or not?
TLDR; Yes, yes it is.
inplace, contrary to what the name implies, often does not prevent copies from being created, and (almost) never offers any performance benefits
inplace does not work with method chaining
inplace can lead to SettingWithCopyWarning if used on a DataFrame column, and may prevent the operation from going though, leading to hard-to-debug errors in code
The pain points above are common pitfalls for beginners, so removing this option will simplify the API.
I don't advise setting this parameter as it serves little purpose. See this GitHub issue which proposes the inplace argument be deprecated api-wide.
It is a common misconception that using inplace=True will lead to more efficient or optimized code. In reality, there are absolutely no performance benefits to using inplace=True. Both the in-place and out-of-place versions create a copy of the data anyway, with the in-place version automatically assigning the copy back.
inplace=True is a common pitfall for beginners. For example, it can trigger the SettingWithCopyWarning:
df = pd.DataFrame({'a': [3, 2, 1], 'b': ['x', 'y', 'z']})
df2 = df[df['a'] > 1]
df2['b'].replace({'x': 'abc'}, inplace=True)
# SettingWithCopyWarning:
# A value is trying to be set on a copy of a slice from a DataFrame
Calling a function on a DataFrame column with inplace=True may or may not work. This is especially true when chained indexing is involved.
As if the problems described above aren't enough, inplace=True also hinders method chaining. Contrast the working of
result = df.some_function1().reset_index().some_function2()
As opposed to
temp = df.some_function1()
temp.reset_index(inplace=True)
result = temp.some_function2()
The former lends itself to better code organization and readability.
Another supporting claim is that the API for set_axis was recently changed such that inplace default value was switched from True to False. See GH27600. Great job devs!
The way I use it is
# Have to assign back to dataframe (because it is a new copy)
df = df.some_operation(inplace=False)
Or
# No need to assign back to dataframe (because it is on the same copy)
df.some_operation(inplace=True)
CONCLUSION:
if inplace is False
Assign to a new variable;
else
No need to assign
The inplace parameter:
df.dropna(axis='index', how='all', inplace=True)
in Pandas and in general means:
1. Pandas creates a copy of the original data
2. ... does some computation on it
3. ... assigns the results to the original data.
4. ... deletes the copy.
As you can read in the rest of my answer's further below, we still can have good reason to use this parameter i.e. the inplace operations, but we should avoid it if we can, as it generate more issues, as:
1. Your code will be harder to debug (Actually SettingwithCopyWarning stands for warning you to this possible problem)
2. Conflict with method chaining
So there is even case when we should use it yet?
Definitely yes. If we use pandas or any tool for handeling huge dataset, we can easily face the situation, where some big data can consume our entire memory.
To avoid this unwanted effect we can use some technics like method chaining:
(
wine.rename(columns={"color_intensity": "ci"})
.assign(color_filter=lambda x: np.where((x.hue > 1) & (x.ci > 7), 1, 0))
.query("alcohol > 14 and color_filter == 1")
.sort_values("alcohol", ascending=False)
.reset_index(drop=True)
.loc[:, ["alcohol", "ci", "hue"]]
)
which make our code more compact (though harder to interpret and debug too) and consumes less memory as the chained methods works with the other method's returned values, thus resulting in only one copy of the input data. We can see clearly, that we will have 2 x original data memory consumption after this operations.
Or we can use inplace parameter (though harder to interpret and debug too) our memory consumption will be 2 x original data, but our memory consumption after this operation remains 1 x original data, which if somebody whenever worked with huge datasets exactly knows can be a big benefit.
Final conclusion:
Avoid using inplace parameter unless you don't work with huge data and be aware of its possible issues in case of still using of it.
Save it to the same variable
data["column01"].where(data["column01"]< 5, inplace=True)
Save it to a separate variable
data["column02"] = data["column01"].where(data["column1"]< 5)
But, you can always overwrite the variable
data["column01"] = data["column01"].where(data["column1"]< 5)
FYI: In default inplace = False
When trying to make changes to a Pandas dataframe using a function, we use 'inplace=True' if we want to commit the changes to the dataframe.
Therefore, the first line in the following code changes the name of the first column in 'df' to 'Grades'. We need to call the database if we want to see the resulting database.
df.rename(columns={0: 'Grades'}, inplace=True)
df
We use 'inplace=False' (this is also the default value) when we don't want to commit the changes but just print the resulting database. So, in effect a copy of the original database with the committed changes is printed without altering the original database.
Just to be more clear, the following codes do the same thing:
#Code 1
df.rename(columns={0: 'Grades'}, inplace=True)
#Code 2
df=df.rename(columns={0: 'Grades'}, inplace=False}
Yes, in Pandas we have many functions has the parameter inplace but by default it is assigned to False.
So, when you do df.dropna(axis='index', how='all', inplace=False) it thinks that you do not want to change the orignial DataFrame, therefore it instead creates a new copy for you with the required changes.
But, when you change the inplace parameter to True
Then it is equivalent to explicitly say that I do not want a new copy
of the DataFrame instead do the changes on the given DataFrame
This forces the Python interpreter to not to create a new DataFrame
But you can also avoid using the inplace parameter by reassigning the result to the orignal DataFrame
df = df.dropna(axis='index', how='all')
inplace=True is used depending if you want to make changes to the original df or not.
df.drop_duplicates()
will only make a view of dropped values but not make any changes to df
df.drop_duplicates(inplace = True)
will drop values and make changes to df.
Hope this helps.:)
inplace=True makes the function impure. It changes the original dataframe and returns None. In that case, You breaks the DSL chain.
Because most of dataframe functions return a new dataframe, you can use the DSL conveniently. Like
df.sort_values().rename().to_csv()
Function call with inplace=True returns None and DSL chain is broken. For example
df.sort_values(inplace=True).rename().to_csv()
will throw NoneType object has no attribute 'rename'
Something similar with python’s build-in sort and sorted. lst.sort() returns None and sorted(lst) returns a new list.
Generally, do not use inplace=True unless you have specific reason of doing so. When you have to write reassignment code like df = df.sort_values(), try attaching the function call in the DSL chain, e.g.
df = pd.read_csv().sort_values()...
As Far my experience in pandas I would like to answer.
The 'inplace=True' argument stands for the data frame has to make changes permanent
eg.
df.dropna(axis='index', how='all', inplace=True)
changes the same dataframe (as this pandas find NaN entries in index and drops them).
If we try
df.dropna(axis='index', how='all')
pandas shows the dataframe with changes we make but will not modify the original dataframe 'df'.
If you don't use inplace=True or you use inplace=False you basically get back a copy.
So for instance:
testdf.sort_values(inplace=True, by='volume', ascending=False)
will alter the structure with the data sorted in descending order.
then:
testdf2 = testdf.sort_values( by='volume', ascending=True)
will make testdf2 a copy. the values will all be the same but the sort will be reversed and you will have an independent object.
then given another column, say LongMA and you do:
testdf2.LongMA = testdf2.LongMA -1
the LongMA column in testdf will have the original values and testdf2 will have the decrimented values.
It is important to keep track of the difference as the chain of calculations grows and the copies of dataframes have their own lifecycle.

Size immutability in pandas data structure

While going through pandas Documentation for version 0.24.1 here, I came across this statement.
"All pandas data structures are value-mutable (the values they contain can be altered) but not always size-mutable. The length of a Series cannot be changed, but, for example, columns can be inserted into a DataFrame."
import pandas as pd
test_s = pd.Series([1,2,3])
id(test_s) # output: 140485359734400 (will vary)
len(test_s) # output: 3
test_s[3] = 37
id(test_s) # output: 140485359734400
len(test_s) # output: 4
The meaning of size immutable as per my inference is that operations like appending and deleting an element are not allowed, which is clearly not the case. Even the identity of the object remains the same, ruling out the possibility of a new object creation with the same name.
So, what does size immutability actually mean?
Appending and deleting are allowed, but that doesn't necessarily imply the Series is mutable.
Series/DataFrames are internally represented by NumPy arrays which are immutable (fixed size) to allow a more compact memory representation and better performance.
When you assign to a Series, you're actually calling Series.__setitem__ (which then delegates to NDFrame.__loc__) which creates a new array. This new array is then assigned back to the same Series (of course, as the end user, you don't get to see this), giving you the illusion of mutability.
#dbot_5
"All pandas data structures are value-mutable (the values they contain can be altered) but not always size-mutable.
A per my opinion, it is already written that all pandas data structures (Series, Dataframes) are value-mutable. It means we can add or delete values in Series and DataFrame.
"The length of a Series cannot be changed, but, for example, columns can be inserted into a DataFrame."
As given in this statement, we cannot change the columns in the Series (by default, it has only one column and we cannot add new columns and we cannot even delete it.) but we can change- add, delete columns in a DataFrame. so here length means no. of columns not the number of values.

The nature of pandas DataFrame

As a followup to my question on mixed types in a column:
Can I think of a DataFrame as a list of columns or is it a list of rows?
In the former case, it means that (optimally) each column has to be homogeneous (type-wise) and different columns can be of different types. The latter case, suggests that each row is type-wise homogeneous.
For the documentation:
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types.
This implies that a DataFrame is a list of columns.
Does it mean that appending a row to a DataFrame is more expensive than appending a column?
You are fully correct that a DataFrame can be seen as a list of columns, or even more a (ordered) dictionary of columns (see explanation here).
Indeed, each column has to be homogeneous of type, and different columns can be of different types. But by using the object dtype you can still hold different types of objects in one column (although not recommended apart for eg strings).
To illustrate, if you ask the data types of a DataFrame, you get the dtype for each column:
In [2]: df = pd.DataFrame({'int_col':[0,1,2], 'float_col':[0.0,1.1,2.5], 'bool_col':[True, False, True]})
In [3]: df.dtypes
Out[3]:
bool_col bool
float_col float64
int_col int64
dtype: object
Internally, the values are stored as blocks of the same type. Each column, or collection of columns of the same type is stored in a separate array.
And this indeed implies that appending a row is more expensive. In general, appending multiple single rows is not a good idea: better to eg preallocate an empty dataframe to fill, or put the new rows/columns in a list and concat them all at once.
See the note at the end of the concat/append docs (just before the first subsection "Set logic on the other axes").
To address the question: Is appending a row to a DataFrame is more expensive than appending a column?
We need to take into account various factors, but the most important one is the internal physical data layout of Pandas Dataframe.
The short and kind of naive answer:
If the table(aka DataFrame) is stored in a column-wise physical layout, then add or fetch a column is faster than with a row; if the table is stored in a row-wise physical layout, it's the other way. In general, the default Pandas DataFrame is stored column-wise(but NOT all the time). So in general, appending a row to a DataFrame is indeed more expensive than appending a column. And you could consider the nature of Pandas DataFrame to be a dict of columns.
A longer answer:
Pandas needs to choose a way to arrange the internal layout of a table in memory (such as a Dataframe of 10 rows and 2 columns). The most common two approaches are column-wise and row-wise.
Pandas is built on top of Numpy, and DataFrame and Seires are built on top of Numpy Array. But do notice though Numpy Array is internally stored row-wise in Memory, this is NOT the case for Pandas DataFrame. How DataFrame is stored depends on how it was initiated, cf this post:https://krbnite.github.io/Memory-Efficient-Windowing-of-Time-Series-Data-in-Python-2-NumPy-Arrays-vs-Pandas-DataFrames/
It's actually quite natural that Pandas adopt a column-wise layout most of the time because Pandas was designed to be a data analysis tool that relies more heavily on column-oriented operations than row-oriented operations. cf https://www.stitchdata.com/columnardatabase/
In the end, the answer to the question Is appending a row to a DataFrame is more expensive than appending a column? also depends on caching, prefetching etc. Thus it's a rather complicated question to answer and could depend on specific runtime conditions. But the most important factor is the data layout.
Answer from the authors of Pandas
The authors of Pandas actually mentioned this point in their design documentation. cf https://github.com/pydata/pandas-design/blob/master/source/internal-architecture.rst#what-is-blockmanager-and-why-does-it-exist
So, to do anything row oriented on an all-numeric DataFrame, pandas
would concatenate all of the columns together (using numpy.vstack or
numpy.hstack) then use array broadcasting or methods like ndarray.sum
(combined with np.isnan to mind missing data) to carry out certain
operations.

Categories

Resources