using .columns in Pandas - python

Hi, I am using .columns attribute in pandas and I am getting INDEX at the beginning, can someone please let me know that why INDEX is mentioned at the beginning.

Answered before Index objects in pandas--why pd.columns returns index rather than list, official documentation .index.
Immutable ndarray implementing an ordered, sliceable set. The basic object storing axis labels for all pandas objects
If you are just expecting values then you've to go a little bit more. As .columns returns an Index , .columns.values returns an array and a helper function .tolist returns a list of column names.
car_sales.columns.values.tolist()
You can use this car_sales.columns.tolist() too but won't perform good in large dataframes.

Related

Getting an error when converting to float to get top 10 largest values

I am trying to use the nlargest function to return top 10 values using code below as,
df['roi'].astype(float).nlargest(3, 'roi')
But get an error of
ValueError: keep must be either "first", "last" or "all"
the roi column is an object, which is why I use the astype float but am still getting an error
When I try the keep = all or keep = first or last filter in the nlargest function I get an error of TypeError: nlargest() got multiple values for argument 'keep'
Thanks!
To use the method as you want, you must change your code to:
df.astype(float).nlargest(3, 'roi')
Since this syntax works only for pandas.DataFrames. If you want to specify the colum by its key, as in a dictionary, then you'll be working with pandas.Series, and the correct syntax would be
df['roi'].astype(float).nlargest(3)
The docs for both methods are here, for DataFrames, and here, for Series
For a one-liner you'll need to convert "roi" to a float type first, and then perform nlargest:
Passing a dictionary to .astype allows us to return the entire DataFrame making selective changes to specific columns' dtypes, and then we can perform .nlargest on that returned DataFrame (instead of just having a Series).
df.astype({"roi": float}).nlargest(3, columns="roi")

Is there a way to use apply() to create two columns in pandas dataframe?

I have a function returning a tuple of values, as an example:
def dumb_func(number):
return number+1,number-1
I'd like to apply it to a pandas DataFrame
df=pd.DataFrame({'numbers':[1,2,3,4,5,6,7]})
test=dumb_df['numbers'].apply(dumb_func)
The result is that test is a pandas series containing tuples.
Is there a way to use the variable test or to remplace it to assign the results of the function to two distinct columns 'number_plus_one' and 'number_minus_one' of the original DataFrame?
df[['number_plus_one', 'number_minus_one']] = pd.DataFrame(zip(*df['numbers'].apply(dumb_func))).transpose()
To understand, try taking it apart piece by piece. Have a look at zip(*df['numbers'].apply(dumb_func)) in isolation (you'll need to convert it to a list). You'll see how it unpacks the tuples one by one and creates two separate lists out of them. Then have a look what happens when you create a dataframe out of it - you'll see why the transpose is necessary. For more on zip, see here : docs.python.org/3.8/library/functions.html#zip
Method 1: When you don't use dumb function,
df[['numbers_plus_one','numbers_minus_one']]=pd.DataFrame(df.apply(lambda x: (x[0]+1,x[0]-1),axis=1).values.tolist())
Method 2: When you have test(i.e. series of tuples you mentioned in question)
df[['numbers_plus_one','numbers_minus_one']]=pd.DataFrame(test.values.tolist())
I hope this is helpful

Size immutability in pandas data structure

While going through pandas Documentation for version 0.24.1 here, I came across this statement.
"All pandas data structures are value-mutable (the values they contain can be altered) but not always size-mutable. The length of a Series cannot be changed, but, for example, columns can be inserted into a DataFrame."
import pandas as pd
test_s = pd.Series([1,2,3])
id(test_s) # output: 140485359734400 (will vary)
len(test_s) # output: 3
test_s[3] = 37
id(test_s) # output: 140485359734400
len(test_s) # output: 4
The meaning of size immutable as per my inference is that operations like appending and deleting an element are not allowed, which is clearly not the case. Even the identity of the object remains the same, ruling out the possibility of a new object creation with the same name.
So, what does size immutability actually mean?
Appending and deleting are allowed, but that doesn't necessarily imply the Series is mutable.
Series/DataFrames are internally represented by NumPy arrays which are immutable (fixed size) to allow a more compact memory representation and better performance.
When you assign to a Series, you're actually calling Series.__setitem__ (which then delegates to NDFrame.__loc__) which creates a new array. This new array is then assigned back to the same Series (of course, as the end user, you don't get to see this), giving you the illusion of mutability.
#dbot_5
"All pandas data structures are value-mutable (the values they contain can be altered) but not always size-mutable.
A per my opinion, it is already written that all pandas data structures (Series, Dataframes) are value-mutable. It means we can add or delete values in Series and DataFrame.
"The length of a Series cannot be changed, but, for example, columns can be inserted into a DataFrame."
As given in this statement, we cannot change the columns in the Series (by default, it has only one column and we cannot add new columns and we cannot even delete it.) but we can change- add, delete columns in a DataFrame. so here length means no. of columns not the number of values.

Why does `head` need `()` and `shape` does not?

In the following code, I import a csv file into Python's pandas library and display the first 5 rows, and query the 'shape' of the pandas dataframe.
import pandas as pd
data = pd.read_csv('my_file.csv')
data.head() #returns the first 5 rows of the dataframe
data.shape # displays the # of rows and # of columns of dataframe
Why is it that the head() method requires empty parentheses after head but shape does not? Does it have to do with their types?
If I called head without following it with the empty parentheses, I would not get the same result. Is it that head is a method and shape is just an attribute?
How could I generalize the answer to the above question to the rest of Python? I am trying to learn not just about pandas here but Python in general. For example, a sentence such as "When _____ is the case, one must include empty parentheses if no arguments will be provided, but for other attributes one does not have to?
The reason that head is a method and not a attribute is most likely has to do with performance. In case head would be an attribute it would mean that every time you wrangle a dataframe, pandas would have to precompute the slice of data and store it in the head attribute, which would be waste of resources. The same goes for the other methods with empty parenthesis.
In case of shape, it is provided as an attribute since this information is essential to any dataframe manipulation thus it is precomputed and available as an attribute.
When you call data.head() you are calling the method head(self) on the object data,
However, when you write data.shape, you are referencing a public attribute of the object data
It is good to keep in mind that there is a distinct difference between methods and object attributes. You can read up on it here

How to create a view of dataframe in pandas?

I have a large dataframe (10m rows, 40 columns, 7GB in memory). I would like to create a view in order to have a shorthand name for a view that is complicated to express, without adding another 2-4 GB to memory usage. In other words, I would rather type:
df2
Than:
df.loc[complicated_condition, some_columns]
The documentation states that, while using .loc ensures that setting values modifies the original dataframe, there is still no guarantee as to whether the object returned by .loc is a view or a copy.
I know I could assign the condition and column list to variables (e.g. df.loc[cond, cols]), but I'm generally curious to know whether it is possible to create a view of a dataframe.
Edit: Related questions:
What rules does Pandas use to generate a view vs a copy?
Pandas: Subindexing dataframes: Copies vs views
You generally can't return a view.
Your answer lies in the pandas docs:
returning-a-view-versus-a-copy.
Whenever an array of labels or a boolean vector are involved in the
indexing operation, the result will be a copy. With single label /
scalar indexing and slicing, e.g. df.ix[3:6] or df.ix[:, 'A'], a view
will be returned.
This answer was found in the following post: Link.

Categories

Resources