Why does `head` need `()` and `shape` does not? - python

In the following code, I import a csv file into Python's pandas library and display the first 5 rows, and query the 'shape' of the pandas dataframe.
import pandas as pd
data = pd.read_csv('my_file.csv')
data.head() #returns the first 5 rows of the dataframe
data.shape # displays the # of rows and # of columns of dataframe
Why is it that the head() method requires empty parentheses after head but shape does not? Does it have to do with their types?
If I called head without following it with the empty parentheses, I would not get the same result. Is it that head is a method and shape is just an attribute?
How could I generalize the answer to the above question to the rest of Python? I am trying to learn not just about pandas here but Python in general. For example, a sentence such as "When _____ is the case, one must include empty parentheses if no arguments will be provided, but for other attributes one does not have to?

The reason that head is a method and not a attribute is most likely has to do with performance. In case head would be an attribute it would mean that every time you wrangle a dataframe, pandas would have to precompute the slice of data and store it in the head attribute, which would be waste of resources. The same goes for the other methods with empty parenthesis.
In case of shape, it is provided as an attribute since this information is essential to any dataframe manipulation thus it is precomputed and available as an attribute.

When you call data.head() you are calling the method head(self) on the object data,
However, when you write data.shape, you are referencing a public attribute of the object data
It is good to keep in mind that there is a distinct difference between methods and object attributes. You can read up on it here

Related

Pandas Groupby issue

Python noob and learning.
Running into an issue when I use the Groupby. If I remove the groupby and print result, it is fine. Not sure what the issue is, any help would be greatly appreciated.
import pandas as pd
path1 = "/content/NYC_Jobs_1.csv"
path2 = "/content/NYC_Jobs_2.xlsx"
df1 = pd.read_csv(path1)
df2 = pd.read_excel(path2)
result = df1.merge(df2,on="Job ID",how='outer')
grouped = result.groupby('Job ID')
grouped.to_csv('NYC.csv', index=False)
Im having an AttributeError
AttributeError Traceback (most recent call last)
<ipython-input-1-066a0fd6dfcb> in <module>
9 grouped = result.groupby('Job ID')
10
---> 11 grouped.to_csv('NYC.csv', index=False)
/usr/local/lib/python3.8/dist-packages/pandas/core/groupby/groupby.py in __getattr__(self, attr)
909 return self[attr]
910
--> 911 raise AttributeError(
912 f"'{type(self).__name__}' object has no attribute '{attr}'"
913 )
AttributeError: 'DataFrameGroupBy' object has no attribute 'to_csv'
The problem you've run into is that you've not kept clear in your mind the distinction between a DataFrame and a DataFrameGroupBy object.
If you're new to programming, one thing that may not be clear to you yet is the relationship between classes, objects, attributes, methods, and functions. So the error you got is opaque to you.
Let me translate:
a class represents a 'type of thing', like, say, 'a saucepan'
an object is a member of a class; a specific saucepan is an object, whereas the 'idea of a saucepan' is a class
objects can have attributes; a saucepan has a handle with a specific length, but this doesn't have to be the same for all saucepans
some attributes are not really 'properties' (like the length of the saucepan handle), but rather methods, meaning things you can do with the object. For example: 'cook spaghetti'.
these methods are the same kind of thing as a function, but they only make sense in the context of the object they are part of (good luck trying to cook spaghetti with your bare hands)
I'll illustrate.
The pandas library provides a function called read_csv. This function returns a DataFrame object. All DataFrame objects store data across various attributes representing columns in a table (these are themselves another type of object called a Series). The read_csv function creates a DataFrame that stores data read from a CSV file on disk.
DataFrame objects have a method, merge, which takes another DataFrame as the first argument. This method returns a third DataFrame, which is an amalgam of the two you started with.
DataFrames have a method to_csv which causes the contents of the DataFrame to be read out to a CSV file.
DataFrames have a method groupby which returns a DataFrameGroupBy object.
Now, DataFrameGroupBy objects do not have a method called to_csv. I hope that you can now understand the meaning of the error: AttributeError: 'DataFrameGroupBy' object has no attribute 'to_csv'.
The way forward
The DataFrameGroupBy object is easy to confuse with the DataFrame. As well as having a similar name and attributes for storing data, pandas often takes the approach of having DataFrame methods return new DataFrames.
Why? This allows method chaining:
pd.read_csv("/some/file.csv").merge(df2).some_other_method()...
Every (...) marks a new DataFrame being created. The expression is evaluated left to right:
pd.read_csv("/some/file.csv").merge(df2).some_other_method()...
-> df1.merge(df2).some_other_method()...
-> df3.some_other_method()...
-> df4...
Neat. Neat, but confusing if you aren't keeping track. Especially confusing if, just as you're getting used to it, one of these methods doesn't return a DataFrame at all, but rather some other kind of object.
The purpose of a DataFrameGroupBy object is to store groups of data ready for aggregation. They have a variety of methods available to do the aggregation, which you can read about in the documentation.
Here is a tutorial on the way to use groupby properly. Some examples would be
count_jobs = result.groupby("Job ID").count()
max_some_column_name = result.groupby("Job ID")["Some Column Name"].max()
The first of these directly aggregates the data, the second combines a group with another column and finds the maximum value of the other column for each Job ID.
In the second case, the output will be a Series object. Since such objects do have a to_csv method, you could successfully write the data:
result.groupby("Job ID")["Some Column Name"].max().to_csv("output.csv")

Why is a cell value coming as a series when you do dataframe[dataframe[ColumnName]==some_value]?

I am having the same problem as described here Cell value coming as series in pandas
While a solution is provided there is no explanation on why a cell would be a dataseries when I would expect that to be a string (which I understand that it is dtype=object)
My dataframe has columns as below
Serial Number object
Device ID int64
Device Name object
I am extracting a
device=s_devices[s_devices['Device ID']==17177529]
print(device['Device ID'])
prints fine as I would expect
17177529
print(device['Device Name'])
prints like below, like a Series:
49 10.112.165.182
Name: Device Name, dtype: object
What can be done ? I could see that I could use ".values" to get the IP only 10.112.165.182 but I am wondering what is causing the difference between dtype float and dtype object at import or elsewhere. I am reading from excel.
As far as I understand, your code should always output a Series. So the problem is probably in the code you are not describing. Also, the question you are referring to uses ix (which doesn't exist in the latest version of pandas), which indicates pandas version may also be an issue.
By the way, I don't think values is a good choice for your case, because it is used when you want an array, not an element. (Also values is not recommended anymore)
If you just want to extract an element, try:
# If there are multiple elements, the first one will be extracted.
print(device['Device Name'].iloc[0])
or
# If there are multiple elements, ValueError will be raised.
print(device['Device Name'].item())

When should I worry about using copy() with a pandas DataFrame?

I'm more of an R user and have recently been "switching" to Python. So that means I'm way more used to the R way of dealing with things. In Python, the whole concept of mutability and passing by assignment is kind of hard to grasp at first.
I can easily understand the issues that mutability may lead to when using lists or dictionaries. However, when using pandas DataFrames, I find that mutability is specially difficult to understand.
For example: let's say I have a DataFrame (df) with some raw data. I want to use a function that receives df as a parameter and outputs a modified version of that df, but keeping the original df. If I wrote the function, maybe I can inspect it and be assured that it makes a copy of the input before applying any manipulation. However, if it's a function I don't know (let's say, from some package), should I always pass my input df as df.copy()?
In my case, I'm trying to write some custom function that transforms a df using a WoE encoder. The data parameter is a DataFrame with feature columns and a label column. It kinda looks like this:
def my_function(data, var_list, label_column):
encoder = category_encoders.WOEEncoder(cols=var_list) # var_list = cols to be encoded
fit_encoder = encoder.fit(
X=data[var_list],
y=data[label_column]
)
new_data = fit_encoder.transform(
data[var_list]
)
new_data[label_column] = data[label_column]
return new_data
So should I be passing data[var_list].copy() instead of data[var_list]? Should I assume that every function that receives a df will modify it in-place or will it return a different object? I mean, how can I be sure that fit_encoder.transform won't modify data itself? I also learned that Pandas sometimes produces views and sometimes not, depending on the operation you apply to the whatever subset of the df. So I feel like there's too much uncertainty surrounding operations on DataFrames.
From the exercise shown on the website https://www.statology.org/pandas-copy-dataframe/ it shows that if you don't use .copy() when manipulating a subset of your dataframe, it could change values in your original dataframe as well. This is not what you want, so you should use .copy() when passing your dataframe to your function.
The example on the link I listed above really illustrates this concept well (and no I'm not affiliated with their site lol, I was just searching for this answer myself).

Pandas one-line filtering for the entire dataset - how is it achieved?

I am just now diving into this wonderful library and am pretty baffled by how filtering, or even column manipulation, is done and am trying to understand if this is a feature of pandas or of python itself. More precisely:
import pandas
df = pandas.read_csv('data.csv')
# Doing
df['Column'] # would display all values from Column for dataframe
# Even moreso, doing
df.loc[df['Column'] > 10] # would display all values from Column greater than 10
# and is the same with
df.loc[df.Column > 10]
So columns are both attributes, and keys, so DataFrame is both a dict, and object? Or perhaps I am missing some basic python functionality that I don't know about... And accessing a column basically loops over the whole dataset? How is this achieved?
Column filtering or column manipulation or overall data manipulation in a data set is a feature of pandas library itself. Once you load your data using pd.read_csv, the data set is stored as a pandas dataframe in a dictionary-like container. Then ,every column of dataframe is a series object of pandas. It depends on how you're trying to access the column, whether as an attribute(df.columnname) or a key(df['columnname']). Though you can apply methods like .head() or .tail() or .shape or .isna() on both the ways it is accessed. While accessing a certain column, it goes through whole dataset and tries to match the column name you have input. If it is matched, output is shown or else it throws some KeyError or AttributeError depends on how you're accessing it.

Understanding parsing function here

I have googled it but it is not clear what the function parse is doing here. At least, I do not quite understand. Please if someone could clarify it for me, I would be grateful.
Data = pd.ExcelFile(filename[0])
ncols = Data.book.sheet_by_index(0).ncols #class book google it
Data_df = Data.parse(0, converters={i : str for i in range(ncols-1)}, encoding="utf-8")
I presume that the snippet presented was preceded with
import pandas as pd
The ExcelFile class is described here, in the Pandas documentation. The ExcelFile.parse function is a thin wrapper around pd.read_excel; the converters argument is described in the last link:
converters : dict, default None
Dict of functions for converting values in certain columns. Keys can either be integers or column labels, values are functions that take one input argument, the Excel cell content, and return the transformed content.
The book object accessed in line 2 is part of the xlrd package, which is the underlying implementation used by panda to read Excel files. It is documented here, and the sheet_by_index method here (although those just do what you might expect); the ncols field in a Sheet is documented here, and it just returns the number of columns in the sheet, ignoring trailing empty columns.
In short, range(ncols-1) will produce the indices of all the columns except the last one, so the converters dictionary {i : str for i in range(ncols-1)} has the effect of treating every column except the last as a simple string, instead of attempting to parse each cell to decide its datatype.

Categories

Resources