I have just recently started on python data science and noticed that i can call the columns of a dataset in two ways. I was wondering if there was an advantage to using one method over the other or can they be used interchangeably?
import seaborn
iris = seaborn.load_dataset('iris')
print(iris.species)
print(iris['species'])
Both print statements give the same output in Jupyter
There is no difference. iris is a Pandas Dataframe, and these are two different ways to access a column in a Dataframe.
Try this:
iris['species'] is iris.species
# True
You can use either method, but I find the indexing approach (iris['species']) is more versatile, e.g. you can use it to access columns whose names contain spaces, you can use it to create new columns, and you won't ever accidentally retrieve a dataframe method or attribute (e.g. iris.shape) instead of a column.
Also see answers to these questions:
In pandas, what's the difference between df['column'] and df.column?
For Pandas DataFrame, what's the difference between using squared brackets or dot to access a column?
Both methods of accessing the dictionary are equivalent.
The main advantage of accessing the iris dictionary via its 'species' key (e.g. iris['species']) is that the specified dictionary key can have spaces.
For example, you can access the iris dictionary with a 'plant color' key like so: iris['plant color']. However, you cannot access the iris dictionary via iris.plant color.
Related
I have 2 dataframes that I am trying to map to each other - the one I created from a csv file and the other from an Excel file. I am trying to map (something like a vlookup in Excel) the name of the one df to the respective code in the other.
When trying to do the same task with another dataframe where the values in the 'key' column were of type 'integer' the process worked well. In this case, both columns are of type Object and I am able to map only one value and for others receive a "NaN". The one value that is successfully mapped is a value with 2 decimals points (for example 9.5.5) which is why I presume pandas treats the column as an object and not integer.
I have attempted the following:
Changing the dtypes of both columns to strings and then trying to map them:
df_1['code'] = df_1['code'].astype(str)
df_2['code'] = df_2['code'].astype(str)
I adjusted the index of both so that the map function works with index instead of columns:
df_1.set_index('code', inplace=True)
df_2.set_index('code', inplace=True)
The mapping was done using the following code:
df_1['code_name'] = df_1.index.map(df_2['code_name'])
Not too sure what else is possible. I cannot use .apply or .applymap functions as python states that my data is a dataframe type. I did also attempt to use the .squeeze() function to convert the df_2 to a Series and I still got the same results - namely NaN for the majority and only certain values changed. If it helps, the values that were mapped are aligned to the Left (when opened in Excel) while those unmapped with NaN are aligned to the Right (perhaps seen as numbers).
If possible, I prefer to use the .map function opposed to .merge as it is faster.
I'm more of an R user and have recently been "switching" to Python. So that means I'm way more used to the R way of dealing with things. In Python, the whole concept of mutability and passing by assignment is kind of hard to grasp at first.
I can easily understand the issues that mutability may lead to when using lists or dictionaries. However, when using pandas DataFrames, I find that mutability is specially difficult to understand.
For example: let's say I have a DataFrame (df) with some raw data. I want to use a function that receives df as a parameter and outputs a modified version of that df, but keeping the original df. If I wrote the function, maybe I can inspect it and be assured that it makes a copy of the input before applying any manipulation. However, if it's a function I don't know (let's say, from some package), should I always pass my input df as df.copy()?
In my case, I'm trying to write some custom function that transforms a df using a WoE encoder. The data parameter is a DataFrame with feature columns and a label column. It kinda looks like this:
def my_function(data, var_list, label_column):
encoder = category_encoders.WOEEncoder(cols=var_list) # var_list = cols to be encoded
fit_encoder = encoder.fit(
X=data[var_list],
y=data[label_column]
)
new_data = fit_encoder.transform(
data[var_list]
)
new_data[label_column] = data[label_column]
return new_data
So should I be passing data[var_list].copy() instead of data[var_list]? Should I assume that every function that receives a df will modify it in-place or will it return a different object? I mean, how can I be sure that fit_encoder.transform won't modify data itself? I also learned that Pandas sometimes produces views and sometimes not, depending on the operation you apply to the whatever subset of the df. So I feel like there's too much uncertainty surrounding operations on DataFrames.
From the exercise shown on the website https://www.statology.org/pandas-copy-dataframe/ it shows that if you don't use .copy() when manipulating a subset of your dataframe, it could change values in your original dataframe as well. This is not what you want, so you should use .copy() when passing your dataframe to your function.
The example on the link I listed above really illustrates this concept well (and no I'm not affiliated with their site lol, I was just searching for this answer myself).
I am just now diving into this wonderful library and am pretty baffled by how filtering, or even column manipulation, is done and am trying to understand if this is a feature of pandas or of python itself. More precisely:
import pandas
df = pandas.read_csv('data.csv')
# Doing
df['Column'] # would display all values from Column for dataframe
# Even moreso, doing
df.loc[df['Column'] > 10] # would display all values from Column greater than 10
# and is the same with
df.loc[df.Column > 10]
So columns are both attributes, and keys, so DataFrame is both a dict, and object? Or perhaps I am missing some basic python functionality that I don't know about... And accessing a column basically loops over the whole dataset? How is this achieved?
Column filtering or column manipulation or overall data manipulation in a data set is a feature of pandas library itself. Once you load your data using pd.read_csv, the data set is stored as a pandas dataframe in a dictionary-like container. Then ,every column of dataframe is a series object of pandas. It depends on how you're trying to access the column, whether as an attribute(df.columnname) or a key(df['columnname']). Though you can apply methods like .head() or .tail() or .shape or .isna() on both the ways it is accessed. While accessing a certain column, it goes through whole dataset and tries to match the column name you have input. If it is matched, output is shown or else it throws some KeyError or AttributeError depends on how you're accessing it.
In the following code, I import a csv file into Python's pandas library and display the first 5 rows, and query the 'shape' of the pandas dataframe.
import pandas as pd
data = pd.read_csv('my_file.csv')
data.head() #returns the first 5 rows of the dataframe
data.shape # displays the # of rows and # of columns of dataframe
Why is it that the head() method requires empty parentheses after head but shape does not? Does it have to do with their types?
If I called head without following it with the empty parentheses, I would not get the same result. Is it that head is a method and shape is just an attribute?
How could I generalize the answer to the above question to the rest of Python? I am trying to learn not just about pandas here but Python in general. For example, a sentence such as "When _____ is the case, one must include empty parentheses if no arguments will be provided, but for other attributes one does not have to?
The reason that head is a method and not a attribute is most likely has to do with performance. In case head would be an attribute it would mean that every time you wrangle a dataframe, pandas would have to precompute the slice of data and store it in the head attribute, which would be waste of resources. The same goes for the other methods with empty parenthesis.
In case of shape, it is provided as an attribute since this information is essential to any dataframe manipulation thus it is precomputed and available as an attribute.
When you call data.head() you are calling the method head(self) on the object data,
However, when you write data.shape, you are referencing a public attribute of the object data
It is good to keep in mind that there is a distinct difference between methods and object attributes. You can read up on it here
I'm using Pandas to store a large dataset that has systematically generated column names. Something like this:
import numpy as np
import pandas as pd
df = pd.DataFrame([[0,1,2],[10,11,12],[20,21,22]],columns=["r0","r1","r2"])
These systematic names also have more meaningful names that users would actually understand. So far, I've been mapping them using a dictionary like so:
altName = {"Objective 1":"r0", "Result 5":"r1", "Parameter 2":"r2"}
so that they could then be accessed like this:
print(df[altName["Objective 1"]])
This works, but it leads to very hard to read code (think a plot command with multiple variables, etc.). I can't simply rename the columns to the friendly names because there are times when I need access to both, but I'm not sure how to support both simultaneously without a dictionary.
Is it possible to assign more than one name to a column, or do some sort of implicit mapping that would let me use both of these access methods:
print(df["r0"])
print(df["Objective 1])
I've thought of making my own subclass that would detect a keyerror and then fail to a secondary dictionary of alternate names and try that, but I wasn't sure I'd be able to do that while preserving all other DataFrame functionality (I'd self-assess my Python beginner bordering on intermediate).
Thanks very much for your suggestions.
Yes you can. Dataframes are just wrappers on numpy arrays, so you can multiply the wrappers :
An example:
df=pd.DataFrame([ [0,1], [2,3] ],list('AB'), columns=list('CD'))
df2=pd.DataFrame(df.values,df.index, columns=list('EF'))
df.loc['A','C']=999
Then df2 is also affected :
In [407]: df2['E']
Out[407]:
A 999
B 2
Name: E, dtype: int32