Using Pandas DataFrame with Multi-name Columns - python

I'm using Pandas to store a large dataset that has systematically generated column names. Something like this:
import numpy as np
import pandas as pd
df = pd.DataFrame([[0,1,2],[10,11,12],[20,21,22]],columns=["r0","r1","r2"])
These systematic names also have more meaningful names that users would actually understand. So far, I've been mapping them using a dictionary like so:
altName = {"Objective 1":"r0", "Result 5":"r1", "Parameter 2":"r2"}
so that they could then be accessed like this:
print(df[altName["Objective 1"]])
This works, but it leads to very hard to read code (think a plot command with multiple variables, etc.). I can't simply rename the columns to the friendly names because there are times when I need access to both, but I'm not sure how to support both simultaneously without a dictionary.
Is it possible to assign more than one name to a column, or do some sort of implicit mapping that would let me use both of these access methods:
print(df["r0"])
print(df["Objective 1])
I've thought of making my own subclass that would detect a keyerror and then fail to a secondary dictionary of alternate names and try that, but I wasn't sure I'd be able to do that while preserving all other DataFrame functionality (I'd self-assess my Python beginner bordering on intermediate).
Thanks very much for your suggestions.

Yes you can. Dataframes are just wrappers on numpy arrays, so you can multiply the wrappers :
An example:
df=pd.DataFrame([ [0,1], [2,3] ],list('AB'), columns=list('CD'))
df2=pd.DataFrame(df.values,df.index, columns=list('EF'))
df.loc['A','C']=999
Then df2 is also affected :
In [407]: df2['E']
Out[407]:
A 999
B 2
Name: E, dtype: int32

Related

When should I worry about using copy() with a pandas DataFrame?

I'm more of an R user and have recently been "switching" to Python. So that means I'm way more used to the R way of dealing with things. In Python, the whole concept of mutability and passing by assignment is kind of hard to grasp at first.
I can easily understand the issues that mutability may lead to when using lists or dictionaries. However, when using pandas DataFrames, I find that mutability is specially difficult to understand.
For example: let's say I have a DataFrame (df) with some raw data. I want to use a function that receives df as a parameter and outputs a modified version of that df, but keeping the original df. If I wrote the function, maybe I can inspect it and be assured that it makes a copy of the input before applying any manipulation. However, if it's a function I don't know (let's say, from some package), should I always pass my input df as df.copy()?
In my case, I'm trying to write some custom function that transforms a df using a WoE encoder. The data parameter is a DataFrame with feature columns and a label column. It kinda looks like this:
def my_function(data, var_list, label_column):
encoder = category_encoders.WOEEncoder(cols=var_list) # var_list = cols to be encoded
fit_encoder = encoder.fit(
X=data[var_list],
y=data[label_column]
)
new_data = fit_encoder.transform(
data[var_list]
)
new_data[label_column] = data[label_column]
return new_data
So should I be passing data[var_list].copy() instead of data[var_list]? Should I assume that every function that receives a df will modify it in-place or will it return a different object? I mean, how can I be sure that fit_encoder.transform won't modify data itself? I also learned that Pandas sometimes produces views and sometimes not, depending on the operation you apply to the whatever subset of the df. So I feel like there's too much uncertainty surrounding operations on DataFrames.
From the exercise shown on the website https://www.statology.org/pandas-copy-dataframe/ it shows that if you don't use .copy() when manipulating a subset of your dataframe, it could change values in your original dataframe as well. This is not what you want, so you should use .copy() when passing your dataframe to your function.
The example on the link I listed above really illustrates this concept well (and no I'm not affiliated with their site lol, I was just searching for this answer myself).

Pandas one-line filtering for the entire dataset - how is it achieved?

I am just now diving into this wonderful library and am pretty baffled by how filtering, or even column manipulation, is done and am trying to understand if this is a feature of pandas or of python itself. More precisely:
import pandas
df = pandas.read_csv('data.csv')
# Doing
df['Column'] # would display all values from Column for dataframe
# Even moreso, doing
df.loc[df['Column'] > 10] # would display all values from Column greater than 10
# and is the same with
df.loc[df.Column > 10]
So columns are both attributes, and keys, so DataFrame is both a dict, and object? Or perhaps I am missing some basic python functionality that I don't know about... And accessing a column basically loops over the whole dataset? How is this achieved?
Column filtering or column manipulation or overall data manipulation in a data set is a feature of pandas library itself. Once you load your data using pd.read_csv, the data set is stored as a pandas dataframe in a dictionary-like container. Then ,every column of dataframe is a series object of pandas. It depends on how you're trying to access the column, whether as an attribute(df.columnname) or a key(df['columnname']). Though you can apply methods like .head() or .tail() or .shape or .isna() on both the ways it is accessed. While accessing a certain column, it goes through whole dataset and tries to match the column name you have input. If it is matched, output is shown or else it throws some KeyError or AttributeError depends on how you're accessing it.

What is the difference between calling "iris.species" and "iris['species']"?

I have just recently started on python data science and noticed that i can call the columns of a dataset in two ways. I was wondering if there was an advantage to using one method over the other or can they be used interchangeably?
import seaborn
iris = seaborn.load_dataset('iris')
print(iris.species)
print(iris['species'])
Both print statements give the same output in Jupyter
There is no difference. iris is a Pandas Dataframe, and these are two different ways to access a column in a Dataframe.
Try this:
iris['species'] is iris.species
# True
You can use either method, but I find the indexing approach (iris['species']) is more versatile, e.g. you can use it to access columns whose names contain spaces, you can use it to create new columns, and you won't ever accidentally retrieve a dataframe method or attribute (e.g. iris.shape) instead of a column.
Also see answers to these questions:
In pandas, what's the difference between df['column'] and df.column?
For Pandas DataFrame, what's the difference between using squared brackets or dot to access a column?
Both methods of accessing the dictionary are equivalent.
The main advantage of accessing the iris dictionary via its 'species' key (e.g. iris['species']) is that the specified dictionary key can have spaces.
For example, you can access the iris dictionary with a 'plant color' key like so: iris['plant color']. However, you cannot access the iris dictionary via iris.plant color.

What is the the best way to modify (e.g., perform math functions) a column in a Dask DataFrame?

I'm a veteran of Pandas DataFrame objects, but I'm struggling to find a clean, convenient method for altering the values in a Dask DataFrame column. For a specific example, I'm trying to multiply positive values in a numpy.float column by -1, thereby making them negative. Here is my current method (I'm trying to change the last column in the DataFrame):
cols = df.columns
df[[cols[-1]]] = df[[cols[-1]]]*-1
This seems to work only if the column has a string header, otherwise it adds another column using the index number as a string-type column name for a new column. Is there something akin to the Pandas method of, say, df.iloc[-1,:] = df.iloc[-1,:]*-1 that I can use with a Dask dataframe?
Edit: I'm also trying to implement: df = df.applymap(lambda x: x*-1). This, of course, applies the function to the entire dataframe, but is there a way to apply a function over just one column? Thank you.
first question
If something works for string columns and not for numeric-named columns then that is probably a bug. I recommend raising an issue at https://github.com/dask/dask/issues/new
second question
but is there a way to apply a function over just one column?
You can't apply a single Python function over a dask dataframe that is stored in many pieces directly, however methods like .map_partitions or .reduction may help you to achieve the same result with some cleverness.
in the future we recommend asking separate questions separately on stack overflow

Python - Use column name with int and string

I have imported an NBA statistical dataset. But some of my column names have 2 data types, as in "3PP" or "2FG". Therefore, the following code won't work.
for team in nba.3PP
Because when it runs, it gives an "invalid syntax" error. Is there a special way I can use 3PP like .\3PP or something to get it to work? Thanks!
EDIT: Using Pandas dataFrame
You don't say what you've imported into. If Pandas:
for team in nba['3PP']:
...
This uses the item-oriented indexing, rather than attribute-oriented indexing. In Python in general, they are not equivalent, but in Pandas they can often be used interchangeably.
Use the .get method:
nba.get("3PP")
Or:
nba['3PP']
Depending on if the dataset is in Pandas or whatnot.

Categories

Resources