I have three binary-type columns of a dataframe whose values together constitute a meaningful grouping of the data. To refer to the group, I'm currently making a new column a hard-coded binary encoding like so:
data['type'] = data['a'] + 2 * data['b'] + 4 * data['c']
Pandas factorize will assign an integer for each distinct value of a sequence, but it doesn't seem to work with combinations of multiple columns. Is there a more general pandas function for situations like this? It would be nice if such a function generalized to K distinct categorical variables of arbitrary number of categories, rather than being limited to binary variables.
If such a thing doesn't exist, would there be interest in a pull request?
Here are two methods you can try:
df = pd.DataFrame({'a': [1, 1, 0],
'b': [0, 1, 0],
'c': [1, 1, 1]})
>>> df
a b c
0 1 0 1
1 1 1 1
2 0 0 1
>>> ["".join(row) for row in df[['a', 'b', 'c']].values.astype(str)]
Out[22]: ['101', '111', '001']
>>> [bytearray("".join(row)) for row in df[['a', 'b', 'c']].values.astype(str)]
Out[23]: [bytearray(b'101'), bytearray(b'111'), bytearray(b'001')]
You may want to take a look at patsy which addresses things like categorical variable encoding and other model-related issues: see docs.
Patsy offers quite a few encoding schemes, including:
Treatment (default)
Backward difference coding
Orthogonal polynomial contrast coding
Deviation coding (also known as sum-to-zero coding), and
Helmert contrasts
Related
This question already has an answer here:
Aggregation over Partition - pandas Dataframe
(1 answer)
Closed 7 months ago.
I have a dataframe with in column "A" locations and in column "B" values. Locations occure multiple times in this DataFrame, now i'd like to add a third column in which i store the average value of column "B" that have the same location value in column "A".
-I know the .mean() can be used to get an average
-I know how to filter with .loc()
I could make a list of all unique values in column A, and compute the average for all of them by making a for loop. Hover, this seems combersome to me. Any idea how this can be done more efficiently?
Sounds like what you need is GroupBy. Take a look here
Given
df = pd.DataFrame({'A': [1, 1, 2, 1, 2],
'B': [np.nan, 2, 3, 4, 5],
'C': [1, 2, 1, 1, 2]}, columns=['A', 'B', 'C'])
You can use
df.groupby('A').mean()
to group the values based on the common values in column "A" and find the mean.
Output:
B C
A
1 3.0 1.333333
2 4.0 1.500000
I could make a list of all unique values in column A, and compute the
average for all of them by making a for loop.
This can be done using pandas.DataFrame.groupby consider following simple example
import pandas as pd
df = pd.DataFrame({"A":["X","Y","Y","X","X"],"B":[1,3,7,10,20]})
means = df.groupby('A').agg('mean')
print(means)
gives output
B
A
X 10.333333
Y 5.000000
import pandas as pd
data = {'A': ['a', 'a', 'b', 'c'], 'B': [32, 61, 40, 45]}
df = pd.DataFrame(data)
df2 = df.groupby(['A']).mean()
print(df2)
Based on your description, I'm not sure if you are trying to simply calculate the averages for each group, or if you are wanting to maintain the long format of your data. I'll break down a solution for each option.
The data I'll use below can be generated by running the following...
import pandas as pd
df = pd.DataFrame([['group1 ', 2],
['group2 ', 4],
['group1 ', 5],
['group2 ', 2],
['group1 ', 2],
['group2 ', 0]], columns=['A', 'B'])
Option 1 - Calculate Group Averages
This one is super simple. It uses the .groupby method, which is the bread and butter of crunching data calculations.
df.groupby('A').B.mean()
Output:
A
group1 3.0
group2 2.0
If you wish for this to return a dataframe instead of a series, you can add .to_frame() to the end of the above line.
Option 2 - Calculate Group Averages and Maintain Long Format
By long format, I mean you want your data to be structured the same as it is currently, but with a third column (we'll call it C) containing a mean that is connected to the A column. ie...
A
B
C (average)
group1
2
3
group2
4
2
group1
5
3
group2
2
2
group1
2
3
group2
0
2
Where the averages for each group are...
group1 = (2+5+2)/3 = 3
group2 = (4+2+0)/3 = 2
The most efficient solution, would be to use .transform, which behaves like an sql window function, but I think this method can be a little confusing when you're new to pandas.
import numpy as np
df.assign(C=df.groupby('A').B.transform(np.mean))
A less efficient, but more beginner friendly option would be to store the averages in a dictionary and then map each row to the group average.
I find myself using this option a lot for modeling projects, when I want to impute a historical average rather than the average of my sampled data.
To accomplish this, you can...
Create a dictionary containing the grouped averages
For every row in the dataframe, pass the group name into the dictionary
# Create the group averages
group_averages = df.groupby('A').B.mean().to_dict()
# For every row, pass the group name into the dictionary
new_column = df.A.map(group_averages)
# Add the new column to the dataframe
df = df.assign(C=new_column)
You can also, optionally, do all of this in a single line
df = df.assign(C=df.A.map(df.groupby('A').B.mean().to_dict()))
I am a beginner in Python and Pandas, and it has been 2 days since I opened Wes McKinney's book. So, this question might be a basic one.
I am using Anaconda distribution (Python 3.6.6) and Pandas 0.21.0. I researched the following threads (https://pandas.pydata.org/pandas-docs/stable/advanced.html, xs function at https://pandas.pydata.org/pandas-docs/stable/advanced.html#advanced-xs, Select only one index of multiindex DataFrame, Selecting rows from pandas by subset of multiindex, and https://pandas.pydata.org/pandas-docs/stable/indexing.html) before posting this. All of them explain how to subset data.frame using either hierarchical index or hierarchical column, but not both.
Here's the data.
import pandas as pd
import numpy as np
from numpy import nan as NA
#Hierarchical index for row and column
data = pd.DataFrame(np.arange(36).reshape(6,6),
index=[['a']*2+['b']*1+['c']*1+['d']*2,
[1, 2, 3, 1, 3, 1]],
columns = [['Title1']*3+['Title2']*3,
['A']*2+['B']*2+['C']*2])
data.index.names = ['key1','key2']
data.columns.names = ['state','color']
Here are my questions:
Question:1 I'd like to access key1 = a, key2 = 1, state = Title1 (column), and color = A (column).
After a few trial and errors, I found that this version works (I really don't know why this works--my hypothesis is that data.loc['a',1] gives an indexed dataframe, which is then subset...and so on):
data.loc['a',1].loc['Title1'].loc['A']
Is there a better way to subset above?
Question:2 How do I subset the data after deleting the indices?
data_wo_index = data.reset_index()
I'm relatively comfortable with data.table in R. So, I thought of using http://datascience-enthusiast.com/R/pandas_datatable.html to subset the data using my data.table knowledge.
I tried one step at a time, but even the first step (i.e. subsetting key1 = a gave me an error:
data_wo_index[data_wo_index['key1']=='a']
Exception: cannot handle a non-unique multi-index!
I don't know why Pandas is still thinking that there is multi-index. I have already reset it.
Question:3 If I run data.columns command, I get the following output:
MultiIndex(levels=[['Title1', 'Title2'], ['A', 'B', 'C']],
labels=[[0, 0, 0, 1, 1, 1], [0, 0, 1, 1, 2, 2]],
names=['state', 'color'])
It seems to me that column names are also indexes. I am saying this because I see MultiIndex class, which is what I see if I run data.index:
MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
labels=[[0, 0, 1, 2, 3, 3], [0, 1, 2, 0, 2, 0]],
names=['key1', 'key2'])
I am unsure why column names are also on object of MultiIndex class. If they are indeed an object of MultiIndex class, then why do we need to set aside a few columns (e.g. key1 and key2 in our example above) as indices, meaning why can't we just use column-based indices? (As a comparison, in data.table in R, we can setkey to whatever columns we want.)
Question 4 Why are column names an object of MultiIndex class? It will be great if someone can offer a theoretical treatment for this.
As a beginner, I'd really appreciate your thoughts. I have spent 3-4 hours researching this topic and have hit a dead-end.
First off, MultiIndex's can be tricky to work with, so it's worth considering whether they actually provide enough benefit for what you're actually doing (in terms of speed/organisation) to make those hassles worthwhile.
To answer your question 1, you can subset a MultiIndexed dataframe by providing tuples of the keys you want for each axis. So you first example subset can be done as:
# We want to use ":" to get all the states, but can't just
# have ":" by itself due to Python's syntax rules
# So pandas provides the IndexSlice object to wrap it in
slicer = pd.IndexSlice
data.loc[('a', 1), (slicer[:], 'A')]
Which gives:
state color
Title1 A 0
A 1
Name: (a, 1), dtype: int32
Wow seems like a lot of questions ..
Q1 Multiple index I will recommend IndexSlice
data.loc[pd.IndexSlice['a',1],pd.IndexSlice['Title1','A']]
Out[410]:
state color
Title1 A 0
A 1
Q2 when you reset the index for this complete data frame it will have some issue , I do not think in R you can do that without ftable
Here is the way doing with pandas
data_wo_index.loc[np.concatenate(data_wo_index.loc[:,pd.IndexSlice['key1',:]].values=='a')]
Out[434]:
state key1 key2 Title1 Title2
color A A B B C C
0 a 1 0 1 2 3 4 5
1 a 2 6 7 8 9 10 11
Q3 I think the column and index multiple level offer 4 dimension, yes you can using one columns or index to represent all just do stack
data.stack()
Out[436]:
state Title1 Title2
key1 key2 color
a 1 A 0 3
B 1 4
C 2 5
2 A 6 9
B 7 10
C 8 11
b 3 A 12 15
B 13 16
C 14 17
c 1 A 18 21
B 19 22
C 20 23
d 3 A 24 27
B 25 28
C 26 29
1 A 30 33
B 31 34
C 32 35
Q4 MultiIndex is one of type for index , and pandas treat index and columns to index type
For example
df.index # index but just different type of index
Out[441]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')
df.columns # index as well
Out[442]: Index(['A', 'B'], dtype='object')
I need to put a combined column as the concat of all values of the row.
Source:
pd.DataFrame(data={
'a' : [1,2,3],
'b' : [2,3,4]
})
Target:
pd.DataFrame(data={
'a' : [1,2,3],
'b' : [2,3,4],
'combine' : [[1,2],[2,3],[3,4]]
})
Current solution:
test['combine'] = test[['a','b']].apply(lambda x: pd.Series([x.values]), axis=1)
Issues:
I actually have many columns, it seems taking too long to run. Is it a better way.
df
a b
0 1 2
1 2 3
2 3 4
If you want to add a column of lists as a single column, you'll need to call the .values attribute, convert it to a nested list, and assign it back -
df['combine'] = df.values.tolist()
# or,
df['combine'] = df[['a', 'b']].values.tolist()
df
a b combine
0 1 2 [1, 2]
1 2 3 [2, 3]
2 3 4 [3, 4]
Note that just assigning the .values result directly does not work, as pandas special cases numpy arrays, leading to undesirable outcomes,
df['combine'] = df[['a', 'b']].values
ValueError: Wrong number of items passed 2, placement implies 1
A couple of notes -
try not to use apply/transform as much as possible. It is only a convenience function meant to hide the application of a loop, and is slow, offering no performance/vectorization benefits whatosever
keeping columns of `objects offers no performance gains as far as pandas is concerned, so unless the goal is to display data, try to avoid it.
Firstly, thank you for helping
I have large pandas DataFrame and I need fast "rank" transformation for each column:
1] if column is only 0-1, do nothing
2] else (for each column):
a] find unique values in the column
b] sort them
c] for each element of the column replace its value with the position in sort-unique "ranking" list
optional:
d] transform this new values to interval [-0.99, 0.99]
e] apply scipy.special.erfinv to each element (to get "normal" like distribution)
How can I do this with Pandas, when need to take care about speed..
Thanx
Getting columns containing only 0 or 1:
columns_to_handle = (~df.isin([0,1])).any()
Converting column type to categorical conveniently handles steps a, b and c:
df.some_column.astype('category').cat.codes
Unfortunately this does seem to require a loop (through apply) over the columns, but if you don't have too many columns this should still be reasonably fast.
Rescaling can just be done by subtracting the minimum and dividing by the maximum for each column. However, as the minimum of each column is 0 already the first step is redundant.
Scipy's erfinv can just take a dataframe as input. However, the values must be between -1 and 1, exclusive. So the range will be epsilon smaller.
Combining it all
import pandas as pd
from scipy.special import erfinv
df = pd.DataFrame(
[['a', 10, 0],
['b', 11, 1],
['c', 9, 0],
['d', 12, 1]],
columns=['val1', 'val2', 'val3']
)
columns_to_handle = (~df.isin([0, 1])).any()
intermediate = df.loc[:, columns_to_handle].apply(lambda x: x.astype('category').cat.codes)
epsilon = 0.0001
# intermediate -= intermediate.min() # the minimum is 0 for every column already
intermediate /= intermediate.max()/(2-2*epsilon)
intermediate -= (1-epsilon)
intermediate = erfinv(intermediate)
result = pd.concat(
[intermediate,
df.loc[:, ~columns_to_handle]],
axis=1)
result being the following dataframe:
val1 val2 val3
0 -2.751064 -0.304538 0
1 -0.304538 0.304538 1
2 0.304538 -2.751064 0
3 2.751064 2.751064 1
I am an R user who is currently learning Python and I am trying to replicate a method of selecting columns used in R into Python.
In R, I could select multiple columns like so:
df[,c(2,4:10)]
In Python, I know how iloc works, but I couldn't split between a single column number and a consecutive set of them.
This wouldn't work
df.iloc[:,[1,3:10]]
So, I'll have to drop the second column like so:
df.iloc[:,1:10].drop(df.iloc[:,1:10].columns[1] , axis=1)
Is there a more efficient way of replicating the method from R in Python?
You can use np.r_ that accepts mixed slice notation and scalar indices and concatenate them as 1-d array:
import numpy as np
df.iloc[:,np.r_[1, 3:10]]
df = pd.DataFrame([[1,2,3,4,5,6]])
df
# 0 1 2 3 4 5
#0 1 2 3 4 5 6
df.iloc[:, np.r_[1, 3:6]]
# 1 3 4 5
#0 2 4 5 6
As np.r_ produces:
np.r_[1, 3:6]
# array([1, 3, 4, 5])
Assuming one wants to select multiple columns of a DataFrame by their name, considering the Dataframe df
df = pandas.DataFrame({'A' : ['X', 'Y'],
'B' : 1,
'C' : [2, 3]})
Considering one wants the columns A and C, simply use
df[['A', 'C']]
>>> A C
0 X 2
1 Y 3
Note that if one wants to use it later on one should assign it to a variable.