Pandas - fast DataFrame transformation for neural nets ("gausrank") - python

Firstly, thank you for helping
I have large pandas DataFrame and I need fast "rank" transformation for each column:
1] if column is only 0-1, do nothing
2] else (for each column):
a] find unique values in the column
b] sort them
c] for each element of the column replace its value with the position in sort-unique "ranking" list
optional:
d] transform this new values to interval [-0.99, 0.99]
e] apply scipy.special.erfinv to each element (to get "normal" like distribution)
How can I do this with Pandas, when need to take care about speed..
Thanx

Getting columns containing only 0 or 1:
columns_to_handle = (~df.isin([0,1])).any()
Converting column type to categorical conveniently handles steps a, b and c:
df.some_column.astype('category').cat.codes
Unfortunately this does seem to require a loop (through apply) over the columns, but if you don't have too many columns this should still be reasonably fast.
Rescaling can just be done by subtracting the minimum and dividing by the maximum for each column. However, as the minimum of each column is 0 already the first step is redundant.
Scipy's erfinv can just take a dataframe as input. However, the values must be between -1 and 1, exclusive. So the range will be epsilon smaller.
Combining it all
import pandas as pd
from scipy.special import erfinv
df = pd.DataFrame(
[['a', 10, 0],
['b', 11, 1],
['c', 9, 0],
['d', 12, 1]],
columns=['val1', 'val2', 'val3']
)
columns_to_handle = (~df.isin([0, 1])).any()
intermediate = df.loc[:, columns_to_handle].apply(lambda x: x.astype('category').cat.codes)
epsilon = 0.0001
# intermediate -= intermediate.min() # the minimum is 0 for every column already
intermediate /= intermediate.max()/(2-2*epsilon)
intermediate -= (1-epsilon)
intermediate = erfinv(intermediate)
result = pd.concat(
[intermediate,
df.loc[:, ~columns_to_handle]],
axis=1)
result being the following dataframe:
val1 val2 val3
0 -2.751064 -0.304538 0
1 -0.304538 0.304538 1
2 0.304538 -2.751064 0
3 2.751064 2.751064 1

Related

python pandas column with averages [duplicate]

This question already has an answer here:
Aggregation over Partition - pandas Dataframe
(1 answer)
Closed 7 months ago.
I have a dataframe with in column "A" locations and in column "B" values. Locations occure multiple times in this DataFrame, now i'd like to add a third column in which i store the average value of column "B" that have the same location value in column "A".
-I know the .mean() can be used to get an average
-I know how to filter with .loc()
I could make a list of all unique values in column A, and compute the average for all of them by making a for loop. Hover, this seems combersome to me. Any idea how this can be done more efficiently?
Sounds like what you need is GroupBy. Take a look here
Given
df = pd.DataFrame({'A': [1, 1, 2, 1, 2],
'B': [np.nan, 2, 3, 4, 5],
'C': [1, 2, 1, 1, 2]}, columns=['A', 'B', 'C'])
You can use
df.groupby('A').mean()
to group the values based on the common values in column "A" and find the mean.
Output:
B C
A
1 3.0 1.333333
2 4.0 1.500000
I could make a list of all unique values in column A, and compute the
average for all of them by making a for loop.
This can be done using pandas.DataFrame.groupby consider following simple example
import pandas as pd
df = pd.DataFrame({"A":["X","Y","Y","X","X"],"B":[1,3,7,10,20]})
means = df.groupby('A').agg('mean')
print(means)
gives output
B
A
X 10.333333
Y 5.000000
import pandas as pd
data = {'A': ['a', 'a', 'b', 'c'], 'B': [32, 61, 40, 45]}
df = pd.DataFrame(data)
df2 = df.groupby(['A']).mean()
print(df2)
Based on your description, I'm not sure if you are trying to simply calculate the averages for each group, or if you are wanting to maintain the long format of your data. I'll break down a solution for each option.
The data I'll use below can be generated by running the following...
import pandas as pd
df = pd.DataFrame([['group1 ', 2],
['group2 ', 4],
['group1 ', 5],
['group2 ', 2],
['group1 ', 2],
['group2 ', 0]], columns=['A', 'B'])
Option 1 - Calculate Group Averages
This one is super simple. It uses the .groupby method, which is the bread and butter of crunching data calculations.
df.groupby('A').B.mean()
Output:
A
group1 3.0
group2 2.0
If you wish for this to return a dataframe instead of a series, you can add .to_frame() to the end of the above line.
Option 2 - Calculate Group Averages and Maintain Long Format
By long format, I mean you want your data to be structured the same as it is currently, but with a third column (we'll call it C) containing a mean that is connected to the A column. ie...
A
B
C (average)
group1
2
3
group2
4
2
group1
5
3
group2
2
2
group1
2
3
group2
0
2
Where the averages for each group are...
group1 = (2+5+2)/3 = 3
group2 = (4+2+0)/3 = 2
The most efficient solution, would be to use .transform, which behaves like an sql window function, but I think this method can be a little confusing when you're new to pandas.
import numpy as np
df.assign(C=df.groupby('A').B.transform(np.mean))
A less efficient, but more beginner friendly option would be to store the averages in a dictionary and then map each row to the group average.
I find myself using this option a lot for modeling projects, when I want to impute a historical average rather than the average of my sampled data.
To accomplish this, you can...
Create a dictionary containing the grouped averages
For every row in the dataframe, pass the group name into the dictionary
# Create the group averages
group_averages = df.groupby('A').B.mean().to_dict()
# For every row, pass the group name into the dictionary
new_column = df.A.map(group_averages)
# Add the new column to the dataframe
df = df.assign(C=new_column)
You can also, optionally, do all of this in a single line
df = df.assign(C=df.A.map(df.groupby('A').B.mean().to_dict()))

Assign a series to ALL columns of the dataFrame (columnwise)?

I have a dataframe, and series of the same vertical size as df, I want to assign
that series to ALL columns of the DataFrame.
What is the natural why to do it ?
For example
df = pd.DataFrame([[1, 2 ], [3, 4], [5 , 6]] )
ser = pd.Series([1, 2, 3 ])
I want all columns of "df" to be equal to "ser".
PS Related:
One way to solve it via answer:
How to assign dataframe[ boolean Mask] = Series - make it row-wise ? I.e. where Mask = true take values from the same row of the Series (creating all true mask), but I guess there should be some more
simple way.
If I need NOT all, but SOME columns - the answer is given here:
Assign a Series to several Rows of a Pandas DataFrame
Use to_frame with reindex:
a = ser.to_frame().reindex(columns=df.columns, method='ffill')
print (a)
0 1
0 1 1
1 2 2
2 3 3
But it seems easier is solution from comment, there was added columns parameter if need same order columns as original with real data:
df = pd.DataFrame({c:ser for c in df.columns}, columns=df.columns)
Maybe a different way to look at it:
df = pd.concat([ser] * df.shape[1], axis=1)

Replace a column in Pandas dataframe with another that has same index but in a different order

I'm trying to re-insert back into a pandas dataframe a column that I extracted and of which I changed the order by sorting it.
Very simply, I have extracted a column from a pandas df:
col1 = df.col1
This column contains integers and I used the .sort() method to order it from smallest to largest. And did some operation on the data.
col1.sort()
#do stuff that changes the values of col1.
Now the indexes of col1 are the same as the indexes of the overall df, but in a different order.
I was wondering how I can insert the column back into the original dataframe (replacing the col1 that is there at the moment)
I have tried both of the following methods:
1)
df.col1 = col1
2)
df.insert(column_index_of_col1, "col1", col1)
but both methods give me the following error:
ValueError: cannot reindex from a duplicate axis
Any help will be greatly appreciated.
Thank you.
Consider this DataFrame:
df = pd.DataFrame({'A': [1, 2, 3], 'B': [6, 5, 4]}, index=[0, 0, 1])
df
Out:
A B
0 1 6
0 2 5
1 3 4
Assign the second column to b and sort it and take the square, for example:
b = df['B']
b = b.sort_values()
b = b**2
Now b is:
b
Out:
1 16
0 25
0 36
Name: B, dtype: int64
Without knowing the exact operation you've done on the column, there is no way to know whether 25 corresponds to the first row in the original DataFrame or the second one. You can take the inverse of the operation (take the square root and match, for example) but that would be unnecessary I think. If you start with an index that has unique elements (df = df.reset_index()) it would be much easier. In that case,
df['B'] = b
should work just fine.

Equivalent of Pandas Factorize For Multiple Columns?

I have three binary-type columns of a dataframe whose values together constitute a meaningful grouping of the data. To refer to the group, I'm currently making a new column a hard-coded binary encoding like so:
data['type'] = data['a'] + 2 * data['b'] + 4 * data['c']
Pandas factorize will assign an integer for each distinct value of a sequence, but it doesn't seem to work with combinations of multiple columns. Is there a more general pandas function for situations like this? It would be nice if such a function generalized to K distinct categorical variables of arbitrary number of categories, rather than being limited to binary variables.
If such a thing doesn't exist, would there be interest in a pull request?
Here are two methods you can try:
df = pd.DataFrame({'a': [1, 1, 0],
'b': [0, 1, 0],
'c': [1, 1, 1]})
>>> df
a b c
0 1 0 1
1 1 1 1
2 0 0 1
>>> ["".join(row) for row in df[['a', 'b', 'c']].values.astype(str)]
Out[22]: ['101', '111', '001']
>>> [bytearray("".join(row)) for row in df[['a', 'b', 'c']].values.astype(str)]
Out[23]: [bytearray(b'101'), bytearray(b'111'), bytearray(b'001')]
You may want to take a look at patsy which addresses things like categorical variable encoding and other model-related issues: see docs.
Patsy offers quite a few encoding schemes, including:
Treatment (default)
Backward difference coding
Orthogonal polynomial contrast coding
Deviation coding (also known as sum-to-zero coding), and
Helmert contrasts

Applying transformations to dataframes with multi-level indices in Python's pandas

I'm trying to do apply simple functions to mostly numeric data in pandas. the data is a set of matrices indexed by time. I wanted to use hierarchical/multilevel indices to represent this and then use a split-apply-combine like operation to group the data, apply an operation, and summarize the result as a dataframe. I'd like the result of these operations to be dataframes and not Series objects.
Below is a simple example with two matrices (two time points) represented as a multi level dataframe. I want to subtract a matrix from each time point, then collapse the data by taking the mean, and get back a dataframe that preserves the original column names of the data.
Everything I try either fails or gives an odd result. I tried to follow http://pandas.pydata.org/pandas-docs/stable/groupby.html since this is basically a split-apply-combine operation, I think, but the documentation is very hard to understand and the examples are dense.
How can this be achieved in pandas? I annotated where my code fails along the relevant lines:
import pandas
import numpy as np
t1 = pandas.DataFrame([[0, 0, 0],
[0, 1, 1],
[5, 5, 5]], columns=[1, 2, 3], index=["A", "B", "C"])
t2 = pandas.DataFrame([[10, 10, 30],
[5, 1, 1],
[2, 2, 2]], columns=[1, 2, 3], index=["A", "B", "C"])
m = np.ones([3,3])
c = pandas.concat([t1, t2], keys=["t1", "t2"], names=["time", "name"])
#print "c: ", c
# How to view just the 'time' column values?
#print c.ix["time"] # fails
#print c["time"] # fails
# How to group matrix by time, subtract value from each matrix, and then
# take the mean across the columns and get a dataframe back?
result = c.groupby(level="time").apply(lambda x: np.mean(x - m, axis=1))
# Why does 'result' appear to have TWO "time" columns?!
print result
# Why is 'result' a series and not a dataframe?
print type(result)
# Attempt to get a dataframe back
df = pandas.DataFrame(result)
# Why does 'df' have a weird '0' outer (hierarchical) column??
print df
# 0
# time time name
# t1 t1 A -1.000000
# B -0.333333
# C 4.000000
# t2 t2 A 15.666667
# B 1.333333
# C 1.000000
In short, the operation I'd like to do is:
for each time point:
subtract m from time point matrix
collapse the result matrix across the columns by taking the mean (preserving the row labels "A", "B", "C"
return result as dataframe
how to view just the 'time' column values?
In [11]: c.index.levels[0].values
Out[11]: array(['t1', 't2'], dtype=object)
how to group matrix by time, subtract value from each matrix, and then
take the mean across the columns and get a dataframe back?
Your attempt was pretty close:
In [46]: c.groupby(level='time').apply(lambda x: x - m).mean(axis=1)
Out[46]:
time name
t1 A -1.000000
B -0.333333
C 4.000000
t2 A 15.666667
B 1.333333
C 1.000000
dtype: float64

Categories

Resources