i have 2 dataframes A and B and was thinking how do i create the dataframe in orange
Values to be populated for each cell would be based on the column and header. For example: the top left cell would be a func based on the row and column index (dataframe A.A0 + dataframe A.A1 - dataframe B.0)
i tried with an empty dataframe of the orange dimensions (emptyDF)
emptyDf.applyMap(lambda x: x[dfA[0]] + x[dfA[1] - x[dfB[0]]]
What you are trying to do is not in the spirit of the uses of the Pandas dataframe, but it is more a matrix manipulation exercise for which NumPy is more appropriate, the library upon which Pandas is built. It is not hard to move between Pandas dataframes and NumPy arrays and back again, you might need to be careful though to store indexes and column labels somewhere safe to use when you bring it back into pandas. There are all kinds of NumPy functions to do any manipulation you could dream up, I found a few tools to help this application:
import pandas as pd
import numpy as np
# create your dataframes:
series = pd.Series([10,9,8,7,6], index=[0,1,2,3,4])
df1 = pd.DataFrame([series])
cols = ['A','B','C','D']
list_of_series = [pd.Series([1,2,3,4],index=cols), pd.Series([5,6,7,8],index=cols)]
df2 = pd.DataFrame(list_of_series, columns=cols)
Now convert to NumPy
A = np.array(df2)
>>> A
array([[1, 2, 3, 4],
[5, 6, 7, 8]])
B = np.array(df1)
>>> B.T
array([[10],
[ 9],
[ 8],
[ 7],
[ 6]])
Now a few NumPy operations to accomplish the task:
C = A.sum(axis=0)
D = np.tile(C,(5,1))
E = np.tile(B.T, (1,4))
F = D - E
F
array([[-4, -2, 0, 2],
[-3, -1, 1, 3],
[-2, 0, 2, 4],
[-1, 1, 3, 5],
[ 0, 2, 4, 6]])
Now convert it back to a dataframe:
pd.DataFrame(F, columns=['A','B','C','D'], index=[0,1,2,3,4])
Anyway, I wonder if this can work directly from Pandas, but it just strikes me as a matrix issue, and it terms of computation time for a large system as this is staying within NumPy I don't think it would be slow.
Related
I have a CSV file that I need to read as a DataFrame, but I'd like to apply a transformation in one of the columns using converters from pandas.read_csv.
This is what's in my file:
matrix size
"(1, 2, 3, 4)" 2
"(1, 2, 3, 4, 5, 6, 7, 8, 9)" 3
The strings in matrix need to be converted to matrices according to the corresponding size. (The actual process is more complex and the values in the data actually correspond to the lower triangle of each matrix, etc.)
So, the expected output DataFrame is:
matrix size
0 [[1, 2], [3, 4]] 2
1 [[1, 2, 3], [4, 5, 6], [7, 8, ... 3
I'm trying to use converters to convert the columns as I read them.
For example, if I wanted to read the strings in matrix as simple arrays, I could do the following:
import numpy as np
converters = {'matrix': lambda x: np.fromstring(x[1:-1], sep=',').astype('int64')}
And then read the file passing this dictionary:
import pandas as pd
df = pd.read_csv('mydata.csv', converters=converters)
The output would be:
matrix size
0 [1, 2, 3, 4] 2
1 [1, 2, 3, 4, 5, 6, 7, 8, 9] 3
In my case, I have a function to transform the strings to matrices:
def array_to_matrix(array_str, size):
array = np.fromstring(array_str[1:-1], sep=',').astype('int64')
return array.reshape(size, size)
But this function requires two arguments.
I can parse the matrix columns by doing this:
df['matrix'] = df.apply(lambda x: array_to_matrix(x['matrix'], x['size']), axis=1)
However, I haven't been able to find a way to parse the matrices using converters. To use converters, I could do the following:
matrix_converters = dict([('matrix', lambda x, y: array_to_matrix(x, y))])
But x will become the value in matrix (the dictionary key) and I have no way to pass y.
My use case is more complex and would benefit from being able to parse many similar columns while reading the file.
Is it possible to pass more than one column in the DataFrame to converters, or is it limited to one?
try:
df.matrix = df.apply(lambda x: np.array(eval(x[0])).reshape((x[1], x[1])), axis=1)
or of the matrix is not square:
df.matrix = df.apply(lambda x: np.array(eval(x[0])).reshape((x[1], -1)), axis=1)
Output:
print(df)
matrix size
0 [[1, 2], [3, 4]] 2
1 [[1, 2, 3], [4, 5, 6], [7, 8, 9]] 3
I have a NumPy array with each row representing some (x, y, z) coordinate like so:
a = array([[0, 0, 1],
[1, 1, 2],
[4, 5, 1],
[4, 5, 2]])
I also have another NumPy array with unique values of the z-coordinates of that array like so:
b = array([1, 2])
How can I apply a function, let's call it "f", to each of the groups of rows in a which correspond to the values in b? For example, the first value of b is 1 so I would get all rows of a which have a 1 in the z-coordinate. Then, I apply a function to all those values.
In the end, the output would be an array the same shape as b.
I'm trying to vectorize this to make it as fast as possible. Thanks!
Example of an expected output (assuming that f is count()):
c = array([2, 2])
because there are 2 rows in array a which have a z value of 1 in array b and also 2 rows in array a which have a z value of 2 in array b.
A trivial solution would be to iterate over array b like so:
for val in b:
apply function to a based on val
append to an array c
My attempt:
I tried doing something like this, but it just returns an empty array.
func(a[a[:, 2]==b])
The problem is that the groups of rows with the same Z can have different sizes so you cannot stack them into one 3D numpy array which would allow to easily apply a function along the third dimension. One solution is to use a for-loop, another is to use np.split:
a = np.array([[0, 0, 1],
[1, 1, 2],
[4, 5, 1],
[4, 5, 2],
[4, 3, 1]])
a_sorted = a[a[:,2].argsort()]
inds = np.unique(a_sorted[:,2], return_index=True)[1]
a_split = np.split(a_sorted, inds)[1:]
# [array([[0, 0, 1],
# [4, 5, 1],
# [4, 3, 1]]),
# array([[1, 1, 2],
# [4, 5, 2]])]
f = np.sum # example of a function
result = list(map(f, a_split))
# [19, 15]
But imho the best solution is to use pandas and groupby as suggested by FBruzzesi. You can then convert the result to a numpy array.
EDIT: For completeness, here are the other two solutions
List comprehension:
b = np.unique(a[:,2])
result = [f(a[a[:,2] == z]) for z in b]
Pandas:
df = pd.DataFrame(a, columns=list('XYZ'))
result = df.groupby(['Z']).apply(lambda x: f(x.values)).tolist()
This is the performance plot I got for a = np.random.randint(0, 100, (n, 3)):
As you can see, approximately up to n = 10^5 the "split solution" is the fastest, but after that the pandas solution performs better.
If you are allowed to use pandas:
import pandas as pd
df=pd.DataFrame(a, columns=['x','y','z'])
df.groupby('z').agg(f)
Here f can be any custom function working on grouped data.
Numeric example:
a = np.array([[0, 0, 1],
[1, 1, 2],
[4, 5, 1],
[4, 5, 2]])
df=pd.DataFrame(a, columns=['x','y','z'])
df.groupby('z').size()
z
1 2
2 2
dtype: int64
Remark that .size is the way to count number of rows per group.
To keep it into pure numpy, maybe this can suit your case:
tmp = np.array([a[a[:,2]==i] for i in b])
tmp
array([[[0, 0, 1],
[4, 5, 1]],
[[1, 1, 2],
[4, 5, 2]]])
which is an array with each group of arrays.
c = np.array([])
for x in np.nditer(b):
c = np.append(c, np.where((a[:,2] == x))[0].shape[0])
Output:
[2. 2.]
Assuming you have a DataFrame with a column containing expressions (referring to other columns), is it possible to evaluate the expressions contained in that column?
I know one can use pd.eval() and df.eval() to apply column-wise operations (as seen below). Example taken from:
https://jakevdp.github.io/PythonDataScienceHandbook/03.12-performance-eval-and-query.html
Assuming you have:
import pandas as pd
df = pd.DataFrame([[1, 2], [2, 3], [5, 6]], columns=['A', 'B'])
then you can write:
df.eval('(A + B)')
and you will get a series with 3, 5, 11 (expected).
Now what if that expression actually varies from row to row and is actually stored as a column? Such as this dataframe:
df = pd.DataFrame([[1, 2, "A + B"], [2, 3, "A - B"], [5, 6, "A + 2 * B"]], columns=['A', 'B', 'C'])
How does one go about evaluating the expressions in column C?
The expected result in that case is a series with 3, -1, 17.
Thanks for your help.
Use
>>> np.diag(df.C.apply(df.eval).values)
array([ 3, -1, 17])
Even though this is a bad design IMO, since you're I) hardcoding operations in a string, making it harder to manipulate it in case you need, II) Storing these operations as string in a pandas DataFrame, which is slow for many string-involved operations.
What is the best solution to compute correlation between my features and target variable ?? My dataframe have 1000 rows and 40 000 columns...
Exemple :
df = pd.DataFrame([[1, 2, 4 ,6], [1, 3, 4, 7], [4, 6, 8, 12], [5, 3, 2 ,10]], columns=['Feature1', 'Feature2','Feature3','Target'])
This code works fine but this is too long on my dataframe ... I need only the last column of correlation matrix : correlation with target (not pairwise feature corelation).
corr_matrix=df.corr()
corr_matrix["Target"].sort_values(ascending=False)
The np.corcoeff() function works with array but can we exclude the pairwise feature correlation ?
You could use pandas corr on each column:
df.drop("Target", axis=1).apply(lambda x: x.corr(df.Target))
Since Pandas 0.24 released in January 2019, you can simply use DataFrame.corrwith():
df.corrwith(df["Target"])
You can use scipy.stats.pearsonr on each of the feature columns like so:
import pandas as pd
import numpy as np
from scipy.stats import pearsonr
# example data
df = pd.DataFrame([[1, 2, 4 ,6], [1, 3, 4, 7], [4, 6, 8, 12], [5, 3, 2 ,10]],
columns=['Feature1', 'Feature2','Feature3','Target'])
# Only compute pearson prod-moment correlations between feature
# columns and target column
target_col_name = 'Target'
feature_target_corr = {}
for col in df:
if target_col_name != col:
feature_target_corr[col + '_' + target_col_name] = \
pearsonr(df[col], df[target_col_name])[0]
print("Feature-Target Correlations")
print(feature_target_corr)
df = pd.DataFrame([[1, 2, 4 ,6], [1, 3, 4, 7], [4, 6, 8, 12], [5, 3, 2 ,10]], columns=['Feature1', 'Feature2','Feature3','Target'])
For correlation between your target variable and all other features:
df.corr()['Target']
This works in my case. Let me know if any corrections/updates on the same.
To get any conclusive results your instance should be atleast 10 times your number of features.
What's an efficient way to remove columns from an NumPy array if the first value is greater than the last value in that column. Let's say I had b:
>>> import numpy as np
>>> b = np.arange(9).reshape(3,3)
>>> b[0,0] = 9
>>> b
array([[9, 1, 2],
[3, 4, 5],
[6, 7, 8]])
And since the b[0,0] > b[-1,0], you would wish to remove the first column and effectively be left with:
array([[1, 2],
[4, 5],
[7, 8]])
What's an efficient way to do this? I've seen it done with rows, with notation like:
b[np.logical_not(np.logical_and(b[:,0] > 20, b[:,0] < 25))]
But not with columns. Also, if transposing could be avoided that would definitely be preferable as I would like to use this on a large data set.
Simply use logical indexing:
new_b = b[:, b[0,:]<=b[-1,:]]