Pandas: 2 data frames have different shapes but no different columns - python

I have 2 dataframes: a and b.
When I run print(a.shape, b.shape), I get the following result: (1, 28849) (44, 29025) meaning that b has more columns that a. When I run b.columns.difference(a.columns) the result is a null index: (Index([], dtype='object'). I get the same result when I run a.columns.difference(b.columns). Why do the dataframes have different columns counts in shape but not have any different columns between them?

Why do the dataframes have different columns counts in shape but not
have any different columns between them?
Empty bi-directional pd.Index.difference is no guarantee that columns in 2 dataframes are the same. Consider the following example:
A = pd.DataFrame(columns=[1, 1, 2, 3, 4])
B = pd.DataFrame(columns=[1, 2, 3, 4])
A.columns.difference(B.columns) # Int64Index([], dtype='int64')
B.columns.difference(A.columns) # Int64Index([], dtype='int64')
pd.Index.difference can be compared to set.difference, i.e. it does not consider duplicates. If you print the columns explicitly, you should see they are different.
Or, to explicitly calculate the counts of each column name, you can use numpy.unique:
import numpy as np
print(np.unique(A.columns, return_counts=True))
(array([1, 2, 3, 4], dtype=int64), array([2, 1, 1, 1], dtype=int64))

Related

performing operation on matched columns of NumPy arrays

I am new to python and programming in general and ran into a question:
I have two NumPy arrays of the same shape: they are 2D arrays, of the dimensions 1000 x 2000.
I wish to compare the values of each column in array A with the values in array B. The important part is that not every column of A should be compared to every column in B, but rather the same columns of A & B should be compared to one another, as in: A[:,0] should be compared to B[:,0], A[:,1] should be compared to B[:,1],… etc.
This was easier to do when I had one dimensional arrays: I used zip(A, B), so I could run the following for-loop:
A = np.array([2,5,6,3,7])
B = np.array([1,3,9,4,8])
res_list = []
For number1, number2 in zip(A, B):
if number1 > number2:
comment1 = “bigger”
res_list.append(comment1)
if number1 < number2:
comment2 = “smaller”
res_list.append(comment2)
res_list
In [702]: res_list
Out[702]: ['bigger', 'bigger', 'smaller', 'smaller', 'smaller']
however, I am not sure how to best do this on the 2D array. As output, I am aiming for a list with 2000 sublists (the 2000 cols), to later count the numbers of instances of "bigger" and "smaller" for each of the columns.
I am very thankful for any input.
So far I have tried to use np.nditer in a double for loop, but it returned all the possible column combinations. I would specifically desire to only combine the "matching" columns.
an approximation of the input (but I have: 1000 rows and 2000 cols)
In [709]: A
Out[709]:
array([[2, 5, 6, 3, 7],
[6, 2, 9, 2, 3],
[2, 1, 4, 5, 7]])
In [710]: B
Out[710]:
array([[1, 3, 9, 4, 8],
[4, 8, 2, 3, 1],
[3, 7, 1, 8, 9]])
As desired output, I want to compare the values of the arrays A & B column-wise (only the "matching" columns, not all columns with all columns, as I tried to explain above), and store them in the a nested list (number of "sublists" should correspond to the number of columns):
res_list = [["bigger", "bigger", "smaller"], ["bigger", "smaller", "smaller"], ["smaller", "bigger", "bigger], ["smaller", "smaller", "smaller"], ...]
From the example input and output, I see that you want to do an element wise comparison, and store the values per columns. From your code you understand the 1D variant of this problem, so the question seems to be how to do it in 2D.
Solution 1
In order to achieve this, we have to make the 2D problem, a 1D problem, so you can do what you already did. If for example the columns would become rows, then you can redo your zip strategy for every row.
In otherwords, if we can turn:
a = np.array(
[[2, 5, 6, 3, 7],
[6, 2, 9, 2, 3],
[2, 1, 4, 5, 7]]
)
into:
array([[2, 6, 2],
[5, 2, 1],
[6, 9, 4],
[3, 2, 5],
[7, 3, 7]])
we can iterate over a and b, at the same time, and get our 1D version of the problem. Swapping the x and y axis of the matrix like this, is called transposing, and is very common, the operation for numpy is a.T, (docs ndarry.T).
Now we use your code onces for the outer loop of iterating over all the rows (after transposing, all the rows actually hold the column values). After which we use the code on those values, because every row is a 1D numpy array.
result = []
# Outer loop, to go over the columns of `a` and `b` at the same time.
for row_a, row_b in zip(a.T, b.T):
result_col = []
# Inner loop to compare a whole column element wise.
for col_a, col_b in zip(row_a, row_b):
result_col.append('bigger' if col_a > col_b else 'smaller')
result.append(result_col)
Note: I use a ternary operator to assign smaller and bigger.
Solution 2
As indicated before you are only looking at 2 values that are in the same position for both arrays, this is called an elementwise comparison. Since we are only interested in the values that are at the exact same position, and we know the output shape of our result array (input 1000x2000, output will be 2000x1000), we can also iterate over all the elements using their index.
Now some quick handy shortcuts,
a.shape holds the dimensions of the array, therefore a.shape will be (1000, 2000).
using [::-1] will reverse the order, similar to reverse()
Combining a.shape[::-1] will hold (2000, 1000), our expected output shape.
np.ndindex provides indexing, based on the number of dimensions provided.
An *, performs tuple unpacking, so using it like np.ndindex(*a.shape), is equivalent to np.ndindex(1000, 2000).
Therefore we can use their index (from np.ndindex) and turn the x and y around to write the result to the correct location in the output array:
a = np.random.randint(0, 255, (1000, 2000))
b = np.random.randint(0, 255, (1000, 2000))
result = np.zeros(a.shape[::-1], dtype=object)
for rows, columns in np.ndindex(*a.shape):
result[columns, rows] = 'bigger' if a[rows, columns] > b[rows, columns] else 'smaller'
print(result)
This will lead to the same result. Similarly we could also first transpose the a and b array, drop the [::-1] in the result array, and swap the assignment result[columns, rows] back to result[rows, columns].
Edit
Thinking about it a bit longer, you are only interested in doing a comparison between two array of the same shape (dimension). For this numpy already has a good solution, np.where(cond, <true>, <false>).
So the entire problem can be reduced to:
answer = np.where(a > b, 'bigger', 'smaller').T
Note the .T to transpose the solution, such that the answer has the columns in the rows.

Is it possible to pass more than one argument to pandas converters (read_csv)?

I have a CSV file that I need to read as a DataFrame, but I'd like to apply a transformation in one of the columns using converters from pandas.read_csv.
This is what's in my file:
matrix size
"(1, 2, 3, 4)" 2
"(1, 2, 3, 4, 5, 6, 7, 8, 9)" 3
The strings in matrix need to be converted to matrices according to the corresponding size. (The actual process is more complex and the values in the data actually correspond to the lower triangle of each matrix, etc.)
So, the expected output DataFrame is:
matrix size
0 [[1, 2], [3, 4]] 2
1 [[1, 2, 3], [4, 5, 6], [7, 8, ... 3
I'm trying to use converters to convert the columns as I read them.
For example, if I wanted to read the strings in matrix as simple arrays, I could do the following:
import numpy as np
converters = {'matrix': lambda x: np.fromstring(x[1:-1], sep=',').astype('int64')}
And then read the file passing this dictionary:
import pandas as pd
df = pd.read_csv('mydata.csv', converters=converters)
The output would be:
matrix size
0 [1, 2, 3, 4] 2
1 [1, 2, 3, 4, 5, 6, 7, 8, 9] 3
In my case, I have a function to transform the strings to matrices:
def array_to_matrix(array_str, size):
array = np.fromstring(array_str[1:-1], sep=',').astype('int64')
return array.reshape(size, size)
But this function requires two arguments.
I can parse the matrix columns by doing this:
df['matrix'] = df.apply(lambda x: array_to_matrix(x['matrix'], x['size']), axis=1)
However, I haven't been able to find a way to parse the matrices using converters. To use converters, I could do the following:
matrix_converters = dict([('matrix', lambda x, y: array_to_matrix(x, y))])
But x will become the value in matrix (the dictionary key) and I have no way to pass y.
My use case is more complex and would benefit from being able to parse many similar columns while reading the file.
Is it possible to pass more than one column in the DataFrame to converters, or is it limited to one?
try:
df.matrix = df.apply(lambda x: np.array(eval(x[0])).reshape((x[1], x[1])), axis=1)
or of the matrix is not square:
df.matrix = df.apply(lambda x: np.array(eval(x[0])).reshape((x[1], -1)), axis=1)
Output:
print(df)
matrix size
0 [[1, 2], [3, 4]] 2
1 [[1, 2, 3], [4, 5, 6], [7, 8, 9]] 3

How can I add two pandas.DataFrames cellwise?

I have a bunch of pandas.DataFrames which are each a table of counts. I'd like to sum them up to return the totals. I.e. cellwise, as if I was summing two dimensional histograms or two dimensional arrays. The output should be a table of the same dimensions as input, but with numerical values summed up.
To make matters worse, the order of the columns may not be preserved.
There must be a cool way to do this without looping, but I can't figure it out.
Here's an example:
import pandas
df1 = pandas.DataFrame({'A': [3, 1, 2], 'B': [1, 1, 0]})
df2 = pandas.DataFrame({'B': [2, 0, 1], 'A': [4, 1, 6]})
I'm looking for a function something like:
df_cellwise_sum = cellwise_sum(df1, df2)
print(df_cellwise_sum)
which makes:
A B
0 7 3
1 2 1
2 8 1
Use DataFrame.add:
df = df1.add(df2)

Using arrays in order to select values from multiindex

I wish to extract values from a multiindex DataFrame, this df has two indexes, a_idx and b_idx. The values to be extracted are i.e. (1,1)
[in] df.loc[(1, 1), :]
[out] 0
Name: (1, 1), dtype: int64
which is as intended. But then if I want to obtain two values (1,2) and (2,3):
[in] df.loc[([1, 2], [2, 3]), :]
[out]
value
a_idx b_idx
1 2 1
3 6
2 2 3
3 9
Which is not what I wanted, I needed the specific pairs, not the 4 values.
Furthermore, I wish to select elements from this database with two arrays select_a and select_b: .loc[[, that have the same length as eachother, but not as the dataframe. So for
select_a = [1, 1, 2, 2, 3]
select_b = [1, 3, 2, 3, 1]
My gist was that I should do this using:
df.loc[(select_a, select_b), :]
and then receive a list of all items with a_idx==select_a[i] and b_idx==select_b[i] for all i in len(select_a).
I have tried xs and slice indexing, but this did not return the desired results. My main reason for going to the indexing method is because of computational speed, as the real dataset is actually 4.3 million lines and the dataset that has to be created will have even more.
If this is not the best way to achieve this result, then please point me in the right direction. Any sources are also welcome, what I found in the pandas documentation was not geared towards this kind of indexing (or at least I have not been able to find it)
The dataframe is created using the following code:
numbers = pd.DataFrame(np.random.randint(0,10,10), columns=["value"])
numbers["a"] = [1, 1, 1, 1, 2, 2, 2, 3, 3, 3]
numbers["b"] = [1, 2, 3, 4, 1, 2, 3, 1, 2, 3]
print("before adding the index to the dataframe")
print(numbers)
index_cols = pd.MultiIndex.from_arrays(
[numbers["a"].values, numbers["b"].values],
names=["a_idx", "b_idx"])
df = pd.DataFrame(numbers.values,
index=index_cols,
columns=numbers.columns.values)
df = df.sort_index()
df.drop(columns=["a","b"],inplace=True)
print("after adding the indexes to the dataframe")
print(df)
You were almost there. To get the pair for those indexes, you need to have the syntax like this:
df.loc[[(1, 2), (2, 3)], :]
You can also do this using select_a and select_b. Just make sure that you pass the pairs to df.loc as tuples.

function across 2 dataframes based on index (python)

i have 2 dataframes A and B and was thinking how do i create the dataframe in orange
Values to be populated for each cell would be based on the column and header. For example: the top left cell would be a func based on the row and column index (dataframe A.A0 + dataframe A.A1 - dataframe B.0)
i tried with an empty dataframe of the orange dimensions (emptyDF)
emptyDf.applyMap(lambda x: x[dfA[0]] + x[dfA[1] - x[dfB[0]]]
What you are trying to do is not in the spirit of the uses of the Pandas dataframe, but it is more a matrix manipulation exercise for which NumPy is more appropriate, the library upon which Pandas is built. It is not hard to move between Pandas dataframes and NumPy arrays and back again, you might need to be careful though to store indexes and column labels somewhere safe to use when you bring it back into pandas. There are all kinds of NumPy functions to do any manipulation you could dream up, I found a few tools to help this application:
import pandas as pd
import numpy as np
# create your dataframes:
series = pd.Series([10,9,8,7,6], index=[0,1,2,3,4])
df1 = pd.DataFrame([series])
cols = ['A','B','C','D']
list_of_series = [pd.Series([1,2,3,4],index=cols), pd.Series([5,6,7,8],index=cols)]
df2 = pd.DataFrame(list_of_series, columns=cols)
Now convert to NumPy
A = np.array(df2)
>>> A
array([[1, 2, 3, 4],
[5, 6, 7, 8]])
B = np.array(df1)
>>> B.T
array([[10],
[ 9],
[ 8],
[ 7],
[ 6]])
Now a few NumPy operations to accomplish the task:
C = A.sum(axis=0)
D = np.tile(C,(5,1))
E = np.tile(B.T, (1,4))
F = D - E
F
array([[-4, -2, 0, 2],
[-3, -1, 1, 3],
[-2, 0, 2, 4],
[-1, 1, 3, 5],
[ 0, 2, 4, 6]])
Now convert it back to a dataframe:
pd.DataFrame(F, columns=['A','B','C','D'], index=[0,1,2,3,4])
Anyway, I wonder if this can work directly from Pandas, but it just strikes me as a matrix issue, and it terms of computation time for a large system as this is staying within NumPy I don't think it would be slow.

Categories

Resources