Converting 2d Matrix to Single row DataFrame While Keeping Elements as Integers - python

I have a question regarding to converting a 2D matrix to a single row of Dataframe.
For example I have the following matrix (2D array) with integer elements
2d_array = [[0, 1, 1],[1, 0, 1],[1, 1, 0]]
Is there a way to convert it to a DataFrame like below, and keeping each element as integers?
df =
0 1 2 3 4 5 6 7 8
0 0 1 1 1 0 1 1 1 0
I tried to flatten the 2D array first
flattened_array = 2d_array.flatten()
Then I use pandas.DataFrame
df = pandas.DataFrame(flatttened_array)
But the results gave me a single column Dataframe with elements of "numpy.float64" like below:
df =
0
0 0.0
1 1.0
2 1.0
3 1.0
4 0.0
5 1.0
6 1.0
7 1.0
8 0.0
Please help. Thank you so much!
Tommy

Adding []
df = pd.DataFrame([flattened_array])
df
Out[297]:
0 1 2 3 4 5 6 7 8
0 0 1 1 1 0 1 1 1 0

maybe you can try:
df[flatttened_array] = df[flatttened_array].astype(int)

Another option:
pd.DataFrame(np.array(array).reshape(1,-1))

Related

Dataframe get exact value using an array

Suppose I have the following dataframe:
A B C D Count
0 0 0 0 0 12.0
1 0 0 0 1 2.0
2 0 0 1 0 4.0
3 0 0 1 1 0.0
4 0 1 0 0 3.0
5 0 1 1 0 0.0
6 1 0 0 0 7.0
7 1 0 0 1 9.0
8 1 0 1 0 0.0
... (truncated for readability)
And an array: [1, 0, 0, 1]
I would like to access Count value given the above values of each column. In this case, this would be row 7 with Count = 9.0
I can use iloc or at by deconstructing each value in the array, but that seems inefficient. Wondering if there's a way to map the values in the array to a value of a column.
You can index the DataFrame with a list of the key column names and compare the resulting view to the array, using NumPy broadcasting to do it for each line at once. Then collapse the resulting Boolean DataFrame to a Boolean row index with all() and use that to index the Count column.
If df is the DataFrame and a is the array (or a list):
df.Count.loc[(df[list('ABCD')] == a).all(axis=1)]
You can try with tuple
out = df.loc[df[list('ABCD')].apply(tuple,1) == (1, 0, 0, 1),'Count']
Out[333]:
7 9.0
Name: Count, dtype: float64
I just used the .loc command, and searched for the multiple conditions like this:
f = [1,0,0,1]
result = df['Count'].loc[(df['A']==f[0]) &
(df['B']==f[1]) &
(df['C']==f[2]) &
(df['D']==f[3])].values
print(result)
OUTPUT:
[9.]
However, I like Arne's answer better :)

Sort 'pandas.core.series.Series' so that largest value is in the centre

I have a Pandas Series that looks like this:
import pandas as pd
x = pd.Series([3, 1, 1])
print(x)
0 3
1 1
2 1
I would like to sort the output so that the largest value is in the center. Like this:
0 1
1 3
2 1
Do you have any ideas on how to do this also for series of different lengths (all of them are sorted with decreasing values). The length of the series will always be odd.
Thank you very much!
Anna
First sort values and then use indexing with join values by concat:
x = pd.Series([6, 4, 4, 2, 2, 1, 1])
x = x.sort_values()
print (pd.concat([x[::2], x[len(x)-2:0:-2]]))
5 1
3 2
1 4
0 6
2 4
4 2
6 1
dtype: int64
x = pd.Series(range(7))
x = x.sort_values()
print (pd.concat([x[::2], x[len(x)-2:0:-2]]))
0 0
2 2
4 4
6 6
5 5
3 3
1 1
dtype: int64

pandas adding columns to bottom of column

I have a df = pd.DataFrame([[1, 3, 5], [2, 4, 6]]) that looks like
0 1 2
0 1 3 5
1 2 4 6
I am trying to move each of the columns to the bottom of the first column. It should look something like...
0
0 1
1 2
2 3
3 4
4 5
5 6
Looking for a way to do this with n rows on a much larger dataframe. I was looking for other ways with pandas stack() but have not found a solution.
You could transpose and stack:
import pandas as pd
df = pd.DataFrame([[1, 3, 5], [2, 4, 6]])
res = df.T.stack()
print(res)
Output
0 0 1
1 2
1 0 3
1 4
2 0 5
1 6
dtype: int64
If you want to remove the index, use reset_index (as suggested by #JoeFerndz):
res = df.T.stack().reset_index(drop=True)
print(res)
Output
0 1
1 2
2 3
3 4
4 5
5 6
dtype: int64
As an alternative, just flatten the numpy array directly:
res = pd.DataFrame(df.values.flatten('F'))
print(res)
Output
0
0 1
1 2
2 3
3 4
4 5
5 6
The F means:
to flatten in column-major (Fortran- style) order.
This can be done using Numpy's reshape method. The code below first converts the DataFrame to a Numpy array and then reshapes it to a column array by traversing the elements in Fortran-like index order. The result is finally converted back to a Pandas DataFrame.
pd.DataFrame(df.values.reshape((-1, 1), order="F"))

How to add numpy matrix as new columns for pandas dataframe?

I have a NxM dataframe and a NxL numpy matrix. I'd like to add the matrix to the dataframe to create L new columns by simply appending the columns and rows the same order they appear. I tried merge() and join(), but I end up with errors:
assign() keywords must be strings
and
columns overlap but no suffix specified
respectively.
Is there a way I can add a numpy matrix as dataframe columns?
You can turn the matrix into a datframe and use concat with axis=1:
For example, given a dataframe df and a numpy array mat:
>>> df
a b
0 5 5
1 0 7
2 1 0
3 0 4
4 6 4
>>> mat
array([[0.44926098, 0.29567859, 0.60728561],
[0.32180566, 0.32499134, 0.94950085],
[0.64958125, 0.00566706, 0.56473627],
[0.17357589, 0.71053224, 0.17854188],
[0.38348102, 0.12440952, 0.90359566]])
You can do:
>>> pd.concat([df, pd.DataFrame(mat)], axis=1)
a b 0 1 2
0 5 5 0.449261 0.295679 0.607286
1 0 7 0.321806 0.324991 0.949501
2 1 0 0.649581 0.005667 0.564736
3 0 4 0.173576 0.710532 0.178542
4 6 4 0.383481 0.124410 0.903596
Setup
df = pd.DataFrame({'a': [5,0,1,0,6], 'b': [5,7,0,4,4]})
mat = np.random.rand(5,3)
Using join:
df.join(pd.DataFrame(mat))
a b 0 1 2
0 5 5 0.884061 0.803747 0.727161
1 0 7 0.464009 0.447346 0.171881
2 1 0 0.353604 0.912781 0.199477
3 0 4 0.466095 0.136218 0.405766
4 6 4 0.764678 0.874614 0.310778
If there is the chance of overlapping column names, simply supply a suffix:
df = pd.DataFrame({0: [5,0,1,0,6], 1: [5,7,0,4,4]})
mat = np.random.rand(5,3)
df.join(pd.DataFrame(mat), rsuffix='_')
0 1 0_ 1_ 2
0 5 5 0.783722 0.976951 0.563798
1 0 7 0.946070 0.391593 0.273339
2 1 0 0.710195 0.827352 0.839212
3 0 4 0.528824 0.625430 0.465386
4 6 4 0.848423 0.467256 0.962953

Combine 2 pandas dataframes according to boolean Vector

My problem is the following:
Let's say I have two dataframes with same number of columns in pandas like for instance:
A= 1 2
3 4
8 9
and
B= 7 8
4 0
And also one boolean vector of length exactly num of rows from A + num of B rows = 5 , with the same number of 1s as num of rows in B which means two 1s in this example.
Let's say Bool= 0 1 0 1 0.
My goal is then to merge A and B into a bigger dataframe called C such that the rows of B corresponds to the 1s in Bool , so with this example it would give me:
C= 1 2
7 8
3 4
4 0
8 9
Do you know how to do this please?
If you know how this would help me tremendously.
Thanks for your reading.
Here's a pandas-only solution that reindexes the original dataframes and then concatenates them:
Bool = pd.Series([0, 1, 0, 1, 0], dtype=bool)
B.index = Bool[ Bool].index
A.index = Bool[~Bool].index
pd.concat([A,B]).sort_index() # sort_index() is not really necessary
# 0 1
#0 1 2
#1 7 8
#2 3 4
#3 4 0
#4 8 9
One option is to create an empty data frame with the expected shape and then fill the values from A and B in:
import pandas as pd
import numpy as np
# initialize a data frame with the same data types as A thanks to #piRSquared
df = pd.DataFrame(np.empty((A.shape[0] + B.shape[0], A.shape[1])), dtype=A.dtypes)
Bool = np.array([0, 1, 0, 1, 0]).astype(bool)
df.loc[Bool,:] = B.values
df.loc[~Bool,:] = A.values
df
# 0 1
#0 1 2
#1 7 8
#2 3 4
#3 4 0
#4 8 9
The following approach will generalize to larger groups than 2. Starting from
A = pd.DataFrame([[1,2],[3,4],[8,9]])
B = pd.DataFrame([[7,8],[4,0]])
C = pd.DataFrame([[9,9],[5,5]])
bb = pd.Series([0, 1, 0, 1, 2, 2, 0])
we can use
pd.concat([A, B, C]).iloc[bb.rank(method='first')-1].reset_index(drop=True)
which gives
In [269]: pd.concat([A, B, C]).iloc[bb.rank(method='first')-1].reset_index(drop=True)
Out[269]:
0 1
0 1 2
1 7 8
2 3 4
3 4 0
4 9 9
5 5 5
6 8 9
This works because when you use method='first', it ranks the values by their values in order and then by the order in which they're seen. This means that we get things like
In [270]: pd.Series([1, 0, 0, 1, 0]).rank(method='first')
Out[270]:
0 4.0
1 1.0
2 2.0
3 5.0
4 3.0
dtype: float64
which is exactly (after subtracting one) the iloc order in which we want to select the rows.

Categories

Resources