How to create a square dataframe/matrix given 3 columns - Python - python

I am struggling to figure out how to develop a square matrix given a format like
a a 0
a b 3
a c 4
a d 12
b a 3
b b 0
b c 2
...
To something like:
a b c d e
a 0 3 4 12 ...
b 3 0 2 7 ...
c 4 3 0 .. .
d 12 ...
e . ..
in pandas. I developed a method which I thinks works but takes forever to run because it has to iterate through each column and row for every value starting from the beginning each time using for loops. I feel like I'm definitely reinventing the wheel here. This also isnt realistic for my dataset given how many columns and rows there are. Is there something similar to R's cast function in python which can do this significantly faster?

You could use df.pivot:
import pandas as pd
df = pd.DataFrame([['a', 'a', 0],
['a', 'b', 3],
['a', 'c', 4],
['a', 'd', 12],
['b', 'a', 3],
['b', 'b', 0],
['b', 'c', 2]], columns=['X','Y','Z'])
print(df.pivot(index='X', columns='Y', values='Z'))
yields
Y a b c d
X
a 0.0 3.0 4.0 12.0
b 3.0 0.0 2.0 NaN
Here, index='X' tells df.pivot to use the column labeled 'X' as the index, and columns='Y' tells it to use the column labeled 'Y' as the column index.
See the docs for more on pivot and other reshaping methods.
Alternatively, you could use pd.crosstab:
print(pd.crosstab(index=df.iloc[:,0], columns=df.iloc[:,1],
values=df.iloc[:,2], aggfunc='sum'))
Unlike df.pivot which expects each (a1, a2) pair to be unique, pd.crosstab
(with agfunc='sum') will aggregate duplicate pairs by summing the associated
values. Although there are no duplicate pairs in your posted example, specifying
how duplicates are supposed to be aggregated is required when the values
parameter is used.
Also, whereas df.pivot is passed column labels, pd.crosstab is passed
array-likes (such as whole columns of df). df.iloc[:, i] is the ith column
of df.

Related

How to record the "least occuring" item in a pandas DataFrame?

I have the following pandas DataFrame, with only three columns:
import pandas pd
dict_example = {'col1':['A', 'A', 'A', 'A', 'A'],
'col2':['A', 'B', 'A', 'B', 'A'], 'col3':['A', 'A', 'A', 'C', 'B']}
df = pd.DataFrame(dict_example)
print(df)
col1 col2 col3
0 A A A
1 A B A
2 A A A
3 A B C
4 A A B
For the rows with differing elements, I'm trying to write a function which will return the column names of the "minority" elements.
As an example, in row 1, there are 2 A's and 1 B. Given there is only one B, I consider this the "minority". If all elements are the same, there's naturally no minority (or majority). However, if each column has a different value, I consider these columns to be minorities.
Here is what I have in mind:
col1 col2 col3 min
0 A A A []
1 A B A ['col2']
2 A A A []
3 A B C ['col1', 'col2', 'col3']
4 A A B ['col3']
I'm stumped how to computationally efficiently calculate this.
Finding the maximum number of items appears straightfoward, either with using pandas.DataFrame.mode() or one could find the maximum item in a list as follows:
lst = ['A', 'B', 'A']
max(lst,key=lst.count)
But I'm not sure how I could find either the least occurring items.
This solution is not simple - but I could not think of a pandas native solution without apply, and numpy does not seemingly provide much help without the below complex number trick for inner-row uniqueness and value counts.
If you are not fixed on adding this min column, we can use some numpy tricks to nan out the non-least-occuring entries. First, given your dataframe we can make a numpy array of integers to help.
v = pd.factorize(df.stack())[0].reshape(df.shape)
v = pd.factorize(df.values.flatten())[0].reshape(df.shape)
(should be faster, as stack is unecessary)
Then, using some tricks for numpy row-wise unique elements (using complex numbers to mark elements as unique in each row, find the least occurring elements, and mask them in). This method is mostly from user unutbu used in several answers.
def make_mask(a):
weight = 1j*np.linspace(0, a.shape[1], a.shape[0], endpoint=False)
b = a + weight[:, np.newaxis]
u, ind, c = np.unique(b, return_index=True, return_counts=True)
b = np.full_like(a, np.nan, dtype=float)
np.put(b, ind, c)
m = np.nanmin(b, axis=1)
# remove only uniques
b[(~np.isnan(b)).sum(axis=1) == 1, :] = np.nan
# remove lower uniques
b[~(b == m.reshape(-1, 1))] = np.nan
return b
m = np.isnan(make_mask(v))
df[m] = np.nan
Giving
col1 col2 col3
0 NaN NaN NaN
1 NaN B NaN
2 NaN NaN NaN
3 A B C
4 NaN NaN B
Hopefully this achieves what you want in a performant way (say if this dataframe is quite large). If there is a faster way to achieve the first line (without using stack), I would imagine this is quite fast for even very large dataframes.

Missing columns when trying to groupby aggregate multiple rows in pandas

I have a dataframe with relevant info, and I want to groupby one column, say id, with the other columns of the same id joined by "|". However, when I run my code, most of my columns end up missing (only the first 3 appear), and I don't know what is going wrong.
My code is:
df = df.groupby('id').agg(lambda col: '|'.join(set(col))).reset_index()
For instance, my data starts like
id words ... (other columns here)
0 a asd
1 a rtr
2 b s
3 c rrtttt
4 c dsfd
and I want
id ... (other columns here)
a asd|rtr
b s
c rrtttt|dsfd
but also with all the rest of my columns grouped similarly. Right now the rest of my columns just don't appear in my output dataset. Not sure what is going wrong. Thanks!
Convert to string beforehand, you can then avoid the lambda by using agg(set) and applymap after:
df.astype(str).groupby('id').agg(set).applymap('|'.join)
Minimal Verifiable Example
df = pd.DataFrame({
'id': ['a', 'a', 'b', 'c', 'c'],
'numbers': [1, 2, 2, 3, 3],
'words': ['asd', 'rtr', 's', 'rrtttt', 'dsfd']})
df
id numbers words
0 a 1 asd
1 a 2 rtr
2 b 2 s
3 c 3 rrtttt
4 c 3 dsfd
df.astype(str).groupby('id').agg(set).applymap('|'.join)
numbers words
id
a 1|2 asd|rtr
b 2 s
c 3 rrtttt|dsfd

How to use an equality condition for manipulating a Pandas Dataframe based on another dataframe?

I have a dataframe in Python, say A, which has multiple columns, including columns named ECode and FG. I have another Pandas dataframe B, also with multiple columns, including columns named ECode,F Gping (note the space in column name for F Gping) and EDesc. What I would like to do is to create a new column called EDesc in dataframe A based on following conditions (Note that EDesc, FG and F Gping contain String type values (text), while the remaining columns are numeric/floating type. Also, dataframes A and B are of different dimensions (with differing rows and columns, and I want to check equality in specific values in the dataframe columns):-
First, for all rows in dataframe A, where value in ECode matches value ECode in dataframe B, then, in the new column EDesc to be created in dataframe A, add the same values as EDesc in B.
Secondly, for all rows in dataframe A where value in FG matches F Gping values, in the new column EDesc in A, add same values as EDesc in B.
After this, if the newly created EDesc column in A still has missing values/NaNs, then add the string value MissingValue to all the rows in the Dataframe A's EDesc column.
I have tried using for loops, as well as list comprehensions, but they don't help in accomplishing this. Moreover, the space within column name F Gping in B is created problems to access the same, as though I can access it like B['F Gping'], it isn't solving the very purpose. Any help in this regard is appreciated.
I'm assuming values are unique in B['ECode'] and B['F Gping'], otherwise we'll have to choose which value we give to A['EDesc'] when we find two matching values for ECode or FG.
There might be a smarter way but here's what I would do with joins:
Example DataFrames:
A = pd.DataFrame({'ECode': [1, 1, 3, 4, 6],
'FG': ['a', 'b', 'c', 'b', 'y']})
B = pd.DataFrame({'ECode': [1, 2, 3, 5],
'F Gping': ['b', 'c', 'x', 'x'],
'EDesc': ['a', 'b', 'c', 'd']})
So they look like:
A
ECode FG
0 1 a
1 1 b
2 3 c
3 4 b
4 6 y
B
ECode F Gping EDesc
0 1 b a
1 2 c b
2 3 x c
3 5 x d
First let's create A['EDesc'] by saying that it's the result of joining A and B on ECode. We'll temporarily use EDesc as index:
A.set_index('ECode', inplace=True, drop=False)
B.set_index('ECode', inplace=True, drop=False)
A['EDesc'] = A.join(B, lsuffix='A')['EDesc']
This works because the result of A.join(B, lsuffix='A') is:
ECodeA FG ECode F Gping EDesc
ECode
1 1 a 1.0 b a
1 1 b 1.0 b a
3 3 c 3.0 x c
4 4 b NaN NaN NaN
6 6 y NaN NaN NaN
Now let's fillna on A['EDesc'], using the match on FG. Same thing:
A.set_index('FG', inplace=True, drop=False)
B.set_index('F Gping', inplace=True, drop=False)
A['EDesc'].fillna(A.join(B, lsuffix='A')['EDesc'].drop_duplicates(), inplace=True)
This works because the result of A.join(B, lsuffix='A') is:
ECodeA FG EDescA ECode F Gping EDesc
a 1 a a NaN NaN NaN
b 1 b a 1.0 b a
b 4 b NaN 1.0 b a
c 3 c c 2.0 c b
y 6 y NaN NaN NaN NaN
Also we dropped the duplicates because as you can see there are two b's in our index.
Finally let's fillna with "Missing" and reset the index:
A['EDesc'].fillna('Missing', inplace=True)
A.reset_index(drop=True, inplace=True)

Mean over a pandas DataFrame object including statistical significance after a groupby

Imagine I've got this pandas DataFrame:
Class Val
0 A 1
1 B 1
2 B 1
3 B 1
4 B 0
And I want to do the mean of the values grouped by Class, BUT having in mind statistical significance of the values so, if B had a lot of Val equal to 1 the result value of the mean of B will overcome the result value of the mean of A because it only has one observation.
Use:
import pandas as pd
df = pd.DataFrame({'Class': ['A', 'B', 'B', 'B', 'B'], 'Val': [1, 1, 1, 1, 0]})
print(df.groupby('Class').agg(['mean', 'count']))
You will have to expand on how you decide which to use, but this provides you with the basic info you need to do that.

pandas DataFrame set value on boolean mask

I'm trying to set a number of different in a pandas DataFrame all to the same value. I thought I understood boolean indexing for pandas, but I haven't found any resources on this specific error.
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'f']})
mask = df.isin([1, 3, 12, 'a'])
df[mask] = 30
Traceback (most recent call last):
...
TypeError: Cannot do inplace boolean setting on mixed-types with a non np.nan value
Above, I want to replace all of the True entries in the mask with the value 30.
I could do df.replace instead, but masking feels a bit more efficient and intuitive here. Can someone explain the error, and provide an efficient way to set all of the values?
You can't use the boolean mask on mixed dtypes for this unfortunately, you can use pandas where to set the values:
In [59]:
df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'f']})
mask = df.isin([1, 3, 12, 'a'])
df = df.where(mask, other=30)
df
Out[59]:
A B
0 1 a
1 30 30
2 3 30
Note: that the above will fail if you do inplace=True in the where method, so df.where(mask, other=30, inplace=True) will raise:
TypeError: Cannot do inplace boolean setting on mixed-types with a non
np.nan value
EDIT
OK, after a little misunderstanding you can still use where y just inverting the mask:
In [2]:
df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'f']})
mask = df.isin([1, 3, 12, 'a'])
df.where(~mask, other=30)
Out[2]:
A B
0 30 30
1 2 b
2 30 f
If you want to use different columns to create your mask, you need to call the values property of the dataframe.
Example
Let's say we want to, replace values in A_1 and 'A_2' according to a mask in B_1 and B_2. For example, replace those values in A (to 999) that corresponds to nulls in B.
The original dataframe:
A_1 A_2 B_1 B_2
0 1 4 y n
1 2 5 n NaN
2 3 6 NaN NaN
The desired dataframe
A_1 A_2 B_1 B_2
0 1 4 y n
1 2 999 n NaN
2 999 999 NaN NaN
The code:
df = pd.DataFrame({
'A_1': [1, 2, 3],
'A_2': [4, 5, 6],
'B_1': ['y', 'n', np.nan],
'B_2': ['n', np.nan, np.nan]})
_mask = df[['B_1', 'B_2']].notnull().values
df[['A_1', 'A_2']] = df[['A_1','A_2']].where(_mask, other=999)
A_1 A_2
0 1 4
1 2 999
2 999 999
I'm not 100% sure but I suspect the error message relates to the fact that there is not identical treatment of missing data across different dtypes. Only float has NaN, but integers can be automatically converted to floats so it's not a problem there. But it appears mixing number dtypes and object dtypes does not work so easily...
Regardless of that, you could get around it pretty easily with np.where:
df[:] = np.where( mask, 30, df )
A B
0 30 30
1 2 b
2 30 f
pandas uses NaN to mark invalid or missing data and can be used across types, since your DataFrame as mixed int and string data types it will not accept the assignment to a single type (other than NaN) as this would create a mixed type (int and str) in B through an in-place assignment.
#JohnE method using np.where creates a new DataFrame in which the type of column B is an object not a string as in the initial example.

Categories

Resources