Pandas: MultiIndex from Nested Dictionary

Pandas: MultiIndex from Nested Dictionary - python

Suppose I have a nested dictionary of the format:
dictionary={
"A":[1, 2],
"B":[2, 3],
"Coords":[{
"X":[1,2,3],
"Y":[1,2,3],
"Z":[1,2,3],
},{
"X":[2,3],
"Y":[2,3],
"Z":[2,3],
}]
}
How can I turn this into a Pandas MultiIndex Dataframe?
Equivalently, how can I produce a Dataframe where the information in the row is not duplicated for every co-ordinate?
In what I imagine, the two rows of output DataFrame should appear as follows:
Index A B Coords
---------------------
0 1 2 X Y Z
1 1 1
2 2 2
3 3 3
--------------------
---------------------
1 2 3 X Y Z
2 2 2
3 3 3
--------------------

From your dictionary :
>>> import pandas as pd
>>> df = pd.DataFrame.from_dict(dictionary)
>>> df
A B Coords
0 1 2 {'X': [1, 2, 3], 'Y': [1, 2, 3], 'Z': [1, 2, 3]}
1 2 3 {'X': [2, 3], 'Y': [2, 3], 'Z': [2, 3]}
Then we can use pd.Series to extract the data in dict in the column Coords like so :
df_concat = pd.concat([df.drop(['Coords'], axis=1), df['Coords'].apply(pd.Series)], axis=1)
>>> df_concat
A B X Y Z
0 1 2 [1, 2, 3] [1, 2, 3] [1, 2, 3]
1 2 3 [2, 3] [2, 3] [2, 3]
To finish we use the explode method to get the list as rows and set the index on columns A and B to get the expected result :
>>> df_concat.explode(['X', 'Y', 'Z']).reset_index().set_index(['index', 'A', 'B'])
X Y Z
index A B
0 1 2 1 1 1
2 2 2 2
2 3 3 3
1 2 3 2 2 2
3 3 3 3
UPDATE :
If you are using a version of Pandas lower than 1.3.0, we can use the trick given by #MillerMrosek in this answer :
def explode(df, columns):
df['tmp']=df.apply(lambda row: list(zip(*[row[_clm] for _clm in columns])), axis=1)
df=df.explode('tmp')
df[columns]=pd.DataFrame(df['tmp'].tolist(), index=df.index)
df.drop(columns='tmp', inplace=True)
return df
explode(df_concat, ["X", "Y", "Z"]).reset_index().set_index(['index', 'A', 'B'])
Output :
X Y Z
index A B
0 1 2 1 1 1
2 2 2 2
2 3 3 3
1 2 3 2 2 2
3 3 3 3

Related

How to get all unique combinations of values in one column that are in another column

Starting with a dataframe like this:
df = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': ['a', 'b', 'b', 'b', 'a']})
A B
0 1 a
1 2 b
2 3 b
3 4 b
4 5 a
What is the best way of getting to a dataframe like this?
pd.DataFrame({'source': [1, 2, 2, 3], 'target': [5, 3, 4, 4]})
source target
0 1 5
1 2 3
2 2 4
3 3 4
For each time a row in column A has the same value in column B as another row in column A, I want to save the unique instances of that relationship in a new dataframe.
This is pretty close:
df.groupby('B')['A'].unique()
B
a [1, 5]
b [2, 3, 4]
Name: A, dtype: object
But I'd ideally convert it into a single dataframe now and my brain has gone kaput.

In your case , you can do itertools.combinations
import itertools
s = df.groupby('B')['A'].apply(lambda x : set(list(itertools.combinations(x, 2)))).explode().tolist()
out = pd.DataFrame(s,columns=['source','target'])
out
Out[312]:
source target
0 1 5
1 3 4
2 2 3
3 2 4

use merge function
df.merge(df, how = "outer", on = ["B"]).query("A_x < A_y")

Fill column based on subsets of array

I have a dataframe like this
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
'A': [1, 2, 3, 2, 3, 1],
'B': [5, 2, 4, 1, 4, 5],
'C': list('abcdef')
}
)
and an array like that
a = np.array([
[1, 5],
[3, 4]
])
I would now like to add an additional column D to df which contains the word "found" based on whether the values of A and B are contained as a subset in a.
A straightforward implementation would be
for li in a.tolist():
m = (df['A'] == li[0]) & (df['B'] == li[1])
df.loc[m, 'D'] = "found"
which gives the desired outcome
A B C D
0 1 5 a found
1 2 2 b NaN
2 3 4 c found
3 2 1 d NaN
4 3 4 e found
5 1 5 f found
Is there a solution which wold avoid the loop?

One option is , we can use merge with indicator
out = df.merge(pd.DataFrame(a,columns=['A','B']),how='left',indicator="D")
out['D'] = np.where(out['D'].eq("both"),"Found","Not Found")
print(out)
A B C D
0 1 5 a Found
1 2 2 b Not Found
2 3 4 c Found
3 2 1 d Not Found
4 3 4 e Found
5 1 5 f Found

Here is one way of doing by using numpy broadcasting:
m = (df[['A', 'B']].values[:, None] == a).all(-1).any(-1)
df['D'] = np.where(m, 'Found', 'Not found')
A B C D
0 1 5 a Found
1 2 2 b Not found
2 3 4 c Found
3 2 1 d Not found
4 3 4 e Found
5 1 5 f Found

Here is another way:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
'A': [1, 2, 3, 2, 3, 1],
'B': [5, 2, 4, 1, 4, 5],
'C': list('abcdef')
}
)
a = np.array([
[1, 5],
[3, 4]
])
df = df.merge(pd.DataFrame(a, columns=['A', 'B']), 'left', indicator="D")
D = df.pop("D")
df['D'] = 'found'
df['D'] = df['D'].where(D.eq('both'), other=np.nan)
print(df)
Output:
A B C D
0 1 5 a found
1 2 2 b NaN
2 3 4 c found
3 2 1 d NaN
4 3 4 e found
5 1 5 f found

pandas.DataFrame.groupby.nunique() does not drop the groupby column/s. Is this a bug?

Although I set the parameter as_index to True, pandas.DataFrame.groupby.nunique() keeps the columns I am grouping by in the result.
The pandas version is: 0.24.1
df = pd.DataFrame(
{'a': [1, 1, 2, 3, 2],
'b': [1, 2, 3, 4, 4]}
)
df.groupby('a', as_index=True).nunique()
The output is:
# a b
# a
# 1 1 2
# 2 1 2
# 3 1 1
I expected:
# b
# a
# 1 2
# 2 2
# 3 1
As a counterexample that behaves as expected:
df.groupby('a', as_index=True).max()
results in:
# b
# a
# 1 2
# 2 4
# 3 4

If you run [print(df.to_string() + '\n') for i, df in df.groupby('a', as_index=True)], you get printed:
a b
0 1 1
1 1 2
a b
2 2 3
4 2 4
a b
3 3 4
The a column isn't set as the index for each data frame group. It is the output of the groupby which has its index set to the group indices when as_index=True (which also is the default), not the data frame groups themselves.

Python Panda - concatenate two column values into a single column with label name [duplicate]

I have a dataframe like this where the columns are the scores of some metrics:
A B C D
4 3 3 1
2 5 2 2
3 5 2 4
I want to create a new column to summarize which metrics each row scored over a set threshold in, using the column name as a string. So if the threshold was A > 2, B > 3, C > 1, D > 3, I would want the new column to look like this:
A B C D NewCol
4 3 3 1 AC
2 5 2 2 BC
3 5 2 4 ABCD
I tried using a series of np.where:
df[NewCol] = np.where(df['A'] > 2, 'A', '')
df[NewCol] = np.where(df['B'] > 3, 'B', '')
etc.
but realized the result was overwriting with the last metric any time all four metrics didn't meet the conditions, like so:
A B C D NewCol
4 3 3 1 C
2 5 2 2 C
3 5 2 4 ABCD
I am pretty sure there is an easier and correct way to do this.

You could do:
import pandas as pd
data = [[4, 3, 3, 1],
[2, 5, 2, 2],
[3, 5, 2, 4]]
df = pd.DataFrame(data=data, columns=['A', 'B', 'C', 'D'])
th = {'A': 2, 'B': 3, 'C': 1, 'D': 3}
df['result'] = [''.join(k for k in df.columns if record[k] > th[k]) for record in df.to_dict('records')]
print(df)
Output
A B C D result
0 4 3 3 1 AC
1 2 5 2 2 BC
2 3 5 2 4 ABCD

Using dot
s=pd.Series([2,3,1,3],index=df.columns)
df.gt(s,1).dot(df.columns)
Out[179]:
0 AC
1 BC
2 ABCD
dtype: object
#df['New']=df.gt(s,1).dot(df.columns)

Another option that operates in an array fashion. It would be interesting to compare performance.
import pandas as pd
import numpy as np
# Data to test.
data = pd.DataFrame(
[
[4, 3, 3, 1],
[2, 5, 2, 2],
[3, 5, 2, 4]
]
, columns = ['A', 'B', 'C', 'D']
)
# Series to hold the thresholds.
thresholds = pd.Series([2, 3, 1, 3], index = ['A', 'B', 'C', 'D'])
# Subtract the series from the data, broadcasting, and then use sum to concatenate the strings.
data['result'] = np.where(data - thresholds > 0, data.columns, '').sum(axis = 1)
print(data)
Gives:
A B C D result
0 4 3 3 1 AC
1 2 5 2 2 BC
2 3 5 2 4 ABCD

Pandas multiindex boolean indexing

So given a multiindexed dataframe, I would like to return only rows that satisfy a condition for all levels of the lower index in a multi index. Here is a small working example:
df = pd.DataFrame({'a': [1, 1, 2, 2], 'b': [1, 2, 3, 4], 'c': [0, 2, 2, 2]})
df = df.set_index(['a', 'b'])
print(df)
out:
c
a b
1 1 0
2 2
2 3 2
4 2
Now, I would like to return the entries for which c > 1. For instance, I would like to do something like
df[df[c > 1]]
out:
c
a b
1 2 2
2 3 2
4 2
But I want to get
out:
c
a b
2 3 2
4 2
Any thoughts on how to do this in the most efficient way?

I ended up using groupby:
df.groupby(level=0).filter(lambda x: all([c > 1 for v in x['c']]))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: MultiIndex from Nested Dictionary - python

Related

How to get all unique combinations of values in one column that are in another column

Fill column based on subsets of array

pandas.DataFrame.groupby.nunique() does not drop the groupby column/s. Is this a bug?

Python Panda - concatenate two column values into a single column with label name [duplicate]

Pandas multiindex boolean indexing

Categories

Resources