Pandas: MultiIndex from Nested Dictionary - python

Suppose I have a nested dictionary of the format:
dictionary={
"A":[1, 2],
"B":[2, 3],
"Coords":[{
"X":[1,2,3],
"Y":[1,2,3],
"Z":[1,2,3],
},{
"X":[2,3],
"Y":[2,3],
"Z":[2,3],
}]
}
How can I turn this into a Pandas MultiIndex Dataframe?
Equivalently, how can I produce a Dataframe where the information in the row is not duplicated for every co-ordinate?
In what I imagine, the two rows of output DataFrame should appear as follows:
Index A B Coords
---------------------
0 1 2 X Y Z
1 1 1
2 2 2
3 3 3
--------------------
---------------------
1 2 3 X Y Z
2 2 2
3 3 3
--------------------

From your dictionary :
>>> import pandas as pd
>>> df = pd.DataFrame.from_dict(dictionary)
>>> df
A B Coords
0 1 2 {'X': [1, 2, 3], 'Y': [1, 2, 3], 'Z': [1, 2, 3]}
1 2 3 {'X': [2, 3], 'Y': [2, 3], 'Z': [2, 3]}
Then we can use pd.Series to extract the data in dict in the column Coords like so :
df_concat = pd.concat([df.drop(['Coords'], axis=1), df['Coords'].apply(pd.Series)], axis=1)
>>> df_concat
A B X Y Z
0 1 2 [1, 2, 3] [1, 2, 3] [1, 2, 3]
1 2 3 [2, 3] [2, 3] [2, 3]
To finish we use the explode method to get the list as rows and set the index on columns A and B to get the expected result :
>>> df_concat.explode(['X', 'Y', 'Z']).reset_index().set_index(['index', 'A', 'B'])
X Y Z
index A B
0 1 2 1 1 1
2 2 2 2
2 3 3 3
1 2 3 2 2 2
3 3 3 3
UPDATE :
If you are using a version of Pandas lower than 1.3.0, we can use the trick given by #MillerMrosek in this answer :
def explode(df, columns):
df['tmp']=df.apply(lambda row: list(zip(*[row[_clm] for _clm in columns])), axis=1)
df=df.explode('tmp')
df[columns]=pd.DataFrame(df['tmp'].tolist(), index=df.index)
df.drop(columns='tmp', inplace=True)
return df
explode(df_concat, ["X", "Y", "Z"]).reset_index().set_index(['index', 'A', 'B'])
Output :
X Y Z
index A B
0 1 2 1 1 1
2 2 2 2
2 3 3 3
1 2 3 2 2 2
3 3 3 3

Related

How to get all unique combinations of values in one column that are in another column

Starting with a dataframe like this:
df = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': ['a', 'b', 'b', 'b', 'a']})
A B
0 1 a
1 2 b
2 3 b
3 4 b
4 5 a
What is the best way of getting to a dataframe like this?
pd.DataFrame({'source': [1, 2, 2, 3], 'target': [5, 3, 4, 4]})
source target
0 1 5
1 2 3
2 2 4
3 3 4
For each time a row in column A has the same value in column B as another row in column A, I want to save the unique instances of that relationship in a new dataframe.
This is pretty close:
df.groupby('B')['A'].unique()
B
a [1, 5]
b [2, 3, 4]
Name: A, dtype: object
But I'd ideally convert it into a single dataframe now and my brain has gone kaput.
In your case , you can do itertools.combinations
import itertools
s = df.groupby('B')['A'].apply(lambda x : set(list(itertools.combinations(x, 2)))).explode().tolist()
out = pd.DataFrame(s,columns=['source','target'])
out
Out[312]:
source target
0 1 5
1 3 4
2 2 3
3 2 4
use merge function
df.merge(df, how = "outer", on = ["B"]).query("A_x < A_y")

Fill column based on subsets of array

I have a dataframe like this
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
'A': [1, 2, 3, 2, 3, 1],
'B': [5, 2, 4, 1, 4, 5],
'C': list('abcdef')
}
)
and an array like that
a = np.array([
[1, 5],
[3, 4]
])
I would now like to add an additional column D to df which contains the word "found" based on whether the values of A and B are contained as a subset in a.
A straightforward implementation would be
for li in a.tolist():
m = (df['A'] == li[0]) & (df['B'] == li[1])
df.loc[m, 'D'] = "found"
which gives the desired outcome
A B C D
0 1 5 a found
1 2 2 b NaN
2 3 4 c found
3 2 1 d NaN
4 3 4 e found
5 1 5 f found
Is there a solution which wold avoid the loop?
One option is , we can use merge with indicator
out = df.merge(pd.DataFrame(a,columns=['A','B']),how='left',indicator="D")
out['D'] = np.where(out['D'].eq("both"),"Found","Not Found")
print(out)
A B C D
0 1 5 a Found
1 2 2 b Not Found
2 3 4 c Found
3 2 1 d Not Found
4 3 4 e Found
5 1 5 f Found
Here is one way of doing by using numpy broadcasting:
m = (df[['A', 'B']].values[:, None] == a).all(-1).any(-1)
df['D'] = np.where(m, 'Found', 'Not found')
A B C D
0 1 5 a Found
1 2 2 b Not found
2 3 4 c Found
3 2 1 d Not found
4 3 4 e Found
5 1 5 f Found
Here is another way:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
'A': [1, 2, 3, 2, 3, 1],
'B': [5, 2, 4, 1, 4, 5],
'C': list('abcdef')
}
)
a = np.array([
[1, 5],
[3, 4]
])
df = df.merge(pd.DataFrame(a, columns=['A', 'B']), 'left', indicator="D")
D = df.pop("D")
df['D'] = 'found'
df['D'] = df['D'].where(D.eq('both'), other=np.nan)
print(df)
Output:
A B C D
0 1 5 a found
1 2 2 b NaN
2 3 4 c found
3 2 1 d NaN
4 3 4 e found
5 1 5 f found

pandas.DataFrame.groupby.nunique() does not drop the groupby column/s. Is this a bug?

Although I set the parameter as_index to True, pandas.DataFrame.groupby.nunique() keeps the columns I am grouping by in the result.
The pandas version is: 0.24.1
df = pd.DataFrame(
{'a': [1, 1, 2, 3, 2],
'b': [1, 2, 3, 4, 4]}
)
df.groupby('a', as_index=True).nunique()
The output is:
# a b
# a
# 1 1 2
# 2 1 2
# 3 1 1
I expected:
# b
# a
# 1 2
# 2 2
# 3 1
As a counterexample that behaves as expected:
df.groupby('a', as_index=True).max()
results in:
# b
# a
# 1 2
# 2 4
# 3 4
If you run [print(df.to_string() + '\n') for i, df in df.groupby('a', as_index=True)], you get printed:
a b
0 1 1
1 1 2
a b
2 2 3
4 2 4
a b
3 3 4
The a column isn't set as the index for each data frame group. It is the output of the groupby which has its index set to the group indices when as_index=True (which also is the default), not the data frame groups themselves.

Python Panda - concatenate two column values into a single column with label name [duplicate]

I have a dataframe like this where the columns are the scores of some metrics:
A B C D
4 3 3 1
2 5 2 2
3 5 2 4
I want to create a new column to summarize which metrics each row scored over a set threshold in, using the column name as a string. So if the threshold was A > 2, B > 3, C > 1, D > 3, I would want the new column to look like this:
A B C D NewCol
4 3 3 1 AC
2 5 2 2 BC
3 5 2 4 ABCD
I tried using a series of np.where:
df[NewCol] = np.where(df['A'] > 2, 'A', '')
df[NewCol] = np.where(df['B'] > 3, 'B', '')
etc.
but realized the result was overwriting with the last metric any time all four metrics didn't meet the conditions, like so:
A B C D NewCol
4 3 3 1 C
2 5 2 2 C
3 5 2 4 ABCD
I am pretty sure there is an easier and correct way to do this.
You could do:
import pandas as pd
data = [[4, 3, 3, 1],
[2, 5, 2, 2],
[3, 5, 2, 4]]
df = pd.DataFrame(data=data, columns=['A', 'B', 'C', 'D'])
th = {'A': 2, 'B': 3, 'C': 1, 'D': 3}
df['result'] = [''.join(k for k in df.columns if record[k] > th[k]) for record in df.to_dict('records')]
print(df)
Output
A B C D result
0 4 3 3 1 AC
1 2 5 2 2 BC
2 3 5 2 4 ABCD
Using dot
s=pd.Series([2,3,1,3],index=df.columns)
df.gt(s,1).dot(df.columns)
Out[179]:
0 AC
1 BC
2 ABCD
dtype: object
#df['New']=df.gt(s,1).dot(df.columns)
Another option that operates in an array fashion. It would be interesting to compare performance.
import pandas as pd
import numpy as np
# Data to test.
data = pd.DataFrame(
[
[4, 3, 3, 1],
[2, 5, 2, 2],
[3, 5, 2, 4]
]
, columns = ['A', 'B', 'C', 'D']
)
# Series to hold the thresholds.
thresholds = pd.Series([2, 3, 1, 3], index = ['A', 'B', 'C', 'D'])
# Subtract the series from the data, broadcasting, and then use sum to concatenate the strings.
data['result'] = np.where(data - thresholds > 0, data.columns, '').sum(axis = 1)
print(data)
Gives:
A B C D result
0 4 3 3 1 AC
1 2 5 2 2 BC
2 3 5 2 4 ABCD

Pandas multiindex boolean indexing

So given a multiindexed dataframe, I would like to return only rows that satisfy a condition for all levels of the lower index in a multi index. Here is a small working example:
df = pd.DataFrame({'a': [1, 1, 2, 2], 'b': [1, 2, 3, 4], 'c': [0, 2, 2, 2]})
df = df.set_index(['a', 'b'])
print(df)
out:
c
a b
1 1 0
2 2
2 3 2
4 2
Now, I would like to return the entries for which c > 1. For instance, I would like to do something like
df[df[c > 1]]
out:
c
a b
1 2 2
2 3 2
4 2
But I want to get
out:
c
a b
2 3 2
4 2
Any thoughts on how to do this in the most efficient way?
I ended up using groupby:
df.groupby(level=0).filter(lambda x: all([c > 1 for v in x['c']]))

Categories

Resources