Suppose I have a 2D numpy array like this:
arr = np.array([[1, 2], [3, 4], [5, 6]])
# array([[1, 2],
# [3, 4],
# [5, 6]])
How can one transform that to a "long" structure with one record per value, associated with the row and column index? In this case that would look like:
df = pd.DataFrame({'row': [0, 0, 1, 1, 2, 2],
'column': [0, 1, 0, 1, 0, 1],
'value': [1, 2, 3, 4, 5, 6]})
melt only assigns the column identifier, not the row:
pd.DataFrame(arr).melt()
# variable value
# 0 0 1
# 1 0 3
# 2 0 5
# 3 1 2
# 4 1 4
# 5 1 6
Is there a way to attach the row identifier?
Pass index to idvar:
pd.DataFrame(arr).reset_index().melt('index')
# index variable value
# 0 0 0 1
# 1 1 0 3
# 2 2 0 5
# 3 0 1 2
# 4 1 1 4
# 5 2 1 6
You can rename:
df = pd.DataFrame(arr).reset_index().melt('index')
df.columns = ['row', 'column', 'value']
melt can use the index if it's a column:
arrdf = pd.DataFrame(arr)
arrdf['row'] = arrdf.index
arrdf.melt(id_vars='row', var_name='column')
# row column value
# 0 0 0 1
# 1 1 0 3
# 2 2 0 5
# 3 0 1 2
# 4 1 1 4
# 5 2 1 6
Related
I'm looking to find a minimum value across level 1 of a multiindex, time in this example. But I'd like to retain all other labels of the index.
import numpy as np
import pandas as pd
stack = [
[0, 1, 1, 5],
[0, 1, 2, 6],
[0, 1, 3, 2],
[0, 2, 3, 4],
[0, 2, 2, 5],
[0, 3, 2, 1],
[1, 1, 0, 5],
[1, 1, 2, 6],
[1, 1, 3, 7],
[1, 2, 2, 8],
[1, 2, 3, 9],
[2, 1, 7, 1],
[2, 1, 8, 3],
[2, 2, 3, 4],
[2, 2, 8, 1],
]
df = pd.DataFrame(stack)
df.columns = ['self', 'time', 'other', 'value']
df.set_index(['self', 'time', 'other'], inplace=True)
df.groupby(level=1).min() doesn't return the correct values:
value
time
1 1
2 1
3 1
doing something like df.groupby(level=[0,1,2]).min() returns the original dataframe unchanged.
I swear I used to be able to do this by calling .min(level=1) but it's giving me deprecation notices and teling me to use the above groupby format, but the result seems different than I remember, am I stupid?
original:
value
self time other
0 1 1 5
2 6
3 2 #<-- min row
2 3 4 #<-- min row
2 5
3 2 1 #<-- min row
1 1 0 5 #<-- min row
2 6
3 7
2 2 8 #<-- min row
3 9
2 1 7 1 #<-- min row
8 3
2 3 4
8 1 #<-- min row
desired result:
value
self time other
0 1 3 2
2 3 4
3 2 1
1 1 0 5
2 2 8
2 1 7 1
2 8 1
Group by your 2 first levels then return the idxmin instead of min to get all indexes. Finally, use loc to filter out your original dataframe:
out = df.loc[df.groupby(level=['self', 'time'])['value'].idxmin()]
print(out)
# Output
value
self time other
0 1 3 2
2 3 4
3 2 1
1 1 0 5
2 2 8
2 1 7 1
2 8 1
Why not just groupby the first two indexes, rather than all three?
out = df.groupby(level=[0,1]).min()
Output:
>>> out
value
self time
0 1 2
2 4
3 1
1 1 5
2 8
2 1 1
2 1
Suppose I have a nested dictionary of the format:
dictionary={
"A":[1, 2],
"B":[2, 3],
"Coords":[{
"X":[1,2,3],
"Y":[1,2,3],
"Z":[1,2,3],
},{
"X":[2,3],
"Y":[2,3],
"Z":[2,3],
}]
}
How can I turn this into a Pandas MultiIndex Dataframe?
Equivalently, how can I produce a Dataframe where the information in the row is not duplicated for every co-ordinate?
In what I imagine, the two rows of output DataFrame should appear as follows:
Index A B Coords
---------------------
0 1 2 X Y Z
1 1 1
2 2 2
3 3 3
--------------------
---------------------
1 2 3 X Y Z
2 2 2
3 3 3
--------------------
From your dictionary :
>>> import pandas as pd
>>> df = pd.DataFrame.from_dict(dictionary)
>>> df
A B Coords
0 1 2 {'X': [1, 2, 3], 'Y': [1, 2, 3], 'Z': [1, 2, 3]}
1 2 3 {'X': [2, 3], 'Y': [2, 3], 'Z': [2, 3]}
Then we can use pd.Series to extract the data in dict in the column Coords like so :
df_concat = pd.concat([df.drop(['Coords'], axis=1), df['Coords'].apply(pd.Series)], axis=1)
>>> df_concat
A B X Y Z
0 1 2 [1, 2, 3] [1, 2, 3] [1, 2, 3]
1 2 3 [2, 3] [2, 3] [2, 3]
To finish we use the explode method to get the list as rows and set the index on columns A and B to get the expected result :
>>> df_concat.explode(['X', 'Y', 'Z']).reset_index().set_index(['index', 'A', 'B'])
X Y Z
index A B
0 1 2 1 1 1
2 2 2 2
2 3 3 3
1 2 3 2 2 2
3 3 3 3
UPDATE :
If you are using a version of Pandas lower than 1.3.0, we can use the trick given by #MillerMrosek in this answer :
def explode(df, columns):
df['tmp']=df.apply(lambda row: list(zip(*[row[_clm] for _clm in columns])), axis=1)
df=df.explode('tmp')
df[columns]=pd.DataFrame(df['tmp'].tolist(), index=df.index)
df.drop(columns='tmp', inplace=True)
return df
explode(df_concat, ["X", "Y", "Z"]).reset_index().set_index(['index', 'A', 'B'])
Output :
X Y Z
index A B
0 1 2 1 1 1
2 2 2 2
2 3 3 3
1 2 3 2 2 2
3 3 3 3
I have a data frame like this:
import pandas as pd
import numpy as np
Out[10]:
samples subject trial_num
0 [0 2 2 1 1
1 [3 3 0 1 2
2 [1 1 1 1 3
3 [0 1 2 2 1
4 [4 5 6 2 2
5 [0 8 8 2 3
I want to have the output like this:
samples subject trial_num frequency
0 [0 2 2 1 1 2
1 [3 3 0 1 2 2
2 [1 1 1 1 3 1
3 [0 1 2 2 1 3
4 [4 5 6 2 2 3
5 [0 8 8 2 3 2
The frequency here is the number of unique values in each list per sample. For example, [0, 2, 2] only have one unique value.
I can do the unique values in pandas without having a list, or implement it using for loop to go through each row access each list and .... but I want a better pandas way to do it.
Thanks.
You can use collections.Counter for the task:
from collections import Counter
df['frequency'] = df['samples'].apply(lambda x: sum(v==1 for v in Counter(x).values()))
print(df)
Prints:
samples subject trial_num frequency
0 [0, 2, 2] 1 1 1
1 [3, 3, 0] 1 2 1
2 [1, 1, 1] 1 3 0
3 [0, 1, 2] 2 1 3
4 [4, 5, 6] 2 2 3
5 [0, 8, 8] 2 3 1
EDIT: For updated question:
df['frequency'] = df['samples'].apply(lambda x: len(set(x)))
print(df)
Prints:
samples subject trial_num frequency
0 [0, 2, 2] 1 1 2
1 [3, 3, 0] 1 2 2
2 [1, 1, 1] 1 3 1
3 [0, 1, 2] 2 1 3
4 [4, 5, 6] 2 2 3
5 [0, 8, 8] 2 3 2
import pandas as pd
import ast # import for sample data creation
from io import StringIO # import for sample data creation
# sample data
s = """samples;subject;trial_num
[0, 2, 2];1;1
[3, 3, 0];1;2
[1, 1, 1];1;3
[0, 1, 2];2;1
[4, 5, 6];2;2
[0, 8, 8];2;3"""
df = pd.read_csv(StringIO(s), sep=';')
df['samples'] = df['samples'].apply(ast.literal_eval)
# convert lists to a new frame and use nunique
# assign values to a col
df['frequency'] = pd.DataFrame(df['samples'].values.tolist()).nunique(1)
samples subject trial_num frequency
0 [0, 2, 2] 1 1 2
1 [3, 3, 0] 1 2 2
2 [1, 1, 1] 1 3 1
3 [0, 1, 2] 2 1 3
4 [4, 5, 6] 2 2 3
5 [0, 8, 8] 2 3 2
I have a DataFrame of generated random agents. However, I want to expand them to match the population I am looking for, so I need to repeat rows, according to my sampled indexes.
Here is a loop code that is taking forever:
df = pd.DataFrame({'a': [0, 1, 2]})
sampled_indexes = [0, 0, 1, 1, 2, 2, 2]
new_df = pd.DataFrame(columns=['a'])
for i, idx in enumerate(sampled_indexes):
new_df.loc[i] = df.loc[idx]
Then, the original DataFrame:
df
a
0 0
1 1
2 2
gives me the result of an enlarged new dataframe
new_df
a
0 0
1 0
2 1
3 1
4 2
5 2
6 2
So, this loop is too slow with a DataFrame that has 34,000 or more rows (takes forever).
How can I do this simpler and faster?
Reindex the dataframe with sampled_indexes, then reset the index.
df.reindex(sampled_indexes).reset_index(drop=True)
You can do DataFrame.merge:
df = pd.DataFrame({'a': [0, 1, 2]})
sampled_indexes = [0, 0, 1, 1, 2, 2, 2]
print( df.merge(pd.DataFrame({'a': sampled_indexes})) )
Prints:
a
0 0
1 0
2 1
3 1
4 2
5 2
6 2
Consider, dataframe d:
d = pd.DataFrame({'a': [0, 2, 1, 1, 1, 1, 1],
'b': [2, 1, 0, 1, 0, 0, 2],
'c': [1, 0, 2, 1, 0, 2, 2]})
> a b c
0 0 2 1
1 2 1 0
2 1 0 2
3 1 1 1
4 1 0 0
5 1 0 2
6 1 2 2
I want to split it by column a into dictionary like that:
{0: a b c
0 0 2 1,
1: a b c
2 1 0 2
3 1 1 1
4 1 0 0
5 1 0 2
6 1 2 2,
2: a b c
1 2 1 0}
The solution I've found using pandas.groupby is:
{k: table for k, table in d.groupby("a")}
What are the other solutions?
You can use dict with tuple / list applied on your groupby:
res = dict(tuple(d.groupby('a')))
A memory efficient alternative to dict is to create a groupby object and then use get_group:
res = d.groupby('a')
res.get_group(1) # select dataframe where column 'a' = 1
In cases where the resulting table requires a minor manipulation, like resetting the index, or removing the groupby column, continue to use a dictionary comprehension.
res = {k: v.drop('a', axis=1).reset_index(drop=True) for k, v in d.groupby('a')}