Plot top n metrics from a pandas pivot table - python

In a pandas dataframe, if I create a pivot and plot based on the column, how can I restrict the number of columns being plotted based on a aggregated function.
Example,
Suppose I have a dataframe pivot as
sum
name 'a' 'b' 'c' 'd'
key
1 1 2 3 4
2 1 2 3 4
3 1 2 3 nan
which I plot as
Now, I would want to only plot top (2) columns based on average such that only 'c' and 'd' shows up suppressing 'a' and 'd'.
How can I achieve this using pandas?
Input DataFrame and Plot
from io import StringIO
import pandas as pd
TESTDATA = StringIO("""name;key;value
'a';1;1
'a';2;1
'a';3;1
'b';1;2
'b';2;2
'b';3;2
'c';1;3
'c';2;3
'c';3;3
'd';1;4
'd';2;4
""")
df = pd.read_csv(TESTDATA, sep=";")
pivot = pd.pivot_table(df, index='key', columns='name', values='value',aggfunc=[np.sum])
pivot.plot()

You can find mean of pivot and then find nlargest with n=2 , then using .loc[] , select only those columns and plot:
pivot.loc[:,pivot.mean().nlargest(2).index].plot()
#or pivot[pivot.mean().nlargest(2).index].plot()

Related

Add multi level column to dataframe

At the beginning, I'd like to add a multilevel column to an empty dataframe.
df = pd.DataFrame({"nodes": list(range(1, 5, 2))})
df.set_index("nodes", inplace=True)
So this is the dataframe to start with (still empty):
>>> df
nodes
1
3
Now I'd like to a first multilevel column.
I tried the following:
new_df = pd.DataFrame.from_dict(dict(zip(df.index, [1,2])), orient="index",
columns=["value"])
df = pd.concat([new_df], axis=1, keys=["test"])
Now the dataframe df looks like this:
>>> df
test
value
1 1
3 2
To add another column, i've done something similar.
new_df2 = pd.DataFrame.from_dict(dict(zip(df.index, [3,4])), orient="index",
columns=[("test2", "value2")])
df = pd.concat([df, new_df2], axis=1)
df.index.name = "nodes"
So the desired dataframe looks like this:
>>> df
test test2
nodes value value2
1 1 3
3 2 4
This way of adding multilevel columns seems a bit strange. Is there a better way of doing so?
Create a MultIndex on the columns by storing your DataFrames in a dict then concat along axis=1. The keys of the dict become levels of the column MultiIndex (if you use tuples it adds multiple levels depending on the length, scalar keys add a single level) and the DataFrame columns stay as is. Alignment is enforced on the row Index.
import pandas as pd
d = {}
d[('foo', 'bar')] = pd.DataFrame({'val': [1,2,3]}).rename_axis(index='nodes')
d[('foo2', 'bar2')] = pd.DataFrame({'val2': [4,5,6]}).rename_axis(index='nodes')
d[('foo2', 'bar1')] = pd.DataFrame({'val2': [7,8,9]}).rename_axis(index='nodes')
pd.concat(d, axis=1)
foo foo2
bar bar2 bar1
val val2 val2
nodes
0 1 4 7
1 2 5 8
2 3 6 9

fillna() for Multi-Index Pandas DataFrame

I have a multi-index Pandas dataframe and I want to use ffill() to fill any NaNs in certain columns. Following code shows the structure of the sample dataframe, and the result of ffill() in the next snapshot.
room = ['A', 'B']
val = range(3)
df = pd.DataFrame(columns=pd.MultiIndex.from_product([room, val]),data=np.random.randn(3,6))
df.loc[1,('B',0)]=np.nan
# print(df.loc[1,('B',0)])
display(df)
df = df.ffill(axis=1)
display(df)
What I was hoping to get is that the NaN at [1,('B',0)] is replaced with -0.392674 and not with -1.349675.
Generally, I want to be able to ffill() from the corresponding columns from level 1 ([0,1,2]).
How do I achieve this?
I think you are looking for groupby fillna
df=df.groupby(level=1,axis=1).fillna(method='ffill')
df
Out[496]:
A B
0 1 2 0 1 2
0 -0.177358 -1.531091 -0.945004 1.665143 0.602459 -0.008192
1 -0.006995 0.472267 -0.859471 -0.006995 -0.601538 -0.410391
2 0.101494 1.031941 0.499288 0.804391 -0.224750 -0.778403

Value error when plotting Dataframe from index

I have a dataframe which is of the following structure:
A B
Location1 1
Location2 2
1 3
2 4
In the above example column A is the index. I am attempting to produce a scatter plot using the index and column B. This data frame is made by resampling and averaging another dataframe like so:
df = df.groupby("A").mean()
Now obviously this sets the index equal to column A and I can plot it using the following which is adapted from here. Use index in pandas to plot data
df.reset_index().plot(x = "A",y = "B",kind="scatter", figsize=(10,10))
Now when I run this it returns the follow:
ValueError: scatter requires x column to be numeric
As the index column is intended to be a column of strings for which I can plot a scatter plot how can I go about fixing this?
You may want to select only the integer rows:
import pandas as pd
d = {'A': ["Location1", "Location2", 1, 2], 'B': [1, 2, 3, 4]}
df = pd.DataFrame(data=d)
df_numeric = df[pd.to_numeric(df.A, errors='coerce').notnull()]
print(df_numeric)
A B
2 1 3
3 2 4
Grouped by A:
df_numeric_grouped_by_A = df_numeric.groupby("A").mean()
print(df_numeric_grouped_by_A)
B
A
1 3
2 4
You may have to transponse the DataFrame, so that you have the index(Column A) as columnnames and then calculate the mean of the columns and plot them.

How to copy one DataFrame column in to another Dataframe if their indexes values are the same

After creating a DataFrame with some duplicated cell values in column with the name 'keys':
import pandas as pd
df = pd.DataFrame({'keys': [1,2,2,3,3,3,3],'values':[1,2,3,4,5,6,7]})
I go ahead and create two more DataFrames which are the consolidated versions of the original DataFrame df. Those newly created DataFrames will have no duplicated cell values under the 'keys' column:
df_sum = df_a.groupby('keys', axis=0).sum().reset_index()
df_mean = df_b.groupby('keys', axis=0).mean().reset_index()
As you can see df_sum['values'] cells values were all summed together.
While df_mean['values'] cell values were averaged with mean() method.
Lastly I rename the 'values' column in both dataframes with:
df_sum.columns = ['keys', 'sums']
df_mean.columns = ['keys', 'means']
Now I would like to copy the df_mean['means'] column into the dataframe df_sum.
How to achieve this?
The Photoshoped image below illustrates the dataframe I would like to create. Both 'sums' and 'means' columns are merged into a single DataFrame:
There are several ways to do this. Using the merge function off the dataframe is the most efficient.
df_both = df_sum.merge(df_mean, how='left', on='keys')
df_both
Out[1]:
keys sums means
0 1 1 1.0
1 2 5 2.5
2 3 22 5.5
I think pandas.merge() is the function you are looking for. Like pd.merge(df_sum, df_mean, on = "keys"). Besides, this result can also be summarized on one agg function as following:
df.groupby('keys')['values'].agg(['sum', 'mean']).reset_index()
# keys sum mean
#0 1 1 1.0
#1 2 5 2.5
#2 3 22 5.5

pandas DataFrame - find max between offset columns

Suppose I have a pandas dataframe given by
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5,2))
df
0 1
0 0.264053 -1.225456
1 0.805492 -1.072943
2 0.142433 -0.469905
3 0.758322 0.804881
4 -0.281493 0.602433
I want to return a Series object with 4 rows, containing max(df[0,0], df[1,1]), max(df[1,0], df[2,1]), max(df[2,0], df[3,1]), max(df[3,0], df[4,1]). More generally, what is the best way to compare the max of column 0 and column 1 offset by n rows?
Thanks.
You want to apply max to rows after having shifted the first column.
pd.concat([df.iloc[:, 0].shift(), df.iloc[:, 1]], axis=1).apply(max, axis=1).dropna()

Categories

Resources