In a pandas dataframe, if I create a pivot and plot based on the column, how can I restrict the number of columns being plotted based on a aggregated function.
Example,
Suppose I have a dataframe pivot as
sum
name 'a' 'b' 'c' 'd'
key
1 1 2 3 4
2 1 2 3 4
3 1 2 3 nan
which I plot as
Now, I would want to only plot top (2) columns based on average such that only 'c' and 'd' shows up suppressing 'a' and 'd'.
How can I achieve this using pandas?
Input DataFrame and Plot
from io import StringIO
import pandas as pd
TESTDATA = StringIO("""name;key;value
'a';1;1
'a';2;1
'a';3;1
'b';1;2
'b';2;2
'b';3;2
'c';1;3
'c';2;3
'c';3;3
'd';1;4
'd';2;4
""")
df = pd.read_csv(TESTDATA, sep=";")
pivot = pd.pivot_table(df, index='key', columns='name', values='value',aggfunc=[np.sum])
pivot.plot()
You can find mean of pivot and then find nlargest with n=2 , then using .loc[] , select only those columns and plot:
pivot.loc[:,pivot.mean().nlargest(2).index].plot()
#or pivot[pivot.mean().nlargest(2).index].plot()
Related
At the beginning, I'd like to add a multilevel column to an empty dataframe.
df = pd.DataFrame({"nodes": list(range(1, 5, 2))})
df.set_index("nodes", inplace=True)
So this is the dataframe to start with (still empty):
>>> df
nodes
1
3
Now I'd like to a first multilevel column.
I tried the following:
new_df = pd.DataFrame.from_dict(dict(zip(df.index, [1,2])), orient="index",
columns=["value"])
df = pd.concat([new_df], axis=1, keys=["test"])
Now the dataframe df looks like this:
>>> df
test
value
1 1
3 2
To add another column, i've done something similar.
new_df2 = pd.DataFrame.from_dict(dict(zip(df.index, [3,4])), orient="index",
columns=[("test2", "value2")])
df = pd.concat([df, new_df2], axis=1)
df.index.name = "nodes"
So the desired dataframe looks like this:
>>> df
test test2
nodes value value2
1 1 3
3 2 4
This way of adding multilevel columns seems a bit strange. Is there a better way of doing so?
Create a MultIndex on the columns by storing your DataFrames in a dict then concat along axis=1. The keys of the dict become levels of the column MultiIndex (if you use tuples it adds multiple levels depending on the length, scalar keys add a single level) and the DataFrame columns stay as is. Alignment is enforced on the row Index.
import pandas as pd
d = {}
d[('foo', 'bar')] = pd.DataFrame({'val': [1,2,3]}).rename_axis(index='nodes')
d[('foo2', 'bar2')] = pd.DataFrame({'val2': [4,5,6]}).rename_axis(index='nodes')
d[('foo2', 'bar1')] = pd.DataFrame({'val2': [7,8,9]}).rename_axis(index='nodes')
pd.concat(d, axis=1)
foo foo2
bar bar2 bar1
val val2 val2
nodes
0 1 4 7
1 2 5 8
2 3 6 9
I have a multi-index Pandas dataframe and I want to use ffill() to fill any NaNs in certain columns. Following code shows the structure of the sample dataframe, and the result of ffill() in the next snapshot.
room = ['A', 'B']
val = range(3)
df = pd.DataFrame(columns=pd.MultiIndex.from_product([room, val]),data=np.random.randn(3,6))
df.loc[1,('B',0)]=np.nan
# print(df.loc[1,('B',0)])
display(df)
df = df.ffill(axis=1)
display(df)
What I was hoping to get is that the NaN at [1,('B',0)] is replaced with -0.392674 and not with -1.349675.
Generally, I want to be able to ffill() from the corresponding columns from level 1 ([0,1,2]).
How do I achieve this?
I think you are looking for groupby fillna
df=df.groupby(level=1,axis=1).fillna(method='ffill')
df
Out[496]:
A B
0 1 2 0 1 2
0 -0.177358 -1.531091 -0.945004 1.665143 0.602459 -0.008192
1 -0.006995 0.472267 -0.859471 -0.006995 -0.601538 -0.410391
2 0.101494 1.031941 0.499288 0.804391 -0.224750 -0.778403
I have a dataframe which is of the following structure:
A B
Location1 1
Location2 2
1 3
2 4
In the above example column A is the index. I am attempting to produce a scatter plot using the index and column B. This data frame is made by resampling and averaging another dataframe like so:
df = df.groupby("A").mean()
Now obviously this sets the index equal to column A and I can plot it using the following which is adapted from here. Use index in pandas to plot data
df.reset_index().plot(x = "A",y = "B",kind="scatter", figsize=(10,10))
Now when I run this it returns the follow:
ValueError: scatter requires x column to be numeric
As the index column is intended to be a column of strings for which I can plot a scatter plot how can I go about fixing this?
You may want to select only the integer rows:
import pandas as pd
d = {'A': ["Location1", "Location2", 1, 2], 'B': [1, 2, 3, 4]}
df = pd.DataFrame(data=d)
df_numeric = df[pd.to_numeric(df.A, errors='coerce').notnull()]
print(df_numeric)
A B
2 1 3
3 2 4
Grouped by A:
df_numeric_grouped_by_A = df_numeric.groupby("A").mean()
print(df_numeric_grouped_by_A)
B
A
1 3
2 4
You may have to transponse the DataFrame, so that you have the index(Column A) as columnnames and then calculate the mean of the columns and plot them.
After creating a DataFrame with some duplicated cell values in column with the name 'keys':
import pandas as pd
df = pd.DataFrame({'keys': [1,2,2,3,3,3,3],'values':[1,2,3,4,5,6,7]})
I go ahead and create two more DataFrames which are the consolidated versions of the original DataFrame df. Those newly created DataFrames will have no duplicated cell values under the 'keys' column:
df_sum = df_a.groupby('keys', axis=0).sum().reset_index()
df_mean = df_b.groupby('keys', axis=0).mean().reset_index()
As you can see df_sum['values'] cells values were all summed together.
While df_mean['values'] cell values were averaged with mean() method.
Lastly I rename the 'values' column in both dataframes with:
df_sum.columns = ['keys', 'sums']
df_mean.columns = ['keys', 'means']
Now I would like to copy the df_mean['means'] column into the dataframe df_sum.
How to achieve this?
The Photoshoped image below illustrates the dataframe I would like to create. Both 'sums' and 'means' columns are merged into a single DataFrame:
There are several ways to do this. Using the merge function off the dataframe is the most efficient.
df_both = df_sum.merge(df_mean, how='left', on='keys')
df_both
Out[1]:
keys sums means
0 1 1 1.0
1 2 5 2.5
2 3 22 5.5
I think pandas.merge() is the function you are looking for. Like pd.merge(df_sum, df_mean, on = "keys"). Besides, this result can also be summarized on one agg function as following:
df.groupby('keys')['values'].agg(['sum', 'mean']).reset_index()
# keys sum mean
#0 1 1 1.0
#1 2 5 2.5
#2 3 22 5.5
Suppose I have a pandas dataframe given by
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5,2))
df
0 1
0 0.264053 -1.225456
1 0.805492 -1.072943
2 0.142433 -0.469905
3 0.758322 0.804881
4 -0.281493 0.602433
I want to return a Series object with 4 rows, containing max(df[0,0], df[1,1]), max(df[1,0], df[2,1]), max(df[2,0], df[3,1]), max(df[3,0], df[4,1]). More generally, what is the best way to compare the max of column 0 and column 1 offset by n rows?
Thanks.
You want to apply max to rows after having shifted the first column.
pd.concat([df.iloc[:, 0].shift(), df.iloc[:, 1]], axis=1).apply(max, axis=1).dropna()