I have a dataframe like this:
df = pd.DataFrame({'A': [0.3, 0.2, 0.5, 0.2], 'B': [0.1, 0.0, 0.3, 0.1], 'C': [0.2, 0.5, 0.0, 0.7], 'D': [0.6, 0.3, 0.4, 0.6]}, index=list('abcd'))
A B C D
a 0.3 0.1 0.2 0.6
b 0.2 0.0 0.5 0.3
c 0.5 0.3 0.0 0.4
d 0.2 0.1 0.7 0.6
Now I want to plot each row as a barplot whereby the y-axis and the x-tick-labels are shared using add_subplot.
Until now, I can only produce a plot that looks like this:
There is one problem:
The axes are not shared, how one do this after using add_subplot? Here, this problem is solved by creating one huge subplot; is there any way to do this in a different manner?
My desired outcome looks like the plot above with the only difference, that there are no x-tick-labels in the upper row and now y-tick-labels in the right column.
My current attempt is the following:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df = pd.DataFrame({'A': [0.3, 0.2, 0.5, 0.2], 'B': [0.1, 0.0, 0.3, 0.1], 'C': [0.2, 0.5, 0.0, 0.7], 'D': [0.6, 0.3, 0.4, 0.6]}, index=list('abcd'))
fig = plt.figure()
bar_width = 0.35
counter = 1
index = np.arange(df.shape[0])
for indi, rowi in df.iterrows():
ax = fig.add_subplot(2, 2, counter)
ax.bar(index, rowi.values, width=bar_width, tick_label=df.columns)
ax.set_ylim([0., 1.])
ax.set_title(indi, fontsize=20)
ax.set_xticks(index + bar_width / 2)
counter += 1
plt.xticks(index + bar_width / 2, df.columns)
The question how to produce shared subplots in matplotlib:
The SO seach engine results
The matplotlib recipes or the examples page
What may be more interesting here, is that you could also directly use pandas to create the plot in a single line:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'A': [0.3, 0.2, 0.5, 0.2], 'B': [0.1, 0.0, 0.3, 0.1], 'C': [0.2, 0.5, 0.0, 0.7], 'D': [0.6, 0.3, 0.4, 0.6]}, index=list('abcd'))
df.plot(kind="bar", subplots=True, layout=(2,2), sharey=True, sharex=True)
plt.show()
Related
I have the following dataframe:
df =
sample measurements
1 [0.2, 0.22, 0.3, 0.7, 0.4, 0.35, 0.2]
2 [0.2, 0.17, 0.6, 0.6, 0.54, 0.32, 0.2]
5 [0.2, 0.39, 0.40, 0.53, 0.41, 0.3, 0.2]
7 [0.2, 0.29, 0.46, 0.68, 0.44, 0.35, 0.2]
The data type in df['measurements'] is a 1-D np.array. I'm trying to concatenate each np.array in the column "measurements" and plot it as a time series, but issue is that the samples are discontinuous, and the interval between points is not consistent due to missing data. What is the best way I can concatenate the arrays and plot them such that there is just a gap in the plot between samples 2 and 5 and 5 and 7?
Depending on how you want to use the data, you can either convert the individual elements to new rows ("long form"), or create new columns ("wide form").
Convert to new rows
This is the preferred format for seaborn. explode() creates new rows from the array elements. Optionally, groupby() together with cumcount() can add a position.
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
df = pd.DataFrame({'sample': [1, 2, 5, 7],
'measurements': [np.array([0.2, 0.22, 0.3, 0.7, 0.4, 0.35, 0.2]),
np.array([0.2, 0.17, 0.6, 0.6, 0.54, 0.32, 0.2]),
np.array([0.2, 0.39, 0.40, 0.53, 0.41, 0.3, 0.2]),
np.array([0.2, 0.29, 0.46, 0.68, 0.44, 0.35, 0.2])]})
df1 = df.explode('measurements', ignore_index=True)
df1['position'] = df1.groupby('sample').cumcount() + 1
sns.lineplot(df1, x='sample', y='measurements', hue='position', palette='bright')
plt.show()
Convert to new columns
If all arrays have the same length, each element can be converted to a new column. This is how pandas usually prefers to organize it data. New columns are created by applying to_list on the original column.
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
df = pd.DataFrame({'sample': [1, 2, 5, 7],
'measurements': [np.array([0.2, 0.22, 0.3, 0.7, 0.4, 0.35, 0.2]),
np.array([0.2, 0.17, 0.6, 0.6, 0.54, 0.32, 0.2]),
np.array([0.2, 0.39, 0.40, 0.53, 0.41, 0.3, 0.2]),
np.array([0.2, 0.29, 0.46, 0.68, 0.44, 0.35, 0.2])]})
df2 = pd.DataFrame(df['measurements'].to_list(),
columns=[f'measurement{i + 1}' for i in range(7)],
index=df['sample'])
df2.plot()
plt.show()
Assume, I have a data frame series containing increasing set of values and decreasing set of values. (Like a sawtooth pattern).
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
y = list(np.linspace(0, 1, 3)) + list(np.linspace(1,0, 11)) * 5
x = list(range(len(y)))
df = pd.DataFrame({"x": x, "y":y})
df.plot("x", "y")
Now I would like to extract these down-sliding sections in to separate dfs. What would be the best way to do this ?
What I am expecting to see a list of dfs as below (image shows the data of the first df)
pd.DataFrame({"x": range(11), "y":list(np.linspace(1,0, 11))}).plot("x", "y")
Use:
s = (df['y']-df['y'].shift(-1)>0)
t = s-s.shift(1)
u = t[t>=0].astype(int).cumsum()
u = u[u>0]
df.loc[u.index].groupby(u)['y'].apply(list)
Output"
y
1 [1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.3999999999999...
2 [1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.3999999999999...
3 [1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.3999999999999...
4 [1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.3999999999999...
5 [1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.3999999999999...
Name: y, dtype: object
I want to create a heightfield map that consists of squares of random height. Given an array of NxN, I want that every square of size MxM, where M<N, will be at the same random height, with the height sampled from a uniform distribution. For example, if we have N = 6 and M = 2, we would have:
0.2, 0.2, 0.6, 0.6, 0.1, 0.1,
0.2, 0.2, 0.6, 0.6, 0.1, 0.1,
0.5, 0.5, 0.3, 0.3, 0.8, 0.8,
0.5, 0.5, 0.3, 0.3, 0.8, 0.8,
0.6, 0.6, 0.4, 0.4, 0.9, 0.9,
0.6, 0.6, 0.4, 0.4, 0.9, 0.9
For now, I've come up with an inefficient way of doing it with 2 nested for loops. I'm sure there must be an efficient and elegant way to do that with NumPy slicing.
This solution using the repeat() method should work for N/M integer.
import numpy as np
N = 6
M = 2
values = np.random.random( [N//M, N//M] )
y = values.repeat( M, axis=0 ).repeat( M, axis=1 )
print(y)
A half-open interval of the form [0,0.5) can be created using the following code:
rv = np.linspace(0., 0.5, nr, endpoint=False)
where nr is the number of points in the interval.
Question: How do I use linspace to create an open interval of the form (a,b) or a half-open interval of the form (a,b]?
Probably the simplest way (since this functionality isn't built in to np.linspace()) is to just slice what you want.
Let's say you're interested in the interval [0,1] with a spacing of 0.1.
>>> import numpy as np
>>> np.linspace(0, 1, 11) # [0,1]
array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])
>>> np.linspace(0, 1, 11-1, endpoint=False) # [0,1)
array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])
>>> np.linspace(0, 1, 11)[:-1] # [0,1) again
array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])
>>> np.linspace(0, 1, 11)[1:] # (0,1]
array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])
>>> np.linspace(0, 1, 11)[1:-1] # (0,1)
array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])
I have a symmetric, multi-index dataframe from which I want to systematically extract data:
import pandas as pd
df_index = pd.MultiIndex.from_arrays(
[["A", "A", "B", "B"], [1, 2, 3, 4]], names = ["group", "id"])
df = pd.DataFrame(
[[1.0, 0.5, 0.3, -0.4],
[0.5, 1.0, 0.9, -0.8],
[0.3, 0.9, 1.0, 0.1],
[-0.4, -0.8, 0.1, 1.0]],
index=df_index, columns=df_index)
I want a function extract_vals that can return all values related to elements in the same group, EXCEPT for the diagonal AND elements must not be double-counted. Here are two examples of the desired behavior (order does not matter):
A_vals = extract_vals("A", df) # [0.5, 0.3, -0.4, 0.9, -0.8]
B_vals = extract_vals("B", df) # [0.3, 0.9, 0.1, -0.4, -0.8]
My question is similar to this question on SO, but my situation is different because I am using a multi-index dataframe.
Finally, to make things more fun, please consider efficiency because I'll be running this many times on much bigger dataframes. Thanks very much!
EDIT:
Happy001's solution is awesome. I came up with a method myself based on the logic of extracting the elements where target is NOT in BOTH the rows and columns, and then extracting the lower triangle of those elements where target IS in BOTH the rows and columns. However, Happy001's solution is much faster.
First, I created a more complex dataframe to make sure both methods are generalizable:
import pandas as pd
import numpy as np
df_index = pd.MultiIndex.from_arrays(
[["A", "B", "A", "B", "C", "C"], [1, 2, 3, 4, 5, 6]], names=["group", "id"])
df = pd.DataFrame(
[[1.0, 0.5, 1.0, -0.4, 1.1, -0.6],
[0.5, 1.0, 1.2, -0.8, -0.9, 0.4],
[1.0, 1.2, 1.0, 0.1, 0.3, 1.3],
[-0.4, -0.8, 0.1, 1.0, 0.5, -0.2],
[1.1, -0.9, 0.3, 0.5, 1.0, 0.7],
[-0.6, 0.4, 1.3, -0.2, 0.7, 1.0]],
index=df_index, columns=df_index)
Next, I defined both versions of extract_vals (the first is my own):
def extract_vals(target, multi_index_level_name, df):
# Extract entries where target is in the rows but NOT also in the columns
target_in_rows_but_not_in_cols_vals = df.loc[
df.index.get_level_values(multi_index_level_name) == target,
df.columns.get_level_values(multi_index_level_name) != target]
# Extract entries where target is in the rows AND in the columns
target_in_rows_and_cols_df = df.loc[
df.index.get_level_values(multi_index_level_name) == target,
df.columns.get_level_values(multi_index_level_name) == target]
mask = np.triu(np.ones(target_in_rows_and_cols_df.shape), k = 1).astype(np.bool)
vals_with_nans = target_in_rows_and_cols_df.where(mask).values.flatten()
target_in_rows_and_cols_vals = vals_with_nans[~np.isnan(vals_with_nans)]
# Append both arrays of extracted values
vals = np.append(target_in_rows_but_not_in_cols_vals, target_in_rows_and_cols_vals)
return vals
def extract_vals2(target, multi_index_level_name, df):
# Get indices for what you want to extract and then extract all at once
coord = [[i, j] for i in range(len(df)) for j in range(len(df)) if i < j and (
df.index.get_level_values(multi_index_level_name)[i] == target or (
df.columns.get_level_values(multi_index_level_name)[j] == target))]
return df.values[tuple(np.transpose(coord))]
I checked that both functions returned output as desired:
# Expected values
e_A_vals = np.sort([0.5, 1.0, -0.4, 1.1, -0.6, 1.2, 0.1, 0.3, 1.3])
e_B_vals = np.sort([0.5, 1.2, -0.8, -0.9, 0.4, -0.4, 0.1, 0.5, -0.2])
e_C_vals = np.sort([1.1, -0.9, 0.3, 0.5, 0.7, -0.6, 0.4, 1.3, -0.2])
# Sort because order doesn't matter
assert np.allclose(np.sort(extract_vals("A", "group", df)), e_A_vals)
assert np.allclose(np.sort(extract_vals("B", "group", df)), e_B_vals)
assert np.allclose(np.sort(extract_vals("C", "group", df)), e_C_vals)
assert np.allclose(np.sort(extract_vals2("A", "group", df)), e_A_vals)
assert np.allclose(np.sort(extract_vals2("B", "group", df)), e_B_vals)
assert np.allclose(np.sort(extract_vals2("C", "group", df)), e_C_vals)
And finally, I checked speed:
## Test speed
import time
# Method 1
start1 = time.time()
for ii in range(10000):
out = extract_vals("C", "group", df)
elapsed1 = time.time() - start1
print elapsed1 # 28.5 sec
# Method 2
start2 = time.time()
for ii in range(10000):
out2 = extract_vals2("C", "group", df)
elapsed2 = time.time() - start2
print elapsed2 # 10.9 sec
I don't assume df has the same columns and index. (Of course they can be the same).
def extract_vals(group_label, df):
coord = [[i, j] for i in range(len(df)) for j in range(len(df)) if i<j and (df.index.get_level_values('group')[i] == group_label or df.columns.get_level_values('group')[j] == group_label) ]
return df.values[tuple(np.transpose(coord))]
print extract_vals('A', df)
print extract_vals('B', df)
result:
[ 0.5 0.3 -0.4 0.9 -0.8]
[ 0.3 -0.4 0.9 -0.8 0.1]
is that what you want?
all elements above the diagonal:
In [139]: df.values[np.triu_indices(len(df), 1)]
Out[139]: array([ 0.5, 0.3, -0.4, 0.9, -0.8, 0.1])
A_vals:
In [140]: df.values[np.triu_indices(len(df), 1)][:-1]
Out[140]: array([ 0.5, 0.3, -0.4, 0.9, -0.8])
B_vals:
In [141]: df.values[np.triu_indices(len(df), 1)][1:]
Out[141]: array([ 0.3, -0.4, 0.9, -0.8, 0.1])
Source matrix:
In [142]: df.values
Out[142]:
array([[ 1. , 0.5, 0.3, -0.4],
[ 0.5, 1. , 0.9, -0.8],
[ 0.3, 0.9, 1. , 0.1],
[-0.4, -0.8, 0.1, 1. ]])