Related
I am trying to read data using pandas.
Here is what I have tried:
df = pd.read_csv("samples_data.csv")
in_x = df.for_x
in_y = df.for_y
in_init = df.Init
plt.plot(in_x[0], in_y[0], 'b-')
The problem is that, in_x and in_y output a string: (0, '[5 3 9 4.8 2]') (1, '[6 3 9 4.8 2]') ... How could I solve the problem ?
Thank you for taking the time to answer my question.
I was expecting :
in_x_1 = in_x[2][0] # output: [
in_x_2 = in_x[2][1] # output: 6
Read in dataframe, and slice with the iloc method:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame([
[[5,3,9,4.8,2], [5,3,9,4.8,9], 33],
[[6,3,9,4.8,2], [4,3.8,9,8,4], 87],
[[6.08,2.89,9,4.8,2], [8,3,9,4,7.34], 93],
],
columns=["for_x", "for_y", "Init"]
)
print(df)
in_x = df.for_x.iloc[0]
in_y = df.for_y.iloc[0]
plt.plot(in_x, in_y, 'b-')
plt.show()
Printing the dataframe:
for_x for_y Init
0 [5, 3, 9, 4.8, 2] [5, 3, 9, 4.8, 9] 33
1 [6, 3, 9, 4.8, 2] [4, 3.8, 9, 8, 4] 87
2 [6.08, 2.89, 9, 4.8, 2] [8, 3, 9, 4, 7.34] 93
If your dataframe has string entries, the eval function will turn them into lists which you can then plot data from:
df_2 = pd.DataFrame([
['[5,3,9,4.8,2]', '[5,3,9,4.8,9]', 33],
['[6,3,9,4.8,2]', '[4,3.8,9,8,4]', 87],
['[6.08,2.89,9,4.8,2]', '[8,3,9,4,7.34]', 93],
],
columns=["for_x", "for_y", "Init"]
)
in_x = eval(df_2.for_x.iloc[0])
in_y = eval(df_2.for_y.iloc[0])
If your values are not comma separated:
df_3 = pd.DataFrame([
['[5 3 9 4.8 2]', '[5 3 9 4.8 9]', 33],
['[6 3 9 4.8 2]', '[4 3.8 9 8 4]', 87],
['[6.08 2.89 9 4.8 2]', '[8 3 9 4 7.34]', 93],
],
columns=["for_x", "for_y", "Init"]
)
string_of_nums_x = df_3.for_x.iloc[0].strip('[').strip(']')
in_x = [float(s) for s in string_of_nums_x.split()]
string_of_nums_y = df_3.for_y.iloc[0].strip('[').strip(']')
in_y = [float(s) for s in string_of_nums_y.split()]
Plotting:
I am trying to display in a subplot all the boxplots corresponding to each columns in my dataframe df.
I have read this question:
Subplot for seaborn boxplot
and tried to implement the given solution:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
d = {'col1': [1, 2, 5.5, 100], 'col2': [3, 4, 0.2, 3], 'col3': [1, 4, 6, 30], 'col4': [2, 24, 0.2, 13], 'col5': [9, 84, 0.9, 3]}
df = pd.DataFrame(data=d)
names = list(df.columns)
f, axes = plt.subplots(round(len(names)/3), 3)
y = 0;
for name in names:
sns.boxplot(x= df[name], ax=axes[y])
y = y + 1
Unfortunately I get an error
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-111-489a538377fc> in <module>
3 y = 0;
4 for name in names:
----> 5 sns.boxplot(x= df[name], ax=axes[y])
6 y = y + 1
AttributeError: 'numpy.ndarray' object has no attribute 'boxplot'
I understand there is a problem with df[name] but I can't see how to fix it.
Would someone be able to point me in the right direction?
Thank you very much.
The problem comes from passing ax=axes[y] to boxplot. axes is a 2-d numpy array with shape (2, 3), that contains the grid of Matplotlib axes that you requested. So axes[y] is a 1-d numpy array that contains three Matplotlib AxesSubplotobjects. I suspect boxplot is attempting to dispatch to this argument, and it expects it to be an object with a boxplot method. You can fix this by indexing axes with the appropriate row and column that you want to use.
Here's your script, with a small change to do that:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
d = {'col1': [1, 2, 5.5, 100], 'col2': [3, 4, 0.2, 3], 'col3': [1, 4, 6, 30], 'col4': [2, 24, 0.2, 13], 'col5': [9, 84, 0.9, 3]}
df = pd.DataFrame(data=d)
names = list(df.columns)
f, axes = plt.subplots(round(len(names)/3), 3)
y = 0;
for name in names:
i, j = divmod(y, 3)
sns.boxplot(x=df[name], ax=axes[i, j])
y = y + 1
plt.tight_layout()
plt.show()
The plot:
I have 60 numbers divided into 8 intervals:
[[534, 540.0, 3], [540.0, 546.0, 3], [546.0, 552.0, 14], [552.0, 558.0, 8], [558.0, 564.0, 14], [564.0, 570.0, 9], [570.0, 576.0, 6], [576.0, 582.0, 3]]
The number of numbers in each interval is divided by 6:
[0.5, 0.5, 2.33, 1.33, 2.33, 1.5, 1.0, 0.5]
How do I create a histogram so that the height of the bars corresponds to the obtained values, while signing the intervals in accordance with my intervals? The result should be something like this
i do not have reputation to post images, so
Running F Blanchet's code generates the following graph in my IPython console:
That doesn't really look like your image. I think you're looking for something more like this, where the x-ticks are between the bars:
This is the code I used to generate the above plot:
import matplotlib.pyplot as plt
# Include one more value for final x-tick.
intervals = list(range(534, 583, 6))
# Include one more bar height that == 0.
bar_height = [0.5, 0.5, 2.33, 1.33, 2.33, 1.5, 1.0, 0.5, 0]
plt.bar(intervals,
bar_height,
width = [6] * 8 + [0], # Set width of 0 bar to 0.
align = "edge", # Align ticks at edge of bars.
tick_label = intervals) # Make tick labels explicit.
You can use matplotlib :
import matplotlib.pyplot as plt
data = [[534, 540.0, 3], [540.0, 546.0, 3], [546.0, 552.0, 14], [552.0, 558.0, 8], [558.0, 564.0, 14], [564.0, 570.0, 9], [570.0, 576.0, 6], [576.0, 582.0, 3]]
x = [element[0]+3 for element in data]
y = [element[2]/6 for element in data]
width = 6
plt.bar(x, y, width, color="blue")
plt.show()
More documentation here
When plotting 2 columns from a dataframe into a line plot, is it possible to, instead of a consistently increasing scale, have fixed values on your y axis (and keep the distances between the numbers on the axis constant)? For example, instead of 0, 100, 200, 300, ... to have 0, 21, 53, 124, 287, depending on the values from your dataset? So basically to have on the axis all your possible values fixed instead of an increasing scale?
Yes, you can use: ax.set_yticks()
Example:
df = pd.DataFrame([[13, 1], [14, 1.5], [15, 1.8], [16, 2], [17, 2], [18, 3 ], [19, 3.6]], columns = ['A','B'])
fig, ax = plt.subplots()
x = df['A']
y = df['B']
ax.plot(x, y, 'g-')
ax.set_yticks(y)
plt.show()
Or if the values are very distant each other, you can use ax.set_yscale('log').
Example:
df = pd.DataFrame([[13, 1], [14, 1.5], [15, 1.8], [16, 2], [17, 2], [18, 3 ], [19, 3.6], [20, 300]], columns = ['A','B'])
fig, ax = plt.subplots()
x = df['A']
y = df['B']
ax.plot(x, y, 'g-')
ax.set_yscale('log', basex=2)
ax.yaxis.set_ticks(y)
ax.yaxis.set_ticklabels(y)
plt.show()
What you need to do is:
get all distinct y values and sort them
set their y position on the plot according to their place on the ordered list
set the y labels according to distinct ordered values
The code below would do
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
df = pd.DataFrame([[13, 1], [14, 1.8], [16, 2], [15, 1.5], [17, 2], [18, 3 ],
[19, 200],[20, 3.6], ], columns = ['A','B'])
x = df['A']
y = df['B']
y_keys = np.sort(y.unique())
y_values = range(len(y_keys))
y_dict = dict(zip(y_keys,y_values))
fig, ax = plt.subplots()
ax.plot(x,[y_dict[k] for k in y],'o-')
ax.set_yticks(y_values)
ax.set_yticklabels(y_keys)
I have the following array:
import numpy as np
a = np.array([[ 1, 2, 3],
[ 1, 2, 3],
[ 1, 2, 3]])
I understand that np.random.shuffle(a.T) will shuffle the array along the row, but what I need is for it to shuffe each row idependently. How can this be done in numpy? Speed is critical as there will be several million rows.
For this specific problem, each row will contain the same starting population.
import numpy as np
np.random.seed(2018)
def scramble(a, axis=-1):
"""
Return an array with the values of `a` independently shuffled along the
given axis
"""
b = a.swapaxes(axis, -1)
n = a.shape[axis]
idx = np.random.choice(n, n, replace=False)
b = b[..., idx]
return b.swapaxes(axis, -1)
a = a = np.arange(4*9).reshape(4, 9)
# array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8],
# [ 9, 10, 11, 12, 13, 14, 15, 16, 17],
# [18, 19, 20, 21, 22, 23, 24, 25, 26],
# [27, 28, 29, 30, 31, 32, 33, 34, 35]])
print(scramble(a, axis=1))
yields
[[ 3 8 7 0 4 5 1 2 6]
[12 17 16 9 13 14 10 11 15]
[21 26 25 18 22 23 19 20 24]
[30 35 34 27 31 32 28 29 33]]
while scrambling along the 0-axis:
print(scramble(a, axis=0))
yields
[[18 19 20 21 22 23 24 25 26]
[ 0 1 2 3 4 5 6 7 8]
[27 28 29 30 31 32 33 34 35]
[ 9 10 11 12 13 14 15 16 17]]
This works by first swapping the target axis with the last axis:
b = a.swapaxes(axis, -1)
This is a common trick used to standardize code which deals with one axis.
It reduces the general case to the specific case of dealing with the last axis.
Since in NumPy version 1.10 or higher swapaxes returns a view, there is no copying involved and so calling swapaxes is very quick.
Now we can generate a new index order for the last axis:
n = a.shape[axis]
idx = np.random.choice(n, n, replace=False)
Now we can shuffle b (independently along the last axis):
b = b[..., idx]
and then reverse the swapaxes to return an a-shaped result:
return b.swapaxes(axis, -1)
If you don't want a return value and want to operate on the array directly, you can specify the indices to shuffle.
>>> import numpy as np
>>>
>>>
>>> a = np.array([[1,2,3], [1,2,3], [1,2,3]])
>>>
>>> # Shuffle row `2` independently
>>> np.random.shuffle(a[2])
>>> a
array([[1, 2, 3],
[1, 2, 3],
[3, 2, 1]])
>>>
>>> # Shuffle column `0` independently
>>> np.random.shuffle(a[:,0])
>>> a
array([[3, 2, 3],
[1, 2, 3],
[1, 2, 1]])
If you want a return value as well, you can use numpy.random.permutation, in which case replace np.random.shuffle(a[n]) with a[n] = np.random.permutation(a[n]).
Warning, do not do a[n] = np.random.shuffle(a[n]). shuffle does not return anything, so the row/column you end up "shuffling" will be filled with nan instead.
Good answer above. But I will throw in a quick and dirty way:
a = np.array([[1,2,3], [1,2,3], [1,2,3]])
ignore_list_outpput = [np.random.shuffle(x) for x in a]
Then, a can be something like this
array([[2, 1, 3],
[4, 6, 5],
[9, 7, 8]])
Not very elegant but you can get this job done with just one short line.
Building on my comment to #Hun's answer, here's the fastest way to do this:
def shuffle_along(X):
"""Minimal in place independent-row shuffler."""
[np.random.shuffle(x) for x in X]
This works in-place and can only shuffle rows. If you need more options:
def shuffle_along(X, axis=0, inline=False):
"""More elaborate version of the above."""
if not inline:
X = X.copy()
if axis == 0:
[np.random.shuffle(x) for x in X]
if axis == 1:
[np.random.shuffle(x) for x in X.T]
if not inline:
return X
This, however, has the limitation of only working on 2d-arrays. For higher dimensional tensors, I would use:
def shuffle_along(X, axis=0, inline=True):
"""Shuffle along any axis of a tensor."""
if not inline:
X = X.copy()
np.apply_along_axis(np.random.shuffle, axis, X) # <-- I just changed this
if not inline:
return X
You can do it with numpy without any loop or extra function, and much more faster. E. g., we have an array of size (2, 6) and we want a sub array (2,2) with independent random index for each column.
import numpy as np
test = np.array([[1, 1],
[2, 2],
[0.5, 0.5],
[0.3, 0.3],
[4, 4],
[7, 7]])
id_rnd = np.random.randint(6, size=(2, 2)) # select random numbers, use choice and range if don want replacement.
new = np.take_along_axis(test, id_rnd, axis=0)
Out:
array([[2. , 2. ],
[0.5, 2. ]])
It works for any number of dimensions.
As of NumPy 1.20.0 released in January 2021 we have a permuted() method on the new Generator type (introduced with the new random API in NumPy 1.17.0, released in July 2019). This does exactly what you need:
import numpy as np
rng = np.random.default_rng()
a = np.array([
[1, 2, 3],
[1, 2, 3],
[1, 2, 3],
])
shuffled = rng.permuted(a, axis=1)
This gives you something like
>>> print(shuffled)
[[2 3 1]
[1 3 2]
[2 1 3]]
As you can see, the rows are permuted independently. This is in sharp contrast with both rng.permutation() and rng.shuffle().
If you want an in-place update you can pass the original array as the out keyword argument. And you can use the axis keyword argument to choose the direction along which to shuffle your array.