Python Numpy arange bins should be shown in ascending order

Python Numpy arange bins should be shown in ascending order - python

I have created a series of bins using the Numpy 'arange' function:
bins = np.arange(0, df['eCPM'].max(), 0.1)
The output looks like this:
[1.8, 1.9) 145940.67 52.569295 1.842306
[1.9, 2) 150356.59 54.159954 1.932365
[10.6, 10.7) 150980.84 54.384815 10.626436
[13.3, 13.4) 152038.63 54.765842 13.373157
[2, 2.1) 171494.11 61.773901 2.033192
[2.1, 2.2) 178196.65 64.188223 2.141412
[2.2, 2.3) 186259.13 67.092410 2.264005
How can I get the bins[10. 6, 10.7] and [13.3, 13.4] to go where they belong such that all bins appear in ascending order?
I'm assuming the bins are read as strings hence this issue. I tried to add a dtype: bins = ..., 0.1, dtype=float) but no luck.
[EDIT]
import numpy as np
import pandas
df = pandas.read_csv('path/to/file', skip_footer=1)
bins = np.arange(0, df1['eCPM'].max(), 0.1, dtype=float)
df['ecpm group'] = pandas.cut(df['eCPM'], bins, right=False, labels=None)
df =df[['ecpm group', 'Imps', 'Revenue']].groupby('ecpm group').sum()

You could sort the index in "human order" and then reindex:
import numpy as np
import pandas as pd
import re
def natural_keys(text):
'''
alist.sort(key=natural_keys) sorts in human order
http://nedbatchelder.com/blog/200712/human_sorting.html
(See Toothy's implementation in the comments)
'''
def atoi(text):
return int(text) if text.isdigit() else text
return [atoi(c) for c in re.split('(\d+)', text)]
# df = pandas.read_csv('path/to/file', skip_footer=1)
df = pd.DataFrame({'eCPM': np.random.randint(20, size=40)})
bins = np.arange(0, df['eCPM'].max()+1, 0.1, dtype=float)
df['ecpm group'] = pd.cut(df['eCPM'], bins, right=False, labels=None)
df = df.groupby('ecpm group').sum()
df = df.reindex(index=sorted(df.index, key=natural_keys))
print(df)
yields
eCPM
[0, 0.1) 0
[1, 1.1) 5
[2, 2.1) 4
[4, 4.1) 12
[6, 6.1) 24
[7, 7.1) 7
[8, 8.1) 16
[9, 9.1) 45
[10, 10.1) 40
[11, 11.1) 11
[12, 12.1) 12
[13, 13.1) 13
[15, 15.1) 15
[16, 16.1) 64
[17, 17.1) 34
[18, 18.1) 18

Related

How to access pandas data from a table

I am trying to read data using pandas.
Here is what I have tried:
df = pd.read_csv("samples_data.csv")
in_x = df.for_x
in_y = df.for_y
in_init = df.Init
plt.plot(in_x[0], in_y[0], 'b-')
The problem is that, in_x and in_y output a string: (0, '[5 3 9 4.8 2]') (1, '[6 3 9 4.8 2]') ... How could I solve the problem ?
Thank you for taking the time to answer my question.
I was expecting :
in_x_1 = in_x[2][0] # output: [
in_x_2 = in_x[2][1] # output: 6

Read in dataframe, and slice with the iloc method:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame([
[[5,3,9,4.8,2], [5,3,9,4.8,9], 33],
[[6,3,9,4.8,2], [4,3.8,9,8,4], 87],
[[6.08,2.89,9,4.8,2], [8,3,9,4,7.34], 93],
],
columns=["for_x", "for_y", "Init"]
)
print(df)
in_x = df.for_x.iloc[0]
in_y = df.for_y.iloc[0]
plt.plot(in_x, in_y, 'b-')
plt.show()
Printing the dataframe:
for_x for_y Init
0 [5, 3, 9, 4.8, 2] [5, 3, 9, 4.8, 9] 33
1 [6, 3, 9, 4.8, 2] [4, 3.8, 9, 8, 4] 87
2 [6.08, 2.89, 9, 4.8, 2] [8, 3, 9, 4, 7.34] 93
If your dataframe has string entries, the eval function will turn them into lists which you can then plot data from:
df_2 = pd.DataFrame([
['[5,3,9,4.8,2]', '[5,3,9,4.8,9]', 33],
['[6,3,9,4.8,2]', '[4,3.8,9,8,4]', 87],
['[6.08,2.89,9,4.8,2]', '[8,3,9,4,7.34]', 93],
],
columns=["for_x", "for_y", "Init"]
)
in_x = eval(df_2.for_x.iloc[0])
in_y = eval(df_2.for_y.iloc[0])
If your values are not comma separated:
df_3 = pd.DataFrame([
['[5 3 9 4.8 2]', '[5 3 9 4.8 9]', 33],
['[6 3 9 4.8 2]', '[4 3.8 9 8 4]', 87],
['[6.08 2.89 9 4.8 2]', '[8 3 9 4 7.34]', 93],
],
columns=["for_x", "for_y", "Init"]
)
string_of_nums_x = df_3.for_x.iloc[0].strip('[').strip(']')
in_x = [float(s) for s in string_of_nums_x.split()]
string_of_nums_y = df_3.for_y.iloc[0].strip('[').strip(']')
in_y = [float(s) for s in string_of_nums_y.split()]
Plotting:

boxplot for all data in dataframe: error "'numpy.ndarray' object has no attribute 'boxplot'"

I am trying to display in a subplot all the boxplots corresponding to each columns in my dataframe df.
I have read this question:
Subplot for seaborn boxplot
and tried to implement the given solution:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
d = {'col1': [1, 2, 5.5, 100], 'col2': [3, 4, 0.2, 3], 'col3': [1, 4, 6, 30], 'col4': [2, 24, 0.2, 13], 'col5': [9, 84, 0.9, 3]}
df = pd.DataFrame(data=d)
names = list(df.columns)
f, axes = plt.subplots(round(len(names)/3), 3)
y = 0;
for name in names:
sns.boxplot(x= df[name], ax=axes[y])
y = y + 1
Unfortunately I get an error
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-111-489a538377fc> in <module>
3 y = 0;
4 for name in names:
----> 5 sns.boxplot(x= df[name], ax=axes[y])
6 y = y + 1
AttributeError: 'numpy.ndarray' object has no attribute 'boxplot'
I understand there is a problem with df[name] but I can't see how to fix it.
Would someone be able to point me in the right direction?
Thank you very much.

The problem comes from passing ax=axes[y] to boxplot. axes is a 2-d numpy array with shape (2, 3), that contains the grid of Matplotlib axes that you requested. So axes[y] is a 1-d numpy array that contains three Matplotlib AxesSubplotobjects. I suspect boxplot is attempting to dispatch to this argument, and it expects it to be an object with a boxplot method. You can fix this by indexing axes with the appropriate row and column that you want to use.
Here's your script, with a small change to do that:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
d = {'col1': [1, 2, 5.5, 100], 'col2': [3, 4, 0.2, 3], 'col3': [1, 4, 6, 30], 'col4': [2, 24, 0.2, 13], 'col5': [9, 84, 0.9, 3]}
df = pd.DataFrame(data=d)
names = list(df.columns)
f, axes = plt.subplots(round(len(names)/3), 3)
y = 0;
for name in names:
i, j = divmod(y, 3)
sns.boxplot(x=df[name], ax=axes[i, j])
y = y + 1
plt.tight_layout()
plt.show()
The plot:

how to plot a histogram by given points in python 3

I have 60 numbers divided into 8 intervals:
[[534, 540.0, 3], [540.0, 546.0, 3], [546.0, 552.0, 14], [552.0, 558.0, 8], [558.0, 564.0, 14], [564.0, 570.0, 9], [570.0, 576.0, 6], [576.0, 582.0, 3]]
The number of numbers in each interval is divided by 6:
[0.5, 0.5, 2.33, 1.33, 2.33, 1.5, 1.0, 0.5]
How do I create a histogram so that the height of the bars corresponds to the obtained values, while signing the intervals in accordance with my intervals? The result should be something like this
i do not have reputation to post images, so

Running F Blanchet's code generates the following graph in my IPython console:
That doesn't really look like your image. I think you're looking for something more like this, where the x-ticks are between the bars:
This is the code I used to generate the above plot:
import matplotlib.pyplot as plt
# Include one more value for final x-tick.
intervals = list(range(534, 583, 6))
# Include one more bar height that == 0.
bar_height = [0.5, 0.5, 2.33, 1.33, 2.33, 1.5, 1.0, 0.5, 0]
plt.bar(intervals,
bar_height,
width = [6] * 8 + [0], # Set width of 0 bar to 0.
align = "edge", # Align ticks at edge of bars.
tick_label = intervals) # Make tick labels explicit.

You can use matplotlib :
import matplotlib.pyplot as plt
data = [[534, 540.0, 3], [540.0, 546.0, 3], [546.0, 552.0, 14], [552.0, 558.0, 8], [558.0, 564.0, 14], [564.0, 570.0, 9], [570.0, 576.0, 6], [576.0, 582.0, 3]]
x = [element[0]+3 for element in data]
y = [element[2]/6 for element in data]
width = 6
plt.bar(x, y, width, color="blue")
plt.show()
More documentation here

Plot specific values on y axis instead of increasing scale from dataframe

When plotting 2 columns from a dataframe into a line plot, is it possible to, instead of a consistently increasing scale, have fixed values on your y axis (and keep the distances between the numbers on the axis constant)? For example, instead of 0, 100, 200, 300, ... to have 0, 21, 53, 124, 287, depending on the values from your dataset? So basically to have on the axis all your possible values fixed instead of an increasing scale?

Yes, you can use: ax.set_yticks()
Example:
df = pd.DataFrame([[13, 1], [14, 1.5], [15, 1.8], [16, 2], [17, 2], [18, 3 ], [19, 3.6]], columns = ['A','B'])
fig, ax = plt.subplots()
x = df['A']
y = df['B']
ax.plot(x, y, 'g-')
ax.set_yticks(y)
plt.show()
Or if the values are very distant each other, you can use ax.set_yscale('log').
Example:
df = pd.DataFrame([[13, 1], [14, 1.5], [15, 1.8], [16, 2], [17, 2], [18, 3 ], [19, 3.6], [20, 300]], columns = ['A','B'])
fig, ax = plt.subplots()
x = df['A']
y = df['B']
ax.plot(x, y, 'g-')
ax.set_yscale('log', basex=2)
ax.yaxis.set_ticks(y)
ax.yaxis.set_ticklabels(y)
plt.show()

What you need to do is:
get all distinct y values and sort them
set their y position on the plot according to their place on the ordered list
set the y labels according to distinct ordered values
The code below would do
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
df = pd.DataFrame([[13, 1], [14, 1.8], [16, 2], [15, 1.5], [17, 2], [18, 3 ],
[19, 200],[20, 3.6], ], columns = ['A','B'])
x = df['A']
y = df['B']
y_keys = np.sort(y.unique())
y_values = range(len(y_keys))
y_dict = dict(zip(y_keys,y_values))
fig, ax = plt.subplots()
ax.plot(x,[y_dict[k] for k in y],'o-')
ax.set_yticks(y_values)
ax.set_yticklabels(y_keys)

NumPy random shuffle rows independently

I have the following array:
import numpy as np
a = np.array([[ 1, 2, 3],
[ 1, 2, 3],
[ 1, 2, 3]])
I understand that np.random.shuffle(a.T) will shuffle the array along the row, but what I need is for it to shuffe each row idependently. How can this be done in numpy? Speed is critical as there will be several million rows.
For this specific problem, each row will contain the same starting population.

import numpy as np
np.random.seed(2018)
def scramble(a, axis=-1):
"""
Return an array with the values of `a` independently shuffled along the
given axis
"""
b = a.swapaxes(axis, -1)
n = a.shape[axis]
idx = np.random.choice(n, n, replace=False)
b = b[..., idx]
return b.swapaxes(axis, -1)
a = a = np.arange(4*9).reshape(4, 9)
# array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8],
# [ 9, 10, 11, 12, 13, 14, 15, 16, 17],
# [18, 19, 20, 21, 22, 23, 24, 25, 26],
# [27, 28, 29, 30, 31, 32, 33, 34, 35]])
print(scramble(a, axis=1))
yields
[[ 3 8 7 0 4 5 1 2 6]
[12 17 16 9 13 14 10 11 15]
[21 26 25 18 22 23 19 20 24]
[30 35 34 27 31 32 28 29 33]]
while scrambling along the 0-axis:
print(scramble(a, axis=0))
yields
[[18 19 20 21 22 23 24 25 26]
[ 0 1 2 3 4 5 6 7 8]
[27 28 29 30 31 32 33 34 35]
[ 9 10 11 12 13 14 15 16 17]]
This works by first swapping the target axis with the last axis:
b = a.swapaxes(axis, -1)
This is a common trick used to standardize code which deals with one axis.
It reduces the general case to the specific case of dealing with the last axis.
Since in NumPy version 1.10 or higher swapaxes returns a view, there is no copying involved and so calling swapaxes is very quick.
Now we can generate a new index order for the last axis:
n = a.shape[axis]
idx = np.random.choice(n, n, replace=False)
Now we can shuffle b (independently along the last axis):
b = b[..., idx]
and then reverse the swapaxes to return an a-shaped result:
return b.swapaxes(axis, -1)

If you don't want a return value and want to operate on the array directly, you can specify the indices to shuffle.
>>> import numpy as np
>>>
>>>
>>> a = np.array([[1,2,3], [1,2,3], [1,2,3]])
>>>
>>> # Shuffle row `2` independently
>>> np.random.shuffle(a[2])
>>> a
array([[1, 2, 3],
[1, 2, 3],
[3, 2, 1]])
>>>
>>> # Shuffle column `0` independently
>>> np.random.shuffle(a[:,0])
>>> a
array([[3, 2, 3],
[1, 2, 3],
[1, 2, 1]])
If you want a return value as well, you can use numpy.random.permutation, in which case replace np.random.shuffle(a[n]) with a[n] = np.random.permutation(a[n]).
Warning, do not do a[n] = np.random.shuffle(a[n]). shuffle does not return anything, so the row/column you end up "shuffling" will be filled with nan instead.

Good answer above. But I will throw in a quick and dirty way:
a = np.array([[1,2,3], [1,2,3], [1,2,3]])
ignore_list_outpput = [np.random.shuffle(x) for x in a]
Then, a can be something like this
array([[2, 1, 3],
[4, 6, 5],
[9, 7, 8]])
Not very elegant but you can get this job done with just one short line.

Building on my comment to #Hun's answer, here's the fastest way to do this:
def shuffle_along(X):
"""Minimal in place independent-row shuffler."""
[np.random.shuffle(x) for x in X]
This works in-place and can only shuffle rows. If you need more options:
def shuffle_along(X, axis=0, inline=False):
"""More elaborate version of the above."""
if not inline:
X = X.copy()
if axis == 0:
[np.random.shuffle(x) for x in X]
if axis == 1:
[np.random.shuffle(x) for x in X.T]
if not inline:
return X
This, however, has the limitation of only working on 2d-arrays. For higher dimensional tensors, I would use:
def shuffle_along(X, axis=0, inline=True):
"""Shuffle along any axis of a tensor."""
if not inline:
X = X.copy()
np.apply_along_axis(np.random.shuffle, axis, X) # <-- I just changed this
if not inline:
return X

You can do it with numpy without any loop or extra function, and much more faster. E. g., we have an array of size (2, 6) and we want a sub array (2,2) with independent random index for each column.
import numpy as np
test = np.array([[1, 1],
[2, 2],
[0.5, 0.5],
[0.3, 0.3],
[4, 4],
[7, 7]])
id_rnd = np.random.randint(6, size=(2, 2)) # select random numbers, use choice and range if don want replacement.
new = np.take_along_axis(test, id_rnd, axis=0)
Out:
array([[2. , 2. ],
[0.5, 2. ]])
It works for any number of dimensions.

As of NumPy 1.20.0 released in January 2021 we have a permuted() method on the new Generator type (introduced with the new random API in NumPy 1.17.0, released in July 2019). This does exactly what you need:
import numpy as np
rng = np.random.default_rng()
a = np.array([
[1, 2, 3],
[1, 2, 3],
[1, 2, 3],
])
shuffled = rng.permuted(a, axis=1)
This gives you something like
>>> print(shuffled)
[[2 3 1]
[1 3 2]
[2 1 3]]
As you can see, the rows are permuted independently. This is in sharp contrast with both rng.permutation() and rng.shuffle().
If you want an in-place update you can pass the original array as the out keyword argument. And you can use the axis keyword argument to choose the direction along which to shuffle your array.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Numpy arange bins should be shown in ascending order - python

Related

How to access pandas data from a table

boxplot for all data in dataframe: error "'numpy.ndarray' object has no attribute 'boxplot'"

how to plot a histogram by given points in python 3

Plot specific values on y axis instead of increasing scale from dataframe

NumPy random shuffle rows independently

Categories

Resources