I have a pandas dataframe with index 3 to 15 with 0.5 steps and want to reindex it to 0.1 steps.
I tried this code and it doesn't work
# create data and set index and print for verification
df = pd.DataFrame({'A':np.arange(3,5,0.5),'B':np.arange(3,5,0.5)})
df.set_index('A', inplace = True)
df.reindex(np.arange(3,5,0.1)).head(15)
The above code outputs this:
A
B
3.0
3.0
3.1
NaN
3.2
NaN
3.3
NaN
3.4
NaN
3.5
NaN * expected output in this position to be 3.5 since it exists in the original df
3.6
NaN
3.7
NaN
3.8
NaN
Strangely the problem is fixed when reindexing from 0 instead of 3 as it's shown in the code below:
df = pd.DataFrame({'A':np.arange(3,5,0.5),'B':np.arange(3,5,0.5)})
df.set_index('A', inplace = True)
print(df.head())
df.reindex(np.arange(0,5,0.1)).head(60)
The output now correctly shows
A
B
0.0
NaN
...
...
3.0
3.0
3.1
NaN
3.2
NaN
3.3
NaN
3.4
NaN
3.5
3.5
3.6
NaN
3.7
NaN
3.8
NaN
I'm running python 3.8.5 on Windows 10.
Pandas version is 1.4.07
Numpy version is 1.22.1
Does anyone know why this happens? If it's a known or new bug? If the bug has been fixed in a newer version of python, pandas or numpy?
Thanks
Good question.
The answer is because np.arange(3,5,0.1) creates a value of 3.5 that is not exactly 3.5. It is 3.5000000000000004. But np.arange(0,5,0.1) does create a 3.5 that is exactly 3.5. Plus, np.arange(3,5,0.5) also generates a 3.5 that is exactly 3.5.
pd.Index(np.arange(3,5,0.1))
Float64Index([ 3.0, 3.1, 3.2,
3.3000000000000003, 3.4000000000000004, 3.5000000000000004,
3.6000000000000005, 3.7000000000000006, 3.8000000000000007,
3.900000000000001, 4.000000000000001, 4.100000000000001,
4.200000000000001, 4.300000000000001, 4.400000000000001,
4.500000000000002, 4.600000000000001, 4.700000000000001,
4.800000000000002, 4.900000000000002],
dtype='float64')
and
pd.Index(np.arange(0,5,0.1))
Float64Index([ 0.0, 0.1, 0.2,
0.30000000000000004, 0.4, 0.5,
0.6000000000000001, 0.7000000000000001, 0.8,
0.9, 1.0, 1.1,
1.2000000000000002, 1.3, 1.4000000000000001,
1.5, 1.6, 1.7000000000000002,
1.8, 1.9000000000000001, 2.0,
2.1, 2.2, 2.3000000000000003,
2.4000000000000004, 2.5, 2.6,
2.7, 2.8000000000000003, 2.9000000000000004,
3.0, 3.1, 3.2,
3.3000000000000003, 3.4000000000000004, 3.5,
3.6, 3.7, 3.8000000000000003,
3.9000000000000004, 4.0, 4.1000000000000005,
4.2, 4.3, 4.4,
4.5, 4.6000000000000005, 4.7,
4.800000000000001, 4.9],
dtype='float64')
and
pd.Index(np.arange(3,5,0.5))
Float64Index([3.0, 3.5, 4.0, 4.5], dtype='float64')
This is definitely related to Numpy:
np.arange(3,5,0.1)[5]
3.5000000000000004
and
np.arange(3,5,0.1)[5] == 3.5
False
This situation is documented in the Numpy arange doc:
https://numpy.org/doc/stable/reference/generated/numpy.arange.html
The length of the output might not be numerically stable.
Another stability issue is due to the internal implementation of
numpy.arange. The actual step value used to populate the array is
dtype(start + step) - dtype(start) and not step. Precision loss can
occur here, due to casting or due to using floating points when start
is much larger than step. This can lead to unexpected behaviour.
It looks like np.linspace might be able to help you out here:
pd.Index(np.linspace(3,5,num=21))
Float64Index([3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 4.0, 4.1, 4.2,
4.3, 4.4, 4.5, 4.6, 4.7, 4.8, 4.9, 5.0],
dtype='float64')
how do i replicate the structure of result of itertools.product?
so as you know itertools.product gives us an object and we need to put them in a list so we can print it
.. something like this.. right?
import itertools
import numpy as np
CN=np.asarray((itertools.product([0,1], repeat=5)))
print(CN)
i want to be able to make something like that but i want the data to be from a csv file.. so i want to make something like this
#PSEUDOCODE
import pandas as pd
df = pd.read_csv('csv here')
#a b c d are the columns that i want to get
x = list(df['a'] df['c'] df['c'] df['d'])
print(x)
so the result will be something like this
[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5. 3.6 1.4 0.2]
[5.4 3.9 1.7 0.4]
[4.6 3.4 1.4 0.3]
[5. 3.4 1.5 0.2]
[4.4 2.9 1.4 0.2]
[4.9 3.1 1.5 0.1]]
how can i do that?
EDIT:
i am trying to learn how to do recursive feature elimination and i saw in some codes in google that they use the iris data set..
from sklearn import datasets
dataset = datasets.load_iris()
x = dataset.data
print(x)
and when printed it looked something like this
[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5. 3.6 1.4 0.2]
[5.4 3.9 1.7 0.4]
[4.6 3.4 1.4 0.3]
[5. 3.4 1.5 0.2]
[4.4 2.9 1.4 0.2]
[4.9 3.1 1.5 0.1]]
how could i make my dataset something like that so i can use this RFE template ?
# Recursive Feature Elimination
from sklearn import datasets
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
# load the iris datasets
dataset = datasets.load_iris()
# create a base classifier used to evaluate a subset of attributes
model = LogisticRegression()
# create the RFE model and select 3 attributes
rfe = RFE(model, 3)
print(rfe)
rfe = rfe.fit(dataset.data, dataset.target)
print("features:",dataset.data)
print("target:",dataset.target)
print(rfe)
# summarize the selection of the attributes
print(rfe.support_)
print(rfe.ranking_)
You don't have to. If you want to use rfe.fit function, you need to feed features and target seperately.
So if your df is like:
a b c d target
0 5.1 3.5 1.4 0.2 1
1 4.9 3.0 1.4 0.2 1
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 1
5 5.4 3.9 1.7 0.4 1
6 4.6 3.4 1.4 0.3 0
7 5.0 3.4 1.5 0.2 0
8 4.4 2.9 1.4 0.2 1
9 4.9 3.1 1.5 0.1 1
you can use:
...
rfe = rfe.fit(df[['a', 'b', 'c', 'd']], df['target'])
...
Today, I tried to equally split a number range into a number list with TensorFlow's linspace function, I found it return a very annoying result:
import tensorflow as tf
print(tf.linspace(-3.0, 3.0, 11))
The output:
tf.Tensor(
[-3. -2.4 -1.8 -1.1999999 -0.5999999 0.
0.60000014 1.2000003 1.8000002 2.4 3. ], shape=(11,), dtype=float32)
But if I use the same function in Numpy, it will show a more reasonable result:
import numpy as np
print(np.linspace(-3.0, 3.0, 11))
The output:
[-3. -2.4 -1.8 -1.2 -0.6 0. 0.6 1.2 1.8 2.4 3. ]
Why TensorFlow will return a number that has a long decimal, like -0.5999999 not just -0.6?
My Python 3.5.2 output in the terminal (on a mac) is limited to a width of ca. 80px, even if I increase the size of the terminal window.
This narrow width causes a bunch of line breaks when outputting long arrays which is really a hassle. How do I tell python to use the full command line window width?
For the record, i am not seeing this problem in any other program, for instance my c++ output looks just fine.
For numpy, it turns out you can enable the full output by setting
np.set_printoptions(suppress=True,linewidth=np.nan,threshold=np.nan).
In Python 3.7 and above, you can use
from shutil import get_terminal_size
pd.set_option('display.width', get_terminal_size()[0])
I have the same problem while using pandas. So if this is what you are trying to solve, I fixed mine by doing
pd.set_option('display.width', pd.util.terminal.get_terminal_size()[0])
Default output of a 2x15 matrix is broken:
a.T
array([[ 0.2, -1.4, -0.8, 1.3, -1.5, -1.4, 0.6, -1.5, 0.4, -0.9, 0.3,
1.1, 0.5, -0.3, 1.1],
[ 1.3, -1.2, 1.6, -1.4, 0.9, -1.2, -1.9, 0.9, 1.8, -1.8, 1.7,
-1.3, 1.4, -1.7, -1.3]])
Output is fixed using numpy set_printoptions() command
import sys
np.set_printoptions(suppress=True,linewidth=sys.maxsize,threshold=sys.maxsize)
a.T
[[ 0.2 -1.4 -0.8 1.3 -1.5 -1.4 0.6 -1.5 0.4 -0.9 0.3 1.1 0.5 -0.3 1.1]
[ 1.3 -1.2 1.6 -1.4 0.9 -1.2 -1.9 0.9 1.8 -1.8 1.7 -1.3 1.4 -1.7 -1.3]]
System and numpy versions:
sys.version = 3.8.3 (default, Jul 2 2020, 17:30:36) [MSC v.1916 64 bit (AMD64)]
numpy.__version__ = 1.18.5
I have a lines object which was created with the following:
junk = plt.plot([xxxx], [yyyy])
for x in junk:
print type(x)
<class 'matplotlib.lines.Line2D'>
I need to find the names of the two lists 'xxxx' and 'yyyy'. How can I get them from the class attributes?
You can use dir to see the content of an object in python, or check the docs for the class. I guess the objects you are looking for are xdata and ydata (although I'm a bit confused, in your post you ask for the names of the lists?)
In [27]:
import numpy as np
import matplotlib.pyplot as plt
x = np.arange(0, 5, 0.1);
y = np.sin(x)
junk = plt.plot(x, y)
for x in junk:
#print(dir(x))
print(x.get_xdata())
print(x.get_ydata())
[ 0. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1. 1.1 1.2 1.3 1.4
1.5 1.6 1.7 1.8 1.9 2. 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9
3. 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4. 4.1 4.2 4.3 4.4
4.5 4.6 4.7 4.8 4.9]
[ 0. 0.09983342 0.19866933 0.29552021 0.38941834 0.47942554
0.56464247 0.64421769 0.71735609 0.78332691 0.84147098 0.89120736
0.93203909 0.96355819 0.98544973 0.99749499 0.9995736 0.99166481
0.97384763 0.94630009 0.90929743 0.86320937 0.8084964 0.74570521
0.67546318 0.59847214 0.51550137 0.42737988 0.33498815 0.23924933
0.14112001 0.04158066 -0.05837414 -0.15774569 -0.2555411 -0.35078323
-0.44252044 -0.52983614 -0.61185789 -0.68776616 -0.7568025 -0.81827711
-0.87157577 -0.91616594 -0.95160207 -0.97753012 -0.993691 -0.99992326
-0.99616461 -0.98245261]
Hope it helps.