I have a pandas dataframe with index 3 to 15 with 0.5 steps and want to reindex it to 0.1 steps.
I tried this code and it doesn't work
# create data and set index and print for verification
df = pd.DataFrame({'A':np.arange(3,5,0.5),'B':np.arange(3,5,0.5)})
df.set_index('A', inplace = True)
df.reindex(np.arange(3,5,0.1)).head(15)
The above code outputs this:
A
B
3.0
3.0
3.1
NaN
3.2
NaN
3.3
NaN
3.4
NaN
3.5
NaN * expected output in this position to be 3.5 since it exists in the original df
3.6
NaN
3.7
NaN
3.8
NaN
Strangely the problem is fixed when reindexing from 0 instead of 3 as it's shown in the code below:
df = pd.DataFrame({'A':np.arange(3,5,0.5),'B':np.arange(3,5,0.5)})
df.set_index('A', inplace = True)
print(df.head())
df.reindex(np.arange(0,5,0.1)).head(60)
The output now correctly shows
A
B
0.0
NaN
...
...
3.0
3.0
3.1
NaN
3.2
NaN
3.3
NaN
3.4
NaN
3.5
3.5
3.6
NaN
3.7
NaN
3.8
NaN
I'm running python 3.8.5 on Windows 10.
Pandas version is 1.4.07
Numpy version is 1.22.1
Does anyone know why this happens? If it's a known or new bug? If the bug has been fixed in a newer version of python, pandas or numpy?
Thanks
Good question.
The answer is because np.arange(3,5,0.1) creates a value of 3.5 that is not exactly 3.5. It is 3.5000000000000004. But np.arange(0,5,0.1) does create a 3.5 that is exactly 3.5. Plus, np.arange(3,5,0.5) also generates a 3.5 that is exactly 3.5.
pd.Index(np.arange(3,5,0.1))
Float64Index([ 3.0, 3.1, 3.2,
3.3000000000000003, 3.4000000000000004, 3.5000000000000004,
3.6000000000000005, 3.7000000000000006, 3.8000000000000007,
3.900000000000001, 4.000000000000001, 4.100000000000001,
4.200000000000001, 4.300000000000001, 4.400000000000001,
4.500000000000002, 4.600000000000001, 4.700000000000001,
4.800000000000002, 4.900000000000002],
dtype='float64')
and
pd.Index(np.arange(0,5,0.1))
Float64Index([ 0.0, 0.1, 0.2,
0.30000000000000004, 0.4, 0.5,
0.6000000000000001, 0.7000000000000001, 0.8,
0.9, 1.0, 1.1,
1.2000000000000002, 1.3, 1.4000000000000001,
1.5, 1.6, 1.7000000000000002,
1.8, 1.9000000000000001, 2.0,
2.1, 2.2, 2.3000000000000003,
2.4000000000000004, 2.5, 2.6,
2.7, 2.8000000000000003, 2.9000000000000004,
3.0, 3.1, 3.2,
3.3000000000000003, 3.4000000000000004, 3.5,
3.6, 3.7, 3.8000000000000003,
3.9000000000000004, 4.0, 4.1000000000000005,
4.2, 4.3, 4.4,
4.5, 4.6000000000000005, 4.7,
4.800000000000001, 4.9],
dtype='float64')
and
pd.Index(np.arange(3,5,0.5))
Float64Index([3.0, 3.5, 4.0, 4.5], dtype='float64')
This is definitely related to Numpy:
np.arange(3,5,0.1)[5]
3.5000000000000004
and
np.arange(3,5,0.1)[5] == 3.5
False
This situation is documented in the Numpy arange doc:
https://numpy.org/doc/stable/reference/generated/numpy.arange.html
The length of the output might not be numerically stable.
Another stability issue is due to the internal implementation of
numpy.arange. The actual step value used to populate the array is
dtype(start + step) - dtype(start) and not step. Precision loss can
occur here, due to casting or due to using floating points when start
is much larger than step. This can lead to unexpected behaviour.
It looks like np.linspace might be able to help you out here:
pd.Index(np.linspace(3,5,num=21))
Float64Index([3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 4.0, 4.1, 4.2,
4.3, 4.4, 4.5, 4.6, 4.7, 4.8, 4.9, 5.0],
dtype='float64')
Related
consider a csv file:
z, a, error, b, error
cm, kg, dl , kg, dl
1.0 , 2.0, 3.0, 4.0, 5.0
1.1 , 2.1, 3.1, 4.1, 5.1
1.2 , 2.2, 3.2, 4.2, 5.2
The first line tells us what the variable is. The second line here describes something about the data which is the units of each of the variables. One way would be to ignore the second line which is currently I am doing.
Is there a more consistent way of doing this than ignoring the second line?
There is! You can tell pandas that your csv contains more than one header row.
header : int, list of int, None, default ‘infer’
Row number(s) to use as the column names, and the start of the data. [. . . ] The header can be a list of integers that specify row locations for a multi-index on the columns e.g. [0,1,3] [. . . ] (pandas documentation on read_csv)
Input csv
z,a,error,b,error
cm,kg,dl,kg,dl
1.0,2.0,3.0,4.0,5.0
1.1,2.1,3.1,4.1,5.1
1.2,2.2,3.2,4.2,5.2
Open it
df = pd.read_csv(path_to_csv, header=[0,1])
Your Dataframe
z a error b error
cm kg dl kg dl.1
0 1.0 2.0 3.0 4.0 5.0
1 1.1 2.1 3.1 4.1 5.1
2 1.2 2.2 3.2 4.2 5.2
You can now easily access the columns and rows.
Result of df["z"]
cm
0 1.0
1 1.1
2 1.2
Result of df.loc[1, "z"]
cm 1.1
Name: 1, dtype: float64
I am writing a deferred acceptance algorithm for doctors and hospitals, but before getting there I need my dictionaries to be presented in a correct manner.
Currently, I have a dictionary of doctors containing a nested dictionary with their rankings of hospitals:
{'Doctor_7': {'Hospital_6': 4.0, 'Hospital_3': 8.0, 'Hospital_1': 10.0, 'Hospital_8': 1.0, 'Hospital_2': 9.0, 'Hospital_10': 5.5, 'Hospital_5': 5.5, 'Hospital_7': 2.0, 'Hospital_4': 7.0, 'Hospital_9': 3.0}
Here 'Hospital_6' indicates the hospital and 4.0 indicates its ranking by this specific doctor (4 out of 10 in this case)
Due to the DataFrame from which I made this dictionary it is represented in its current form. However, I want the placement of 'Hospital_6' and 4.0 to switch. Hence, I want 4.0 to be a key and hospital_6 to be its value (of the nested dictionary).
However, I do not quite know how to switch these two. If anyone could help me, that would be extremely appreciated!
You can user dict Comprehension to achieve this:
dict_ ={'Doctor_7': {'Hospital_6': 4.0, 'Hospital_3': 8.0, 'Hospital_1': 10.0, 'Hospital_8': 1.0, 'Hospital_2': 9.0, 'Hospital_10': 5.5, 'Hospital_5': 5.5, 'Hospital_7': 2.0, 'Hospital_4': 7.0, 'Hospital_9': 3.0 }}
new_dict = {key:{v:k for k,v in value.items()} for key, value in dict_.items()}
print(new_dict)
To learn more about Dict Comprehension: Follow this
NOTE: It will override duplicate keys, which were values in the previous dict. If you have Two hospitals with the same rating, you will get only one.
Output:
{'Doctor_7': {4.0: 'Hospital_6',
8.0: 'Hospital_3',
10.0: 'Hospital_1',
1.0: 'Hospital_8',
9.0: 'Hospital_2',
5.5: 'Hospital_5',
2.0: 'Hospital_7',
7.0: 'Hospital_4',
3.0: 'Hospital_9'}}
old_dict = {'Doctor_7': {'Hospital_6': 4.0, 'Hospital_3': 8.0, 'Hospital_1': 10.0, 'Hospital_8': 1.0, 'Hospital_2': 9.0, 'Hospital_10': 5.5, 'Hospital_5': 5.5, 'Hospital_7': 2.0, 'Hospital_4': 7.0, 'Hospital_9': 3.0}
new_dict = {doctor: OrderedDict(sorted(((value, hospital) for hospital, value in values.items()),
key=lambda p: p[0]))
for doctor, values in old_dict.items()}
Outputs
{'Doctor_7': OrderedDict([(1.0, 'Hospital_8'),
(2.0, 'Hospital_7'),
(3.0, 'Hospital_9'),
(4.0, 'Hospital_6'),
(5.5, 'Hospital_5'),
(7.0, 'Hospital_4'),
(8.0, 'Hospital_3'),
(9.0, 'Hospital_2'),
(10.0, 'Hospital_1')])}
Since the two solutions so far neglect taking care of duplicate keys (the same rating given to multiple hospitals), here is a solution that does.
It has the disadvantage that every rating points to a list of hospitals with that rating, instead of to the name directly, even if that list has a length of one.
from collections import defaultdict
d = {'Doctor_7': {'Hospital_6': 4.0,
'Hospital_3': 8.0,
'Hospital_1': 10.0,
'Hospital_8': 1.0,
'Hospital_2': 9.0,
'Hospital_10': 5.5,
'Hospital_5': 5.5,
'Hospital_7': 2.0,
'Hospital_4': 7.0,
'Hospital_9': 3.0}}
new_d = {}
for doctor, ratings in d.items():
ratings_inverse = defaultdict(list)
for hospital, rating in ratings.items():
ratings_inverse[rating].append(hospital)
new_d[doctor] = dict(ratings_inverse)
print(new_d)
# {'Doctor_7': {1.0: ['Hospital_8'],
# 2.0: ['Hospital_7'],
# 3.0: ['Hospital_9'],
# 4.0: ['Hospital_6'],
# 5.5: ['Hospital_10', 'Hospital_5'],
# 7.0: ['Hospital_4'],
# 8.0: ['Hospital_3'],
# 9.0: ['Hospital_2'],
# 10.0: ['Hospital_1']}}
But since you mention a dataframe, if this is a pandas.DataFrame that looked like this:
# Doctor_1 Doctor_7
# Hospital_1 1.0 10.0
# Hospital_10 8.0 5.5
# Hospital_2 3.0 9.0
# Hospital_3 10.0 8.0
# Hospital_4 6.0 7.0
# Hospital_5 8.0 5.5
# Hospital_6 4.0 4.0
# Hospital_7 4.0 2.0
# Hospital_8 9.0 1.0
# Hospital_9 3.0 3.0
You can do something like this:
df.apply(lambda col: col.reset_index()\
.groupby(col.name)["index"]\
.apply(lambda x: x.tolist()))
# Doctor_1 Doctor_7
# 1.0 [Hospital_1] [Hospital_8]
# 2.0 NaN [Hospital_7]
# 3.0 [Hospital_2, Hospital_9] [Hospital_9]
# 4.0 [Hospital_6, Hospital_7] [Hospital_6]
# 5.5 NaN [Hospital_10, Hospital_5]
# 6.0 [Hospital_4] NaN
# 7.0 NaN [Hospital_4]
# 8.0 [Hospital_10, Hospital_5] [Hospital_3]
# 9.0 [Hospital_8] [Hospital_2]
# 10.0 [Hospital_3] [Hospital_1]
I have an issue with numpy linspace
import numpy as np
temp = np.linspace(1,2,11)
for t in temp:
print(t)
This return :
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7000000000000002
1.8
1.9
2.0
The 1.7 value looks definitely wrong.
It seems related to this issue https://github.com/numpy/numpy/issues/8909
Does anybody ever had such a problem with numpy.linspace ? is it a known issue ?
François
This is nothing to do with numpy, consider:
>>> temp = np.linspace(1,2,11)
>>> temp
array([1. , 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2. ])
>>> # ^ look, numpy displays it fine
>>> for t in temp:
... print(t)
...
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7000000000000002
1.8
1.9
2.0
The "issue" is with how computers represent floats in general. See: https://docs.python.org/3/tutorial/floatingpoint.html.
My Python 3.5.2 output in the terminal (on a mac) is limited to a width of ca. 80px, even if I increase the size of the terminal window.
This narrow width causes a bunch of line breaks when outputting long arrays which is really a hassle. How do I tell python to use the full command line window width?
For the record, i am not seeing this problem in any other program, for instance my c++ output looks just fine.
For numpy, it turns out you can enable the full output by setting
np.set_printoptions(suppress=True,linewidth=np.nan,threshold=np.nan).
In Python 3.7 and above, you can use
from shutil import get_terminal_size
pd.set_option('display.width', get_terminal_size()[0])
I have the same problem while using pandas. So if this is what you are trying to solve, I fixed mine by doing
pd.set_option('display.width', pd.util.terminal.get_terminal_size()[0])
Default output of a 2x15 matrix is broken:
a.T
array([[ 0.2, -1.4, -0.8, 1.3, -1.5, -1.4, 0.6, -1.5, 0.4, -0.9, 0.3,
1.1, 0.5, -0.3, 1.1],
[ 1.3, -1.2, 1.6, -1.4, 0.9, -1.2, -1.9, 0.9, 1.8, -1.8, 1.7,
-1.3, 1.4, -1.7, -1.3]])
Output is fixed using numpy set_printoptions() command
import sys
np.set_printoptions(suppress=True,linewidth=sys.maxsize,threshold=sys.maxsize)
a.T
[[ 0.2 -1.4 -0.8 1.3 -1.5 -1.4 0.6 -1.5 0.4 -0.9 0.3 1.1 0.5 -0.3 1.1]
[ 1.3 -1.2 1.6 -1.4 0.9 -1.2 -1.9 0.9 1.8 -1.8 1.7 -1.3 1.4 -1.7 -1.3]]
System and numpy versions:
sys.version = 3.8.3 (default, Jul 2 2020, 17:30:36) [MSC v.1916 64 bit (AMD64)]
numpy.__version__ = 1.18.5
I have a text file that consists of points of 4 dimensions each point.
The file is like this:
4.8 3.4 1.6 0.2
4.8 3.0 1.4 0.1
4.3 3.0 1.1 0.1
5.8 4.0 1.2 0.2
5.7 4.4 1.5 0.4
5.4 3.9 1.3 0.4
5.1 3.5 1.4 0.3
I want to read the file and store each line of the file as a seperate list.For instance point1=[4,8 3,4 1,6 0,2].
What I have done so far is :
f= open('points.txt', 'r')
data = f.readlines()
for line in data:
pList= line.rstrip()
print (pList)
I get a list of all the points.
You might find Python's CSV module useful for this:
import csv
with open('points.txt', 'r') as f_input:
points = list(csv.reader(f_input, delimiter='\t'))
# To convert to floats
points = [map(float, x) for x in points]
print points
This would display the following:
[[4.8, 3.4, 1.6, 0.2], [4.8, 3.0, 1.4, 0.1], [4.3, 3.0, 1.1, 0.1], [5.8, 4.0, 1.2, 0.2], [5.7, 4.4, 1.5, 0.4], [5.4, 3.9, 1.3, 0.4], [5.1, 3.5, 1.4, 0.3]]
Try with :
f= open('points.txt', 'r')
data = f.readlines()
for line in data:
points = line.split()
print points