I have a table:
-60 -40 -20 0 20 40 60
100 520 440 380 320 280 240 210
110 600 500 430 370 320 280 250
120 670 570 490 420 370 330 290
130 740 630 550 480 420 370 330
140 810 690 600 530 470 410 370
The headers along the top are a wind vector and the first col on the left is a distance. The actual data in the 'body' of the table is just a fuel additive.
I am very new to Pandas and Numpy so please excuse the simplicity of the question. What I would like to know is, how can I enter the table using the headers to retrieve one number? I have seen its possible using indexes, but I don't want to use that method if I don't have to.
for example:
I have a wind unit of -60 and a distance of 120 so I need to retrieve the number 670. How can I use Numpy or Pandas to do this?
Also, if I have a wind unit of say -50 and a distance of 125, is it then possible to interpolate these in a simple way?
EDIT:
Here is what I've tried so far:
import pandas as pd
df = pd.read_table('fuel_adjustment.txt', delim_whitespace=True, header=0,index_col=0)
print(df.loc[120, -60])
But get the error:
line 3083, in get_loc raise KeyError(key) from err
KeyError: -60
You can select any cell from existing indices using:
df.loc[120,-60]
The type of the indices needs however to be integer. If not, you can fix it using:
df.index = df.index.map(int)
df.columns = df.columns.map(int)
For interpolation, you need to add the empty new rows/columns using reindex, then apply interpolate on each dimension.
(df.reindex(index=sorted(df.index.to_list()+[125]),
columns=sorted(df. columns.to_list()+[-50]))
.interpolate(axis=1, method='index')
.interpolate(method='index')
)
Output:
-60 -50 -40 -20 0 20 40 60
100 520.0 480.0 440.0 380.0 320.0 280.0 240.0 210.0
110 600.0 550.0 500.0 430.0 370.0 320.0 280.0 250.0
120 670.0 620.0 570.0 490.0 420.0 370.0 330.0 290.0
125 705.0 652.5 600.0 520.0 450.0 395.0 350.0 310.0
130 740.0 685.0 630.0 550.0 480.0 420.0 370.0 330.0
140 810.0 750.0 690.0 600.0 530.0 470.0 410.0 370.0
You can simply use df.loc for that purpose
df.loc[120,-60]
You need to check the data type of index and column. That should be the reason why you failed df.loc[120,-60].
Try:
df.loc[120, "-60"]
To validate the data type, you may call:
>>> df.index
Int64Index([100, 110, 120, 130, 140], dtype='int64')
>>> df.columns
Index(['-60', '-40', '-20', '0', '20', '40', '60'], dtype='object')
If you want to turn the header of columns into int64, you may need to turn it into numeric:
df.columns = pd.to_numeric(df.columns)
For interpolation, I think the only way would be creating that nonexistent index and column first, then you can get that value. However, it will grow your df rapidly if it's frequently query.
First, you need to add the nonexistent index and column.
Interpolate row-wise and column-wise.
Get your value.
new_index = df.index.to_list()
new_index.append(125)
new_index.sort()
new_col = df.columns.to_list()
new_col.append(-50)
new_col.sort()
df = df.reindex(index=new_index, columns=new_col)
df = df.interpolate(axis=1).interpolate()
print(df[125, -50])
Another way is to write a function to fetch relative numbers and returns the interpolate result.
Find the upper and lower indexes and columns of your target.
Fetch the four numbers.
Sequentially interpolate the index and column.
Related
When doing groupby rolling in pandas with dtype float64, sum of zeros become an arbitrary small float when number of groups is large. For example,
import pandas as pd
import numpy as np
np.random.seed(1)
df = pd.DataFrame({'a': (np.random.random(800)*1e5+1e5).tolist() + [0.0]*800, 'b': list(range(80))*20})
a = df.groupby('b').rolling(5, min_periods=1).agg({'a': 'sum'})
The first line generates a dataframe with 2 columns a and b.
Column a has 800 random numbers between 1e5 and 2e5 and 800 zeros.
Column b assigns these to 80 groups.
For example for group 79, the df looks like below:
a b
79 158742.001924 79
159 115045.502837 79
239 171582.695286 79
319 181072.123361 79
399 194672.826961 79
479 130100.794308 79
559 169784.165605 79
639 132752.405585 79
719 162355.180105 79
799 148140.045915 79
879 0.000000 79
959 0.000000 79
1039 0.000000 79
1119 0.000000 79
1199 0.000000 79
1279 0.000000 79
1359 0.000000 79
1439 0.000000 79
1519 0.000000 79
1599 0.000000 79
The second line calculates the rolling sum of 5 for column a for each group.
One would expect the rolling sum to be zeros for the last few entries in each group, e.g. 79.
However, arbitrary small floats are returned, e.g. -5.820766e-11 for group 79 below
a
79 1.587420e+05
159 2.737875e+05
239 4.453702e+05
319 6.264423e+05
399 8.211152e+05
479 7.924739e+05
559 8.472126e+05
639 8.083823e+05
719 7.896654e+05
799 7.431326e+05
879 6.130318e+05
959 4.432476e+05
1039 3.104952e+05
1119 1.481400e+05
1199 -5.820766e-11
1279 -5.820766e-11
1359 -5.820766e-11
1439 -5.820766e-11
1519 -5.820766e-11
1599 -5.820766e-11
If we decrease the number of groups to 20, the issue disappears. E.g.
df['b'] = df['b'] = list(range(20))*80
a = df.groupby('b').rolling(5, min_periods=1).agg({'a': 'sum'})
This yields (for group 19, since there are only 20 groups from 0-19)
a
19 165083.125668
39 359750.793592
59 485563.758520
79 644305.760443
99 837370.199660
...
1519 0.000000
1539 0.000000
1559 0.000000
1579 0.000000
1599 0.000000
[80 rows x 1 columns]
This is only tested on pandas 1.2.5/python 3.7.9/windows 10. You might have to increase the number of groups for this to show up depending on your machine memory.
In my application, I can't really control the number of groups. I can change the dtype to float32 and the issue goes away. But, this causes me to loose precision for large numbers.
Any idea what's causing this and how to resolve it besides using float32?
TLDR: this is a side effect of optimization; the workaround is to use a non-pandas sum.
The reason is that pandas tries to optimize. Naive rolling window functions will take O(n*w) time. However, if we're aware the function is a sum, we can subtract one element going out of window and add the one getting into it. This approach no longer depends on window size, and is always O(n).
The caveat is that now we'll get side effects of floating point precision, manifesting itself similar to what you've described.
Sources: Python code calling window aggregation, Cython implementation of the rolling sum
I have a series object w/ following shape
H AB HBP SF G 2B 3B HR BAVG
playerID
ruthba01 2873 8398 43.0 0.0 2503 506 136 714 0.342105
willite01 2654 7706 39.0 20.0 2292 525 71 521 0.344407
gehrilo01 2721 8001 45.0 0.0 2164 534 163 493 0.340082
hornsro01 2930 8173 48.0 0.0 2259 541 169 301 0.358497
I am trying to extract the first column(playerIDs) and convert it to a list. However, list = df['playerID'] gives KeyError, iLoc[0] will return the H column. FYI its a series object.
Thanks
You have the playerID in index
df = df.reset_index()
Then you can call your .iloc and df['playerID']
Or we do not need reset
l = df.index.tolist()
If that's the index you just need df.index I believe.
Looks like the first column might be set to an index. To turn it to a list you will use df.index.tolist()
I have a table:
-60 -40 -20 0 20 40 60
100 520.0 440.0 380.0 320.0 280.0 240.0 210.0
110 600.0 500.0 430.0 370.0 320.0 280.0 250.0
I add the column to the dataframe like so:
wind_comp = -35
if int(wind_comp) not in df.columns:
new_col = df.columns.to_list()
new_col.append(int(wind_comp))
new_col.sort()
df = df.reindex(columns=new_col)
Which returns this:
-60 -40 -35 -20 0 20 40 60
100 520 440 NaN 380 320 280 240 210
110 600 500 NaN 430 370 320 280 250
I interpolate using pandas interpolate() method like this:
df.interpolate(axis=1).interpolate('linear')
If I add a new column of say, -35 it just finds the middle of the -40 and the -20 columns and doesn't get any more accurate. So it returns this:
-60 -40 -35 -20 0 20 40 60
100 520.0 440.0 410.0 380.0 320.0 280.0 240.0 210.0
110 600.0 500.0 465.0 430.0 370.0 320.0 280.0 250.0
Obviously this row would be correct if I had added a column of -30, but I didn't. I need it to give back more accuracy. I want to be able to enter -13 for example and it give me back that interpolated exact number.
How can I do this? Am I doing something wrong in my code or and I missing something? Please help.
EDIT:
It seems that pandas.interpolate() will only halve the to numbers it is placed between and doesn't take into account headers.
I can't find anything that really applies to working with a table using scipy but maybe I'm interpreting it wrong. Is it possible to use that or something different?
Here's an example of interp1d with your values. Now, I'm glossing over a huge number of details here, like how to get values from your DataFrame into a list like this. In many cases, it is easier to do manipulation like this with lists before it becomes a DataFrame.
import scipy.interpolate
x = [ -60, -40, -20, 0 , 20, 40, 60]
y1 = [ 520.0, 440.0, 380.0, 320.0, 280.0, 240.0, 210.0]
y2 = [ 600.0, 500.0, 430.0, 370.0, 320.0, 280.0, 250.0]
f1 = scipy.interpolate.interp1d(x,y1)
f2 = scipy.interpolate.interp1d(x,y2)
print(-35, f1(-35))
print(-35, f2(-35))
Output:
-35 425.0
-35 482.5
I think I am overthinking this - I am trying to copy existing pandas data frame columns and values and making rolling averages - I do not want to overwrite original data. I am iterating over the columns, taking the columns and values, making a rolling 7 day ma as a new column with the suffix _ma as a copy to the original copy. I want to compare existing data to the 7day MA and see how many standard dev the data is from the 7 day MA - which I can figure out - I am just trying to save MA data as a new data frame.
I have
for column in original_data[ma_columns]:
ma_df = pd.DataFrame(original_data[ma_columns].rolling(window=7).mean(), columns = str(column)+'_ma')
and getting the error : Index(...) must be called with a collection of some kind, 'Carrier_AcctPswd_ma' was passed
But if I am iterating with
for column in original_data[ma_columns]:
print('Colunm Name : ', str(column)+'_ma')
print('Contents : ', original_data[ma_columns].rolling(window=7).mean())
I get the data I need :
My issue is just saving this as a new data frame, which I can concatenate to the old, and then do my analysis.
EDIT
I have now been able to make a bunch of data frames, but I want to concatenate them together and this is where the issue is:
for column in original_data[ma_columns]:
MA_data = pd.DataFrame(original_data[column].rolling(window=7).mean())
for i in MA_data:
new = pd.concat(i)
print(i)
<ipython-input-75-7c5e5fa775b3> in <module>
17 # print(type(MA_data))
18 for i in MA_data:
---> 19 new = pd.concat(i)
20 print(i)
21
~\Anaconda3\lib\site-packages\pandas\core\reshape\concat.py in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
279 verify_integrity=verify_integrity,
280 copy=copy,
--> 281 sort=sort,
282 )
283
~\Anaconda3\lib\site-packages\pandas\core\reshape\concat.py in __init__(self, objs, axis, join, keys, levels, names, ignore_index, verify_integrity, copy, sort)
307 "first argument must be an iterable of pandas "
308 "objects, you passed an object of type "
--> 309 '"{name}"'.format(name=type(objs).__name__)
310 )
311
TypeError: first argument must be an iterable of pandas objects, you passed an object of type "str"
You should iterate over column names and assign the resulting pandas series as a new named column, for example:
import pandas as pd
original_data = pd.DataFrame({'A': range(100), 'B': range(100, 200)})
ma_columns = ['A', 'B']
for column in ma_columns:
new_column = column + '_ma'
original_data[new_column] = pd.DataFrame(original_data[column].rolling(window=7).mean())
print(original_data)
Output dataframe:
A B A_ma B_ma
0 0 100 NaN NaN
1 1 101 NaN NaN
2 2 102 NaN NaN
3 3 103 NaN NaN
4 4 104 NaN NaN
.. .. ... ... ...
95 95 195 92.0 192.0
96 96 196 93.0 193.0
97 97 197 94.0 194.0
98 98 198 95.0 195.0
99 99 199 96.0 196.0
[100 rows x 4 columns]
My dataframe has a column called dir, it has several values, I want to know how many the values passes a certain point. For example:
df['dir'].value_counts().sort_index()
It returns a Series
0 855
20 881
40 2786
70 3777
90 3964
100 4
110 2115
130 3040
140 1
160 1697
180 1734
190 3
200 618
210 3
220 1451
250 895
270 2167
280 1
290 1643
300 1
310 1894
330 1
340 965
350 1
Name: dir, dtype: int64
Here, I want to know the number of the value passed 500. In this case, it's all except 100, 140, 190,210, 280,300,330,350.
How can I do that?
I can get away with df['dir'].value_counts()[df['dir'].value_counts() > 500]
(df['dir'].value_counts() > 500).sum()
This gets the value counts and returns them as a series of Truth Values. The parens treats this whole thing like a series. .sum() counts the True values as 1 and the False values as 0.