I have a numpy.ndarray as follow. It's the output from talib.RSI. It's the type of numpy.ndarray. I want to get the list of rolling(windows=3).max() and the rolling(window=3).min
How to do that?
[ nan nan nan nan nan
nan nan nan nan nan
nan nan nan nan 56.50118203
60.05461743 56.99068148 55.70899949 59.2299361 64.19044898
60.62186599 53.96346826 44.06538636 52.04519976 51.32884016
58.65240379 60.44789401 58.94743634 59.75308787 53.56534397
54.22091468 47.22502341 51.5425848 50.0923126 49.80608264
45.69087847 50.07778871 54.21701441 58.79268406 63.59307774
66.08195696 65.49255218 65.11035657 68.47403716 70.70530564
73.21955929 76.57474822 65.89852612 66.51497688 72.42658468
73.80944844 69.56561001]
If you can afford adding a new dependency, I would rather do that with Pandas.
import numpy
import pandas
x = numpy.array([0, 1, 2, 3, 4])
s = pandas.Series(x)
print(s.rolling(3).min())
print(s.rolling(3).max())
print(s.rolling(3).mean())
print(s.rolling(3).std())
Note that converting your NumPy array to a Pandas series does not create a copy of the array, as Pandas uses NumPy arrays internally for its series.
You can use np.lib.stride_tricks.as_strided:
# a smaller example
import numpy.random as npr
npr.seed(123)
arr = npr.randn(10)
arr[:4] = np.nan
windows = np.lib.stride_tricks.as_strided(arr, shape=(8, 3), strides=(8, 8))
print(windows.max(axis=1))
print(windows.sum(axis=1))
[ nan nan nan nan 1.65143654 1.65143654
1.26593626 1.26593626]
[ nan nan nan nan -1.35384296 -1.20415534
-1.58965561 -0.02971677]
Related
I want to create a dataframe from census data. I want to calculate the number of people that returned a tax return for each specific earnings group.
For now, I wrote this
census_df = pd.read_csv('../zip code data/19zpallagi.csv')
sub_census_df = census_df[['zipcode', 'agi_stub', 'N02650', 'A02650', 'ELDERLY', 'A07180']].copy()
num_of_returns = ['Number_of_returns_1_25000', 'Number_of_returns_25000_50000', 'Number_of_returns_50000_75000',
'Number_of_returns_75000_100000', 'Number_of_returns_100000_200000', 'Number_of_returns_200000_more']
for i, column_name in zip(range(1, 7), num_of_returns):
sub_census_df[column_name] = sub_census_df[sub_census_df['agi_stub'] == i]['N02650']
I have 6 groups attached to a specific zip code. I want to get one row, with the number of returns for a specific zip code appearing just once as a column. I already tried to change NaNs to 0 and to use groupby('zipcode').sum(), but I get 50 million rows summed for zip code 0, where it seems that only around 800k should exist.
Here is the dataframe that I currently get:
zipcode agi_stub N02650 A02650 ELDERLY A07180 Number_of_returns_1_25000 Number_of_returns_25000_50000 Number_of_returns_50000_75000 Number_of_returns_75000_100000 Number_of_returns_100000_200000 Number_of_returns_200000_more Amount_1_25000 Amount_25000_50000 Amount_50000_75000 Amount_75000_100000 Amount_100000_200000 Amount_200000_more
0 0 1 778140.0 10311099.0 144610.0 2076.0 778140.0 NaN NaN NaN NaN NaN 10311099.0 NaN NaN NaN NaN NaN
1 0 2 525940.0 19145621.0 113810.0 17784.0 NaN 525940.0 NaN NaN NaN NaN NaN 19145621.0 NaN NaN NaN NaN
2 0 3 285700.0 17690402.0 82410.0 9521.0 NaN NaN 285700.0 NaN NaN NaN NaN NaN 17690402.0 NaN NaN NaN
3 0 4 179070.0 15670456.0 57970.0 8072.0 NaN NaN NaN 179070.0 NaN NaN NaN NaN NaN 15670456.0 NaN NaN
4 0 5 257010.0 35286228.0 85030.0 14872.0 NaN NaN NaN NaN 257010.0 NaN NaN NaN NaN NaN 35286228.0 NaN
And here is what I want to get:
zipcode Number_of_returns_1_25000 Number_of_returns_25000_50000 Number_of_returns_50000_75000 Number_of_returns_75000_100000 Number_of_returns_100000_200000 Number_of_returns_200000_more
0 0 778140.0 525940.0 285700.0 179070.0 257010.0 850.0
here is one way to do it using groupby and sum the desired columns
num_of_returns = ['Number_of_returns_1_25000', 'Number_of_returns_25000_50000', 'Number_of_returns_50000_75000',
'Number_of_returns_75000_100000', 'Number_of_returns_100000_200000', 'Number_of_returns_200000_more']
df.groupby('zipcode', as_index=False)[num_of_returns].sum()
zipcode Number_of_returns_1_25000 Number_of_returns_25000_50000 Number_of_returns_50000_75000 Number_of_returns_75000_100000 Number_of_returns_100000_200000 Number_of_returns_200000_more
0 0 778140.0 525940.0 285700.0 179070.0 257010.0 0.0
This question needs more information to actually give a proper answer. For example you leave out what is meant by certain columns in your data frame:
- `N1: Number of returns`
- `agi_stub: Size of adjusted gross income`
According to IRS this has the following levels.
Size of adjusted gross income "0 = No AGI Stub
1 = ‘Under $1’
2 = '$1 under $10,000'
3 = '$10,000 under $25,000'
4 = '$25,000 under $50,000'
5 = '$50,000 under $75,000'
6 = '$75,000 under $100,000'
7 = '$100,000 under $200,000'
8 = ‘$200,000 under $500,000’
9 = ‘$500,000 under $1,000,000’
10 = ‘$1,000,000 or more’"
I got the above from https://www.irs.gov/pub/irs-soi/16incmdocguide.doc
With this information, I think what you want to find is the number of
people who filed a tax return for each of the income levels of agi_stub.
If that is what you mean then, this can be achieved by:
import pandas as pd
data = pd.read_csv("./data/19zpallagi.csv")
## select only the desired columns
data = data[['zipcode', 'agi_stub', 'N1']]
## solution to your problem?
df = data.pivot_table(
index='zipcode',
values='N1',
columns='agi_stub',
aggfunc=['sum']
)
## bit of cleaning up.
PREFIX = 'agi_stub_level_'
df.columns = [PREFIX + level for level in df.columns.get_level_values(1).astype(str)]
Here's the output.
In [77]: df
Out[77]:
agi_stub_level_1 agi_stub_level_2 ... agi_stub_level_5 agi_stub_level_6
zipcode ...
0 50061850.0 37566510.0 ... 21938920.0 8859370.0
1001 2550.0 2230.0 ... 1420.0 230.0
1002 2850.0 1830.0 ... 1840.0 990.0
1005 650.0 570.0 ... 450.0 60.0
1007 1980.0 1530.0 ... 1830.0 460.0
... ... ... ... ... ...
99827 470.0 360.0 ... 170.0 40.0
99833 550.0 380.0 ... 290.0 80.0
99835 1250.0 1130.0 ... 730.0 190.0
99901 1960.0 1520.0 ... 1030.0 290.0
99999 868450.0 644160.0 ... 319880.0 142960.0
[27595 rows x 6 columns]
I have a tiff with lakes, which I converted into a 2D array. I would like to keep the outline of the lakes in a 2D array.
import rasterio
import numpy as np
with rasterio.open('myfile.tif') as dtm:
array = dtm.read(1)
array[array>0] = 1
array = array.astype(float)
array[array==0] = np.nan
My array looks like this now, a lake can be seen in the upper right corner:
[[ nan nan nan ... 2888.001 **2877.458 2867.5798**]
[ nan nan nan ... 2890.188 **2879.2876 2869.0415**]
[ nan nan nan ... 2892.2622 2880.9907 2870.4985]
...
[ nan nan nan ... nan nan nan]
[ nan nan nan ... nan nan nan]
[ nan nan nan ... nan nan nan]]
I wish to keep the outline of the lakes I have to set all values to nan, which are NOT located next to a nan (marked in bold).
I have tried:
array[1:-1, 1:-1] = np.nan
However, this converts ALL inner values of the entire array to nan, not just the inner values of the lakes.
If you know of a completely different way how to keep the outline of the lakes (maybe with rasterio), I would also be thankful.
I hope I made clear what I mean with inner values of the lakes.
Tim
I have a pandas dataframe called 'result' containing Longitude, Latitude and Production values. The dataframe looks like the following. For each pair of latitude and longitude there is one production value, therefore there many NaN values.
> Latitude 0.00000 32.00057 32.00078 ... 32.92114 32.98220 33.11217
Longitude ...
-104.5213 NaN NaN NaN ... NaN NaN NaN
-104.4745 NaN NaN NaN ... NaN NaN NaN
-104.4679 NaN NaN NaN ... NaN NaN NaN
-104.4678 NaN NaN NaN ... NaN NaN NaN
-104.4660 NaN NaN NaN ... NaN NaN NaN
This is my code:
plt.rcParams['figure.figsize'] = (12.0, 10.0)
plt.rcParams['font.family'] = "serif"
plt.figure(figsize=(14,7))
plt.title('Heatmap based on ANN results')
sns.heatmap(result)
The heatmap plot looks like this
but I want it to look more like this
How to adjust my code so it looks like the one on the second image?
I made a quick and dirty example of how you can smooth data in numpy array. It should be directly applicable to pandas dataframes as well.
First I present the code, then go through it:
# Some needed packages
import numpy as np
import matplotlib.pyplot as plt
from scipy import sparse
from scipy.ndimage import gaussian_filter
np.random.seed(42)
# init an array with a lot of nans to imitate OP data
non_zero_entries = sparse.random(50, 60)
sparse_matrix = np.zeros(non_zero_entries.shape) + non_zero_entries
sparse_matrix[sparse_matrix == 0] = None
# set nans to 0
sparse_matrix[np.isnan(sparse_matrix)] = 0
# smooth the matrix
smoothed_matrix = gaussian_filter(sparse_matrix, sigma=5)
# Set 0s to None as they will be ignored when plotting
# smoothed_matrix[smoothed_matrix == 0] = None
sparse_matrix[sparse_matrix == 0] = None
# Plot the data
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2,
sharex=False, sharey=True,
figsize=(9, 4))
ax1.matshow(sparse_matrix)
ax1.set_title("Original matrix")
ax2.matshow(smoothed_matrix)
ax2.set_title("Smoothed matrix")
plt.tight_layout()
plt.show()
The code is fairly simple. You can't smooth NaN and we have to get rid of them. I set them to zero, but depending on your field you might want to interpolate them.
Using the gaussian_filter we smooth the image, where sigma controls the width of the kernel.
The plot code yields the following images
I have the following data frame:
index_names = ['1c', '1s', '2c', '2s', '2s', '3c', '3s', '4c', '4s']
individual_atom_df = pd.DataFrame(columns=['Q0', 'Q1', 'Q2', 'Q3', 'Q4'], index=index_names)
which returns the following:
Q0 Q1 Q2 Q3 Q4
1c NaN NaN NaN NaN NaN
1s NaN NaN NaN NaN NaN
2c NaN NaN NaN NaN NaN
2s NaN NaN NaN NaN NaN
2s NaN NaN NaN NaN NaN
3c NaN NaN NaN NaN NaN
3s NaN NaN NaN NaN NaN
4c NaN NaN NaN NaN NaN
4s NaN NaN NaN NaN NaN
as expected. The data that fills this data frame are lists contained within a list, whereby each list length varies according to a (2x + 1) rule. Here is an example of the lists:
my_list = [[-1.064525],
[-4e-06, -0.105246, 0.036201],
[0.340138, -6e-06, -2e-06, -0.454872, 0.383145],
[4e-06, -0.208369, -0.482417, -4e-06, 3e-06, -0.105177, -0.097678],
[0.047612,
3.5e-05,
5e-06,
0.734665,
0.979878,
-2.9e-05,
1.5e-05,
0.45498,
-0.005097]]
Each list will occupy a column of this data frame relating to the index of that list, for example:
-1.064525: Q0-1c (because -1.064525 is my_list[0][0] so it occupies Q0)
-4e-06: Q1-1c, -0.105246: Q1-1s, 0.036201: Q1-2c
and so on until the top right diagonal of the data frame is full of the my_list values, and the bottom left diagonal is left with NaN.
I need to iterate through my_list and fill the columns of the data frame (the reason for this is because this isn't the only list of lists, in fact there are many list of lists are contained in a dictionary, see ahead).
dictionary = {'H5': [[0.355421],
[-0.013164, -0.012894, 0.012746],
[0.011902, 0.004148, 0.00579, -0.022556, 0.017715],
[-0.007411, 0.015751, 0.003681, -0.0048, -0.020631, -0.004436, -0.002779],
[-0.012934,
-0.00844,
-0.013543,
0.003076,
0.00371,
-0.008476,
-0.008116,
-0.001628,
0.006953]],
'N1': [[-1.064525],
[-4e-06, -0.105246, 0.036201],
[0.340138, -6e-06, -2e-06, -0.454872, 0.383145],
[4e-06, -0.208369, -0.482417, -4e-06, 3e-06, -0.105177, -0.097678],
[0.047612,
3.5e-05,
5e-06,
0.734665,
0.979878,
-2.9e-05,
1.5e-05,
0.45498,
-0.005097]]}
I have tried this, but I'm quite new to data frames and I would highly appreciate some help in how to fill the data frame with the content of my_list. This is what I've tried:
for kk in dictionary:
# define dataframe
individual_atom_df = pd.DataFrame(columns=['Q0', 'Q1', 'Q2', 'Q3', 'Q4'], index=index_names)
# jj is a loop over Q0, Q1, Q2....
for idx, val in enumerate(individual_atom_df):
individual_atom_df[val].append(dictionary[kk][idx])
Each data frame generated for each dictionary element will be outputted to a .json file using the following (which will be put at the end of the loop):
coord_string = df.to_string().splitlines()
coord_data = {
'File origin': file_directory,
'Error list': error_array,
'Data': coord_string
}
with open("file_name.json", "w") as coord_json:
json.dump(file_name, coord_json, indent=4)
First things first, in your outer loop you are overwriting your dataframe with every iteration. You need to save it in some fashion, maybe in a dictionary defined outside the loop. That said, what you are doing in the loop could be done with
something like the following:
# data
l1 = [np.random.rand()]
l2 = [np.random.rand() for i in range(3)]
l3 = [np.random.rand() for i in range(5)]
ll = [l1, l2, l3]
# find max length
maxlen = max(len(i) for i in ll)
# extend shorter arrays by filling with NaN
for col in ll:
col.extend((maxlen-len(col)) * [np.nan])
# convert to array
arr = np.asarray(ll).T
df = pd.DataFrame(
arr,
columns=[f'Q{i}' for i in range(1,4)],
index=['1c', '1s', '2c', '2s', '2s']
)
Does this help?
multidimensional interpolation with dataframe not working
import pandas as pd
import numpy as np
raw_data = {'CCY_CODE': ['SGD','USD','USD','USD','USD','USD','USD','EUR','EUR','EUR','EUR','EUR','EUR','USD'],
'END_DATE': ['16/03/2018','17/03/2018','17/03/2018','17/03/2018','17/03/2018','17/03/2018','17/03/2018',
'17/03/2018','17/03/2018','17/03/2018','17/03/2018','17/03/2018','17/03/2018','17/03/2018'],
'STRIKE':[0.005,0.01,0.015,0.02,0.025,0.03,0.035,0.04,0.045,0.05,0.55,0.06,0.065,0.07],
'VOLATILITY':[np.nan,np.nan,0.3424,np.nan,0.2617,0.2414,np.nan,np.nan,0.215,0.212,0.2103,np.nan,0.2092,np.nan]
}
df_volsurface = pd.DataFrame(raw_data,columns = ['CCY_CODE','END_DATE','STRIKE','VOLATILITY'])
df_volsurface['END_DATE'] = pd.to_datetime(df_volsurface['END_DATE'])
df_volsurface.interpolate(method='akima',limit_direction='both')
Output:
<table><tbody><tr><th> </th><th>CCY_CODE</th><th>END_DATE</th><th>STRIKE</th><th>VOLATILITY</th></tr><tr><td>0</td><td>SGD</td><td>3/16/2018</td><td>0.005</td><td>NaN</td></tr><tr><td>1</td><td>USD</td><td>3/17/2018</td><td>0.01</td><td>NaN</td></tr><tr><td>2</td><td>USD</td><td>3/17/2018</td><td>0.015</td><td>0.3424</td></tr><tr><td>3</td><td>USD</td><td>3/17/2018</td><td>0.02</td><td>0.296358</td></tr><tr><td>4</td><td>USD</td><td>3/17/2018</td><td>0.025</td><td>0.2617</td></tr><tr><td>5</td><td>USD</td><td>3/17/2018</td><td>0.03</td><td>0.2414</td></tr><tr><td>6</td><td>USD</td><td>3/17/2018</td><td>0.035</td><td>0.230295</td></tr><tr><td>7</td><td>EUR</td><td>3/17/2018</td><td>0.04</td><td>0.220911</td></tr><tr><td>8</td><td>EUR</td><td>3/17/2018</td><td>0.045</td><td>0.215</td></tr><tr><td>9</td><td>EUR</td><td>3/17/2018</td><td>0.05</td><td>0.212</td></tr><tr><td>10</td><td>EUR</td><td>3/17/2018</td><td>0.55</td><td>0.2103</td></tr><tr><td>11</td><td>EUR</td><td>3/17/2018</td><td>0.06</td><td>0.209471</td></tr><tr><td>12</td><td>EUR</td><td>3/17/2018</td><td>0.065</td><td>0.2092</td></tr><tr><td>13</td><td>USD</td><td>3/17/2018</td><td>0.07</td><td>NaN</td></tr></tbody></table>
Expected Result:
<table><tbody><tr><th> </th><th>CCY_CODE</th><th>END_DATE</th><th>STRIKE</th><th>VOLATILITY</th></tr><tr><td>0</td><td>SGD</td><td>3/16/2018</td><td>0.005</td><td>NaN</td></tr><tr><td>1</td><td>USD</td><td>3/17/2018</td><td>0.01</td><td>Expected some logical value</td></tr><tr><td>2</td><td>USD</td><td>3/17/2018</td><td>0.015</td><td>0.3424</td></tr><tr><td>3</td><td>USD</td><td>3/17/2018</td><td>0.02</td><td>0.296358</td></tr><tr><td>4</td><td>USD</td><td>3/17/2018</td><td>0.025</td><td>0.2617</td></tr><tr><td>5</td><td>USD</td><td>3/17/2018</td><td>0.03</td><td>0.2414</td></tr><tr><td>6</td><td>USD</td><td>3/17/2018</td><td>0.035</td><td>0.230295</td></tr><tr><td>7</td><td>EUR</td><td>3/17/2018</td><td>0.04</td><td>0.220911</td></tr><tr><td>8</td><td>EUR</td><td>3/17/2018</td><td>0.045</td><td>0.215</td></tr><tr><td>9</td><td>EUR</td><td>3/17/2018</td><td>0.05</td><td>0.212</td></tr><tr><td>10</td><td>EUR</td><td>3/17/2018</td><td>0.55</td><td>0.2103</td></tr><tr><td>11</td><td>EUR</td><td>3/17/2018</td><td>0.06</td><td>0.209471</td></tr><tr><td>12</td><td>EUR</td><td>3/17/2018</td><td>0.065</td><td>0.2092</td></tr><tr><td>13</td><td>USD</td><td>3/17/2018</td><td>0.07</td><td>Expected some logical value</td></tr></tbody></table>
Linear interpolation methods gives copy last available values to all backward and forward missing value without considering ccy_code
df_volsurface.interpolate(method='linear',limit_direction='both')
Output:
<table><tbody><tr><th>CCY_CODE</th><th>END_DATE</th><th>STRIKE</th><th>VOLATILITY</th><th> </th></tr><tr><td>0</td><td>SGD</td><td>3/16/2018</td><td>0.005</td><td>0.3424</td></tr><tr><td>1</td><td>USD</td><td>3/17/2018</td><td>0.01</td><td>0.3424</td></tr><tr><td>2</td><td>USD</td><td>3/17/2018</td><td>0.015</td><td>0.3424</td></tr><tr><td>3</td><td>USD</td><td>3/17/2018</td><td>0.02</td><td>0.30205</td></tr><tr><td>4</td><td>USD</td><td>3/17/2018</td><td>0.025</td><td>0.2617</td></tr><tr><td>5</td><td>USD</td><td>3/17/2018</td><td>0.03</td><td>0.2414</td></tr><tr><td>6</td><td>USD</td><td>3/17/2018</td><td>0.035</td><td>0.2326</td></tr><tr><td>7</td><td>EUR</td><td>3/17/2018</td><td>0.04</td><td>0.2238</td></tr><tr><td>8</td><td>EUR</td><td>3/17/2018</td><td>0.045</td><td>0.215</td></tr><tr><td>9</td><td>EUR</td><td>3/17/2018</td><td>0.05</td><td>0.212</td></tr><tr><td>10</td><td>EUR</td><td>3/17/2018</td><td>0.55</td><td>0.2103</td></tr><tr><td>11</td><td>EUR</td><td>3/17/2018</td><td>0.06</td><td>0.20975</td></tr><tr><td>12</td><td>EUR</td><td>3/17/2018</td><td>0.065</td><td>0.2092</td></tr><tr><td>13</td><td>USD</td><td>3/17/2018</td><td>0.07</td><td>0.2092</td></tr></tbody></table>
Any help is appreciated! Thanks!
I'd like to point out that this is still onedimensional interpolation. We have one independent variable ('STRIKE') and one dependent variable ('VOLATILITY'). Interpolation is done for different conditions, e.g. for each day, each currency, each scenario, etc. The following is an example of how the interpolation can be done based on 'END_DATE' and 'CCY_CODE'.
# set all the conditions as index
df_volsurface.set_index(['END_DATE', 'CCY_CODE', 'STRIKE'], inplace=True)
df_volsurface.sort_index(level=['END_DATE', 'CCY_CODE', 'STRIKE'], inplace=True)
# create separate columns for all criteria except the independent variable
df_volsurface = df_volsurface.unstack(level=['END_DATE', 'CCY_CODE'])
for ccy in df_volsurface:
indices = df_volsurface[ccy].notna()
if not any(indices):
continue # we are not interested in a column with only NaN
x = df_volsurface.index.get_level_values(level='STRIKE') # independent var
y = df_volsurface[ccy] # dependent var
# create interpolation function
f_interp = scipy.interpolate.interp1d(x[indices], y[indices], kind='linear',
bounds_error=False, fill_value='extrapolate')
df_volsurface['VOL_INTERP', ccy[1], ccy[2]] = f_interp(x)
print(df_volsurface)
The interpolation for the other conditions should work analogously. This is the resulting DataFrame:
VOLATILITY VOL_INTERP
END_DATE 2018-03-16 2018-03-17 2018-03-17
CCY_CODE SGD EUR USD EUR USD
STRIKE
0.005 NaN NaN NaN 0.23900 0.42310
0.010 NaN NaN NaN 0.23600 0.38275
0.015 NaN NaN 0.3424 0.23300 0.34240
0.020 NaN NaN NaN 0.23000 0.30205
0.025 NaN NaN 0.2617 0.22700 0.26170
0.030 NaN NaN 0.2414 0.22400 0.24140
0.035 NaN NaN NaN 0.22100 0.22110
0.040 NaN NaN NaN 0.21800 0.20080
0.045 NaN 0.2150 NaN 0.21500 0.18050
0.050 NaN 0.2120 NaN 0.21200 0.16020
0.055 NaN 0.2103 NaN 0.21030 0.13990
0.060 NaN NaN NaN 0.20975 0.11960
0.065 NaN 0.2092 NaN 0.20920 0.09930
0.070 NaN NaN NaN 0.20865 0.07900
Use df_volsurface.stack() to return to a multiindex of your choice. There are also several pandas interpolation methods to choose from. However, I have not found a satisfactory solution for your problem using method='akima' because it only interpolates between the given data points, but does not seem to extrapolate beyond.