Optimize dataframe fill and refill Python Pandas - python

I have changed the column names and have added new columns too.
I am having a numpy array that I have to fill in the respective dataframe columns.
I am getting a delayed response in filling the dataframe using the following code:
import pandas as pd
import numpy as np
df = pd.read_csv("sample.csv")
df = df.tail(1000)
DISPLAY_IN_TRAINING = []
Slice_Middle_Piece_X = slice(None,-1, None)
Slice_Middle_Piece_Y = slice(-1, None)
input_slicer = slice(None, None)
output_slice = slice(None, None)
seq_len = 15 # choose sequence length
n_steps = seq_len - 1
Disp_Data = df
def Generate_DataSet(stock,
df_clone,
seq_len
):
global DISPLAY_IN_TRAINING
data_raw = stock.values # convert to numpy array
data = []
len_data_raw = data_raw.shape[0]
for index in range(0, len_data_raw - seq_len + 1):
data.append(data_raw[index: index + seq_len])
data = np.array(data);
test_set_size = int(np.round(30 / 100 * data.shape[0]));
train_set_size = data.shape[0] - test_set_size;
x_train, y_train = Get_Data_Chopped(data[:train_set_size])
print("Training Sliced Successful....!")
df_train_candle = df_clone[n_steps : train_set_size + n_steps]
if len(DISPLAY_IN_TRAINING) == 0:
DISPLAY_IN_TRAINING = list(df_clone)
df_train_candle.columns = DISPLAY_IN_TRAINING
return [x_train, y_train, df_train_candle]
def Get_Data_Chopped(data_related_to):
x_values = []
y_values = []
for index,iter_values in enumerate(data_related_to):
x_values.append(iter_values[Slice_Middle_Piece_X,input_slicer])
y_values.append([item for sublist in iter_values[Slice_Middle_Piece_Y,output_slice] for item in sublist])
x_values = np.asarray(x_values)
y_values = np.asarray(y_values)
return [x_values,y_values]
x_train, y_train, df_train_candle = Generate_DataSet(df,
Disp_Data,
seq_len
)
df_train_candle.reset_index(drop = True, inplace = True)
df_columns = list(df_train_candle)
df_outputs_name = []
OUTPUT_COLUMN = df.columns
for output_column_name in OUTPUT_COLUMN:
df_outputs_name.append(output_column_name + "_pred")
for i in range(len(df_columns)):
if df_columns[i] == output_column_name:
df_columns[i] = output_column_name + "_orig"
break
df_train_candle.columns = df_columns
df_pred_names = pd.DataFrame(columns = df_outputs_name)
df_train_candle = df_train_candle.join(df_pred_names, how="outer")
for row_index, row_value in enumerate(y_train):
for valueindex, output_label in enumerate(OUTPUT_COLUMN):
df_train_candle.loc[row_index, output_label + "_orig"] = row_value[valueindex]
df_train_candle.loc[row_index, output_label + "_pred"] = row_value[valueindex]
print(df_train_candle.head())
The shape of my y_train is (195, 24) and the dataframe shape is (195, 48). Now I am trying to optimize and make the process work faster. The y_train may change shape to say (195, 1) or (195, 5).
So please can someone tell me what other way (optimized way) for doing the above process? I want a general solution that could fit anything without loosing the data integrity and is faster too.
If teh data size increases from 1000 to 2000 the process become slow. Please advise how to make it faster.
Sample Data df looks like this with shape (1000, 8)
A B C D E F G H
64272 195 215 239 272 22 11 33 55
64273 196 216 240 273 22 11 33 55
64274 197 217 241 274 22 11 33 55
64275 198 218 242 275 22 11 33 55
64276 199 219 243 276 22 11 33 55
The output looks like this:
A_orig B_orig C_orig D_orig E_orig F_orig G_orig H_orig A_pred B_pred C_pred D_pred E_pred F_pred G_pred H_pred
0 10 30 54 87 22 11 33 55 10 30 54 87 22 11 33 55
1 11 31 55 88 22 11 33 55 11 31 55 88 22 11 33 55
2 12 32 56 89 22 11 33 55 12 32 56 89 22 11 33 55
3 13 33 57 90 22 11 33 55 13 33 57 90 22 11 33 55
4 14 34 58 91 22 11 33 55 14 34 58 91 22 11 33 55
Please generate csv columns with 1000 or more lines and see that the program becomes slower. I want to make it faster. I hope this is good to go for understanding.

Related

Python, trying to calculate RSI but I am getting unusually high numbers

I am trying to calculate the RSI formula in python. I am getting the closing price data from the AlphaVantage TimeSeries API.
def rsi(data,period):
length = len(data) - 1
current_price = 0
previous_price = 0
avg_up = 0
avg_down = 0
for i in range(length-period,length):
current_price = data[i]
if current_price > previous_price:
avg_up += current_price - previous_price
else:
avg_down += previous_price - current_price
previous_price = data[i]
# Calculate average gain and loss
avg_up = avg_up/period
avg_down = avg_down/period
# Calculate relative strength
rs = avg_up/avg_down
# Calculate rsi
rsi = 100 - (100/(1+rs))
return rsi
print(rsi(data=closing_price,period=14))
In this case, this will output a really high number along the lines of RSI: 99.824. But according to TradingView, the current RSI is actually 62.68.
Any feedback on what I am doing wrong would be very much appreciated!
Here is some data, it is 100 mintues of AAPL data
0
0 118.3900
1 118.4200
2 118.3500
3 118.3000
4 118.2800
5 118.4000
6 118.3400
7 118.4500
8 118.3900
9 118.4100
10 118.4700
11 118.4000
12 118.4000
13 118.3400
14 118.4100
15 118.2850
16 118.2900
17 118.1700
18 118.2600
19 118.2800
20 118.2600
21 118.2400
22 118.2950
23 118.2800
24 118.2900
25 118.2850
26 118.3000
27 118.2150
28 118.2300
29 118.1450
30 118.1200
31 118.0800
32 118.1300
33 118.1100
34 118.1300
35 118.2300
36 118.1000
37 118.1900
38 118.2800
39 118.2400
40 118.2300
41 118.3300
42 118.3200
43 118.3500
44 118.3600
45 118.3650
46 118.3800
47 118.4500
48 118.5000
49 118.5100
50 118.5400
51 118.5100
52 118.5063
53 118.5200
54 118.5400
55 118.4700
56 118.4700
57 118.4300
58 118.4400
59 118.4300
60 118.3800
61 118.4000
62 118.3600
63 118.3700
64 118.3400
65 118.3200
66 118.3000
67 118.3210
68 118.3714
69 118.4000
70 118.4100
71 118.3500
72 118.3300
73 118.3200
74 118.3250
75 118.3200
76 118.3900
77 118.5000
78 118.4800
79 118.5300
80 118.5300
81 118.4800
82 118.5000
83 118.4400
84 118.5400
85 118.5550
86 118.5200
87 118.4600
88 118.4500
89 118.4400
90 118.4300
91 118.4019
92 118.4400
93 118.4400
94 118.4100
95 118.4000
96 118.4400
97 118.4400
98 118.4600
99 118.5050
I've managed to compute 59.4 with the code below, which is close to what you are looking to. Here is what I've changed:
_ averages are divided by n_up and n_down counters, and not by period.
_ previous and current prices were removed to directly access to actual data[i] and previous data[i-1] prices.
Note that the code has to be check with other data.
close_AAPL = [118.4200, 118.3500, 118.3000, 118.2800, 118.4000,
118.3400, 118.4500, 118.3900, 118.4100, 118.4700,
118.4000, 118.4000, 118.3400, 118.4100, 118.2850,
118.2900, 118.1700, 118.2600, 118.2800, 118.2600,
118.2400, 118.2950, 118.2800, 118.2900, 118.2850,
118.3000, 118.2150, 118.2300, 118.1450, 118.1200,
118.0800, 118.1300, 118.1100, 118.1300, 118.2300,
118.1000, 118.1900, 118.2800, 118.2400, 118.2300,
118.3300, 118.3200, 118.3500, 118.3600, 118.3650,
118.3800, 118.4500, 118.5000, 118.5100, 118.5400,
118.5100, 118.5063, 118.5200, 118.5400, 118.4700,
118.4700, 118.4300, 118.4400, 118.4300, 118.3800,
118.4000, 118.3600, 118.3700, 118.3400, 118.3200,
118.3000, 118.3210, 118.3714, 118.4000, 118.4100,
118.3500, 118.3300, 118.3200, 118.3250, 118.3200,
118.3900, 118.5000, 118.4800, 118.5300, 118.5300,
118.4800, 118.5000, 118.4400, 118.5400, 118.5550,
118.5200, 118.4600, 118.4500, 118.4400, 118.4300,
118.4019, 118.4400, 118.4400, 118.4100, 118.4000,
118.4400, 118.4400, 118.4600, 118.5050]
def rsi(data,period):
length = len(data) - 1
current_price = 0
previous_price = 0
avg_up = 0
n_up = 0
avg_down = 0
n_down = 0
for i in range(length-period,length):
if data[i] > data[i-1]:
avg_up += data[i] - data[i-1]
n_up += 1
else:
avg_down += data[i-1] - data[i]
n_down += 1
# Calculate average gain and loss
avg_up = avg_up/n_up
avg_down = avg_down/n_down
# Calculate relative strength
rs = avg_up/avg_down
# Calculate rsi
return 100. - 100./(1+rs)
print(rsi(data=close_AAPL, period=14))

Getting a list of list of Nearest Neighbour within boundaries

I'm trying to return a list of list of vertical, horizontal and diagonal nearest neighbors of every item of a 2D numpy array
import numpy as np
import copy
tilemap = np.arange(99).reshape(11, 9)
print(tilemap)
def get_neighbor(pos, array):
x = copy.deepcopy(pos[0])
y = copy.deepcopy(pos[1])
grid = copy.deepcopy(array)
split = []
split.append([grid[y-1][x-1]])
split.append([grid[y-1][x]])
split.append([grid[y-1][x+1]])
split.append([grid[y][x - 1]])
split.append([grid[y][x+1]])
split.append([grid[y+1][x-1]])
split.append([grid[y+1][x]])
split.append([grid[y+1][x+1]])
print("\n Neighbors of ITEM[{}]\n {}".format(grid[y][x],split))
cordinates = [5, 6]
get_neighbor(pos=cordinates, array=tilemap)
i would want a list like this:
first item = 0
[[1],[12],[13],
[1,2], [12,24],[13,26],
[1,2,3], [12,24,36], [13,26,39]....
till it get to the boundaries completely then proceeds to second item = 1
and keeps adding to the list. if there is a neighbor above it should be add too..
MY RESULT
[[ 0 1 2 3 4 5 6 7 8]
[ 9 10 11 12 13 14 15 16 17]
[18 19 20 21 22 23 24 25 26]
[27 28 29 30 31 32 33 34 35]
[36 37 38 39 40 41 42 43 44]
[45 46 47 48 49 50 51 52 53]
[54 55 56 57 58 59 60 61 62]
[63 64 65 66 67 68 69 70 71]
[72 73 74 75 76 77 78 79 80]
[81 82 83 84 85 86 87 88 89]
[90 91 92 93 94 95 96 97 98]]
Neighbors of ITEM[59]
[[49], [50], [51], [58], [60], [67], [68], [69]]
Alright, what about a using a function like this? This takes the array, your target index, and the "radius" of the elements to be included.
def get_idx_adj(arr, idx, radius):
num_rows, num_cols = arr.shape
idx_row, idx_col = idx
slice_1 = np.s_[max(0, idx_row - radius):min(num_rows, idx_row + radius + 1)]
slice_2 = np.s_[max(0, idx_col - radius):min(num_cols, idx_col + radius + 1)]
return arr[slice_1, slice_2]
I'm currently trying to find the best way to transform the index of the element, so that the function can be used on its own output successively to get all the subarrays of various sizes.

How to create contours over points with Basemap?

Having a table "tempcc" of value with x,y geografic coords (don't know attaching files here, there is 86 rows in my csv):
X Y Temp
0 35.268 55.618 1.065389
1 35.230 55.682 1.119160
2 35.508 55.690 1.026214
3 35.482 55.652 1.007834
4 35.289 55.664 1.087598
5 35.239 55.655 1.099459
6 35.345 55.662 1.066117
7 35.402 55.649 1.035958
8 35.506 55.643 0.991939
9 35.526 55.688 1.018137
10 35.541 55.695 1.017870
11 35.471 55.682 1.033929
12 35.573 55.668 0.985559
13 35.547 55.651 0.982335
14 35.425 55.671 1.042975
15 35.505 55.675 1.016236
16 35.600 55.681 0.985532
17 35.458 55.717 1.063691
18 35.538 55.720 1.037523
19 35.230 55.726 1.146047
20 35.606 55.707 1.003364
21 35.582 55.700 1.006711
22 35.350 55.696 1.087173
23 35.309 55.677 1.088988
24 35.563 55.687 1.003785
25 35.510 55.764 1.079220
26 35.334 55.736 1.119026
27 35.429 55.745 1.093300
28 35.366 55.752 1.119061
29 35.501 55.745 1.068676
.. ... ... ...
56 35.472 55.800 1.117183
57 35.538 55.855 1.134721
58 35.507 55.834 1.129712
59 35.256 55.845 1.211969
60 35.338 55.823 1.174397
61 35.404 55.835 1.162387
62 35.460 55.826 1.138965
63 35.497 55.831 1.130774
64 35.469 55.844 1.148516
65 35.371 55.510 0.945187
66 35.378 55.545 0.969400
67 35.456 55.502 0.902285
68 35.429 55.517 0.925932
69 35.367 55.710 1.090652
70 35.431 55.490 0.903296
71 35.284 55.606 1.051335
72 35.234 55.634 1.088135
73 35.284 55.591 1.041181
74 35.354 55.587 1.010446
75 35.332 55.581 1.015004
76 35.356 55.606 1.023234
77 35.311 55.545 0.997468
78 35.307 55.575 1.020845
79 35.363 55.645 1.047831
80 35.401 55.628 1.021373
81 35.340 55.629 1.045491
82 35.440 55.643 1.017227
83 35.293 55.630 1.063910
84 35.370 55.623 1.029797
85 35.238 55.601 1.065699
I try to create isolines with:
from numpy import meshgrid,linspace
data=tempcc
m = Basemap(lat_0 = np.mean(tempcc['Y'].values),\
lon_0 = np.mean(tempcc['X'].values),\
llcrnrlon=35,llcrnrlat=55.3, \
urcrnrlon=35.9, urcrnrlat=56.0, resolution='l')
x = linspace(m.llcrnrlon, m.urcrnrlon, data.shape[1])
y = linspace(m.llcrnrlat, m.urcrnrlat, data.shape[0])
xx, yy = meshgrid(x, y)
m.contour(xx, yy, data,latlon=True)
#pt.legend()
m.scatter(tempcc['X'].values, tempcc['Y'].values, latlon=True)
#m.contour(x,y,data,latlon=True)
But I can't manage correctly, although everything seems to be fine. As far as I understand I have to make a 2D matrix of values, where i is lat, and j is lon, but I can't find the example.
The result I get
as you see, region is correct, but interpolation is not good.
What's the matter? Which parameter have I forgotten?
You could use a Triangulation and then call tricontour() instead of contour()
import matplotlib.pyplot as plt
from matplotlib.tri import Triangulation
from mpl_toolkits.basemap import Basemap
import numpy
m = Basemap(lat_0 = np.mean(tempcc['Y'].values),
lon_0 = np.mean(tempcc['X'].values),
llcrnrlon=35,llcrnrlat=55.3,
urcrnrlon=35.9, urcrnrlat=56.0, resolution='l')
triMesh = Triangulation(tempcc['X'].values, tempcc['Y'].values)
tctr = m.tricontour(triMesh, tempcc['Temp'].values,
levels=numpy.linspace(min(tempcc['Temp'].values),
max(tempcc['Temp'].values), 7),
latlon=True)

Pyomo's parameter estimation in an ODE system with missing values in time series

I have an ODE system of 7 equations for explaining a particular set of microorganisms dynamics of the form:
Where the are the different chemical and microorganisms species involved (even sub-indexes for chemical compounds), the are the yield coefficients and the are the pseudo-reactions:
I am using Pyomo for the estimation of all my unknown parameters, which are basically all the yield coefficients and kinetic constants (15 in total).
The following code works perfectly when is used with complete experimental time series for each of the dynamical variables:
from pyomo.environ import *
from pyomo.dae import *
m = AbstractModel()
m.t = ContinuousSet()
m.MEAS_t = Set(within=m.t) # Measurement times, must be subset of t
m.x1_meas = Param(m.MEAS_t)
m.x2_meas = Param(m.MEAS_t)
m.x3_meas = Param(m.MEAS_t)
m.x4_meas = Param(m.MEAS_t)
m.x5_meas = Param(m.MEAS_t)
m.x6_meas = Param(m.MEAS_t)
m.x7_meas = Param(m.MEAS_t)
m.x1 = Var(m.t,within=PositiveReals)
m.x2 = Var(m.t,within=PositiveReals)
m.x3 = Var(m.t,within=PositiveReals)
m.x4 = Var(m.t,within=PositiveReals)
m.x5 = Var(m.t,within=PositiveReals)
m.x6 = Var(m.t,within=PositiveReals)
m.x7 = Var(m.t,within=PositiveReals)
m.k1 = Var(within=PositiveReals)
m.k2 = Var(within=PositiveReals)
m.k3 = Var(within=PositiveReals)
m.k4 = Var(within=PositiveReals)
m.k5 = Var(within=PositiveReals)
m.k6 = Var(within=PositiveReals)
m.k7 = Var(within=PositiveReals)
m.k8 = Var(within=PositiveReals)
m.k9 = Var(within=PositiveReals)
m.y1 = Var(within=PositiveReals)
m.y2 = Var(within=PositiveReals)
m.y3 = Var(within=PositiveReals)
m.y4 = Var(within=PositiveReals)
m.y5 = Var(within=PositiveReals)
m.y6 = Var(within=PositiveReals)
m.x1dot = DerivativeVar(m.x1,wrt=m.t)
m.x2dot = DerivativeVar(m.x2,wrt=m.t)
m.x3dot = DerivativeVar(m.x3,wrt=m.t)
m.x4dot = DerivativeVar(m.x4,wrt=m.t)
m.x5dot = DerivativeVar(m.x5,wrt=m.t)
m.x6dot = DerivativeVar(m.x6,wrt=m.t)
m.x7dot = DerivativeVar(m.x7,wrt=m.t)
def _init_conditions(m):
yield m.x1[0] == 51.963
yield m.x2[0] == 6.289
yield m.x3[0] == 0
yield m.x4[0] == 6.799
yield m.x5[0] == 0
yield m.x6[0] == 4.08
yield m.x7[0] == 0
m.init_conditions=ConstraintList(rule=_init_conditions)
def _x1dot(m,i):
if i==0:
return Constraint.Skip
return m.x1dot[i] == - m.y1*m.k1*m.x1[i]*m.x2[i]/(m.k2+m.x1[i]) - m.y2*m.k3*m.x1[i]*m.x4[i]/(m.k4+m.x1[i])
m.x1dotcon = Constraint(m.t, rule=_x1dot)
def _x2dot(m,i):
if i==0:
return Constraint.Skip
return m.x2dot[i] == m.k1*m.x1[i]*m.x2[i]/(m.k2+m.x1[i]) - m.k7*m.x2[i]*m.x3[i]
m.x2dotcon = Constraint(m.t, rule=_x2dot)
def _x3dot(m,i):
if i==0:
return Constraint.Skip
return m.x3dot[i] == m.y3*m.k1*m.x1[i]*m.x2[i]/(m.k2+m.x1[i]) - m.y4*m.k5*m.x3[i]*m.x6[i]/(m.k6+m.x3[i])
m.x3dotcon = Constraint(m.t, rule=_x3dot)
def _x4dot(m,i):
if i==0:
return Constraint.Skip
return m.x4dot[i] == m.k3*m.x1[i]*m.x4[i]/(m.k4+m.x1[i]) - m.k8*m.x4[i]*m.x3[i]
m.x4dotcon = Constraint(m.t, rule=_x4dot)
def _x5dot(m,i):
if i==0:
return Constraint.Skip
return m.x5dot[i] == m.y5*m.k3*m.x1[i]*m.x4[i]/(m.k4+m.x1[i])
m.x5dotcon = Constraint(m.t, rule=_x5dot)
def _x6dot(m,i):
if i==0:
return Constraint.Skip
return m.x6dot[i] == m.k5*m.x3[i]*m.x6[i]/(m.k6+m.x3[i]) - m.k9*m.x6[i]*m.x7[i]
m.x6dotcon = Constraint(m.t, rule=_x6dot)
def _x7dot(m,i):
if i==0:
return Constraint.Skip
return m.x7dot[i] == m.y6*m.k5*m.x3[i]*m.x6[i]/(m.k6+m.x3[i])
m.x7dotcon = Constraint(m.t, rule=_x7dot)
def _obj(m):
return sum((m.x1[i]-m.x1_meas[i])**2+(m.x2[i]-m.x2_meas[i])**2+(m.x3[i]-m.x3_meas[i])**2+(m.x4[i]-m.x4_meas[i])**2+(m.x5[i]-m.x5_meas[i])**2+(m.x6[i]-m.x6_meas[i])**2+(m.x7[i]-m.x7_meas[i])**2 for i in m.MEAS_t)
m.obj = Objective(rule=_obj)
m.pprint()
instance = m.create_instance('exp.dat')
instance.t.pprint()
discretizer = TransformationFactory('dae.collocation')
discretizer.apply_to(instance,nfe=30)#,ncp=3)
solver=SolverFactory('ipopt')
results = solver.solve(instance,tee=True)
However, I am trying to run the same estimation routine in another experimental data that have missing values at the end of one or maximum two time series of some of the dynamical variables.
In other words, these complete experimental data looks like (in the .dat file):
set t := 0 6 12 18 24 30 36 42 48 54 60 66 72 84 96 120 144;
set MEAS_t := 0 6 12 18 24 30 36 42 48 54 60 66 72 84 96 120 144;
param x1_meas :=
0 51.963
6 43.884
12 24.25
18 26.098
24 11.871
30 4.607
36 1.714
42 4.821
48 5.409
54 3.701
60 3.696
66 1.544
72 4.428
84 1.086
96 2.337
120 2.837
144 3.486
;
param x2_meas :=
0 6.289
6 6.242
12 7.804
18 7.202
24 6.48
30 5.833
36 6.644
42 5.741
48 4.568
54 4.252
60 5.603
66 5.167
72 4.399
84 4.773
96 4.801
120 3.866
144 3.847
;
param x3_meas :=
0 0
6 2.97
12 9.081
18 9.62
24 6.067
30 11.211
36 16.213
42 10.215
48 20.106
54 22.492
60 5.637
66 5.636
72 13.85
84 4.782
96 9.3
120 4.267
144 7.448
;
param x4_meas :=
0 6.799
6 7.73
12 7.804
18 8.299
24 8.208
30 8.523
36 8.507
42 8.656
48 8.49
54 8.474
60 8.203
66 8.127
72 8.111
84 8.064
96 6.845
120 6.721
144 6.162
;
param x5_meas :=
0 0
6 0.267
12 0.801
18 1.256
24 1.745
30 5.944
36 3.246
42 7.787
48 7.991
54 6.943
60 8.593
66 8.296
72 6.85
84 8.021
96 7.667
120 7.209
144 8.117
;
param x6_meas :=
0 4.08
6 4.545
12 4.784
18 4.888
24 5.293
30 5.577
36 5.802
42 5.967
48 6.386
54 6.115
60 6.625
66 6.835
72 6.383
84 6.605
96 5.928
120 5.354
144 4.975
;
param x7_meas :=
0 0
6 0.152
12 1.616
18 0.979
24 4.033
30 5.121
36 2.759
42 3.541
48 4.278
54 4.141
60 6.139
66 3.219
72 5.319
84 4.328
96 3.621
120 4.208
144 5.93
;
While one of my incomplete data sets could have all time series complete, but one like this:
param x6_meas :=
0 4.08
6 4.545
12 4.784
18 4.888
24 5.293
30 5.577
36 5.802
42 5.967
48 6.386
54 6.115
60 6.625
66 6.835
72 6.383
84 6.605
96 5.928
120 5.354
144 .
;
I have knowledge that one can specify to Pyomo to take the derivative of certain variables with respect to a different time serie. However, after tried it, it hadn't worked, and I guess that is because that these are coupled ODE. So basically my question is if there is a way to overcome this issue in Pyomo.
Thanks in advance.
I think all you need to do is slightly modify your objective function like this:
def _obj(m):
sum1 = sum((m.x1[i]-m.x1_meas[i])**2 for i in m.MEAS_t if i in m.x1_meas.keys())
sum2 = sum((m.x2[i]-m.x2_meas[i])**2 for i in m.MEAS_t if i in m.x2_meas.keys())
sum3 = sum((m.x3[i]-m.x3_meas[i])**2 for i in m.MEAS_t if i in m.x3_meas.keys())
sum4 = sum((m.x4[i]-m.x4_meas[i])**2 for i in m.MEAS_t if i in m.x4_meas.keys())
sum5 = sum((m.x5[i]-m.x5_meas[i])**2 for i in m.MEAS_t if i in m.x5_meas.keys())
sum6 = sum((m.x6[i]-m.x6_meas[i])**2 for i in m.MEAS_t if i in m.x6_meas.keys())
sum7 = sum((m.x7[i]-m.x7_meas[i])**2 for i in m.MEAS_t if i in m.x7_meas.keys())
return sum1+sum2+sum3+sum4+sum5+sum6+sum7
m.obj = Objective(rule=_obj)
This double checks that i is a valid index for each set of measurements before adding that index to the sum. If you knew apriori which measurement sets were missing data then you could simplify this function by only doing this check on those sets and summing over the others like you were before.

Pandas sort() ignoring negative sign

I want to sort a pandas df but I'm having problems with the negative values.
import pandas as pd
df = pd.read_csv('File.txt', sep='\t', header=None)
#Suppress scientific notation (finally)
pd.set_option('display.float_format', lambda x: '%.8f' % x)
print(df)
print(df.dtypes)
print(df.shape)
b = df.sort(axis=0, ascending=True)
print(b)
This gives me the ascending order but completely disregards the sign.
SPATA1 -0.00000005
HMBOX1 0.00000005
SLC38A11 -0.00000005
RP11-571M6.17 0.00000004
GNRH1 -0.00000004
PCDHB8 -0.00000004
CXCL1 0.00000004
RP11-48B3.3 -0.00000004
RNFT2 -0.00000004
GRIK3 -0.00000004
ZNF483 0.00000004
RP11-627G18.1 0.00000003
Any ideas what I'm doing wrong?
Thanks
Loading your file with:
df = pd.read_csv('File.txt', sep='\t', header=None)
Since sort(....) is deprecated, you can use sort_values:
b = df.sort_values(by=[1], axis=0, ascending=True)
where [1] is your column of values. For me this returns:
0 1
0 ACTA1 -0.582570
1 MT-CO1 -0.543877
2 CKM -0.338265
3 MT-ND1 -0.306239
5 MT-CYB -0.128241
6 PDK4 -0.119309
8 GAPDH -0.090912
9 MYH1 -0.087777
12 RP5-940J5.9 -0.074280
13 MYH2 -0.072261
16 MT-ND2 -0.052551
18 MYL1 -0.049142
19 DES -0.048289
20 ALDOA -0.047661
22 ENO3 -0.046251
23 MT-CO2 -0.043684
26 RP11-799N11.1 -0.034972
28 TNNT3 -0.032226
29 MYBPC2 -0.030861
32 TNNI2 -0.026707
33 KLHL41 -0.026669
34 SOD2 -0.026166
35 GLUL -0.026122
42 TRIM63 -0.022971
47 FLNC -0.018180
48 ATP2A1 -0.017752
49 PYGM -0.016934
55 hsa-mir-6723 -0.015859
56 MT1A -0.015110
57 LDHA -0.014955
.. ... ...
60 RP1-178F15.4 0.013383
58 HSPB1 0.014894
54 UBB 0.015874
53 MIR1282 0.016318
52 ALDH2 0.016441
51 FTL 0.016543
50 RP11-317J10.2 0.016799
46 RP11-290D2.6 0.018803
45 RRAD 0.019449
44 MYF6 0.019954
43 STAC3 0.021931
41 RP11-138I1.4 0.023031
40 MYBPC1 0.024407
39 PDLIM3 0.025442
38 ANKRD1 0.025458
37 FTH1 0.025526
36 MT-RNR2 0.025887
31 HSPB6 0.027680
30 RP11-451G4.2 0.029969
27 AC002398.12 0.033219
25 MT-RNR1 0.040741
24 TNNC1 0.042251
21 TNNT1 0.047177
17 MT-ND3 0.051963
15 MTND1P23 0.059405
14 MB 0.063896
11 MYL2 0.076358
10 MT-ND5 0.076479
7 CA3 0.100221
4 MT-ND6 0.140729
[18152 rows x 2 columns]

Categories

Resources