how to set a variable dynamically - python, pandas - python

>>>import pandas as pd
>>>import numpy as np
>>>from pandas import Series, DataFrame
>>>rawData = pd.read_csv('wow.txt')
>>>rawData
time mean
0 0.005 0
1 0.010 258.64
2 0.015 258.43
3 0.020 253.72
4 0.025 0
5 0.030 0
6 0.035 253.84
7 0.040 254.17
8 0.045 0
9 0.050 0
10 0.055 0
11 0.060 254.73
12 0.065 254.90
.
.
.
489 4.180 167.46
I want to apply below formula and get 'y' when I enter 'x' value dynamically to plotting a graph.
y = y0 + (y1-y0)*(x-x0/x1-x0)
If 'mean' value is 0(for example index4,5 index8,9,10)
1) Ask question "Do you want to interpolate?"
2) If yes, enter the 'x' value
3) calculating using formula (repeat 1-3 until answer is no)
4) If answer is no, finish the program.
time(x-axis) mean(y-axis)
0 0.005 0
1 0.010 258.64
2 0.015 258.43
3 0.020 <--x0 253.72 <-- y0
4 0.025 0
5 0.030 0
6 0.035 <--x1 253.84 <-- y1
7 0.040 <--x0 254.17 <-- y0
8 0.045 0
9 0.050 0
10 0.055 0
11 0.060 <--x1 254.73 <-- y1
12 0.065 254.90
.
.
.
489 4.180 167.46
variable x0,x1,y0,y1 is determined when they are located outside between '0' value.
How to get a variable dynamically and calculate?
Do you have any good idea to design program?

for i in df.index:
if df.mean [i]=0:
answer=input ("Do you want to interpolate?")
if answer="Y":
x1 = df.loc[df.mean > 0].index[1]
y1 = df.loc[df.time > 0].index[1]
Interpolate eqn
else:
Process
else:
x0 = df.time [i]
y0 = df.mean[i]
Excuse typos, working on mobile phone.

Related

Printing out effect sizes from an anova_lm model

I have the following code. I am running a linear model on the dataframe 'x', with gender and highest education level achieved as categorical variables.
The aim is to assess how well age, gender and highest level of education achieved can predict 'weighteddistance'.
resultmodeldistancevariation2sleep = smf.ols(formula='weighteddistance ~ age + C(gender) + C(highest_education_level_acheived)',data=x).fit()
summarymodel = resultmodeldistancevariation2sleep.summary()
print(summarymodel)
This gives me the output:
0 1 2 3 4 5 6
0 coef std err t P>|t| [0.025 0.975]
1 Intercept 6.3693 1.391 4.580 0.000 3.638 9.100
2 C(gender)[T.2.0] 0.2301 0.155 1.489 0.137 -0.073 0.534
3 C(gender)[T.3.0] 0.0302 0.429 0.070 0.944 -0.812 0.872
4 C(highest_education_level_acheived)[T.3] 1.1292 0.501 2.252 0.025 0.145 2.114
5 C(highest_education_level_acheived)[T.4] 1.0876 0.513 2.118 0.035 0.079 2.096
6 C(highest_education_level_acheived)[T.5] 1.0692 0.498 2.145 0.032 0.090 2.048
7 C(highest_education_level_acheived)[T.6] 1.2995 0.525 2.476 0.014 0.269 2.330
8 C(highest_education_level_acheived)[T.7] 1.7391 0.605 2.873 0.004 0.550 2.928
However, I want to calculate the main effect of each categorical variable on distance, which are not shown in the model above, and so I entered the model fit into an anova using 'anova_lm'.
anovaoutput = sm.stats.anova_lm(resultmodeldistancevariation2sleep)
anovaoutput['PR(>F)'] = anovaoutput['PR(>F)'].round(4)
This gives me following output below, and as I wanted, does show me the main effect of each categorical variable - gender and highest education level achieved - rather than the different groups within that variable (meaning that there is no gender[2.0] and gender[3.0] in the output below).
df sum_sq mean_sq F PR(>F)
C(gender) 2.0 4.227966 2.113983 5.681874 0.0036
C(highest_education_level_acheived) 5.0 11.425706 2.285141 6.141906 0.0000
age 1.0 8.274317 8.274317 22.239357 0.0000
Residual 647.0 240.721120 0.372057 NaN NaN
However, this output no longer shows me the confidence intervals or the coefficients for each variable.
So in other words, I would like the bottom anova table should have a column with 'coef' and '[0.025 0.975]' like in the first table.
How can I achieve this?
I would be so grateful for a helping hand!

I am trying to interpolate a value from a pandas dataframe using numpy.interp but it continuously returns a wrong interpolation

import pandas as pd
import numpy as np
# defining a function for interpolation
def interpolate(x, df, xcol, ycol):
return np.interp([x], df[xcol], df[ycol])
# function call
print(interpolate(0.4, freq_data, 'Percent_cum_freq', 'cum_OGIP'))
Trying a more direct method:
print(np.interp(0.4, freq_data['Percent_cum_freq'], freq_data['cum_OGIP']))
Output:
from function [2.37197912e+10]
from direct 23719791158.266743
For any values of x that I pass: 0.4, 0.6 and 0.9, it gives the same result, that is, 2.37197912e+10
freq_data dataframe
Percent_cum_freq cum_OGIP
0 0.999 4.455539e+07
1 0.981 1.371507e+08
2 0.913 2.777860e+08
3 0.824 4.664612e+08
4 0.720 7.031764e+08
5 0.615 9.879315e+08
6 0.547 1.320727e+09
7 0.464 1.701562e+09
8 0.396 2.130436e+09
9 0.329 2.607351e+09
10 0.285 3.132306e+09
11 0.245 3.705301e+09
12 0.199 4.326336e+09
13 0.167 4.995410e+09
14 0.136 5.712525e+09
15 0.115 6.477680e+09
16 0.085 7.290874e+09
17 0.072 8.152108e+09
18 0.056 9.061383e+09
19 0.042 1.001870e+10
20 0.034 1.102405e+10
21 0.027 1.207745e+10
22 0.022 1.317888e+10
23 0.015 1.432835e+10
24 0.013 1.552587e+10
25 0.010 1.677142e+10
26 0.007 1.806502e+10
27 0.002 1.940665e+10
28 0.002 2.079632e+10
29 0.002 2.223404e+10
30 0.001 2.371979e+10
What is wrong? How can I solve the problem?
Well I was as well surprised by the results when I implemented the code you provided. After a little search on the documentation for np.interp , found that the x-coordinates must be always increasing.
np.interp(x,list_of_x_coordinates,list_of_y_coordinates)
Where x is the value you want the value of y at.
list_of_x_coordinates is df[xcol] -> This must always be increasing. But as your dataframe is decreasing, it will never give correct result.
list_of_y_coordinates is df[ycol] -> This must be of same dimension and in order with the df[xcol]
My reproduced code:
import numpy as np
list_1=np.interp([0.1,0.5,0.8],[0.999,0.547,0.199,0.056,0.013,0.001],[4.455539e+07,1.320727e+09,4.326336e+09,9.061383e+09,1.552587e+10, 2.371979e+10])
list_2=np.interp([0.1,0.5,0.8],[0.001,0.013,0.056,0.199,0.547,0.999],[2.371979e+10,1.552587e+10,9.061383e+09,4.326336e+09,1.320727e+09,4.455539e+07])
print("In decreasing order -> As in your case",list_1)
print("In increasing order of x xoordinates",list_2)
Output:
In decreasing order -> As in your case [2.371979e+10 2.371979e+10 2.371979e+10]
In increasing order of x xoordinates [7.60444546e+09 1.72665695e+09 6.06409705e+08]
As you can understand now, you have to sort the df[x_col] and accordingly pass the df[y_col]
​
By default np.interp needs the x values to be sorted. If you do not want to sort your dataframe a workaround is to set the period argument to np.inf:
print(np.interp(0.4, freq_data['Percent_cum_freq'], freq_data['cum_OGIP'], period=np.inf))

Obtaining the last value that equals or most near in the column dataframe

i have an issue in my code, i'm making points of cuts.
First, this is my Dataframe Column:
In [23]: df['bad_%']
0 0.025
1 0.007
2 0.006
3 0.006
4 0.006
5 0.006
6 0.007
7 0.007
8 0.007
9 0.006
10 0.006
11 0.009
12 0.009
13 0.009
14 0.008
15 0.008
16 0.008
17 0.012
18 0.012
19 0.05
20 0.05
21 0.05
22 0.05
23 0.05
24 0.05
25 0.05
26 0.05
27 0.062
28 0.062
29 0.061
5143 0.166
5144 0.166
5145 0.166
5146 0.167
5147 0.167
5148 0.167
5149 0.167
5150 0.167
5151 0.05
5152 0.167
5153 0.167
5154 0.167
5155 0.167
5156 0.051
5157 0.052
5158 0.161
5159 0.149
5160 0.168
5161 0.168
5162 0.168
5163 0.168
5164 0.168
5165 0.168
5166 0.168
5167 0.168
5168 0.049
5169 0.168
5170 0.168
5171 0.168
5172 0.168
Name: bad%, Length: 5173, dtype: float64
I used this code to detected the value equals or most near to 0.05 (VALUE THAT INTRODUCED on the CONSOLE)
error = 100 #Margin of error
valuesA = [] #array to save data
pointCut=0 #identify cut point
for index, row in df.iterrows():
if(abs(row['bad%'] - a) <= error):
valuesA = row
error = abs(row['bad%'] - a)
#Variable "a" introduced by console, in this case is "0.05"
pointCut = index
This code return the value "0.05" in the index 5151, in first instance looks good, because the "0.05" in the index "5151" is the last "0.05".
Out [27]:
5151 0.05
But my objetive is obtain THE LAST VALUE IN THE COLUMN equal or most near to "0.05", in this case this value correspond to "0.049" in the index "5168", i need obtain this value.
Exists an algorithm that permit this? Any solution or recomendation?
Thanks in advance.
Solutions if exist at leas one value:
Use [::-1] for swap values from back and get idxmax for last matched index value:
a = 0.05
s = df['bad%']
b = s[[(s[::-1] <= a).idxmax()]]
print (b)
5168 0.049
Or:
b = s[(s <= a)].iloc[[-1]]
print (b)
5168 0.049
Name: bad%, dtype: float64
Solution working also if value not exist - then empty Series yields:
a = 0.05
s = df['bad%']
m1 = (s <= a)
m2 = m1[::-1].cumsum().eq(1)
b = s[m1 & m2]
print (b)
5168 0.049
Name: bad%, dtype: float64
Sample data:
df = pd.DataFrame({'bad%': {5146: 0.16699999999999998, 5147: 0.16699999999999998, 5148: 0.16699999999999998, 5149: 0.049, 5150: 0.16699999999999998, 5151: 0.05, 5152: 0.16699999999999998, 5167: 0.168, 5168: 0.049, 5169: 0.168}})

Looping a function with irregular index increase

I have my function:
import numpy as np
def monte_carlo1(N):
x = np.random.random (size = N)
y = np.random.random (size = N)
dist = np.sqrt(x ** 2 + y ** 2)
hit = 0
miss = 0
for z in dist:
if z <=1:
hit += 1
else:
miss +=1
hit_ratio = hit / N
return hit_ratio
What i want to do is run this function 100 times each for 10 different values of N, collecting the data into arrays.
For example, a couple of the data collections could be generated by:
data1 = np.array([monte_carlo1(10) for i in range(100)])
data2 = np.array([monte_carlo1(50) for i in range(100)])
data3 = np.array([monte_carlo1(100) for i in range(100)])
But it would be better if I could create a while loop which can iterate 10 times to produce 10 arrays of data instead of having 10 variables data1...data10.
However, I want to be able to increase the values of N inside monte_carlo(N) by irregular amounts, so in my loop i cant just add a fixed value to N each iteration.
WOuld someone suggest how I might build a loop like this?
Thanks
EDIT:
N_vals = [10, 50, 100, 250, 500, 1000, 3000, 5000, 7500, 10000]
def data_mc():
for n in N_vals:
data = np.array([monte_carlo1(n) for i in range(10)])
return data
I've set up the function like this but the output of the function is just one array, which suggests im doing something wrong and the N_values isnt being cycled through
Here is a solution using a pandas.DataFrame where the row index is the value of n for that iteration and the columns represent each of the repeat iterations.
import pandas as pd
def calc(n_list, repeat):
# this will be a simply list of lists, no special 'numpy' arrays
result_list = []
for n in n_list:
result_list.append([monte_carlo1(n) for _ in range(repeat)])
return pd.DataFrame(data=result_list, index=n_list)
This would allow you to do some data analysis afterwards:
>>> from my_script import calc
>>> n_list = [10, 50, 100, 250, 500]
>>> df = calc(n_list, 10)
>>> df
0 1 2 3 4 5 6 7 8 9
10 0.600 0.800 0.700 0.800 0.600 1.000 0.800 0.800 0.700 0.900
50 0.840 0.860 0.700 0.860 0.740 0.860 0.780 0.740 0.740 0.820
100 0.770 0.780 0.730 0.790 0.780 0.730 0.760 0.740 0.770 0.690
250 0.784 0.804 0.792 0.768 0.800 0.780 0.792 0.800 0.804 0.764
500 0.798 0.776 0.782 0.786 0.768 0.798 0.786 0.774 0.774 0.796
Now you can calculate statistics per value of n:
>>> import pandas as pd
>>> stats = pd.DataFrame()
>>> stats['mean'] = df.mean(axis=1)
>>> stats['standard_dev'] = df.std(axis=1)
>>> stats
mean standard_dev
10 0.7700 0.125167
50 0.7940 0.061137
100 0.7540 0.030984
250 0.7888 0.014459
500 0.7838 0.010891
This data analysis shows you, for example, that your predictions get more accurate (smaller std) as you increase n.

Improve performance on processing a big pandas dataframe

I have a big pandas dataframe (1 million rows), and I need better performance in my code to process this data.
My code is below, and a profiling analysis is also provided.
Header of the dataset:
key_id, date, par1, par2, par3, par4, pop, price, value
For each key, we have a row with every of the 5000 dates possibles
There is 200 key_id * 5000 date = 1000000 rows
Using different variables var1, ..., var4, I compute a value for each row, and I want to extract the top 20 dates with best value for each key_id, and then compute the popularity of the set of variables used.
In the end, I want to find the variables which optimize this popularity.
def compute_value_col(dataset, val1=0, val2=0, val3=0, val4=0):
dataset['value'] = dataset['price'] + val1 * dataset['par1'] \
+ val2 * dataset['par2'] + val3 * dataset['par3'] \
+ val4 * dataset['par4']
return dataset
def params_to_score(dataset, top=10, val1=0, val2=0, val3=0, val4=0):
dataset = compute_value_col(dataset, val1, val2, val3, val4)
dataset = dataset.sort(['key_id','value'], ascending=True)
dataset = dataset.groupby('key_id').head(top).reset_index(drop=True)
return dataset['pop'].sum()
def optimize(dataset, top):
for i,j,k,l in product(xrange(10),xrange(10),xrange(10),xrange(10)):
print i, j, k, l, params_to_score(dataset, top, 10*i, 10*j, 10*k, 10*l)
optimize(my_dataset, 20)
I need to enhance perf
Here is a %prun output, after running 49 params_to_score
ncalls tottime percall cumtime percall filename:lineno(function)
98 2.148 0.022 2.148 0.022 {pandas.algos.take_2d_axis1_object_object}
49 1.663 0.034 9.852 0.201 <ipython-input-59-88fc8127a27f>:150(params_to_score)
49 1.311 0.027 1.311 0.027 {method 'get_labels' of 'pandas.hashtable.Float64HashTable' objects}
49 1.219 0.025 1.223 0.025 {pandas.algos.groupby_indices}
49 0.875 0.018 0.875 0.018 {method 'get_labels' of 'pandas.hashtable.PyObjectHashTable' objects}
147 0.452 0.003 0.457 0.003 index.py:581(is_unique)
343 0.193 0.001 0.193 0.001 {method 'copy' of 'numpy.ndarray' objects}
1 0.136 0.136 10.058 10.058 <ipython-input-59-88fc8127a27f>:159(optimize)
147 0.122 0.001 0.122 0.001 {method 'argsort' of 'numpy.ndarray' objects}
833 0.112 0.000 0.112 0.000 {numpy.core.multiarray.empty}
49 0.109 0.002 0.109 0.002 {method 'get_labels_groupby' of 'pandas.hashtable.Int64HashTable' objects}
98 0.083 0.001 0.083 0.001 {pandas.algos.take_2d_axis1_float64_float64}
49 0.078 0.002 1.460 0.030 groupby.py:1014(_cumcount_array)
I think I could split the big dataframe in small dataframe by key_id, to improve the sort time, as I want to take the top 20 dates with best value for each key_id, so sorting by key is just to separate the different keys.
But I would need any advice, how can I improve the efficience of this code, as I would need to run thousands of params_to_score ?
EDIT: #Jeff
Thanks a lot for your help!
I tried using nsmallest instead of sort & head, but strangely it is 5-6 times slower, when I benchmark the two following functions:
def to_bench1(dataset):
dataset = dataset.sort(['key_id','value'], ascending=True)
dataset = dataset.groupby('key_id').head(50).reset_index(drop=True)
return dataset['pop'].sum()
def to_bench2(dataset):
dataset = dataset.set_index('pop')
dataset = dataset.groupby(['key_id'])['value'].nsmallest(50).reset_index()
return dataset['pop'].sum()
On a sample of ~100000 rows, to_bench2 performs in 0.5 seconds, while to_bench1 takes only 0.085 seconds on average.
After profiling to_bench2, I notice many more isinstance call, compared to before, but I do not know from where they come from...
The way to make this significantly faster is like this.
Create some sample data
In [148]: df = DataFrame({'A' : range(5), 'B' : [1,1,1,2,2] })
Define the compute_val_column like you have
In [149]: def f(p):
return DataFrame({ 'A' : df['A']*p, 'B' : df.B })
.....:
These are the cases (this you prob want a list of tuples), e.g. the cartesian product of all of the cases that you want to feed into the above function
In [150]: parms = [1,3]
Create a new data frame that has the full set of values, keyed by each of the parms). This is basically a broadcasting operation.
In [151]: df2 = pd.concat([ f(p) for p in parms ],keys=parms,names=['parm','indexer']).reset_index()
In [155]: df2
Out[155]:
parm indexer A B
0 1 0 0 1
1 1 1 1 1
2 1 2 2 1
3 1 3 3 2
4 1 4 4 2
5 3 0 0 1
6 3 1 3 1
7 3 2 6 1
8 3 3 9 2
9 3 4 12 2
Here's the magic. Groupby by whatever columns you want, including parm as the first one (or possibly multiple ones). Then do a partial sort (this is what nlargest does); this is more efficient that sort & head (well it depends on the group density a bit). Sum at the end (again by the groupers that we are about, as you are doing a 'partial' reduction)
In [153]: df2.groupby(['parm','B']).A.nlargest(2).sum(level=['parm','B'])
Out[153]:
parm B
1 1 3
2 7
3 1 9
2 21
dtype: int64

Categories

Resources