Python, how to perform an iterate formula?

Python, how to perform an iterate formula? - python

I'm trying to export from excel a formula that (as usual) iterate two other value.
I'm trying to explain me better.
I have the following three variables:
iva_versamenti_totale,
utilizzato,
riporto
iva_versamenti_totalehave a len equal to 13 and it is given by the following formula
iva_versamenti_totale={'Saldo IVA': [sum(t) for t in zip(*iva_versamenti.values())],}
Utilizzato and riporto are simultaneously obtained in an iterate manner. I have tried to get the following code but does not work:
utilizzato=dict()
riporto=dict()
for index, xi in enumerate(iva_versamenti_totale['Saldo IVA']):
if xi > 0 :
riporto[index] = riporto[index] + xi
else:
riporto[index] = riporto[index-1]-utilizzato[index]
for index, xi in enumerate(iva_versamenti_totale['Saldo IVA']):
if xi > 0 :
utilizzato[index] == 0
elif riporto[index-1] >= xi:
utilizzato[index] = -xi
else:
utilizzato[index]=riporto[index-1]
Python give me KeyError:0.
EDIT
Here my excel file:
https://drive.google.com/open?id=1SRp9uscUgYsV88991yTnZ8X4c8NElhuD
In grey input and in yellow the variables

Related

Nested ANOVA in statsmodels

The issue:
When trying to compute two-way nested ANOVA, the results do not equal the appropriate results from R (formulas and data are the same).
Sample:
We use "atherosclerosis" dataset from here: https://stepik.org/media/attachments/lesson/9250/atherosclerosis.csv.
To get nested data we replace dose values for age == 2:
df['dose'] = np.where((df['age']==2) & (df['dose']=='D1'),'D3', df.dose)
df['dose'] = np.where((df['age']==2) & (df['dose']=='D2'),'D4', df.dose)
So we have dose factor nested into age: values D1 and D2 are in first age and values D3 and D4 are only in the 2nd age.
After getting ANOVA table we have the results below:
mod = ols('expr~age/C(dose)', data=df).fit()
anova_table = sm.stats.anova_lm(mod, typ=1); anova_table
Screenshot
The total sum of the 'sum_sg' = 1590.257424 + 47.039636 + 197.452754 = 1834.7498139999998 that is NOT equal the right total sum (computed below) = 1805.5494956433238
grand_mean = df['expr'].mean()
ssq_t = sum((df.expr - grand_mean)**2)
Expected Output:
Let's try to get ANOVA table in R:
df <- read.csv(file = "/mnt/storage/users/kseniya/platform-adpkd-mrwda-aim-imaging/mrwda_training/data_samples/athero_new.csv")
nest <- aov(df$expr ~ df$age / factor(df$dose))
print(summary(nest))
The results:
Screenshot
Why they are not equal? The formulas are the same. Are there any mistakes in computing ANOVA through statsmodels?
The results from R seem to be right, because the total sum 197.5 + 17.8 + 1590.3 = 1805.6 is equal to the total sum computed manually.

The degrees of freedom aren't equal. I suspect that the model definition is not really the same between OLS and R. Since lm(y ~ x/z, data) is just a shortcut for lm(y ~ x + x:z, data), I prefer using the extended formulation and recheck if your data is the same. Use also lm instead of aov and . Behaviour of the Python and R implementations should be more similar.
Also behavior of C() in Python does not seem the same as factor() cast in R.

How can I count the element in specific intervals in a dataframe?

I've got a dataframe like below where columns in c01 represent the start time and c04 the end for time intervals:
c01 c04
1742 8.444991 14.022029
3786 29.91143 31.422439
3951 29.91143 31.145099
5402 37.81136 42.689595
8230 63.12394 65.34602
also a list like this (it's actually way longer):
8.522494
8.54471
8.578426
8.611193
8.644996
8.678053
8.710918
8.744901
8.777851
8.811053
8.844867
8.878389
8.912099
8.944729
8.977601
9.011232
9.04492
9.078157
9.111946
9.144788
9.177663
9.211054
9.245265
9.27805
9.311766
9.344647
9.377612
9.411709
I'd like to count how many elements in the list falls in the intervals shown by the dataframe, where I coded like this:
count = 0
for index, row in speech.iterrows():
count += gtls.count(lambda i : i in [row['c01'], row['c04']])
the file works as a whole but all 'count' turns out to be 0, would you please tell me where did I mess up?

I took the liberty of converting your list into a numpy array() (I called it arr). Then you can use the apply function to create your count column. Let's assume your dataframe is called df.
def get_count(row): #the logic for your summation is here
return np.sum([(row['c01'] < arr) & (row['c04'] >= arr)])
df['C_sum'] = df.apply(get_count, axis = 1)
print(df)
Output:
c01 c04 C_sum
0 8.444991 14.022029 28
1 29.911430 31.422439 0
2 29.911430 31.145099 0
3 37.811360 42.689595 0
4 63.123940 65.346020 0
You can also do the whole thing in one line using lambda:
df['C_sum'] = df.apply(lambda row: np.sum([(row['c01'] < arr) & (row['c04'] >= arr)]), axis = 1)

Welcome to Stack Overflow! The i in [row['c01'], row['c04']] doesn't do what you seem to think; it stands for checking whether element i can be found from the two-element list instead of checking the range between row['c01'] and row['c04']. For checking if a floating point number is within a range, use row['c01'] < i < row['c04'].

Combining multiple functions through loops and parameters

UPDATE My question has been fully answered, I have applied it to my program using jarmod's answer, and although the code looks neater, it has not effected the speed of (when my graph appears( i plot this data using matplotlib) I am a a little confused on why my program runs slowly and how I can increase the speed ( takes about 30 seconds and I know this portion of the code is slowing it down) I have shown my real code in the second block of code. Also, the speed is strongly determined by the Range I set, with a short range it is quiet fast
I have this sample code here that shows my calculation needed to conduct forecasting and extracting values. I use the for loops to run through a specific range of CSV files that I labeled 1-100. I return numbers for each month (1-12) to get the forecasting average for a forecast for a given amount of month.
My full code includes 12 functions for a full year forecast but I feel the code is inefficient because the functions are very similar except for one number and reading the csv file so many times slows the program.
Is there a way I can combine these functions and perhaps add another parameter to make it run so. The biggest concern I had was that it would be hard to return separate numbers and categorize them. In other words, I would like to ideally only have one function for all 12 month accuracy predictions and the way I can possibly see how to do that would to add another parameter and another loop series, but have no idea how to go about that or if it is possible. Essentially, I would like to store all the values of onemonthaccuracy ( which goes into the file before the current file and compares the predicted value for the date associated with the currentfile) and then store all the values of twomonthaccurary and so on... so I can later use these variables for graphing and other purposes
import csv
import pandas as pd
def onemonthaccuracy(basefilenumber):
basefileread = pd.read_csv(str(basefilenumber)+'.csv', encoding='latin-1')
basefilevalue = basefileread.loc[basefileread['Customer'].str.contains('Customer A', na=False), 'Jun-16\nQty']
onemonthread = pd.read_csv(str(basefilenumber-1)+'.csv', encoding='latin-1')
onemonthvalue = onemonthread.loc[onemonthread['Customer'].str.contains('Customer A', na=False),'Jun-16\nQty']
onetotal = int(onemonthvalue)/int(basefilevalue)
return onetotal
def twomonthaccuracy(basefilenumber):
basefileread = pd.read_csv(str(basefilenumber)+'.csv', encoding='Latin-1')
basefilevalue = basefileread.loc[basefileread['Customer'].str.contains('Customer A', na=False), 'Jun-16\nQty']
twomonthread = pd.read_csv(str(basefilenumber-2)+'.csv', encoding = 'Latin-1')
twomonthvalue = twomonthread.loc[twomonthread['Customer'].str.contains('Customer A', na=False), 'Jun-16\nQty']
twototal = int(twomonthvalue)/int(basefilevalue)
return twototal
onetotal = 0
twototal = 0
onetotallist = []
twototallist = []
for basefilenumber in range(24,36):
onetotal += onemonthaccuracy(basefilenumber)
twototal +=twomonthaccuracy(basefilenumber)
onetotallist.append(onemonthaccuracy(i))
twototallist.append(twomonthaccuracy(i))
onetotalpermonth = onetotal/12
twototalpermonth = twototal/12
x = [1,2]
y = [onetotalpermonth, twototalpermonth]
z = [1,2]
w = [(onetotallist),(twototallist)]
for ze, we in zip(z, w):
plt.scatter([ze] * len(we), we, marker='D', s=5)
plt.scatter(x,y)
plt.show()
This is the real block of code I am using in my program, perhaps something is slowing it down that I am unaware of?
#other parts of code
#StartRange = yearvalue+Value
#EndRange = endValue + endyearvalue
#Range = EndRange - StartRange
# Department
#more code....
def nmonthaccuracy(basefilenumber, n):
basefileread = pd.read_csv(str(basefilenumber)+'.csv', encoding='Latin-1')
baseheader = getfileheader(basefilenumber)
basefilevalue = basefileread.loc[basefileread['Customer'].str.contains(Department, na=False), baseheader]
nmonthread = pd.read_csv(str(basefilenumber-n)+'.csv', encoding = 'Latin-1')
nmonthvalue = nmonthread.loc[nmonthread['Customer'].str.contains(Department, na=False), baseheader]
return (1-(int(basefilevalue)/int(nmonthvalue))+1) if int(nmonthvalue) > int(basefilevalue) else int(nmonthvalue)/int(basefilevalue)
N = 13
total = [0] * N
total_by_month_list = [[] for _ in range(N)]
for basefilenumber in range(int(StartRange),int(EndRange)):
for n in range(N):
total[n] += nmonthaccuracy(basefilenumber, n)
total_by_month_list[n].append(nmonthaccuracy(basefilenumber,n))
onetotal=total[1]/ Range
twototal=total[2]/ Range
threetotal=total[3]/ Range
fourtotal=total[4]/ Range
fivetotal=total[5]/ Range #... all the way to 12
onetotallist=total_by_month_list[1]
twototallist=total_by_month_list[2]
threetotallist=total_by_month_list[3]
fourtotallist=total_by_month_list[4]
fivetotallist=total_by_month_list[5] #... all the way to 12
# alot more code after this

Something like this:
def nmonthaccuracy(basefilenumber, n):
basefileread = pd.read_csv(str(basefilenumber)+'.csv', encoding='Latin-1')
basefilevalue = basefileread.loc[basefileread['Customer'].str.contains('Lam DepT', na=False), 'Jun-16\nQty']
nmonthread = pd.read_csv(str(basefilenumber-n)+'.csv', encoding = 'Latin-1')
nmonthvalue = nmonthread.loc[nmonthread['Customer'].str.contains('Lam DepT', na=False), 'Jun-16\nQty']
return int(nmonthvalue)/int(basefilevalue)
N = 2
total_by_month = [0] * N
total_aggregate = 0
for basefilenumber in range(20,30):
for n in range(N):
a = nmonthaccuracy(basefilenumber, n)
total_by_month[n] += a
total_aggregate += a
In case you are wondering what the following code does:
N = 2
total_by_month = [0] * N
It sets N to the number of months desired (2, but you could make it 12 or another value) and it then creates a total_by_month array that can store N results, one per month. It then initializes total_by_month to all zeroes (N zeroes) so that each of the N monthly totals starts at zero.

Python: counting number of co-orindates within specified windows

I have two tab delimited files. Each has two columns, one for chromosome and one for column. I want to identify the number of positions in file2 that are within a specified window range of the positions in file1, and then check in the next window, and the next and so forth.
So if this is the first row from file 1:
1 10500
And this is from file 2:
1 10177
1 10352
1 10616
1 11008
1 11012
1 13110
1 13116
1 13118
1 13273
1 13550
If the window is of size 1000, the following co-ordinates from file 2 fall within this window:
1 10177
1 10352
1 10616
The second window should then be 2000 in size and therefore these co-ordinates fall within this window:
1 10177
1 10352
1 10616
1 11008
1 11012
My files are split by chromosomes so this needs to be taken into account also.
I have started by creating the following function:
#Function counts number of variants in a window of specified size around a specified
#genomic position
def vCount(nsPos, df, windowSize):
#If the window minimum is below 0, set to 0
if(nsPos - (windowSize/2) < 0):
low = 0
else:
low = nsPos - (windowSize/2)
high = high = nsPos + (windowSize/2)
#calculate the length (i.e. the number) of variants in the subsetted data.
diversity = len(df[(df['POS'] >= low) & (df['POS'] <= high)])
return(diversity)
This calculates the number for a single co-ordinate from file 1. However, what I want to do is calculate the number for every co-ordinate and then calculate the average. For this I use the following function, which uses the above function:
#Function to apply vCount function across the entire positional column of df
def windowAvg(NSdf, variantsDF, window):
x=NSdf.POS.apply(vCount, df=variantsDF, windowSize=window)
return(x)
This also works fine, but as I mentioned above, this needs to be done over multiple genomic windows, and then the means must be plotted. So I have created a function that uses a loop and then plots the result:
def loopNplot(NSdf, variantsDF, window):
#store window
windows = list(range(1,101))
#store diversity
diversity = []
#Loop through windows appending to the list.
for i in range(window, 101000, window):
#Create a list to hold counts for each chromosome (which we then take the mean of)
tempList = []
#Loop through chromosomes
for j in NSdf['CHROM']:
#Subset NS polymorphisms
subDF = vtable.loc[(vtable['CHROM'] == j)].copy()
#Subset all variants
subVar = all_variants.loc[(all_variants['CHROM'] == j)].copy()
x = windowAvg(subDF, subVar, i)
#convert series to list
x = x.tolist()
#extend templist to include x
tempList.extend(x)
#Append mean of tempList - counts of diversity - to list.
diversity.append(sum(tempList)/len(tempList))
#Copy diversity
divCopy = list(diversity)
#Add a new first index of 0
diversity = [0] + diversity
#Remove last index
diversity = diversity[:-1]
#x - diversity to give number of variants in each window
div = list(map(operator.sub, divCopy, diversity))
plt.scatter(windows, div)
plt.show()
The issue here is that file1 has 11609 rows, and file2 has 13758644 rows. Running the above function is extremely sluggish and I would like to know if there is any way to optimise or change the script to make things more efficient?
I am more than happy to use other command line tools if Python is not the best way to go about this.

Mathematical equations with imported excel coordinates

I have like 10 or more x and y coordinates, and also an equation for them. And I can't figure out how make those calculations - especially a part with the equations. First of all, let's say there are 5 coordinates:
and I need to apply this equation:
The first one is the main, and those two others are for control check. How could I make it so it would read those coordinates and calculate according to equation? I tried:
book = openpyxl.load_workbook('coordinates.xlsx')
sheet = book.active
for row_i in range(1, sheet.max_row + 1):
x_value = sheet.cell(row=row_i, column=1).value
y_value = sheet.cell(row=row_i, column=2).value
print(x_value,y_value)
I am stuck at making calculations and managing the whole process after inputing values. Moreover, it needs to accept as many coordinates as there are and it counts the area plot.

The top equation seems to suggest that you would do:
two_p = 0
for row_i in range(1, sheet.max_row - 1):
x_i = float(sheet.cell(row=row_i, column=1).value)
y_plus1 = float(sheet.cell(row=row_i+1, column=2).value)
y_minus1 = float(sheet.cell(row=row_i-1, column=2).value)
two_p += x_i*(y_plus1 - y_minus1)
print(two_p)
Is this what you had in mind? You would do something similar for the control equations.
EDIT: Added float conversion in the formulas above in case the data coming from excel is wrongly formatted.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python, how to perform an iterate formula? - python

Related

Nested ANOVA in statsmodels

How can I count the element in specific intervals in a dataframe?

Combining multiple functions through loops and parameters

Python: counting number of co-orindates within specified windows

Mathematical equations with imported excel coordinates

Categories

Resources