How to use increments in Python - python

def number():
b = 0.1
while True:
yield b
b = b + 0.1
b = number()
for i in range(10):
print(next(b))
Outputs
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.9999999999999999
Then, I just want
c=b*2
print("c="=)
My expected outputs are
c=0.2
0.4
0.6
0.8
1
1.2
And so on.
Could you tell me what I have to do to get my expected outputs?

Floating point numbers are not precise. The more you handle them, the more error they can accumulate. To have numbers you want, the best way is to keep things integral for as long as possible:
def number():
b = 1
while True:
yield b / 10.0
b += 1

You can pass the number as an argument:
def number(start=0.1,num=0.1):
b = start
while True:
yield round(b,1)
b += num
b = number(0,0.2)
It yields:
0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8

Like this?
for i in range(10):
AnotherB=next(b)
c=AnotherB*2
print(AnotherB)
print("c="+str(c))
or do you mean how do you reset a yeild?
just redeclare it.
def number():
b = 0.1
while True:
yield round(b,1)
b = b + 0.1
b = number()
for i in range(10):
print(next(b))
b=number()
for i in range(10):
print("c="+str(next(b)*2))

Related

How to create modified dataframe based on list values?

Consider a dataframe df of the following structure:-
Name Slide Height Weight Status General
A X 3 0.1 0.5 0.2
B Y 10 0.2 0.7 0.8
...
I would like to create duplicates for each row in this dataframe (specific to the Name and Slide) for the following combinations of Height and Weight shown by this list:-
list_combinations = [[3,0.1],[10,0.2],[5,1.3]]
The desired output:-
Name Slide Height Weight Status General
A X 3 0.1 0.5 0.2 #original
A X 10 0.2 0.5 0.2 # modified duplicate
A X 5 1.3 0.5 0.2 # modified duplicate
B Y 10 0.2 0.7 0.8 #original
B Y 3 0.1 0.7 0.8 # modified duplicate
B Y 5 1.3 0.7 0.8 # modified duplicate
etc. ...
Any suggestions and help would be much appreciated.
We can do merge with cross
out = pd.DataFrame(list_combinations,columns = ['Height','Weight']).\
merge(df,how='cross',suffixes = ('','_')).\
reindex(columns=df.columns).sort_values('Name')
Name Slide Height Weight Status General
0 A X 3 0.1 0.5 0.2
2 A X 10 0.2 0.5 0.2
4 A X 5 1.3 0.5 0.2
1 B Y 3 0.1 0.7 0.8
3 B Y 10 0.2 0.7 0.8
5 B Y 5 1.3 0.7 0.8

Summing subsets of many dataframes

I have ~1.2k files that when converted into dataframes look like this:
df1
A B C D
0 0.1 0.5 0.2 C
1 0.0 0.0 0.8 C
2 0.5 0.1 0.1 H
3 0.4 0.5 0.1 H
4 0.0 0.0 0.8 C
5 0.1 0.5 0.2 C
6 0.1 0.5 0.2 C
Now, I have to subset each dataframe with a window of fixed size along the rows, and add its contents to a second dataframe, with all its values originally initialized to 0.
df_sum
A B C
0 0.0 0.0 0.0
1 0.0 0.0 0.0
2 0.0 0.0 0.0
For example, let's set the window size to 3. The first subset therefore will be
window = df.loc[start:end, 'A':'C']
window
A B C
0 0.1 0.5 0.2
1 0.0 0.0 0.8
2 0.5 0.1 0.1
window.index = correct_index
df_sum = df_sum.add(window, fill_value=0)
df_sum
A B C
0 0.1 0.5 0.2
1 0.0 0.0 0.8
2 0.5 0.1 0.1
After that, the window will be the subset of df1 from rows 1-4, then rows 2-5, and finally rows 3-6. Once the first file has been scanned, the second file will begin, until all file have been processed. As you can see, this approach relies on df.loc for the subset and df.add for the addition. However, despite the ease of coding, it is very inefficient. On my machine it takes about 5 minutes to process the whole batch of 1.2k files of 200 lines each. I know that an implementation based on numpy arrays is orders of magnitude faster (about 10 seconds), but a bit more complicated in terms of subsetting and adding. Is there any way to increase the performance of this method while stile using dataframe? For example substituting the loc with a more performing slice method.
Example:
def generate_index_list(window_size):
before_offset = -(window_size - 1)// 2
after_offset = (window_size - 1)// 2
index_list = list()
for n in range(before_offset, after_offset + 1):
index_list.append(str(n))
return index_list
window_size = 3
for file in os.listdir('.'):
df1 = pd.read_csv(file, sep= '\t')
starting_index = (window_size - 1)//2
before_offset = (window_size - 1)// 2
after_offset = (window_size -1)//2
for index in df1.iterrows():
if index < starting_index or index + before_offset + 1 > len(profile.index):
continue
indexes = generate_index_list(window_size)
window = df1.loc[index - before_offset:index + after_offset, 'A':'C']
window.index = indexes
df_sum = df_sum.add(window, fill_value=0)
Expected output:
df_sum
A B C
0 1.0 1.1 2.0
1 1.0 1.1 2.0
2 1.1 1.6 1.4
Consider building a list of subsetted data frames with.loc and .head. Then run groupby aggregation after individual elements are concatenated.
window_size = 3
def window_process(file):
csv_df = pd.read_csv(file, sep= '\t')
window_dfs = [(csv_df.loc[i:,['A', 'B', 'C']] # ROW AND COLUMN SLICE
.head(window) # SELECT FIRST WINDOW ROWS
.reset_index(drop=True) # RESET INDEX TO 0, 1, 2, ...
) for i in range(df.shape[0])]
sum_df = (pd.concat(window_dfs) # COMBINE WINDOW DFS
.groupby(level=0).sum()) # AGGREGATE BY INDEX
return sum_df
# BUILD LONG DF FROM ALL FILES
long_df = pd.concat([window_process(f) for file in os.listdir('.')])
# FINAL AGGREGATION
df_sum = long_df.groupby(level=0).sum()
Using posted data sample, below are the outputs of each window_dfs:
A B C
0 0.1 0.5 0.2
1 0.0 0.0 0.8
2 0.5 0.1 0.1
A B C
0 0.0 0.0 0.8
1 0.5 0.1 0.1
2 0.4 0.5 0.1
A B C
0 0.5 0.1 0.1
1 0.4 0.5 0.1
2 0.0 0.0 0.8
A B C
0 0.4 0.5 0.1
1 0.0 0.0 0.8
2 0.1 0.5 0.2
A B C
0 0.0 0.0 0.8
1 0.1 0.5 0.2
2 0.1 0.5 0.2
A B C
0 0.1 0.5 0.2
1 0.1 0.5 0.2
A B C
0 0.1 0.5 0.2
With final df_sum to show accuracy of DataFrame.add():
df_sum
A B C
0 1.2 2.1 2.4
1 1.1 1.6 2.2
2 1.1 1.6 1.4

List Column Names if the value change from max is within certain % (in pandas)

Apologies for unclear title. My data look like this. They always sum to 1
>df
A B C D E
0.3 0.3 0.05 0.2 0.05
What i want to do it identify columns which:
1) Highest value
2) The % reduction for highest was less than threshold.
For example:
Assuming 50% was threshold, I want to end up with [A,B,C], based on logic that:
1) A & B have highest value.
2) 50% of A or B is 0.15. Since D is 0.2, it is added to list
3) 50% of D is 0.1. Since both C or E are less than 0.1, they are not added to list.
I used the following test DataFrame:
A B C D E
0 0.3 0.3 0.05 0.2 0.05
1 0.5 0.1 0.20 0.1 0.10
Start from defining the following function to get column names for the current row:
def getCols(row, threshold):
s = row.sort_values(ascending=False)
currVal = 0.0
lst = []
for key, grp in s.groupby(s, sort=False):
if len(lst) > 0 and key < currVal * threshold:
break
currVal = key
lst.extend(grp.index.sort_values().tolist())
return lst
Then apply it:
df['cols'] = df.apply(getCols, axis=1, threshold = 0.5)
The result is:
A B C D E cols
0 0.3 0.3 0.05 0.2 0.05 [A, B, D]
1 0.5 0.1 0.20 0.1 0.10 [A]

Python: While loop inside a for loop only running for the first instance of for variable

So here's my problem: I have a for loop with a variable k running from 1 to 31. Inside the for loop there is a while loop that seemingly only runs for the very first k and no others.
from numpy import exp
a = 0.0
N = 1
x = 0.1
def f(t):
return exp(-t**2)
def int_trap(x,N):
h = (x-a)/N
s = 0.5*f(a) + 0.5*f(x)
for i in range(1,N):
s += f(a + i*h)
return h*s
new_value = 1.0
old_value = 0.0
for k in range(1,11):
x = k/10
while abs(new_value - old_value) > 10**-6:
old_value = new_value
N = N*2
new_value = int_trap(x,N)
print(N,'\t',x,'\t',abs(new_value - old_value))
print(x)
The print(x) at the end is there to confirm that that the code is running through the k's.
And here's the output:
2 0.1 0.900373598036
4 0.1 3.09486672713e-05
8 0.1 7.73536466929e-06
16 0.1 1.93372859864e-06
32 0.1 4.83425115119e-07
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
The for loop runs fine through all k values. It doesn't run through the while loop, perhaps because you don't reset the new_value and old_value variables inside the for loop. If we add some things to print to the original loop:
for k in range(1,11):
x = k/10
while abs(new_value - old_value) > 10**-6:
old_value = new_value
N = N*2
new_value = int_trap(x,N)
print(N,'\t',x,'\t',abs(new_value - old_value), 'In while for x={} and k={}'.format(x, k))
print(x, '\tThis is me completing the loop for k=', k)
We see that it is correctly running for all k values:
2 0.1 0.900373598036 In while for x=0.1 and k=1
4 0.1 3.09486672713e-05 In while for x=0.1 and k=1
8 0.1 7.73536466929e-06 In while for x=0.1 and k=1
16 0.1 1.93372859864e-06 In while for x=0.1 and k=1
32 0.1 4.83425115119e-07 In while for x=0.1 and k=1
0.1 This is me completing the loop for k= 1
0.2 This is me completing the loop for k= 2
0.3 This is me completing the loop for k= 3
0.4 This is me completing the loop for k= 4
0.5 This is me completing the loop for k= 5
0.6 This is me completing the loop for k= 6
0.7 This is me completing the loop for k= 7
0.8 This is me completing the loop for k= 8
0.9 This is me completing the loop for k= 9
1.0 This is me completing the loop for k= 10
So try the following:
for k in range(1,11):
x = k/10
new_value = 1.0
old_value = 0.0
while abs(new_value - old_value) > 10**-6:
old_value = new_value
N = N*2
new_value = int_trap(x,N)
print(N,'\t',x,'\t',abs(new_value - old_value), 'In while for x={} and k={}'.format(x, k))
print(x, '\tThis is me completing the loop for k=', k)

rearranging matrix with named column/rows python

I am stuck (and in a bit of a time crunch) and was hoping for some help. This is probably a simple task but I can't seem to solve it..
I have a matrix, say 5 by 5, with an additional starting column of names for the rows and the same names for the columns in a text file like this:
b e a d c
b 0.0 0.1 0.3 0.2 0.5
e 0.1 0.0 0.4 0.9 0.3
a 0.3 0.4 0.0 0.7 0.6
d 0.2 0.9 0.7 0.0 0.1
c 0.5 0.3 0.6 0.1 0.0
I have multiple files that have the same format and size of matrix but the order of the names are different. I need a way to change these around so they are all the same and maintain the 0.0 diagonal. So any swapping I do to the columns I must do to the rows.
I have been searching a bit and it seems like NumPy might do what I want but I have never worked with it or arrays in general. Any help is greatly appreciated!
In short: How do I get a text file into an array which I can then swap around rows and columns to a desired order?
I suggest you use pandas:
from StringIO import StringIO
import pandas as pd
data = StringIO("""b e a d c
b 0.0 0.1 0.3 0.2 0.5
e 0.1 0.0 0.4 0.9 0.3
a 0.3 0.4 0.0 0.7 0.6
d 0.2 0.9 0.7 0.0 0.1
c 0.5 0.3 0.6 0.1 0.0
""")
df = pd.read_csv(data, sep=" ")
print df.sort_index().sort_index(axis=1)
output:
a b c d e
a 0.0 0.3 0.6 0.7 0.4
b 0.3 0.0 0.5 0.2 0.1
c 0.6 0.5 0.0 0.1 0.3
d 0.7 0.2 0.1 0.0 0.9
e 0.4 0.1 0.3 0.9 0.0
Here's the start of a horrific Numpy version (use HYRY's answer...)
import numpy as np
with open("myfile", "r") as myfile:
lines = myfile.read().split("\n")
floats = [[float(item) for item in line.split()[1:]] for line in lines[1:]]
floats_transposed = np.array(floats).transpose().tolist()
from copy import copy
f = open('input', 'r')
data = []
for line in f:
row = line.rstrip().split(' ')
data.append(row)
#collect labels, strip empty spaces
r = data.pop(0)
c = [row.pop(0) for row in data]
r.pop(0)
origrow, origcol = copy(r), copy(c)
r.sort()
c.sort()
newgrid = []
for row, rowtitle in enumerate(r):
fromrow = origrow.index(rowtitle)
newgrid.append(range(len(c)))
for col, coltitle in enumerate(c):
#We ask this len(row) times, so memoization
#might matter on a large matrix
fromcol = origcol.index(coltitle)
newgrid[row][col] = data[fromrow][fromcol]
print "\t".join([''] + r)
clabel = c.__iter__()
for line in newgrid:
print "\t".join([clabel.next()] + line)

Categories

Resources