get value from data that depend on other data - python

i am trying to get the l input which will be between 0,1. the l input will be for A column. and second input will be the 'mesafe' column so the result must be 23 which is for A and mesafe's zero column. I get some error.
import pandas as pd
import numpy as np
def var():
df = pd.read_csv('l_y.txt')
l=float(input("speed of the wind:"))
w=int(input("range:"))
for l in range(0, 1) and w in range(0, 100) :
print(df['A'].values[0])
l_y.txt=( mesafe A B C D E F
0 100 23 18 14 8 4 0
1 1000 210 170 110 60 40 30
2 5000 820 510 380 300 230 160
3 10000 1600 1200 820 560 400 250
4 20000 2800 2100 1600 1000 820 500
5 50000 5900 4600 3400 3200 1600 1100
6 100000 10000 8100 6100 3900 2800 2000 )
Traceback (most recent call last):
File "<pyshell#10>", line 1, in <module>
var()
File "C:\Users\user\AppData\Local\Programs\Python\Python36-32\ml.py", line
8, in var
for l in range(0, 1) and w in range(0, 100) :
TypeError: 'bool' object is not iterable

You have to follow the format of the language. In this case if you want to run a double loop with l and w:
for l in range(0, 1):
for w in range(0, 100) :
print(df['A'].values[0])
Note the indent, if you don't indent properly then the python codes won't work correctly.
Also your code here doesn't seem to accomplish anything except printing the same thing for 200 times. What are you trying to do?

Ok, I assume you want to get a certain value from your matrix, depending on two input values l and w, so that if l is between 0 and 1 column 'A' should be selected. (I further assume, that if l is between 1 nd 2 it's column 'B', 2 <= l < 3 -> 'c', and so on...)
The row is directly derived from w with the data in the mesafe-column: if w is between 0 and 100 -> row 0, between 100 and 1000 -> row 1 and so on...
Well, this can be achieved as follows:
l = .3 # let's say user types 0.3
There is some mapping between l and the letters:
l_mapping = [1, 5, 12, 20, 35, 50] # These are the thresholds between the columns A...F
l_index = np.searchsorted(l_mapping, l) # calculate the index of the column letter
col = df.columns[1:][l_index] # this translates l to the column names from A...F
col # look what col is when l was < 1
Out: 'A'
w = 42 # now the user Input for w is 42
row = np.searchsorted(df.mesafe.values, w) # this calculates the fractional index of w in df.mesafe
row
Out: 0
So with these two formulas you get column and row Information to index your desired result:
df[col].iloc[row]
Out: 23
Summing this all up in a function would look like this:
def get_l_y(l, w, df_ly):
l_mapping = [1, 5, 12, 20, 35, 50]
l_index = np.searchsorted(l_mapping, l)
col = df_ly.columns[1:][l_index]
row = np.searchsorted(df.mesafe.values, w)
print(l, w, col, row) # print Input and calculated row- and column- values for testing purpose, can be deleted/commented out if everything works as you want
return df[col].iloc[row]
This function expects l, w and the pandas-dataframe of your matrix as input Parameters and returns l_y.

Related

Using np.vectorize to create a column in a data frame

I have a data fram that contains two columns with numbers and a third column with repeating letters. Let's say somthing like this:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(100, 2)), columns=list('xy'))
letters = ['A', 'B', 'C', 'D'] * int(len(df.index) / 4)
df['letters'] = letters
I want to create two new columns, which compares the number in columns 'x' and 'y' to the average of their corresponding letters. For example one new column will just contain the number 10 (if 20% or better than the mean), -10 (if 20% worse than the mean) or else 0.
I wrote the function below:
def scoreFunHigh(dataField, mean, diff, multip):
upper = mean * (1 + diff)
lower = mean * (1 - diff)
if dataField > upper:
return multip * 1
elif dataField < lower:
return multip * (-1)
else:
return 0
And then created the column as follows:
letterMeanX = df.groupby('letters')['x'].transform(np.nanmean)
df['letter x score'] = np.vectorize(scoreFunHigh)(df['x'], letterMeanX, 0.2, 10)
letterMeanY = df.groupby('letters')['y'].transform(np.nanmean)
df['letter y score'] = np.vectorize(scoreFunHigh)(df['y'], letterMeanY, 0.3, 5)
This works. However, I am getting the below runtime waring:
C:\Users\ ... \Python\Python38\lib\site-packages\numpy\lib\function_base.py:2167: RuntimeWarning: invalid value encountered in ? (vectorized)
outputs = ufunc(*inputs)
(Please note, that if I am running the exact same code as above I am not getting the error. My real dataframe is much larger and I am using several functions for different data)
What is the problem here? Is there a better way to set this up?
Thank you very much
The sample you give does not produce the runtimewarning, so we can't do anything to help you diagnose it. I don't know if a fuller traceback provides any useful information.
But lets look at the calculations:
In [70]: np.vectorize(scoreFunHigh)(df['x'], letterMeanX, 0.2, 10)
Out[70]:
array([-10, 0, 10, -10, 0, 0, -10, -10, 10, 0, 0, 10, -10,
-10, 0, 10, 10, -10, 0, 10, -10, -10, -10, 10, 10, -10,
...
-10, 10, -10, 0, 0, 10, 10, 0, 10])
and with the df assignment:
In [74]: df['letter x score'] = np.vectorize(scoreFunHigh)(df['x'], letterMeanX,
...: 0.2, 10)
...:
In [75]: df
Out[75]:
x y letters letter x score
0 33 98 A -10
1 38 49 B 0
2 78 46 C 10
3 31 46 D -10
4 41 74 A 0
.. .. .. ... ...
95 51 4 D 0
96 70 4 A 10
97 74 74 B 10
98 54 70 C 0
99 87 44 D 10
Often np.vectorize gives problems because of the otypes issue (read the docs); if the trial calculation produces an integer, then the return dtype is set to that, giving problems if other values are floats. However in this case the result can only have one of three values, [-10,0,10] (the last parameter).
The warning, such as you provide, suggests that some value(s) in the larger dataframe are wrong for the calculations in your scoreFunHigh function. But the warning doesn't give enough detail to say what.
It is relatively easy to apply real numpy vectorization to this problem, since it depends on two Series, df['x] an letterMeanX and 2 scalars.
In [111]: letterMeanX = df.groupby('letters')['x'].transform(np.nanmean)
In [112]: letterMeanX.shape
Out[112]: (100,)
In [113]: df['x'].shape
Out[113]: (100,)
In [114]: upper = letterMeanX *(1+0.2)
In [115]: lower = letterMeanX *(1-0.2)
In [116]: res = np.zeros(letterMeanX.shape,int)
In [117]: res[df['x']>upper] = 10
In [118]: res[df['x']<lower] = -10
In [119]: np.allclose(res, Out[70])
Out[119]: True
In other words, rather than applying the upper/lower comparison row by row, it applies it to the whole Series. It is still iterating, but in compiled numpy methods, which are much faster. np.vectorize is just a wrapper around an iteration. It still calls your python function once for each row. Hopefully the performance disclaimer is clear enough.
Consider directly calling your function with slight adjustment to method to handle conditional logic using numpy.select (or numpy.where). With this approach no loops are run but vectorized operations on the Series and scalar parameters:
def scoreFunHigh(dataField, mean, diff, multip):
conds = [dataField > mean * (1 + diff),
dataField < mean * (1 - diff)]
vals = [multip * 1, multip * (-1)]
return np.select(conds, vals, default=0)
letterMeanX = df.groupby('letters')['x'].transform(np.nanmean)
df['letter x score'] = scoreFunHigh(df['x'], letterMeanX, 0.2, 10)
letterMeanY = df.groupby('letters')['y'].transform(np.nanmean)
df['letter y score'] = scoreFunHigh(df['y'], letterMeanY, 0.3, 5)
Here is version that doesn't use np.vectorize
def scoreFunHigh(val, mean, diff, multip):
upper = mean * (1 + diff)
lower = mean * (1 - diff)
if val > upper:
return multip * 1
elif val < lower:
return multip * (-1)
else:
return 0
letterMeanX = df.groupby('letters')['x'].apply(lambda x: np.nanmean(x))
df['letter x score'] = df.apply(
lambda row: scoreFunHigh(row['x'], letterMeanX[row['letters']], 0.2, 10), axis=1)
Output
x y letters letter x score
0 52 76 A 0
1 90 99 B 10
2 87 43 C 10
3 44 73 D 0
4 49 3 A 0
.. .. .. ... ...
95 16 51 D -10
96 38 3 A 0
97 43 47 B 0
98 58 39 C 0
99 41 26 D 0

Replicate the tuple values to create random dataset in python [duplicate]

I have a pandas DataFrame with 100,000 rows and want to split it into 100 sections with 1000 rows in each of them.
How do I draw a random sample of certain size (e.g. 50 rows) of just one of the 100 sections? The df is already ordered such that the first 1000 rows are from the first section, next 1000 rows from another, and so on.
You can use the sample method*:
In [11]: df = pd.DataFrame([[1, 2], [3, 4], [5, 6], [7, 8]], columns=["A", "B"])
In [12]: df.sample(2)
Out[12]:
A B
0 1 2
2 5 6
In [13]: df.sample(2)
Out[13]:
A B
3 7 8
0 1 2
*On one of the section DataFrames.
Note: If you have a larger sample size that the size of the DataFrame this will raise an error unless you sample with replacement.
In [14]: df.sample(5)
ValueError: Cannot take a larger sample than population when 'replace=False'
In [15]: df.sample(5, replace=True)
Out[15]:
A B
0 1 2
1 3 4
2 5 6
3 7 8
1 3 4
One solution is to use the choice function from numpy.
Say you want 50 entries out of 100, you can use:
import numpy as np
chosen_idx = np.random.choice(1000, replace=False, size=50)
df_trimmed = df.iloc[chosen_idx]
This is of course not considering your block structure. If you want a 50 item sample from block i for example, you can do:
import numpy as np
block_start_idx = 1000 * i
chosen_idx = np.random.choice(1000, replace=False, size=50)
df_trimmed_from_block_i = df.iloc[block_start_idx + chosen_idx]
You could add a "section" column to your data then perform a groupby and sample:
import numpy as np
import pandas as pd
df = pd.DataFrame(
{"x": np.arange(1_000 * 100), "section": np.repeat(np.arange(100), 1_000)}
)
# >>> df
# x section
# 0 0 0
# 1 1 0
# 2 2 0
# 3 3 0
# 4 4 0
# ... ... ...
# 99995 99995 99
# 99996 99996 99
# 99997 99997 99
# 99998 99998 99
# 99999 99999 99
#
# [100000 rows x 2 columns]
sample = df.groupby("section").sample(50)
# >>> sample
# x section
# 907 907 0
# 494 494 0
# 775 775 0
# 20 20 0
# 230 230 0
# ... ... ...
# 99740 99740 99
# 99272 99272 99
# 99863 99863 99
# 99198 99198 99
# 99555 99555 99
#
# [5000 rows x 2 columns]
with additional .query("section == 42") or whatever if you are interested in only a particular section.
Note this requires pandas 1.1.0, see the docs here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.sample.html
For older versions, see the answer by #msh5678
Thank you, Jeff,
But I received an error;
AttributeError: Cannot access callable attribute 'sample' of 'DataFrameGroupBy' objects, try using the 'apply' method
So I suggest instead of sample = df.groupby("section").sample(50) using below command :
df.groupby('section').apply(lambda grp: grp.sample(50))
This is a nice place for recursion.
def main2():
rows = 8 # say you have 8 rows, real data will need len(rows) for int
rands = []
for i in range(rows):
gen = fun(rands)
rands.append(gen)
print(rands) # now range through random values
def fun(rands):
gen = np.random.randint(0, 8)
if gen in rands:
a = fun(rands)
return a
else: return gen
if __name__ == "__main__":
main2()
output: [6, 0, 7, 1, 3, 5, 4, 2]

Take the sum of every N rows in a pandas series

Suppose
s = pd.Series(range(50))
0 0
1 1
2 2
3 3
...
48 48
49 49
How can I get the new series that consists of sum of every n rows?
Expected result is like below, when n = 5;
0 10
1 35
2 60
3 85
...
8 210
9 235
If using loc or iloc and loop by python, of course it can be accomplished, however I believe it could be done simply in Pandas way.
Also, this is a very simplified example, I don't expect the explanation of the sequences:). Actual data series I'm trying has the time index and the the number of events occurred in every second as the values.
GroupBy.sum
N = 5
s.groupby(s.index // N).sum()
0 10
1 35
2 60
3 85
4 110
5 135
6 160
7 185
8 210
9 235
dtype: int64
Chunk the index into groups of 5 and group accordingly.
numpy.reshape + sum
If the size is a multiple of N (or 5), you can reshape and add:
s.values.reshape(-1, N).sum(1)
# array([ 10, 35, 60, 85, 110, 135, 160, 185, 210, 235])
numpy.add.at
b = np.zeros(len(s) // N)
np.add.at(b, s.index // N, s.values)
b
# array([ 10., 35., 60., 85., 110., 135., 160., 185., 210., 235.])
The most efficient solution I can think of is f1() in my example below. It is orders of magnitude faster than using the groupby in the other answer.
Note that f1() doesn't work when the length of the array is not an exact multiple, e.g. if you want to sum a 3-item array every 2 items.
For those cases, you can use f1v2():
f1v2( [0,1,2,3,4] ,2 ) = [1,5,4]
My code is below. I have used timeit for the comparisons:
import timeit
import numpy as np
import pandas as pd
def f1(a,x):
if isinstance(a, pd.Series):
a = a.to_numpy()
return a.reshape((int(a.shape[0]/x), int(x) )).sum(1)
def f2(myarray, x):
return [sum(myarray[n: n+x]) for n in range(0, len(myarray), x)]
def f3(myarray, x):
s = pd.Series(myarray)
out = s.groupby(s.index // 2).sum()
return out
def f1v2(a,x):
if isinstance(a, pd.Series):
a = a.to_numpy()
mod = a.shape[0] % x
if mod != 0:
excl = a[-mod:]
keep = a[: len(a) - mod]
out = keep.reshape((int(keep.shape[0]/x), int(x) )).sum(1)
out = np.hstack( (excl.sum() , out) )
else:
out = a.reshape((int(a.shape[0]/x), int(x) )).sum(1)
return out
a = np.arange(0,1e6)
out1 = f1(a,2)
out2 = f2(a,2)
out3 = f2(a,2)
t1 = timeit.Timer( "f1(a,2)" , globals = globals() ).repeat(repeat = 5, number = 2)
t1v2 = timeit.Timer( "f1v2(a,2)" , globals = globals() ).repeat(repeat = 5, number = 2)
t2 = timeit.Timer( "f2(a,2)" , globals = globals() ).repeat(repeat = 5, number = 2)
t3 = timeit.Timer( "f3(a,2)" , globals = globals() ).repeat(repeat = 5, number = 2)
resdf = pd.DataFrame(index = ['min time'])
resdf['f1'] = [min(t1)]
resdf['f1v2'] = [min(t1v2)]
resdf['f2'] = [min(t2)]
resdf['f3'] = [min(t3)]
#the docs explain why it makes more sense to take the min than the avg
resdf = resdf.transpose()
resdf['% difference vs fastes'] = (resdf /resdf.min() - 1) * 100
b = np.array( [0,1,2,4,5,6,7] )
out1v2 = f1v2(b,2)

Filling in missing data in Python

I was hoping you would be able to help me solve a small problem.
I am using a small device that prints out two properties that I save to a file. The device rasters in X and Y direction to form a grid. I am interested in plotting the relative intensity of these two properties as a function of the X and Y dimensions. I record the data in 4 columns that are comma separated (X, Y, property 1, property 2).
The grid is examined in lines, so for each Y value, it will move from X1 to X2 which are separated several millimeters apart. Then it will move to the next line and over again.
I am able to process the data in python with pandas/numpy but it doesn't work too well when there are any missing rows (which unfortunately does happen).
I have attached a sample of the output (and annotated the problems):
44,11,500,1
45,11,120,2
46,11,320,3
47,11,700,4
New << used as my Y axis separator
44,12,50,5
45,12,100,6
46,12,1500,7
47,12,2500,8
Sometimes, however a line or a few will be missing making it not possible to process and plot. Currently I have not been able to automatically fix it and have to do it manually. The bad output looks like this:
44,11,500,1
45,11,120,2
46,11,320,3
47,11,700,4
New << used as my Y axis separator
45,12,100,5 << missing 44,12...
46,12,1500,6
47,12,2500,7
I know the number of lines I expect since I know my range of X and Y.
What would be the best way to deal with this? Currently I manually enter the missing X and Y values and populate property 1 and 2 with values of 0. This can be time consuming and I would like to automate it. I have two questions.
Question 1: How can I automatically fill in my missing data with the corresponding values of X and Y and two zeros? This could be obtained from a pre-generated array of X and Y values that correspond to the experimental range.
Question 2: Is there a better way to split the file into separate arrays for plotting (rather than using the 'New' line?) For instance, by having a 'if' function that will output each line between X(start) and X(end) to a separate array? I've tried doing that but with no success.
I've attached my current (crude) code:
df = pd.read_csv('FileName.csv', delimiter = ',', skiprows=0)
rows = [-1] + np.where(df['X']=='New')[0].tolist() + [len(df.index)]
dff = {}
for i, r in enumerate(rows[:-1]):
dff[i] = df[r+1: rows[i+1]]
maxY = len(dff)
data = []
data2 = []
for yaxes in range(0, maxY):
data2.append(dff[yaxes].ix[:,2])
<data2 is then used for plotting using matplotlib>
To answer my Question 1, I was thinking about using the 'reindex' and 'reset_index' functions, however haven't managed to make them work.
I would appreciate any suggestions.
Is this meet what you want?
Q1: fill X using reindex, and others using fillna
Q2: Passing separated StringIO to read_csv is easier (change if you use Python 3)
# read file and split the input
f = open('temp.csv', 'r')
chunks = f.read().split('New')
# read csv as separated dataframes, using first column as index
dfs = [pd.read_csv(StringIO(unicode(chunk)), header=None, index_col=0) for chunk in chunks]
def pad(df):
# reindex, you should know the range of x
df = df.reindex(np.arange(44, 48))
# pad y from forward / backward, assuming y should have the single value
df[1] = df[1].fillna(method='bfill')
df[1] = df[1].fillna(method='ffill')
# padding others
df = df.fillna(0)
# revert index to values
return df.reset_index(drop=False)
dfs = [pad(df) for df in dfs]
dfs[0]
# 0 1 2 3
# 0 44 11 500 1
# 1 45 11 120 2
# 2 46 11 320 3
# 3 47 11 700 4
# dfs[1]
# 0 1 2 3
# 0 44 12 0 0
# 1 45 12 100 5
# 2 46 12 1500 6
# 3 47 12 2500 7
First Question
I've included print statements inside function to explain how this function works
In [89]:
def replace_missing(df , Ids ):
# check what are the mssing values
missing = np.setdiff1d(Ids , df[0])
if len(missing) > 0 :
missing_df = pd.DataFrame(data = np.zeros( (len(missing) , 4 )))
#print('---missing df---')
#print(missing_df)
missing_df[0] = missing
#print('---missing df---')
#print(missing_df)
missing_df[1].replace(0 , df[1].iloc[0] , inplace = True)
#print('---missing df---')
#print(missing_df)
df = pd.concat([df , missing_df])
#print('---final df---')
#print(df)
return df
​
In [91]:
Ids = np.arange(44,48)
final_df = df1.groupby(df1[1] , as_index = False).apply(replace_missing , Ids).reset_index(drop = True)
final_df
Out[91]:
0 1 2 3
44 11 500 1
45 11 120 2
46 11 320 3
47 11 700 4
45 12 100 5
46 12 1500 6
47 12 2500 7
44 12 0 0
Second question
In [92]:
group = final_df.groupby(final_df[1])
In [99]:
separate = [group.get_group(key) for key in group.groups.keys()]
separate[0]
Out[104]:
0 1 2 3
44 11 500 1
45 11 120 2
46 11 320 3
47 11 700 4

Finding row wise average and average of all rows of space delimited file

I have a file which has many lines and each row would look like below:
10 55 19 51 2 9 96 64 60 2 45 39 99 60 34 100 33 71 49 13
77 3 32 100 68 90 44 100 10 52 96 95 36 50 96 39 81 25 26 13
Each line as numbers separated by space and each line( row is of different length)
How can I find the average of each row?
How can I find Sum of all the row wise averages?
Preferred language Python
Below code does the task mentioned:
def rowAverageSum(filename):
import numpy as np
FullMean = 0
li = [map(int, x) for x in [i.strip().split() for i in open(filename).readlines()]]
i=0
while i<len(li):
for k in li:
print "Mean of row ",i+1,":",np.mean(k)
FullMean+=np.mean(k)
i+=1
print "***************************"
print "Grand Average:",FullMean
print "***************************"
Using two utility functions words (to get the words in a line) and average (to get the average of a sequence of integers), I'd start wth something like
def words(s):
return (w for w in s.strip().split())
def average(l):
return sum(l) / len(l)
with open('input.txt') as f:
averages = [average(map(int, words(line))) for line in f]
total = sum(averages)
I like the total = sum(averages) part which very closely resembles your second requirement (the sum of all averages). :-)
I used map(int, words(line)) (to convert a list of strings to a list of integers) simply because it's shorter than [int(x) for x in words(line)] even though the latter would most certainly be considered to be "more Pythonic".
how about trying this in a short way?
avg_per_row = [];
avg_all_row = 0;
f1 = open("myfile") # Default mode is read
for line in f1:
temp = line.split();
avg = sum([int(x) for x in temp])/length(temp)
avg_per_row.append(avg); # Average per row
avg_all_row = sum(avg_per_row)/len(avg_per_row) # Average for all averages
Very compressed, but should work for you
3 / 2 is 1 in the Python. So you want to float result you should convert float.
float(3) / 2 is 1.5
>>> s = '''10 55 19 51 2 9 96 64 60 2 45 39 99 60 34 100 33 71 49 13
77 3 32 100 68 90 44 100 10 52 96 95 36 50 96 39 81 25 26 13'''
>>> line_averages = []
>>> for line in s.splitlines():
... line_averages.append(sum([ float(ix) for ix in line.split()]) / len(line.split()))
...
>>> line_averages
[45.55, 56.65]
>>> sum(line_averages)
102.19999999999999
Or you can use reduce
>>> for line in s.splitlines():
... line_averages.append(reduce(lambda x,y: int(x) + int(y), line.split()) / len(line.split()))
>>> line_averages
[45, 56]
>>> reduce(lambda x,y: int(x) + int(y), line_averages)
101
>>> f = open('yourfile')
>>> averages = [ sum(map(float,x.strip().split()))/len(x.strip().split()) for x in f ]
>>> averages
[45.55, 56.65]
>>> sum(averages)
102.19999999999999
>>> sum(averages)/len(averages)
51.099999999999994
strip removes '\n' then split will split on whitespace will give list of the numbers, map will convert all numbers to float type. sum will sum all numbers.
if you don't understand above code, you can see this , its same as above but expanded :
>>> f = open('ll.txt')
>>> averages = []
>>> for x in f:
... x = x.strip() # removes newline character
... x = x.split() # split the lines on whitespaces and produces list of numbers
... x = [ float(i) for i in x ] # convert all number to type float
... avg = sum(x)/len(x) # calculate average ans store to variable avg
... averages.append(avg) # append the avg to list averages
...
>>> averages
[45.55, 56.65]
>>> sum(averages)/len(averages)
51.099999999999994

Categories

Resources