Missing values replaced with average of its neighbors (timeseries) - python

I want all missing values from dataset to replace with average of two nearest neighbors. Except of first and last cells and when neighbors are 0 (then I manually fix values). I coded this and it works, but the solution is not very smart. Is is another way to do it faster? Interpolate method is suitable for that? I'm not quite sure how does it work.
Input:
0 1 2 3 4 5
0 0.0 1596.0 1578.0 1567.0 1580.0 1649.0
1 1554.0 1506.0 0.0 1466.0 1469.0 1503.0
2 1588.0 1510.0 1495.0 1485.0 1489.0 0.0
3 1592.0 0.0 0.0 1571.0 1647.0 0.0
Output:
0 1 2 3 4 5
0 0.0 1596.0 1578.0 1567.0 1580.0 1649.0
1 1554.0 1506.0 1486.0 1466.0 1469.0 1503.0
2 1588.0 1510.0 1495.0 1485.0 1489.0 1540.5
3 1592.0 0.0 0.0 1571.0 1647.0 0.0
Code:
data_len = len(df)
first_col = str(df.columns[0])
last_col = str(df.columns[len(df.columns) - 1])
d = df.apply(lambda s: pd.to_numeric(s, errors="coerce"))
m = d.eq(0) | d.isna()
s = m.stack()
list = s[s].index.tolist() #list of indeces of missing values
count = len(list)
for el in list:
if (el == ('0', first_col) or el == (str(data_len - 1), last_col)):
continue
next = df.at[str(int(el[0]) + 1), first_col] if el[1] == last_col else df.at[el[0], str(int(el[1]) + 1)]
prev = df.at[str(int(el[0]) - 1), last_col] if el[1] == first_col else df.at[el[0], str(int(el[1]) - 1)]
if prev == 0 or next == 0:
continue
df.at[el[0],el[1]] = (prev + next)/2
JSON of example:
{"0":{"0":0.0,"1":1554.0,"2":1588.0,"3":0.0},"1":{"0":1596.0,"1":1506.0,"2":1510.0,"3":0.0},"2":{"0":1578.0,"1":0.0,"2":1495.0,"3":1561.0},"3":{"0":1567.0,"1":1466.0,"2":1485.0,"3":1571.0},"4":{"0":1580.0,"1":1469.0,"2":1489.0,"3":1647.0},"5":{"0":1649.0,"1":1503.0,"2":0.0,"3":0.0}}

Here's one approach using shift to average the neighbour's values and slice assigning back to the dataframe:
m = df==0
r = (df.shift(axis=1)+df.shift(-1,axis=1))/2
df.iloc[1:-1,1:-1] = df.mask(m,r)
print(df)
0 1 2 3 4 5
0 0.0 1596.0 1578.0 1567.0 1580.0 1649.0
1 1554.0 1506.0 1486.0 1466.0 1469.0 1503.0
2 1588.0 1510.0 1495.0 1485.0 1489.0 0.0
3 0.0 0.0 1561.0 1571.0 1647.0 0.0

Related

How to apply a function to each row of a dataframe and get the results back

This is my data Frame
3 4 5 6 97 98 99 100
0 1.0 2.0 3.0 4.0 95.0 96.0 97.0 98.0
1 50699.0 16302.0 50700.0 16294.0 50735.0 16334.0 50737.0 16335.0
2 57530.0 33436.0 57531.0 33438.0 NaN NaN NaN NaN
3 24014.0 24015.0 34630.0 24016.0 NaN NaN NaN NaN
4 44933.0 2611.0 44936.0 2612.0 44982.0 2631.0 44972.0 2633.0
1792 46712.0 35340.0 46713.0 35341.0 46759.0 35387.0 46760.0 35388.0
1793 61283.0 40276.0 61284.0 40277.0 61330.0 40323.0 61331.0 40324.0
1794 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1795 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1796 27156.0 48331.0 27157.0 48332.0 NaN NaN NaN NaN
--> How do I apply the below function and get the answers back for each row in one run...
values is the array of values of each row and N is 100
def entropy_s(values, N):
a= scipy.stats.entropy(values,base=2)
a = round(a,2)
global CONSTANT_COUNT,RANDOM_COUNT,LOCAL_COUNT,GLOBAL_COUNT,ODD_COUNT
if(math.isnan(a) == True):
a = 0.0
if(a==0.0):
CONSTANT_COUNT += 1
elif(a<round(math.log2(N),2)):
LOCAL_COUNT +=1
RANDOM_COUNT +=1
elif(a==round(math.log2(N),2)):
RANDOM_COUNT +=1
GLOBAL_COUNT += 1
LOCAL_COUNT += 1
else:
ODD_COUNT +=1
I assume that the values are supposed to be rows? in that case, I suggest the following:
rows will be fed to function and you can get the column in each row using row.column_name.
def func(N=100):
def entropy_s(values):
a= scipy.stats.entropy(values,base=2)
a = round(a,2)
global CONSTANT_COUNT,RANDOM_COUNT,LOCAL_COUNT,GLOBAL_COUNT,ODD_COUNT
if(math.isnan(a) == True):
a = 0.0
if(a==0.0):
CONSTANT_COUNT += 1
elif(a<round(math.log2(N),2)):
LOCAL_COUNT +=1
RANDOM_COUNT +=1
elif(a==round(math.log2(N),2)):
RANDOM_COUNT +=1
GLOBAL_COUNT += 1
LOCAL_COUNT += 1
else:
ODD_COUNT +=1
return entropy_s
df.apply(func(100), axis=1)
if you want to have the rows as list you can do this:
df.apply(lambda x: func(100)([k for k in x]), axis=1)
import functools
series = df.apply(functool.partial(entropy_s, N=100), axis=1)
# or
series = df.apply(lambda x: entropy_s(x, N=100), axis=1)
axis=1 will push the rows of your df to the first arg of apply.
You will get a pd.Series of None's though, because your function doesn't return anything.
I highly suggest to avoid using globals in your function.
EDIT: If you want meaningful help, you need to ask meaningful questions. Which errors are you getting?
Here is a quick and dirty example that demonstrates what I've suggested. If you have an error, your function likely has a bug (for example, it doesn't return anything), or it doesn't know how to handle NaN.
In [6]: df = pd.DataFrame({1: [1, 2, 3], 2: [3, 4, 5], 3: [6, 7, 8]})
In [7]: df
Out[7]:
1 2 3
0 1 3 6
1 2 4 7
2 3 5 8
In [8]: df.apply(lambda x: np.sum(x), axis=1)
Out[8]:
0 10
1 13
2 16
dtype: int64

Operations on specific elements of a dataframe in Python

I'm trying to convert kilometer values in one column of a dataframe to mile values. I've tried various things and this is what I have now:
def km_dist(column, dist):
length = len(column)
for dist in zip(range(length), column):
if (column == data["dist"] and dist in data.loc[(data["dist"] > 25)]):
return dist / 5820
else:
return dist
data = data.apply(lambda x: km_dist(data["dist"], x), axis=1)
The dataset I'm working with looks something like this:
past_score dist income lab score gender race income_bucket plays_sports student_id lat long
0 8.091553 11.586920 67111.784934 0 7.384394 male H 3 0 1 0.0 0.0
1 8.091553 11.586920 67111.784934 0 7.384394 male H 3 0 1 0.0 0.0
2 7.924539 7858.126614 93442.563796 1 10.219626 F W 4 0 2 0.0 0.0
3 7.924539 7858.126614 93442.563796 1 10.219626 F W 4 0 2 0.0 0.0
4 7.726480 11.057883 96508.386987 0 8.544586 M W 4 0 3 0.0 0.0
With my code above, I'm trying to loop through all the "dist" values and if those values are in the right column ("data["dist"]") and greater than 25, divide those values by 5820 (the number of feet in a kilometer). More generally, I'd like to find a way to operate on specific elements of dataframes. I'm sure this is at least a somewhat common question, I just haven't been able to find an answer for it. If someone could point me towards somewhere with an answer, I would be just as happy.
Instead your solution filter rows with mask and divide column dist by 5820:
data.loc[data["dist"] > 25, 'dist'] /= 5820
Working same like:
data.loc[data["dist"] > 25, 'dist'] = data.loc[data["dist"] > 25, 'dist'] / 5820
data.loc[data["dist"] > 25, 'dist'] /= 5820
print (data)
past_score dist income lab score gender race \
0 8.091553 11.586920 67111.784934 0 7.384394 male H
1 8.091553 11.586920 67111.784934 0 7.384394 male H
2 7.924539 1.350194 93442.563796 1 10.219626 F W
3 7.924539 1.350194 93442.563796 1 10.219626 F W
4 7.726480 11.057883 96508.386987 0 8.544586 M W
income_bucket plays_sports student_id lat long
0 3 0 1 0.0 0.0
1 3 0 1 0.0 0.0
2 4 0 2 0.0 0.0
3 4 0 2 0.0 0.0
4 4 0 3 0.0 0.0

Pandas column that depends on its previous value (row)?

I would like to create a 3rd column in my dataframe, which depends on both the new and existing columns in the previous row.
This new column should start at 0.
I would like my 3rd column to start at 0.
Its next value is its previous value plus df.below_lo[i] (if the previous value was 0).
If its previous value was 1, its next value is its previous value plus df.above_hi[i].
I think I have two issues: how to initiate this 3rd column and how to make it dependent on itself.
import pandas as pd
import math
data = {'below_lo': [0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
'above_hi': [0, 0, -1, 0, -1, 0, -1, 0, 0, 0, 0, 0, 0]}
df = pd.DataFrame(data)
df['pos'] = math.nan
df['pos'][0] = 0
for i in range(len(df.below_lo)):
if df.pos[i] == 0:
df.pos[i+1] = df.pos[i] + df.below_lo[i]
if df.pos[i] == 1:
df.pos[i+1] = df.pos[i] + df.above_hi[i]
print(df)
The desired output is:
below_lo above_hi pos
0 0.0 0.0 0.0
1 1.0 0.0 0.0
2 0.0 -1.0 1.0
3 0.0 0.0 0.0
4 0.0 -1.0 0.0
5 0.0 0.0 0.0
6 0.0 -1.0 0.0
7 0.0 0.0 0.0
8 0.0 0.0 0.0
9 1.0 0.0 0.0
10 0.0 0.0 1.0
11 0.0 0.0 1.0
12 0.0 0.0 1.0
13 NaN NaN 1.0
The above code produces the correct output, except I am also getting a few of these error messages:
A value is trying to be set on a copy of a slice from a DataFrame
How do I clean this code up so that it runs without throwing this warning? ?
Use .loc:
df.loc[0, 'pos'] = 0
for i in range(len(df.below_lo)):
if df.loc[i, 'pos'] == 0:
df.loc[i+1, 'pos'] = df.loc[i, 'pos'] + df.loc[i, 'below_lo']
if df.loc[i, 'pos'] == 1:
df.loc[i+1, 'pos'] = df.loc[i, 'pos'] + df.loc[i, 'above_hi']
Appreciate there is an accepted, and perfectly good, answer by #Michael O. already, but if you dislike iterating over rows as not-quite Pandas-esque, here is a solution without explicit looping over rows:
from functools import reduce
res = reduce(lambda d, _ :
d.fillna({'pos':d['pos'].shift(1)
+ (d['pos'].shift(1) == 0) * d['below_lo']
+ (d['pos'].shift(1) == 1) * d['above_hi']}),
range(len(df)), df)
res
produces
below_lo above_hi pos
-- ---------- ---------- -----
0 0 0 0
1 1 0 1
2 0 -1 0
3 0 0 0
4 0 -1 0
5 0 0 0
6 0 -1 0
7 0 0 0
8 0 0 0
9 1 0 1
10 0 0 1
11 0 0 1
12 0 0 1
It is, admittedly, somewhat less efficient and has a bit more obscure syntax. But it could be written on a single line (even if I split it over a few for clarity)!
The idea is that we can use fillna(..) function by passing the value, calculated from the previous value of 'pos' (hence shift(1)) and current values of 'below_lo' and 'above_hi'. The extra complication here is that this operation will only fill NaN with a non-NaN for the row just below the one with non-NaN value. Hence we need to apply this function repeatedly until all NaNs are filled, and this is where reduce comes into play

Python checking if row is empty after slicing

I'm working on a program that pulls data from an excel file and inputs in a calendar. In the excel file, dates are on the y axis. I'm iterating cell by cell across each row, but if the entire row is empty (aside from the date), I want to perform a different action than if just some of the cells in the row were empty. I cannot reference the header names directly as they will not always be the same.
In the example below, for row 1 I'm iterating 0, 2, nan, 4, nan
For row 2, I want to print('empty row') before moving to row 3.
Date bx1 bx2 bx3 bx4 bx5
1 0 2 4
2
3 0 1 2 3 4
4 1 2 3 4 5
5
6
7 0 1 2 3 4
I've tried this:
if pd.isnull(m):
print('emptyrow')
and this:
if pd.isna(df[1]):
print('empty row')
Here's code for context:
layout = [[sg.In(key='-CAL-', enable_events=True, visible=False),
sg.CalendarButton('Calendar', target='-CAL-', pad=None, font=('MS Sans Serif', 10, 'bold'),
button_color=('red', 'white'), format='%m/%d/%Y')],
[sg.OK(), sg.Cancel()]]
window = sg.Window('Data Collector', layout, grab_anywhere=False, size=(400, 280), return_keyboard_events=True,
finalize=True)
event, values = window.read()
adate = (values['-CAL-'])
stu = (values[0])
window.close()
df = pd.read_excel('C:\\Users\\aelfont\\Documents\\python_date_test.xlsx', Sheet_name=0, header=None)
x = len(df.columns) # length of bx
z = 1 # used to determine when at end of row
b = 1 # location of column to start summing
c = len(df.index) # number of days in the month
r = 1 # used to stop once last day of month reached
y = 1
# while date < last day in month, do action
# while the above is true, enter data until end of row
# once at end of row, submit and move to next row
while y < c:
while z < x:
n = int((values['-CAL-'][3:5]))
m = df.iloc[n, b]
z = z + 1
b = b + 1
if pd.isnull(m):
ActionChains(browser) \
.send_keys(Keys.TAB) \
.perform()
continue
else:
ActionChains(browser) \
.send_keys(str(m)) \
.perform()
if z == x:
z = 1
b = 1
n = n + 1
y = y + 1
time.sleep(5)
yes = browser.find_element_by_css_selector('button.publishBottom:nth-child(2)')
time.sleep(5)
yes.click()
else:
ActionChains(browser) \
.send_keys(Keys.TAB) \
.perform()
if y == c:
break
if pd.isnull(m):
print('emptyrow')
can you try the following:
if pd.isnull(m).all():
print('emptyrow')
Full Code:
df = df.set_index('Date')
print(df)
for ind, row in df.iterrows():
print(pd.isnull(row).all())
Output:
bx1 bx2 bx3 bx4 bx5
Date
1 0.0 2.0 4.0 NaN NaN
2 NaN NaN NaN NaN NaN
3 0.0 1.0 2.0 3.0 4.0
4 1.0 2.0 3.0 4.0 5.0
5 NaN NaN NaN NaN NaN
6 NaN NaN NaN NaN NaN
7 0.0 1.0 2.0 3.0 4.0
False
True
False
False
True
True
False

In Pandas, how can I count consecutive positive and negatives in a row?

In python pandas or numpy, is there a built-in function or a combination of functions that can count the number of positive or negative values in a row?
This could be thought of as similar to a roulette wheel with the number of blacks or reds in a row.
Example input series data:
Date
2000-01-07 -3.550049
2000-01-10 28.609863
2000-01-11 -2.189941
2000-01-12 4.419922
2000-01-13 17.690185
2000-01-14 41.219971
2000-01-18 0.000000
2000-01-19 -16.330078
2000-01-20 7.950195
2000-01-21 0.000000
2000-01-24 38.370117
2000-01-25 6.060059
2000-01-26 3.579834
2000-01-27 7.669922
2000-01-28 2.739991
2000-01-31 -8.039795
2000-02-01 10.239990
2000-02-02 -1.580078
2000-02-03 1.669922
2000-02-04 7.440186
2000-02-07 -0.940185
Desired output:
- in a row 5 times
+ in a row 4 times
++ in a row once
++++ in a row once
+++++++ in a row once
Nonnegatives:
from functools import reduce # For Python 3.x
ser = df['x'] >= 0
c = ser.expanding().apply(lambda r: reduce(lambda x, y: x + 1 if y else x * y, r))
c[ser & (ser != ser.shift(-1))].value_counts()
Out:
1.0 2
7.0 1
4.0 1
2.0 1
Name: x, dtype: int64
Negatives:
ser = df['x'] < 0
c = ser.expanding().apply(lambda r: reduce(lambda x, y: x + 1 if y else x * y, r))
c[ser & (ser != ser.shift(-1))].value_counts()
Out:
1.0 6
Name: x, dtype: int64
Basically, it creates a boolean series takes the cumulative count between the turning points (when the sign changes, it starts over). For example, for nonnegatives, c is:
Out:
0 0.0
1 1.0 # turning point
2 0.0
3 1.0
4 2.0
5 3.0
6 4.0 # turning point
7 0.0
8 1.0
9 2.0
10 3.0
11 4.0
12 5.0
13 6.0
14 7.0 # turning point
15 0.0
16 1.0 # turning point
17 0.0
18 1.0
19 2.0 # turning point
20 0.0
Name: x, dtype: float64
Now, in order to identify the turning points the condition is that the current value is different than the next and it is True. If you select those, you have the counts.
You can use itertools.groupby() function.
import itertools
l = [-3.550049, 28.609863, -2.189941, 4.419922, 17.690185, 41.219971, 0.000000, -16.330078, 7.950195, 0.000000, 38.370117, 6.060059, 3.579834, 7.669922, 2.739991, -8.039795, 10.239990, -1.580078, 1.669922, 7.440186, -0.940185]
r_pos = {}
r_neg = {}
for k, v in itertools.groupby(l, lambda e:e>0):
count = len(list(v))
r = r_pos
if k == False:
r = r_neg
if count not in r.keys():
r[count] = 0
r[count] += 1
for k, v in r_neg.items():
print '%s in a row %s time(s)' % ('-'*k, v)
for k, v in r_pos.items():
print '%s in a row %s time(s)' % ('+'*k, v)
output
- in a row 6 time(s)
+ in a row 2 time(s)
++ in a row 1 time(s)
++++ in a row 1 time(s)
+++++++ in a row 1 time(s)
depending on what you consider as a positive value, you can change the line lambda e:e>0
So far this is what I've come up with, it works and outputs a count for how many times each of the negative, positive and zero values occur in a row. Maybe someone can make it more concise using some of the suggestions posted by ayhan and Ghilas above.
from collections import Counter
ser = [-3.550049, 28.609863, -2.1, 89941,4.419922,17.690185,41.219971,0.000000,-16.330078,7.950195,0.000000,38.370117,6.060059,3.579834,7.669922,2.739991,-8.039795,10.239990,-1.580078, 1.669922, 7.440186,-0.940185]
c = 0
zeros, neg_counts, pos_counts = [], [], []
for i in range(len(ser)):
c+=1
s = np.sign(ser[i])
try:
if s != np.sign(ser[i+1]):
if s == 0:
zeros.append(c)
elif s == -1:
neg_counts.append(c)
elif s == 1:
pos_counts.append(c)
c = 0
except IndexError:
pos_counts.append(c) if s == 1 else neg_counts.append(c) if s ==-1 else zeros.append(c)
print(Counter(neg_counts))
print(Counter(pos_counts))
print(Counter(zeros))
Out:
Counter({1: 5})
Counter({1: 3, 2: 1, 4: 1, 5: 1})
Counter({1: 2})

Categories

Resources