df:
cat116_O cat116_S cat116_T cat116_U cat116_Y
0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0
expected output:
df(changed):
cat116_O cat116_S cat116_T cat116_U cat116_Y
0 -1 -1 -1 -1 -1
1 -1 -1 -1 -1 -1
code:
df.replace(0.0, -1)
But its not working. I was able to do iteratively for each row and column but it is taking a lot of time. Where am I going wrong with the replace function in the code.
Sounds like you're interested to find out why your replace function is not working.
I think this might be what you're looking for:
df.replace(to_replace = 0.0, value = -1, inplace = True)
This will return float(-1.0) values as your your values are floats.
Related
This is my data Frame
3 4 5 6 97 98 99 100
0 1.0 2.0 3.0 4.0 95.0 96.0 97.0 98.0
1 50699.0 16302.0 50700.0 16294.0 50735.0 16334.0 50737.0 16335.0
2 57530.0 33436.0 57531.0 33438.0 NaN NaN NaN NaN
3 24014.0 24015.0 34630.0 24016.0 NaN NaN NaN NaN
4 44933.0 2611.0 44936.0 2612.0 44982.0 2631.0 44972.0 2633.0
1792 46712.0 35340.0 46713.0 35341.0 46759.0 35387.0 46760.0 35388.0
1793 61283.0 40276.0 61284.0 40277.0 61330.0 40323.0 61331.0 40324.0
1794 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1795 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1796 27156.0 48331.0 27157.0 48332.0 NaN NaN NaN NaN
--> How do I apply the below function and get the answers back for each row in one run...
values is the array of values of each row and N is 100
def entropy_s(values, N):
a= scipy.stats.entropy(values,base=2)
a = round(a,2)
global CONSTANT_COUNT,RANDOM_COUNT,LOCAL_COUNT,GLOBAL_COUNT,ODD_COUNT
if(math.isnan(a) == True):
a = 0.0
if(a==0.0):
CONSTANT_COUNT += 1
elif(a<round(math.log2(N),2)):
LOCAL_COUNT +=1
RANDOM_COUNT +=1
elif(a==round(math.log2(N),2)):
RANDOM_COUNT +=1
GLOBAL_COUNT += 1
LOCAL_COUNT += 1
else:
ODD_COUNT +=1
I assume that the values are supposed to be rows? in that case, I suggest the following:
rows will be fed to function and you can get the column in each row using row.column_name.
def func(N=100):
def entropy_s(values):
a= scipy.stats.entropy(values,base=2)
a = round(a,2)
global CONSTANT_COUNT,RANDOM_COUNT,LOCAL_COUNT,GLOBAL_COUNT,ODD_COUNT
if(math.isnan(a) == True):
a = 0.0
if(a==0.0):
CONSTANT_COUNT += 1
elif(a<round(math.log2(N),2)):
LOCAL_COUNT +=1
RANDOM_COUNT +=1
elif(a==round(math.log2(N),2)):
RANDOM_COUNT +=1
GLOBAL_COUNT += 1
LOCAL_COUNT += 1
else:
ODD_COUNT +=1
return entropy_s
df.apply(func(100), axis=1)
if you want to have the rows as list you can do this:
df.apply(lambda x: func(100)([k for k in x]), axis=1)
import functools
series = df.apply(functool.partial(entropy_s, N=100), axis=1)
# or
series = df.apply(lambda x: entropy_s(x, N=100), axis=1)
axis=1 will push the rows of your df to the first arg of apply.
You will get a pd.Series of None's though, because your function doesn't return anything.
I highly suggest to avoid using globals in your function.
EDIT: If you want meaningful help, you need to ask meaningful questions. Which errors are you getting?
Here is a quick and dirty example that demonstrates what I've suggested. If you have an error, your function likely has a bug (for example, it doesn't return anything), or it doesn't know how to handle NaN.
In [6]: df = pd.DataFrame({1: [1, 2, 3], 2: [3, 4, 5], 3: [6, 7, 8]})
In [7]: df
Out[7]:
1 2 3
0 1 3 6
1 2 4 7
2 3 5 8
In [8]: df.apply(lambda x: np.sum(x), axis=1)
Out[8]:
0 10
1 13
2 16
dtype: int64
I am sorry for asking such a simple question (yes I googled). Do I really require 2 steps to map a simple pandas series of float between 0 and 1s to 0 and 1s given a threshold. This is the reproducible example:
series = pd.Series([0.0, 0.3, 0.6, 1.0])
threshold = 0.5
print(series)
series[series > threshold] = 1.0
series[series <= threshold] = 0.0
print(series)
It works producing:
0 0.0
1 0.0
2 1.0
3 1.0
from:
0 0.0
1 0.3
2 0.6
3 1.0
You can use the > operator.
series = (series > threshold).astype(int)
print(series)
Output:
0 0
1 0
2 1
3 1
dtype: int32
You could also apply a function on all elements using map() like
series = series.map(lambda x: 1.0 if x > threshold else 0.0)
I'd use numpy.where:
np.where(series > threshold, 1, 0)
I would like to create a 3rd column in my dataframe, which depends on both the new and existing columns in the previous row.
This new column should start at 0.
I would like my 3rd column to start at 0.
Its next value is its previous value plus df.below_lo[i] (if the previous value was 0).
If its previous value was 1, its next value is its previous value plus df.above_hi[i].
I think I have two issues: how to initiate this 3rd column and how to make it dependent on itself.
import pandas as pd
import math
data = {'below_lo': [0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
'above_hi': [0, 0, -1, 0, -1, 0, -1, 0, 0, 0, 0, 0, 0]}
df = pd.DataFrame(data)
df['pos'] = math.nan
df['pos'][0] = 0
for i in range(len(df.below_lo)):
if df.pos[i] == 0:
df.pos[i+1] = df.pos[i] + df.below_lo[i]
if df.pos[i] == 1:
df.pos[i+1] = df.pos[i] + df.above_hi[i]
print(df)
The desired output is:
below_lo above_hi pos
0 0.0 0.0 0.0
1 1.0 0.0 0.0
2 0.0 -1.0 1.0
3 0.0 0.0 0.0
4 0.0 -1.0 0.0
5 0.0 0.0 0.0
6 0.0 -1.0 0.0
7 0.0 0.0 0.0
8 0.0 0.0 0.0
9 1.0 0.0 0.0
10 0.0 0.0 1.0
11 0.0 0.0 1.0
12 0.0 0.0 1.0
13 NaN NaN 1.0
The above code produces the correct output, except I am also getting a few of these error messages:
A value is trying to be set on a copy of a slice from a DataFrame
How do I clean this code up so that it runs without throwing this warning? ?
Use .loc:
df.loc[0, 'pos'] = 0
for i in range(len(df.below_lo)):
if df.loc[i, 'pos'] == 0:
df.loc[i+1, 'pos'] = df.loc[i, 'pos'] + df.loc[i, 'below_lo']
if df.loc[i, 'pos'] == 1:
df.loc[i+1, 'pos'] = df.loc[i, 'pos'] + df.loc[i, 'above_hi']
Appreciate there is an accepted, and perfectly good, answer by #Michael O. already, but if you dislike iterating over rows as not-quite Pandas-esque, here is a solution without explicit looping over rows:
from functools import reduce
res = reduce(lambda d, _ :
d.fillna({'pos':d['pos'].shift(1)
+ (d['pos'].shift(1) == 0) * d['below_lo']
+ (d['pos'].shift(1) == 1) * d['above_hi']}),
range(len(df)), df)
res
produces
below_lo above_hi pos
-- ---------- ---------- -----
0 0 0 0
1 1 0 1
2 0 -1 0
3 0 0 0
4 0 -1 0
5 0 0 0
6 0 -1 0
7 0 0 0
8 0 0 0
9 1 0 1
10 0 0 1
11 0 0 1
12 0 0 1
It is, admittedly, somewhat less efficient and has a bit more obscure syntax. But it could be written on a single line (even if I split it over a few for clarity)!
The idea is that we can use fillna(..) function by passing the value, calculated from the previous value of 'pos' (hence shift(1)) and current values of 'below_lo' and 'above_hi'. The extra complication here is that this operation will only fill NaN with a non-NaN for the row just below the one with non-NaN value. Hence we need to apply this function repeatedly until all NaNs are filled, and this is where reduce comes into play
I want all missing values from dataset to replace with average of two nearest neighbors. Except of first and last cells and when neighbors are 0 (then I manually fix values). I coded this and it works, but the solution is not very smart. Is is another way to do it faster? Interpolate method is suitable for that? I'm not quite sure how does it work.
Input:
0 1 2 3 4 5
0 0.0 1596.0 1578.0 1567.0 1580.0 1649.0
1 1554.0 1506.0 0.0 1466.0 1469.0 1503.0
2 1588.0 1510.0 1495.0 1485.0 1489.0 0.0
3 1592.0 0.0 0.0 1571.0 1647.0 0.0
Output:
0 1 2 3 4 5
0 0.0 1596.0 1578.0 1567.0 1580.0 1649.0
1 1554.0 1506.0 1486.0 1466.0 1469.0 1503.0
2 1588.0 1510.0 1495.0 1485.0 1489.0 1540.5
3 1592.0 0.0 0.0 1571.0 1647.0 0.0
Code:
data_len = len(df)
first_col = str(df.columns[0])
last_col = str(df.columns[len(df.columns) - 1])
d = df.apply(lambda s: pd.to_numeric(s, errors="coerce"))
m = d.eq(0) | d.isna()
s = m.stack()
list = s[s].index.tolist() #list of indeces of missing values
count = len(list)
for el in list:
if (el == ('0', first_col) or el == (str(data_len - 1), last_col)):
continue
next = df.at[str(int(el[0]) + 1), first_col] if el[1] == last_col else df.at[el[0], str(int(el[1]) + 1)]
prev = df.at[str(int(el[0]) - 1), last_col] if el[1] == first_col else df.at[el[0], str(int(el[1]) - 1)]
if prev == 0 or next == 0:
continue
df.at[el[0],el[1]] = (prev + next)/2
JSON of example:
{"0":{"0":0.0,"1":1554.0,"2":1588.0,"3":0.0},"1":{"0":1596.0,"1":1506.0,"2":1510.0,"3":0.0},"2":{"0":1578.0,"1":0.0,"2":1495.0,"3":1561.0},"3":{"0":1567.0,"1":1466.0,"2":1485.0,"3":1571.0},"4":{"0":1580.0,"1":1469.0,"2":1489.0,"3":1647.0},"5":{"0":1649.0,"1":1503.0,"2":0.0,"3":0.0}}
Here's one approach using shift to average the neighbour's values and slice assigning back to the dataframe:
m = df==0
r = (df.shift(axis=1)+df.shift(-1,axis=1))/2
df.iloc[1:-1,1:-1] = df.mask(m,r)
print(df)
0 1 2 3 4 5
0 0.0 1596.0 1578.0 1567.0 1580.0 1649.0
1 1554.0 1506.0 1486.0 1466.0 1469.0 1503.0
2 1588.0 1510.0 1495.0 1485.0 1489.0 0.0
3 0.0 0.0 1561.0 1571.0 1647.0 0.0
I have binary values filled in a csv file and a list of real number values that I would like to apply multiplication on both files. How can I discard those values which multiply with the value of 0 in csv file? Can anyone help me with the algorithm part?
Binary.csv
This is 3 lines binary values.
0 1 0 0 1 0 1 0 0
1 0 0 0 0 1 0 1 0
0 0 1 0 1 0 1 0 0
Real.csv
This is one line real number values.
0.1 0.2 0.4 0.1 0.5 0.5 0.3 0.6 0.3
Before desired output
0.0 0.2 0.0 0.0 0.5 0.0 0.3 0.0 0.0
0.1 0.0 0.0 0.0 0.0 0.5 0.0 0.6 0.0
0.0 0.0 0.4 0.0 0.5 0.0 0.3 0.0 0.0
Desired output
0.2 0.5 0.3
0.1 0.5 0.6
0.4 0.5 0.3
Code
import numpy as np
import itertools
a = np.array([[0,1,0,0,1,0,1,0,0],[1,0,0,0,0,1,0,1,0],[0,0,1,0,1,0,1,0,0]])
b = np.array([0.1,0.2,0.4,0.1,0.5,0.5,0.3,0.6,0.3])
c=(a * b)
d=itertools.compress(c,str(True))
print d
The above code is just another alternatives that I tried at the same time. Sorry for inconvenience. Very appreciate all your helps here.
Several ways to do this, mine is simplistic:
import csv
with open('real.csv', 'rb') as csvfile:
for row in csv.reader(csvfile, delimiter=' '):
reals = row
with open('binary.csv', 'rb') as csvfile:
pwreader = csv.reader(csvfile, delimiter=' ')
for row in pwreader:
result = []
for i,b in enumerate(row):
if b == '1' :
result.append(reals[i])
print " ".join(result)
You will notice that there is no multiplication here. When you read from a CSV file the values are strings. You could convert each field to a numeric, construct a bit-mask, then work it out from there, but is it worth it? I have just used a simple string comparison. The output is a string anyway.
Edit: now I find you have numpy arrays in your code, ignoring the csv files. Please stop changing the goalposts!