I would like to apply a test to a pandas dataframe, and create flags in a corresponding dataframe based on the test results. I've gotten this far:
import numpy as np
import pandas as pd
matrix = pd.DataFrame({'a': [1, 11, 2, 3, 4], 'b': [5, 6, 22, 8, 9]})
flags = pd.DataFrame(np.zeros(matrix.shape), columns=matrix.columns)
flag_values = pd.Series({"a": 100, "b": 200})
flags[matrix > 10] = flag_values
but this raises the error
ValueError: Must specify axis=0 or 1
Where can I specify the axis in this situation? Is there a better way to accomplish this?
Edit:
The result I'm looking for in this example for "flags" is
a b
0 0
100 0
0 200
0 0
0 0
You could define flags = (matrix > 10) * flag_values:
In [35]: (matrix > 10) * flag_values
Out[35]:
a b
0 0 0
1 100 0
2 0 200
3 0 0
4 0 0
This relies on True having numeric value 1 and False having numeric value 0.
It also relies on Pandas' nifty automatic alignment of DataFrames (and Series) based on labels before performing arithmetic operations.
mask with mul
flags.mask(matrix > 10,1).mul(flag_values,axis=1)
Out[566]:
a b
0 0.0 0.0
1 100.0 0.0
2 0.0 200.0
3 0.0 0.0
4 0.0 0.0
Related
I tried to make a kind of running average - out of 90 rows, every 3 in column A should make an average that would be the same as those rows in column B.
For example:
From this:
df = pd.DataFrame( A B
2 0
3 0
4 0
7 0
9 0
8 0)
to this:
df = pd.DataFrame( A B
2 3
3 3
4 3
7 8
9 8
8 8)
I tried running this code:
x=0
for i in df['A']:
if x<90:
y = (df['A'][x]+ df['A'][(x +1)]+df['A'][(x +2)])/3
df['B'][x] = y
df['B'][(x+1)] = y
df['B'][(x+2)] = y
x=x+3
print(y)
It does print the correct Y
But does not change B
I know there is a better way to do it, and if anyone knows - it would be great if they shared it. But the more important thing for me is to understand why what I wrote down doesn't have an effect on the df.
You could group by the index divided by 3, then use transform to compute the mean of those values and assign to B:
df = pd.DataFrame({'A': [2, 3, 4, 7, 9, 8], 'B': [0, 0, 0, 0, 0, 0]})
df['B'] = df.groupby(df.index // 3)['A'].transform('mean')
Output:
A B
0 2 3
1 3 3
2 4 3
3 7 8
4 9 8
5 8 8
Note that this relies on the index being of the form 0,1,2,3,4,.... If that is not the case, you could either reset the index (df.reset_index(drop=True)) or use np.arange(df.shape[0]) instead i.e.
df['B'] = df.groupby(np.arange(df.shape[0]) // 3)['A'].transform('mean')
i = 0
batch_size = 3
df = pd.DataFrame({'A':[2,3,4,7,9,8,9,10],'B':[-1] * 8})
while i < len(df):
j = min(i+batch_size-1,len(df)-1)
avg =sum(df.loc[i:j,'A'])/ (j-i+1)
df.loc[i:j,'B'] = [avg] * (j-i+1)
i+=batch_size
df
corner case when len(df) % batch_size != 0 assumes we take the average of the leftover rows.
I am trying to implement ALS algorithm in Dask, but I am having trouble figuring out how to compute latent feautures in one step. I followed formulas on this stackoverflow thread and come up with this code:
Items = da.linalg.lstsq(da.add(da.dot(Users, Users.T), lambda_ * da.eye(n_factors)),
da.dot(Users, X))[0].T.compute()
Items = np.where(Items < 0, 0, Items)
Users = da.linalg.lstsq(da.add(da.dot(Items.T, Items), lambda_ * da.eye(n_factors)),
da.dot(Items.T, X.T))[0].compute()
Users = np.where(Users < 0, 0, Users)
But I don't think this works correctly, because MSE is not decreasing.
Example input:
n_factors = 2
lambda_ = 0.1
# We have 6 users and 4 items
Matrix X_train(6x4), R(4x6), Users(2x6) and Items(4x2) looks like:
1 0 0 0 5 2 1 0 0 0 0.8 1.3 1.1 0.2 4.1 1.6
0 0 0 0 4 0 0 0 1 1 3.9 4.3 3.5 2.7 4.3 0.5
0 3 0 0 4 0 0 0 0 0 2.9 1.5
0 3 0 0 0 0 0 0 0 0 0.2 4.7
1 1 1 0 0.9 1.1
1 0 0 0 4.8 3.0
EDIT: I found the problem, but I don't know how to get around it. Before the iteration starts I set all values in X_train matrix, where there is no rating, to 0.
X_train = da.nan_to_num(X_train)
Reason for that is because dot product works only on numeric values. But because the matrix is very sparse 90% of it now consists of zeros. And insted of fiting real ratings in the matrix it fits this zeros.
Any help would be highly appreciated. <3
One way to handle gaps or missing values in data sets is to use masked arrays. As of May 2017 Dask also supports them.
Defining a masked array in Dask is fairly simple and simmilar to numpy's. All supported functions are listed in docs, here are just some most commonly used approaches:
data_set = da.array([[1, 2], [3, 4]])
masked_data_set_1 = da.ma.masked_array(data_set, mask=[[False, True],[True, False]])
# returns [[1, --],[--, 4]]
masked_data_set_2 = da.ma.masked_equal(data_set, 4)
# returns [[1, 2],[3, --]]
masked_data_set_3 = da.ma.masked_where(data_set < 3, data_set)
# returns [[--, --],[3, 4]]
In your case, you are trying to perform dot product of da.dot(Users, X)). Instead of setting all NaN values to 0, you can use masked array as:
masked_X = da.ma.masked_where(X != X, X)
Now you can easily perform dot product like:
da.ma.getdata(da.dot(Users,masked_X))
I would like to add a column in a data frame when another column is increasing/decreasing or stays the same with:
1 -> increasing, 0 -> same, -1 -> decreasing
So if df['battery'] = [1,2,3,4,7,9,3,3,3,]
I would like state to be df['state'] = [1,1,1,1,1,-1,0,0]
This should do the trick!
a = [1,2,3,4,7,9,3,3,3]
b = []
for x in range(len(a)-1):
b.append((a[x+1] > a[x]) - (a[x+1] < a[x]))
print(b)
You could use pd.Series.diff method to get the difference between consecutive values, and then assign the necessary state values by using boolean indexing:
import pandas as pd
df = pd.DataFrame()
df['battery'] = [1,2,3,4,7,9,3,3,3]
diff = df['battery'].diff()
df.loc[diff > 0, 'state'] = 1
df.loc[diff == 0, 'state'] = 0
df.loc[diff < 0, 'state'] = -1
print(df)
# battery state
# 0 1 NaN
# 1 2 1.0
# 2 3 1.0
# 3 4 1.0
# 4 7 1.0
# 5 9 1.0
# 6 3 -1.0
# 7 3 0.0
# 8 3 0.0
Or, alternatively, one could use np.select:
import numpy as np
diff = df['battery'].diff()
df['state'] = np.select([diff < 0, diff > 0], [-1, 1], 0)
# Be careful, default 0 will replace the first NaN as well.
print(df)
# battery state
# 0 1 0
# 1 2 1
# 2 3 1
# 3 4 1
# 4 7 1
# 5 9 1
# 6 3 -1
# 7 3 0
# 8 3 0
So here's your dataframe:
>>> import pandas as pd
>>> data = [[[1,2,3,4,7,9,3,3,3]]]
>>> df = pd.DataFrame(data, columns = ['battery'])
>>> df
battery
0 [1, 2, 3, 4, 7, 9, 3, 3, 3]
And finally use apply and a lambda function in order to generate the required result:
>>> df['state'] = df.apply(lambda row: [1 if t - s > 0 else -1 if t-s < 0 else 0 for s, t in zip(row['battery'], row['battery'][1:])], axis=1)
>>> df
battery state
0 [1, 2, 3, 4, 7, 9, 3, 3, 3] [1, 1, 1, 1, 1, -1, 0, 0]
Alternatively, if you want the exact difference between each element in the list, you can use the following:
>>> df['state'] = df.apply(lambda row: [t - s for s, t in zip(row['battery'], row['battery'][1:])], axis=1)
>>> df
battery state
0 [1, 2, 3, 4, 7, 9, 3, 3, 3] [1, 1, 1, 3, 2, -6, 0, 0]
Try pd.np.sign
pd.np.sign(df.battery.diff().fillna(1))
0 1.0
1 1.0
2 1.0
3 1.0
4 1.0
5 1.0
6 -1.0
7 0.0
8 0.0
Name: battery, dtype: float64
I am trying to tabulate a change in condition using a 'groupby' but am stumped and would appreciate any guidance. I have a data frame as follows:
SUBJECT TYPE
1 1
1 2
1 2
2 1
2 1
3 1
3 3
3 5
I would like to generate a statement that tabulates any positive change, ignores any negative change, and generates a count of change per subject. For example, the output of the above would be:
Subject TYPE
1 1
2 0
3 2
Would I need create an if/else clause using pandas, or is there a simpler way to achieve this using summit? Maybe something like...
def tabchange(type, subject):
current_subject = subject[0]
type_diff = type - type
j = 1
for i in range(1,len(type)):
type_diff[i] = type[i] - type[i-j]
if subject[i] == current_subject:
if type_diff[i] > 0:
new_row = 1
j += 1
else:
j = 1
else:
new_row[i] = 0
current_subject = subject[i]
return new_row
import pandas as pd
df = pd.DataFrame({'SUBJECT': [1, 1, 1, 2, 2, 3, 3, 3],
'TYPE': [1, 2, 2, 1, 1, 1, 3, 5]})
grouped = df.groupby('SUBJECT')
df['TYPE'] = grouped['TYPE'].diff() > 0
result = grouped['TYPE'].agg('sum')
yields
SUBJECT
1 1.0
2 0.0
3 2.0
Name: TYPE, dtype: float64
Above, df is grouped by SUBJECT and the diff is taken of the TYPE column:
In [253]: grouped = df.groupby('SUBJECT'); df['TYPE'] = grouped['TYPE'].diff() > 0
In [254]: df
Out[254]:
SUBJECT TYPE
0 1 False
1 1 True
2 1 False
3 2 False
4 2 False
5 3 False
6 3 True
7 3 True
Then, again grouping by SUBJECT, the result is obtained by counting the number of Trues in the TYPE column:
In [255]: result = grouped['TYPE'].agg('sum'); result
Out[255]:
SUBJECT
1 1.0
2 0.0
3 2.0
Name: TYPE, dtype: float64
I'd like to create a new dataframe using the same values from another dataframe, unless there is a 0 value. If there is a 0 value, I'd like to find the average of the entry before and after.
For Example:
df = A B C
5 2 1
3 4 5
2 1 0
6 8 7
I'd like the result to look like the df below:
df_new = A B C
5 2 1
3 4 5
2 1 6
6 8 7
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[5, 3, 2, 6], 'B':[2, 4, 1, 8], 'C':[1, 5, 0, 7]})
Nrows = len(df)
def run(col):
originalValues = list(df[col])
values = list(np.where(np.array(list(df[col])) == 0)[0])
indices2replace = filter(lambda x: x > 0 and x < Nrows, values)
for index in indices2replace:
originalValues[index] = 0.5 * (originalValues[index+1] + originalValues[index-1])
return originalValues
newDF = pd.DataFrame(map(lambda x: run(x) , df.columns)).transpose()