Related
I tried to make a kind of running average - out of 90 rows, every 3 in column A should make an average that would be the same as those rows in column B.
For example:
From this:
df = pd.DataFrame( A B
2 0
3 0
4 0
7 0
9 0
8 0)
to this:
df = pd.DataFrame( A B
2 3
3 3
4 3
7 8
9 8
8 8)
I tried running this code:
x=0
for i in df['A']:
if x<90:
y = (df['A'][x]+ df['A'][(x +1)]+df['A'][(x +2)])/3
df['B'][x] = y
df['B'][(x+1)] = y
df['B'][(x+2)] = y
x=x+3
print(y)
It does print the correct Y
But does not change B
I know there is a better way to do it, and if anyone knows - it would be great if they shared it. But the more important thing for me is to understand why what I wrote down doesn't have an effect on the df.
You could group by the index divided by 3, then use transform to compute the mean of those values and assign to B:
df = pd.DataFrame({'A': [2, 3, 4, 7, 9, 8], 'B': [0, 0, 0, 0, 0, 0]})
df['B'] = df.groupby(df.index // 3)['A'].transform('mean')
Output:
A B
0 2 3
1 3 3
2 4 3
3 7 8
4 9 8
5 8 8
Note that this relies on the index being of the form 0,1,2,3,4,.... If that is not the case, you could either reset the index (df.reset_index(drop=True)) or use np.arange(df.shape[0]) instead i.e.
df['B'] = df.groupby(np.arange(df.shape[0]) // 3)['A'].transform('mean')
i = 0
batch_size = 3
df = pd.DataFrame({'A':[2,3,4,7,9,8,9,10],'B':[-1] * 8})
while i < len(df):
j = min(i+batch_size-1,len(df)-1)
avg =sum(df.loc[i:j,'A'])/ (j-i+1)
df.loc[i:j,'B'] = [avg] * (j-i+1)
i+=batch_size
df
corner case when len(df) % batch_size != 0 assumes we take the average of the leftover rows.
given the data set
#Create Series
s = pd.Series([[1,2,3,],[1,10,11],[2,11,12]],['buz','bas','bur'])
k = pd.Series(['y','n','o'],['buz','bas','bur'])
#Create DataFrame df from two series
df = pd.DataFrame({'first':s,'second':k})
I was able to create new columns based on all possible values of 'first'
def text_to_list(df,col):
val=df[col].explode().unique()
return val
unique=text_to_list(df,'first')
for options in unique :
df[options]=0
now I need to check off (or turn the value to '1') in each row and column where that value exists in the original list of 'first'
I'm pretty sure its a combination of .isin and/or .apply, but i'm struggling
the end result should be for row
buz: cols 1,2,3 are 1
bas: cols 1,10,11 are 1
bur: cols 2,11,12 are 1
first second 1 2 3 10 11 12
buz [1, 2, 3] y 1 1 1 0 0 0
bas [1, 10, 11] n 1 0 0 1 1 0
bur [2, 11, 12] o 0 1 0 0 1 1
adding the solution provided by -https://stackoverflow.com/users/3558077/ashutosh-porwal
df1=df.join(pd.get_dummies(df['first'].apply(pd.Series).stack()).sum(level=0))
print(df1)
Note: this solution did not require my hack job of creating the columns beforehand by explode column 'first'
From your update it seems that what you need is simply:
for opt in unique :
df[opt]=df['first'].apply(lambda x: int(opt in x))
Output:
first second 1 2 3 10 11 12
buz [1, 2, 3] y 1 1 1 0 0 0
bas [1, 10, 11] n 1 0 0 1 1 0
bur [2, 11, 12] o 0 1 0 0 1 1
Data:
>>> import pandas as pd
>>> s = pd.Series([[1,2,3,],[1,10,11],[2,11,12]],['buz','bas','bur'])
>>> k = pd.Series(['y','n','o'],['buz','bas','bur'])
>>> df = pd.DataFrame({'first':s,'second':k})
>>> df
first second
buz [1, 2, 3] y
bas [1, 10, 11] n
bur [2, 11, 12] o
Solution:
>>> df[df['first'].explode().to_list()] = 0
>>> df = df[['first', 'second']].join(df.apply(lambda x:x.loc[x['first']], axis=1).replace({0 : 1, np.nan : 0}).astype(int))
>>> df
first second 1 2 3 10 11 12
buz [1, 2, 3] y 1 1 1 0 0 0
bas [1, 10, 11] n 1 0 0 1 1 0
bur [2, 11, 12] o 0 1 0 0 1 1
Use pd.merge and pivot_table:
out = df.reset_index().explode('first') \
.pivot_table(values='index', index='second', columns='first',
aggfunc='any', fill_value=False, sort=False).astype(int)
out = df.merge(out, on='second')
Output:
>>> out
first second 1 2 3 10 11 12
0 [1, 2, 3] y 1 1 1 0 0 0
1 [1, 10, 11] n 1 0 0 1 1 0
2 [2, 11, 12] o 0 1 0 0 1 1
This question is very similar to one I posted before with just one change. Instead of doing just the absolute difference for all the columns I also want to find the magnitude difference for the 'Z' column, so if the current Z is 1.1x greater than prev than keep it.
(more context to the problem)
Pandas using the previous rank values to filter out current row
df = pd.DataFrame({
'rank': [1, 1, 2, 2, 3, 3],
'x': [0, 3, 0, 3, 4, 2],
'y': [0, 4, 0, 4, 5, 5],
'z': [1, 3, 1.2, 3.25, 3, 6],
})
print(df)
# rank x y z
# 0 1 0 0 1.00
# 1 1 3 4 3.00
# 2 2 0 0 1.20
# 3 2 3 4 3.25
# 4 3 4 5 3.00
# 5 3 2 5 6.00
Here's what I want the output to be
output = pd.DataFrame({
'rank': [1, 1, 2, 3],
'x': [0, 3, 0, 2],
'y': [0, 4, 0, 5],
'z': [1, 3, 1.2, 6],
})
print(output)
# rank x y z
# 0 1 0 0 1.0
# 1 1 3 4 3.0
# 2 2 0 0 1.2
# 5 3 2 5 6.00
basically what I want to happen is if the previous rank has any rows with x, y (+- 1 both ways) AND z (<1.1z) to remove it.
So for the rows rank 1 ANY rows in rank 2 that have any combo of x = (-1-1), y = (-1-1), z= (<1.1) OR x = (2-5), y = (3-5), z= (<3.3) I want it to be removed
Here's a solution using numpy broadcasting:
# Initially, no row is dropped
df['drop'] = False
for r in range(df['rank'].min(), df['rank'].max()):
# Find the x_min, x_max, y_min, y_max, z_max of the current rank
cond = df['rank'] == r
x, y, z = df.loc[cond, ['x','y','z']].to_numpy().T
x_min, x_max = x + [[-1], [1]] # use numpy broadcasting to ±1 in one command
y_min, y_max = y + [[-1], [1]]
z_max = z * 1.1
# Find the x, y, z of the next rank. Raise them one dimension
# so that we can make a comparison matrix again x_min, x_max, ...
cond = df['rank'] == r + 1
if not cond.any():
continue
x, y, z = df.loc[cond, ['x','y','z']].to_numpy().T[:, :, None]
# Condition to drop a row
drop = (
(x_min <= x) & (x <= x_max) &
(y_min <= y) & (y <= y_max) &
(z <= z_max)
).any(axis=1)
df.loc[cond, 'drop'] = drop
# Result
df[~df['drop']]
Condensed
An even more condensed version (and likely faster). This is a really good way to puzzle your future teammates when they read the code:
r, x, y, z = df[['rank', 'x', 'y', 'z']].T.to_numpy()
rr, xx, yy, zz = [col[:,None] for col in [r, x, y, z]]
drop = (
(rr == r + 1) &
(x-1 <= xx) & (xx <= x+1) &
(y-1 <= yy) & (yy <= y+1) &
(zz <= z*1.1)
).any(axis=1)
# Result
df[~drop]
What this does is comparing every row in df against each other (including itself) and return True (i.e. drop) if:
The current row's rank == the other row's rank + 1; and
The current row's x, y, z fall within the specified range of the other row's x, y, z
You need to slightly modify my previous code:
def check_previous_group(rank, d, groups):
if not rank-1 in groups.groups:
# check is a previous group exists, else flag all rows False (i.e. not to be dropped)
return pd.Series(False, index=d.index)
else:
# get previous group (rank-1)
d_prev = groups.get_group(rank-1)
# get the absolute difference per row with the whole dataset
# of the previous group: abs(d_prev-s)
# if all differences are within 1/1/0.1*z for x/y/z
# for at least one rows of the previous group
# then flag the row to be dropped (True)
return d.apply(lambda s: abs(d_prev-s)[['x', 'y', 'z']].le([1,1,.1*s['z']]).all(1).any(), axis=1)
groups = df.groupby('rank')
mask = pd.concat([check_previous_group(rank, d, groups) for rank,d in groups])
df[~mask]
output:
rank x y z
0 1 0 0 1.0
1 1 3 4 3.0
2 2 0 0 1.2
5 3 2 5 6.0
I have modified mozway's function so that it works according to your requirements.
# comparing 'equal' float values, may go wrong, that's why I am using this constant
DELTA=0.1**12
def check_previous_group(rank, d, groups):
if not rank-1 in groups.groups:
# check if a previous group exists, else flag all rows False (i.e. not to be dropped)
#return pd.Series(False, index=d.index)
return pd.Series(False, index=d.index)
else:
# get previous group (rank-1)
d_prev = groups.get_group(rank-1)
# get the absolute difference per row with the whole dataset
# of the previous group: abs(d_prev-s)
# if differences in x and y are within 1 and z < 1.1*x
# for at least one row of the previous group
# then flag the row to be dropped (True)
return d.apply(lambda s: (abs(d_prev-s)[['x', 'y']].le([1,1]).all(1)&
(s['z']<1.1*d_prev['x']-DELTA)).any(), axis=1)
tests,
>>> df = pd.DataFrame({
'rank': [1, 1, 2, 2, 3, 3],
'x': [0, 3, 0, 3, 4, 2],
'y': [0, 4, 0, 4, 5, 5],
'z': [1, 3, 1.2, 3.25, 3, 6],
})
>>> df
rank x y z
0 1 0 0 1.00
1 1 3 4 3.00
2 2 0 0 1.20
3 2 3 4 3.25
4 3 4 5 3.00
5 3 2 5 6.00
>>> groups = df.groupby('rank')
>>> mask = pd.concat([check_previous_group(rank, d, groups) for rank,d in groups])
>>> df[~mask]
rank x y z
0 1 0 0 1.0
1 1 3 4 3.0
2 2 0 0 1.2
5 3 2 5 6.0
>>> df = pd.DataFrame({
'rank': [1, 1, 2, 2, 3, 3],
'x': [0, 3, 0, 3, 4, 2],
'y': [0, 4, 0, 4, 5, 5],
'z': [1, 3, 1.2, 3.3, 3, 6],
})
>>> df
rank x y z
0 1 0 0 1.0
1 1 3 4 3.0
2 2 0 0 1.2
3 2 3 4 3.3
4 3 4 5 3.0
5 3 2 5 6.0
>>> groups = df.groupby('rank')
>>> mask = pd.concat([check_previous_group(rank, d, groups) for rank,d in groups])
>>> df[~mask]
rank x y z
0 1 0 0 1.0
1 1 3 4 3.0
2 2 0 0 1.2
3 2 3 4 3.3
5 3 2 5 6.0
Just takes an adjustment to the z term of the lamda equation from the linked post:
return d.apply(lambda s: abs(d_prev-s)[['x', 'y', 'z']].le([1,1,.1*d_prev['z']]).all(1).any(), axis=1)
Here's the full code that works for me:
df = pd.DataFrame({
'rank': [1, 1, 2, 2, 2, 3, 3],
'x': [0, 3, 0, 3, 3, 4, 2],
'y': [0, 4, 0, 4, 4, 5, 5],
'z': [1, 3, 1.2, 3.3, 3.31, 3, 6],
})
def check_previous_group(rank, d, groups):
if not rank-1 in groups.groups:
# check is a previous group exists, else flag all rows False (i.e. not to be dropped)
return pd.Series(False, index=d.index)
else:
# get previous group (rank-1)
d_prev = groups.get_group(rank-1)
# get the absolute difference per row with the whole dataset
# of the previous group: abs(d_prev-s)
# if all differences are within 1/1/0.1*z for x/y/z
# for at least one rows of the previous group
# then flag the row to be dropped (True)
return d.apply(lambda s: abs(d_prev-s)[['x', 'y', 'z']].le([1,1,.1*d_prev['z']]).all(1).any(), axis=1)
groups = df.groupby('rank')
mask = pd.concat([check_previous_group(rank, d, groups) for rank,d in groups])
df[~mask]
This works for me on Python 3.8.6
import pandas as pd
dfg = df.groupby("rank")
def filter_func(dfg):
for g in dfg.groups.keys():
if g-1 in dfg.groups.keys():
yield (
pd.merge(
dfg.get_group(g).assign(id = lambda df: df.index),
dfg.get_group(g-1),
how="cross", suffixes=("", "_prev")
).assign(
cond = lambda df: ~(
(df.x - df.x_prev).abs().le(1) & (df.y - df.y_prev).abs().le(1) & df.z.divide(df.z_prev).lt(1.1)
)
)
).groupby("id").agg(
{
**{"cond": "all"},
**{k: "first" for k in df.columns}
}).loc[lambda df: df.cond].drop(columns = ["cond"])
else:
yield dfg.get_group(g)
pd.concat(
filter_func(dfg), ignore_index=True
)
The output seems to match what you expected:
rank x y z
0 1 0 0 1.0
1 1 3 4 3.0
2 2 0 0 1.2
3 3 2 5 6.0
Small edit: in your question it seems like you care about the row index. The solution I posted just ignores this, but if you want to keep it, just save it as an additional column in the dataframe.
I would like to add a column in a data frame when another column is increasing/decreasing or stays the same with:
1 -> increasing, 0 -> same, -1 -> decreasing
So if df['battery'] = [1,2,3,4,7,9,3,3,3,]
I would like state to be df['state'] = [1,1,1,1,1,-1,0,0]
This should do the trick!
a = [1,2,3,4,7,9,3,3,3]
b = []
for x in range(len(a)-1):
b.append((a[x+1] > a[x]) - (a[x+1] < a[x]))
print(b)
You could use pd.Series.diff method to get the difference between consecutive values, and then assign the necessary state values by using boolean indexing:
import pandas as pd
df = pd.DataFrame()
df['battery'] = [1,2,3,4,7,9,3,3,3]
diff = df['battery'].diff()
df.loc[diff > 0, 'state'] = 1
df.loc[diff == 0, 'state'] = 0
df.loc[diff < 0, 'state'] = -1
print(df)
# battery state
# 0 1 NaN
# 1 2 1.0
# 2 3 1.0
# 3 4 1.0
# 4 7 1.0
# 5 9 1.0
# 6 3 -1.0
# 7 3 0.0
# 8 3 0.0
Or, alternatively, one could use np.select:
import numpy as np
diff = df['battery'].diff()
df['state'] = np.select([diff < 0, diff > 0], [-1, 1], 0)
# Be careful, default 0 will replace the first NaN as well.
print(df)
# battery state
# 0 1 0
# 1 2 1
# 2 3 1
# 3 4 1
# 4 7 1
# 5 9 1
# 6 3 -1
# 7 3 0
# 8 3 0
So here's your dataframe:
>>> import pandas as pd
>>> data = [[[1,2,3,4,7,9,3,3,3]]]
>>> df = pd.DataFrame(data, columns = ['battery'])
>>> df
battery
0 [1, 2, 3, 4, 7, 9, 3, 3, 3]
And finally use apply and a lambda function in order to generate the required result:
>>> df['state'] = df.apply(lambda row: [1 if t - s > 0 else -1 if t-s < 0 else 0 for s, t in zip(row['battery'], row['battery'][1:])], axis=1)
>>> df
battery state
0 [1, 2, 3, 4, 7, 9, 3, 3, 3] [1, 1, 1, 1, 1, -1, 0, 0]
Alternatively, if you want the exact difference between each element in the list, you can use the following:
>>> df['state'] = df.apply(lambda row: [t - s for s, t in zip(row['battery'], row['battery'][1:])], axis=1)
>>> df
battery state
0 [1, 2, 3, 4, 7, 9, 3, 3, 3] [1, 1, 1, 3, 2, -6, 0, 0]
Try pd.np.sign
pd.np.sign(df.battery.diff().fillna(1))
0 1.0
1 1.0
2 1.0
3 1.0
4 1.0
5 1.0
6 -1.0
7 0.0
8 0.0
Name: battery, dtype: float64
I'd like to create a new dataframe using the same values from another dataframe, unless there is a 0 value. If there is a 0 value, I'd like to find the average of the entry before and after.
For Example:
df = A B C
5 2 1
3 4 5
2 1 0
6 8 7
I'd like the result to look like the df below:
df_new = A B C
5 2 1
3 4 5
2 1 6
6 8 7
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[5, 3, 2, 6], 'B':[2, 4, 1, 8], 'C':[1, 5, 0, 7]})
Nrows = len(df)
def run(col):
originalValues = list(df[col])
values = list(np.where(np.array(list(df[col])) == 0)[0])
indices2replace = filter(lambda x: x > 0 and x < Nrows, values)
for index in indices2replace:
originalValues[index] = 0.5 * (originalValues[index+1] + originalValues[index-1])
return originalValues
newDF = pd.DataFrame(map(lambda x: run(x) , df.columns)).transpose()