Suppose I have the following dataframe.
df = pd.DataFrame({"a": [1, 0, 0, 2, 0]})
I want to construct a new dataframe based on df such that
newdf[0] = 1 or nan
newdf[1] = 0 + newdf[0] * exp(-alpha) # Alpha is some value.
newdf[2] = 0 + newdf[1] * exp(-alpha)
newdf[3] = 2 + newdf[2] * exp(-alpha)
newdf[4] = 0 + newdf[3] * exp(-alpha)
Basically I want to construct a new dataframe which accepts instanteneous change and decay its own value.
Is there an elegant way to achieve this using pd.rolling or pd.ewm?
I'd like to avoid any for-loop because dataframe has many rows and columns.
Thanks
Use -
alpha = 2
df['new'] = 1 or np.nan
df['new'] = df['a'] + df['a'].shift(-1)*np.exp(-alpha)
import numpy as np is a dependency.
The last row in the df will by np.nan based on this.
Related
i have a dataframe (=used_dataframe), that contains duplicates. I am required to create a list that contains the indices of those duplicates
For this I used a function I found here:
Find indices of duplicate rows in pandas DataFrame
def duplicates(x):
#dataframe = pd.read_csv(x)
#df = dataframe.iloc[: , 1:]
df = x
duplicateRowsDF = df[df.duplicated()]
df = df[df.duplicated(keep=False)]
tuppl = df.groupby(list(df)).apply(lambda x: tuple(x.index)).tolist() #this is the function!
n = 1 # N. . .
indicees = [x[n] for x in tuppl]
return indicees
duplicates(used_df)
The next function I need is one, where I remove the duplicates from the dataset which i did like this:
x= tidy(mn)
indices = duplicates(tidy(mn))
used_df = x
used_df['indexcol'] = range(0, len(tidy(mn)))
dropped = used_df[~used_df['indexcol'].isin(indices)]
finito = dropped.drop(columns=['indexcol'])
return finito
handling_duplicate_entries(used_df)
And it works - but when I want to check my solution (to assess, that all duplicates have been removed)
Which I do by duplicates(handling_duplicate_entries(used_df))which should return an empty dataframe to show that there are no duplicates, it returns the error 'DataFrame' object has no attribute 'tolist'.
In the question of the link above, this has also been added as a comment but not solved - and to be quite frank I would love to find a different solution for the duplicates function because I don't quite understand it but so far I haven't.
Ok. I'll try to do my best.
So if you are trying to find the duplicate indices, and want to store those values in a list you can use the following code. Also I have included a small example to create a dataframe containing the duplicated values (original), and the data without any duplicated data.
import pandas as pd
# Toy dataset
data = {
'A': [0, 0, 3, 0, 3, 0],
'B': [0, 1, 3, 2, 3, 0],
'C': [0, 1, 3, 2, 3, 0]
}
df = pd.DataFrame(data)
group = df.groupby(list(df.columns)).size()
group = group[group>1].reset_index(name = 'count')
group = group.drop(columns=['count']).reset_index().rename(columns={'index':'count'})
idxs = df.reset_index().merge(group, how = 'right')['index'].values
duplicates = df.loc[idxs]
no_duplicates = df.loc[~df.index.isin(idxs)]
duplicates
A B C
0 0 0 0
5 0 0 0
2 3 3 3
4 3 3 3
no_duplicates
A B C
1 0 1 1
3 0 2 2
So, I have the following sample dataframe (included only one row for clarity/simplicity):
df = pd.DataFrame({'base_number': [2],
'std_dev': [1]})
df['amount_needed'] = 5
df['upper_bound'] = df['base_number'] + df['std_dev']
df['lower_bound'] = df['base_number'] - df['std_dev']
For each given rows, I would like to generate the amount of rows such that the total amount per row is the number given by df['amount_needed'] (so 5, in this example). I would like those 5 new rows to be spread across a spectrum given by df['upper_bound'] and df['lower_bound']. So for the example above, I would like the following result as an output:
df_new = pd.DataFrame({'base_number': [1, 1.5, 2, 2.5, 3]})
Of course, this process will be done for all rows in a much larger dataframe, with many other columns which aren't relevant to this particular issue, which is why I'm trying to find a way to automate this process.
One row of df will create one series (or one data frame). Here's one way to iterate over df and create the series with the values you specified:
for row in df.itertuples():
arr = np.linspace(row.lower_bound,
row.upper_bound,
row.amount_needed)
s = pd.Series(arr).rename('base_number')
print(s)
0 1.0
1 1.5
2 2.0
3 2.5
4 3.0
Name: base_number, dtype: float64
Ended up using jsmart's contribution and working on it to generate a new dataframe, conserving original id's in order to merge the other columns from the old one onto this new one according to id as needed (whole process shown below):
amount_needed = 5
df = pd.DataFrame({'base_number': [2, 4, 8, 0],
'std_dev': [1, 2, 3, 0]})
df['amount_needed'] = amount_needed
df['upper_bound'] = df['base_number'] + df['std_dev']
df['lower_bound'] = df['base_number'] - df['std_dev']
s1 = pd.Series([],dtype = int)
for row in df.itertuples():
arr = np.linspace(row.lower_bound,
row.upper_bound,
row.amount_needed)
s = pd.Series(arr).rename('base_number')
s1 = pd.concat([s1, s])
df_new = pd.DataFrame({'base_number': s1})
ids_og = list(range(1, len(df) + 1))
ids_og = [ids_og] * amount_needed
ids_og = sorted(list(itertools.chain.from_iterable(ids_og)))
df_new['id'] = ids_og
I want to remove outliers based on percentile 99 values by group wise.
import pandas as pd
df = pd.DataFrame({'Group': ['A','A','A','B','B','B','B'], 'count': [1.1,11.2,1.1,3.3,3.40,3.3,100.0]})
in output i want to remove 11.2 from group A and 100 from group b. so in final dataset there will only be 5 observations.
wantdf = pd.DataFrame({'Group': ['A','A','B','B','B'], 'count': [1.1,1.1,3.3,3.40,3.3]})
I have tried this one but I'm not getting the desired results
df[df.groupby("Group")['count'].transform(lambda x : (x<x.quantile(0.99))&(x>(x.quantile(0.01)))).eq(1)]
Here is my solution:
def is_outlier(s):
lower_limit = s.mean() - (s.std() * 3)
upper_limit = s.mean() + (s.std() * 3)
return ~s.between(lower_limit, upper_limit)
df = df[~df.groupby('Group')['count'].apply(is_outlier)]
You can write your own is_outlier function
I don't think you want to use quantile, as you'll exclude your lower values:
import pandas as pd
df = pd.DataFrame({'Group': ['A','A','A','B','B','B','B'], 'count': [1.1,11.2,1.1,3.3,3.40,3.3,100.0]})
print(pd.DataFrame(df.groupby('Group').quantile(.01)['count']))
output:
count
Group
A 1.1
B 3.3
Those aren't outliers, right? So you wouldn't want to exclude them.
You could try setting left and right limits by using standard deviations from the median maybe? This is a bit verbose, but it gives you the right answer:
left = pd.DataFrame(df.groupby('Group').median() - pd.DataFrame(df.groupby('Group').std()))
right = pd.DataFrame(df.groupby('Group').median() + pd.DataFrame(df.groupby('Group').std()))
left.columns = ['left']
right.columns = ['right']
df = df.merge(left, left_on='Group', right_index=True)
df = df.merge(right, left_on='Group', right_index=True)
df = df[(df['count'] > df['left']) & (df['count'] < df['right'])]
df = df.drop(['left', 'right'], axis=1)
print(df)
output:
Group count
0 A 1.1
2 A 1.1
3 B 3.3
4 B 3.4
5 B 3.3
I've been browsing for an answer to my issue but I can't seem to find a suitable solution. I have a dataframe with distances (NxN cells) and I find the minimum distance of the whole dataframe with:
min_distance = distances.values.min()
Now I need to find the location (which row and which column of the dataframe) of the min_distance. Any ideas?
EDIT
Minimal code
import numpy as np
import pandas as pd
distances=[]
for i in range(5):
distances.append([])
for j in range(5):
distances[i].append(np.random.randint(10))
distances=pd.DataFrame(distances)
min_distance = distances.values.min()
print "Minimum=", min_distance
print "Location of minimum value="
I depends on what form you want your result in. But a very straight forward approach would be to use stack and idxmin.
Like so:
Setup
import pandas as pd
df = pd.DataFrame([[2, 2, 2], [2, 1, 2], [2, 2, 2]],
columns=list('ABC'), index=list('abc'))
print df
A B C
a 2 2 2
b 2 1 2
c 2 2 2
We should expect the min to be 1 and the location to be row b columns B
Solution
df.stack().idxmin()
('b', 'B')
Now you could manipulate this to deliver this any other way. This just happens to deliver a tuple.
Generate example:
N = 4
df = pd.DataFrame(np.random.rand(N,N))
Find minimal index of flattened dataframe:
idx_min = df.values.flatten().argmin()
Simple arithmetic to get the row and column numbers back:
row = ((idx_min + 1) // N) - 1
column = idx_min - (row * N)
I have a dataframe df like
A B
1 2
3 4
I then want to create 2 new series
t = pd.Series()
r = pd.Series()
I was able to assign values to t using the condition cond as below
t = "1+" + df.A.astype(str) + '+' + df.B.astype(str)
cond = df['A']<df['B']
team[cond] = "1+" + df.loc[cond,'B'].astype(str) + '+' + df.loc[cond,'A'].astype(str)
But I'm having problems with r. I just want r to contain values of 2 when con is satisfied and 1 otherwise
If I just try
r = 1
r[cond] = 2
Then I get TypeError: 'int' object does not support item assignment
I figure I could just run a for loop through df and check the cases in cond through each row of df, but I was wondering if Pandas offers a more efficient way instead?
You will laugh at how easy this is:
r = cond + 1
The reason is that cond is a boolean (True and False) which evaluate to 1 and 0. If you add one to it, it coerces the boolean to an int, which will mean True maps to 2 and False maps to one.
df = pd.DataFrame({'A': [1, 3, 4],
'B': [2, 4, 3]})
cond = df['A'] < df['B']
>>> cond + 1
0 2
1 2
2 1
dtype: int64
When you assign 1 to r as in
r = 1
r now references the integer 1. So when you call r[cond] you're treating an integer like a series.
You want to first create a series of ones for r the size of cond. Something like
r = pd.Series(np.ones(cond.shape))