I am performing math operations on a dataframe. This is the operation that works for me.
df['delta'] = np.where (df ['peak'] == 1, df['value'] - df['value'].shift(1), np.NaN)
Now I want to do the EXACTLY same thing with index. Indexes are integers, but they have gaps.
My DataFrame looks like this:
, value, peak
40, 5878, 1
90, 8091, 1
98, 9091, 1
101,10987,1
So, how should I write my line? I mean something like this:
df['i'] = np.where (df ['peak'] == 1, df.index - df.index.shift(1), np.NaN)
So, I want to get column with values 50, 8, 3 and so on...
Since df.index.shift() is only implemented for time-related indexes, use NumPy's diff as a replacement: np.diff(df.index, prepend=np.nan):
df['i'] = np.where(df.peak == 1, np.diff(df.index, prepend=np.nan), np.nan)
Related
I have a df as below
I want to make this df binary as follows
I tried
df[:]=np.where(df>0, 1, 0)
but with this I am losing my df index.
I can try this on all columns one by one or use loop but I think there would be some easy & quick way to do this.
You can convert boolean mask by DataFrame.gt to integers:
df1 = df.gt(0).astype(int)
Or use DataFrame.clip if integers and no negative values:
df1 = df.clip(upper=1)
Your solution should working with loc:
df.loc[:] = np.where(df>0, 1, 0)
of course it is possible by func, it can be done with just operator
(df > 0) * 1
Without using numpy:
df[df>0]=0
Suppose I have two series:
s = pd.Series([20, 21, 12]
t = pd.Series([17,19 , 11]
I want to apply a two argument function to the two series to get a series of results (as a series). Now, one way to do it is as follows:
df = pd.concat([s, t], axis=1)
result = df.apply(lambda x: foo(x[s], x[t]), axis=1)
But this seems clunky. Is there any more elegant way?
There are many ways to do what you want.
Depending on the function in question, you may be able to apply it directly to the series. For example, calling s + t returns
0 37
1 40
2 23
dtype: int64
However, if your function is more complicated than simple arithmetic, you may need to get creative. One option is to use the built-in Python map function. For example, calling
list(map(np.add, s, t))
returns
[37, 40, 23]
If the two series have the same index, you can create a series with list comprehension:
result = pd.Series([foo(xs, xt) for xs,xt in zip(s,t)], index=s.index)
If you can't guarantee that the two series have the same index, concat is the way to go as it helps align the index.
If I understand you can use this to apply a function using 2 colums and copy the results in another column:
df['result'] = df.loc[:, ['s', 't']].apply(foo, axis=1)
It might be possible to use numpy.vectorize:
from numpy import vectorize
vect_foo = vectorize(foo)
result = vect_foo(s, t)
I've been going crazy trying to figure this out. I'm trying to avoid using df.iterrows() to iterate through the rows of a dataframe, as it's quite time consuming and .loc() is better from what I've seen.
I know this works:
df = df.loc[df.number == 3, :]
And that'll basically set df to be each row where the "number" column is equal to 3.
But, I get an error when I try something like this:
df = df.loc[someFunction(df.number), :]
What I want is to get every row where someFunction() returns True whenever the "number" value of said row is set as the parameter.
For some reason, it's passing the entire column (the dataframe's entire "number" column, in this example), instead of the value of a row as it iterates through the row, like the previous example.
Again, I know I can just use a for loop and .iterrows(), but I'm working with around 280,000 rows and it just takes longer than I'd like. Also have tried using a lambda function among other things.
Apply is slow - if you can, try to just put the complex vectorization logic in the function by taking in series as arguments:
import pandas as pd
df = pd.DataFrame()
df['a'] = [7, 6, 5, 4, 3, 2]
df['b'] = [1, 2, 3, 4, 5, 6]
def my_func(series1, series2):
return (series2 > 3) | (series1 == series2)
df.loc[my_func(df.b, df.a), 'new_column_name'] = True
I think this is what you need:
import pandas as pd
df = pd.DataFrame({"number": [x for x in range(10)]})
def someFunction(row):
if row > 5:
return True
else:
return False
df = df.loc[df.number.apply(someFunction)]
print(df)
Output:
number
6 6
7 7
8 8
9 9
You can use an anonymous function with .loc
x refers to the dataframe you are indexing
df.loc[lambda x: x.number > 5, :]
Two options I can think of:
Create a new column using the pandas apply() method and a lambda function that returns either true or false depending on someFunction(). Then, use loc to filter on the new column you just created.
Use a for loop and df.itertuples() as it is way faster than iterrows. Make sure to look up the documentation as the syntax is slightly different for itertuples
Just use something like this will work
df = pd.DataFrame()
df['number'] = np.arange(10)
display(df[df['number']>5])
display(df[df['number']>2][df['number']<7])
In a pandas dataframe, a function can be used to group its index. I'm looking to define a function that instead is applied to a column.
I'm looking to group by two columns, except I need the second column to be grouped by an arbitrary function, foo:
group_sum = df.groupby(['name', foo])['tickets'].sum()
How would foo be defined to group the second column into two groups, demarcated by whether values are > 0, for example? Or, is an entirely different approach or syntax used?
Groupby can accept any combination of both labels and series/arrays (as long as the array has the same length as your dataframe), so you can map the function to your column and pass it into the groupby, like
df.groupby(['name', df[1].map(foo)])
Alternatively you might want to add the condition as a new column to your dataframe before your perform the groupby, this will have the advantage of giving it a name in the index:
df['>0'] = df[1] > 0
group_sum = df.groupby(['name', '>0'])['tickets'].sum()
Something like this will work:
x.groupby(['name', x['value']>0])['tickets'].sum()
Like mentioned above the groupby can accept labels and series. This should give you the answer you are looking for. Here is an example:
data = np.array([[1, -1, 20], [1, 1, 50], [1, 1, 50], [2, 0, 100]])
x = pd.DataFrame(data, columns = ['name', 'value', 'value2'])
x.groupby(['name', x['value']>0])['value2'].sum()
name value
1 False 20
True 100
2 False 100
Name: value2, dtype: int64
I was wondering if there is faster way to assign new values to cells in a pandas dataframe conditional on the value of another cell. For example, take this df:
df = pd.DataFrame({'rank':[1, 1, 1, 1, 2, 2, 2, 2], 'condition':[.01, .01, .01, .01, .01, .01, .01, .01]})
The following code works:
def changerank(row):
if (row['condition'] == 0) & (row['rank'] > 1):
row['rank'] = 1
return row
df = df.apply(changerank, axis=1)
But it is rather slow on my real dataframe containing millions of rows. I feel like there may be another way to change the values of 'rank' depending on values of row.
Thanks for any thoughts!
You can use .ix:
df.ix[(df.condition==0) & (df.rank>1), 'rank'] = 1
I believe loc may also work instead of ix here.