Pandas conditional row values based on an other column - python

Picture of the dataframe1
Hi! I've been trying to figure out how I could calculate wallet balances of erc-20 tokens, but can't get this to work.The idea is simple, when the "Status" columns row value is "Sending", the value would be negative, and when it is "receiving", it would be positive. Lastly I would use groupby and calculate sums by token symbols. The problem is, I can't get the conditional statement to work. What would be a way to do this? I've tried making loop iterations but they don't seem to work.

Assuming that df is the dataframe you presented, it's enough to select proper slice and multiply values by -1:
df.loc[df['Status'] == 'Sending', 'Value'] *= -1
And then grouping:
df = df.groupby(['Symbol']).sum().reset_index()
The looping in pandas is not a good idea – you are able to perform operations in a more optimal, vectorised manner, so try to avoid that.

Related

Pandas dataframe sum of row won't let me use result in equation

Anybody wish to help me understand why below code doesn't work?
start_date = '1990-01-01'
ticker_list = ['SPY', 'QQQ', 'IWM','GLD']
tickers = yf.download(ticker_list, start=start_date)['Close'].dropna()
ticker_vol_share = (tickers.pct_change().rolling(20).std()) \
/ ((tickers.pct_change().rolling(20).std()).sum(axis=1))
Both (tickers.pct_change().rolling(20).std()) and ((tickers.pct_change().rolling(20).std()).sum(axis=1)) runs fine by themselves, but when ran together they form a dataframe with thousands of columns all filled with nan
Try this.
rolling_std = tickers.pct_change().rolling(20).std()
ticker_vol_share = rolling_std.apply(lambda row:row/sum(row),axis = 1)
You will get
Why its not working as expected:
Your tickers object is a DataFrame, as is the tickers.pct_change(), tickers.pct_change().rolling(20) and tickers.pct_change().rolling(20).std(). The tickers.pct_change().rolling(20).std().sum(axis=1) is probably a Series.
You're therefore doing element-wise division of a DataFrame by a Series. This yields a DataFrame.
Without seeing your source data, it's hard to say for sure why the output DF is filled with nan, but that can certainly happen if some of the things you're dividing by are 0. It might also happen if each series is only one element long after taking the rolling average. It might also happen if you're actually evaluating a Series tickers rather than a DataFrame, since Series.sum(axis=1) doesn't make a whole lot of sense. It is also suspicious that your top and bottom portions of the division are probably different shapes, since sum() collapses an axis.
It's not clear to me what your expected output is, so I'll defer to others or wait for an update before answering that part.

How to delete an outlier from a np.where condition

I have this dataframe that has an outlier, which I recognized through a boxplot. Then, I caught the value of it through np.where but the thing is, I don't know how to delete this value and its whole row from my dataframe so that I can get rid of the outlier.
This is the code I used for it so far:
sns.boxplot(x=df_cor_inc['rt'].astype(float))
outlier = np.where(df_cor_inc['rt'].astype(float)>50000)
Any help would be great. Thanks.
No need for np.where, a simple boolean mask will do the trick:
df_cor_inc = df_cor_inc[df_cor_inc['rt'] <= 50000]]
Also, why are you casting df_cor_inc['rt'] as float? Is it not already numeric?
If you want to reset the indices of your dataframe, tack on a .reset_index(drop=True).
Try this:
df_cor_inc[np.where(df_cor_inc['rt'].astype(float)>50000,False,True)]

Function to get Row and column of panda dataset

I have a csv dataset with texts. I need to search through them. I couldn't find an easy way to search for a string in a dataset and get the row and column indexes. For example, let's say the dataset is like:
df = pd.DataFrame({"China": ['Xi','Lee','Hung'], "India": ['Roy','Rani','Jay'], "England": ['Tom','Sam','Jack']})
Now let's say I want to find the string 'rani' and know its location. Is there a simple function to do that? Or do I have to loop through everything to find it?
One vectorized (and therefore relatively scalable) solution to this is to leverage numpy.where:
import numpy as np
np.where(df == 'Rani')
This returns two arrays, corresponding to column and row indices:
(array([1]), array([1]))
You can continue to take advantage of vectorized operations, but also write a more complicated filtering function, like so:
np.where(df.applymap(lambda x: "ani" in x))
In other words, "apply to each cell the function that returns True if 'ani' is in the cell", and then conduct the same np.where filtering step.
You can use any function:
def _should_include_cell(cell_contents):
return cell_contents.lower() == "rani" or "Xi" in cell_contents
np.where(df.applymap(_should_include_cell)
Some final notes:
applymap is slower than simple equality checking
if you need this to scale WAY up, consider using dask instead of pandas
Not sure how this will scale but it works
df[df.eq('Rani')].dropna(1, how='all').dropna()
India
1 Rani

Drop row in Pandas dataframe if zero bordered by numbers (Python)

Due to some foibles in the API I'm using, sometimes a 'Zero' is returned when it should return a number; which works its way through to a Pandas dataframe that my script outputs (Python).
What would be a Pythonic way to drop a row if a zero is bordered both above and below by non-zero numbers? I can think of extensive loops to solve this, but that'd be quite an intensive way of going about this.
Note that elsewhere in the dataframe there'll be continuous rows of zeros, which are valid, so it's not simply a case of dropping all rows with zeros in them; I only want to drop rows with zero if they're bordered by rows with valid non-zero numbers.
Assuming col is the column you want to filter on, and it's type is str (drop " if it's float):
df = df.loc[~ (df["col"].shift(-1).ne("0.0") & df["col"].eq("0.0") & df["col"].shift(1).ne("0.0"))]

Conditional mean over a Pandas DataFrame

I have a dataset from which I want a few averages of multiple variables I created.
I started off with:
data2['socialIdeology2'].mean()
data2['econIdeology'].mean()
^ that works perfectly, and gives me the averages I'm looking for.
Now, I'm trying to do a conditional mean, so the mean only for a select group within the data set. (I want the ideologies broken down by whom voted for in the 2016 election) In Stata, the code would be similar to: mean(variable) if voteChoice == 'Clinton'
Looking into it, I came to the conclusion a conditional mean just isn't a thing (although hopefully I am wrong?), so I was writing my own function for it.
This is me just starting out with a 'mean' function, to create a foundation for a conditional mean function:
def mean():
sum = 0.0
count = 0
for index in range(0, len(data2['socialIdeology2'])):
sum = sum + (data2['socialIdeology2'][index])
print(data2['socialIdeology2'][index])
count = count + 1
return sum / count
print(mean())
Yet I keep getting 'nan' as the result. Printing data2['socialIdeology2'][index] within the loop prints nan over and over again.
So my question is: if the data stored within the socialIdeology2 variable really is a nan (which I don't understand how it could be), why is it that the .mean() function works with it?
And how can I get generate means by category?
Conditional mean is indeed a thing in pandas. You can use DataFrame.groupby():
means = data2.groupby('voteChoice').mean()
or maybe, in your case, the following would be more efficient:
means = data2.groupby('voteChoice')['socialIdeology2'].mean()
to drill down to the mean you're looking for. (The first case will calculate means for all columns.) This is assuming that voteChoice is the name of the column you want to condition on.
If you're only interested in the mean for a single group (e.g. Clinton voters) then you could create a boolean series that is True for members of that group, then use this to index into the rows of the DataFrame before taking the mean:
voted_for_clinton = data2['voteChoice'] == 'Clinton'
mean_for_clinton_voters = data2.loc[voted_for_clinton, 'socialIdeology2'].mean()
If you want to get the means for multiple groups simultaneously then you can use groupby, as in Brad's answer. However, I would do it like this:
means_by_vote_choice = data2.groupby('voteChoice')['socialIdeology2'].mean()
Placing the ['socialIdeology2'] index before the .mean() means that you only compute the mean over the column you're interested in, whereas if you place the indexing expression after the .mean() (i.e. data2.groupby('voteChoice').mean()['socialIdeology2']) this computes the means over all columns and then selects only the 'socialIdeology2' column from the result, which is less efficient.
See here for more info on indexing DataFrames using .loc and here for more info on groupby.

Categories

Resources