Due to some foibles in the API I'm using, sometimes a 'Zero' is returned when it should return a number; which works its way through to a Pandas dataframe that my script outputs (Python).
What would be a Pythonic way to drop a row if a zero is bordered both above and below by non-zero numbers? I can think of extensive loops to solve this, but that'd be quite an intensive way of going about this.
Note that elsewhere in the dataframe there'll be continuous rows of zeros, which are valid, so it's not simply a case of dropping all rows with zeros in them; I only want to drop rows with zero if they're bordered by rows with valid non-zero numbers.
Assuming col is the column you want to filter on, and it's type is str (drop " if it's float):
df = df.loc[~ (df["col"].shift(-1).ne("0.0") & df["col"].eq("0.0") & df["col"].shift(1).ne("0.0"))]
Related
I have a large dataframe that consists of around 19,000 rows and 150 columns. Many of these columns contain values with -1s and -2s. When I try to replace the -1s and -2s with 0 using the following code, Jupyter times out on me and says no memory left. So, I am curious if you can select a range of columns and apply the replace function. This way I can replace in batches since I cant seem to replace in one pass based on my available memory.
Here is the code a tried to use that timed out on me when first replacing the -2s:
df.replace(to_replace=-2, value="0").
Thank you for any guidance!
Sean
Let's say you want to divide your columns in chunks of 10, then you should try something like this:
columns = your_df.columns
division_num = 10
chunks_num = int(len(columns)/division_num)
index = 0
for i in range(chunks_num):
your_df[columns[index: index+10]].replace(to_replace=-2, value="0")
index += division_num
If your memory keeps overflowing then maybe you can try with loc function to divide the data by rows instead of columns.
Picture of the dataframe1
Hi! I've been trying to figure out how I could calculate wallet balances of erc-20 tokens, but can't get this to work.The idea is simple, when the "Status" columns row value is "Sending", the value would be negative, and when it is "receiving", it would be positive. Lastly I would use groupby and calculate sums by token symbols. The problem is, I can't get the conditional statement to work. What would be a way to do this? I've tried making loop iterations but they don't seem to work.
Assuming that df is the dataframe you presented, it's enough to select proper slice and multiply values by -1:
df.loc[df['Status'] == 'Sending', 'Value'] *= -1
And then grouping:
df = df.groupby(['Symbol']).sum().reset_index()
The looping in pandas is not a good idea – you are able to perform operations in a more optimal, vectorised manner, so try to avoid that.
Pandas loc method when used with row filter throws an error
test[test.loc[0:1,['holiday','weekday']].apply(lambda x:True,axis=1)]
IndexingError: Unalignable boolean Series provided as indexer (index
of the boolean Series and of the indexed object do not match).
whereas the same code without row filter works fine
test[test.loc[0:1,['holiday','weekday']].apply(lambda x:True,axis=1)]
steps to reproduce
test=pd.DataFrame({"holiday":[0,0,0],"weekday":[1,2,3],"workingday":[1,1,1]})
test[test.loc[:,['holiday','weekday']].apply(lambda x:True,axis=1)] ##works fine
test[test.loc[0:1,['holiday','weekday']].apply(lambda x:True,axis=1)] ##fails
I am trying to understand what is the difference between these two which makes one fail whereas the other one succeed
So the basic syntax is DataFrame[things to look for, e.g row slices or columns]
With that in mind, you are trying to filter your dataframe test with the following commands (these are the code snippets in the brackets):
test.loc[:,['holiday','weekday']].apply(lambda x:True,axis=1)
This returns True for every row in the dataframe and therefore the "filter" returns the entire dataframe
test.loc[0:1,['holiday','weekday']].apply(lambda x:True,axis=1)
This part itself is working and it is doing so by slicing the rows 0 and 1 and then applying the lambda function. Therefore, the "filter" consists of True in only 2 rows. Now the point is, that there is no value for the third row and this causes your error: The indices of the dataframe that has to be sliced (3 rows) and the boolean Series used to slice it (2 values) don't match.
Solving this problem depends on what you actually want as your output, i.e. whether the lambda function is supposed to be applied only to a subset of the data or whether you want only a subset of the results being retrieved to work with.
I wrote a function in which each cell of a DataFrame is divided by a number saved in another dataframe.
def calculate_dfA(df_t,xout):
df_A = df_t.copy()
vector_x = xout.T
for index_col, column in tqdm(df_A.iteritems()):
for index_row, row in df_A.iterrows():
df_A.iloc[index_row,index_col] = df_A.iloc[index_row,index_col]/vector_x.iloc[0,index_col]
return(df_A)
The DataFrame on which I apply the calculation has a size of 14839 rows x 14839 columns. According to tqdm the processing speed is roughly 4.5s/it. Accordingly, the calculation will require approixmately 50 days which is not feasible for me. Is there a way to speed up my calculation?
You need to vectorize your division:
result = df_A.values/vector_x
This will broadcast along the row dimension and divide along the column dimension, as you seem to ask for.
Compared to your double for-loop, you are taking advantage of contiguity and homogeneity of the data in memory. This allows for a massive speedup.
Edit: Coming back to this answer today, I am spotting that converting to a numpy array first speeds up the computation. Locally I get a 10x speedup for an array of size similar to the one in the question here-above. Have edited my answer.
I'm on mobile now but you should try to avoid every for loop in python - theres always a better way
For one I know you can multiply a pandas column (Series) times a column to get your desired result.
I think to multiply every column with the matching column of another DataFrame you would still need to iterate (but only with one for loop => performance boost)
I would strongly recommend that you temporarily convert to a numpy ndarray and work with these
I am calling this line:
lang_modifiers = [keyw.strip() for keyw in row["language_modifiers"].split("|") if not isinstance(row["language_modifiers"], float)]
This seems to work where row["language_modifiers"] is a word (atlas method, central), but not when it comes up as nan.
I thought my if not isinstance(row["language_modifiers"], float) could catch the time when things come up as nan but not the case.
Background: row["language_modifiers"] is a cell in a tsv file, and comes up as nan when that cell was empty in the tsv being parsed.
You are right, such errors mostly caused by NaN representing empty cells.
It is common to filter out such data, before applying your further operations, using this idiom on your dataframe df:
df_new = df[df['ColumnName'].notnull()]
Alternatively, it may be more handy to use fillna() method to impute (to replace) null values with something default.
E.g. all null or NaN's can be replaced with the average value for its column
housing['LotArea'] = housing['LotArea'].fillna(housing.mean()['LotArea'])
or can be replaced with a value like empty string "" or another default value
housing['GarageCond']=housing['GarageCond'].fillna("")
You might also use df = df.dropna(thresh=n) where n is the tolerance. Meaning, it requires n Non-NA values to not drop the row
Mind you, this approach will remove the row
For example: If you have a dataframe with 5 columns, df.dropna(thresh=5) would drop any row that does not have 5 valid, or non-Na values.
In your case you might only want to keep valid rows; if so, you can set the threshold to the number of columns you have.
pandas documentation on dropna