I have a dataframe containing two columns: one filled with a string (irrelevant), and the other one is (a reference to) a dataframe.
Now I want to only keep the rows, where the dataframes in the second column have entries aka len(df.index) > 0 (there should be rows left, I don't care about columns).
I know that sorting out rows like this works perfectly fine for me if I use it in a list comprehension and can do it on every entry by its own, like in the following example:
[do_x for a, inner_df
in zip(outer_df.index, outer_df["inner"])
if len(inner_df.index) > 0]
But if I try using it for conditional indexing to create a shorter version of the dataframe, it will produce the error KeyError: True.
I thought, that putting len() around it could be a problem so I also tried different approaches to check for zero rows. In the following I show 4 examples of how I tried it:
# a) with the length of the index
outer_df = outer_df.loc[len(outer_df["inner"].index) > 0, :]
# b) same, but with lambda just like in the panda docs user guide
# I used it on the other versions too, with no change in result
outer_df = outer_df.loc[lambda df: len(df["inner"]) > 0, :]
# c) switching
outer_df = outer_df.loc[outer_df["inner"].index.size > 0, :]
# d) even "shorter" version
outer_df = outer_df.loc[not outer_df["inner"].empty, :]
So... where is my error and can I even do it with conditional indexing or do I need to find another way?
Edit: Changed and added some sentences above for more clarity plus added all below.
I know, that the filtering here kind of works through creating a Series the same length as the dataframe consisting of "True" and "False" after a comparison, resulting in keeping only the rows that contain a "True".
I do not however see a fundamental difference between my attempt to create such a list and the following examples (Source https://www.geeksforgeeks.org/selecting-rows-in-pandas-dataframe-based-on-conditions/):
# 1. difference: the resulting Series is *not* altered
# it just gets compared directly with here the value 80
# -> I thought this might be the problem, but then there is also #2
df = df[df['Percentage'] > 80]
# or
df = df.loc[df['Percentage'] > 80]
# 2. Here the entry is checked in a similar way to my c and d
options = ['x', 'y']
df = df[df['Stream'].isin(options)]
# or
df = df.loc[df['Stream'].isin(options)]
In both, number two here and my versions c & d, the entry in the cell (string // dataframe) is checked for something (is part of list // is empty).
Not sure if I understand your question or where you are stuck. however, I will just write my comment in this answer so that I can easily edit the post.
First, let's try typing in myvar = df['Percentage'] > 80 and see what myvar is. See if the content of myvar makes sense to you.
There is really only 1 true rule of .loc[], that is, the TRUTH TABLE.
Regarding the df[stuff] expression always appears within .loc[ df[stuff] expression ], you might get the impression that df[stuff] expression had some special meaning. For example: df[df['Percentage'] > 80] is asking for any Percentage that is greater than 80, looks quite intuitive! so...df['Percentage'] > 80 must be a "special syntax"? In reality, df['Percentage'] > 80 isn't anything special, it is just another truth table. df[stuff] expression will always be a truth table, that's it.
Related
Below is the example to reproduce the error:
testx1df = pd.DataFrame()
testx1df['A'] = [100,200,300,400]
testx1df['B'] = [15,60,35,11]
testx1df['C'] = [11,45,22,9]
testx1df['D'] = [5,15,11,3]
testx1df['E'] = [1,6,4,0]
(testx1df[testx1df < 6].apply(lambda x: x.index.get_loc(x.first_valid_index(), method='ffill'), axis=1))
The desired output should be a list or array with the values [3,NaN,4,3]. The NaN because it does not satisfy the criteria.
I checked the pandas references and it says that for cases when you do not have an exact match you can change the "method" to 'fill', 'brill', or 'nearest' to pick the previous, next, or closest index. Based on this, if i indicated the method as 'ffill' it would give me an index of 4 instead of NaN. However, when i do so it does not work and i get the error show in the question title. For criteria higher than 6 it works fine but it doesn't for less than 6 due to the fact that the second row in the data frame does not satisfy it.
Is there a way around this issue? should it not work for my example(return previous index of 3 or 4)?
One solution i thought of is to add a dummy column populated by zeros so that is has a place to "find" and index that satisfies the criteria but this is a bit crude to me and i think there is a more efficient solution out there.
please try this:
import numpy as np
ls = list(testx1df[testx1df<6].T.isna().sum())
ls = [np.nan if x==testx1df.shape[1] else x for x in ls]
print(ls)
I'm trying to find a vectorized way of determining the first instance where my column of data has a sign change. I looked at this question and it gets close to what I want, except it evaluates my first zeros as true. I'm open to different solutions including changing how the data is set up in the first place. I'll detail what I'm doing below.
I have two columns, let's call them positive and negative, that look at a third column. The third column has values ranging between [-5, 5]. When this column is [3, 5], my positive column gets a +1 on that same row; all other rows are 0 in that column. Likewise, when the third column is between [-5, -3], my negative column gets a -1 in that row; all other rows are 0.
I combine these columns into one column. You can conceptualize this as 'turn machine on, keep it on/off, turn it off, keep it on/off, turn machine on ... etc.' The problem I've having is that my combined column looks something like below:
pos = [1,1,1,0, 0, 0,0,0,0,0,1, 0,1]
neg = [0,0,0,0,-1,-1,0,0,0,0,0,-1,0]
com = [1,1,1,0,-1,-1,0,0,0,0,1,-1,1]
# Below is what I want to have as the final column.
cor = [1,0,0,0,-1, 0,0,0,0,0,1,-1,1]
The problem with what I've linked is that it gets close, but it evaluates the first 0 as a sign change as well. 0's should be ignored and I tried a few things, but seem to be creating new errors. For the sake of completeness, this is what the code linked outputs:
lnk = [True,False,False,True,True,False,True,False,False,False,True,True,True]
As you can see, it's doing the 1 and -1 not flipping fine, but the zero's it's flipping. Not sure if I should change how the combined column is made or if I should change the logic for the creation of the component columns, both. The big thing is I need to vectorize this code for performance concerns.
Any help would be greatly appreciated!
Let's suppose your dataframe is named df with columns pos and neg then you can try something like the following :
df.loc[:, "switch_pos"] = (np.diff(df.pos, prepend=0) > 0)*1
df.loc[:, "switch_neg"] = (np.diff(df.neg, prepend=0) > 0)*(-1)
You can then combine your two switchs columns.
Explanations
no.diff gives you the difference row by row but setting (for pos columns) 1 for 0 to 1 and - 1 for 1 to 0. Considering your desired output, you want to keep only your 0 to 1, that's why you need to keep only the more than zero output
I need an efficient way to list and drop unary columns in a Spark DataFrame (I use the PySpark API). I define a unary column as one which has at most one distinct value and for the purpose of the definition, I count null as a value as well. That means that a column with one distinct non-null value in some rows and null in other rows is not a unary column.
Based on the answers to this question I managed to write an efficient way to obtain a list of null columns (which are a subset of my unary columns) and drop them as follows:
counts = df.summary("count").collect()[0].asDict()
null_cols = [c for c in counts.keys() if counts[c] == '0']
df2 = df.drop(*null_cols)
Based on my very limited understanding of the inner workings of Spark this is fast because the method summary manipulates the entire data frame simultaneously (I have roughly 300 columns in my initial DataFrame). Unfortunately, I cannot find a similar way to deal with the second type of unary columns - ones which have no null values but are lit(something).
What I currently have is this (using the df2 I obtain from the code snippet above):
prox_counts = (df2.agg(*(F.approx_count_distinct(F.col(c)).alias(c)
for c in df2.columns
)
)
.collect()[0].asDict()
)
poss_unarcols = [k for k in prox_counts.keys() if prox_counts[k] < 3]
unar_cols = [c for c in poss_unarcols if df2.select(c).distinct().count() < 2]
Essentially, I first find columns which could be unary in a fast but approximate way and then look at the "candidates" in more detail and more slowly.
What I don't like about it is that a) even with the approximative pre-selection it is still fairly slow, taking over a minute to run even though at this point I only have roughly 70 columns (and about 6 million rows) and b) I use the approx_count_distinct with the magical constant 3 (approx_count_distinct does not count null, hence 3 instead of 2). Since I'm not exactly sure how the approx_count_distinct works internally I am a little worried that 3 is not a particularly good constant since the function might estimate the number of distinct (non-null) values as say 5 when it really is 1 and so maybe a higher constant is needed to guarantee nothing is missing in the candidate list poss_unarcols.
Is there a smarter way to do this, ideally so that I don't even have to drop the null columns separately and do it all in one fell swoop (although that is actually quite fast and so that big a big issue)?
I suggest that you have a look at the following function
pyspark.sql.functions.collect_set(col)
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=dataframe
It shall return all the values in col with multiplicated elements eliminated. Then you can check for the length of result (whether it equals one). I would be wondering about performance but I think it will beat distinct().count() definitely. Lets have a look on Monday :)
you can df.na.fill("some non exisitng value").summary() and then drop the relevant columns from the original dataframe
So far the best solution I found is this (it is faster than the other proposed answers, although not ideal, see below):
rows = df.count()
nullcounts = df.summary("count").collect()[0].asDict()
del nullcounts['summary']
nullcounts = {key: (rows-int(value)) for (key, value) in nullcounts.items()}
# a list for columns with just null values
null_cols = []
# a list for columns with no null values
full_cols = []
for key, value in nullcounts.items():
if value == rows:
null_cols.append(key)
elif value == 0:
full_cols.append(key)
df = df.drop(*null_cols)
# only columns in full_cols can be unary
# all other remaining columns have at least 1 null and 1 non-null value
try:
unarcounts = (df.agg(*(F.countDistinct(F.col(c)).alias(c) for c in full_cols))
.collect()[0]
.asDict()
)
unar_cols = [key for key in unarcounts.keys() if unarcounts[key] == 1]
except AssertionError:
unar_cols = []
df = df.drop(*unar_cols)
This works reasonably fast, mostly because I don't have too many "full columns", i.e. columns which contain no null rows and I only go through all rows of these, using the fast summary("count") method to clasify as many columns as I can.
Going through all rows of a column seems incredibly wasteful to me, since once two distinct values are found, I don't really care what's in the rest of the column. I don't think this can be solved in pySpark though (but I am a beginner), this seems to require a UDF and pySpark UDFs are so slow that it is not likely to be faster than using countDistinct(). Still, as long as there are many columns with no null rows in a dataframe, this method will be pretty slow (and I am not sure how much one can trust approx_count_distinct() to differentiate between one or two distinct values in a column)
As far as I can say it beats the collect_set() approach and filling the null values is actually not necessary as I realized (see the comments in the code).
I tried your solution, and it was too slow in my situation, so I simply grabbed the first row of the data frame and checked for duplicates. This turned out to be far more performant. I'm sure there's a better way, but I don't know what it is!
first_row = df.limit(1).collect()[0]
drop_cols = [
key for key, value in df.select(
[
sqlf.count(
sqlf.when(sqlf.col(column) != first_row[column], column)
).alias(column)
for column in df.columns
]
).collect()[0].asDict().items()
if value == 0
]
df = df.drop(*[drop_cols])
Here is a sample of my df:
units price
0 143280.0 0.8567
1 4654.0 464.912
2 512210.0 607
3 Unknown 0
4 Unknown 0
I have the following code:
myDf.loc[(myDf["units"].str.isnumeric())&(myDf["price"].str.isnumeric()),'newValue']=(
myDf["price"].astype(float).fillna(0.0)*
myDf["units"].astype(float).fillna(0.0)/
1000)
As you can see, I'm trying to only do math to create the 'newValue' column for rows where the two source columns are both numeric. However, I get the following error:
ValueError: could not convert string to float: 'Unknown'
So it seems that even though I'm attempting to perform math only on the rows that don't have text, Pandas does not like that any of the rows have text.
Note that I need to maintain the instances of "Unknown" exactly as they are and so filling those with zero is not a good option.
This has be pretty stumped. Could not find any solutions by searching Google.
Would appreciate any help/solutions.
You can use the same condition you use on the left side of the = on the right side as follows (I set the condition in a variable is_num for readability):
is_num = (myDf["units"].astype(str).str.replace('.', '').str.isnumeric()) & (myDf["price"].astype(str).str.replace('.', '').str.isnumeric())
myDf.loc[is_num,'newValue']=(
myDf.loc[is_num, "price"].astype(float).fillna(0.0)*
myDf.loc[is_num, "units"].astype(float).fillna(0.0)/1000)
Also, you need to check with your read dataframe, but from this example, you can:
Remove the fillna(0.0), since there are no NaNs
Remove the checks on 'price' (as of your example, price is always numeric, so the check is not necessary)
Remove the astype(float) cast for price, since it's already numeric.
That would lead to the following somewhat more concise code:
is_num = myDf["units"].astype(str).str.replace('.', '').str.isnumeric()
myDf.loc[is_num,'newValue']=(
myDf.loc[is_num, "price"].astype(float)*
myDf.loc[is_num, "units"]/1000)
I have seen several solutions that come close to solving my problem
link1
link2
but they have not helped me succeed thus far.
I believe that the following solution is what I need, but continue to get an error (and I don't have the reputation points to comment/question on it): link
(I get the following error, but I don't understand where to .copy() or add an "inplace=True" when administering the following command df2=df.groupby('install_site').transform(replace):
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the the caveats in the documentation: link
SO, I have attempted to come up with my own version, but I keep getting stuck. Here goes.
I have a data frame indexed by time with columns for site (string values for many different sites) and float values.
time_index site val
I would like to go through the 'val' column, grouped by site, and replace any outliers (those +/- 3 standard deviations from the mean) with a NaN (for each group).
When I use the following function, I cannot index the data frame with my vector of True/Falses:
def replace_outliers_with_nan(df, stdvs):
dfnew=pd.DataFrame()
for i, col in enumerate(df.sites.unique()):
dftmp = pd.DataFrame(df[df.sites==col])
idx = [np.abs(dftmp-dftmp.mean())<=(stdvs*dftmp.std())] #boolean vector of T/F's
dftmp[idx==False]=np.nan #this is where the problem lies, I believe
dfnew[col] = dftmp
return dfnew
In addition, I fear the above function will take a very long time on 7 million+ rows, which is why I was hoping to use the groupby function option.
If I have understood you right, there is no need to iterate over the columns. This solution replaces all values which deviates more than three group standard deviations with NaN.
def replace(group, stds):
group[np.abs(group - group.mean()) > stds * group.std()] = np.nan
return group
# df is your DataFrame
df.loc[:, df.columns != group_column] = df.groupby(group_column).transform(lambda g: replace(g, 3))