How to optimize code that iterates on a big dataframe in Python - python

I have a big pandas dataframe. It has thousands of columns and over a million rows. I want to calculate the difference between the max value and the min value row-wise. Keep in mind that there are many NaN values and some rows are all NaN values (but I still want to keep them!).
I wrote the following code. It works but it's time consuming:
totTime = []
for index, row in date.iterrows():
myRow = row.dropna()
if len(myRow):
tt = max(myRow) - min(myRow)
else:
tt = None
totTime.append(tt)
Is there any way to optimize it? I tried with the following code but I get an error when it encounters all NaN rows:
tt = lambda x: max(x.dropna()) - min(x.dropna())
totTime = date.apply(tt, axis=1)
Any suggestions will be appreciated!

It is usually a bad idea to use a python for loop to iterate over a large pandas.DataFrame or a numpy.ndarray. You should rather use the available build in functions on them as they are optimized and in many cases actually not written in python but in a compiled language. In your case you should use the methods pandas.DataFrame.max and pandas.DataFrame.min that both give you an option skipna to skip nan values in your DataFrame without the need to actually drop them manually. Furthermore, you can choose a axis to minimize along. So you can specifiy axis=1 to get the minimum along columns.
This will add up to something similar as what #EdChum just mentioned in the comments:
data.max(axis=1, skipna=True) - data.min(axis=1, skipna=True)

I have the same problem about iterating. 2 points:
Why don't you replace NaN values with 0? You can do it with this df.replace(['inf','nan'],[0,0]). It replaces inf and nan values.
Take a look at this This. Maybe you can understand, I have a similar question about how to optimize the loop to calculate de difference between actual row with the previous one.

Related

Pandas dataframe sum of row won't let me use result in equation

Anybody wish to help me understand why below code doesn't work?
start_date = '1990-01-01'
ticker_list = ['SPY', 'QQQ', 'IWM','GLD']
tickers = yf.download(ticker_list, start=start_date)['Close'].dropna()
ticker_vol_share = (tickers.pct_change().rolling(20).std()) \
/ ((tickers.pct_change().rolling(20).std()).sum(axis=1))
Both (tickers.pct_change().rolling(20).std()) and ((tickers.pct_change().rolling(20).std()).sum(axis=1)) runs fine by themselves, but when ran together they form a dataframe with thousands of columns all filled with nan
Try this.
rolling_std = tickers.pct_change().rolling(20).std()
ticker_vol_share = rolling_std.apply(lambda row:row/sum(row),axis = 1)
You will get
Why its not working as expected:
Your tickers object is a DataFrame, as is the tickers.pct_change(), tickers.pct_change().rolling(20) and tickers.pct_change().rolling(20).std(). The tickers.pct_change().rolling(20).std().sum(axis=1) is probably a Series.
You're therefore doing element-wise division of a DataFrame by a Series. This yields a DataFrame.
Without seeing your source data, it's hard to say for sure why the output DF is filled with nan, but that can certainly happen if some of the things you're dividing by are 0. It might also happen if each series is only one element long after taking the rolling average. It might also happen if you're actually evaluating a Series tickers rather than a DataFrame, since Series.sum(axis=1) doesn't make a whole lot of sense. It is also suspicious that your top and bottom portions of the division are probably different shapes, since sum() collapses an axis.
It's not clear to me what your expected output is, so I'll defer to others or wait for an update before answering that part.

PySpark - an efficient way to find DataFrame columns with more than 1 distinct value

I need an efficient way to list and drop unary columns in a Spark DataFrame (I use the PySpark API). I define a unary column as one which has at most one distinct value and for the purpose of the definition, I count null as a value as well. That means that a column with one distinct non-null value in some rows and null in other rows is not a unary column.
Based on the answers to this question I managed to write an efficient way to obtain a list of null columns (which are a subset of my unary columns) and drop them as follows:
counts = df.summary("count").collect()[0].asDict()
null_cols = [c for c in counts.keys() if counts[c] == '0']
df2 = df.drop(*null_cols)
Based on my very limited understanding of the inner workings of Spark this is fast because the method summary manipulates the entire data frame simultaneously (I have roughly 300 columns in my initial DataFrame). Unfortunately, I cannot find a similar way to deal with the second type of unary columns - ones which have no null values but are lit(something).
What I currently have is this (using the df2 I obtain from the code snippet above):
prox_counts = (df2.agg(*(F.approx_count_distinct(F.col(c)).alias(c)
for c in df2.columns
)
)
.collect()[0].asDict()
)
poss_unarcols = [k for k in prox_counts.keys() if prox_counts[k] < 3]
unar_cols = [c for c in poss_unarcols if df2.select(c).distinct().count() < 2]
Essentially, I first find columns which could be unary in a fast but approximate way and then look at the "candidates" in more detail and more slowly.
What I don't like about it is that a) even with the approximative pre-selection it is still fairly slow, taking over a minute to run even though at this point I only have roughly 70 columns (and about 6 million rows) and b) I use the approx_count_distinct with the magical constant 3 (approx_count_distinct does not count null, hence 3 instead of 2). Since I'm not exactly sure how the approx_count_distinct works internally I am a little worried that 3 is not a particularly good constant since the function might estimate the number of distinct (non-null) values as say 5 when it really is 1 and so maybe a higher constant is needed to guarantee nothing is missing in the candidate list poss_unarcols.
Is there a smarter way to do this, ideally so that I don't even have to drop the null columns separately and do it all in one fell swoop (although that is actually quite fast and so that big a big issue)?
I suggest that you have a look at the following function
pyspark.sql.functions.collect_set(col)
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=dataframe
It shall return all the values in col with multiplicated elements eliminated. Then you can check for the length of result (whether it equals one). I would be wondering about performance but I think it will beat distinct().count() definitely. Lets have a look on Monday :)
you can df.na.fill("some non exisitng value").summary() and then drop the relevant columns from the original dataframe
So far the best solution I found is this (it is faster than the other proposed answers, although not ideal, see below):
rows = df.count()
nullcounts = df.summary("count").collect()[0].asDict()
del nullcounts['summary']
nullcounts = {key: (rows-int(value)) for (key, value) in nullcounts.items()}
# a list for columns with just null values
null_cols = []
# a list for columns with no null values
full_cols = []
for key, value in nullcounts.items():
if value == rows:
null_cols.append(key)
elif value == 0:
full_cols.append(key)
df = df.drop(*null_cols)
# only columns in full_cols can be unary
# all other remaining columns have at least 1 null and 1 non-null value
try:
unarcounts = (df.agg(*(F.countDistinct(F.col(c)).alias(c) for c in full_cols))
.collect()[0]
.asDict()
)
unar_cols = [key for key in unarcounts.keys() if unarcounts[key] == 1]
except AssertionError:
unar_cols = []
df = df.drop(*unar_cols)
This works reasonably fast, mostly because I don't have too many "full columns", i.e. columns which contain no null rows and I only go through all rows of these, using the fast summary("count") method to clasify as many columns as I can.
Going through all rows of a column seems incredibly wasteful to me, since once two distinct values are found, I don't really care what's in the rest of the column. I don't think this can be solved in pySpark though (but I am a beginner), this seems to require a UDF and pySpark UDFs are so slow that it is not likely to be faster than using countDistinct(). Still, as long as there are many columns with no null rows in a dataframe, this method will be pretty slow (and I am not sure how much one can trust approx_count_distinct() to differentiate between one or two distinct values in a column)
As far as I can say it beats the collect_set() approach and filling the null values is actually not necessary as I realized (see the comments in the code).
I tried your solution, and it was too slow in my situation, so I simply grabbed the first row of the data frame and checked for duplicates. This turned out to be far more performant. I'm sure there's a better way, but I don't know what it is!
first_row = df.limit(1).collect()[0]
drop_cols = [
key for key, value in df.select(
[
sqlf.count(
sqlf.when(sqlf.col(column) != first_row[column], column)
).alias(column)
for column in df.columns
]
).collect()[0].asDict().items()
if value == 0
]
df = df.drop(*[drop_cols])

pandas - vectorized formula computation with nans

I have a DataFrame (Called signal) that is a simple timeseries with 5 columns. This is what its .describe() looks like:
ES NK NQ YM
count 5294.000000 6673.000000 4798.000000 3415.000000
mean -0.000340 0.000074 -0.000075 -0.000420
std 0.016726 0.018401 0.023868 0.015399
min -0.118724 -0.156342 -0.144667 -0.103101
25% -0.008862 -0.010297 -0.011481 -0.008162
50% -0.001422 -0.000590 -0.001747 -0.001324
75% 0.007069 0.009163 0.009841 0.006304
max 0.156365 0.192686 0.181245 0.132630
I want to apply a simple function on every single row, and receive back a matrix with the same dimensions:
weights = -2*signal.subtract( signal.mean(axis=1), axis=0).divide( signal.sub( signal.mean(axis=1), axis=0).abs().sum(axis=1), axis=0 )
However, when I run this line, the program gets stuck. I believe this issue comes from the difference in length/presence of nans. Dropping the nans/filling it is not an option, for any given row that has a nan I want that nan to simply be excluded from the computation. A temporary solution would be to do this iteratively using .iterrows(), but this is not an efficient solution.
Are there any smart solutions to this problem?
The thing is, the pandas mean and sum methods already exclude NaN values by default (see the description of the skipna keyword in the linked docs). Additionally, subtract and divide allow for the use of a fill_value keyword arg:
fill_value : None or float value, default None
Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing
So you may be able to get what you want by setting fill_value=0 in the calls to subtract, and fill_value=1 in the calls to divide.
However, I suspect that the default behavior (NaN is ignored in mean and sum, NaN - anything = NaN, NaN\anything = NaN) is what you actually want. In that case, your problem isn't directly related to NaNs, and you're going to have to clarify your statement "when I run this line, the program gets stuck" in order to get a useful answer.

Conditional mean over a Pandas DataFrame

I have a dataset from which I want a few averages of multiple variables I created.
I started off with:
data2['socialIdeology2'].mean()
data2['econIdeology'].mean()
^ that works perfectly, and gives me the averages I'm looking for.
Now, I'm trying to do a conditional mean, so the mean only for a select group within the data set. (I want the ideologies broken down by whom voted for in the 2016 election) In Stata, the code would be similar to: mean(variable) if voteChoice == 'Clinton'
Looking into it, I came to the conclusion a conditional mean just isn't a thing (although hopefully I am wrong?), so I was writing my own function for it.
This is me just starting out with a 'mean' function, to create a foundation for a conditional mean function:
def mean():
sum = 0.0
count = 0
for index in range(0, len(data2['socialIdeology2'])):
sum = sum + (data2['socialIdeology2'][index])
print(data2['socialIdeology2'][index])
count = count + 1
return sum / count
print(mean())
Yet I keep getting 'nan' as the result. Printing data2['socialIdeology2'][index] within the loop prints nan over and over again.
So my question is: if the data stored within the socialIdeology2 variable really is a nan (which I don't understand how it could be), why is it that the .mean() function works with it?
And how can I get generate means by category?
Conditional mean is indeed a thing in pandas. You can use DataFrame.groupby():
means = data2.groupby('voteChoice').mean()
or maybe, in your case, the following would be more efficient:
means = data2.groupby('voteChoice')['socialIdeology2'].mean()
to drill down to the mean you're looking for. (The first case will calculate means for all columns.) This is assuming that voteChoice is the name of the column you want to condition on.
If you're only interested in the mean for a single group (e.g. Clinton voters) then you could create a boolean series that is True for members of that group, then use this to index into the rows of the DataFrame before taking the mean:
voted_for_clinton = data2['voteChoice'] == 'Clinton'
mean_for_clinton_voters = data2.loc[voted_for_clinton, 'socialIdeology2'].mean()
If you want to get the means for multiple groups simultaneously then you can use groupby, as in Brad's answer. However, I would do it like this:
means_by_vote_choice = data2.groupby('voteChoice')['socialIdeology2'].mean()
Placing the ['socialIdeology2'] index before the .mean() means that you only compute the mean over the column you're interested in, whereas if you place the indexing expression after the .mean() (i.e. data2.groupby('voteChoice').mean()['socialIdeology2']) this computes the means over all columns and then selects only the 'socialIdeology2' column from the result, which is less efficient.
See here for more info on indexing DataFrames using .loc and here for more info on groupby.

Redundant index number in Pandas - can I disable it?

is it feasible to strip a redundant index returned by groupby in pandas?
So in the following example using iris datasets:
df.groupby(["Species"]).apply(lambda x: x[x["Sepal.Length"]>=6]["Petal.Length"] + x[x["Petal.Length"]>=4]["Sepal.Width"])
This returns Series, but unfortunately it has an awful index along with the grouped variable.
And note that I cannot strip it afterward, since the actual computation is much more convoluted such as:
df.groupby(["Species"]).apply(lambda x: x[x["Sepal.Length"]>=6]["Petal.Length"] + x[x["Petal.Length"]>=4]["Sepal.Width"])
, which returns lots of NAN.
Note that the actual datasets I use don't have NAN values within the lambda function (i.e. at least one record meets x[x["Sepal.Length"] >= 6 and x[x["Petal.Length"] >= 4] for all grouped variables, so it shouldn't return NAN. And I found out that the reason it still returns NAN is because of the redundant indexes.
Also, in this case:
df.groupby(["Species"]).apply(lambda x: x["Sepal.Length"].mean())
the result of this execution doesn't include the redundant indexes - why?
So I want to disable the useless feature within the lambda function. Is it feasible?

Categories

Resources