I have a huge dataset, where I'm trying to reduce the dimensionality by removing the variables that fulfill these two conditions:
Count of unique values in a feature / sample size < 10%
Count of most common value / Count of second most common value > 20 times
The first condition has no problem, the second condition is where I'm stuck at as I'm trying to be as much efficient as possible because of the size of the dataset, I'm trying to use numpy as I have known that it's faster than pandas. So, a possible solution was numpy-most-efficient-frequency-counts-for-unique-values-in-an-array but I'm having too much trouble trying to get the count of the two most common values.
My attempt:
n = df.shape[0]/10
variable = []
condition_1 = []
condition_2 = []
for i in df:
variable.append(i)
condition_1.append(df[i].unique().shape[0] < n)
condition_2.append(most_common_value_count/second_most_common_value_count > 20)
result = pd.DataFrame({"Variables": variable,
"Condition_1": condition_1,
"Condition_2": condition_2})
The dataset df contains positive and negative values (so I can't use np.bincount), and also categorical variables, objects, datetimes, dates, and NaN variables/values.
Any suggestions? Remember that it's critical to minimize the number of steps in order to maximize efficiency.
As noted in the comments, you may want to use np.unique (or pd.unique). You can set return_counts=True to get the value counts. These will be the second item in the tuple returned by np.unique, hence the [1] index below. After sorting them, the most common count will be the last value, and the second most common count will be the next to last value, so you can get them both by indexing with [-2:].
You could then construct a Boolean list indicating which columns meet your condition #2 (or rather the opposite). This list can then be used as a mask to reduce the dataframe:
def counts_ratio(s):
"""Take a pandas series s and return the
count of its most common value /
count of its second most common value."""
counts = np.sort(np.unique(s, return_counts=True)[1])[-2:]
return counts[1] / counts[0]
condition2 = [counts_ratio(df[col]) <= 20
for col in df.columns]
df_reduced = df[df.columns[condition2]]
Related
I have a large datasets. I partitioned the data into training and test.
I found the missing values of the independent variable.
I want to calculate the number of columns that have the missing value. in this case, I should get 12 names. I was only able to sum the whole column
Here is my attempt:
finding_missing_values = data.train.isnull().sum()
finding_missing_values
finding_missing_values.sum()
is there a way I can count the number of column that has a missing value?
Take data list to and then count non zero values as follows.
finding_missing_values = (data.train.isnull().sum()).to_list()
number of missing value columns = sum(k>0 for k in finding_missing_values )
print(number of missing value columns)
should Give #
12
You wrote
finding_missing_values.sum()
You were looking for
(finding_missing_values > 0).values.sum()
From .values we get a numpy array.
The comparison gives us False / True values,
which conveniently are treated as 0 / 1 by .sum()
Is it possible to use .nlargest to get the two highest numbers in a set of number, but ensure that they are x amount of rows apart?
For examples, in the following code I would want to find the largest values but ensure that they are more than 5 values apart from each other. Is there an easy way to do this?
data = {'Pressure' : [100,112,114,120,123,420,1222,132,123,333,123,1230,132,1,23,13,13,13,123,13,123,3,222,2303,1233,1233,1,1,30,20,40,401,10,40,12,122,1,12,333],
}
If I understand the question correctly, you need to output the largest value, and then the next largest value that's at least X rows apart from it (based on the index).
First value is just data.Pressure.max(). Its index is data.Pressure.idxmax()
Second value is either before or after the first value's index:
max_before = df.Pressure.loc[:df.Pressure.idxmax() - X].max()
max_after = df.Pressure.loc[df.Pressure.idxmax() + X:].max()
second_value = max(max_before, max_after)
What would be a more elegant way to writing:
df[df['income'] > 0].count()['income']
I would like to simply count the number of column values meeting a condition (in this example, the condition is just being larger than zero, but I would like a way applicable to any arbitrary condition or set of conditions). Obviously more elegant if the column name would not need to show up twice in the expression. Should be hopefully easy.
df = pd.DataFrame([0, 30000, 75000, -300, 23000], columns=['income'])
print(df)
income
0 0
1 30000
2 75000
3 -300
4 23000
If you would like to count values in a column meeting a slightly more complex condition than just being positive, for example "value is in the range from 5000 to 25000", you can use two methods.
First, using boolean indexing,
((df['income'] > 5000) & (df['income'] < 25000)).sum()
Second, applying a function on every row of the series,
df['income'].map(lambda x: 5000 < x < 25000).sum()
Note that the second approach allows arbitrarily complex condition but is much slower than the first approach which is using low-level operations on the underlying arrays. See the documentation on boolean indexing for more information.
I have a csv datafile that I've split by a column value into 5 datasets for each person using:
for i in range(1,6):
PersonData = df[df['Person'] == i].values
P[i] = PersonData
I want to sort the data into ascending order according to one column, then split the data half way at that column to find the median.
So I sorted the data with the following:
dataP = {}
for i in range(1,6):
sortData = P[i][P[i][:,9].argsort()]
P[i] = sortData
P[i] = pd.DataFrame(P[i])
dataP[1]
Using that I get a dataframe for each of my datasets 1-6 sorted by the relevant column (9), depending on which number I put into dataP[i].
Then I calculate half the length:
for i in range(1,6):
middle = len(dataP[i])/2
print(middle)
Here is where I'm stuck!
I need to create a new column in each dataP[i] dataframe that splits the length in 2 and gives the value 0 if it's in the first half and 1 if it's in the second.
This is what I've tried but I don't understand why it doesn't produce a new list of values 0 and 1 that I can later append to dataP[i]:
for n in range(1, (len(dataP[i]))):
for n, line in enumerate(dataP[i]):
if middle > n:
confval = 0
elif middle < n:
confval = 1
for i in range(1,6):
Confval[i] = confval
Confval[1]
Sorry if this is basic, I'm quite new to this so a lot of what I've written might not be the best way to do it/necessary, and sorry also for the long post.
Any help would be massively appreciated. Thanks in advance!
If I'm reading your question right I believe you are attempting to do two things.
Find the median value of a column
Create a new column which is 0 if the value is less than the median or 1 if greater.
Let's tackle #1 first:
median = df['originalcolumn'].median()
That easy! There's many great pandas functions for things like this.
Ok so number two:
df['newcolumn'] = df[df['originalcolumn'] > median].astype(int)
What we're doing here is creating a new bool series, false if the value at that location is less than the median, true otherwise. Then we can cast that to an int which gives us 0s and 1s.
I need some help with a vert quick calculation, in the denominator line below I need to get the sum of the string occurances, yet only need to sum over values which are above a value, so for example, I need to get the sum of all of them, but exclude the number that comes with a certain occurance at 2, so theoretically I need something along the lines of:
enominator = np.sum(occurances yet only sum above the value of occurances(2))
# the next bit uses the True/False columns to find the ranges in which a
# series of avalanches happen.
fst = bins.index[bins['avalanche'] & ~ bins['avalanche'].shift(1).fillna(False)]
lst = bins.index[bins['avalanche'] & ~ bins['avalanche'].shift(-1).fillna(False)]
for i, j in zip(fst, lst):
bins.loc[j, 'total count'] = sum(bins.loc[i:j+1, 'count'])
bins.loc[j, 'total duration'] = (j-i+1)*bin_width
writer = pd.ExcelWriter(bin_file)
bins.to_excel(writer)
writer.save()
# When a series of avalanches occur, we need to add them up.
occurances = bins.groupby(bins['total count']).size()
# Fill in the gaps with zero
occurances = occurances.reindex(np.arange(occurances.index.min(), occurances.index.max()), fill_value=0)
# Create a new series that shows the percentage of outcomes
denominator = np.sum(occurances)
print(denominator)
percentage = occurances/denominator
#print (denomimator)
So, this takes an excel file and runs it as a dataframe, nonetheless, I'm having trouble, like I mentioned earlier, calculating the variable denominator. Occurances simply adds up the number of times a given values is present, however, i need to calculate denominator such that:
denominator = np.sum(occurances) - occurances[2] + occurances[1]
Yet if it occurances[2] or occurances[1] isn't present it crashes, so how would I go about taking the sum of occurances[3] and above, I also tried:
denominator = np.sum(occurances) >=occurances[3]
but it only gave me a True and False statement and would crash shortly after. So I basically need the sum of the values present in occurances[3] and above. Thank you any help is appreciated
Using a conditional index:
denominator = occurances[occurances > occurances(2)].sum()