Drop column with low variance in pandas

Drop column with low variance in pandas - python

I'm trying to drop columns in my pandas dataframe with 0 variance.
I'm sure this has been answered somewhere but I had a lot of trouble finding a thread on it. I found this thread, however when I tried the solution for my dataframe, baseline with the command
baseline_filtered=baseline.loc[:,baseline.std() > 0.0]
I got the error
"Unalignable boolean Series provided as "
IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).
So, can someone tell me why I'm getting this error or provide an alternative solution?

There are some non numeric columns, so std remove this columns by default:
baseline = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,1,1,1,1,1],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
#no A, F columns
m = baseline.std() > 0.0
print (m)
B True
C True
D False
E True
dtype: bool
So possible solution for add or remove strings columns is use DataFrame.reindex:
baseline_filtered=baseline.loc[:,m.reindex(baseline.columns, axis=1, fill_value=True) ]
print (baseline_filtered)
A B C E F
0 a 4 7 5 a
1 b 5 8 3 a
2 c 4 9 6 a
3 d 5 4 9 b
4 e 5 2 2 b
5 f 4 3 4 b
baseline_filtered=baseline.loc[:,m.reindex(baseline.columns, axis=1, fill_value=False) ]
print (baseline_filtered)
B C E
0 4 7 5
1 5 8 3
2 4 9 6
3 5 4 9
4 5 2 2
5 4 3 4
Another idea is use DataFrame.nunique working with strings and numeric columns:
baseline_filtered=baseline.loc[:,baseline.nunique() > 1]
print (baseline_filtered)
A B C E F
0 a 4 7 5 a
1 b 5 8 3 a
2 c 4 9 6 a
3 d 5 4 9 b
4 e 5 2 2 b
5 f 4 3 4 b

Related

Python - Pandas: number/index of the minimum value in the given row

I have one pandas dataframe, with one row and multiple columns.
I want to get the column number/index of the minimum value in the given row.
The code I found was: df.columns.get_loc('colname')
The above code asks for a column name.
My dataframe doesn't have column names. I want to get the column location of the minimum value.

Use argmin with converting DataFrame to array by values, only necessary only numeric data:
df = pd.DataFrame({
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[-5,3,6,9,2,-4]
})
print (df)
B C D E
0 4 7 1 -5
1 5 8 3 3
2 4 9 5 6
3 5 4 7 9
4 5 2 1 2
5 4 3 0 -4
df['col'] = df.values.argmin(axis=1)
print (df)
B C D E col
0 4 7 1 -5 3
1 5 8 3 3 2
2 4 9 5 6 0
3 5 4 7 9 1
4 5 2 1 2 2
5 4 3 0 -4 3

Convert Outline format in CSV to Two Columns

I have data in a CSV file of the following format (one column in a dataframe). This is essentially like an outline in a Word document, where the headers I've shown here are letters are the main headers, and the items as numbers are subheaders:
A
1
2
3
B
1
2
C
1
2
3
4
I want to convert this to the following format (two columns in a dataframe):
A 1
A 2
A 3
B 1
B 2
C 1
C 2
C 3
C 4
I'm using pandas read_csv to convert the data into a dataframe, and I'm trying to reformat through for loops, but I'm having difficulty because the data repeats and gets overwritten. For example, A 3 will get overwritten with C 3 (resulting in two instance of C 3 when only one is desired, and losing A 3 altogether) later in the loop. What's the best way to do this?
Apologies for poor formatting, new to the site.

Use:
#if no csv header use names parameter
df = pd.read_csv(file, names=['col'])
df.insert(0, 'a', df['col'].mask(df['col'].str.isnumeric()).ffill())
df = df[df['a'] != df['col']]
print (df)
a col
1 A 1
2 A 2
3 A 3
5 B 1
6 B 2
8 C 1
9 C 2
10 C 3
11 C 4
Details:
Check isnumeric values:
print (df['col'].str.isnumeric())
0 False
1 True
2 True
3 True
4 False
5 True
6 True
7 False
8 True
9 True
10 True
11 True
Name: col, dtype: bool
Replace True by NaNs by mask and forward fill missing values:
print (df['col'].mask(df['col'].str.isnumeric()).ffill())
0 A
1 A
2 A
3 A
4 B
5 B
6 B
7 C
8 C
9 C
10 C
11 C
Name: col, dtype: object
Add new column to first position by DataFrame.insert:
df.insert(0, 'a', df['col'].mask(df['col'].str.isnumeric()).ffill())
print (df)
a col
0 A A
1 A 1
2 A 2
3 A 3
4 B B
5 B 1
6 B 2
7 C C
8 C 1
9 C 2
10 C 3
11 C 4
and last remove rows with same values by boolean indexing.

Aggregate data frame rows based on conditions

I have this table
A B C E
1 2 1 3
1 2 4 4
2 7 1 1
3 4 0 2
3 4 8 3
Now, I want to remove duplicates based on column A and B and at the same time sum up column C. For E, it should take the value where C shows the max value. The desirable result table should look like this:
A B C E
1 2 5 4
2 7 1 1
3 4 8 3
I tried this: df.groupby(['A', 'B']).sum()['C'] but my data frame does not change at all as I am thinking that I didn't incorporate the E column part properly...Can somebody advise?
Thanks so much!

If the first and second rows are duplicates, we can group by them.
In [20]: df
Out[20]:
A B C E
0 1 1 5 4
1 1 1 1 1
2 3 3 8 3
In [21]: df.groupby(['A', 'B'])['C'].sum()
Out[21]:
A B
1 1 6
3 3 8
Name: C, dtype: int64
I tried this: df.groupby(['A', 'B']).sum()['C'] but my data frame does not change at all
yes, it's because pandas didn't overwrite initial DataFrame
In [22]: df
Out[22]:
A B C E
0 1 1 5 4
1 1 1 1 1
2 3 3 8 3
You have to overwrite it explicitly.
In [23]: df = df.groupby(['A', 'B'])['C'].sum()
In [24]: df
Out[24]:
A B
1 1 6
3 3 8
Name: C, dtype: int64

How do I multiply a pandas column with a part of a multi index dataframe

I have a data frame with a multi index and one column.
Index fields are type and amount, the column is called count
I would like to add a column that multiplies amount and count
df2 = df.groupby(['type','amount']).count().copy()
# I then dropped all columns but one and renamed it to "count"
df2['total_amount'] = df2['count'].multiply(df2['amount'], axis='index')
doesn't work. I get a key error on amount.
How do I access a part of the multi index to use it in calculations?

Use GroupBy.transform for Series with same size as original df with aggregated values, so possible multiple:
count = df.groupby(['type','amount'])['type'].transform('count')
df['total_amount'] = df['amount'].multiply(count, axis='index')
print (df)
A amount C D E type total_amount
0 a 4 7 1 5 a 8
1 b 5 8 3 3 a 5
2 c 4 9 5 6 a 8
3 d 5 4 7 9 b 10
4 e 5 2 1 2 b 10
5 f 4 3 0 4 b 4
Or:
df = pd.DataFrame({'A':list('abcdef'),
'amount':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'type':list('aaabbb')})
print (df)
A amount C D E type
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b
df2 = df.groupby(['type','amount'])['type'].count().to_frame('count')
df2['total_amount'] = df2['count'].mul(df2.index.get_level_values('amount'))
print (df2)
count total_amount
type amount
a 4 2 8
5 1 5
b 4 1 4
5 2 10

Filling missing data in df.loc filtered conditions?

I have following problem with filling nan in a filtered df.
Let's take this df :
condition value
0 A 1
1 B 8
2 B np.nan
3 A np.nan
4 C 3
5 C np.nan
6 A 2
7 B 5
8 C 4
9 A np.nan
10 B np.nan
11 C np.nan
How can I fill np.nan with the value from the last value based on condition, so that I get following result?
condition value
0 A 1
1 B 8
2 B 8
3 A 1
4 C 3
5 C 3
6 A 2
7 B 5
8 C 4
9 A 2
10 B 5
11 C 4
I've failed with following code (ValueError: Cannot index with multidimensional key):
conditions = set(df['condition'].tolist())
for c in conditions :
filter = df.loc[df['condition'] == c]
df.loc[filter, 'value'] = df.loc[filter, 'value'].fillna(method='ffill')
THX & BR from Vienna

If your values are actual NaN, you simply need to do a groupby on condition, and then call ffill (which is essentially a wrapper for fillna(method='ffill')):
df.groupby('condition').ffill()
Which returns:
condition value
0 A 1
1 B 8
2 B 8
3 A 1
4 C 3
5 C 3
6 A 2
7 B 5
8 C 4
9 A 2
10 B 5
11 C 4
If your values are strings that say np.nan, as in your example, then replace them before:
df.replace('np.nan', np.nan, inplace=True)
df.groupby('condition').ffill()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Drop column with low variance in pandas - python

Related

Python - Pandas: number/index of the minimum value in the given row

Convert Outline format in CSV to Two Columns

Aggregate data frame rows based on conditions

How do I multiply a pandas column with a part of a multi index dataframe

Filling missing data in df.loc filtered conditions?

Categories

Resources