Python Pandas drop columns based on max value of column - python

Im just getting going with Pandas as a tool for munging two dimensional arrays of data. It's super overwhelming, even after reading the docs. You can do so much that I can't figure out how to do anything, if that makes any sense.
My dataframe (simplified):
Date Stock1 Stock2 Stock3
2014.10.10 74.75 NaN NaN
2014.9.9 NaN 100.95 NaN
2010.8.8 NaN NaN 120.45
So each column only has one value.
I want to remove all columns that have a max value less than x. So say here as an example, if x = 80, then I want a new DataFrame:
Date Stock2 Stock3
2014.10.10 NaN NaN
2014.9.9 100.95 NaN
2010.8.8 NaN 120.45
How can this be acheived? I've looked at dataframe.max() which gives me a series. Can I use that, or have a lambda function somehow in select()?

Use the df.max() to index with.
In [19]: from pandas import DataFrame
In [23]: df = DataFrame(np.random.randn(3,3), columns=['a','b','c'])
In [36]: df
Out[36]:
a b c
0 -0.928912 0.220573 1.948065
1 -0.310504 0.847638 -0.541496
2 -0.743000 -1.099226 -1.183567
In [24]: df.max()
Out[24]:
a -0.310504
b 0.847638
c 1.948065
dtype: float64
Next, we make a boolean expression out of this:
In [31]: df.max() > 0
Out[31]:
a False
b True
c True
dtype: bool
Next, you can index df.columns by this (this is called boolean indexing):
In [34]: df.columns[df.max() > 0]
Out[34]: Index([u'b', u'c'], dtype='object')
Which you can finally pass to DF:
In [35]: df[df.columns[df.max() > 0]]
Out[35]:
b c
0 0.220573 1.948065
1 0.847638 -0.541496
2 -1.099226 -1.183567
Of course, instead of 0, you use any value that you want as the cutoff for dropping.

Related

I want to extract the values of cells in a Dataframe and keep the unique string as a column name [duplicate]

I have the following data frame:
pa=pd.DataFrame({'a':np.array([[1.,4.],[2.],[3.,4.,5.]])})
I want to select the column 'a' and then only a particular element (i.e. first: 1., 2., 3.)
What do I need to add to:
pa.loc[:,['a']]
?
pa.loc[row] selects the row with label row.
pa.loc[row, col] selects the cells which are the instersection of row and col
pa.loc[:, col] selects all rows and the column named col. Note that although this works it is not the idiomatic way to refer to a column of a dataframe. For that you should use pa['a']
Now you have lists in the cells of your column so you can use the vectorized string methods to access the elements of those lists like so.
pa['a'].str[0] #first value in lists
pa['a'].str[-1] #last value in lists
Storing lists as values in a Pandas DataFrame tends to be a mistake because
it prevents you from taking advantage of fast NumPy or Pandas vectorized operations.
Therefore, you might be better off converting your DataFrame of lists of numbers into a wider DataFrame with native NumPy dtypes:
import numpy as np
import pandas as pd
pa = pd.DataFrame({'a':np.array([[1.,4.],[2.],[3.,4.,5.]])})
df = pd.DataFrame(pa['a'].values.tolist())
# 0 1 2
# 0 1.0 4.0 NaN
# 1 2.0 NaN NaN
# 2 3.0 4.0 5.0
Now, you could select the first column like this:
In [36]: df.iloc[:, 0]
Out[36]:
0 1.0
1 2.0
2 3.0
Name: 0, dtype: float64
or the first row like this:
In [37]: df.iloc[0, :]
Out[37]:
0 1.0
1 4.0
2 NaN
Name: 0, dtype: float64
If you wish to drop NaNs, use .dropna():
In [38]: df.iloc[0, :].dropna()
Out[38]:
0 1.0
1 4.0
Name: 0, dtype: float64
and .tolist() to retrieve the values as a list:
In [39]: df.iloc[0, :].dropna().tolist()
Out[39]: [1.0, 4.0]
but if you wish to leverage NumPy/Pandas for speed, you'll want to express your calculation as vectorized operations on df itself without converting back to Python lists.

Pandas.read_excel sometimes incorrectly reads Boolean values as 1's/0's

I need to read a very large Excel file into a DataFrame. The file has string, integer, float, and Boolean data, as well as missing data and totally empty rows. It may also be worth noting that some of the cell values are derived from cell formulas and/or VBA - although theoretically that shouldn't affect anything.
As the title says, pandas sometimes reads Boolean values as float or int 1's and 0's, instead of True and False. It appears to have something to do with the amount of empty rows and type of other data. For simplicity's sake, I'm just linking a 2-sheet Excel file where the issue is replicated.
Boolean_1.xlsx
Here's the code:
import pandas as pd
df1 = pd.read_excel('Boolean_1.xlsx','Sheet1')
df2 = pd.read_excel('Boolean_1.xlsx','Sheet2')
print(df1, '\n' *2, df2)
Here's the print. Mainly note row ZBA, which has the same values in both sheets, but different values in the DataFrames:
Name stuff Unnamed: 1 Unnamed: 2 Unnamed: 3
0 AFD a dsf ads
1 DFA 1 2 3
2 DFD 123.3 41.1 13.7
3 IIOP why why why
4 NaN NaN NaN NaN
5 ZBA False False True
Name adslfa Unnamed: 1 Unnamed: 2 Unnamed: 3
0 asdf 6.0 3.0 6.0
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 ZBA 0.0 0.0 1.0
I was also able to get integer 1's and 0's output in the large file I'm actually working on (yay), but wasn't able to easily replicate it.
What could be causing this inconsistency, and is there a way to force pandas to read Booleans as they should be read?
Pandas type-casting is applied by column / series. In general, Pandas doesn't work well with mixed types, or object dtype. You should expect internalised logic to determine the most efficient dtype for a series. In this case, Pandas has chosen float dtype as applicable for a series containing float and bool values. In my opinion, this is efficient and neat.
However, as you noted, this won't work when you have a transposed input dataset. Let's set up an example from scratch:
import pandas as pd, numpy as np
df = pd.DataFrame({'A': [True, False, True, True],
'B': [np.nan, np.nan, np.nan, False],
'C': [True, 'hello', np.nan, True]})
df = df.astype({'A': bool, 'B': float, 'C': object})
print(df)
A B C
0 True NaN True
1 False NaN hello
2 True NaN NaN
3 True 0.0 True
Option 1: change "row dtype"
You can, without transposing your data, change the dtype for objects in a row. This will force series B to have object dtype, i.e. a series storing pointers to arbitrary types:
df.iloc[3] = df.iloc[3].astype(bool)
print(df)
A B C
0 True NaN True
1 False NaN hello
2 True NaN NaN
3 True False True
print(df.dtypes)
A bool
B object
C object
dtype: object
Option 2: transpose and cast to Boolean
In my opinion, this is the better option, as a data type is being attached to a specific category / series of input data.
df = df.T # transpose dataframe
df[3] = df[3].astype(bool) # convert series to Boolean
print(df)
0 1 2 3
A True False True True
B NaN NaN NaN False
C True hello NaN True
print(df.dtypes)
0 object
1 object
2 object
3 bool
dtype: object
Read_excel will determine the dtype for each column based on the first row in the column with a value. If the first row of that column is empty, Read_excel will continue to the next row until a value is found.
In Sheet1, your first row with values in column B, C, and D contains strings. Therefore, all subsequent rows will be treated as strings for these columns. In this case, FALSE = False
In Sheet2, your first row with values in column B, C, and D contains integers. Therefore, all subsequent rows will be treated as integers for these columns. In this case, FALSE = 0.

Change values on some criteria in Pandas DataFrame and save it to the new df without affecting the original one

Here's the example. I will write Pandas DataFrame as a list so it's easier to read:
df_0 = [1,2,3]
I want to change values bigger than 2 to np.nan and save the new DataFrame to new variable df_1. Final result:
df_0 = [1,2,3]
df_1 = [1,2,np.nan]
You just need mask here
df_1=df_0.mask(df_0>2)
df_1
Out[291]:
0 1.0
1 2.0
2 NaN
dtype: float64
df_0
Out[292]:
0 1
1 2
2 3
dtype: int64
What is your column name in the dataframe? Let's say, it is COL. Then you can do:
df_1 = df_0.copy()
df_1.loc[df_1['COL'] >= 2, 'COL'] = np.nan
Also, there is a keyword where that you can make use of.
df_0 = pd.Series([1,2,3])
df_1 = df_0.where(df_0 <= 2)
0 1.0
1 2.0
2 NaN
dtype: float64

How do I select an element in array column of a data frame?

I have the following data frame:
pa=pd.DataFrame({'a':np.array([[1.,4.],[2.],[3.,4.,5.]])})
I want to select the column 'a' and then only a particular element (i.e. first: 1., 2., 3.)
What do I need to add to:
pa.loc[:,['a']]
?
pa.loc[row] selects the row with label row.
pa.loc[row, col] selects the cells which are the instersection of row and col
pa.loc[:, col] selects all rows and the column named col. Note that although this works it is not the idiomatic way to refer to a column of a dataframe. For that you should use pa['a']
Now you have lists in the cells of your column so you can use the vectorized string methods to access the elements of those lists like so.
pa['a'].str[0] #first value in lists
pa['a'].str[-1] #last value in lists
Storing lists as values in a Pandas DataFrame tends to be a mistake because
it prevents you from taking advantage of fast NumPy or Pandas vectorized operations.
Therefore, you might be better off converting your DataFrame of lists of numbers into a wider DataFrame with native NumPy dtypes:
import numpy as np
import pandas as pd
pa = pd.DataFrame({'a':np.array([[1.,4.],[2.],[3.,4.,5.]])})
df = pd.DataFrame(pa['a'].values.tolist())
# 0 1 2
# 0 1.0 4.0 NaN
# 1 2.0 NaN NaN
# 2 3.0 4.0 5.0
Now, you could select the first column like this:
In [36]: df.iloc[:, 0]
Out[36]:
0 1.0
1 2.0
2 3.0
Name: 0, dtype: float64
or the first row like this:
In [37]: df.iloc[0, :]
Out[37]:
0 1.0
1 4.0
2 NaN
Name: 0, dtype: float64
If you wish to drop NaNs, use .dropna():
In [38]: df.iloc[0, :].dropna()
Out[38]:
0 1.0
1 4.0
Name: 0, dtype: float64
and .tolist() to retrieve the values as a list:
In [39]: df.iloc[0, :].dropna().tolist()
Out[39]: [1.0, 4.0]
but if you wish to leverage NumPy/Pandas for speed, you'll want to express your calculation as vectorized operations on df itself without converting back to Python lists.

How to make pandas read_csv distinguish strings based on quoting

I want pandas.io.parsers.read_csv to distinguish between strings and the rest of data types based on the fact that strings in my csv file are always "quoted". Is it possible?
I have the following csv example:
"ID"|"DATE"|"NAME"|"YEAR"|"FLOAT"|"BOOL"
"01"|2000-01-01|"Name1"|1975|1.2|1
"02"||""||||
It should give me a dataframe where all the quoted guys are strings. Most likely pandas will make everything else np.float64, but I could deal with it afterwards. I want to wait with using dtype, because I have many columns, and I don't want to map types for all of them. I would like to try to make it only "quote"-based, if possible.
I tried to use quotechar='"' and quoting=3, but quotechar doesn't do anything at all, while quoting keeps "" which I don't want as well. It seems to me pandas parsers should be able to do it, since this is the way to distinguish strings in csv files.
Specifying dtypes would be the more straightforward way, but if you don't want to do that I'd suggest using quoting=3 and cleaning up afterwards.
strip_char = lambda x: x.strip('"')
In [40]: df = pd.read_csv(StringIO(s), sep='|', quoting=3)
In [41]: df
Out[41]:
"ID" "DATE" "NAME" "YEAR" "FLOAT" "BOOL"
0 "01" 2000-01-01 "Name1" 1975 1.2 1
1 "02" NaN "" NaN NaN NaN
[2 rows x 6 columns]
In [42]: df = df.rename(columns=strip_char)
In [43]: df[['ID', 'NAME']] = df[['ID', 'NAME']].applymap(strip_char)
In [44]: df
Out[44]:
ID DATE NAME YEAR FLOAT BOOL
0 01 2000-01-01 Name1 1975 1.2 1
1 02 NaN NaN NaN NaN
[2 rows x 6 columns]
In [45]: df.dtypes
Out[45]:
ID object
DATE object
NAME object
YEAR float64
FLOAT float64
BOOL float64
dtype: object
EDIT: Then you can set the index:
In [11]: df = df.set_index('ID')
In [12]: df
Out[12]:
DATE NAME YEAR FLOAT BOOL
ID
01 2000-01-01 Name1 1975 1.2 1
02 NaN NaN NaN NaN
[2 rows x 5 columns]

Categories

Resources