What is the fastest way to query for staleness (unvarying data) in a DataFrame column, so that it would return the 'Stale' column?
As example:
from pandas import DataFrame
from numpy.random import randn
df = DataFrame(randn(50, 5))
df['Stale'] = 100.0
will yield a df similar to the following:
0 1 2 3 4 Stale
0 -0.064293 1.226319 -1.162909 -0.574240 -0.547402 100.0
1 0.529428 0.587148 0.367549 0.066041 -0.071709 100.0
2 -0.112633 0.217315 0.810061 -0.610718 0.179225 100.0
3 0.513706 -2.300195 -0.895974 0.853926 -1.604018 100.0
4 0.410546 0.641980 0.611272 1.121002 -1.082460 100.0
And I'd like to get the 'Stale' column returned. Right now I am doing:
df.columns[df.std() == 0.0] which works, but which is probably not very efficient.
This:
df.columns[df.std() == 0.0]
returns the 'Stale' index because the standard deviation of the stale column would be zero.
If you define "staleness" as unvarying data, df.var() == 0 is slightly faster (probably because you don't need to take the square root). It also occurred to me to check df.max() == df.min() but that's actually slower.
To return the column using this information, do this:
df[df.columns[df.var() == 0.0]]
How about:
if 'Stale' in df.columns: #test if you have a column named 'Stale'
_df = df.ix[:,df.columns!='Stale']
#do something on the DataFrame without the 'Stale' column
else:
#_df = df
#do something to the DataFrame directly.
You have the following options that I can think of:
df.ix[:,df.columns!='Stale'] will return a view of the DataFrame without the 'Stale' column and
df.ix[:,df.columns=='Stale'] will return 'Stale' column as a DataFrame, if it is in the dataframe. An empty DataFrame otherwise.
df.get['Stale'] returns 'Stale' column as a Series, if the column is not there, it will return None.
You can't just do df['Stale'], because if the column is not there, an keyError will be raised.
I suggest to use the shift method of the pandas data frame:
df == df.shift()
Note: almost never comment on stackoverflow.
Related
For the following screenshot, I want to change the NaN values under the total_claim_count_ge65 to a 5 if the values of the ge65_suppress_flag have the # symbol.
I want to use a for loop to go through the ge65_suppress_flag column and every time it encounters a # symbol, it will change the NaN value in the very next column (total_claim_count_ge65) to a 5.
Try something like:
df[df['ge65_suppress_flag'] == '#']['total_claim_count_ge65'].fillna(5, inplace=True)
Creating a similar data frame
import pandas
df1 = pd.DataFrame({"ge65_suppress_flag": ['bla', 'bla', '#', 'bla'], "total_claim_count_ge65": [1.0, 2.0, None, 4.0]})
Filling in 5.0 in rows where ge65_suppress_flag column value equals to '#'
df1.loc[df1['ge65_suppress_flag']=="#", 'total_claim_count_ge65'] = 5.0
Using df.apply with a lambda:
d = {'ge65_suppress_flag': ['not_supressed','not_supressed','#'], 'total_claim_count_ge65': [516.03, 881.0, np.nan]}
df = pd.DataFrame(data=d)
df['total_claim_count_ge65'] = df.apply(lambda x: 5 if x['ge65_suppress_flag']=='#' else x['total_claim_count_ge65'], axis=1)
print(df)
prints:
ge65_suppress_flag total_claim_count_ge65
0 not_supressed 516.03
1 not_supressed 881.00
2 # 5.00
I'll explain step-by-step after giving you a solution.
Here's a one-liner that will work.
df[df[0]=='#'] = df[df[0]=='#'].fillna(5)
To make the solution more general, I used the column's index based on your screenshot. You can change the index number, or specify by name like so:
df['name_of_column']
Step-by-step explanation:
First, you want to use the variable attributes in your first column df[0] to select only those equal to string '#':
df[df[0]=='#']
Next, use the pandas fillna method to replace all variable attributes that are np.NaN with 5:
df[df[0]=='#'].fillna(5)
According to the fillna documentation, this function returns a new dataframe. So, to avoid this, you want to set the subsection of your dataframe to what is returned by the function:
df[df[0]=='#'] = df[df[0]=='#'].fillna(5)
I am trying to create a row at the bottom of a dataframe to show the sum of certain columns. I am under the impression that this shall be a really simple operation, but to my surprise, none of the methods I found on SO works for me in one step.
The methods that I've found on SO:
df.loc['TOTAL'] = df.sum()
This doesn't work for me as long as there are non-numeric columns in the dataframe. I need to select the columns first and then concat the non-numeric columns back
df.append(df.sum(numeric_only=True), ignore_index=True)
This won't preserve my data types. Integer column will be converted to float.
df3.loc['Total', 'ColumnA']= df['ColumnA'].sum()
I can only use this to sum one column.
I must have missed something in the process as this is not that hard an operation. Please let me know how I can add a sum row while preserving the data type of the dataframe.
Thanks.
Edit:
First off, sorry for the late update. I was on the road for the last weekend
Example:
df1 = pd.DataFrame(data = {'CountyID': [77, 95], 'Acronym': ['LC', 'NC'], 'Developable': [44490, 56261], 'Protected': [40355, 35943],
'Developed': [66806, 72211]}, index = ['Lehigh', 'Northampton'])
What I want to get would be
Please ignore the differences of the index.
It's a little tricky for me because I don't need to get the sum for the column 'County ID' since it's for specific indexing. So the question is more about getting the sum of specific numeric columns.
Thanks again.
Here is some toy data to use as an example:
df = pd.DataFrame({'A':[1.0,2.0,3.0],'B':[1,2,3],'C':['A','B','C']})
So that we can preserve the dtypes after the sum, we will store them as d
d = df.dtypes
Next, since we only want to sum numeric columns, pass numeric_only=True to sum(), but follow similar logic to your first attempt
df.loc['Total'] = df.sum(numeric_only=True)
And finally, reset the dtypes of your DataFrame to their original values.
df.astype(d)
A B C
0 1.0 1 A
1 2.0 2 B
2 3.0 3 C
Total 6.0 6 NaN
To select the numeric columns, you can do
df_numeric = df.select_dtypes(include = ['int64', 'float64'])
df_num_cols = df_numeric.columns
Then do what you did first (using what I found here)
df.loc['Total'] = pd.Series(df[df_num_cols].sum(), index = [df_num_cols])
I'm practicing with using apply with Pandas dataframes.
So I have cooked up a simple dataframe with dates, and values:
dates = pd.date_range('2013',periods=10)
values = list(np.arange(1,11,1))
DF = DataFrame({'date':dates, 'value':values})
I have a second dataframe, which is made up of 3 rows of the original dataframe:
DFa = DF.iloc[[1,2,4]]
So, I'd like to use the 2nd dataframe, DFa, and get the dates from each row (using apply), and then find and sum up any dates in the original dataframe, that came earlier:
def foo(DFa, DF=DF):
cutoff_date = DFa['date']
ans=DF[DF['date'] < cutoff_date]
DFa.apply(foo, axis=1)
Things work fine. My question is, since I've created 3 ans, how do I access these values?
Obviously I'm new to apply and I'm eager to get away from loops. I just don't understand how to return values from apply.
Your function needs to return a value. E.g.,
def foo(df1, df2):
cutoff_date = df1.date
ans = df2[df2.date < cutoff_date].value.sum()
return ans
DFa.apply(lambda x: foo(x, DF), axis=1)
Also, note that apply returns a DataFrame. So your current function would return a DataFrame for each row in DFa, so you would end up with a DataFrame of DataFrames
There's a bit of a mixup the way you're using apply. With axis=1, foo will be applied to each row (see the docs), and yet your code implies (by the parameter name) that its first parameter is a DataFrame.
Additionally, you state that you want to sum up the original DataFrame's values for those less than the date. So foo needs to do this, and return the values.
So the code needs to look something like this:
def foo(row, DF=DF):
cutoff_date = row['date']
return DF[DF['date'] < cutoff_date].value.sum()
Once you make the changes, as foo returns a scalar, then apply will return a series:
>> DFa.apply(foo, axis=1)
1 1
2 3
4 10
dtype: int64
I constantly struggle with cleanly iterating or applying a function to Pandas DataFrames of variable length. Specifically, a length 1 DataFrame slice (Pandas Series).
Simple example, a DataFrame and a function that acts on each row of it. The format of the dataframe is known/expected.
def stringify(row):
return "-".join([row["y"], str(row["x"]), str(row["z"])])
df = pd.DataFrame(dict(x=[1,2,3],y=["foo","bar","bro"],z=[-99,1.04,213]))
Out[600]:
x y z
0 1 foo -99.00
1 2 bar 1.04
2 3 bro 213.00
df_slice = df.iloc[0] # This is a Series
Usually, you can apply the function in one of the following ways:
stringy = df.apply(stringify,axis=1)
# or
stringy = [stringify(row) for _,row in df.iterrows()]
Out[611]: ['foo-1--99.0', 'bar-2-1.04', 'bro-3-213.0']
## Error with same syntax if Series
stringy = df_slice.apply(stringify, axis=1)
If the dataframe is empty, or has only one entry, these methods no longer work. A Series does not have an iterrows() method and apply applies the function to each column (not rows).
Is there a cleaner built in method to iterate/apply functions to DataFrames of variable length? Otherwise you have to constantly write cumbersome logic.
if type(df) is pd.DataFrame:
if len(df) == 0:
return None
else:
return df.apply(stringify, axis=1)
elif type(df) is pd.Series:
return stringify(df)
I realize there are methods to ensure you form length 1 DataFrames, but what I am asking is for a clean way to apply/iterate on the various pandas data structures when it could be like-formatted dataframes or series.
There is no generic way to write a function which will seemlessly handle both
DataFrames and Series. You would either need to use an if-statement to check
for type, or use try..except to handle exceptions.
Instead of doing either of those things, I think it is better to make sure you create the right type of object before calling apply. For example, instead of using df.iloc[0] which returns a Series, use df.iloc[:1] to select a DataFrame of length 1. As long as you pass a slice range instead of a single value to df.iloc, you'll get back a DataFrame.
In [155]: df.iloc[0]
Out[155]:
x 1
y foo
z -99
Name: 0, dtype: object
In [156]: df.iloc[:1]
Out[156]:
x y z
0 1 foo -99
I'm trying to select the first row of each group of a data frame.
import pandas as pd
import numpy as np
x = [{'id':"a",'val':np.nan, 'val2':-1},{'id':"a",'val':'TREE','val2':15}]
df = pd.DataFrame(x)
# id val val2
# 0 a NaN -1
# 1 a TREE 15
When I try to do this with groupby, I get
df.groupby('id', as_index=False).first()
# id val val2
# 0 a TREE -1
The row returned to me is nowhere in the original data frame. Do I need to do something special with NaN values in columns other than the groupby columns?
Found the following that appears to be a workaround on the Pandas github site. Uses the nth() method
instead of first()
df.groupby('id', as_index=False).nth(0,dropna=False)
I didn't dig into it much. It seems odd that first() would actually use the val from a different row but I haven't actually found the documentation on first to check if this is by design.