I am very new to python and I would just like some guidance.
I want to know how to iterate over each value for each column of my data frame to apply a function I have created myself. First I need to check if it is numeric and if yes then I can proceed with my function.
Should I first make each column into lists? If so whats the best way?
You probably dont want to use a loop for that but instead return a Series containing the information OnlyNumeric/NotOnlyNumeric for every column.
You use pd.to_numeric and coerce the errors like this:
import pandas as pd
df = pd.DataFrame({'col': [1, 2, 10, np.nan, 'a'],
'col2': ['a', 10, 30, 40, 50],
'col3': [1, 2, 3, 4, 5.0]})
df.apply(lambda col: pd.to_numeric(col, errors='coerce').notnull().all())
output:
col False
col2 False
col3 True
dtype: bool
You can then use .all() again to get a single True or False and continue on with applying your function.
in one-line:
df.apply(lambda col: pd.to_numeric(col, errors='coerce').notnull().all()).all()
Related
Suppose I have a dataframe where some columns contain a zero value as one of their elements (or potentially more than one zero). I don't specifically want to retrieve these columns or discard them (I know how to do that) - I just want to locate these. For instance: if there is are zeros somewhere in the 4th, 6th and the 23rd columns, I want a list with the output [4,6,23].
You could iterate over the columns, checking whether 0 occurs in each columns values:
[i for i, c in enumerate(df.columns) if 0 in df[c].values]
Use any() for the fastest, vectorized approach.
For instance,
df = pd.DataFrame({'col1': [1, 2, 3],
'col2': [0, 100, 200],
'col3': ['a', 'b', 'c']})
Then,
>>> s = df.eq(0).any()
col1 False
col2 True
col3 False
dtype: bool
From here, it's easy to get the indexes. For example,
>>> s[s].tolist()
['col2']
Many ways to retrieve the indexes from a pd.Series of booleans.
Here is an approach that leverages a couple of lambda functions:
d = {'a': np.random.randint(10, size=100),
'b': np.random.randint(1,10, size=100),
'c': np.random.randint(10, size=100),
'd': np.random.randint(1,10, size=100)
}
df = pd.DataFrame(d)
df.apply(lambda x: (x==0).any())[lambda x: x].reset_index().index.to_list()
[0, 2]
Another idea based on #rafaelc slick answer (but returning relative locations of the columns instead of column names):
df.eq(0).any().reset_index()[lambda x: x[0]].index.to_list()
[0, 2]
Or with the column names instead of locations:
df.apply(lambda x: (x==0).any())[lambda x: x].index.to_list()
['a', 'c']
I have a pandas DataFrame with 4 columns and I want to create a new DataFrame that only has three of the columns. This question is similar to: Extracting specific columns from a data frame but for pandas not R. The following code does not work, raises an error, and is certainly not the pandasnic way to do it.
import pandas as pd
old = pd.DataFrame({'A' : [4,5], 'B' : [10,20], 'C' : [100,50], 'D' : [-30,-50]})
new = pd.DataFrame(zip(old.A, old.C, old.D)) # raises TypeError: data argument can't be an iterator
What is the pandasnic way to do it?
There is a way of doing this and it actually looks similar to R
new = old[['A', 'C', 'D']].copy()
Here you are just selecting the columns you want from the original data frame and creating a variable for those. If you want to modify the new dataframe at all you'll probably want to use .copy() to avoid a SettingWithCopyWarning.
An alternative method is to use filter which will create a copy by default:
new = old.filter(['A','B','D'], axis=1)
Finally, depending on the number of columns in your original dataframe, it might be more succinct to express this using a drop (this will also create a copy by default):
new = old.drop('B', axis=1)
The easiest way is
new = old[['A','C','D']]
.
Another simpler way seems to be:
new = pd.DataFrame([old.A, old.B, old.C]).transpose()
where old.column_name will give you a series.
Make a list of all the column-series you want to retain and pass it to the DataFrame constructor. We need to do a transpose to adjust the shape.
In [14]:pd.DataFrame([old.A, old.B, old.C]).transpose()
Out[14]:
A B C
0 4 10 100
1 5 20 50
columns by index:
# selected column index: 1, 6, 7
new = old.iloc[: , [1, 6, 7]].copy()
As far as I can tell, you don't necessarily need to specify the axis when using the filter function.
new = old.filter(['A','B','D'])
returns the same dataframe as
new = old.filter(['A','B','D'], axis=1)
Generic functional form
def select_columns(data_frame, column_names):
new_frame = data_frame.loc[:, column_names]
return new_frame
Specific for your problem above
selected_columns = ['A', 'C', 'D']
new = select_columns(old, selected_columns)
As an alternative:
new = pd.DataFrame().assign(A=old['A'], C=old['C'], D=old['D'])
If you want to have a new data frame then:
import pandas as pd
old = pd.DataFrame({'A' : [4,5], 'B' : [10,20], 'C' : [100,50], 'D' : [-30,-50]})
new= old[['A', 'C', 'D']]
You can drop columns in the index:
df = pd.DataFrame({'A': [1, 1], 'B': [2, 2], 'C': [3, 3], 'D': [4, 4]})
df[df.columns.drop(['B', 'C'])]
or
df.loc[:, df.columns.drop(['B', 'C'])]
Output:
A D
0 1 4
1 1 4
df = pd.DataFrame({'A': [1, 1], 'B': [2, 2], 'C': [3, 3], 'D': [4, 4]})
new = df.filter(['A','B','D'], axis=1)
Imagine I have a Pandas DataFrame:
# create df
df = pd.DataFrame({'id': [1,1,1,2,2,2],
'val': [5,4,6,3,2,3]})
Lets assume it is ordered by 'id' and an imaginary, not shown, date column (ascending).
I want to create another column where each row is a list of 'val' at that date.
The ending DataFrame will look like this:
df = pd.DataFrame({'id': [1,1,1,2,2,2],
'val': [5,4,6,3,2,3],
'val_list': [[5],[5,4],[5,4,6],[3],[3,2],[3,2,3]]})
I don't want to use a loop because the actual df I am working with has about 4 million records. I am imagining I would use a lambda function in conjunction with groupby (something like this):
df['val_list'] = df.groupby('id')['val'].apply(lambda x: x.runlist())
This raises an AttributError because the runlist() method does not exist, but I am thinking the solution would be something like this.
Does anyone know what to do to solve this problem?
Let us try
df['new'] = df.val.map(lambda x : [x]).groupby(df.id).apply(lambda x : x.cumsum())
Out[138]:
0 [5]
1 [5, 4]
2 [5, 4, 6]
3 [3]
4 [3, 2]
5 [3, 2, 3]
Name: val, dtype: object
I want to Fill in these missing numbers in column b with the consecutive values 1 and 2.
This is what I have done:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [1, 4, 7,8,4],
'b': [1, np.nan, 3, np.nan, 5]})
df['b'].fillna({'b':[1,2]}, inplace=True)
but nothing is done.
One way is to use loc with an array:
df.loc[df['b'].isnull(), 'b'] = [1, 2]
What you're attempting is possible but cumbersome with fillna:
nulls = df['b'].isnull()
df['b'] = df['b'].fillna(pd.Series([1, 2], index=nulls[nulls].index))
You may be looking for interpolate but the above solutions are generic given an input list or array.
If, on the other hand, you want to fill nulls with a sequence 1, 2, 3, etc, you can use cumsum:
# fillna solution
df['b'] = df['b'].fillna(df['b'].isnull().cumsum())
# loc solution
nulls = df['b'].isnull()
df.loc[nulls, 'b'] = nulls.cumsum()
You can't feed fillna a list of values, as stated here and in the documentation. Also, if you're selecting the column, no need to tell fillna which column to use. You could do:
df.fillna({'b':1}, inplace=True)
Or
df['b'].fillna(1, inplace=True)
By the way, inplace is on the way to deprecation in Pandas, the preferred way to do this is, for example
df = df.fillna({'b':1})
You can interpolate. Example:
s = pd.Series([0, 1, np.nan, 3])
s.interpolate()
0 0
1 1
2 2
3 3
If I understand wording " consecutive values 1 and 2" correctly, the solution may be:
from itertools import isclice, cycle
filler = [1, 2]
nans = df.b.isna()
df.loc[nans, 'b'] = list(islice(cycle(filler), sum(nans)))
Suppose I want to compare the content of two dataframes, but not the column names (or index names). Is it possible to achieve this without renaming the columns?
For example:
df = pd.DataFrame({'A': [1,2], 'B':[3,4]})
df_equal = pd.DataFrame({'a': [1,2], 'b':[3,4]})
df_diff = pd.DataFrame({'A': [1,2], 'B':[3,5]})
In this case, df is df_equal but different to df_diff, because the values in df_equal has the same content, but the ones in df_diff. Notice that the column names in df_equal are different, but I still want to get a true value.
I have tried the following:
equals:
# Returns false because of the column names
df.equals(df_equal)
eq:
# doesn't work as it compares four columns (A,B,a,b) assuming nulls for the one that doesn't exist
df.eq(df_equal).all().all()
pandas.testing.assert_frame_equal:
# same as equals
pd.testing.assert_frame_equal(df, df_equal, check_names=False)
I thought that it was going to be possible to use the assert_frame_equal, but none of the parameters seem to work to ignore column names.
pd.DataFrame is built around pd.Series, so it's unlikely you will be able to perform comparisons without column names.
But the most efficient way would be to drop down to numpy:
assert_equal = (df.values == df_equal.values).all()
To deal with np.nan, you can use np.testing.assert_equal and catch AssertionError, as suggested by #Avaris :
import numpy as np
def nan_equal(a,b):
try:
np.testing.assert_equal(a,b)
except AssertionError:
return False
return True
assert_equal = nan_equal(df.values, df_equal.values)
I just needed to get the values (numpy array) from the data frame, so the column names won't be considered.
df.eq(df_equal.values).all().all()
I would still like to see a parameter on equals, or assert_frame_equal. Maybe I am missing something.
An advantage of this compared to #jpp answer is that, I can get see which columns do not match, calling only all() only once:
df.eq(df_diff.values).all()
Out[24]:
A True
B False
dtype: bool
One problem is that when eq is used, then np.nan is not equal to np.nan, in which case the following expression, would serve well:
(df.eq(df_equal.values) | (df.isnull().values & df_equal.isnull().values)).all().all()
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
for i in range(df1.shape[0]):
for j in range(df1.shape[1]):
print(df1.iloc[i, j] == df2.iloc[i, j])
Will return:
True
True
True
True
Same thing for:
df1 = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
df2 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
One obvious issue is that column names matters in Pandas to sort dataframes. For example:
df1 = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
df2 = pd.DataFrame({'a': [1, 2], 'B': [3, 4]})
print(df1)
print(df2)
renders as ('B' is before 'a' in df2):
a b
0 1 3
1 2 4
B a
0 3 1
1 4 2