in one of the columns in my dataframe I have five values:
1,G,2,3,4
How to make it change the name of all "G" to 1
I tried:
df = df['col_name'].replace({'G': 1})
I also tried:
df = df['col_name'].replace('G',1)
"G" is in fact 1 (I do not know why there is a mixed naming)
Edit:
works correctly with:
df['col_name'] = df['col_name'].replace({'G': 1})
If I am understanding your question correctly, you are trying to change the values in a column and not the column name itself.
Given you have mixed data type there, I assume that column is of type object and thus the number is read as string.
df['col_name'] = df['col_name'].str.replace('G', '1')
You could try the following line
df.replace('G', 1, inplace=True)
use numpy
import numpy as np
df['a'] = np.where((df.a =='G'), 1, df.a)
You can try this, lets say your data is like :
ab=pd.DataFrame({'a':[1,2,3,'G',5]})
And you will replace it as :
ab1=ab.replace('G',4)
Related
I have a pandas dataframe that works just fine. I am trying to figure out how to tell if a column with a label that I know if correct does not contain all the same values.
The code
below errors out for some reason when I want to see if the column contains -1 in each cell
# column = "TheColumnLabelThatIsCorrect"
# df = "my correct dataframe"
# I get an () takes 1 or 2 arguments but 3 is passed in error
if (not df.loc(column, estimate.eq(-1).all())):
I just learned about .eq() and .all() and hopefully I am using them correctly.
It's a syntax issue - see docs for .loc/indexing. Specifically, you want to be using [] instead of ()
You can do something like
if not df[column].eq(-1).all():
...
If you want to use .loc specifically, you'd do something similar:
if not df.loc[:, column].eq(-1).all():
...
Also, note you don't need to use .eq(), you can just do (df[column] == -1).all()) if you prefer.
You could drop duplicates and if you get only one record it means all records are the same.
import pandas as pd
df = pd.DataFrame({'col': [1, 1, 1, 1]})
len(df['col'].drop_duplicates()) == 1
> True
Question not as clear. Lets try the following though
Contains only -1 in each cell
df['estimate'].eq(-1).all()
Contains -1 in any cell
df['estimate'].eq(-1).any()
Filter out -1 and all columns
df.loc[df['estimate'].eq(-1),:]
df['column'].value_counts() gives you a list of all unique values and their counts in a column. As for checking if all the values are a specific number, you can do that by dropping duplicates and checking the length to be 1.
len(set(df['column'])) == 1
I have a dataframe that I am concatenating from dataframes and arrays.
Somehow its inherited the index of the original dataframe - hence I am trying to exclude rows based on one of the columns that should not have missing values.
If I view my dataframe, it shows as this:
print(model_data2['is_62p_days_overdue'][0:11])
now, when I run:
print(model_data2['is_62p_days_overdue'].where(model_data2['is_62p_days_overdue'] != np.nan)[0:11])
I get the exact same output.
And when I run :
print(model_data2['is_62p_days_overdue'].where(model_data2['is_62p_days_overdue'] == np.nan)[0:11])
What am I missing? this is driving me nuts!
I've tried resetting the index - but this also does nothing.
IIUC:
Instead of this:
print(model_data2['is_62p_days_overdue'].where(model_data2['is_62p_days_overdue'] != np.nan)[0:11])
try with loc accessor and notna() method:
print(model_data2.loc[model_data2['is_62p_days_overdue'].notna(),'is_62p_days_overdue'][0:11])
Answer to the comment:
there are 2 reasons of it
you can't compare NaN's like that like you do in your method:
model_data2['is_62p_days_overdue'] != np.nan
#this is wrong instead use notna() method
2.You are using where method even when you corrected above method it will make that back to NaN:
model_data2['is_62p_days_overdue'].where(model_data2['is_62p_days_overdue'].notna())
see the "# rows you may want to see" in the bottom of my code
import pandas as pd
import numpy as np
# make a dataset
dict={'is_62p_days_overdue':[0, 0, 0, 0, 0, None, None, 0, None, 0, None]}
data=pd.DataFrame(dict)
print(data)
# append numeric 1~10
data=data.append(pd.DataFrame({'is_62p_days_overdue': list(range(1,10+1))}),ignore_index=True)
data
# rows you may want to see
data.loc[~(data.is_62p_days_overdue.isna())]
you can use .dropna() to drop the rows with NaN values.
use this:
model_data2.dropna(subset = ['is_62p_days_overdue'], inplace = True)
I am quite new to Python programming.
I am working with the following dataframe:
Before
Note that in column "FBgn", there is a mix of FBgn and FBtr string values. I would like to replace the FBtr-containing values with FBgn values provided in the adjacent column called "## FlyBase_FBgn". However, I want to keep the FBgn values in column "FBgn". Maybe keep in mind that I am showing only a portion of the dataframe (reality: 1432 rows). How would I do that? I tried the replace() method from Pandas, but it did not work.
This is actually what I would like to have:
After
Thanks a lot!
With Pandas, you could try:
df.loc[df["FBgn"].str.contains("FBtr"), "FBgn"] = df["## FlyBase_FBgn"]
Welcome to stackoverflow. Please next time provide more info including your code. It is always helpful
Please see the code below, I think you need something similar
import pandas as pd
#ignore the dict1, I just wanted to recreate your df
dict1= {"FBgn": ['FBtr389394949', 'FBgn3093840', 'FBtr000025'], "FBtr": ['FBgn546466646', '', 'FBgn15565555']}
df = pd.DataFrame(dict1) #recreating your dataframe
#print df
print(df)
#function to replace the values
def replace_values(df):
for i in range(0, (df.size//2)):
if 'tr' in df['FBgn'][i]:
df['FBgn'][i] = df['FBtr'][i]
return df
df = replace_values(df)
#print new df
print(df)
I have specific problem with pandas: I need to select rows in dataframe which start with specific letters.
Details: I've imported my data to dataframe and selected columns that I need. I've also narrowed it down to row index I need. Now I also need to select rows in other column where objects START with letters 'pl'.
Is there any solution to select row only based on first two characters in it?
I was thinking about
pl = df[‘Code’] == pl*
but it won't work due to row indexing. Advise appreciated!
Use startswith for this:
df = df[df['Code'].str.startswith('pl')]
Fully reproducible example for those who want to try it.
import pandas as pd
df = pd.DataFrame([["plusieurs", 1], ["toi", 2], ["plutot", 3]])
df.columns = ["Code", "number"]
df = df[df.Code.str.startswith("pl")] # alternative is df = df[df["Code"].str.startswith("pl")]
If you use a string method on the Series that should return you a true/false result. You can then use that as a filter combined with .loc to create your data subset.
new_df = df.loc[df[‘Code’].str.startswith('pl')].copy()
The condition is just a filter, then you need to apply it to the dataframe. as filter you may use the method Series.str.startswith and do
df_pl = df[df['Code'].str.startswith('pl')]
Let's say df is a pandas DataFrame.
I would like to find all columns of numeric type.
Something like:
isNumeric = is_numeric(df)
You could use select_dtypes method of DataFrame. It includes two parameters include and exclude. So isNumeric would look like:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
newdf = df.select_dtypes(include=numerics)
Simple one-line answer to create a new dataframe with only numeric columns:
df.select_dtypes(include=np.number)
If you want the names of numeric columns:
df.select_dtypes(include=np.number).columns.tolist()
Complete code:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': range(7, 10),
'B': np.random.rand(3),
'C': ['foo','bar','baz'],
'D': ['who','what','when']})
df
# A B C D
# 0 7 0.704021 foo who
# 1 8 0.264025 bar what
# 2 9 0.230671 baz when
df_numerics_only = df.select_dtypes(include=np.number)
df_numerics_only
# A B
# 0 7 0.704021
# 1 8 0.264025
# 2 9 0.230671
colnames_numerics_only = df.select_dtypes(include=np.number).columns.tolist()
colnames_numerics_only
# ['A', 'B']
You can use the undocumented function _get_numeric_data() to filter only numeric columns:
df._get_numeric_data()
Example:
In [32]: data
Out[32]:
A B
0 1 s
1 2 s
2 3 s
3 4 s
In [33]: data._get_numeric_data()
Out[33]:
A
0 1
1 2
2 3
3 4
Note that this is a "private method" (i.e., an implementation detail) and is subject to change or total removal in the future. Use with caution.
df.select_dtypes(exclude = ['object'])
Update:
df.select_dtypes(include= np.number)
or with new version of panda
df.select_dtypes('number')
Simple one-liner:
df.select_dtypes('number').columns
Following codes will return list of names of the numeric columns of a data set.
cnames=list(marketing_train.select_dtypes(exclude=['object']).columns)
here marketing_train is my data set and select_dtypes() is function to select data types using exclude and include arguments and columns is used to fetch the column name of data set
output of above code will be following:
['custAge',
'campaign',
'pdays',
'previous',
'emp.var.rate',
'cons.price.idx',
'cons.conf.idx',
'euribor3m',
'nr.employed',
'pmonths',
'pastEmail']
This is another simple code for finding numeric column in pandas data frame,
numeric_clmns = df.dtypes[df.dtypes != "object"].index
We can include and exclude data types as per the requirement as below:
train.select_dtypes(include=None, exclude=None)
train.select_dtypes(include='number') #will include all the numeric types
Referred from Jupyter Notebook.
To select all numeric types, use np.number or 'number'
To select strings you must use the object dtype but note that
this will return all object dtype columns
See the NumPy dtype hierarchy <http://docs.scipy.org/doc/numpy/reference/arrays.scalars.html>__
To select datetimes, use np.datetime64, 'datetime' or
'datetime64'
To select timedeltas, use np.timedelta64, 'timedelta' or
'timedelta64'
To select Pandas categorical dtypes, use 'category'
To select Pandas datetimetz dtypes, use 'datetimetz' (new in
0.20.0) or ``'datetime64[ns, tz]'
Although this is old subject,
but i think the following formula is easier than all other comments
df[df.describe().columns]
As the function describe() only works for numeric columns, the column of the output will only be numeric.
Please see the below code:
if(dataset.select_dtypes(include=[np.number]).shape[1] > 0):
display(dataset.select_dtypes(include=[np.number]).describe())
if(dataset.select_dtypes(include=[np.object]).shape[1] > 0):
display(dataset.select_dtypes(include=[np.object]).describe())
This way you can check whether the value are numeric such as float and int or the srting values. the second if statement is used for checking the string values which is referred by the object.
Adapting this answer, you could do
df.ix[:,df.applymap(np.isreal).all(axis=0)]
Here, np.applymap(np.isreal) shows whether every cell in the data frame is numeric, and .axis(all=0) checks if all values in a column are True and returns a series of Booleans that can be used to index the desired columns.
A lot of the posted answers are inefficient. These answers either return/select a subset of the original dataframe (a needless copy) or perform needless computational statistics in the case of describe().
To just get the column names that are numeric, one can use a conditional list comprehension with the pd.api.types.is_numeric_dtype function:
numeric_cols = [col for col in df if pd.api.types.is_numeric_dtype(df[col])]
I'm not sure when this function was introduced.
def is_type(df, baseType):
import numpy as np
import pandas as pd
test = [issubclass(np.dtype(d).type, baseType) for d in df.dtypes]
return pd.DataFrame(data = test, index = df.columns, columns = ["test"])
def is_float(df):
import numpy as np
return is_type(df, np.float)
def is_number(df):
import numpy as np
return is_type(df, np.number)
def is_integer(df):
import numpy as np
return is_type(df, np.integer)