I came across the issue while practicing importing and summarizing data. Please any help?
Image seems not to show any error but I don't know how to figure the 'UN' and np.nan issue
# Names of the columns we're searching for missing values
columns = ['median', 'p25th', 'p75th']
# take a look at the dtypes
print(recent_grads[columns].dtypes)
# find how missing values are represented
print(recent_grads['median'].unique())
# replace missing values with NaN
for column in columns:
recent_grads.loc[recent_grads['median'] == 'UN', column] = np.nan
final output:
while reading csv, use the parameter 'na_values'.
pd.read_csv('', na_values = 'UN')
It seems like you need to import numpy and use nan to solve that problem the way they'd like it. 'nan', short for 'not a number' is what they want you to use to represent the absence of a value.
import numpy as np
# ... your code up to the last bit ...
for value in recent_grads['median'].unique():
if value == 'UN':
value = np.nan
Related
I have tried with everything I can come up with and would appreciate some help! :)
This is a method that's gonna return an imputed part of a data frame
from statistics import mean
from unicodedata import numeric
def imputation(df, columns_to_imputed):
# Step 1: Get a part of dataframe using columns received as a parameter.
import pandas as pd
import numpy as np
df.set_axis(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'], axis=1, inplace=True)#Sätter rubrikerna
part_of_df = pd.DataFrame(df.filter(columns_to_imputed, axis=1))
part_of_df = part_of_df.drop([0], axis=0)
#Step 2: Change the zero values in the columns to np.nan
part_of_df = part_of_df.replace('0', np.nan)
# Step 3: Change the nan values to the mean of each attribute (column).
#You can use the apply(), fillna() functions.
part_of_df = part_of_df.fillna(part_of_df.mean(axis=0)) #####Ive tried everything on this row, can't get it to work. I want to fill each nan-value with the mean of the column its in..
return part_of_df ####Im returning this part to see if the nans are replaced but nothings happened...
You were on the right track, you just need to make a small change. Here I created a sample Df and introduced some NaNs:
dummy_df = pd.DataFrame({"col1":range(5), "col2":range(5)})
dummy_df['col1'][1] = None
dummy_df['col1'][3] = None
dummy_df['col2'][4] = None
and got this:
Disclaimer: Don't use my method of value assignment. Use proper indexing through loc.
Now, I use apply() and lambda to iterate over each column and fill NaNs with the mean value:
dummy_df = dummy_df.apply(lambda x: x.fillna(x.mean()), axis=0)
This gives me:
Hope this helps!
I have a pandas dataframe that works just fine. I am trying to figure out how to tell if a column with a label that I know if correct does not contain all the same values.
The code
below errors out for some reason when I want to see if the column contains -1 in each cell
# column = "TheColumnLabelThatIsCorrect"
# df = "my correct dataframe"
# I get an () takes 1 or 2 arguments but 3 is passed in error
if (not df.loc(column, estimate.eq(-1).all())):
I just learned about .eq() and .all() and hopefully I am using them correctly.
It's a syntax issue - see docs for .loc/indexing. Specifically, you want to be using [] instead of ()
You can do something like
if not df[column].eq(-1).all():
...
If you want to use .loc specifically, you'd do something similar:
if not df.loc[:, column].eq(-1).all():
...
Also, note you don't need to use .eq(), you can just do (df[column] == -1).all()) if you prefer.
You could drop duplicates and if you get only one record it means all records are the same.
import pandas as pd
df = pd.DataFrame({'col': [1, 1, 1, 1]})
len(df['col'].drop_duplicates()) == 1
> True
Question not as clear. Lets try the following though
Contains only -1 in each cell
df['estimate'].eq(-1).all()
Contains -1 in any cell
df['estimate'].eq(-1).any()
Filter out -1 and all columns
df.loc[df['estimate'].eq(-1),:]
df['column'].value_counts() gives you a list of all unique values and their counts in a column. As for checking if all the values are a specific number, you can do that by dropping duplicates and checking the length to be 1.
len(set(df['column'])) == 1
I am currently learning pandas and I am using an imdb movies database, which one of the columns is the duration of the movies. However, one of the values is "None", so I can´t calculate the mean because there is this string in the middle. I thought of changing the "None" to = 0, however that would skew the results. Like can be seen with the code below.
dur_temp = duration.replace("None", 0)
dur_temp = dur_temp.astype(float)
descricao_duration = dur_temp.mean()
Any ideas on what I should do in order to not skew the data? I also graphed it and it becomes more clear how it skews it.
You can replace "None" with numpy.nan, instead that using 0.
Something like this should do the trick:
import numpy as np
dur_temp = duration.replace("None", np.nan)
descricao_duration = dur_temp.mean()
if you want it working for any string in your pandas serie, you could use pd.to_numeric:
pd.to_numeric(dur_temp, errors='coerce').mean()
in this way all the values that cannot be converted to float will be replaced by NaN regardless of which is
Just filter by condition like this
df[df['a']!='None'] #assuming your mean values are in column a
Make them np.NAN values
I am writing it as answer because i can't comment df = df.replace('None ', np.NaN) or df.replace('None', np.NaN, inplace=True)
You can use fillna(value=np.nan) as shown below:
descricao_duration = dur_temp.fillna(value=np.nan).mean()
Demo:
import pandas as pd
import numpy as np
dur_temp = pd.DataFrame({'duration': [10, 20, None, 15, None]})
descricao_duration = dur_temp.fillna(value=np.nan).mean()
print(descricao_duration)
Output:
duration 15.0
dtype: float64
I am quite new to Python programming.
I am working with the following dataframe:
Before
Note that in column "FBgn", there is a mix of FBgn and FBtr string values. I would like to replace the FBtr-containing values with FBgn values provided in the adjacent column called "## FlyBase_FBgn". However, I want to keep the FBgn values in column "FBgn". Maybe keep in mind that I am showing only a portion of the dataframe (reality: 1432 rows). How would I do that? I tried the replace() method from Pandas, but it did not work.
This is actually what I would like to have:
After
Thanks a lot!
With Pandas, you could try:
df.loc[df["FBgn"].str.contains("FBtr"), "FBgn"] = df["## FlyBase_FBgn"]
Welcome to stackoverflow. Please next time provide more info including your code. It is always helpful
Please see the code below, I think you need something similar
import pandas as pd
#ignore the dict1, I just wanted to recreate your df
dict1= {"FBgn": ['FBtr389394949', 'FBgn3093840', 'FBtr000025'], "FBtr": ['FBgn546466646', '', 'FBgn15565555']}
df = pd.DataFrame(dict1) #recreating your dataframe
#print df
print(df)
#function to replace the values
def replace_values(df):
for i in range(0, (df.size//2)):
if 'tr' in df['FBgn'][i]:
df['FBgn'][i] = df['FBtr'][i]
return df
df = replace_values(df)
#print new df
print(df)
I was wandering how you could add a new column with values calculated from other columns.
Let's say 'small' is one that has a value less than 5. I want to record this information in a new column called 'small' where this column has value 1 if it is short and 0 otherwise.
I know you can do that with numpy like this :
import pandas as pd
import numpy as np
rawData = pd.read_csv('data.csv', encoding = 'ISO-8859-1')
rawData['small'] = np.where(rawData['value'] < 5, '1', '0')
How would you do the same operation without using numpy?
I've read about numpy.where() but I still don't get it.
Help would be really appreciated thanks!
You can convert boolean mask to integer and then to string:
rawData['small'] = (rawData['value'] < 5).astype(int).astype(str)