I have tried with everything I can come up with and would appreciate some help! :)
This is a method that's gonna return an imputed part of a data frame
from statistics import mean
from unicodedata import numeric
def imputation(df, columns_to_imputed):
# Step 1: Get a part of dataframe using columns received as a parameter.
import pandas as pd
import numpy as np
df.set_axis(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'], axis=1, inplace=True)#Sätter rubrikerna
part_of_df = pd.DataFrame(df.filter(columns_to_imputed, axis=1))
part_of_df = part_of_df.drop([0], axis=0)
#Step 2: Change the zero values in the columns to np.nan
part_of_df = part_of_df.replace('0', np.nan)
# Step 3: Change the nan values to the mean of each attribute (column).
#You can use the apply(), fillna() functions.
part_of_df = part_of_df.fillna(part_of_df.mean(axis=0)) #####Ive tried everything on this row, can't get it to work. I want to fill each nan-value with the mean of the column its in..
return part_of_df ####Im returning this part to see if the nans are replaced but nothings happened...
You were on the right track, you just need to make a small change. Here I created a sample Df and introduced some NaNs:
dummy_df = pd.DataFrame({"col1":range(5), "col2":range(5)})
dummy_df['col1'][1] = None
dummy_df['col1'][3] = None
dummy_df['col2'][4] = None
and got this:
Disclaimer: Don't use my method of value assignment. Use proper indexing through loc.
Now, I use apply() and lambda to iterate over each column and fill NaNs with the mean value:
dummy_df = dummy_df.apply(lambda x: x.fillna(x.mean()), axis=0)
This gives me:
Hope this helps!
Related
I am currently learning pandas and I am using an imdb movies database, which one of the columns is the duration of the movies. However, one of the values is "None", so I can´t calculate the mean because there is this string in the middle. I thought of changing the "None" to = 0, however that would skew the results. Like can be seen with the code below.
dur_temp = duration.replace("None", 0)
dur_temp = dur_temp.astype(float)
descricao_duration = dur_temp.mean()
Any ideas on what I should do in order to not skew the data? I also graphed it and it becomes more clear how it skews it.
You can replace "None" with numpy.nan, instead that using 0.
Something like this should do the trick:
import numpy as np
dur_temp = duration.replace("None", np.nan)
descricao_duration = dur_temp.mean()
if you want it working for any string in your pandas serie, you could use pd.to_numeric:
pd.to_numeric(dur_temp, errors='coerce').mean()
in this way all the values that cannot be converted to float will be replaced by NaN regardless of which is
Just filter by condition like this
df[df['a']!='None'] #assuming your mean values are in column a
Make them np.NAN values
I am writing it as answer because i can't comment df = df.replace('None ', np.NaN) or df.replace('None', np.NaN, inplace=True)
You can use fillna(value=np.nan) as shown below:
descricao_duration = dur_temp.fillna(value=np.nan).mean()
Demo:
import pandas as pd
import numpy as np
dur_temp = pd.DataFrame({'duration': [10, 20, None, 15, None]})
descricao_duration = dur_temp.fillna(value=np.nan).mean()
print(descricao_duration)
Output:
duration 15.0
dtype: float64
I simulate a data frame as follows:
import pandas as pd
import numpy as np
# Create Missing Values in DataFrame
df = pd.DataFrame(np.random.randn(5,5))
df[df > 0.9] = pd.np.nan
df.columns = ['A', 'B','C','D','E']
df
which i gave a column name as A, B, C, D, E. I have this python code to remove rows that contain at least one missing value through pandas as follows"
df.loc[(~pd.isnull(df['A']))&\
(~pd.isnull(df['B']))&\
(~pd.isnull(df['C']))&\
(~pd.isnull(df['D']))&\
(~pd.isnull(df['E']))]
How can I achieve removing column(s) instead of row(s) that contain at least one missing value in any of its rows and also in a specific row?
Use df.dropna() to remove rows/columns that contain NaNs. Read more https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html
Consider boolean selection
df.loc[:, ~df.isnull().any(axis=0)]
You can use dropna while specifying axis:
df = df.dropna(axis='columns', how='any')
The default for how is 'any', but you can be explicit.
You can use isnull() and drop:
for i in df.columns:
if df[i].isnull().count() > 0:
df = df.drop(i, axis=1)
I came across the issue while practicing importing and summarizing data. Please any help?
Image seems not to show any error but I don't know how to figure the 'UN' and np.nan issue
# Names of the columns we're searching for missing values
columns = ['median', 'p25th', 'p75th']
# take a look at the dtypes
print(recent_grads[columns].dtypes)
# find how missing values are represented
print(recent_grads['median'].unique())
# replace missing values with NaN
for column in columns:
recent_grads.loc[recent_grads['median'] == 'UN', column] = np.nan
final output:
while reading csv, use the parameter 'na_values'.
pd.read_csv('', na_values = 'UN')
It seems like you need to import numpy and use nan to solve that problem the way they'd like it. 'nan', short for 'not a number' is what they want you to use to represent the absence of a value.
import numpy as np
# ... your code up to the last bit ...
for value in recent_grads['median'].unique():
if value == 'UN':
value = np.nan
I am currently trying to implement a statistical test for a specific row based on the content of different rows. Given the dataframe in the following image:
DataFrame
I would like to create a new column based on a function that takes into account all the columns of the dataframe that has the same string in column "Template".
For example, in this case there are 2 rows with Template "[Are|Off]", and for each one of those rows I would need to create an element in a new column based on "Clicks", "Impressions" and "Conversions" of both rows.
How would you best approach this problem?
PS: I apologise in advance for the way I am describing the problem, as you might have notices I am not a professional codes :D But I would really appreciate your help!
Here the formula with which I solved this in excel:
Excel Chi Squared test
This might be overly general but I would use some sort of function map if different things should be done depending on the template name:
import pandas as pd
import numpy as np
import collections
n = 5
template_column = list(['are|off', 'are|off', 'comp', 'comp', 'comp|city'])
n = len(template_column)
df = pd.DataFrame(np.random.random((n, 3)), index=range(n), columns=['Clicks', 'Impressions', 'Conversions'])
df['template'] = template_column
# Use a defaultdict so that you can define a default value if a template is
# note defined
function_map = collections.defaultdict(lambda: lambda df: np.nan)
# Now define functions to compute what the new columns should do depending on
# the template.
function_map.update({
'are|off': lambda df: df.sum().sum(),
'comp': lambda df: df.mean().mean(),
'something else': lambda df: df.mean().max()
})
# The lambda functions are just placeholders. You could do whatever you want in these functions... for example:
def do_special_stuff(df):
"""Do something that uses rows and columns...
you could also do looping or whatever you want as long
as the result is a scalar, or a sequence with the same
number of columns as the original template DataFrame
"""
crazy_stuff = np.prod(np.sum(df.values,axis=1)[:,None] + 2*df.values, axis=1)
return crazy_stuff
function_map['comp'] = do_special_stuff
def wrap(f):
"""Wrap a function so that it returns an updated dataframe"""
def wrapped(df):
df = df.copy()
new_column_data = f(df.drop('template', axis=1))
df['new_column'] = new_column_data
return df
return wrapped
# wrap all the functions so that each template has a function defined that does
# the correct thing
series_function_map = {k: wrap(function_map[k]) for k in df['template'].unique()}
# throw everything back together
new_df = pd.concat([series_function_map[label](group)
for label, group in df.groupby('template')],
ignore_index=True)
# print your shiny new dataframe
print(new_df)
The result is then something like:
Clicks Impressions Conversions template new_column
0 0.959765 0.111648 0.769329 are|off 4.030594
1 0.809917 0.696348 0.683587 are|off 4.030594
2 0.265642 0.656780 0.182373 comp 0.502015
3 0.753788 0.175305 0.978205 comp 0.502015
4 0.269434 0.966951 0.478056 comp|city NaN
Hope it helps!
Ok so after groupby u need to apply this formula ..so you can do this in pandas also ...
import numpy as np
t = df.groupby("Template") # this is for groupby
def calculater(b5,b6,c5,c6):
return b5/(b5+b6)*((c5+c6))
t['result'] = np.vectorize(calculater)(df["b5"],df["b6"],df["c5"],df["c6"])
here b5,b6 .. are column names of the cells shown in image
This should work for you or may need to do some minor changes in maths there
I have a dataframe. For each row of the dataframe: I need to read values from two column indexes, pass these values to a set of equations, enter the result of each equation into its own column index in the same row, go to the next row and repeat.
After reading the responses to similar questions I tried:
import pandas as pd
DF = pd.read_csv("...")
Equation_1 = f(x, y)
Equation_2 = g(x, y)
for index, row in DF.iterrows():
a = DF[m]
b = DF[n]
DF[p] = Equation_1(a, b)
DF[q] = Equation_2(a, b)
Rather than iterating over DF, reading and entering new values for each row, this codes iterates over DF and enters the same values for each row. I am not sure what I am doing wrong here.
Also, from what I have read it is actually faster to treat the DF as a NumPy array and perform the calculation over the entire array at once rather than iterating. Not sure how I would go about this.
Thanks.
Turns out that this is extremely easy. All that must be done is to define two variables and assign the desired columns to them. Then set "the row to be replaced" equivalent to the equation containing the variables.
Pandas already knows that it must apply the equation to every row and return each value to its proper index. I didn't realize it would be this easy and was looking for more explicit code.
e.g.,
import pandas as pd
df = pd.read_csv("...") # df is a large 2D array
A = df[0]
B = df[1]
f(A,B) = ....
df[3] = f(A,B)
# If your equations are simple enough, do operations column-wise in Pandas:
import pandas as pd
test = pd.DataFrame([[1,2],[3,4],[5,6]])
test # Default column names are 0, 1
test[0] # This is column 0
test.icol(0) # This is COLUMN 0-indexed, returned as a Series
test.columns=(['S','Q']) # Column names are easier to use
test #Column names! Use them column-wise:
test['result'] = test.S**2 + test.Q
test # results stored in DataFrame
# For more complicated stuff, try apply, as in Python pandas apply on more columns :
def toyfun(df):
return df[0]-df[1]**2
test['out2']=test[['S','Q']].apply(toyfun, axis=1)
# You can also define the column names when you generate the DataFrame:
test2 = pd.DataFrame([[1,2],[3,4],[5,6]],columns = (list('AB')))