This is my task:
Write a function that accepts a dataframe as input, the name of the column with missing values , and a list of grouping columns and returns the dataframe by filling in missing values with the median value
Here is that I tried to do:
def fillnull(set,col):
val = {col:set[col].sum()/set[col].count()}
set.fillna(val)
return set
fillnull(titset,'Age')
My problem is that my function doesn't work, also I don't know how to count median and how to group through this function
Here are photos of my dataframe and missing values of my dataset
DATAFRAME
NaN Values
Check does this code works for you
import pandas as pd
df = pd.DataFrame({
'processId': range(100, 900, 100),
'groupId': [1, 1, 2, 2, 3, 3, 4, 4],
'other': [1, 2, 3, None, 3, 4, None, 9]
})
print(df)
def fill_na(df, missing_value_col, grouping_col):
values = df.groupby(grouping_col)[missing_value_col].median()
df.set_index(grouping_col, inplace=True)
df[missing_value_col].fillna(values, inplace=True)
df.reset_index(grouping_col, inplace=True)
return df
fill_na(df, 'other', 'groupId')
Related
I have two dataframes of different shape
The 'ANTENNA1' and 'ANTENNA2' in the bigger dataframe correspond to the ID columns in the smaller dataframe. I want to create merge the smaller dataframe to the bigger one so that the bigger dataframe will have '(POSITION, col1)', '(POSITION, col2)', '(POSITION, col3)' according to ANTENNA1 == ID
Edit: I tried with pd.merge but it is changing the original dataframe column values
Original:
df = pd.merge(df_main, df_sub, left_on='ANTENNA1', right_on ='id', how = 'left')
Result:
I want to keep the original dataframe columns as it is.
Assuming your first dataframe (with positions) is called df1, and the second is called df2, with your loaded data, you could just use pandas.DataFrame.merge: ( -> pd.merge(...) )
df = pd.merge(df1,df2,left_on='id', right_on='ANTENNA1')
Than you might select the df on your needed columns(col1,col2,..) to get the desired result df[["col1","col2",..]].
simple example:
# import pandas as pd
import pandas as pd
# creating dataframes as df1 and df2
df1 = pd.DataFrame({'ID': [1, 2, 3, 5, 7, 8],
'Name': ['Sam', 'John', 'Bridge',
'Edge', 'Joe', 'Hope']})
df2 = pd.DataFrame({'id': [1, 2, 4, 5, 6, 8, 9],
'Marks': [67, 92, 75, 83, 69, 56, 81]})
# merging df1 and df2 by ID
# i.e. the rows with common ID's get
# merged i.e. {1,2,5,8}
df = pd.merge(df1, df2, left_on="ID", right_on="id")
print(df)
I need to append the row data from a column in df1 into separate dfs.
The row value from column 'i1' in df1 should correspond to the name of the dataframe that it needs appending too and there is a common id column across dataframes.
However - the i1 name and the name of the tables are different. I have created a dictionary below so you can see what i mean.
d_map = {'ab1':'c30_sab1',
'cd2':'kjm_1cd2'}
example data and the expected output is shown below is shown with df1. Any pointers would be great. thanks so much
df1
df = pd.DataFrame(data={'id': [1, 1, 2, 2, 3], 'i1': ['ab1','cd2','ab1','cd2','ab1'], 'i2': ['10:25','10:27','11:51','12:01','13:18']})
tables that need appending with i2 column from df1 depending on id and i1 match
c30_sab = pd.DataFrame(data={'id': [1, 2, 3]})
kjm_1cd = pd.DataFrame(data={'id': [1, 2]})
expected output
e_ab1 = pd.DataFrame(data={'id': [1, 2, 3], 'i2': ['10:25','11:51','13:18']})
e_cd2 = pd.DataFrame(data={'id': [1, 2], 'i2': ['10:27','12:01']})
A simple way to do it (assuming you accept repetitions when the df ids are duplicated):
df_ab1 = df[df['i1'] == 'ab1'] # select only the values for 'ab1' df
df_cd2 = df[df['i1'] == 'cd2'] # select only the values for 'cd2' df
e_ab_1 = ab1.merge(df_ab1[['id', 'i2']], on='id')
e_cd_2 = cd2.merge(df_cd2[['id', 'i2']], on='id')
I'm trying to write a function that take a pandas Dataframe as argument and at some concatenate this datagframe with another.
for exemple:
def concat(df):
df = pd.concat((df, pd.DataFrame({'E': [1, 1, 1]})), axis=1)
I would like this function to modify in place the input df but I can't find how to achieve this. When I do
...
print(df)
concat(df)
print(df)
The dataframe df is identical before and after the function call
Note: I don't want to do df['E'] = [1, 1, 1] because I don't know how many column will be added to df. So I want to use pd.concat(), if possible...
This will edit the original DataFrame inplace and give the desired output as long as the new data contains the same number of rows as the original, and there are no conflicting column names.
It's the same idea as your df['E'] = [1, 1, 1] suggestion, except it will work for an arbitrary number of columns.
I don't think there is a way to achieve this using pd.concat, as it doesn't have an inplace parameter as some Pandas functions do.
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'C': [10, 20, 30], 'D': [40, 50, 60]})
df[df2.columns] = df2
Results (df):
A B C D
0 1 4 10 40
1 2 5 20 50
2 3 6 30 60
I would like to drop all data in a pandas dataframe, but am getting TypeError: drop() takes at least 2 arguments (3 given). I essentially want a blank dataframe with just my columns headers.
import pandas as pd
web_stats = {'Day': [1, 2, 3, 4, 2, 6],
'Visitors': [43, 43, 34, 23, 43, 23],
'Bounce_Rate': [3, 2, 4, 3, 5, 5]}
df = pd.DataFrame(web_stats)
df.drop(axis=0, inplace=True)
print df
You need to pass the labels to be dropped.
df.drop(df.index, inplace=True)
By default, it operates on axis=0.
You can achieve the same with
df.iloc[0:0]
which is much more efficient.
My favorite:
df = df.iloc[0:0]
But be aware df.index.max() will be nan.
To add items I use:
df.loc[0 if math.isnan(df.index.max()) else df.index.max() + 1] = data
My favorite way is:
df = df[0:0]
Overwrite the dataframe with something like that
import pandas as pd
df = pd.DataFrame(None)
or if you want to keep columns in place
df = pd.DataFrame(columns=df.columns)
If your goal is to drop the dataframe, then you need to pass all columns. For me: the best way is to pass a list comprehension to the columns kwarg. This will then work regardless of the different columns in a df.
import pandas as pd
web_stats = {'Day': [1, 2, 3, 4, 2, 6],
'Visitors': [43, 43, 34, 23, 43, 23],
'Bounce_Rate': [3, 2, 4, 3, 5, 5]}
df = pd.DataFrame(web_stats)
df.drop(columns=[i for i in check_df.columns])
This code make clean dataframe:
df = pd.DataFrame({'a':[1,2], 'b':[3,4]})
#clean
df = pd.DataFrame()
Right now I'm having to do calculations on dataframe_one, then create a new column on dataframe_two and fill the results. dataframe_one is multi indexed, while the second one is not but there are columns that are matched to the indices in dataframe_one.
This is what I'm currently doing:
import pandas as pd
import numpy as np
dataframe_two = {}
dataframe_two['project_id'] = [1, 2]
dataframe_two['scenario'] = ['hgh', 'low']
dataframe_two = pd.DataFrame(dataframe_two)
dataframe_one = {}
dataframe_one['ts_project_id'] = [1, 1, 1, 1, 1, 2, 2, 2, 2, 2]
dataframe_one['ts_scenario'] = ['hgh', 'hgh', 'hgh', 'hgh', 'hgh', 'low', 'low', 'low', 'low', 'low']
dataframe_one['ts_economics_atcf'] = [-2, 2, -3, 4, 5 , -6, 3, -3, 4, 5]
dataframe_one = pd.DataFrame(dataframe_one)
dataframe_one.index = [dataframe_one['ts_project_id'], dataframe_one['ts_scenario']]
project_scenario = zip(dataframe_two['project_id'], dataframe_two['scenario'])
dataframe_two['econ_irr'] = np.zeros(len(dataframe_two.index))
i = 0
for project, scenario in project_scenario:
# Grabs corresponding series from dataframe_one
atcf = dataframe_one.ix[project].ix[scenario]['ts_economics_atcf']
irr = np.irr(atcf.values)
dataframe_two['econ_irr'][i] = irr
i = i + 1
print dataframe_two
Is there an easier way to do this?
Cheers!
If I understood right, you want pandas equivalent for SQL group_by and aggregation functions. They are essentialy the same, groupby method of a DataFrame and a aggregate method of groupby.SeriesGroupBy object.
>>> dataframe_one['ts_economics_atcf'].groupby(level=[0,1]).aggregate(np.irr)
ts_project_id ts_scenario
1 hgh 0.544954
2 low 0.138952
dtype: float64