Using data.table in R, you can simultaneously select and assign columns. Assume one has a data.table with 3 columns--col1, col2, and col3. One could do the following using data.table:
dt2 <- dt[, .(col1, col2, newcol = 3, anothercol = col3)]
I want to do something similar in pandas but it looks like it would take 3 lines.
df2 = df.copy()
df2['newcol'] = 3
df2.rename(columns = {"col3" : "anothercol"})
Is there a more concise way to do what I did above?
This might work:
import pandas as pd
ddict = {
'col1':['A','A','B','X'],
'col2':['A','A','B','X'],
'col3':['A','A','B','X'],
}
df = pd.DataFrame(ddict)
df.loc[:, ['col1', 'col2', 'col3']].rename(columns={"col3":"anothercol"}).assign(newcol=3)
result:
col1 col2 anothercol newcol
0 A A A 3
1 A A A 3
2 B B B 3
3 X X X 3
I don't know R, but what I'm seeing is that you are adding a new column called newcol that has a value of 3 on all the rows.
also you are renaming a column from col3 to anothercol.
you don't really need do the copy step.
df2 = df.rename(columns = {'col3': 'anothercol'})
df2['newcol'] = 3
You can use df.assign for that :
Example :
>>> df = pd.DataFrame({'temp_c': [17.0, 25.0]},
index=['Portland', 'Berkeley'])
>>> df
temp_c
Portland 17.0
Berkeley 25.0
>>> df.assign(temp_f=lambda x: x.temp_c * 9 / 5 + 32)
temp_c temp_f
Portland 17.0 62.6
Berkeley 25.0 77.0
>>> df.assign(newcol=3).rename(columns={"temp_c":"anothercol"}
anothercol newcol
Portland 17.0 3
Berkeley 25.0 3
And then you can assign it to df2.
First examples taken from pandas Docs
Related
Hello I have the following dataframe
df = pd.DataFrame(data={'grade_1':['A','B','C'],
'grade_1_count': [19,28,32],
'grade_2': ['pass','fail',np.nan],
'grade_2_count': [39,18, np.nan]})
whereby some grades as missing, and need to be inserted in to the grade_n column according to the values in this dictionary
grade_dict = {'grade_1':['A','B','C','D','E','F'],
'grade_2' : ['pass','fail','not present', 'borderline']}
and the corresponding row value in the _count column should be filled with np.nan
so the expected output is like this
expected_df = pd.DataFrame(data={'grade_1':['A','B','C','D','E','F'],
'grade_1_count': [19,28,32,0,0,0],
'grade_2': ['pass','fail','not preset','borderline', np.nan, np.nan],
'grade_2_count': [39,18,0,0,np.nan,np.nan]})
so far I have this rather inelegant code that creates a column that includes all the correct categories for the grades, but i cannot reinsert it in to the dataframe, or fill the count columns with zeros (where the np.nans just reflect empty cells due to coercing columns with different lengths of rows) I hope that makes sense. any advice would be great. thanks
x=[]
for k, v in grade_dict.items():
out = df[k].reindex(grade_dict[k], axis=0, fill_value=0)
x = pd.concat([out], axis=1)
x[k] = x.index
x = x.reset_index(drop=True)
df[k] = x.fillna(np.nan)
Here is a solution using two consecutive merges:
# set up combinations
from itertools import zip_longest
df2 = pd.DataFrame(list(zip_longest(*grade_dict.values())), columns=grade_dict)
# merge
(df2.merge(df.filter(like='grade_1'),
on='grade_1', how='left')
.merge(df.filter(like='grade_2'),
on='grade_2', how='left')
.sort_index(axis=1)
)
output:
grade_1 grade_1_count grade_2 grade_2_count
0 A 19.0 pass 39.0
1 B 28.0 fail 18.0
2 C 32.0 not present NaN
3 D NaN borderline NaN
4 E NaN None NaN
5 F NaN None NaN
multiple merges:
df2 = pd.DataFrame(list(zip_longest(*grade_dict.values())), columns=grade_dict)
for col in grade_dict:
df2 = df2.merge(df.filter(like=col),
on=col, how='left')
df2
If you only need to merge on grade_1 without updating the non-NaNs of grade_2, you can cast grade_dict into a df and then use combine_first:
print (df.set_index("grade_1").combine_first(pd.DataFrame(grade_dict.values(),
index=grade_dict.keys()).T.set_index("grade_1"))
.fillna({"grade_1_count": 0}).reset_index())
grade_1 grade_1_count grade_2 grade_2_count
0 A 19.0 pass 39.0
1 B 28.0 fail 18.0
2 C 32.0 not present NaN
3 D 0.0 borderline NaN
4 E 0.0 None NaN
5 F 0.0 None NaN
I used Pandas to load a CSV to the following DataFrame:
value values
0 56.0 [-0.5554548,10.0748005,4.232949]
1 72.0 [-0.1953888,0.15093994,-0.058532715]
...
Now I would like to replace "values" column with 3 new columns like so:
value values_a values_b values_c
0 56.0 -0.5554548 10.0748005 4.232949
1 72.0 -0.1953888 0.15093994 -0.058532715
...
How can I split the list to 3 columns?
You can use split with removing [] by strip:
df1 = df.pop('values').str.strip('[]').str.split(',',expand=True).astype(float)
df[['values_a', 'values_b', 'values_c']] = df1
Solution if There is no NaNs:
L = [x.split(',') for x in df.pop('values').str.strip('[]').values.tolist()]
df[['values_a', 'values_b', 'values_c']] = pd.DataFrame(L).astype(float)
solutions with converting columns first to list and then is used DataFrame constructor:
import ast
s = df.pop('values').apply(ast.literal_eval)
df[['values_a', 'values_b', 'values_c']] = pd.DataFrame(s.values.tolist()).astype(float)
Similar:
df = pd.read_csv(file converters={'values':ast.literal_eval})
print (df)
value values
0 56.0 [-0.5554548, 10.0748005, 4.232949]
1 72.0 [-0.1953888, 0.15093994, -0.058532715]
df1 = pd.DataFrame(df.pop('values').tolist()).astype(float)
df[['values_a', 'values_b', 'values_c']] = df1
Final:
print (df)
value values_a values_b values_c
0 56.0 -0.555455 10.074801 4.232949
1 72.0 -0.195389 0.150940 -0.058533
EDIT:
If is possible in some column is more as 3 value then is not possible assign to 3 new columns. Solution is use join:
df = df.join(df1.add_prefix('val'))
print (df)
value val0 val1 val2
0 56.0 -0.555455 10.074801 4.232949
1 72.0 -0.195389 0.150940 -0.058533
This should be straightforward, but the closest thing I've found is this post:
pandas: Filling missing values within a group, and I still can't solve my problem....
Suppose I have the following dataframe
df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3], 'name': ['A','A', 'B','B','B','B', 'C','C','C']})
name value
0 A 1
1 A NaN
2 B NaN
3 B 2
4 B 3
5 B 1
6 C 3
7 C NaN
8 C 3
and I'd like to fill in "NaN" with mean value in each "name" group, i.e.
name value
0 A 1
1 A 1
2 B 2
3 B 2
4 B 3
5 B 1
6 C 3
7 C 3
8 C 3
I'm not sure where to go after:
grouped = df.groupby('name').mean()
Thanks a bunch.
One way would be to use transform:
>>> df
name value
0 A 1
1 A NaN
2 B NaN
3 B 2
4 B 3
5 B 1
6 C 3
7 C NaN
8 C 3
>>> df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
>>> df
name value
0 A 1
1 A 1
2 B 2
3 B 2
4 B 3
5 B 1
6 C 3
7 C 3
8 C 3
fillna + groupby + transform + mean
This seems intuitive:
df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))
The groupby + transform syntax maps the groupwise mean to the index of the original dataframe. This is roughly equivalent to #DSM's solution, but avoids the need to define an anonymous lambda function.
#DSM has IMO the right answer, but I'd like to share my generalization and optimization of the question: Multiple columns to group-by and having multiple value columns:
df = pd.DataFrame(
{
'category': ['X', 'X', 'X', 'X', 'X', 'X', 'Y', 'Y', 'Y'],
'name': ['A','A', 'B','B','B','B', 'C','C','C'],
'other_value': [10, np.nan, np.nan, 20, 30, 10, 30, np.nan, 30],
'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3],
}
)
... gives ...
category name other_value value
0 X A 10.0 1.0
1 X A NaN NaN
2 X B NaN NaN
3 X B 20.0 2.0
4 X B 30.0 3.0
5 X B 10.0 1.0
6 Y C 30.0 3.0
7 Y C NaN NaN
8 Y C 30.0 3.0
In this generalized case we would like to group by category and name, and impute only on value.
This can be solved as follows:
df['value'] = df.groupby(['category', 'name'])['value']\
.transform(lambda x: x.fillna(x.mean()))
Notice the column list in the group-by clause, and that we select the value column right after the group-by. This makes the transformation only be run on that particular column. You could add it to the end, but then you will run it for all columns only to throw out all but one measure column at the end. A standard SQL query planner might have been able to optimize this, but pandas (0.19.2) doesn't seem to do this.
Performance test by increasing the dataset by doing ...
big_df = None
for _ in range(10000):
if big_df is None:
big_df = df.copy()
else:
big_df = pd.concat([big_df, df])
df = big_df
... confirms that this increases the speed proportional to how many columns you don't have to impute:
import pandas as pd
from datetime import datetime
def generate_data():
...
t = datetime.now()
df = generate_data()
df['value'] = df.groupby(['category', 'name'])['value']\
.transform(lambda x: x.fillna(x.mean()))
print(datetime.now()-t)
# 0:00:00.016012
t = datetime.now()
df = generate_data()
df["value"] = df.groupby(['category', 'name'])\
.transform(lambda x: x.fillna(x.mean()))['value']
print(datetime.now()-t)
# 0:00:00.030022
On a final note you can generalize even further if you want to impute more than one column, but not all:
df[['value', 'other_value']] = df.groupby(['category', 'name'])['value', 'other_value']\
.transform(lambda x: x.fillna(x.mean()))
Shortcut:
Groupby + Apply + Lambda + Fillna + Mean
>>> df['value1']=df.groupby('name')['value'].apply(lambda x:x.fillna(x.mean()))
>>> df.isnull().sum().sum()
0
This solution still works if you want to group by multiple columns to replace missing values.
>>> df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, np.nan,np.nan, 4, 3],
'name': ['A','A', 'B','B','B','B', 'C','C','C'],'class':list('ppqqrrsss')})
>>> df['value']=df.groupby(['name','class'])['value'].apply(lambda x:x.fillna(x.mean()))
>>> df
value name class
0 1.0 A p
1 1.0 A p
2 2.0 B q
3 2.0 B q
4 3.0 B r
5 3.0 B r
6 3.5 C s
7 4.0 C s
8 3.0 C s
I'd do it this way
df.loc[df.value.isnull(), 'value'] = df.groupby('group').value.transform('mean')
The featured high ranked answer only works for a pandas Dataframe with only two columns. If you have a more columns case use instead:
df['Crude_Birth_rate'] = df.groupby("continent").Crude_Birth_rate.transform(
lambda x: x.fillna(x.mean()))
To summarize all above concerning the efficiency of the possible solution
I have a dataset with 97 906 rows and 48 columns.
I want to fill in 4 columns with the median of each group.
The column I want to group has 26 200 groups.
The first solution
start = time.time()
x = df_merged[continuous_variables].fillna(df_merged.groupby('domain_userid')[continuous_variables].transform('median'))
print(time.time() - start)
0.10429811477661133 seconds
The second solution
start = time.time()
for col in continuous_variables:
df_merged.loc[df_merged[col].isnull(), col] = df_merged.groupby('domain_userid')[col].transform('median')
print(time.time() - start)
0.5098445415496826 seconds
The next solution I only performed on a subset since it was running too long.
start = time.time()
for col in continuous_variables:
x = df_merged.head(10000).groupby('domain_userid')[col].transform(lambda x: x.fillna(x.median()))
print(time.time() - start)
11.685635566711426 seconds
The following solution follows the same logic as above.
start = time.time()
x = df_merged.head(10000).groupby('domain_userid')[continuous_variables].transform(lambda x: x.fillna(x.median()))
print(time.time() - start)
42.630549907684326 seconds
So it's quite important to choose the right method.
Bear in mind that I noticed once a column was not a numeric the times were going up exponentially (makes sense as I was computing the median).
def groupMeanValue(group):
group['value'] = group['value'].fillna(group['value'].mean())
return group
dft = df.groupby("name").transform(groupMeanValue)
I know that is an old question. But I am quite surprised by the unanimity of apply/lambda answers here.
Generally speaking, that is the second worst thing to do after iterating rows, from timing point of view.
What I would do here is
df.loc[df['value'].isna(), 'value'] = df.groupby('name')['value'].transform('mean')
Or using fillna
df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))
I've checked with timeit (because, again, unanimity for apply/lambda based solution made me doubt my instinct). And that is indeed 2.5 faster than the most upvoted solutions.
To fill all the numeric null values with the mean grouped by "name"
num_cols = df.select_dtypes(exclude='object').columns
df[num_cols] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
df.fillna(df.groupby(['name'], as_index=False).mean(), inplace=True)
You can also use "dataframe or table_name".apply(lambda x: x.fillna(x.mean())).
energy.loc['Republic of Korea']
I want to change the value of index from 'Republic of Korea' to 'South Korea'.
But the dataframe is too large and it is not possible to change every index value. How do I change only this single value?
#EdChum's solution looks good.
Here's one using rename, which would replace all these values in the index.
energy.rename(index={'Republic of Korea':'South Korea'},inplace=True)
Here's an example
>>> example = pd.DataFrame({'key1' : ['a','a','a','b','a','b'],
'data1' : [1,2,2,3,nan,4],
'data2' : list('abcdef')})
>>> example.set_index('key1',inplace=True)
>>> example
data1 data2
key1
a 1.0 a
a 2.0 b
a 2.0 c
b 3.0 d
a NaN e
b 4.0 f
>>> example.rename(index={'a':'c'}) # can also use inplace=True
data1 data2
key1
c 1.0 a
c 2.0 b
c 2.0 c
b 3.0 d
c NaN e
b 4.0 f
You want to do something like this:
as_list = df.index.tolist()
idx = as_list.index('Republic of Korea')
as_list[idx] = 'South Korea'
df.index = as_list
Basically, you get the index as a list, change that one element, and the replace the existing index.
Try This
df.rename(index={'Republic of Korea':'South Korea'},inplace=True)
If you have MultiIndex DataFrame, do this:
# input DataFrame
import pandas as pd
t = pd.DataFrame(data={'i1':[0,0,0,0,1,1,1,1,2,2,2,2],
'i2':[0,1,2,3,0,1,2,3,0,1,2,3],
'x':[1.,2.,3.,4.,5.,6.,7.,8.,9.,10.,11.,12.]})
t.set_index(['i1','i2'], inplace=True)
t.sort_index(inplace=True)
# changes index level 'i1' values 0 to -1
t.rename(index={0:-1}, level='i1', inplace=True)
Here's another good one, using replace on the column.
df.reset_index(inplace=True)
df.drop('index', axis = 1, inplace=True)
df["Country"].replace("Republic of Korea", value="South Korea", inplace=True)
df.set_index("Country", inplace=True)
Here's another idea based on set_value
df = df.reset_index()
df.drop('index', axis = 1, inplace=True)
index = df.index[df["Country"] == "Republic of Korea"]
df.set_value(index, "Country", "South Korea")
df = df.set_index("Country")
df["Country"] = df.index
We can use rename function to change row index or column name. Here is the example,
Suppose data frame is like given below,
student_id marks
index
1 12 33
2 23 98
To change index 1 to 5
we will use axis = 0 which is for row
df.rename({ 1 : 5 }, axis=0)
df refers to data frame variable. So, output will be like
student_id marks
index
5 12 33
2 23 98
To change column name
we will have to use axis = 1
df.rename({ "marks" : "student_marks" }, axis=1)
so, changed data frame is
student_id student_marks
index
5 12 33
2 23 98
This seems to work too:
energy.index.values[energy.index.tolist().index('Republic of Korea')] = 'South Korea'
No idea though whether this is recommended or discouraged.
I have a big data frame with lots of NaN, I want to store it into a smaller data frame which stores all the indexes and the values of the non-NaN, non-zero values.
dff = pd.DataFrame(np.random.randn(4,3), columns=list('ABC'))
dff.iloc[0:2,0] = np.nan
dff.iloc[2,2] = np.nan
dff.iloc[1:4,1] = 0
The data frame may look like this:
A B C
0 NaN -2.268882 0.337074
1 NaN 0.000000 1.340350
2 -1.526945 0.000000 NaN
3 -1.223816 0.000000 -2.185926
I want a data frame looks like this:
0 B -2.268882
0 C 0.337074
1 C 1.340350
2 A -1.526945
3 A -1.223816
4 C -2.185926
How can I do it quickly, as i have a relatively big data frame, thousands by thousands...
Many Thanks!
Replace 0 with np.nan and .stack() the result (see docs).
If there's a chance that you have all np.nan values in rows after .replace(), you could do .dropna(how='all') before .stack() to reduce the number of rows to pivot. If that could apply to columns do `.dropna(how='all', axis=1).
df.replace(0, np.nan).stack()
0 B -2.268882
C 0.337074
1 C 1.340350
2 A -1.526945
3 A -1.223816
C -2.185926
Combine with .reset_index() as needed.
To select from a Series with MultiIndex use .loc[(level_0, level_1)]:
df.loc[(0, 'B')] = -2.268882
Details on slicing etc in the docs.
I've come up with a bit ugly way of achieving things, but hey, it works. But this solution has index going from 0 and does not preserve the original order of 'A', 'B', 'C' as in your question, if that matters.
import pandas as pd
import numpy as np
dff = pd.DataFrame(np.random.randn(4,3), columns=list('ABC'))
dff.iloc[0:2,0] = np.nan
dff.iloc[2,2] = np.nan
dff.iloc[1:4,1] = 0
dff.iloc[2,1] = np.nan
# mask to do logical and for two lists
mask = lambda y,z: list(map(lambda x: x[0] and x[1], zip(y,z)))
# create new frame
new_df = pd.DataFrame()
types = []
vals = []
# iterate over columns
for col in dff.columns:
# get the non empty and non zero values from current column
data = dff[col][mask(dff[col].notnull(), dff[col] != 0)]
# add corresponding original column name
types.extend([col for x in range(len(data))])
vals.extend(data)
# populate the dataframe
new_df['Types'] = pd.Series(types)
new_df['Vals'] = pd.Series(vals)
print(new_df)
# A B C
#0 NaN -1.167975 -1.362128
#1 NaN 0.000000 1.388611
#2 1.482621 NaN NaN
#3 -1.108279 0.000000 -1.454491
# Types Vals
#0 A 1.482621
#1 A -1.108279
#2 B -1.167975
#3 C -1.362128
#4 C 1.388611
#5 C -1.454491
I am looking forward for more pandas/python like answer myself!