how to overwrite values of first row of dataframe

how to overwrite values of first row of dataframe - python

Given a panda.Dataframe such as:
df = pd.DataFrame(np.random.randn(10,5), columns = ['a','b','c','d','e'])
I would like to know the best way to replace all values in the first row with a 0 (or some other specific value) and work with the new dataframe. I would like to do this in a general way, where there may be more or less columns than in this example.
Despite the simplicity of the question, I was not able to come across a solution. Most examples posted by others had to do with fillna() and related methods

You can use iloc to do that pretty cleanly like:
Code:
df.iloc[0] = 0
Test Code:
df = pd.DataFrame(np.random.randn(10, 5), columns=['a', 'b', 'c', 'd', 'e'])
print(df)
df.iloc[0] = 0
print(df)
Results:
a b c d e
0 0.715524 -0.914676 0.241008 -1.353033 0.170578
1 -0.300348 1.118491 -0.520407 0.185877 -0.950839
2 1.942239 0.980477 0.110457 -0.558483 0.903775
3 0.400923 1.347769 -0.120445 0.036253 0.683571
4 -0.761881 -0.642469 2.030019 2.274070 -0.067672
5 0.566003 0.263949 -0.567247 0.689599 0.870442
6 1.904812 -0.689312 1.400950 1.942681 -1.268679
7 -0.253381 0.464208 1.362960 0.129433 0.527576
8 -1.404035 0.174586 1.006268 0.007333 1.172559
9 0.330404 0.735610 1.277451 -0.104888 0.528356
a b c d e
0 0.000000 0.000000 0.000000 0.000000 0.000000
1 -0.300348 1.118491 -0.520407 0.185877 -0.950839
2 1.942239 0.980477 0.110457 -0.558483 0.903775
3 0.400923 1.347769 -0.120445 0.036253 0.683571
4 -0.761881 -0.642469 2.030019 2.274070 -0.067672
5 0.566003 0.263949 -0.567247 0.689599 0.870442
6 1.904812 -0.689312 1.400950 1.942681 -1.268679
7 -0.253381 0.464208 1.362960 0.129433 0.527576
8 -1.404035 0.174586 1.006268 0.007333 1.172559
9 0.330404 0.735610 1.277451 -0.104888 0.528356

Related

Using slice on DaraFrameGroupBy

I need to use slice on DataFrameGroupBy object.
For example, assume there is DataFrame with A-Z columns, if I want to use columns A-C I will use .loc[:, 'A':'C'], but when I'm using DataFrameGroupBy, I can't use slicing so I have to write [['A', 'B', 'C']]
Take a look here:
from numpy import around
from numpy.random import uniform
from pandas import DataFrame
from string import ascii_lowercase
data = around(a=uniform(low=1.0, high=50.0, size=(6, len(ascii_lowercase) + 1)), decimals=3)
df = DataFrame(data=data, columns=['group'] + list(ascii_lowercase), dtype='float64')
rows, columns = df.shape
df.loc[:rows // 2, 'group'] = 1.0
df.loc[rows // 2:, 'group'] = 2.0
print(df)
abc = df.groupby(by='group')[['a', 'b', 'c']].shift(periods=1)
print(abc)
Output of df is:
group a b c ... w x y z
0 1.0 22.380 36.873 10.073 ... 26.052 38.625 48.122 33.841
1 1.0 16.702 32.160 35.018 ... 12.990 17.878 19.297 16.330
2 1.0 9.957 25.202 7.106 ... 46.500 12.932 37.401 43.134
3 2.0 42.395 40.616 24.611 ... 30.436 33.521 42.136 2.690
4 2.0 2.069 29.891 2.217 ... 20.734 12.365 9.302 47.019
5 2.0 4.208 23.955 33.966 ... 45.439 16.488 32.892 9.345
Output of abc is:
a b c
0 NaN NaN NaN
1 22.380 36.873 10.073
2 16.702 32.160 35.018
3 NaN NaN NaN
4 42.395 40.616 24.611
5 2.069 29.891 2.217
How can I avoid of using [['a', 'b', 'c']]? I have 105 columns that I need to write there, I want use slicing like .loc[:, 'a':'c']
Thank you all :)

You can grouping by Series df['group'], so is possible filter columns before groupby to pass only filtered columns names:
abc = df.loc[:, 'a':'c'].groupby(by=df['group']).shift(periods=1)
print(abc)
a b c
0 NaN NaN NaN
1 37.999 21.197 39.527
2 35.560 27.214 23.211
3 NaN NaN NaN
4 49.053 11.319 37.279
5 27.881 38.529 46.550
Another idea is use:
cols = df.loc[:, 'a':'c'].columns
abc = df.groupby(by='group')[cols].shift(periods=1)

Pandas: Dynamically replace NaN values with the average of previous and next non-missing values

I have a dataframe df with NaN values and I want to dynamically replace them with the average values of previous and next non-missing values.
In [27]: df
Out[27]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 NaN -2.027325 1.533582
4 NaN NaN 0.461821
5 -0.788073 NaN NaN
6 -0.916080 -0.612343 NaN
7 -0.887858 1.033826 NaN
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
For example, A[3] is NaN so its value should be (-0.120211-0.788073)/2 = -0.454142. A[4] then should be (-0.454142-0.788073)/2 = -0.621108.
Therefore, the result dataframe should look like:
In [27]: df
Out[27]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 -0.454142 -2.027325 1.533582
4 -0.621108 -1.319834 0.461821
5 -0.788073 -0.966089 -1.260202
6 -0.916080 -0.612343 -2.121213
7 -0.887858 1.033826 -2.551718
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
Is this a good way to deal with the missing values? I can't simply replace them by the average values of each column because my data is time-series and tends to increase over time. (The initial value may be $0 and final value might be $100000, so the average is $50000 which can be much bigger/smaller than the NaN values).

You can try to understand your logic behind the average that is Geometric progression
s=df.isnull().cumsum()
t1=df[(s==1).shift(-1).fillna(False)].stack().reset_index(level=0,drop=True)
t2=df.lookup(s.idxmax()+1,s.idxmax().index)
df.fillna(t1/(2**s)+t2*(1-0.5**s)*2/2)
Out[212]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 -0.454142 -2.027325 1.533582
4 -0.621107 -1.319834 0.461821
5 -0.788073 -0.966089 -1.260201
6 -0.916080 -0.612343 -2.121213
7 -0.887858 1.033826 -2.551718
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
Explanation:
1st NaN x/2+y/2=1st
2nd NaN 1st/2+y/2=2nd
3rd NaN 2nd/2+y/2+3rd
Then x/(2**n)+y(1-(1/2)**n)/(1-1/2), this is the key

Got a simular Problem.
The following code worked for me.
def fill_nan_with_mean_from_prev_and_next(df):
NANrows = pd.isnull(df).any(1).nonzero()[0]
null_df = df.isnull()
for row in NANrows :
for colum in range(0,df.shape[1]):
if(null_df.iloc[row][colum]):
df.iloc[row][colum] = (df.iloc[row-1][colum]+df.iloc[row-1][colum])/2
return df
maybe it is helps someone too.

as Ben.T has mentioned above
if you have another group of NaN in the same column
you can consider this lazy solution :)
for column in df:
for ind,row in df[[column]].iterrows():
if ~np.isnan(row[column]):
previous = row[column]
else:
indx = ind + 1
while np.isnan(df.loc[indx,column]):
indx += 1
next = df.loc[indx,column]
previous = df[column][ind] = (previous + next)/2

Python Pandas fillna doesn't work in for loop?

Given a set up such as below:
import pandas as pd
import numpy as np
#Create random number dataframes
df1 = pd.DataFrame(np.random.rand(10,4))
df2 = pd.DataFrame(np.random.rand(10,4))
df3 = pd.DataFrame(np.random.rand(10,4))
#Create list of dataframes
data_frame_list = [df1, df2, df3]
#Introduce some NaN values
df1.iloc[4,3] = np.NaN
df2.iloc[1:4,2] = np.NaN
#Create loop to ffill any NaN values
for df in data_frame_list:
df = df.fillna(method='ffill')
This still leaves df2 (for example) as:
0 1 2 3
0 0.946601 0.492957 0.688421 0.582571
1 0.365173 0.507617 NaN 0.997909
2 0.185005 0.496989 NaN 0.962120
3 0.278633 0.515227 NaN 0.868952
4 0.346495 0.779571 0.376018 0.750900
5 0.384307 0.594381 0.741655 0.510144
6 0.499180 0.885632 0.13413 0.196010
7 0.245445 0.771402 0.371148 0.222618
8 0.564510 0.487644 0.121945 0.095932
9 0.401214 0.282698 0.0181196 0.689916
Although the individual line of code:
df2 = df2.fillna(method='ffill)
Does work. I thought the issue may be due to the way I was naming variables so I introduced global()[df], but this didn't seem to work either.
Wondering if it possible to do a ffill of an entire dataframe in a for loop, or am I going wrong somewhere in my approach?

No, it unfortunately does not. You are calling fillna not in place and it results in the generation of a copy, which you then reassign back to the variable df. You should understand that reassigning this variable does not change the contents of the list.
If you want to do that, iterate over the index or use a list comprehension.
data_frame_list = [df.ffill() for df in data_frame_list]
Or,
for i in range(len(data_frame_list)):
data_frame_list[i].ffill(inplace=True)

You can change only DataFrame in list of DataFrames, so df1 - df3 are not changed with ffill and parameter inplace=True:
data_frame_list = [df1, df2, df3]
for df in data_frame_list:
df.ffill(inplace=True)
print (data_frame_list)
[ 0 1 2 3
0 0.506726 0.057531 0.627580 0.132553
1 0.131085 0.788544 0.506686 0.412826
2 0.578009 0.488174 0.335964 0.140816
3 0.891442 0.086312 0.847512 0.529616
4 0.550261 0.848461 0.158998 0.529616
5 0.817808 0.977898 0.933133 0.310414
6 0.481331 0.382784 0.874249 0.363505
7 0.384864 0.035155 0.634643 0.009076
8 0.197091 0.880822 0.002330 0.109501
9 0.623105 0.999237 0.567151 0.487938, 0 1 2 3
0 0.104856 0.525416 0.284066 0.658453
1 0.989523 0.644251 0.284066 0.141395
2 0.488099 0.167418 0.284066 0.097982
3 0.930415 0.486878 0.284066 0.192273
4 0.210032 0.244598 0.175200 0.367130
5 0.981763 0.285865 0.979590 0.924292
6 0.631067 0.119238 0.855842 0.782623
7 0.815908 0.575624 0.037598 0.532883
8 0.346577 0.329280 0.606794 0.825932
9 0.273021 0.503340 0.828568 0.429792, 0 1 2 3
0 0.491665 0.752531 0.780970 0.524148
1 0.635208 0.283928 0.821345 0.874243
2 0.454211 0.622611 0.267682 0.726456
3 0.379144 0.345580 0.694614 0.585782
4 0.844209 0.662073 0.590640 0.612480
5 0.258679 0.413567 0.797383 0.431819
6 0.034473 0.581294 0.282111 0.856725
7 0.352072 0.801542 0.862749 0.000285
8 0.793939 0.297286 0.441013 0.294635
9 0.841181 0.804839 0.311352 0.171094]

Or you can concat
df=pd.concat([df1,df2,df3],keys=['df1','df2','df3'])
[x for _,x in df.groupby(level=0).ffill().groupby(level=0)]

Pandas dataframe total row

I have a dataframe, something like:
foo bar qux
0 a 1 3.14
1 b 3 2.72
2 c 2 1.62
3 d 9 1.41
4 e 3 0.58
and I would like to add a 'total' row to the end of dataframe:
foo bar qux
0 a 1 3.14
1 b 3 2.72
2 c 2 1.62
3 d 9 1.41
4 e 3 0.58
5 total 18 9.47
I've tried to use the sum command but I end up with a Series, which although I can convert back to a Dataframe, doesn't maintain the data types:
tot_row = pd.DataFrame(df.sum()).T
tot_row['foo'] = 'tot'
tot_row.dtypes:
foo object
bar object
qux object
I would like to maintain the data types from the original data frame as I need to apply other operations to the total row, something like:
baz = 2*tot_row['qux'] + 3*tot_row['bar']

Update June 2022
pd.append is now deprecated. You could use pd.concat instead but it's probably easier to use df.loc['Total'] = df.sum(numeric_only=True), as Kevin Zhu commented. Or, better still, don't modify the data frame in place and keep your data separate from your summary statistics!
Append a totals row with
df.append(df.sum(numeric_only=True), ignore_index=True)
The conversion is necessary only if you have a column of strings or objects.
It's a bit of a fragile solution so I'd recommend sticking to operations on the dataframe, though. eg.
baz = 2*df['qux'].sum() + 3*df['bar'].sum()

df.loc["Total"] = df.sum()
works for me and I find it easier to remember. Am I missing something?
Probably wasn't possible in earlier versions.
I'd actually like to add the total row only temporarily though.
Adding it permanently is good for display but makes it a hassle in further calculations.
Just found
df.append(df.sum().rename('Total'))
This prints what I want in a Jupyter notebook and appears to leave the df itself untouched.

New Method
To get both row and column total:
import numpy as np
import pandas as pd
df = pd.DataFrame({'a': [10,20],'b':[100,200],'c': ['a','b']})
df.loc['Column_Total']= df.sum(numeric_only=True, axis=0)
df.loc[:,'Row_Total'] = df.sum(numeric_only=True, axis=1)
print(df)
a b c Row_Total
0 10.0 100.0 a 110.0
1 20.0 200.0 b 220.0
Column_Total 30.0 300.0 NaN 330.0

Use DataFrame.pivot_table with margins=True:
import pandas as pd
data = [('a',1,3.14),('b',3,2.72),('c',2,1.62),('d',9,1.41),('e',3,.58)]
df = pd.DataFrame(data, columns=('foo', 'bar', 'qux'))
Original df:
foo bar qux
0 a 1 3.14
1 b 3 2.72
2 c 2 1.62
3 d 9 1.41
4 e 3 0.58
Since pivot_table requires some sort of grouping (without the index argument, it'll raise a ValueError: No group keys passed!), and your original index is vacuous, we'll use the foo column:
df.pivot_table(index='foo',
margins=True,
margins_name='total', # defaults to 'All'
aggfunc=sum)
Voilà!
bar qux
foo
a 1 3.14
b 3 2.72
c 2 1.62
d 9 1.41
e 3 0.58
total 18 9.47

Alternative way (verified on Pandas 0.18.1):
import numpy as np
total = df.apply(np.sum)
total['foo'] = 'tot'
df.append(pd.DataFrame(total.values, index=total.keys()).T, ignore_index=True)
Result:
foo bar qux
0 a 1 3.14
1 b 3 2.72
2 c 2 1.62
3 d 9 1.41
4 e 3 0.58
5 tot 18 9.47

Building on JMZ answer
df.append(df.sum(numeric_only=True), ignore_index=True)
if you want to continue using your current index you can name the sum series using .rename() as follows:
df.append(df.sum().rename('Total'))
This will add a row at the bottom of the table.

This is the way that I do it, by transposing and using the assign method in combination with a lambda function. It makes it simple for me.
df.T.assign(GrandTotal = lambda x: x.sum(axis=1)).T

Building on answer from Matthias Kauer.
To add row total:
df.loc["Row_Total"] = df.sum()
To add column total,
df.loc[:,"Column_Total"] = df.sum(axis=1)

New method [September 2022]
TL;DR:
Just use
df.style.concat(df.agg(['sum']).style)
for a solution that won't change you dataframe, works even if you have an "sum" in your index, and can be styled!
Explanation
In pandas 1.5.0, a new method named .style.concat() gives you the ability to display several dataframes together. This is a good way to show the total (or any other statistics), because it is not changing the original dataframe, and works even if you have an index named "sum" in your original dataframe.
For example:
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=['A', 'B', 'C'])
df.style.concat(df.agg(['sum']).style)
and it will return a formatted table that is visible in jupyter as this:
Styling
with a little longer code, you can even make the last row look different:
df.style.concat(
df.agg(['sum']).style
.set_properties(**{'background-color': 'yellow'})
)
to get:
see other ways to style (such as bold font, or table lines) in the docs

Following helped for me to add a column total and row total to a dataframe.
Assume dft1 is your original dataframe... now add a column total and row total with the following steps.
from io import StringIO
import pandas as pd
#create dataframe string
dfstr = StringIO(u"""
a;b;c
1;1;1
2;2;2
3;3;3
4;4;4
5;5;5
""")
#create dataframe dft1 from string
dft1 = pd.read_csv(dfstr, sep=";")
## add a column total to dft1
dft1['Total'] = dft1.sum(axis=1)
## add a row total to dft1 with the following steps
sum_row = dft1.sum(axis=0) #get sum_row first
dft1_sum=pd.DataFrame(data=sum_row).T #change it to a dataframe
dft1_sum=dft1_sum.reindex(columns=dft1.columns) #line up the col index to dft1
dft1_sum.index = ['row_total'] #change row index to row_total
dft1.append(dft1_sum) # append the row to dft1

Actually all proposed solutions render the original DataFrame unusable for any further analysis and can invalidate following computations, which will be easy to overlook and could lead to false results.
This is because you add a row to the data, which Pandas cannot differentiate from an additional row of data.
Example:
import pandas as pd
data = [1, 5, 6, 8, 9]
df = pd.DataFrame(data)
df
df.describe()
yields
0
0
1
1
5
2
6
3
8
4
9
0
count
5
mean
5.8
std
3.11448
min
1
25%
5
50%
6
75%
8
max
9
After
df.loc['Totals']= df.sum(numeric_only=True, axis=0)
the dataframe looks like this
0
0
1
1
5
2
6
3
8
4
9
Totals
29
This looks nice, but the new row is treated as if it was an additional data item, so df.describe will produce false results:
0
count
6
mean
9.66667
std
9.87252
min
1
25%
5.25
50%
7
75%
8.75
max
29
So: Watch out! and apply this only after doing all other analyses of the data or work on a copy of the DataFrame!

When the "totals" need to be added to an index column:
totals = pd.DataFrame(df.sum(numeric_only=True)).transpose().set_index(pd.Index({"totals"}))
df.append(totals)
e.g.
(Pdb) df
count min bytes max bytes mean bytes std bytes sum bytes
row_0 837200 67412.0 368733992.0 2.518989e+07 5.122836e+07 2.108898e+13
row_1 299000 85380.0 692782132.0 2.845055e+08 2.026823e+08 8.506713e+13
row_2 837200 67412.0 379484173.0 8.706825e+07 1.071484e+08 7.289354e+13
row_3 239200 85392.0 328063972.0 9.870446e+07 1.016989e+08 2.361011e+13
row_4 59800 67292.0 383487021.0 1.841879e+08 1.567605e+08 1.101444e+13
row_5 717600 112309.0 379483824.0 9.687554e+07 1.103574e+08 6.951789e+13
row_6 119600 664144.0 358486985.0 1.611637e+08 1.171889e+08 1.927518e+13
row_7 478400 67300.0 593141462.0 2.824301e+08 1.446283e+08 1.351146e+14
row_8 358800 215002028.0 327493141.0 2.861329e+08 1.545693e+07 1.026645e+14
row_9 358800 202248016.0 321657935.0 2.684668e+08 1.865470e+07 9.632590e+13
(Pdb) totals = pd.DataFrame(df.sum(numeric_only=True)).transpose()
(Pdb) totals
count min bytes max bytes mean bytes std bytes sum bytes
0 4305600.0 418466685.0 4.132815e+09 1.774725e+09 1.025805e+09 6.365722e+14
(Pdb) totals = pd.DataFrame(df.sum(numeric_only=True)).transpose().set_index(pd.Index({"totals"}))
(Pdb) totals
count min bytes max bytes mean bytes std bytes sum bytes
totals 4305600.0 418466685.0 4.132815e+09 1.774725e+09 1.025805e+09 6.365722e+14
(Pdb) df.append(totals)
count min bytes max bytes mean bytes std bytes sum bytes
row_0 837200.0 67412.0 3.687340e+08 2.518989e+07 5.122836e+07 2.108898e+13
row_1 299000.0 85380.0 6.927821e+08 2.845055e+08 2.026823e+08 8.506713e+13
row_2 837200.0 67412.0 3.794842e+08 8.706825e+07 1.071484e+08 7.289354e+13
row_3 239200.0 85392.0 3.280640e+08 9.870446e+07 1.016989e+08 2.361011e+13
row_4 59800.0 67292.0 3.834870e+08 1.841879e+08 1.567605e+08 1.101444e+13
row_5 717600.0 112309.0 3.794838e+08 9.687554e+07 1.103574e+08 6.951789e+13
row_6 119600.0 664144.0 3.584870e+08 1.611637e+08 1.171889e+08 1.927518e+13
row_7 478400.0 67300.0 5.931415e+08 2.824301e+08 1.446283e+08 1.351146e+14
row_8 358800.0 215002028.0 3.274931e+08 2.861329e+08 1.545693e+07 1.026645e+14
row_9 358800.0 202248016.0 3.216579e+08 2.684668e+08 1.865470e+07 9.632590e+13
totals 4305600.0 418466685.0 4.132815e+09 1.774725e+09 1.025805e+09 6.365722e+14

Since i generally want to do this at the very end as to avoid breaking the integrity of the dataframe (right before printing). I created a summary_rows_cols method which returns a printable dataframe:
def summary_rows_cols(df: pd.DataFrame,
column_sum: bool = False,
column_avg: bool = False,
column_median: bool = False,
row_sum: bool = False,
row_avg: bool = False,
row_median: bool = False
) -> pd.DataFrame:
ret = df.copy()
if column_sum: ret.loc['Sum'] = df.sum(numeric_only=True, axis=0)
if column_avg: ret.loc['Avg'] = df.mean(numeric_only=True, axis=0)
if column_median: ret.loc['Median'] = df.median(numeric_only=True, axis=0)
if row_sum: ret.loc[:, 'Sum'] = df.sum(numeric_only=True, axis=1)
if row_median: ret.loc[:, 'Avg'] = df.mean(numeric_only=True, axis=1)
if row_avg: ret.loc[:, 'Median'] = df.median(numeric_only=True, axis=1)
ret.fillna('-', inplace=True)
return ret
This allows me to enter a generic (numeric) df and get a summarized output such as:
a b c Sum Median
0 1 4 7 12 4
1 2 5 8 15 5
2 3 6 9 18 6
Sum 6 15 24 - -
from:
data = {
'a': [1, 2, 3],
'b': [4, 5, 6],
'c': [7, 8, 9]
}
df = pd.DataFrame(data)
printable = summary_rows_cols(df, row_sum=True, column_sum=True, row_median=True)

Pandas: df.set_value() method erases / resets column names of MultiIndex

I am writing an application that makes use of pandas (version 0.10.1) to store the underlying data model as a (3-level) MultiIndex'ed DataFrame. The model is a line spectrum, and the top level of the index is the atomic transition.
A simple dataframe could look like this:
Pos Sigma Ampl Line center Identifier
H-alpha-6697.6 30-30 Comp2 -3.600 0.774000 33.058000 6699.5 b
Comp3 3.538 2.153000 28.054000 6699.5 c
Contin NaN NaN 0.000000 NaN NaN
Comp4 1.384 0.921000 37.504000 6699.5 d
Comp1 -2.124 1.977000 69.166000 6699.5 a
31-31 Comp2 -3.292 0.884603 49.813423 6699.5 b
Comp3 3.600 2.299000 19.999000 6699.5 c
Contin NaN NaN 0.000000 NaN NaN
Comp4 1.692 1.009000 22.222000 6699.5 d
Comp1 -1.262 2.534000 68.002000 6699.5 a
At some point, I need to be able to create a different transition, e.g. H-beta, using H-alpha as a template. I would ideally do this by something like df.ix['H-beta-wavelength'] = df.ix['H-alpha-6697.6'], but this is not possible to do. So instead, I tried following this example: Prepend a level to a pandas MultiIndex
However, the example above requires the .names of the multiindex levels to be set in order to reorder them. And the names attribute is set when initializing the dataframe, but during the building of it, I rely quite extensibly on the set_values() method, and doing this destroys the names attribute - or rather sets them to [None, None, None].
Example:
In [68]: df
Out[68]:
Pos Sigma Ampl Line center Identifier
Transition Rows Component
Center: 6699.5 26-26 Comp2 -3.846 0.657 15.2740 6699.5 b
Comp3 2.924 1.449 31.3930 6699.5 c
Contin NaN NaN 0.0000 NaN NaN
Comp4 8.030 1.009 7.0831 6699.5 d
Comp1 -1.816 2.153 50.2750 6699.5 a
In [69]: df.set_value(('Center: 5044.3', '26-26', 'Comp1'), 'Sigma', 2.457)
Out[69]:
Pos Sigma Ampl Line center Identifier
Center: 6699.5 26-26 Comp2 -3.846 0.657 15.2740 6699.5 b
Comp3 2.924 1.449 31.3930 6699.5 c
Contin NaN NaN 0.0000 NaN NaN
Comp4 8.030 1.009 7.0831 6699.5 d
Comp1 -1.816 2.153 50.2750 6699.5 a
Center: 5044.3 26-26 Comp1 NaN 2.457 NaN NaN NaN
Of course, this makes it quite hard to use the names for reordering the levels of the multiindex. Is there a way to avoid this, short of brute-force setting the names after each time I've run set_values()?
EDIT: simpler, reproducible example.
Here is an iPython session recreating the index.names problem with a somewhat simpler example. It also shows that it is possibly a bug that goes beyond index.names, as it seems to change the index.lexsort_depth from 3 to 0. Missing numbers in the prompt are just unnecessary views of the dataframe.
I believe that one must choose secondary and/or tertiary indices that already exist like I have done below in order to reproduce it.
In [4]: idx = pd.MultiIndex.from_arrays(
[['Hans']*4 + ['Grethe']*4, ['1', '1', '2', '2']*2, ['a', 'b']*4],
names=['Name', 'Number', 'Letter'])
In [5]: df = pd.DataFrame(
random.random((8, 3)),
columns=['one', 'two','three'],
index=idx)
In [6]: df
Out[6]:
one two three
Name Number Letter
Hans 1 a 0.803566 0.434574 0.805976
b 0.655322 0.208469 0.989559
2 a 0.893952 0.380358 0.173764
b 0.822446 0.673894 0.676573
Grethe 1 a 0.202641 0.387263 0.405296
b 0.646733 0.086953 0.882114
2 a 0.358458 0.147107 0.769586
b 0.183782 0.477863 0.601098
# To rule out another possible source of problems:
In [9]: df.unstack().drop(('Grethe', '1')).stack()
Out[9]:
one two three
Name Number Letter
Grethe 2 a 0.358458 0.147107 0.769586
b 0.183782 0.477863 0.601098
Hans 1 a 0.803566 0.434574 0.805976
b 0.655322 0.208469 0.989559
2 a 0.893952 0.380358 0.173764
b 0.822446 0.673894 0.676573
In [10]: df.set_value(('Frans', '2', 'b'), 'one', 23.)
Out[10]:
one two three
Hans 1 a 0.803566 0.434574 0.805976
b 0.655322 0.208469 0.989559
2 a 0.893952 0.380358 0.173764
b 0.822446 0.673894 0.676573
Grethe 1 a 0.202641 0.387263 0.405296
b 0.646733 0.086953 0.882114
2 a 0.358458 0.147107 0.769586
b 0.183782 0.477863 0.601098
Frans 2 b 23.000000 NaN NaN
In [11]: df = df.sortlevel(level='Name')
In [13]: df.index.lexsort_depth
Out[13]: 3
In [14]: df.set_value(('Frans', '2', 'b'), 'one', 23.).index.lexsort_depth
Out[14]: 0

Your index needs to be sorted! See docs here: http://pandas.pydata.org/pandas-docs/dev/indexing.html#the-need-for-sortedness and these recipes may help http://pandas.pydata.org/pandas-docs/dev/cookbook.html
This is 0.10.1 as well
Heres a sorted frame
In [26]: index = pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
names=['first', 'second'])
In [27]: df = pd.DataFrame(np.random.rand(len(index)), index=index,columns=['A'])
In [7]: df.index.lexsort_depth
Out[7]: 2
In [28]: df.set_value(('a',1),'A',1)
Out[28]:
A
first second
a 1 1.000000
2 0.136456
b 1 0.712612
2 0.818473
And if I sort by the 2nd level (so its unsorted)
In [29]: df2 = df.sortlevel(level='second')
# this is not sorted! (well it is, just not lexsorted)
In [10]: df2.index.lexsort_depth
Out[10]: 0
In [30]: df2.set_value(('b','1'),'A',2)
Out[30]:
A
a 1 1.000000
b 1 0.712612
a 2 0.136456
b 2 0.818473
1 2.000000

So according to Andy Hayden, this is a names bug in pandas.
Hopefully a fix will come soon.
Until then, I believe the best way to do this is to do the following:
tmp = df.ix['ExistingTransition'].copy()
tmp['Transition'] = 'NewTransition'
tmp = tmp.set_index('Transition', append=True)
tmp.index = tmp.index.reorder_levels([2, 0, 1])
# ...Do whatever else needs to be done to this before applying as template...
df = df.append(tmp)
...That, or making sure thet the names attribute is recreated after each run of set_values(), and then just going by the example linked in the question.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to overwrite values of first row of dataframe - python

Related

Using slice on DaraFrameGroupBy

Pandas: Dynamically replace NaN values with the average of previous and next non-missing values

Python Pandas fillna doesn't work in for loop?

Pandas dataframe total row

Pandas: df.set_value() method erases / resets column names of MultiIndex

Categories

Resources