Issues with adding new rows to a pandas dataframe - python

Apologies if the formatting on this is strange, it's the first time I've posted anything. I've created a multi-index data frame in Python, which works fine:
arrays = [['one','one', 'two', 'two'],
['A','B','A','B']]
tuples = list(zip(*arrays))
mindex = pd.MultiIndex.from_tuples(tuples)
s = pd.DataFrame(data=np.random.randn(4), index=mindex, columns=(['Values']))
s
This works fine, except that I think I should be able to add new rows by simply typing
s['Values'].loc[('Three', 'A')] = 1
s['Values'].loc[('Three','B')]= 2
This returns no error message, and I can check it has worked by entering
s['Values'].loc[('Three', 'A')]
Which gives me 1. So all as expected.
However, I can't see the 'Three' data in Jupyter notebook - if simply type
s
then it only shows me the original one, two, A & B rows. This is probably because the new row is not the index:
s.index
returns
MultiIndex(levels=[['one', 'two'], ['A', 'B']],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
Can anyone please give me a hint as to what's going on here? I'd like rows I subsequently add to appear in the index. Should I be using the .append function instead? It seems a bit cumbersome and other posts have recommended using the .loc approach above to add rows.
Thanks!

I believe you need select column(s) in function DataFrame.loc:
s.loc[('Three', 'A'), 'Values'] = 1
s.loc[('Three', 'B'), 'Values'] = 2
print (s)
Values
one A -0.808372
B 0.904552
two A -0.443619
B 1.157234
Three A 1.000000
B 2.000000
print (s.index)
MultiIndex(levels=[['one', 'two', 'Three'], ['A', 'B']],
labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])
because your solution add values to column (Series), but not to DataFrame:
s['Values'].loc[('Three', 'A')] = 1
print (s['Values'])
one A -0.808372
B 0.904552
two A -0.443619
B 1.157234
Three A 1.000000
Name: Values, dtype: float64
print (s)
Values
one A -0.808372
B 0.904552
two A -0.443619
B 1.157234

Related

how do I select a specific column in a pivot_table - Python [duplicate]

I have a pandas DataFrame with 4 columns and I want to create a new DataFrame that only has three of the columns. This question is similar to: Extracting specific columns from a data frame but for pandas not R. The following code does not work, raises an error, and is certainly not the pandasnic way to do it.
import pandas as pd
old = pd.DataFrame({'A' : [4,5], 'B' : [10,20], 'C' : [100,50], 'D' : [-30,-50]})
new = pd.DataFrame(zip(old.A, old.C, old.D)) # raises TypeError: data argument can't be an iterator
What is the pandasnic way to do it?
There is a way of doing this and it actually looks similar to R
new = old[['A', 'C', 'D']].copy()
Here you are just selecting the columns you want from the original data frame and creating a variable for those. If you want to modify the new dataframe at all you'll probably want to use .copy() to avoid a SettingWithCopyWarning.
An alternative method is to use filter which will create a copy by default:
new = old.filter(['A','B','D'], axis=1)
Finally, depending on the number of columns in your original dataframe, it might be more succinct to express this using a drop (this will also create a copy by default):
new = old.drop('B', axis=1)
The easiest way is
new = old[['A','C','D']]
.
Another simpler way seems to be:
new = pd.DataFrame([old.A, old.B, old.C]).transpose()
where old.column_name will give you a series.
Make a list of all the column-series you want to retain and pass it to the DataFrame constructor. We need to do a transpose to adjust the shape.
In [14]:pd.DataFrame([old.A, old.B, old.C]).transpose()
Out[14]:
A B C
0 4 10 100
1 5 20 50
columns by index:
# selected column index: 1, 6, 7
new = old.iloc[: , [1, 6, 7]].copy()
As far as I can tell, you don't necessarily need to specify the axis when using the filter function.
new = old.filter(['A','B','D'])
returns the same dataframe as
new = old.filter(['A','B','D'], axis=1)
Generic functional form
def select_columns(data_frame, column_names):
new_frame = data_frame.loc[:, column_names]
return new_frame
Specific for your problem above
selected_columns = ['A', 'C', 'D']
new = select_columns(old, selected_columns)
As an alternative:
new = pd.DataFrame().assign(A=old['A'], C=old['C'], D=old['D'])
If you want to have a new data frame then:
import pandas as pd
old = pd.DataFrame({'A' : [4,5], 'B' : [10,20], 'C' : [100,50], 'D' : [-30,-50]})
new= old[['A', 'C', 'D']]
You can drop columns in the index:
df = pd.DataFrame({'A': [1, 1], 'B': [2, 2], 'C': [3, 3], 'D': [4, 4]})
df[df.columns.drop(['B', 'C'])]
or
df.loc[:, df.columns.drop(['B', 'C'])]
Output:
A D
0 1 4
1 1 4
df = pd.DataFrame({'A': [1, 1], 'B': [2, 2], 'C': [3, 3], 'D': [4, 4]})
new = df.filter(['A','B','D'], axis=1)

assigning an alternative value to pandas dataFrame conditional on its value

I am trying to assign alternative values to a column in a pandas dataFrame object. The condition to assigning an alternative value is that the element has value zero now.
This is my code snippet:
df = pd.DataFrame({'A': [0, 1, 2, 0, 0, 1, 1 ,0], 'B': [1, 2, 3, 4, 1, 2, 3, 4]})
for i, row in df.iterrows():
if row['A'] == 0.0:
df.iloc[i]['A'] = df.iloc[i-1]['A'] + df.iloc[i]['B'] - df.iloc[i-1]['B']
However, as it turns out, the values in these elements remain zero! The above has zero effect.
What's going on?
The original answer below works for some inputs, but it's not entirely right. Testing your code with the dataframe in your question, I found that it works, but it's not guaranteed to work with all dataframes. Here's an example where it doesn't work:
df = pd.DataFrame(np.random.randn(6,4), index=list(range(0,12,2)), columns=['A', 'B', 'C', 'D'])
This dataframe will cause your code to fail because the indices are not 0, 1, 2... as your algorithm expects, they're 0, 2, 4, ..., as defined by index=list(range(0,12,2)).
That means the values of i returned by the iterator will also be 0, 2, 4,..., so you'll get unexpected results when you try to use i-1 as a parameter to iloc.
In short, when you use for i, row in df.iterrows(): to iterate over a dataframe, i takes on the index values of the dimension you're iterating over as they're defined in the dataframe. Make sure you know what those values are when using them with offsets inside the loop.
Original answer:
I can't figure out why your code doesn't work, but I can verify that it doesn't. It may have something to do with modifying a dataframe while iterating over it, since you can use df.iloc[1]['A'] = 0.0 to set a value outside a loop with no problems.
Try using DataFrame.at instead:
for i, row in df.iterrows():
if row['A'] == 0.0:
df.at[i, 'A'] = df.iloc[i-1]['A'] + df.iloc[i]['B'] - df.iloc[i-1]['B']
This doesn't do anything to account for df.iloc[i-1] returning the last row in the dataframe, so be aware of that when the first value in column A is 0.0.
What about:
df = pd.DataFrame({'A': [0, 1, 2, 0, 0, 1, 1 ,0], 'B': [1, 2, 3, 4, 1, 2, 3, 4]})
df['A'] = df.where(df[['A']] != 0,
df['A'].shift() + df['B'] - df['B'].shift(),
axis=0)['A']
print(df)
A B
0 NaN 1
1 1.0 2
2 2.0 3
3 3.0 4
4 -3.0 1
5 1.0 2
6 1.0 3
7 2.0 4
The NaN is there since there is no element prior to the first element
You are using chained indexing which is related to the famous SettingWithCopy warning. Check the SettingWithCopy setting in modern pandas by Tom Augspurger.
In general this means that assigments of the form df['A']['B']= ...are discouraged. It doesn't matter if you use a loc acessor there.
If you add print statements to your code:
for i, row in df.iterrows():
print(df)
if row['A'] == 0.0:
df.iloc[i]['A'] = df.iloc[i-1]['A'] + df.iloc[i]['B'] - df.iloc[i-1]['B']
you see strange things happening. The dataframe df is modified if and only if the first row the column 'A' is 0.
As Bill the Lizard pointed out, you need a single accessor. However, note that Bill's method has the disadvantage of providing label based access. This may not be what you want when having a dataframe that is differently indexed. Then a better solutions would be to use loc
for i, row in df.iterrows():
if row['A'] == 0.0:
df.loc[df.index[i], 'A'] = df.iloc[i-1]['A'] + df.iloc[i]['B'] - df.iloc[i-1]['B']
or iloc
for i, row in df.iterrows():
if row['A'] == 0.0:
df.iloc[i, df.columns.get_loc('A')] = df.iloc[i-1]['A'] + df.iloc[i]['B'] - df.iloc[i-1]['B']
assuming the index is unique in the last case.
Note that the chained indexing occurs when setting values.
Though this approach works, it's - by the quote above - not encouraged!

How can I get the name of grouping columns from a Pandas GroupBy object?

Suppose I have the following dataframe:
df = pd.DataFrame(dict(Foo=['A', 'A', 'B', 'B'], Bar=[1, 2, 3, 4]))
i.e.:
Bar Foo
0 1 A
1 2 A
2 3 B
3 4 B
Then I create a pandas.GroupBy object:
g = df.groupby('Foo')
How can I get, from g, the fact that g is grouped by a column originally named Foo?
If I do g.groups I get:
{'A': Int64Index([0, 1], dtype='int64'),
'B': Int64Index([2, 3], dtype='int64')}
That tells me the values that the Foo column takes ('A' and 'B') but not the original column name.
Now, I can just do something like:
g.first().index.name
But it seems odd that there's not an attribute of g with the group name in it, so I feel like I must be missing something. In particular, if g was grouped by multiple columns, then the above doesn't work:
df = pd.DataFrame(dict(Foo=['A', 'A', 'B', 'B'], Baz=['C', 'D', 'C', 'D'], Bar=[1, 2, 3, 4]))
g = df.groupby(['Foo', 'Baz'])
g.first().index.name # returns None, because it's a MultiIndex
g.first().index.names # returns ['Foo', 'Baz']
For context, I am trying to do some plotting with a grouped dataframe, and I want to be able to label each facet (which is plotting a single group) with the name of that group as well as the group label.
Is there a better way?
Query GroupBy.BaseGrouper.names to get a list of all groupers:
df.groupby('Foo').grouper.names
Which gives,
['Foo']

Semantics of DataFrame groupby method

I find the behavior of the groupby method on a DataFrame object unexpected.
Let me explain with an example.
df = pd.DataFrame({'key1': ['a', 'a', 'b', 'b', 'a'],
'key2': ['one', 'two', 'one', 'two', 'one'],
'data1': np.random.randn(5),
'data2': np.random.randn(5)})
data1 = df['data1']
data1
# Out[14]:
# 0 1.989430
# 1 -0.250694
# 2 -0.448550
# 3 0.776318
# 4 -1.843558
# Name: data1, dtype: float64
data1 does not have the 'key1' column anymore.
So I would expect to get an error if I applied the following operation:
grouped = data1.groupby(df['key1'])
But I don't, and I can further apply the mean method on grouped to get the expected result.
grouped.mean()
# Out[13]:
# key1
# a -0.034941
# b 0.163884
# Name: data1, dtype: float64
However, the above operation does create a group using the 'key1' column of df.
How can this happen? Does the interpreter store information of the originating DataFrame (df in this case) with the created DataFrame/series (data1 in this case)?
Thank you.
It is only syntactic sugar, check here - selection by columns (Series) separately:
This is mainly syntactic sugar for the alternative and much more verbose
s = df['data1'].groupby(df['key1']).mean()
print (s)
key1
a 0.565292
b 0.106360
Name: data1, dtype: float64
Although the grouping columns are typically from the same dataframe or series, they don't have to be.
Your statement data1.groupby(df['key1']) is equivalent to data1.groupby(['a', 'a', 'b', 'b', 'a']). In fact, you can inspect the actual groups:
>>> data1.groupby(['a', 'a', 'b', 'b', 'a']).groups
{'a': [0, 1, 4], 'b': [2, 3]}
This means that your groupby on data1 will have a group a using rows 0, 1, and 4 from data1 and a group b using rows 2 and 3.

subsetting hierarchical data in pandas

I am trying to subset hierarchical data that has two row ids.
Say I have data in hdf
index = MultiIndex(levels=[['foo', 'bar', 'baz', 'qux'],
['one', 'two', 'three']],
labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3],
[0, 1, 2, 0, 1, 1, 2, 0, 1, 2]])
hdf = DataFrame(np.random.randn(10, 3), index=index,
columns=['A', 'B', 'C'])
hdf
And I wish to subset so that i see foo and qux, subset to return only sub-row two and columns A and C.
I can do this in two steps as follows:
sub1 = hdf.ix[['foo','qux'], ['A', 'C']]
sub1.xs('two', level=1)
Is there a single-step way to do this?
thanks
In [125]: hdf[hdf.index.get_level_values(0).isin(['foo', 'qux']) & (hdf.index.get_level_values(1) == 'two')][['A', 'C']]
Out[125]:
A C
foo two -0.113320 -1.215848
qux two 0.953584 0.134363
Much more complicated, but it would be better if you have many different values you want to choose in level one.
Doesn't look the nicest, but use tuples to get the rows you want and then squares brackets to select the columns.
In [36]: hdf.loc[[('foo', 'two'), ('qux', 'two')]][['A', 'C']]
Out[36]:
A C
foo two -0.356165 0.565022
qux two -0.701186 0.026532
loc could be swapped out for ix here.
itertools to the rescue:
>>> from itertools import product
>>>
>>> def _p(*iterables):
... return list(product(*iterables))
...
>>> hdf.ix[ _p(('foo','qux'),('two',)), ['A','C'] ]
A C
foo two 1.125401 1.389568
qux two 1.051455 -0.271256
>>>
Thanks everyone for your help. I also hit upon this solution:
hdf.ix[['bar','qux'], ['A', 'C']].xs('two', level=1)

Categories

Resources