Ambiguity in Pandas Dataframe / Numpy Array "axis" definition - python

I've been very confused about how python axes are defined, and whether they refer to a DataFrame's rows or columns. Consider the code below:
>>> df = pd.DataFrame([[1, 1, 1, 1], [2, 2, 2, 2], [3, 3, 3, 3]], columns=["col1", "col2", "col3", "col4"])
>>> df
col1 col2 col3 col4
0 1 1 1 1
1 2 2 2 2
2 3 3 3 3
So if we call df.mean(axis=1), we'll get a mean across the rows:
>>> df.mean(axis=1)
0 1
1 2
2 3
However, if we call df.drop(name, axis=1), we actually drop a column, not a row:
>>> df.drop("col4", axis=1)
col1 col2 col3
0 1 1 1
1 2 2 2
2 3 3 3
Can someone help me understand what is meant by an "axis" in pandas/numpy/scipy?
A side note, DataFrame.mean just might be defined wrong. It says in the documentation for DataFrame.mean that axis=1 is supposed to mean a mean over the columns, not the rows...

It's perhaps simplest to remember it as 0=down and 1=across.
This means:
Use axis=0 to apply a method down each column, or to the row labels (the index).
Use axis=1 to apply a method across each row, or to the column labels.
Here's a picture to show the parts of a DataFrame that each axis refers to:
It's also useful to remember that Pandas follows NumPy's use of the word axis. The usage is explained in NumPy's glossary of terms:
Axes are defined for arrays with more than one dimension. A 2-dimensional array has two corresponding axes: the first running vertically downwards across rows (axis 0), and the second running horizontally across columns (axis 1). [my emphasis]
So, concerning the method in the question, df.mean(axis=1), seems to be correctly defined. It takes the mean of entries horizontally across columns, that is, along each individual row. On the other hand, df.mean(axis=0) would be an operation acting vertically downwards across rows.
Similarly, df.drop(name, axis=1) refers to an action on column labels, because they intuitively go across the horizontal axis. Specifying axis=0 would make the method act on rows instead.

There are already proper answers, but I give you another example with > 2 dimensions.
The parameter axis means axis to be changed.
For example, consider that there is a dataframe with dimension a x b x c.
df.mean(axis=1) returns a dataframe with dimenstion a x 1 x c.
df.drop("col4", axis=1) returns a dataframe with dimension a x (b-1) x c.
Here, axis=1 means the second axis which is b, so b value will be changed in these examples.

Another way to explain:
// Not realistic but ideal for understanding the axis parameter
df = pd.DataFrame([[1, 1, 1, 1], [2, 2, 2, 2], [3, 3, 3, 3]],
columns=["idx1", "idx2", "idx3", "idx4"],
index=["idx1", "idx2", "idx3"]
)
---------------------------------------1
| idx1 idx2 idx3 idx4
| idx1 1 1 1 1
| idx2 2 2 2 2
| idx3 3 3 3 3
0
About df.drop (axis means the position)
A: I wanna remove idx3.
B: **Which one**? // typing while waiting response: df.drop("idx3",
A: The one which is on axis 1
B: OK then it is >> df.drop("idx3", axis=1)
// Result
---------------------------------------1
| idx1 idx2 idx4
| idx1 1 1 1
| idx2 2 2 2
| idx3 3 3 3
0
About df.apply (axis means direction)
A: I wanna apply sum.
B: Which direction? // typing while waiting response: df.apply(lambda x: x.sum(),
A: The one which is on *parallel to axis 0*
B: OK then it is >> df.apply(lambda x: x.sum(), axis=0)
// Result
idx1 6
idx2 6
idx3 6
idx4 6

It should be more widely known that the string aliases 'index' and 'columns' can be used in place of the integers 0/1. The aliases are much more explicit and help me remember how the calculations take place. Another alias for 'index' is 'rows'.
When axis='index' is used, then the calculations happen down the columns, which is confusing. But, I remember it as getting a result that is the same size as another row.
Let's get some data on the screen to see what I am talking about:
df = pd.DataFrame(np.random.rand(10, 4), columns=list('abcd'))
a b c d
0 0.990730 0.567822 0.318174 0.122410
1 0.144962 0.718574 0.580569 0.582278
2 0.477151 0.907692 0.186276 0.342724
3 0.561043 0.122771 0.206819 0.904330
4 0.427413 0.186807 0.870504 0.878632
5 0.795392 0.658958 0.666026 0.262191
6 0.831404 0.011082 0.299811 0.906880
7 0.749729 0.564900 0.181627 0.211961
8 0.528308 0.394107 0.734904 0.961356
9 0.120508 0.656848 0.055749 0.290897
When we want to take the mean of all the columns, we use axis='index' to get the following:
df.mean(axis='index')
a 0.562664
b 0.478956
c 0.410046
d 0.546366
dtype: float64
The same result would be gotten by:
df.mean() # default is axis=0
df.mean(axis=0)
df.mean(axis='rows')
To get use an operation left to right on the rows, use axis='columns'. I remember it by thinking that an additional column may be added to my DataFrame:
df.mean(axis='columns')
0 0.499784
1 0.506596
2 0.478461
3 0.448741
4 0.590839
5 0.595642
6 0.512294
7 0.427054
8 0.654669
9 0.281000
dtype: float64
The same result would be gotten by:
df.mean(axis=1)
Add a new row with axis=0/index/rows
Let's use these results to add additional rows or columns to complete the explanation. So, whenever using axis = 0/index/rows, its like getting a new row of the DataFrame. Let's add a row:
df.append(df.mean(axis='rows'), ignore_index=True)
a b c d
0 0.990730 0.567822 0.318174 0.122410
1 0.144962 0.718574 0.580569 0.582278
2 0.477151 0.907692 0.186276 0.342724
3 0.561043 0.122771 0.206819 0.904330
4 0.427413 0.186807 0.870504 0.878632
5 0.795392 0.658958 0.666026 0.262191
6 0.831404 0.011082 0.299811 0.906880
7 0.749729 0.564900 0.181627 0.211961
8 0.528308 0.394107 0.734904 0.961356
9 0.120508 0.656848 0.055749 0.290897
10 0.562664 0.478956 0.410046 0.546366
Add a new column with axis=1/columns
Similarly, when axis=1/columns it will create data that can be easily made into its own column:
df.assign(e=df.mean(axis='columns'))
a b c d e
0 0.990730 0.567822 0.318174 0.122410 0.499784
1 0.144962 0.718574 0.580569 0.582278 0.506596
2 0.477151 0.907692 0.186276 0.342724 0.478461
3 0.561043 0.122771 0.206819 0.904330 0.448741
4 0.427413 0.186807 0.870504 0.878632 0.590839
5 0.795392 0.658958 0.666026 0.262191 0.595642
6 0.831404 0.011082 0.299811 0.906880 0.512294
7 0.749729 0.564900 0.181627 0.211961 0.427054
8 0.528308 0.394107 0.734904 0.961356 0.654669
9 0.120508 0.656848 0.055749 0.290897 0.281000
It appears that you can see all the aliases with the following private variables:
df._AXIS_ALIASES
{'rows': 0}
df._AXIS_NUMBERS
{'columns': 1, 'index': 0}
df._AXIS_NAMES
{0: 'index', 1: 'columns'}

When axis='rows' or axis=0, it means access elements in the direction of the rows, up to down. If applying sum along axis=0, it will give us totals of each column.
When axis='columns' or axis=1, it means access elements in the direction of the columns, left to right. If applying sum along axis=1, we will get totals of each row.
Still confusing! But the above makes it a bit easier for me.

I remembered by the change of dimension, if axis=0, row changes, column unchanged, and if axis=1, column changes, row unchanged.

Related

Pivot table to "tidy" data frame in Pandas

I have an array of numbers (I think the format makes it a pivot table) that I want to turn into a "tidy" data frame. For example, I start with variable 1 down the left, variable 2 across the top, and the value of interest in the middle, something like this:
X Y
A 1 2
B 3 4
I want to turn that into a tidy data frame like this:
V1 V2 value
A X 1
A Y 2
B X 3
B Y 4
The row and column order don't matter to me, so the following is totally acceptable:
value V1 V2
2 A Y
4 B Y
3 B X
1 A X
For my first go at this, which was able to get me the correct final answer, I looped over the rows and columns. This was terribly slow, and I suspected that some machinery in Pandas would make it go faster.
It seems that melt is close to the magic I seek, but it doesn't get me all the way there. That first array turns into this:
V2 value
0 X 1
1 X 2
2 Y 3
3 Y 4
It gets rid of my V1 variable!
Nothing is special about melt, so I will be happy to read answers that use other approaches, particularly if melt is not much faster than my nested loops and another solution is. Nonetheless, how can I go from that array to the kind of tidy data frame I want as the output?
Example dataframe:
df = pd.DataFrame({"X":[1,3], "Y":[2,4]},index=["A","B"])
Use DataFrame.reset_index with DataFrame.rename_axis and then DataFrame.melt. If you want order columns we could use DataFrame.reindex.
new_df = (df.rename_axis(index = 'V1')
.reset_index()
.melt('V1',var_name='V2')
.reindex(columns = ['value','V1','V2']))
print(new_df)
Another approach DataFrame.stack:
new_df = (df.stack()
.rename_axis(index = ['V1','V2'])
.rename('value')
.reset_index()
.reindex(columns = ['value','V1','V2']))
print(new_df)
value V1 V2
0 1 A X
1 3 B X
2 2 A Y
3 4 B Y
to names names there is another alternative like commenting #Scott Boston in the comments
Melt is a good approach, but it doesn't seem to play nicely with identifying the results by index. You can reset the index first to move it to its own column, then use that column as the id col.
test = pd.DataFrame([[1,2],[3,4]], columns=['X', 'Y'], index=['A', 'B'])
X Y
A 1 2
B 3 4
test = test.reset_index()
index X Y
0 A 1 2
1 B 3 4
test.melt('index',['X', 'Y'], 'prev cols')
index prev cols value
0 A X 1
1 B X 3
2 A Y 2
3 B Y 4

How to convert a Pandas series into a Dataframe for merging [duplicate]

If you came here looking for information on how to
merge a DataFrame and Series on the index, please look at this
answer.
The OP's original intention was to ask how to assign series elements
as columns to another DataFrame. If you are interested in knowing the
answer to this, look at the accepted answer by EdChum.
Best I can come up with is
df = pd.DataFrame({'a':[1, 2], 'b':[3, 4]}) # see EDIT below
s = pd.Series({'s1':5, 's2':6})
for name in s.index:
df[name] = s[name]
a b s1 s2
0 1 3 5 6
1 2 4 5 6
Can anybody suggest better syntax / faster method?
My attempts:
df.merge(s)
AttributeError: 'Series' object has no attribute 'columns'
and
df.join(s)
ValueError: Other Series must have a name
EDIT The first two answers posted highlighted a problem with my question, so please use the following to construct df:
df = pd.DataFrame({'a':[np.nan, 2, 3], 'b':[4, 5, 6]}, index=[3, 5, 6])
with the final result
a b s1 s2
3 NaN 4 5 6
5 2 5 5 6
6 3 6 5 6
Update
From v0.24.0 onwards, you can merge on DataFrame and Series as long as the Series is named.
df.merge(s.rename('new'), left_index=True, right_index=True)
# If series is already named,
# df.merge(s, left_index=True, right_index=True)
Nowadays, you can simply convert the Series to a DataFrame with to_frame(). So (if joining on index):
df.merge(s.to_frame(), left_index=True, right_index=True)
You could construct a dataframe from the series and then merge with the dataframe.
So you specify the data as the values but multiply them by the length, set the columns to the index and set params for left_index and right_index to True:
In [27]:
df.merge(pd.DataFrame(data = [s.values] * len(s), columns = s.index), left_index=True, right_index=True)
Out[27]:
a b s1 s2
0 1 3 5 6
1 2 4 5 6
EDIT for the situation where you want the index of your constructed df from the series to use the index of the df then you can do the following:
df.merge(pd.DataFrame(data = [s.values] * len(df), columns = s.index, index=df.index), left_index=True, right_index=True)
This assumes that the indices match the length.
Here's one way:
df.join(pd.DataFrame(s).T).fillna(method='ffill')
To break down what happens here...
pd.DataFrame(s).T creates a one-row DataFrame from s which looks like this:
s1 s2
0 5 6
Next, join concatenates this new frame with df:
a b s1 s2
0 1 3 5 6
1 2 4 NaN NaN
Lastly, the NaN values at index 1 are filled with the previous values in the column using fillna with the forward-fill (ffill) argument:
a b s1 s2
0 1 3 5 6
1 2 4 5 6
To avoid using fillna, it's possible to use pd.concat to repeat the rows of the DataFrame constructed from s. In this case, the general solution is:
df.join(pd.concat([pd.DataFrame(s).T] * len(df), ignore_index=True))
Here's another solution to address the indexing challenge posed in the edited question:
df.join(pd.DataFrame(s.repeat(len(df)).values.reshape((len(df), -1), order='F'),
columns=s.index,
index=df.index))
s is transformed into a DataFrame by repeating the values and reshaping (specifying 'Fortran' order), and also passing in the appropriate column names and index. This new DataFrame is then joined to df.
Nowadays, much simpler and concise solution can achieve the same task. Leveraging the capability of DataFrame.apply() to turn a Series into columns of its belonging DataFrame, we can use:
df.join(df.apply(lambda x: s, axis=1))
Result:
a b s1 s2
3 NaN 4 5 6
5 2.0 5 5 6
6 3.0 6 5 6
Here, we used DataFrame.apply() with a simple lambda function as the applied function on axis=1. The applied lambda function simply just returns the Series s:
df.apply(lambda x: s, axis=1)
Result:
s1 s2
3 5 6
5 5 6
6 5 6
The result has already inherited the row index of the original DataFrame df. Consequently, we can simply join df with this interim result by DataFrame.join() to get the desired final result (since they have the same row index).
This capability of DataFrame.apply() to turn a Series into columns of its belonging DataFrame is well documented in the official document as follows:
By default (result_type=None), the final return type is inferred from
the return type of the applied function.
The default behaviour (result_type=None) depends on the return value of the
applied function: list-like results will be returned as a Series of
those. However if the apply function returns a Series these are
expanded to columns.
The official document also includes example of such usage:
Returning a Series inside the function is similar to passing
result_type='expand'. The resulting column names will be the Series
index.
df.apply(lambda x: pd.Series([1, 2], index=['foo', 'bar']), axis=1)
foo bar
0 1 2
1 1 2
2 1 2
If I could suggest setting up your dataframes like this (auto-indexing):
df = pd.DataFrame({'a':[np.nan, 1, 2], 'b':[4, 5, 6]})
then you can set up your s1 and s2 values thus (using shape() to return the number of rows from df):
s = pd.DataFrame({'s1':[5]*df.shape[0], 's2':[6]*df.shape[0]})
then the result you want is easy:
display (df.merge(s, left_index=True, right_index=True))
Alternatively, just add the new values to your dataframe df:
df = pd.DataFrame({'a':[nan, 1, 2], 'b':[4, 5, 6]})
df['s1']=5
df['s2']=6
display(df)
Both return:
a b s1 s2
0 NaN 4 5 6
1 1.0 5 5 6
2 2.0 6 5 6
If you have another list of data (instead of just a single value to apply), and you know it is in the same sequence as df, eg:
s1=['a','b','c']
then you can attach this in the same way:
df['s1']=s1
returns:
a b s1
0 NaN 4 a
1 1.0 5 b
2 2.0 6 c
You can easily set a pandas.DataFrame column to a constant. This constant can be an int such as in your example. If the column you specify isn't in the df, then pandas will create a new column with the name you specify. So after your dataframe is constructed, (from your question):
df = pd.DataFrame({'a':[np.nan, 2, 3], 'b':[4, 5, 6]}, index=[3, 5, 6])
You can just run:
df['s1'], df['s2'] = 5, 6
You could write a loop or comprehension to make it do this for all the elements in a list of tuples, or keys and values in a dictionary depending on how you have your real data stored.
If df is a pandas.DataFrame then df['new_col']= Series list_object of length len(df) will add the or Series list_object as a column named 'new_col'. df['new_col']= scalar (such as 5 or 6 in your case) also works and is equivalent to df['new_col']= [scalar]*len(df)
So a two-line code serves the purpose:
df = pd.DataFrame({'a':[1, 2], 'b':[3, 4]})
s = pd.Series({'s1':5, 's2':6})
for x in s.index:
df[x] = s[x]
Output:
a b s1 s2
0 1 3 5 6
1 2 4 5 6

Does Pandas compare indexes when adding a new column?

When I add a column using apply on other columns, does panda store the result of this new column in the same row as the one used for the computation. If not how can I make it do it.
The reason why I'am not completely confident is following example
df = pd.DataFrame({'index':[0,1,2,3,4], 'value':[1,2,3,4,5]})
df2 = pd.DataFrame({'index':[0,2,1,3,5], 'value':[1,2,3,4,5]})
df['second_value'] = df['value'].apply(lambda x: x**2)
df['third_value'] = df2['value'].apply(lambda x: x**2)
df
The results this yields is
index value second_value third_value
0 1 1 1
1 2 4 4
2 3 9 9
3 4 16 16
4 5 25 25
So what I see here is that pandas only checks for the order. So can it happen that a DataFrame is sorted at a random moment which could mess up or can I assume that the order is always preserved when I perform
df['new_value'] = df['old_value'].apply(...)
?
EDIT: In my original code snippet I forgot to set the index and that is actually where I was doing wrong. So I had df.set_index('index') and df2.set_index('index') before using apply. the problem is that this method creates a copy with the said index. So either you asign these to the original dataframes df and df2 or even better you add inline=True in the method call in order to not create a copy and set the index in the given dataframe.
That's not how you define an index. You need to pass a list/iterable to the index keyword argument when calling the pd.DataFrame constructor.
df = pd.DataFrame({'value' : [1, 2, 3, 4, 5]})
df2 = pd.DataFrame({'value' : [1, 2, 3, 4, 5]}, index=[0, 2, 1, 3, 4])
df['second'] = df['value'] ** 2
df['third'] = df2['value'] ** 2
df
value second third
0 1 1 1
1 2 4 9 # note these
2 3 9 4 # two rows
3 4 16 16
4 5 25 25
The assignment operations are always index aligned.

Delete zeros from a pandas dataframe

This question was asked in multiple other posts but I could not get any of the methods to work. This is my dataframe:
df = pd.DataFrame([[1,2,3,4.5],[1,2,0,4,5]])
I would like to know how I can either:
1) Delete rows that contain any/all zeros
2) Delete columns that contain any/all zeros
In order to delete rows that contain any zeros, this worked:
df2 = df[~(df == 0).any(axis=1)]
df2 = df[~(df == 0).all(axis=1)]
But I cannot get this to work column wise. I tried to set axis=0 but that gives me this error:
__main__:1: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
Any suggestions?
You're going to need loc for this:
df
0 1 2 3 4
0 1 2 3 4 5
1 1 2 0 4 5
df.loc[:, ~(df == 0).any(0)] # notice the :, this means we are indexing on the columns now, not the rows
0 1 3 4
0 1 2 4 5
1 1 2 4 5
Direct indexing defaults to indexing on the rows. You are trying to index a dataframe with only two rows using [0, 1, 3, 4], so pandas is warning you about that.

Keeping the N first occurrences of

The following code will (of course) keep only the first occurrence of 'Item1' in rows sorted by 'Date'. Any suggestions as to how I could get it to keep, say the first 5 occurrences?
## Sort the dataframe by Date and keep only the earliest appearance of 'Item1'
## drop_duplicates considers the column 'Date' and keeps only first occurence
coocdates = data.sort('Date').drop_duplicates(cols=['Item1'])
You want to use head, either on the dataframe itself or on the groupby:
In [11]: df = pd.DataFrame([[1, 2], [1, 4], [1, 6], [2, 8]], columns=['A', 'B'])
In [12]: df
Out[12]:
A B
0 1 2
1 1 4
2 1 6
3 2 8
In [13]: df.head(2) # the first two rows
Out[13]:
A B
0 1 2
1 1 4
In [14]: df.groupby('A').head(2) # the first two rows in each group
Out[14]:
A B
0 1 2
1 1 4
3 2 8
Note: the behaviour of groupby's head was changed in 0.14 (it didn't act like a filter - but modified the index), so you will have to reset index if using an earlier versions.
Use groupby() and nth():
According to Pandas docs, nth()
Take the nth row from each group if n is an int, or a subset of rows if n is a list of ints.
Therefore all you need is:
df.groupby('Date').nth([0,1,2,3,4]).reset_index(drop=False, inplace=True)

Categories

Resources