When I add a column using apply on other columns, does panda store the result of this new column in the same row as the one used for the computation. If not how can I make it do it.
The reason why I'am not completely confident is following example
df = pd.DataFrame({'index':[0,1,2,3,4], 'value':[1,2,3,4,5]})
df2 = pd.DataFrame({'index':[0,2,1,3,5], 'value':[1,2,3,4,5]})
df['second_value'] = df['value'].apply(lambda x: x**2)
df['third_value'] = df2['value'].apply(lambda x: x**2)
df
The results this yields is
index value second_value third_value
0 1 1 1
1 2 4 4
2 3 9 9
3 4 16 16
4 5 25 25
So what I see here is that pandas only checks for the order. So can it happen that a DataFrame is sorted at a random moment which could mess up or can I assume that the order is always preserved when I perform
df['new_value'] = df['old_value'].apply(...)
?
EDIT: In my original code snippet I forgot to set the index and that is actually where I was doing wrong. So I had df.set_index('index') and df2.set_index('index') before using apply. the problem is that this method creates a copy with the said index. So either you asign these to the original dataframes df and df2 or even better you add inline=True in the method call in order to not create a copy and set the index in the given dataframe.
That's not how you define an index. You need to pass a list/iterable to the index keyword argument when calling the pd.DataFrame constructor.
df = pd.DataFrame({'value' : [1, 2, 3, 4, 5]})
df2 = pd.DataFrame({'value' : [1, 2, 3, 4, 5]}, index=[0, 2, 1, 3, 4])
df['second'] = df['value'] ** 2
df['third'] = df2['value'] ** 2
df
value second third
0 1 1 1
1 2 4 9 # note these
2 3 9 4 # two rows
3 4 16 16
4 5 25 25
The assignment operations are always index aligned.
Related
I don't understand this code:
d = {'col1': [5, 6,4, 1, 2, 9, 15, 11]}
df = pd.DataFrame(data=d)
df.head(10)
df['col1'] = df.sort_values('col1')['col1']
print(df.sort_values('col1')['col1'])
This is what is printed:
3 1
4 2
2 4
0 5
1 6
5 9
7 11
6 15
My df doesn't change at all.
Why does this code: df.sort_values('col1')['col1'] do not arrange my dataframe?
Thanks
If want assign back sorted column is necessary convert output to numpy array for prevent index alignment - it means if use only df.sort_values('col1')['col1'] it sorting correctly, index order is changed, but in assign step is change order like original, so no change in order of values.
df['col1'] = df.sort_values('col1')['col1'].to_numpy()
If default index another idea is create default index (same like original), so alignment asign by new index values:
df['col1'] = df.sort_values('col1')['col1'].reset_index(drop=True)
If want sort by col1 column:
df = df.sort_values('col1')
Suppose I have the following Pandas DataFrame:
df = pd.DataFrame({
'a': [1, 2, 3],
'b': [4, 5, 6],
'c': [7, 8, 9]
})
a b c
0 1 4 7
1 2 5 8
2 3 6 9
I want to generate a new pandas.Series so that the values of this series are selected, row by row, from a random column in the DataFrame. So, a possible output for that would be the series:
0 7
1 2
2 9
dtype: int64
(where in row 0 it randomly chose 'c', in row 1 it randomly chose 'a' and in row 2 it randomly chose 'c' again).
I know this can be done by iterating over the rows and using random.choice to choose each row, but iterating over the rows not only has bad performance but also is "unpandonic", so to speak. Also, df.sample(axis=1) would choose whole columns, so all of them would be chosen from the same column, which is not what I want. Is there a better way to do this with vectorized pandas methods?
Here is a fully vectorized solution. Note however that it does not use Pandas methods, but rather involves operations on the underlying numpy array.
import numpy as np
indices = np.random.choice(np.arange(len(df.columns)), len(df), replace=True)
Example output is [1, 2, 1] which corresponds to ['b', 'c', 'b'].
Then use this to slice the numpy array:
df['random'] = df.to_numpy()[np.arange(len(df)), indices]
Results:
a b c random
0 1 4 7 7
1 2 5 8 5
2 3 6 9 9
May be something like:
pd.Series([np.random.choice(i,1)[0] for i in df.values])
This does the job (using the built-in module random):
ddf = df.apply(lambda row : random.choice(row.tolist()), axis=1)
or using pandas sample:
ddf = df.apply(lambda row : row.sample(), axis=1)
Both have the same behaviour. ddf is your Series.
pd.DataFrame(
df.values[range(df.shape[0]),
np.random.randint(
0, df.shape[1], size=df.shape[0])])
output
0
0 4
1 5
2 9
You're probably still going to need to iterate through each row while selecting a random value in each row - whether you do it explicitly with a for loop or implicitly with whatever function you decide to call.
You can, however, simplify the to a single line using a list comprehension, if it suits your style:
result = pd.Series([random.choice(pd.iloc[i]) for i in range(len(df))])
If you came here looking for information on how to
merge a DataFrame and Series on the index, please look at this
answer.
The OP's original intention was to ask how to assign series elements
as columns to another DataFrame. If you are interested in knowing the
answer to this, look at the accepted answer by EdChum.
Best I can come up with is
df = pd.DataFrame({'a':[1, 2], 'b':[3, 4]}) # see EDIT below
s = pd.Series({'s1':5, 's2':6})
for name in s.index:
df[name] = s[name]
a b s1 s2
0 1 3 5 6
1 2 4 5 6
Can anybody suggest better syntax / faster method?
My attempts:
df.merge(s)
AttributeError: 'Series' object has no attribute 'columns'
and
df.join(s)
ValueError: Other Series must have a name
EDIT The first two answers posted highlighted a problem with my question, so please use the following to construct df:
df = pd.DataFrame({'a':[np.nan, 2, 3], 'b':[4, 5, 6]}, index=[3, 5, 6])
with the final result
a b s1 s2
3 NaN 4 5 6
5 2 5 5 6
6 3 6 5 6
Update
From v0.24.0 onwards, you can merge on DataFrame and Series as long as the Series is named.
df.merge(s.rename('new'), left_index=True, right_index=True)
# If series is already named,
# df.merge(s, left_index=True, right_index=True)
Nowadays, you can simply convert the Series to a DataFrame with to_frame(). So (if joining on index):
df.merge(s.to_frame(), left_index=True, right_index=True)
You could construct a dataframe from the series and then merge with the dataframe.
So you specify the data as the values but multiply them by the length, set the columns to the index and set params for left_index and right_index to True:
In [27]:
df.merge(pd.DataFrame(data = [s.values] * len(s), columns = s.index), left_index=True, right_index=True)
Out[27]:
a b s1 s2
0 1 3 5 6
1 2 4 5 6
EDIT for the situation where you want the index of your constructed df from the series to use the index of the df then you can do the following:
df.merge(pd.DataFrame(data = [s.values] * len(df), columns = s.index, index=df.index), left_index=True, right_index=True)
This assumes that the indices match the length.
Here's one way:
df.join(pd.DataFrame(s).T).fillna(method='ffill')
To break down what happens here...
pd.DataFrame(s).T creates a one-row DataFrame from s which looks like this:
s1 s2
0 5 6
Next, join concatenates this new frame with df:
a b s1 s2
0 1 3 5 6
1 2 4 NaN NaN
Lastly, the NaN values at index 1 are filled with the previous values in the column using fillna with the forward-fill (ffill) argument:
a b s1 s2
0 1 3 5 6
1 2 4 5 6
To avoid using fillna, it's possible to use pd.concat to repeat the rows of the DataFrame constructed from s. In this case, the general solution is:
df.join(pd.concat([pd.DataFrame(s).T] * len(df), ignore_index=True))
Here's another solution to address the indexing challenge posed in the edited question:
df.join(pd.DataFrame(s.repeat(len(df)).values.reshape((len(df), -1), order='F'),
columns=s.index,
index=df.index))
s is transformed into a DataFrame by repeating the values and reshaping (specifying 'Fortran' order), and also passing in the appropriate column names and index. This new DataFrame is then joined to df.
Nowadays, much simpler and concise solution can achieve the same task. Leveraging the capability of DataFrame.apply() to turn a Series into columns of its belonging DataFrame, we can use:
df.join(df.apply(lambda x: s, axis=1))
Result:
a b s1 s2
3 NaN 4 5 6
5 2.0 5 5 6
6 3.0 6 5 6
Here, we used DataFrame.apply() with a simple lambda function as the applied function on axis=1. The applied lambda function simply just returns the Series s:
df.apply(lambda x: s, axis=1)
Result:
s1 s2
3 5 6
5 5 6
6 5 6
The result has already inherited the row index of the original DataFrame df. Consequently, we can simply join df with this interim result by DataFrame.join() to get the desired final result (since they have the same row index).
This capability of DataFrame.apply() to turn a Series into columns of its belonging DataFrame is well documented in the official document as follows:
By default (result_type=None), the final return type is inferred from
the return type of the applied function.
The default behaviour (result_type=None) depends on the return value of the
applied function: list-like results will be returned as a Series of
those. However if the apply function returns a Series these are
expanded to columns.
The official document also includes example of such usage:
Returning a Series inside the function is similar to passing
result_type='expand'. The resulting column names will be the Series
index.
df.apply(lambda x: pd.Series([1, 2], index=['foo', 'bar']), axis=1)
foo bar
0 1 2
1 1 2
2 1 2
If I could suggest setting up your dataframes like this (auto-indexing):
df = pd.DataFrame({'a':[np.nan, 1, 2], 'b':[4, 5, 6]})
then you can set up your s1 and s2 values thus (using shape() to return the number of rows from df):
s = pd.DataFrame({'s1':[5]*df.shape[0], 's2':[6]*df.shape[0]})
then the result you want is easy:
display (df.merge(s, left_index=True, right_index=True))
Alternatively, just add the new values to your dataframe df:
df = pd.DataFrame({'a':[nan, 1, 2], 'b':[4, 5, 6]})
df['s1']=5
df['s2']=6
display(df)
Both return:
a b s1 s2
0 NaN 4 5 6
1 1.0 5 5 6
2 2.0 6 5 6
If you have another list of data (instead of just a single value to apply), and you know it is in the same sequence as df, eg:
s1=['a','b','c']
then you can attach this in the same way:
df['s1']=s1
returns:
a b s1
0 NaN 4 a
1 1.0 5 b
2 2.0 6 c
You can easily set a pandas.DataFrame column to a constant. This constant can be an int such as in your example. If the column you specify isn't in the df, then pandas will create a new column with the name you specify. So after your dataframe is constructed, (from your question):
df = pd.DataFrame({'a':[np.nan, 2, 3], 'b':[4, 5, 6]}, index=[3, 5, 6])
You can just run:
df['s1'], df['s2'] = 5, 6
You could write a loop or comprehension to make it do this for all the elements in a list of tuples, or keys and values in a dictionary depending on how you have your real data stored.
If df is a pandas.DataFrame then df['new_col']= Series list_object of length len(df) will add the or Series list_object as a column named 'new_col'. df['new_col']= scalar (such as 5 or 6 in your case) also works and is equivalent to df['new_col']= [scalar]*len(df)
So a two-line code serves the purpose:
df = pd.DataFrame({'a':[1, 2], 'b':[3, 4]})
s = pd.Series({'s1':5, 's2':6})
for x in s.index:
df[x] = s[x]
Output:
a b s1 s2
0 1 3 5 6
1 2 4 5 6
So I've been doing things like this with pandas:
usrdata['columnA'] = usrdata.apply(functionA, axis=1)
in order to do row operations and changing/adding columns to my dataframe.
However, now I want to try to do something like this:
usrdata['columnB', 'columnC'] = usrdata.apply(functionB, axis=1)
But the output of function B is a Series with only one column in a tuple (with two values for each row) apparently. Is there a nice way for me to either:
format the output from functionB so it can readily be added to my
dataframe
add (and possibly have to unpack) the output from functionB and assign each each column to each column of my dataframe?
Try using zip:
usrdata['columnB'], usrdata['columnC'] = zip(*usrdata.apply(functionB, axis=1))
I'd assign directly to a df consisting of your new df's and modify the func body to return a Series constructed with a list of the data:
In [9]:
df = pd.DataFrame({'a':[1, 2, 3, 4, 5]})
df
Out[9]:
a
0 1
1 2
2 3
3 4
4 5
In [10]:
def func(x):
return pd.Series([x*3, x*10])
df[['b','c']] = df['a'].apply(func)
df
Out[10]:
a b c
0 1 3 10
1 2 6 20
2 3 9 30
3 4 12 40
4 5 15 50
I've been very confused about how python axes are defined, and whether they refer to a DataFrame's rows or columns. Consider the code below:
>>> df = pd.DataFrame([[1, 1, 1, 1], [2, 2, 2, 2], [3, 3, 3, 3]], columns=["col1", "col2", "col3", "col4"])
>>> df
col1 col2 col3 col4
0 1 1 1 1
1 2 2 2 2
2 3 3 3 3
So if we call df.mean(axis=1), we'll get a mean across the rows:
>>> df.mean(axis=1)
0 1
1 2
2 3
However, if we call df.drop(name, axis=1), we actually drop a column, not a row:
>>> df.drop("col4", axis=1)
col1 col2 col3
0 1 1 1
1 2 2 2
2 3 3 3
Can someone help me understand what is meant by an "axis" in pandas/numpy/scipy?
A side note, DataFrame.mean just might be defined wrong. It says in the documentation for DataFrame.mean that axis=1 is supposed to mean a mean over the columns, not the rows...
It's perhaps simplest to remember it as 0=down and 1=across.
This means:
Use axis=0 to apply a method down each column, or to the row labels (the index).
Use axis=1 to apply a method across each row, or to the column labels.
Here's a picture to show the parts of a DataFrame that each axis refers to:
It's also useful to remember that Pandas follows NumPy's use of the word axis. The usage is explained in NumPy's glossary of terms:
Axes are defined for arrays with more than one dimension. A 2-dimensional array has two corresponding axes: the first running vertically downwards across rows (axis 0), and the second running horizontally across columns (axis 1). [my emphasis]
So, concerning the method in the question, df.mean(axis=1), seems to be correctly defined. It takes the mean of entries horizontally across columns, that is, along each individual row. On the other hand, df.mean(axis=0) would be an operation acting vertically downwards across rows.
Similarly, df.drop(name, axis=1) refers to an action on column labels, because they intuitively go across the horizontal axis. Specifying axis=0 would make the method act on rows instead.
There are already proper answers, but I give you another example with > 2 dimensions.
The parameter axis means axis to be changed.
For example, consider that there is a dataframe with dimension a x b x c.
df.mean(axis=1) returns a dataframe with dimenstion a x 1 x c.
df.drop("col4", axis=1) returns a dataframe with dimension a x (b-1) x c.
Here, axis=1 means the second axis which is b, so b value will be changed in these examples.
Another way to explain:
// Not realistic but ideal for understanding the axis parameter
df = pd.DataFrame([[1, 1, 1, 1], [2, 2, 2, 2], [3, 3, 3, 3]],
columns=["idx1", "idx2", "idx3", "idx4"],
index=["idx1", "idx2", "idx3"]
)
---------------------------------------1
| idx1 idx2 idx3 idx4
| idx1 1 1 1 1
| idx2 2 2 2 2
| idx3 3 3 3 3
0
About df.drop (axis means the position)
A: I wanna remove idx3.
B: **Which one**? // typing while waiting response: df.drop("idx3",
A: The one which is on axis 1
B: OK then it is >> df.drop("idx3", axis=1)
// Result
---------------------------------------1
| idx1 idx2 idx4
| idx1 1 1 1
| idx2 2 2 2
| idx3 3 3 3
0
About df.apply (axis means direction)
A: I wanna apply sum.
B: Which direction? // typing while waiting response: df.apply(lambda x: x.sum(),
A: The one which is on *parallel to axis 0*
B: OK then it is >> df.apply(lambda x: x.sum(), axis=0)
// Result
idx1 6
idx2 6
idx3 6
idx4 6
It should be more widely known that the string aliases 'index' and 'columns' can be used in place of the integers 0/1. The aliases are much more explicit and help me remember how the calculations take place. Another alias for 'index' is 'rows'.
When axis='index' is used, then the calculations happen down the columns, which is confusing. But, I remember it as getting a result that is the same size as another row.
Let's get some data on the screen to see what I am talking about:
df = pd.DataFrame(np.random.rand(10, 4), columns=list('abcd'))
a b c d
0 0.990730 0.567822 0.318174 0.122410
1 0.144962 0.718574 0.580569 0.582278
2 0.477151 0.907692 0.186276 0.342724
3 0.561043 0.122771 0.206819 0.904330
4 0.427413 0.186807 0.870504 0.878632
5 0.795392 0.658958 0.666026 0.262191
6 0.831404 0.011082 0.299811 0.906880
7 0.749729 0.564900 0.181627 0.211961
8 0.528308 0.394107 0.734904 0.961356
9 0.120508 0.656848 0.055749 0.290897
When we want to take the mean of all the columns, we use axis='index' to get the following:
df.mean(axis='index')
a 0.562664
b 0.478956
c 0.410046
d 0.546366
dtype: float64
The same result would be gotten by:
df.mean() # default is axis=0
df.mean(axis=0)
df.mean(axis='rows')
To get use an operation left to right on the rows, use axis='columns'. I remember it by thinking that an additional column may be added to my DataFrame:
df.mean(axis='columns')
0 0.499784
1 0.506596
2 0.478461
3 0.448741
4 0.590839
5 0.595642
6 0.512294
7 0.427054
8 0.654669
9 0.281000
dtype: float64
The same result would be gotten by:
df.mean(axis=1)
Add a new row with axis=0/index/rows
Let's use these results to add additional rows or columns to complete the explanation. So, whenever using axis = 0/index/rows, its like getting a new row of the DataFrame. Let's add a row:
df.append(df.mean(axis='rows'), ignore_index=True)
a b c d
0 0.990730 0.567822 0.318174 0.122410
1 0.144962 0.718574 0.580569 0.582278
2 0.477151 0.907692 0.186276 0.342724
3 0.561043 0.122771 0.206819 0.904330
4 0.427413 0.186807 0.870504 0.878632
5 0.795392 0.658958 0.666026 0.262191
6 0.831404 0.011082 0.299811 0.906880
7 0.749729 0.564900 0.181627 0.211961
8 0.528308 0.394107 0.734904 0.961356
9 0.120508 0.656848 0.055749 0.290897
10 0.562664 0.478956 0.410046 0.546366
Add a new column with axis=1/columns
Similarly, when axis=1/columns it will create data that can be easily made into its own column:
df.assign(e=df.mean(axis='columns'))
a b c d e
0 0.990730 0.567822 0.318174 0.122410 0.499784
1 0.144962 0.718574 0.580569 0.582278 0.506596
2 0.477151 0.907692 0.186276 0.342724 0.478461
3 0.561043 0.122771 0.206819 0.904330 0.448741
4 0.427413 0.186807 0.870504 0.878632 0.590839
5 0.795392 0.658958 0.666026 0.262191 0.595642
6 0.831404 0.011082 0.299811 0.906880 0.512294
7 0.749729 0.564900 0.181627 0.211961 0.427054
8 0.528308 0.394107 0.734904 0.961356 0.654669
9 0.120508 0.656848 0.055749 0.290897 0.281000
It appears that you can see all the aliases with the following private variables:
df._AXIS_ALIASES
{'rows': 0}
df._AXIS_NUMBERS
{'columns': 1, 'index': 0}
df._AXIS_NAMES
{0: 'index', 1: 'columns'}
When axis='rows' or axis=0, it means access elements in the direction of the rows, up to down. If applying sum along axis=0, it will give us totals of each column.
When axis='columns' or axis=1, it means access elements in the direction of the columns, left to right. If applying sum along axis=1, we will get totals of each row.
Still confusing! But the above makes it a bit easier for me.
I remembered by the change of dimension, if axis=0, row changes, column unchanged, and if axis=1, column changes, row unchanged.