I am an R user who is currently learning Python and I am trying to replicate a method of selecting columns used in R into Python.
In R, I could select multiple columns like so:
df[,c(2,4:10)]
In Python, I know how iloc works, but I couldn't split between a single column number and a consecutive set of them.
This wouldn't work
df.iloc[:,[1,3:10]]
So, I'll have to drop the second column like so:
df.iloc[:,1:10].drop(df.iloc[:,1:10].columns[1] , axis=1)
Is there a more efficient way of replicating the method from R in Python?
You can use np.r_ that accepts mixed slice notation and scalar indices and concatenate them as 1-d array:
import numpy as np
df.iloc[:,np.r_[1, 3:10]]
df = pd.DataFrame([[1,2,3,4,5,6]])
df
# 0 1 2 3 4 5
#0 1 2 3 4 5 6
df.iloc[:, np.r_[1, 3:6]]
# 1 3 4 5
#0 2 4 5 6
As np.r_ produces:
np.r_[1, 3:6]
# array([1, 3, 4, 5])
Assuming one wants to select multiple columns of a DataFrame by their name, considering the Dataframe df
df = pandas.DataFrame({'A' : ['X', 'Y'],
'B' : 1,
'C' : [2, 3]})
Considering one wants the columns A and C, simply use
df[['A', 'C']]
>>> A C
0 X 2
1 Y 3
Note that if one wants to use it later on one should assign it to a variable.
Related
Suppose I have the following Pandas DataFrame:
df = pd.DataFrame({
'a': [1, 2, 3],
'b': [4, 5, 6],
'c': [7, 8, 9]
})
a b c
0 1 4 7
1 2 5 8
2 3 6 9
I want to generate a new pandas.Series so that the values of this series are selected, row by row, from a random column in the DataFrame. So, a possible output for that would be the series:
0 7
1 2
2 9
dtype: int64
(where in row 0 it randomly chose 'c', in row 1 it randomly chose 'a' and in row 2 it randomly chose 'c' again).
I know this can be done by iterating over the rows and using random.choice to choose each row, but iterating over the rows not only has bad performance but also is "unpandonic", so to speak. Also, df.sample(axis=1) would choose whole columns, so all of them would be chosen from the same column, which is not what I want. Is there a better way to do this with vectorized pandas methods?
Here is a fully vectorized solution. Note however that it does not use Pandas methods, but rather involves operations on the underlying numpy array.
import numpy as np
indices = np.random.choice(np.arange(len(df.columns)), len(df), replace=True)
Example output is [1, 2, 1] which corresponds to ['b', 'c', 'b'].
Then use this to slice the numpy array:
df['random'] = df.to_numpy()[np.arange(len(df)), indices]
Results:
a b c random
0 1 4 7 7
1 2 5 8 5
2 3 6 9 9
May be something like:
pd.Series([np.random.choice(i,1)[0] for i in df.values])
This does the job (using the built-in module random):
ddf = df.apply(lambda row : random.choice(row.tolist()), axis=1)
or using pandas sample:
ddf = df.apply(lambda row : row.sample(), axis=1)
Both have the same behaviour. ddf is your Series.
pd.DataFrame(
df.values[range(df.shape[0]),
np.random.randint(
0, df.shape[1], size=df.shape[0])])
output
0
0 4
1 5
2 9
You're probably still going to need to iterate through each row while selecting a random value in each row - whether you do it explicitly with a for loop or implicitly with whatever function you decide to call.
You can, however, simplify the to a single line using a list comprehension, if it suits your style:
result = pd.Series([random.choice(pd.iloc[i]) for i in range(len(df))])
If you came here looking for information on how to
merge a DataFrame and Series on the index, please look at this
answer.
The OP's original intention was to ask how to assign series elements
as columns to another DataFrame. If you are interested in knowing the
answer to this, look at the accepted answer by EdChum.
Best I can come up with is
df = pd.DataFrame({'a':[1, 2], 'b':[3, 4]}) # see EDIT below
s = pd.Series({'s1':5, 's2':6})
for name in s.index:
df[name] = s[name]
a b s1 s2
0 1 3 5 6
1 2 4 5 6
Can anybody suggest better syntax / faster method?
My attempts:
df.merge(s)
AttributeError: 'Series' object has no attribute 'columns'
and
df.join(s)
ValueError: Other Series must have a name
EDIT The first two answers posted highlighted a problem with my question, so please use the following to construct df:
df = pd.DataFrame({'a':[np.nan, 2, 3], 'b':[4, 5, 6]}, index=[3, 5, 6])
with the final result
a b s1 s2
3 NaN 4 5 6
5 2 5 5 6
6 3 6 5 6
Update
From v0.24.0 onwards, you can merge on DataFrame and Series as long as the Series is named.
df.merge(s.rename('new'), left_index=True, right_index=True)
# If series is already named,
# df.merge(s, left_index=True, right_index=True)
Nowadays, you can simply convert the Series to a DataFrame with to_frame(). So (if joining on index):
df.merge(s.to_frame(), left_index=True, right_index=True)
You could construct a dataframe from the series and then merge with the dataframe.
So you specify the data as the values but multiply them by the length, set the columns to the index and set params for left_index and right_index to True:
In [27]:
df.merge(pd.DataFrame(data = [s.values] * len(s), columns = s.index), left_index=True, right_index=True)
Out[27]:
a b s1 s2
0 1 3 5 6
1 2 4 5 6
EDIT for the situation where you want the index of your constructed df from the series to use the index of the df then you can do the following:
df.merge(pd.DataFrame(data = [s.values] * len(df), columns = s.index, index=df.index), left_index=True, right_index=True)
This assumes that the indices match the length.
Here's one way:
df.join(pd.DataFrame(s).T).fillna(method='ffill')
To break down what happens here...
pd.DataFrame(s).T creates a one-row DataFrame from s which looks like this:
s1 s2
0 5 6
Next, join concatenates this new frame with df:
a b s1 s2
0 1 3 5 6
1 2 4 NaN NaN
Lastly, the NaN values at index 1 are filled with the previous values in the column using fillna with the forward-fill (ffill) argument:
a b s1 s2
0 1 3 5 6
1 2 4 5 6
To avoid using fillna, it's possible to use pd.concat to repeat the rows of the DataFrame constructed from s. In this case, the general solution is:
df.join(pd.concat([pd.DataFrame(s).T] * len(df), ignore_index=True))
Here's another solution to address the indexing challenge posed in the edited question:
df.join(pd.DataFrame(s.repeat(len(df)).values.reshape((len(df), -1), order='F'),
columns=s.index,
index=df.index))
s is transformed into a DataFrame by repeating the values and reshaping (specifying 'Fortran' order), and also passing in the appropriate column names and index. This new DataFrame is then joined to df.
Nowadays, much simpler and concise solution can achieve the same task. Leveraging the capability of DataFrame.apply() to turn a Series into columns of its belonging DataFrame, we can use:
df.join(df.apply(lambda x: s, axis=1))
Result:
a b s1 s2
3 NaN 4 5 6
5 2.0 5 5 6
6 3.0 6 5 6
Here, we used DataFrame.apply() with a simple lambda function as the applied function on axis=1. The applied lambda function simply just returns the Series s:
df.apply(lambda x: s, axis=1)
Result:
s1 s2
3 5 6
5 5 6
6 5 6
The result has already inherited the row index of the original DataFrame df. Consequently, we can simply join df with this interim result by DataFrame.join() to get the desired final result (since they have the same row index).
This capability of DataFrame.apply() to turn a Series into columns of its belonging DataFrame is well documented in the official document as follows:
By default (result_type=None), the final return type is inferred from
the return type of the applied function.
The default behaviour (result_type=None) depends on the return value of the
applied function: list-like results will be returned as a Series of
those. However if the apply function returns a Series these are
expanded to columns.
The official document also includes example of such usage:
Returning a Series inside the function is similar to passing
result_type='expand'. The resulting column names will be the Series
index.
df.apply(lambda x: pd.Series([1, 2], index=['foo', 'bar']), axis=1)
foo bar
0 1 2
1 1 2
2 1 2
If I could suggest setting up your dataframes like this (auto-indexing):
df = pd.DataFrame({'a':[np.nan, 1, 2], 'b':[4, 5, 6]})
then you can set up your s1 and s2 values thus (using shape() to return the number of rows from df):
s = pd.DataFrame({'s1':[5]*df.shape[0], 's2':[6]*df.shape[0]})
then the result you want is easy:
display (df.merge(s, left_index=True, right_index=True))
Alternatively, just add the new values to your dataframe df:
df = pd.DataFrame({'a':[nan, 1, 2], 'b':[4, 5, 6]})
df['s1']=5
df['s2']=6
display(df)
Both return:
a b s1 s2
0 NaN 4 5 6
1 1.0 5 5 6
2 2.0 6 5 6
If you have another list of data (instead of just a single value to apply), and you know it is in the same sequence as df, eg:
s1=['a','b','c']
then you can attach this in the same way:
df['s1']=s1
returns:
a b s1
0 NaN 4 a
1 1.0 5 b
2 2.0 6 c
You can easily set a pandas.DataFrame column to a constant. This constant can be an int such as in your example. If the column you specify isn't in the df, then pandas will create a new column with the name you specify. So after your dataframe is constructed, (from your question):
df = pd.DataFrame({'a':[np.nan, 2, 3], 'b':[4, 5, 6]}, index=[3, 5, 6])
You can just run:
df['s1'], df['s2'] = 5, 6
You could write a loop or comprehension to make it do this for all the elements in a list of tuples, or keys and values in a dictionary depending on how you have your real data stored.
If df is a pandas.DataFrame then df['new_col']= Series list_object of length len(df) will add the or Series list_object as a column named 'new_col'. df['new_col']= scalar (such as 5 or 6 in your case) also works and is equivalent to df['new_col']= [scalar]*len(df)
So a two-line code serves the purpose:
df = pd.DataFrame({'a':[1, 2], 'b':[3, 4]})
s = pd.Series({'s1':5, 's2':6})
for x in s.index:
df[x] = s[x]
Output:
a b s1 s2
0 1 3 5 6
1 2 4 5 6
Sorry for editing the question again but as I dug deeper into it I realized that it boils down to the question if I can access the values of a column and the values of a rows index in the same way. This, for me, would seem quite natural as the rows index and a column are actually very similar entities.
For example, if I define a DataFrame with a two-level rows multi-index like that:
df = pd.DataFrame(data=None, index=pd.MultiIndex.from_product([['A', 'B'], [1, 2]], names=['X', 'Y']))
df.insert(loc=0, column='DATA', value=[1, 2, 3, 4])
Which gives
DATA
X Y
A 1 1
2 2
B 1 3
2 4
To access column values I can, e.g., use df.DATA or df.loc[:, 'DATA']. Consequently, to select all rows where DATA is 2, I can do df.loc[df.DATA == 2, :] or df.loc[df.loc[:, 'DATA'] == 2, :].
However, to do the same operation on, say, the index column Y, this does not work. Neither df.Y nor df.loc[:, 'Y']. And therefore I can't select rows based on index values like above: df.loc[df.Y == 2, :] or df.loc[df.loc[:, 'Y'] == 2, :] do not work.
Which is a pity as this requires to write different code depending on if the column is a normal column or part of the index. Or is there another way to do that which works for both columns and indexes?
If you want to call .loc[:,'Y'] reset the index and then call it i.e
df.reset_index().loc[:,'Y']
Output:
0 1
1 2
2 1
3 2
Name: Y, dtype: object
If you want to select the data based on condtion then
df.reset_index()[df.reset_index().Y == 2].set_index(['X','Y'])
Output:
DATA
X Y
A 2 2
B 2 4
im new to Python, and have what is probably a basis question.
I have imported a number of Pandas Dataframes consisting of stock data for different sectors. So all columns are the same, just with different dataframe names.
I need to do a lot of different small operations on some of the columns, and I can figure out how to do it on one Dataframe at a time, but I need to figure out how to loop over the different frames and do the same operations on each.
For example for one DF i do:
ConsumerDisc['IDX_EST_PRICE_BOOK']=1/ConsumerDisc['IDX_EST_PRICE_BOOK']
ConsumerDisc['IDX_EST_EV_EBITDA']=1/ConsumerDisc['IDX_EST_EV_EBITDA']
ConsumerDisc['INDX_GENERAL_EST_PE']=1/ConsumerDisc['INDX_GENERAL_EST_PE']
ConsumerDisc['EV_TO_T12M_SALES']=1/ConsumerDisc['EV_TO_T12M_SALES']
ConsumerDisc['CFtoEarnings']=ConsumerDisc['CASH_FLOW_PER_SH']/ConsumerDisc['TRAIL_12M_EPS']
And instead of just copying and pasting this code for the next 10 sectors, I want to to do it in a loop somehow, but I cant figure out how to access the df via variable, eg:
CS=['ConsumerDisc']
CS['IDX_EST_PRICE_BOOK']=1/CS['IDX_EST_PRICE_BOOK']
so I could just create a list of df names and loop through it.
Hope you can give a small example as how to do this.
You're probably looking for something like this
for df in (df1, df2, df3):
df['IDX_EST_PRICE_BOOK']=1/df['IDX_EST_PRICE_BOOK']
df['IDX_EST_EV_EBITDA']=1/df['IDX_EST_EV_EBITDA']
df['INDX_GENERAL_EST_PE']=1/df['INDX_GENERAL_EST_PE']
df['EV_TO_T12M_SALES']=1/df['EV_TO_T12M_SALES']
df['CFtoEarnings']=df['CASH_FLOW_PER_SH']/df['TRAIL_12M_EPS']
Here we're iterating over the dataframes that we've put in a tuple datasctructure, does that make sense?
Do you mean something like this?
import pandas as pd
d = {'a' : pd.Series([1, 2, 3, 10]), 'b' : pd.Series([2, 2, 6, 8])}
z = {'d' : pd.Series([4, 2, 3, 1]), 'e' : pd.Series([21, 2, 60, 8])}
df = pd.DataFrame(d)
zf = pd.DataFrame(z)
df.head()
a b
0 1 2
1 2 2
2 3 6
3 10 8
df = df.apply(lambda x: 1/x)
df.head()
a b
0 1.0 0.500000
1 2.0 0.500000
2 3.0 0.166667
3 10.0 0.125000
You have more functions so you can create a function and then just apply that to each DataFrame. Alternatively you could also apply these lambda functions to only specific columns. So lets say you want to apply only 1/column to the every column but the last (going by your example, I am assuming it is in the end) you could do df.ix[:, :-1].apply(lambda x : 1/x).
Finding a normalized dataframe removes the column being used to group by, so that it can't be used in subsequent groupby operations. for example (edit: updated):
df = pd.DataFrame({'a':[1, 1 , 2, 3, 2, 3], 'b':[0, 1, 2, 3, 4, 5]})
a b
0 1 0
1 1 1
2 2 2
3 3 3
4 2 4
5 3 5
df.groupby('a').transform(lambda x: x)
b
0 0
1 1
2 2
3 3
4 4
5 5
Now, with most operations on groups the 'missing' column becomes a new index (which can then be adjusted using reset_index, or set as_index=False), but when using transform it just disappears, leaving the original index and a new dataset without the key.
Edit: here's a one liner of what I would like to be able to do
df.groupby('a').transform(lambda x: x+1).groupby('a').mean()
KeyError 'a'
In the example from the pandas docs a function is used to split based on the index, which appears to avoid this issue entirely. Alternatively, it would always be possible just to add the column after the groupby/transform, but surely there's a better way?
Update:
It looks like reset_index/as_index are intended only for functions that reduce each group to a single row. There seem to be a couple options, from answers
The issue is discussed also here.
The returned object has the same indices as the original df, therefore you can do
pd.concat([
df['a'],
df.groupby('a').transform(lambda x: x)
], axis=1)
that is bizzare!
I tricked it like this
df.groupby(df.a.values).transform(lambda x: x)
Another way to achieve something similiar to what Pepacz suggested:
df.loc[:, df.columns.drop('a')] = df.groupby('a').transform(lambda x: x)
Try this:
df['b'] = df.groupby('a').transform(lambda x: x)
df.drop_duplicates()