Pandas, how to combine multiple columns into an array column - python

I need to put a combined column as the concat of all values of the row.
Source:
pd.DataFrame(data={
'a' : [1,2,3],
'b' : [2,3,4]
})
Target:
pd.DataFrame(data={
'a' : [1,2,3],
'b' : [2,3,4],
'combine' : [[1,2],[2,3],[3,4]]
})
Current solution:
test['combine'] = test[['a','b']].apply(lambda x: pd.Series([x.values]), axis=1)
Issues:
I actually have many columns, it seems taking too long to run. Is it a better way.

df
a b
0 1 2
1 2 3
2 3 4
If you want to add a column of lists as a single column, you'll need to call the .values attribute, convert it to a nested list, and assign it back -
df['combine'] = df.values.tolist()
# or,
df['combine'] = df[['a', 'b']].values.tolist()
df
a b combine
0 1 2 [1, 2]
1 2 3 [2, 3]
2 3 4 [3, 4]
Note that just assigning the .values result directly does not work, as pandas special cases numpy arrays, leading to undesirable outcomes,
df['combine'] = df[['a', 'b']].values
ValueError: Wrong number of items passed 2, placement implies 1
A couple of notes -
try not to use apply/transform as much as possible. It is only a convenience function meant to hide the application of a loop, and is slow, offering no performance/vectorization benefits whatosever
keeping columns of `objects offers no performance gains as far as pandas is concerned, so unless the goal is to display data, try to avoid it.

Related

Selecting multiple columns R vs python pandas

I am an R user who is currently learning Python and I am trying to replicate a method of selecting columns used in R into Python.
In R, I could select multiple columns like so:
df[,c(2,4:10)]
In Python, I know how iloc works, but I couldn't split between a single column number and a consecutive set of them.
This wouldn't work
df.iloc[:,[1,3:10]]
So, I'll have to drop the second column like so:
df.iloc[:,1:10].drop(df.iloc[:,1:10].columns[1] , axis=1)
Is there a more efficient way of replicating the method from R in Python?
You can use np.r_ that accepts mixed slice notation and scalar indices and concatenate them as 1-d array:
import numpy as np
df.iloc[:,np.r_[1, 3:10]]
df = pd.DataFrame([[1,2,3,4,5,6]])
df
# 0 1 2 3 4 5
#0 1 2 3 4 5 6
df.iloc[:, np.r_[1, 3:6]]
# 1 3 4 5
#0 2 4 5 6
As np.r_ produces:
np.r_[1, 3:6]
# array([1, 3, 4, 5])
Assuming one wants to select multiple columns of a DataFrame by their name, considering the Dataframe df
df = pandas.DataFrame({'A' : ['X', 'Y'],
'B' : 1,
'C' : [2, 3]})
Considering one wants the columns A and C, simply use
df[['A', 'C']]
>>> A C
0 X 2
1 Y 3
Note that if one wants to use it later on one should assign it to a variable.

Pandas loc does not work to subset DataFrame when using a variable

I am fairly new to Python, especially pandas. I have a DataFrame called KeyRow which is from a bigger df:
KeyRow=df.loc[df['Order'] == UniqueOrderName[i]]
Then I make a nested loop
for i in range (0,len(PersonNum)):
print(KeyRow.loc[KeyRow['Aisle'] == '6', 'FixedPill'])
So it appears to only work when a constant is placed, whereas if I use PersonNum[0] instead of '6',even though both values are equivalent, it appears not to work. When I use PersonNum[i] this is the output I am getting:
Series([], Name: FixedPill, dtype: object)
Whereas if I use 'x' I get a desired result:
15 5
Name: FixedPill, dtype: object
It's a little unclear what your are trying to accomplish with this questions. If you are looking to filter a DataFrame, then I would suggest to never do this in an iterative manner. You should take full advantage of the slicing capabilities of .loc. Consider the example:
df = pd.DataFrame([[1,2,3], [4,5,6],
[1,2,3], [2,5,6],
[1,2,3], [4,5,6],
[1,2,3], [4,5,6]],
columns=["A", "B", "C"])
df.head()
A B C
0 1 2 3
1 4 5 6
2 1 2 3
3 2 5 6
4 1 2 3
Suppose you have a list of PersonNum that you want to use to locate the a particular field where your list is PersonNum = [1, 2]. You can slice the DataFrame in one step by performing:
df.loc[df["A"].isin(PersonNum), "B"]
Which will return a pandas Series and
df.loc[df["A"].isin(PersonNum), "B"].to_frame()
which returns a new DataFrame. Utilizing the .loc is significantly faster than an iterative approach.

Loop through different Pandas Dataframes

im new to Python, and have what is probably a basis question.
I have imported a number of Pandas Dataframes consisting of stock data for different sectors. So all columns are the same, just with different dataframe names.
I need to do a lot of different small operations on some of the columns, and I can figure out how to do it on one Dataframe at a time, but I need to figure out how to loop over the different frames and do the same operations on each.
For example for one DF i do:
ConsumerDisc['IDX_EST_PRICE_BOOK']=1/ConsumerDisc['IDX_EST_PRICE_BOOK']
ConsumerDisc['IDX_EST_EV_EBITDA']=1/ConsumerDisc['IDX_EST_EV_EBITDA']
ConsumerDisc['INDX_GENERAL_EST_PE']=1/ConsumerDisc['INDX_GENERAL_EST_PE']
ConsumerDisc['EV_TO_T12M_SALES']=1/ConsumerDisc['EV_TO_T12M_SALES']
ConsumerDisc['CFtoEarnings']=ConsumerDisc['CASH_FLOW_PER_SH']/ConsumerDisc['TRAIL_12M_EPS']
And instead of just copying and pasting this code for the next 10 sectors, I want to to do it in a loop somehow, but I cant figure out how to access the df via variable, eg:
CS=['ConsumerDisc']
CS['IDX_EST_PRICE_BOOK']=1/CS['IDX_EST_PRICE_BOOK']
so I could just create a list of df names and loop through it.
Hope you can give a small example as how to do this.
You're probably looking for something like this
for df in (df1, df2, df3):
df['IDX_EST_PRICE_BOOK']=1/df['IDX_EST_PRICE_BOOK']
df['IDX_EST_EV_EBITDA']=1/df['IDX_EST_EV_EBITDA']
df['INDX_GENERAL_EST_PE']=1/df['INDX_GENERAL_EST_PE']
df['EV_TO_T12M_SALES']=1/df['EV_TO_T12M_SALES']
df['CFtoEarnings']=df['CASH_FLOW_PER_SH']/df['TRAIL_12M_EPS']
Here we're iterating over the dataframes that we've put in a tuple datasctructure, does that make sense?
Do you mean something like this?
import pandas as pd
d = {'a' : pd.Series([1, 2, 3, 10]), 'b' : pd.Series([2, 2, 6, 8])}
z = {'d' : pd.Series([4, 2, 3, 1]), 'e' : pd.Series([21, 2, 60, 8])}
df = pd.DataFrame(d)
zf = pd.DataFrame(z)
df.head()
a b
0 1 2
1 2 2
2 3 6
3 10 8
df = df.apply(lambda x: 1/x)
df.head()
a b
0 1.0 0.500000
1 2.0 0.500000
2 3.0 0.166667
3 10.0 0.125000
You have more functions so you can create a function and then just apply that to each DataFrame. Alternatively you could also apply these lambda functions to only specific columns. So lets say you want to apply only 1/column to the every column but the last (going by your example, I am assuming it is in the end) you could do df.ix[:, :-1].apply(lambda x : 1/x).

Keeping 'key' column when using groupby with transform in pandas

Finding a normalized dataframe removes the column being used to group by, so that it can't be used in subsequent groupby operations. for example (edit: updated):
df = pd.DataFrame({'a':[1, 1 , 2, 3, 2, 3], 'b':[0, 1, 2, 3, 4, 5]})
a b
0 1 0
1 1 1
2 2 2
3 3 3
4 2 4
5 3 5
df.groupby('a').transform(lambda x: x)
b
0 0
1 1
2 2
3 3
4 4
5 5
Now, with most operations on groups the 'missing' column becomes a new index (which can then be adjusted using reset_index, or set as_index=False), but when using transform it just disappears, leaving the original index and a new dataset without the key.
Edit: here's a one liner of what I would like to be able to do
df.groupby('a').transform(lambda x: x+1).groupby('a').mean()
KeyError 'a'
In the example from the pandas docs a function is used to split based on the index, which appears to avoid this issue entirely. Alternatively, it would always be possible just to add the column after the groupby/transform, but surely there's a better way?
Update:
It looks like reset_index/as_index are intended only for functions that reduce each group to a single row. There seem to be a couple options, from answers
The issue is discussed also here.
The returned object has the same indices as the original df, therefore you can do
pd.concat([
df['a'],
df.groupby('a').transform(lambda x: x)
], axis=1)
that is bizzare!
I tricked it like this
df.groupby(df.a.values).transform(lambda x: x)
Another way to achieve something similiar to what Pepacz suggested:
df.loc[:, df.columns.drop('a')] = df.groupby('a').transform(lambda x: x)
Try this:
df['b'] = df.groupby('a').transform(lambda x: x)
df.drop_duplicates()

Apply function to pandas dataframe that returns multiple rows

I would like to apply a function to a pandas DataFrame that splits some of the rows into two. So for example, I may have this as input:
df = pd.DataFrame([{'one': 3, 'two': 'a'}, {'one': 5, 'two': 'b,c'}], index=['i1', 'i2'])
one two
i1 3 a
i2 5 b,c
And I want something like this as output:
one two
i1 3 a
i2_0 5 b
i2_1 5 c
My hope was that I could just use apply() on the data frame, calling a function that returns a dataframe with 1 or more rows itself, which would then get merged back together. However, this does not seem to work at all. Here is a test case where I am just trying to duplicate each row:
dfa = df.apply(lambda s: pd.DataFrame([s.to_dict(), s.to_dict()]), axis=1)
one two
i1 one two
i2 one two
So if I return a DataFrame, the column names of that DataFrame seem to become the contents of the rows. This is obviously not what I want.
There is another question on here that was solved by using .groupby(), however I don't think this applies to my case since I don't actually want to group by anything.
What is the correct way to do this?
You have a messed up database (comma separated string where you should have separate columns). We first fix this:
df2 = pd.concat([df['one'], pd.DataFrame(df.two.str.split(',').tolist(), index=df.index)], axis=1)
Which gives us something more neat as
In[126]: df2
Out[126]:
one 0 1
i1 3 a None
i2 5 b c
Now, we can just do
In[125]: df2.set_index('one').unstack().dropna()
Out[125]:
one
0 3 a
5 b
1 5 c
Adjusting the index (if desired) is trivial and left to the reader as an exercise.

Categories

Resources