Given a dataframe with columns a1, b1 and a2, b2, I want to find the column index for the largest value out of a1, b1 and then get the value with the same relative column index out of a2, b2, as shown in the want column below:
import pandas as pd
import numpy as np
# Sample data
df= pd.DataFrame({'a_1':[1,2,3], 'b_1': [2,1,3], 'a_2': [3,4,7], 'b_2':[5,6,8], 'want':[5, 4, 7]})
I was able to get this far, but I'm not sure what the best approach is for the final step:
# Get the argmax for a1, b1
df['c'] = df[['a_1', 'b_1']].idxmax(axis=1)
# Get the column index of the argmax
df['d'] = df['c'].apply(lambda x: ['a_1', 'b_1'].index(x))
This is a simplified version of the problem - there are actually many more columns to search through - e.g. a1-z1, a2-z2.
For two columns, this should do:
df['e'] = np.where(df['a_1']>=df['b_1'], df['a_2'], df['b_2'])
For several columns:
numcols = 2
idx_max = np.argmax(df.iloc[:, :numcols].values, 1)
df['e'] = df.iloc[:,numcols:2*numcols].values[np.arange(len(df)), idx_max]
You can also replace df.iloc[...] with the corresponding column names, e.g. df.iloc[:, :numcols] by df[[a_1','b_1']]
We can do
s=df[['a_1','b_1']].idxmax(1).replace(['a_1','b_1'],['a_2','b_2'])
df['value']=df.lookup(s.index,s)
df
Out[23]:
a_1 b_1 a_2 b_2 want value
0 1 2 3 5 5 5
1 2 1 4 6 4 4
2 3 3 7 8 7 7
Use, DataFrame.filter along with DataFrame.lookup:
cols = df.filter(regex=r'[a-zA-Z]+_1').idxmax(1).str.rstrip('1') + '2'
df['want'] = df.filter(regex=r'[a-zA-Z]+_2').lookup(df.index, cols)
# print(df)
a_1 b_1 a_2 b_2 want
0 1 2 3 5 5
1 2 1 4 6 4
2 3 3 7 8 7
This is the solution I initially came up with:
df['e'] = np.choose(df['d'].values, df[['a_2', 'b_2']].transpose().values)
I think this works, but is there a simpler way?
EDIT: It seems this only works as long as you have at most 32 columns to choose from, so other options are definitely better than this one.
Related
I have an array of numbers (I think the format makes it a pivot table) that I want to turn into a "tidy" data frame. For example, I start with variable 1 down the left, variable 2 across the top, and the value of interest in the middle, something like this:
X Y
A 1 2
B 3 4
I want to turn that into a tidy data frame like this:
V1 V2 value
A X 1
A Y 2
B X 3
B Y 4
The row and column order don't matter to me, so the following is totally acceptable:
value V1 V2
2 A Y
4 B Y
3 B X
1 A X
For my first go at this, which was able to get me the correct final answer, I looped over the rows and columns. This was terribly slow, and I suspected that some machinery in Pandas would make it go faster.
It seems that melt is close to the magic I seek, but it doesn't get me all the way there. That first array turns into this:
V2 value
0 X 1
1 X 2
2 Y 3
3 Y 4
It gets rid of my V1 variable!
Nothing is special about melt, so I will be happy to read answers that use other approaches, particularly if melt is not much faster than my nested loops and another solution is. Nonetheless, how can I go from that array to the kind of tidy data frame I want as the output?
Example dataframe:
df = pd.DataFrame({"X":[1,3], "Y":[2,4]},index=["A","B"])
Use DataFrame.reset_index with DataFrame.rename_axis and then DataFrame.melt. If you want order columns we could use DataFrame.reindex.
new_df = (df.rename_axis(index = 'V1')
.reset_index()
.melt('V1',var_name='V2')
.reindex(columns = ['value','V1','V2']))
print(new_df)
Another approach DataFrame.stack:
new_df = (df.stack()
.rename_axis(index = ['V1','V2'])
.rename('value')
.reset_index()
.reindex(columns = ['value','V1','V2']))
print(new_df)
value V1 V2
0 1 A X
1 3 B X
2 2 A Y
3 4 B Y
to names names there is another alternative like commenting #Scott Boston in the comments
Melt is a good approach, but it doesn't seem to play nicely with identifying the results by index. You can reset the index first to move it to its own column, then use that column as the id col.
test = pd.DataFrame([[1,2],[3,4]], columns=['X', 'Y'], index=['A', 'B'])
X Y
A 1 2
B 3 4
test = test.reset_index()
index X Y
0 A 1 2
1 B 3 4
test.melt('index',['X', 'Y'], 'prev cols')
index prev cols value
0 A X 1
1 B X 3
2 A Y 2
3 B Y 4
If you came here looking for information on how to
merge a DataFrame and Series on the index, please look at this
answer.
The OP's original intention was to ask how to assign series elements
as columns to another DataFrame. If you are interested in knowing the
answer to this, look at the accepted answer by EdChum.
Best I can come up with is
df = pd.DataFrame({'a':[1, 2], 'b':[3, 4]}) # see EDIT below
s = pd.Series({'s1':5, 's2':6})
for name in s.index:
df[name] = s[name]
a b s1 s2
0 1 3 5 6
1 2 4 5 6
Can anybody suggest better syntax / faster method?
My attempts:
df.merge(s)
AttributeError: 'Series' object has no attribute 'columns'
and
df.join(s)
ValueError: Other Series must have a name
EDIT The first two answers posted highlighted a problem with my question, so please use the following to construct df:
df = pd.DataFrame({'a':[np.nan, 2, 3], 'b':[4, 5, 6]}, index=[3, 5, 6])
with the final result
a b s1 s2
3 NaN 4 5 6
5 2 5 5 6
6 3 6 5 6
Update
From v0.24.0 onwards, you can merge on DataFrame and Series as long as the Series is named.
df.merge(s.rename('new'), left_index=True, right_index=True)
# If series is already named,
# df.merge(s, left_index=True, right_index=True)
Nowadays, you can simply convert the Series to a DataFrame with to_frame(). So (if joining on index):
df.merge(s.to_frame(), left_index=True, right_index=True)
You could construct a dataframe from the series and then merge with the dataframe.
So you specify the data as the values but multiply them by the length, set the columns to the index and set params for left_index and right_index to True:
In [27]:
df.merge(pd.DataFrame(data = [s.values] * len(s), columns = s.index), left_index=True, right_index=True)
Out[27]:
a b s1 s2
0 1 3 5 6
1 2 4 5 6
EDIT for the situation where you want the index of your constructed df from the series to use the index of the df then you can do the following:
df.merge(pd.DataFrame(data = [s.values] * len(df), columns = s.index, index=df.index), left_index=True, right_index=True)
This assumes that the indices match the length.
Here's one way:
df.join(pd.DataFrame(s).T).fillna(method='ffill')
To break down what happens here...
pd.DataFrame(s).T creates a one-row DataFrame from s which looks like this:
s1 s2
0 5 6
Next, join concatenates this new frame with df:
a b s1 s2
0 1 3 5 6
1 2 4 NaN NaN
Lastly, the NaN values at index 1 are filled with the previous values in the column using fillna with the forward-fill (ffill) argument:
a b s1 s2
0 1 3 5 6
1 2 4 5 6
To avoid using fillna, it's possible to use pd.concat to repeat the rows of the DataFrame constructed from s. In this case, the general solution is:
df.join(pd.concat([pd.DataFrame(s).T] * len(df), ignore_index=True))
Here's another solution to address the indexing challenge posed in the edited question:
df.join(pd.DataFrame(s.repeat(len(df)).values.reshape((len(df), -1), order='F'),
columns=s.index,
index=df.index))
s is transformed into a DataFrame by repeating the values and reshaping (specifying 'Fortran' order), and also passing in the appropriate column names and index. This new DataFrame is then joined to df.
Nowadays, much simpler and concise solution can achieve the same task. Leveraging the capability of DataFrame.apply() to turn a Series into columns of its belonging DataFrame, we can use:
df.join(df.apply(lambda x: s, axis=1))
Result:
a b s1 s2
3 NaN 4 5 6
5 2.0 5 5 6
6 3.0 6 5 6
Here, we used DataFrame.apply() with a simple lambda function as the applied function on axis=1. The applied lambda function simply just returns the Series s:
df.apply(lambda x: s, axis=1)
Result:
s1 s2
3 5 6
5 5 6
6 5 6
The result has already inherited the row index of the original DataFrame df. Consequently, we can simply join df with this interim result by DataFrame.join() to get the desired final result (since they have the same row index).
This capability of DataFrame.apply() to turn a Series into columns of its belonging DataFrame is well documented in the official document as follows:
By default (result_type=None), the final return type is inferred from
the return type of the applied function.
The default behaviour (result_type=None) depends on the return value of the
applied function: list-like results will be returned as a Series of
those. However if the apply function returns a Series these are
expanded to columns.
The official document also includes example of such usage:
Returning a Series inside the function is similar to passing
result_type='expand'. The resulting column names will be the Series
index.
df.apply(lambda x: pd.Series([1, 2], index=['foo', 'bar']), axis=1)
foo bar
0 1 2
1 1 2
2 1 2
If I could suggest setting up your dataframes like this (auto-indexing):
df = pd.DataFrame({'a':[np.nan, 1, 2], 'b':[4, 5, 6]})
then you can set up your s1 and s2 values thus (using shape() to return the number of rows from df):
s = pd.DataFrame({'s1':[5]*df.shape[0], 's2':[6]*df.shape[0]})
then the result you want is easy:
display (df.merge(s, left_index=True, right_index=True))
Alternatively, just add the new values to your dataframe df:
df = pd.DataFrame({'a':[nan, 1, 2], 'b':[4, 5, 6]})
df['s1']=5
df['s2']=6
display(df)
Both return:
a b s1 s2
0 NaN 4 5 6
1 1.0 5 5 6
2 2.0 6 5 6
If you have another list of data (instead of just a single value to apply), and you know it is in the same sequence as df, eg:
s1=['a','b','c']
then you can attach this in the same way:
df['s1']=s1
returns:
a b s1
0 NaN 4 a
1 1.0 5 b
2 2.0 6 c
You can easily set a pandas.DataFrame column to a constant. This constant can be an int such as in your example. If the column you specify isn't in the df, then pandas will create a new column with the name you specify. So after your dataframe is constructed, (from your question):
df = pd.DataFrame({'a':[np.nan, 2, 3], 'b':[4, 5, 6]}, index=[3, 5, 6])
You can just run:
df['s1'], df['s2'] = 5, 6
You could write a loop or comprehension to make it do this for all the elements in a list of tuples, or keys and values in a dictionary depending on how you have your real data stored.
If df is a pandas.DataFrame then df['new_col']= Series list_object of length len(df) will add the or Series list_object as a column named 'new_col'. df['new_col']= scalar (such as 5 or 6 in your case) also works and is equivalent to df['new_col']= [scalar]*len(df)
So a two-line code serves the purpose:
df = pd.DataFrame({'a':[1, 2], 'b':[3, 4]})
s = pd.Series({'s1':5, 's2':6})
for x in s.index:
df[x] = s[x]
Output:
a b s1 s2
0 1 3 5 6
1 2 4 5 6
Let's say I have a DataFrame that looks like this:
a b c d e f g
1 2 3 4 5 6 7
4 3 7 1 6 9 4
8 9 0 2 4 2 1
How would I go about deleting every column besides a and b?
This would result in:
a b
1 2
4 3
8 9
I would like a way to delete these using a simple line of code that says, delete all columns besides a and b, because let's say hypothetically I have 1000 columns of data.
Thank you.
In [48]: df.drop(df.columns.difference(['a','b']), 1, inplace=True)
Out[48]:
a b
0 1 2
1 4 3
2 8 9
or:
In [55]: df = df.loc[:, df.columns.intersection(['a','b'])]
In [56]: df
Out[56]:
a b
0 1 2
1 4 3
2 8 9
PS please be aware that the most idiomatic Pandas way to do that was already proposed by #Wen:
df = df[['a','b']]
or
df = df.loc[:, ['a','b']]
Another option to add to the mix. I prefer this approach for readability.
df = df.filter(['a', 'b'])
Where the first positional argument is items=[]
Bonus
You can also use a like argument or regex to filter.
Helpful if you have a set of columns like ['a_1','a_2','b_1','b_2']
You can do
df = df.filter(like='b_')
and end up with ['b_1','b_2']
Pandas documentation for filter.
there are multiple solution .
df = df[['a','b']] #1
df = df[list('ab')] #2
df = df.loc[:,df.columns.isin(['a','b'])] #3
df = pd.DataFrame(data=df.eval('a,b').T,columns=['a','b']) #4 PS:I do not recommend this method , but still a way to achieve this
Hey what you are looking for is:
df = df[["a","b"]]
You will recive a dataframe which only contains the columns a and b
If you only want to keep more columns than you're dropping put a "~" before the .isin statement to select every column except the ones you want:
df = df.loc[:, ~df.columns.isin(['a','b'])]
If you have more than two columns that you want to drop, let's say 20 or 30, you can use lists as well. Make sure that you also specify the axis value.
drop_list = ["a","b"]
df = df.drop(df.columns.difference(drop_list), axis=1)
Let's say I have a table with a key (e.g. customer ID) and two numeric columns C1 and C2. I would like to group rows by the key (customer) and run some aggregators like sum and mean on its columns. After computing group aggregators I would like to assign the results back to each customer row in a DataFrame (as some customer-wide features added to each row).
I can see that I can do something like
df['F1'] = df.groupby(['Key'])['C1'].transform(np.sum)
if I want to aggregate just one column and be able to add the result back to the DataFrame.
Can I make it conditional - can I add up C1 column in a group only for rows whose C2 column is equal to some number X and still be able to add results back to the DataFrame?
How can I run aggregator on a combination of rows like:
np.sum(C1 + C2)?
What would be the simplest and most elegant way to implement it? What is the most efficient way to do it? Can those aggregations be done in a one path?
Thank you in advance.
Here's some setup of some dummy data.
In [81]: df = pd.DataFrame({'Key': ['a','a','b','b','c','c'],
'C1': [1,2,3,4,5,6],
'C2': [7,8,9,10,11,12]})
In [82]: df['F1'] = df.groupby('Key')['C1'].transform(np.sum)
In [83]: df
Out[83]:
C1 C2 Key F1
0 1 7 a 3
1 2 8 a 3
2 3 9 b 7
3 4 10 b 7
4 5 11 c 11
5 6 12 c 11
If you want to do a conditional GroupBy, you can just filter the dataframe as it's passed to .groubpy. For example, if you wanted the group sum of 'C1' if C2 is less than 8 or greater than 9.
In [87]: cond = (df['C2'] < 8) | (df['C2'] > 9)
In [88]: df['F2'] = df[cond].groupby('Key')['C1'].transform(np.sum)
In [89]: df
Out[89]:
C1 C2 Key F1 F2
0 1 7 a 3 1
1 2 8 a 3 NaN
2 3 9 b 7 NaN
3 4 10 b 7 4
4 5 11 c 11 11
5 6 12 c 11 11
This works because the transform operation preserves the index, so it will still align with the original dataframe correctly.
If you want to sum the group totals for two columns, probably easiest to do something like this? Someone may have something more clever.
In [93]: gb = df.groupby('Key')
In [94]: df['C1+C2'] = gb['C1'].transform(np.sum) + gb['C2'].transform(np.sum)
Edit:
Here's one other way to get group totals for multiple columns. The syntax isn't really any cleaner, but may be more convenient for a large number of a columns.
df['C1_C2'] = gb[['C1','C2']].apply(lambda x: pd.DataFrame(x.sum().sum(), index=x.index, columns=['']))
I found another approach that uses apply() instead of transform(), but you need to join the result table with the input DataFrame and I just haven't figured out yet how to do it. Would appreciate help to finish the table joining part or any better alternatives.
df = pd.DataFrame({'Key': ['a','a','b','b','c','c'],
'C1': [1,2,3,4,5,6],
'C2': [7,8,9,10,11,12]})
# Group g will be given as a DataFrame
def group_feature_extractor(g):
feature_1 = (g['C1'] + g['C2']).sum()
even_C1_filter = g['C1'] % 2 == 0
feature_2 = g[even_C1_filter]['C2'].sum()
return pd.Series([feature_1, feature_2], index = ['F1', 'F2'])
# Group once
group = df.groupby(['Key'])
# Extract features from each group
group_features = group.apply(group_feature_extractor)
#
# Join with the input data frame ...
#
Suppose I have a Pandas dataframe df has columns a,b,c,d...z . And I want to: df.groupby('a').apply(my_func()) for columns d-z, while leave column 'b' & 'c' unchanged . How to do that ?
I notice Pandas can apply different function to different column by passing a dict . But I have a long column list and just want parameters to set or tip to simply tell Pandas to bypass some columns and apply my_func() to rest of columns ? (Otherwise I have to build a long dict)
One simple (and general) approach is to create a view of the dataframe with the subset you are interested in (or, stated for your case, a view with all columns except the ones you want to ignore), and then use APPLY for that view.
In [116]: df
Out[116]:
a b c d f
0 one 3 0.493808 a bob
1 two 8 0.150585 b alice
2 one 6 0.641816 c michael
3 two 5 0.935653 d joe
4 one 1 0.521159 e kate
Use your favorite methods to create the view you need. You could select a range of columns like so df_view = df.ix[:,'b':'d'], but the following might be more useful for your scenario:
#I want all columns except two
cols = df.columns.tolist()
mycols = [x for x in cols if not x in ['a','f']]
df_view = df[mycols]
Apply your function to that view. (Note this doesn't yet change anything in df.)
In [158]: df_view.apply(lambda x: x /2)
Out[158]:
b c d
0 1 0.246904 20
1 4 0.075293 25
2 3 0.320908 28
3 2 0.467827 28
4 0 0.260579 24
Update the df using update()
In [156]: df.update(df_view.apply(lambda x: x/2))
In [157]: df
Out[157]:
a b c d f
0 one 1 0.246904 20 bob
1 two 4 0.075293 25 alice
2 one 3 0.320908 28 michael
3 two 2 0.467827 28 joe
4 one 0 0.260579 24 kate