If I have a df such as this:
a b
0 1 3
1 2 4
I can use df['c'] = '' and df['d'] = -1 to add 2 columns and become this:
a b c d
0 1 3 -1
1 2 4 -1
How can I make the code within a function, so I can apply that function to df and add all the columns at once, instead of adding them one by one seperately as above? Thanks
Create a dictionary:
dictionary= { 'c':'', 'd':-1 }
def new_columns(df, dictionary):
return df.assign(**dictionary)
then call it with your df:
df = new_columns(df, dictionary)
or just ( if you don't need a function call, not sure what your use case is) :
df.assign(**dictionary)
def update_df(a_df, new_cols_names, new_cols_vals):
for n, v in zip(new_cols_names, new_cols_vals):
a_df[n] = v
update_df(df, ['c', 'd', 'e'], ['', 5, 6])
Related
I have n variables. Suppose n equals 3 in this case. I want to apply one function to all of the combinations(or permutations, depending on how you want to solve this) of variables and store the result in the same row and column in dataframe.
a = 1
b = 2
c = 3
indexes = ['a', 'b', 'c']
df = pd.DataFrame({x:np.nan for x in indexes}, index=indexes)
If I apply sum(the function can be anything), then the result that I want to get is like this:
a b c
a 2 3 4
b 3 4 5
c 4 5 6
I can only think of iterating all the variables, apply the function one by one, and use the index of the iterators to set the value in the dataframe. Is there any better solution?
You can use apply and return a pd.Series for that effect. In such cases, pandas uses the series indices as columns in the resulting dataframe.
s = pd.Series({"a": 1, "b": 2, "c": 3})
s.apply(lambda x: x+s)
Just note that the operation you do is between an element and a series.
I believe you need broadcast sum of array created from variables if performance is important:
a = 1
b = 2
c = 3
indexes = ['a', 'b', 'c']
arr = np.array([a,b,c])
df = pd.DataFrame(arr + arr[:, None], index=indexes, columns=indexes)
print (df)
a b c
a 2 3 4
b 3 4 5
c 4 5 6
I need to make a function that can act on any dataframe and perform an action on it.
To clarify, for example let's say I have this sample dataframe here:
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
Which looks like this.
a b c
0 1 2 3
1 4 5 6
2 7 8 9
I have created a function that does something of this sort:
def ColDrop(df, collist):
> df=df.drop(columns = collist)
> return df
(assume > as indent)
What I'd like is for it to accept a list as the 'collist' variable and drop all of those from the dataframe stated as 'df', so...
col = ['a', 'b']
ColDrop(df, col)
Would look like...
c
0 3
1 6
2 9
However, it doesn't seem to work. Similarly I want to remove values from any dataframe based on its row, for example...
def rowvaluedrop(df, column, pattern):
> filter = df[column].str.contains(pattern)
> df = df[~filter]
> return df
rowvaluedrop(df, a, 4)
Would look like...
a b c
0 1 2 3
2 7 8 9
(i realise this second example may not work since the values are integers rather than strings, but i hope that my point gets across regardless.)
Thanks in advance.
You need to store the returning dataframe back to df implicitly
df = rowvaluedrop(df, a, 4)
I want to add a multi-index column to an existing pandas dataframe df. An example:
d = {('a','b'):[1,2,3], ('c', 'd'): [4,5,6]}
df = pd.DataFrame(d)
The resulting dataframe is:
a c
b d
0 1 4
1 2 5
2 3 6
Now I want to add a new column to the dataframe. The correct way to do that would be to use df['e', 'f'] = [7,8,9]. However, I would like to use the list new_key as the key. Normally I could use the asterisk *, but apparently it cannot be used outside of functions. So I get the following errors.
new_key = ['e','f']
df[new_key] = [7,8,9]
> KeyError: "['e' 'f'] not in index"
df[*new_key] = [7,8,9]
> SyntaxError: invalid syntax
Does anyone know how to solve this?
Cast to a tuple first:
df[tuple(new_key)] = [7,8,9]
a c e
b d f
0 1 4 7
1 2 5 8
2 3 6 9
Let's say I have a data frame with such column names:
['a','b','c','d','e','f','g']
And I would like to change names from 'c' to 'f' (actually add string to the name of column), so the whole data frame column names would look like this:
['a','b','var_c_equal','var_d_equal','var_e_equal','var_f_equal','g']
Well, firstly I made a function that changes column names with the string i want:
df.rename(columns=lambda x: 'or_'+x+'_no', inplace=True)
But now I really want to understand how to implement something like this:
df.loc[:,'c':'f'].rename(columns=lambda x: 'var_'+x+'_equal', inplace=True)
You can a use a list comprehension for that like:
Code:
new_columns = ['var_{}_equal'.format(c) if c in 'cdef' else c for c in columns]
Test Code:
import pandas as pd
df = pd.DataFrame({'a':(1,2), 'b':(1,2), 'c':(1,2), 'd':(1,2)})
print(df)
df.columns = ['var_{}_equal'.format(c) if c in 'cdef' else c
for c in df.columns]
print(df)
Results:
a b c d
0 1 1 1 1
1 2 2 2 2
a b var_c_equal var_d_equal
0 1 1 1 1
1 2 2 2 2
One way is to use a dictionary instead of an anonymous function. Both the below variations assume the columns you need to rename are contiguous.
Contiguous columns by position
d = {k: 'var_'+k+'_equal' for k in df.columns[2:6]}
df = df.rename(columns=d)
Contiguous columns by name
If you need to calculate the numerical indices:
cols = df.columns.get_loc
d = {k: 'var_'+k+'_equal' for k in df.columns[cols('c'):cols('f')+1]}
df = df.rename(columns=d)
Specifically identified columns
If you want to provide the columns explicitly:
d = {k: 'var_'+k+'_equal' for k in 'cdef'}
df = df.rename(columns=d)
Is there a way to slice a DataFrameGroupBy object?
For example, if I have:
df = pd.DataFrame({'A': [2, 1, 1, 3, 3], 'B': ['x', 'y', 'z', 'r', 'p']})
A B
0 2 x
1 1 y
2 1 z
3 3 r
4 3 p
dfg = df.groupby('A')
Now, the returned GroupBy object is indexed by values from A, and I would like to select a subset of it, e.g. to perform aggregation. It could be something like
dfg.loc[1:2].agg(...)
or, for a specific column,
dfg['B'].loc[1:2].agg(...)
EDIT. To make it more clear: by slicing the GroupBy object I mean accessing only a subset of groups. In the above example, the GroupBy object will contain 3 groups, for A = 1, A = 2, and A = 3. For some reasons, I may only be interested in groups for A = 1 and A = 2.
It seesm you need custom function with iloc - but if use agg is necessary return aggregate value:
df = df.groupby('A')['B'].agg(lambda x: ','.join(x.iloc[0:3]))
print (df)
A
1 y,z
2 x
3 r,p
Name: B, dtype: object
df = df.groupby('A')['B'].agg(lambda x: ','.join(x.iloc[1:3]))
print (df)
A
1 z
2
3 p
Name: B, dtype: object
For multiple columns:
df = pd.DataFrame({'A': [2, 1, 1, 3, 3],
'B': ['x', 'y', 'z', 'r', 'p'],
'C': ['g', 'y', 'y', 'u', 'k']})
print (df)
A B C
0 2 x g
1 1 y y
2 1 z y
3 3 r u
4 3 p k
df = df.groupby('A').agg(lambda x: ','.join(x.iloc[1:3]))
print (df)
B C
A
1 z y
2
3 p k
If I understand correctly, you only want some groups, but those are supposed to be returned completely:
A B
1 1 y
2 1 z
0 2 x
You can solve your problem by extracting the keys and then selecting groups based on those keys.
Assuming you already know the groups:
pd.concat([dfg.get_group(1),dfg.get_group(2)])
If you don't know the group names and are just looking for random n groups, this might work:
pd.concat([dfg.get_group(n) for n in list(dict(list(dfg)).keys())[:2]])
The output in both cases is a normal DataFrame, not a DataFrameGroupBy object, so it might be smarter to first filter your DataFrame and only aggregate afterwards:
df[df['A'].isin([1,2])].groupby('A')
The same for unknown groups:
df[df['A'].isin(list(set(df['A']))[:2])].groupby('A')
I believe there are some Stackoverflow answers refering to this, like How to access pandas groupby dataframe by key