Slice mutilple columns that are not next to each other in dataframe - python

I want to slice metope columns that are located several columns away from each other. I'm trying to write code that easy without having to write the code repeatedly:
df (See below for example) where columns are from A to H, with many rows containing some data (x).
How do I slice multiple randomly spaced columns, the say A, D, E, G, all in minimum amount of code. I don't want to rewrite loc code (df.loc['A'], df.loc['C:E'], df.loc['G'])?
Can I generate a list and loop through it or is there a shorter/quicker way?
Ultimately my goal would be to drop the selected columns from the main DataFrame.
A B C D E F G H
0 x x x x x x x x
1 x x x x x x x x
2 x x x x x x x x
3 x x x x x x x x
4 x x x x x x x x

You might harness .iloc method to get columns by their position rather than name, for example:
import pandas as pd
df = pd.DataFrame({'A':[1,2,3],'B':[4,5,6],'C':[7,8,9],'D':[10,11,12],'E':[13,14,15]})
df2 = df.iloc[:, [0,2,4]]
print(df2)
output:
A C E
0 1 7 13
1 2 8 14
2 3 9 15
If you need just x random columns from your df which has y columns, you might use random.sample for example if you want 3 column out of 5:
import random
cols = sorted(random.sample(range(0,5),k=3))
gives cols which is sorted list of three numbers (thanks to sorted order of columns will be preserved)

Related

KeyError: in Pandas

I am trying to run groupby with multiple columns and aggregate column and aggregate Operator.
I will get all of above as parameter to method. I have to do groupby:
result = df.groupby([groupByColumns])[aggColumn].agg(aggOperation)
Here
groupByColumns: clientId,state,branchId
aggColumn: amount
aggOperator: sum
But I am getting this error
KeyError: ''
I am not good in Panda. How can I correct my statement above?
If groupByColumns is already a list, remove [] in groupby:
groupByColumns = ['clientId', 'state', 'branchId']
aggColumn = 'amount'
aggOperation = sum
out = df.groupby(groupByColumns)[aggColumn].agg(aggOperation)
# OR
out = df.groupby(['clientId', 'state', 'branchId'])['amount'].sum()
print(out)
# Output
clientId state branchId
A M X 3
N Y 3
B M X 9
N Y 6
Name: amount, dtype: int64
Setup:
df = pd.DataFrame({'clientId': list('AAABBB'),
'state': list('MMNMMN'),
'branchId': list('XXYXXY'),
'amount': range(1, 7)})
print(df)
# Output
clientId state branchId amount
0 A M X 1
1 A M X 2
2 A N Y 3
3 B M X 4
4 B M X 5
5 B N Y 6
The groupby column requires a 1D list in input parameter. In your case, your groupByColumns is ['clientId', 'state', 'branchId'], and while using the groupby function, you are using the list operator, essentially making it a 2d list of length 1. This is what is happening in your case
df.groupby([['clientId', 'state', 'branchId']])['amount'].sum()
Solution
As answered by #Corralien, use the same command but without the list operator, this way in the groupby command you'll be passing a 1d list and it should work !

How to filter out dataframe into 2 based on particular column value?

I have one dataframe which I have to divide it into 2 dataframes.
Example:
Project_Number Indication
S100 X
S100 Y
S200 Z
S300 P
S300 Q
S300 R
S400 S
Now I have to divide into 2 based on Project_Number. If particular project_number is having more than 1 value then it should go into 1 dataframe and if it is having single value then go in 2nd dataframe.
Output:
df1-
Project_Number Indication
S100 X
S100 Y
S300 P
S300 Q
S300 R
df2-
Project_Number Indication
S200 Z
S400 S
Use Series.duplicated with keep=False for all dupes:
m = df['Project_Number'].duplicated(keep=False)
df1 = df[m]
df2 = df[~m]
You can do this in a few steps using the groupby() and duplicated():
df = pd.DataFrame([x.split(" ") for x in ("""S100 X
S100 Y
S200 Z
S300 P
S300 Q
S300 R
S400 S""").split("\n")], columns="Project_Number,Indication".split(","))
(has_multiple1, df1), (has_multiple2, df2) = list(df.groupby(df['Project_Number'].duplicated(keep=False)))

Pandas groupby aggregation with percentages

I have the following dataframe:
import pandas as pd
import numpy as np
np.random.seed(123)
n = 10
df = pd.DataFrame({"val": np.random.randint(1, 10, n),
"cat": np.random.choice(["X", "Y", "Z"], n)})
val cat
0 3 Z
1 3 X
2 7 Y
3 2 Z
4 4 Y
5 7 X
6 2 X
7 1 X
8 2 X
9 1 Y
I want to know the percentage each category X, Y, and Z has of the entire val column sum. I can aggregate df like this:
total_sum = df.val.sum()
#32
s = df.groupby("cat").val.sum().div(total_sum)*100
#this is the desired result in % of total val
cat
X 46.875 #15/32
Y 37.500 #12/32
Z 15.625 #5/32
Name: val, dtype: float64
However, I find it rather surprising that pandas seemingly does not have a percentage/frequency function something like df.groupby("cat").val.freq() instead of df.groupby("cat").val.sum() or df.groupby("cat").val.mean(). I assumed this is a common operation, and Series.value_counts has implemented this with normalize=True - but for groupby aggregation, I cannot find anything similar. Am I missing here something or is there indeed no out-of-the-box function?

Multiplying pandas dataframe and series, element wise

Lets say I have a pandas series:
import pandas as pd
x = pd.DataFrame({0: [1,2,3], 1: [4,5,6], 2: [7,8,9] })
y = pd.Series([-1, 1, -1])
I want to multiply x and y in such a way that I get z:
z = pd.DataFrame({0: [-1,2,-3], 1: [-4,5,-6], 2: [-7,8,-9] })
In other words, if element j of the series is -1, then all elements of the j-th row of x get multiplied by -1. If element k of the series is 1, then all elements of the j-th row of x get multiplied by 1.
How do I do this?
You can do that:
>>> new_x = x.mul(y, axis=0)
>>> new_x
0 1 2
0 -1 -4 -7
1 2 5 8
2 -3 -6 -9
Adding to the best answer: if the function returns a bunch of nonsensical NaNs, you should multiply by the values of the series in question like so:
new_x = df.mul(s.values, axis=0)
As Abdou points out, the answer is
z = x.apply(lambda col: col*y)
Moreover, if you instead have a DataFrame, e.g.
y = pandas.DataFrame({"colname": [1,-1,-1]})
Then you can do
z = x.apply(lambda z: z*y["colname"])
You can multiply the dataframes directly.
x * y

Rename specific column(s) in pandas

I've got a dataframe called data. How would I rename the only one column header? For example gdp to log(gdp)?
data =
y gdp cap
0 1 2 5
1 2 3 9
2 8 7 2
3 3 4 7
4 6 7 7
5 4 8 3
6 8 2 8
7 9 9 10
8 6 6 4
9 10 10 7
data.rename(columns={'gdp':'log(gdp)'}, inplace=True)
The rename show that it accepts a dict as a param for columns so you just pass a dict with a single entry.
Also see related
A much faster implementation would be to use list-comprehension if you need to rename a single column.
df.columns = ['log(gdp)' if x=='gdp' else x for x in df.columns]
If the need arises to rename multiple columns, either use conditional expressions like:
df.columns = ['log(gdp)' if x=='gdp' else 'cap_mod' if x=='cap' else x for x in df.columns]
Or, construct a mapping using a dictionary and perform the list-comprehension with it's get operation by setting default value as the old name:
col_dict = {'gdp': 'log(gdp)', 'cap': 'cap_mod'} ## key→old name, value→new name
df.columns = [col_dict.get(x, x) for x in df.columns]
Timings:
%%timeit
df.rename(columns={'gdp':'log(gdp)'}, inplace=True)
10000 loops, best of 3: 168 µs per loop
%%timeit
df.columns = ['log(gdp)' if x=='gdp' else x for x in df.columns]
10000 loops, best of 3: 58.5 µs per loop
How do I rename a specific column in pandas?
From v0.24+, to rename one (or more) columns at a time,
DataFrame.rename() with axis=1 or axis='columns' (the axis argument was introduced in v0.21.
Index.str.replace() for string/regex based replacement.
If you need to rename ALL columns at once,
DataFrame.set_axis() method with axis=1. Pass a list-like sequence. Options are available for in-place modification as well.
rename with axis=1
df = pd.DataFrame('x', columns=['y', 'gdp', 'cap'], index=range(5))
df
y gdp cap
0 x x x
1 x x x
2 x x x
3 x x x
4 x x x
With 0.21+, you can now specify an axis parameter with rename:
df.rename({'gdp':'log(gdp)'}, axis=1)
# df.rename({'gdp':'log(gdp)'}, axis='columns')
y log(gdp) cap
0 x x x
1 x x x
2 x x x
3 x x x
4 x x x
(Note that rename is not in-place by default, so you will need to assign the result back.)
This addition has been made to improve consistency with the rest of the API. The new axis argument is analogous to the columns parameter—they do the same thing.
df.rename(columns={'gdp': 'log(gdp)'})
y log(gdp) cap
0 x x x
1 x x x
2 x x x
3 x x x
4 x x x
rename also accepts a callback that is called once for each column.
df.rename(lambda x: x[0], axis=1)
# df.rename(lambda x: x[0], axis='columns')
y g c
0 x x x
1 x x x
2 x x x
3 x x x
4 x x x
For this specific scenario, you would want to use
df.rename(lambda x: 'log(gdp)' if x == 'gdp' else x, axis=1)
Index.str.replace
Similar to replace method of strings in python, pandas Index and Series (object dtype only) define a ("vectorized") str.replace method for string and regex-based replacement.
df.columns = df.columns.str.replace('gdp', 'log(gdp)')
df
y log(gdp) cap
0 x x x
1 x x x
2 x x x
3 x x x
4 x x x
The advantage of this over the other methods is that str.replace supports regex (enabled by default). See the docs for more information.
Passing a list to set_axis with axis=1
Call set_axis with a list of header(s). The list must be equal in length to the columns/index size. set_axis mutates the original DataFrame by default, but you can specify inplace=False to return a modified copy.
df.set_axis(['cap', 'log(gdp)', 'y'], axis=1, inplace=False)
# df.set_axis(['cap', 'log(gdp)', 'y'], axis='columns', inplace=False)
cap log(gdp) y
0 x x x
1 x x x
2 x x x
3 x x x
4 x x x
Note: In future releases, inplace will default to True.
Method Chaining
Why choose set_axis when we already have an efficient way of assigning columns with df.columns = ...? As shown by Ted Petrou in this answer set_axis is useful when trying to chain methods.
Compare
# new for pandas 0.21+
df.some_method1()
.some_method2()
.set_axis()
.some_method3()
Versus
# old way
df1 = df.some_method1()
.some_method2()
df1.columns = columns
df1.some_method3()
The former is more natural and free flowing syntax.
There are at least five different ways to rename specific columns in pandas, and I have listed them below along with links to the original answers. I also timed these methods and found them to perform about the same (though YMMV depending on your data set and scenario). The test case below is to rename columns A M N Z to A2 M2 N2 Z2 in a dataframe with columns A to Z containing a million rows.
# Import required modules
import numpy as np
import pandas as pd
import timeit
# Create sample data
df = pd.DataFrame(np.random.randint(0,9999,size=(1000000, 26)), columns=list('ABCDEFGHIJKLMNOPQRSTUVWXYZ'))
# Standard way - https://stackoverflow.com/a/19758398/452587
def method_1():
df_renamed = df.rename(columns={'A': 'A2', 'M': 'M2', 'N': 'N2', 'Z': 'Z2'})
# Lambda function - https://stackoverflow.com/a/16770353/452587
def method_2():
df_renamed = df.rename(columns=lambda x: x + '2' if x in ['A', 'M', 'N', 'Z'] else x)
# Mapping function - https://stackoverflow.com/a/19758398/452587
def rename_some(x):
if x=='A' or x=='M' or x=='N' or x=='Z':
return x + '2'
return x
def method_3():
df_renamed = df.rename(columns=rename_some)
# Dictionary comprehension - https://stackoverflow.com/a/58143182/452587
def method_4():
df_renamed = df.rename(columns={col: col + '2' for col in df.columns[
np.asarray([i for i, col in enumerate(df.columns) if 'A' in col or 'M' in col or 'N' in col or 'Z' in col])
]})
# Dictionary comprehension - https://stackoverflow.com/a/38101084/452587
def method_5():
df_renamed = df.rename(columns=dict(zip(df[['A', 'M', 'N', 'Z']], ['A2', 'M2', 'N2', 'Z2'])))
print('Method 1:', timeit.timeit(method_1, number=10))
print('Method 2:', timeit.timeit(method_2, number=10))
print('Method 3:', timeit.timeit(method_3, number=10))
print('Method 4:', timeit.timeit(method_4, number=10))
print('Method 5:', timeit.timeit(method_5, number=10))
Output:
Method 1: 3.650640267
Method 2: 3.163998427
Method 3: 2.998530871
Method 4: 2.9918436889999995
Method 5: 3.2436501520000007
Use the method that is most intuitive to you and easiest for you to implement in your application.
Use the pandas.DataFrame.rename funtion.
Check this link for description.
data.rename(columns = {'gdp': 'log(gdp)'}, inplace = True)
If you intend to rename multiple columns then
data.rename(columns = {'gdp': 'log(gdp)', 'cap': 'log(cap)', ..}, inplace = True)
df.rename(columns=lambda x: {"My_sample": "My_sample_new_name"}.get(x, x))
ewe can rename by re—doing the table
df = pd.DataFrame()
column_names = mydataframe.columns
for i in range(len(mydataframe)):
column = mydataframe.iloc[:,i]
df[column_names[i][:-8]+"desigred_texnt"] = column
print(df.columns)

Categories

Resources