Selecting values from a series in pandas - python

I have a dataset D with Columns from [A - Z] in total 26 columns. I have done some test and got to know which are the useful columns to me in a series S.
D #Dataset with columns from A - Z
S
B 0.78
C 1.04
H 2.38
S has the columns and a value associated with it, So I now know their importance and would like to keep only those Columns in the Dataset eg(B, C, D) How can I do it?

IIUC you can use:
cols = ['B','C','D']
df = df[cols]
Or if column names are in Series as values:
S = pd.Series(['B','C','D'])
df = df[S]
Sample:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
S = pd.Series(['B','C','D'])
print (S)
0 B
1 C
2 D
dtype: object
print (df[S])
B C D
0 4 7 1
1 5 8 3
2 6 9 5
Or index values:
S = pd.Series([1,2,3], index=['B','C','D'])
print (S)
B 1
C 2
D 3
dtype: int64
print (df[S.index])
B C D
0 4 7 1
1 5 8 3
2 6 9 5

Related

Summing every row of dataframe with a series

I am trying to sum every row of a dataframe with a series.
I have a dataframe with [107 rows and 42 columns] and a series of length 42. I would like to sum every row with the series such that every column in the dataframe would have the same number added to it. I tried df.add(series) but the result was a dataframe with 107 rows and 84 columns with all NaN values.
For example
dataframe:
Index a b c
d 1 2 3
e 4 5 6
f 7 8 9
g 0 0 0
series: 1 2 3
result would be
Index a b c
d 2 4 6
e 5 7 9
f 8 10 12
g 1 2 3
You can use DataFrame.add or + with numpy array if differet index values like columns names:
s = pd.Series([1,2,3])
df = df.add(s.to_numpy())
#alternative
#df = df + s.to_numpy()
print (df)
a b c
d 2 4 6
e 5 7 9
f 8 10 12
g 1 2 3
s = pd.Series([1,2,3])
s.index = df.columns
df = df.add(s)
#alternative
#df = df + s
print (df)
a b c
d 2 4 6
e 5 7 9
f 8 10 12
g 1 2 3

coding 2 columns in pandas with the same key

I am asked to code the following 2 columns, and you have these values, when using the method cat.codes the problem arises that the 2 columns are not with the same codes, what I want is that the data that are equal are with the same code?
Example:
The input is a dataframe
col1 col2
0 A E
1 B F
2 C A
3 D B
4 A B
5 E A
Assuming this input as df:
col1 col2
0 A E
1 B F
2 C A
3 D B
4 A B
5 E A
You can compute the unique values and use them to factorize:
vals = df[['col1', 'col2']].stack().unique()
d = {k:v for v,k in enumerate(vals)}
df['col1_codes'] = df['col1'].map(d)
df['col2_codes'] = df['col2'].map(d)
output:
col1 col2 col1_codes col2_codes
0 A E 0 1
1 B F 2 3
2 C A 4 0
3 D B 5 2
4 A B 0 2
5 E A 1 0
You can try below as well
a b
0 apple nokia
1 xiomi samsung
2 samsung apple
3 moto oneplus
import pandas as pd
from sklearn import preprocessing
cat_var = list(df.a.values)+list(df.b.values)
le = preprocessing.LabelEncoder()
le.fit(cat_var)
df['a'] = le.transform(df.a)
df['b'] = le.transform(df.b)
will give you below output
a b
0 0 2
1 5 4
2 4 0
3 1 3

Aggregate data frame rows based on conditions

I have this table
A B C E
1 2 1 3
1 2 4 4
2 7 1 1
3 4 0 2
3 4 8 3
Now, I want to remove duplicates based on column A and B and at the same time sum up column C. For E, it should take the value where C shows the max value. The desirable result table should look like this:
A B C E
1 2 5 4
2 7 1 1
3 4 8 3
I tried this: df.groupby(['A', 'B']).sum()['C'] but my data frame does not change at all as I am thinking that I didn't incorporate the E column part properly...Can somebody advise?
Thanks so much!
If the first and second rows are duplicates, we can group by them.
In [20]: df
Out[20]:
A B C E
0 1 1 5 4
1 1 1 1 1
2 3 3 8 3
In [21]: df.groupby(['A', 'B'])['C'].sum()
Out[21]:
A B
1 1 6
3 3 8
Name: C, dtype: int64
I tried this: df.groupby(['A', 'B']).sum()['C'] but my data frame does not change at all
yes, it's because pandas didn't overwrite initial DataFrame
In [22]: df
Out[22]:
A B C E
0 1 1 5 4
1 1 1 1 1
2 3 3 8 3
You have to overwrite it explicitly.
In [23]: df = df.groupby(['A', 'B'])['C'].sum()
In [24]: df
Out[24]:
A B
1 1 6
3 3 8
Name: C, dtype: int64

How do I multiply a pandas column with a part of a multi index dataframe

I have a data frame with a multi index and one column.
Index fields are type and amount, the column is called count
I would like to add a column that multiplies amount and count
df2 = df.groupby(['type','amount']).count().copy()
# I then dropped all columns but one and renamed it to "count"
df2['total_amount'] = df2['count'].multiply(df2['amount'], axis='index')
doesn't work. I get a key error on amount.
How do I access a part of the multi index to use it in calculations?
Use GroupBy.transform for Series with same size as original df with aggregated values, so possible multiple:
count = df.groupby(['type','amount'])['type'].transform('count')
df['total_amount'] = df['amount'].multiply(count, axis='index')
print (df)
A amount C D E type total_amount
0 a 4 7 1 5 a 8
1 b 5 8 3 3 a 5
2 c 4 9 5 6 a 8
3 d 5 4 7 9 b 10
4 e 5 2 1 2 b 10
5 f 4 3 0 4 b 4
Or:
df = pd.DataFrame({'A':list('abcdef'),
'amount':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'type':list('aaabbb')})
print (df)
A amount C D E type
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b
df2 = df.groupby(['type','amount'])['type'].count().to_frame('count')
df2['total_amount'] = df2['count'].mul(df2.index.get_level_values('amount'))
print (df2)
count total_amount
type amount
a 4 2 8
5 1 5
b 4 1 4
5 2 10

Pandas show column number

is there any way to get pandas to show the column number and the column name at the same time? I'm dealing with a dataset with >30 columns, all of very long column names and some with little variation with each other. Its an absolute chore to type out the names when writing out the code. (i would still need to see the column names to know which columns to select)
thanks.
One possible solution is create MultiIndex and then select columns by DataFrame.xs:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
df.columns = pd.MultiIndex.from_arrays([pd.RangeIndex(len(df.columns)), df.columns])
print (df)
0 1 2 3 4 5
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
print (df.xs(2, level=0, axis=1))
C
0 7
1 8
2 9

Categories

Resources