Value_counts on multiple columns with groupby - python

I need some help with Pandas.
I have following dataframe:
df = pd.DataFrame({'1Country': ['FR', 'FR', 'GER','GER','IT','IT', 'FR','GER','IT'],
'2City': ['Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome','Paris','Berlin','Rome'],
'F1': ['A', 'B', 'C', 'B', 'B', 'C', 'A', 'B', 'C'],
'F2': ['B', 'C', 'A', 'A', 'B', 'C', 'A', 'B', 'C'],
'F3': ['C', 'A', 'B', 'C', 'C', 'C', 'A', 'B', 'C']})
screenshot
I am trying to do a groupby on first two columns 1Country and 2City and do value_counts on columns F1 and F2. So far I was only able to do groupby and value_counts on 1 column at a time with
df.groupby(['1Country','2City'])['F1'].apply(pd.Series.value_counts)
How can I do value_counts on multiple columns and get a datframe as a result?

You could use agg, something along these lines:
df.groupby(['1Country','2City']).agg({i:'value_counts' for i in df.columns[2:]})
F1 F2 F3
FR Paris A 2.0 1.0 2.0
B 1.0 1.0 NaN
C NaN 1.0 1.0
GER Berlin A NaN 2.0 NaN
B 2.0 1.0 2.0
C 1.0 NaN 1.0
IT Rome B 1.0 1.0 NaN
C 2.0 2.0 3.0

You can pass a dict to agg as follows:
df.groupby(['1Country', '2City']).agg({'F1': 'value_counts', 'F2': 'value_counts'})

Related

Different ways to get unique get_level_values()

Consider the following DataFrame df:
df=
kind A B
names a1 a2 b1 b2 b3
Time
0.0 0.7804 0.5294 0.1895 0.9195 0.0508
0.1 0.1703 0.7095 0.8704 0.8566 0.5513
0.2 0.8147 0.9055 0.0506 0.4212 0.2464
0.3 0.3985 0.4515 0.7118 0.6146 0.2682
0.4 0.2505 0.2752 0.4097 0.3347 0.1296
When I issue the command levs = df.columns.get_level_values("kind"), I get that levs is equal to
Index(['A', 'A', 'A', 'B', 'B'], dtype='object', name='kind')
whereas I would like to have that levs is equal to Index(['A', 'B'], dtype='object', name='kind').
One way to achieve such an objective could be to run levs=list(set(levs)), but I am wondering if there are any other simple methods.
I think you can use levels:
out = df.columns.levels[0]
print (out)
Index(['A', 'B'], dtype='object')
EDIT: One idea with lookup by names of MultiIndex:
d = {v: k for k, v in enumerate(df.columns.names)}
print (d)
{'kind': 0, 'names': 1}
out = df.columns.levels[d['kind']]
print (out)
Index(['A', 'B'], dtype='object', name='kind')

PANDAS - Rename and combine like columns

I am trying to rename a column and combine that renamed column to others like it. The row indexes will not be the same (i.e. I am not combining 'City' and 'State' from two columns).
df = pd.DataFrame({'Col_1': ['A', 'B', 'C'],
'Col_2': ['D', 'E', 'F'],
'Col_one':['G', 'H', 'I'],})
df.rename(columns={'Col_one' : 'Col_1'}, inplace=True)
# Desired output:
({'Col_1': ['A', 'B', 'C', 'G', 'H', 'I'],
'Col_2': ['D', 'E', 'F', '-', '-', '-'],})
I've tried pd.concat and a few other things, but it fails to combine the columns in a way I'm expecting. Thank you!
This is melt and pivot after you have renamed:
u = df.melt()
out = (u.assign(k=u.groupby("variable").cumcount())
.pivot("k","variable","value").fillna('-'))
out = out.rename_axis(index=None,columns=None)
print(out)
Col_1 Col_2
0 A D
1 B E
2 C F
3 G -
4 H -
5 I -
Using append without modifying the actual dataframe:
result = (df[['Col_1', 'Col_2']]
.append(df[['Col_one']]
.rename(columns={'Col_one': 'Col_1'}),ignore_index=True).fillna('-')
)
OUTPUT:
Col_1 Col_2
0 A D
1 B E
2 C F
3 G -
4 H -
5 I -
Might be a slightly longer method than other answers but the below delivered the required output.
df = pd.DataFrame({'Col_1': ['A', 'B', 'C'],
'Col_2': ['D', 'E', 'F'],
'Col_one':['G', 'H', 'I'],})
# Create a list of the values we want to retain
TempList = df['Col_one']
# Append existing dataframe with the values from the list
df = df.append(pd.DataFrame({'Col_1':TempList}), ignore_index = True)
# Drop the redundant column
df.drop(columns=['Col_one'], inplace=True)
# Populate NaN with -
df.fillna('-', inplace=True)
Output is
Col_1 Col_2
0 A D
1 B E
2 C F
3 G -
4 H -
5 I -
Using concat should work.
import pandas as pd
df = pd.DataFrame({'Col_1': ['A', 'B', 'C'],
'Col_2': ['D', 'E', 'F'],
'Col_one':['G', 'H', 'I'],})
df2 = pd.DataFrame()
df2['Col_1'] = pd.concat([df['Col_1'], df['Col_one']], axis = 0)
df2 = df2.reset_index()
df2 = df2.drop('index', axis =1)
df2['Col_2'] = df['Col_2']
df2['Col_2'] = df2['Col_2'].fillna('-')
print(df2)
prints
Col_1 Col_2
0 A D
1 B E
2 C F
3 G -
4 H -
5 I -

How to remove NaN values in pandas dataframe whose columns are referenced in another dataframe as encoded values indicating missing or unknown values

Dataframe 1
1 C1 C2 C3 . . . C85
2
3
4
.
.
800000 . . . . . .
Columns with missing values across rows
0 32
100 10
200 7
300 7
400 6
1000 5
2000 3
3000 3
9000 3
12000 2
13000 1
15000 1
20000 1
30000 1
40000 1
50000 1
60000 1
Dataframe 2
attribute missing_or_unknown
C1 [-1,X]
C2 [XX]
. .
. .
C85 []
Missing values sorted by value_Counts()
[-1] 26
[-1,9] 17
[-1,0] 16
[0] 12
[] 10
[-1,0,9] 1
[-1,XX] 1
[-1,X] 1
[XX] 1
Need
Dataframe 1 is the master table that has many missing or unknown values that needs to be cleaned up.
However that determination needs to happen by referencing dataframe 2 and using those encoded indicators in missing_or_unknown column
Approach
To be able to do that, I was trying to concat the 2 dataframes and see if i can add that missing_or_unknown column to dataframe 1 before i could proceed and use replace function to replace those indicators with np.nan
Question
How do i perform concatenation when the 2 dataframes don't have same number of rows? Basically 1st dataframe's columns are rows in 2nd dataframe?
I suggest that you transpose the Dataframe2 and replace the column headings with the values of 1st row and then concanate Dataframe1 and Dataframe2. After this, you can operate on the Row1 of the resultant Dataframe to further replace it with "Nan" values.
Here is a sample of this:
import pandas as pd
dummy_data1 = {
'C1': ['11', '12', '13', '14', '15', '16', '17', '18', '19', '20'],
'C2': ['A', 'E', 'I', 'M', 'Q', 'A', 'E', 'I', 'M', 'Q', ],
'C3': ['B', 'F', 'J', 'N', 'R', 'B', 'F', 'J', 'N', 'R', ],
'C4': ['C', 'G', 'K', 'O', 'S', 'C', 'G', 'K', 'O', 'S', ],
'C5': ['D', 'H', 'L', 'P', 'T', 'D', 'H', 'L', 'P', 'T', ]}
df1 = pd.DataFrame(dummy_data1, columns = ['C1', 'C2', 'C3', 'C4', 'C5'])
dummy_data2 = {
'attribute': ['C1', 'C2', 'C4', 'C5', 'C3', ],
'missing_or_unknown': ['X1', 'X2', 'X4', 'X5', 'X3', ]}
df2 = pd.DataFrame(dummy_data2, columns = ['attribute', 'missing_or_unknown'])
df2_transposed = df2.transpose()
print("df2_transposed=\n", df2_transposed)
df2_transposed.columns = df2_transposed.iloc[0]
df2_transposed = df2_transposed.drop(df2_transposed.index[0])
print("df2_transposed with HEADER Replaced=\n", df2_transposed)
df_new = pd.concat([df2_transposed, df1])
print("df_new=\n", df_new)

how to extract index (multiple-level) for dataframe

mydf = pd.DataFrame({'dts':['1/1/2000','1/1/2000','1/1/2000','1/2/2000', '1/3/2000', '1/3/2000'],
'product':['A', 'B', 'A','A', 'A','B'],
'value':[1,2,2,3,6,1]})
a =mydf.groupby(['dts','product']).sum()
so a has multi-level index now...
a
Out[1]:
value
dts product
1/1/2000 A 3
B 2
1/2/2000 A 3
1/3/2000 A 6
B 1
how to extract product-level index in a? a.index['product']does not work.
Using get_level_values
>>> a.index.get_level_values(1)
Index(['A', 'B', 'A', 'A', 'B'], dtype='object', name='product')
You can also use the name of the level:
>>> a.index.get_level_values('product')
Index(['A', 'B', 'A', 'A', 'B'], dtype='object', name='product')

Can I replace some values at once with dataframe?

Now I have to do like this:
df = pd.DataFrame({'column': ['A', 'B', 'C', 'D', 'E', 'F', 'G', '-']})
df['column'] = df['column'].str.replace('A', 'cat').replace('B', 'rabit').replace('C', 'octpath').replace('D', 'spider').replace('E', 'mammoth').replace('F', 'snake').replace('G', 'starfish')
But I think this is long and unreadable.
Do you know a simple solution?
Here is another approach using pandas.Series.replace:
d = {'A':'cat','B':'rabit', 'C':'octpath','D':'spider','E':'mammoth','F':'snake','G':'starfish'}
df['column'] = df['column'].replace(d)
Output:
column
0 cat
1 rabit
2 octpath
3 spider
4 mammoth
5 snake
6 starfish
7 -
You can define a dict of your replacement values and call map on the column passing in your dict, to handle values that are not present you can pass param na_action='ignore', this will return NaN or None as you want to keep your existing values if not present you can call fillna and pass your orig column:
In[60]:
df = pd.DataFrame({'column': ['A', 'B', 'C', 'D', 'E', 'F', 'G', '-']})
d = {'A':'cat','B':'rabit', 'C':'octpath','D':'spider','E':'mammoth','F':'snake','G':'starfish'}
df['column'] = df['column'].map(d, na_action='ignore').fillna(df['column'])
df
Out[60]:
column
0 cat
1 rabit
2 octpath
3 spider
4 mammoth
5 snake
6 starfish
7 -
df = pd.DataFrame({'column': ['A', 'B', 'C', 'D', 'E', 'F', 'G', '-']})
mapper={'A':'cat','B':'rabit','C':'octpath','D':'spider','E':'mammoth'}
df['column']=df.column.apply(lambda x:mapper.get(x))
0 cat
1 rabit
2 octpath
3 spider
4 mammoth
5 None
6 None
7 None
in case you want to set default values
df['column']=df.column.apply(lambda x:mapper.get(x) if mapper.get(x) is not None else "pandas")
df.column
0 cat
1 rabit
2 octpath
3 spider
4 mammoth
5 pandas
6 pandas
7 pandas
greatings from shibuya

Categories

Resources