how to extract index (multiple-level) for dataframe - python

mydf = pd.DataFrame({'dts':['1/1/2000','1/1/2000','1/1/2000','1/2/2000', '1/3/2000', '1/3/2000'],
'product':['A', 'B', 'A','A', 'A','B'],
'value':[1,2,2,3,6,1]})
a =mydf.groupby(['dts','product']).sum()
so a has multi-level index now...
a
Out[1]:
value
dts product
1/1/2000 A 3
B 2
1/2/2000 A 3
1/3/2000 A 6
B 1
how to extract product-level index in a? a.index['product']does not work.

Using get_level_values
>>> a.index.get_level_values(1)
Index(['A', 'B', 'A', 'A', 'B'], dtype='object', name='product')
You can also use the name of the level:
>>> a.index.get_level_values('product')
Index(['A', 'B', 'A', 'A', 'B'], dtype='object', name='product')

Related

PANDAS - Rename and combine like columns

I am trying to rename a column and combine that renamed column to others like it. The row indexes will not be the same (i.e. I am not combining 'City' and 'State' from two columns).
df = pd.DataFrame({'Col_1': ['A', 'B', 'C'],
'Col_2': ['D', 'E', 'F'],
'Col_one':['G', 'H', 'I'],})
df.rename(columns={'Col_one' : 'Col_1'}, inplace=True)
# Desired output:
({'Col_1': ['A', 'B', 'C', 'G', 'H', 'I'],
'Col_2': ['D', 'E', 'F', '-', '-', '-'],})
I've tried pd.concat and a few other things, but it fails to combine the columns in a way I'm expecting. Thank you!
This is melt and pivot after you have renamed:
u = df.melt()
out = (u.assign(k=u.groupby("variable").cumcount())
.pivot("k","variable","value").fillna('-'))
out = out.rename_axis(index=None,columns=None)
print(out)
Col_1 Col_2
0 A D
1 B E
2 C F
3 G -
4 H -
5 I -
Using append without modifying the actual dataframe:
result = (df[['Col_1', 'Col_2']]
.append(df[['Col_one']]
.rename(columns={'Col_one': 'Col_1'}),ignore_index=True).fillna('-')
)
OUTPUT:
Col_1 Col_2
0 A D
1 B E
2 C F
3 G -
4 H -
5 I -
Might be a slightly longer method than other answers but the below delivered the required output.
df = pd.DataFrame({'Col_1': ['A', 'B', 'C'],
'Col_2': ['D', 'E', 'F'],
'Col_one':['G', 'H', 'I'],})
# Create a list of the values we want to retain
TempList = df['Col_one']
# Append existing dataframe with the values from the list
df = df.append(pd.DataFrame({'Col_1':TempList}), ignore_index = True)
# Drop the redundant column
df.drop(columns=['Col_one'], inplace=True)
# Populate NaN with -
df.fillna('-', inplace=True)
Output is
Col_1 Col_2
0 A D
1 B E
2 C F
3 G -
4 H -
5 I -
Using concat should work.
import pandas as pd
df = pd.DataFrame({'Col_1': ['A', 'B', 'C'],
'Col_2': ['D', 'E', 'F'],
'Col_one':['G', 'H', 'I'],})
df2 = pd.DataFrame()
df2['Col_1'] = pd.concat([df['Col_1'], df['Col_one']], axis = 0)
df2 = df2.reset_index()
df2 = df2.drop('index', axis =1)
df2['Col_2'] = df['Col_2']
df2['Col_2'] = df2['Col_2'].fillna('-')
print(df2)
prints
Col_1 Col_2
0 A D
1 B E
2 C F
3 G -
4 H -
5 I -

Value_counts on multiple columns with groupby

I need some help with Pandas.
I have following dataframe:
df = pd.DataFrame({'1Country': ['FR', 'FR', 'GER','GER','IT','IT', 'FR','GER','IT'],
'2City': ['Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome','Paris','Berlin','Rome'],
'F1': ['A', 'B', 'C', 'B', 'B', 'C', 'A', 'B', 'C'],
'F2': ['B', 'C', 'A', 'A', 'B', 'C', 'A', 'B', 'C'],
'F3': ['C', 'A', 'B', 'C', 'C', 'C', 'A', 'B', 'C']})
screenshot
I am trying to do a groupby on first two columns 1Country and 2City and do value_counts on columns F1 and F2. So far I was only able to do groupby and value_counts on 1 column at a time with
df.groupby(['1Country','2City'])['F1'].apply(pd.Series.value_counts)
How can I do value_counts on multiple columns and get a datframe as a result?
You could use agg, something along these lines:
df.groupby(['1Country','2City']).agg({i:'value_counts' for i in df.columns[2:]})
F1 F2 F3
FR Paris A 2.0 1.0 2.0
B 1.0 1.0 NaN
C NaN 1.0 1.0
GER Berlin A NaN 2.0 NaN
B 2.0 1.0 2.0
C 1.0 NaN 1.0
IT Rome B 1.0 1.0 NaN
C 2.0 2.0 3.0
You can pass a dict to agg as follows:
df.groupby(['1Country', '2City']).agg({'F1': 'value_counts', 'F2': 'value_counts'})

Sort or groupby dataframe in python using given string

I have given dataframe
Id Direction Load Unit
1 CN05059815 LoadFWD 0,0 NaN
2 CN05059815 LoadBWD 0,0 NaN
4 ....
....
and the given list.
list =['CN05059830','CN05059946','CN05060010','CN05060064' ...]
I would like to sort or group the data by a given element of the list.
For example,
The new data will have exactly the same sort as the list. The first column would start withCN05059815 which doesn't belong to the list, then the second CN05059830 CN05059946 ... are both belong to the list. With remaining the other data
One way is to use Categorical Data. Here's a minimal example:
# sample dataframe
df = pd.DataFrame({'col': ['A', 'B', 'C', 'D', 'E', 'F']})
# required ordering
lst = ['D', 'E', 'A', 'B']
# convert to categorical
df['col'] = df['col'].astype('category')
# set order, adding values not in lst to the front
order = list(set(df['col']) - set(lst)) + lst
# attach ordering information to categorical series
df['col'] = df['col'].cat.reorder_categories(order)
# apply ordering
df = df.sort_values('col')
print(df)
col
2 C
5 F
3 D
4 E
0 A
1 B
Consider below approach and example:
df = pd.DataFrame({
'col': ['a', 'b', 'c', 'd', 'e']
})
list_ = ['d', 'b', 'a']
print(df)
Output:
col
0 a
1 b
2 c
3 d
4 e
Then in order to sort the df with the list and its ordering:
df.reindex(df.assign(dummy=df['col'])['dummy'].apply(lambda x: list_.index(x) if x in list_ else -1).sort_values().index)
Output:
col
2 c
4 e
3 d
1 b
0 a

Can I replace some values at once with dataframe?

Now I have to do like this:
df = pd.DataFrame({'column': ['A', 'B', 'C', 'D', 'E', 'F', 'G', '-']})
df['column'] = df['column'].str.replace('A', 'cat').replace('B', 'rabit').replace('C', 'octpath').replace('D', 'spider').replace('E', 'mammoth').replace('F', 'snake').replace('G', 'starfish')
But I think this is long and unreadable.
Do you know a simple solution?
Here is another approach using pandas.Series.replace:
d = {'A':'cat','B':'rabit', 'C':'octpath','D':'spider','E':'mammoth','F':'snake','G':'starfish'}
df['column'] = df['column'].replace(d)
Output:
column
0 cat
1 rabit
2 octpath
3 spider
4 mammoth
5 snake
6 starfish
7 -
You can define a dict of your replacement values and call map on the column passing in your dict, to handle values that are not present you can pass param na_action='ignore', this will return NaN or None as you want to keep your existing values if not present you can call fillna and pass your orig column:
In[60]:
df = pd.DataFrame({'column': ['A', 'B', 'C', 'D', 'E', 'F', 'G', '-']})
d = {'A':'cat','B':'rabit', 'C':'octpath','D':'spider','E':'mammoth','F':'snake','G':'starfish'}
df['column'] = df['column'].map(d, na_action='ignore').fillna(df['column'])
df
Out[60]:
column
0 cat
1 rabit
2 octpath
3 spider
4 mammoth
5 snake
6 starfish
7 -
df = pd.DataFrame({'column': ['A', 'B', 'C', 'D', 'E', 'F', 'G', '-']})
mapper={'A':'cat','B':'rabit','C':'octpath','D':'spider','E':'mammoth'}
df['column']=df.column.apply(lambda x:mapper.get(x))
0 cat
1 rabit
2 octpath
3 spider
4 mammoth
5 None
6 None
7 None
in case you want to set default values
df['column']=df.column.apply(lambda x:mapper.get(x) if mapper.get(x) is not None else "pandas")
df.column
0 cat
1 rabit
2 octpath
3 spider
4 mammoth
5 pandas
6 pandas
7 pandas
greatings from shibuya

Mapping values from one DataFrame to another

I am trying to figure out some fast and clean way to map values from one DataFrame A to another. Let say I have DataFrame like this one:
C1 C2 C3 C4 C5
1 a b c a
2 d a e b a
3 a c
4 b e e
And now I want to change those letter codes to actual values. My DataFrame Bwith explanations looks like that:
Code Value
1 a 'House'
2 b 'Bike'
3 c 'Lamp'
4 d 'Window'
5 e 'Car'
So far my brute-force approach was to just go through every element in A and check with isin() the value in B. I know that I can also use Series (or simple dictionary) as an B instead of DataFrame and use for example Code column as a index. But still I would need to use multiple loops to map everything.
Is there any other nice way to achieve my goal?
You could use replace:
A.replace(B.set_index('Code')['Value'])
import pandas as pd
A = pd.DataFrame(
{'C1': ['a', 'd', 'a', 'b'],
'C2': ['b', 'a', 'c', 'e'],
'C3': ['c', 'e', '', 'e'],
'C4': ['a', 'b', '', ''],
'C5': ['', 'a', '', '']})
B = pd.DataFrame({'Code': ['a', 'b', 'c', 'd', 'e'],
'Value': ["'House'", "'Bike'", "'Lamp'", "'Window'", "'Car'"]})
print(A.replace(B.set_index('Code')['Value']))
yields
C1 C2 C3 C4 C5
0 'House' 'Bike' 'Lamp' 'House'
1 'Window' 'House' 'Car' 'Bike' 'House'
2 'House' 'Lamp'
3 'Bike' 'Car' 'Car'
Another alternative is map. Although it requires looping over columns, if I didn't mess up the tests, it is still faster than replace:
A = pd.DataFrame(np.random.choice(list("abcdef"), (1000, 1000)))
B = pd.DataFrame({'Code': ['a', 'b', 'c', 'd', 'e'],
'Value': ["'House'", "'Bike'", "'Lamp'", "'Window'", "'Car'"]})
B = B.set_index("Code")["Value"]
%timeit A.replace(B)
1 loop, best of 3: 970 ms per loop
C = pd.DataFrame()
%%timeit
for col in A:
C[col] = A[col].map(B).fillna(A[col])
1 loop, best of 3: 586 ms per loop

Categories

Resources