Rename name in Python Pandas MultiIndex - python

I try to rename a column name in a Pandas MultiIndex but it doesn't work. Here you can see my series object. Btw, why is the dataframe df_injury_record becoming a series object in this function?
Frequency_BodyPart = df_injury_record.groupby(["Surface","BodyPart"]).size()
In the next line you will see my try to rename the column.
Frequency_BodyPart.rename_axis(index={'Surface': 'Class'})
But after this, the column has still the same name.
Regards

One possible problem should be pandas version under 0.24 or you forget assign back like mentioned #anky_91:
df_injury_record = pd.DataFrame({'Surface':list('aaaabbbbddd'),
'BodyPart':list('abbbbdaaadd')})
Frequency_BodyPart = df_injury_record.groupby(["Surface","BodyPart"]).size()
print (Frequency_BodyPart)
Surface BodyPart
a a 1
b 3
b a 2
b 1
d 1
d a 1
d 2
dtype: int64
Frequency_BodyPart = Frequency_BodyPart.rename_axis(index={'Surface': 'Class'})
print (Frequency_BodyPart)
Class BodyPart
a a 1
b 3
b a 2
b 1
d 1
d a 1
d 2
dtype: int64
If want 3 columns DataFrame working also for oldier pandas versions:
df = Frequency_BodyPart.reset_index(name='count').rename(columns={'Surface': 'Class'})
print (df)
Class BodyPart count
0 a a 1
1 a b 3
2 b a 2
3 b b 1
4 b d 1
5 d a 1
6 d d 2

Related

Adding pandas series on end of each pandas dataframe's row

I've had issues finding a concise way to append a series to each row of a dataframe, with the series labels becoming new columns in the df. All the values will be the same on each of the dataframes' rows, which is desired.
I can get the effect by doing the following:
df["new_col_A"] = ser["new_col_A"]
.....
df["new_col_Z"] = ser["new_col_Z"]
But this is so tedious there must be a better way, right?
Given:
# df
A B
0 1 2
1 1 3
2 4 6
# ser
C a
D b
dtype: object
Doing:
df[ser.index] = ser
print(df)
Output:
A B C D
0 1 2 a b
1 1 3 a b
2 4 6 a b

Find all groups in which specific values show up

I'm new to Python and Pandas. I have the following DataFrame:
import pandas as pd
df = pd.DataFrame( {'a':['A','A','B','B','B','C','C','C'], 'b':[1,3,1,2,3,1,3,3]})
a b
0 A 1
1 A 3
2 B 1
3 B 2
4 B 3
5 C 1
6 C 3
7 C 3
I would like to create a new DataFrame in which only groups from column A that have the values 1 and 2 in column b show up, that is:
a b
0 B 1
1 B 2
2 B 3
I know we can create groups using df.groupby('a'), and the method df.all() seems to be related to this, but I can't figure it out by myself. It seems like it should be straightforward. Any help?
Use GroupBy.filter + Series.any:
new_df=df.groupby('a').filter(lambda x: x.b.eq(2).any() & x.b.eq(1).any())
print(new_df)
a b
2 B 1
3 B 2
4 B 3
We could also use:
new_df=df[df.groupby('a').transform(lambda x: x.eq(1).any() & x.eq(2).any()).b]
print(new_df)
a b
2 B 1
3 B 2
4 B 3
Another approach:
s = (pd.DataFrame(df['b'].values == np.array([[1],[2]])).T
.groupby(df['a'])
.transform('any')
.all(1)
)
df[s]
Output:
a b
2 B 1
3 B 2
4 B 3

Duplicate row of low occurrence in pandas dataframe

In the following dataset what's the best way to duplicate row with groupby(['Type']) count < 3 to 3. df is the input, and df1 is my desired outcome. You see row 3 from df was duplicated by 2 times at the end. This is only an example deck. the real data has approximately 20mil lines and 400K unique Types, thus a method that does this efficiently is desired.
>>> df
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
>>> df1
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Thought about using something like the following but do not know the best way to write the func.
df.groupby('Type').apply(func)
Thank you in advance.
Use value_counts with map and repeat:
counts = df.Type.value_counts()
repeat_map = 3 - counts[counts < 3]
df['repeat_num'] = df.Type.map(repeat_map).fillna(0,downcast='infer')
df = df.append(df.set_index('Type')['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)[['Type','Val']]
print(df)
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Note : sort=False for append is present in pandas>=0.23.0, remove if using lower version.
EDIT : If data contains multiple val columns then make all columns columns as index expcept one column and repeat and then reset_index as:
df = df.append(df.set_index(['Type','Val_1','Val_2'])['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)

How do you concatenate two single rows in pandas?

I am trying to select a bunch of single rows in bunch of dataframes and trying to make a new data frame by concatenating them together.
Here is a simple example
x=pd.DataFrame([[1,2,3],[1,2,3]],columns=["A","B","C"])
A B C
0 1 2 3
1 1 2 3
a=x.loc[0,:]
A 1
B 2
C 3
Name: 0, dtype: int64
b=x.loc[1,:]
A 1
B 2
C 3
Name: 1, dtype: int64
c=pd.concat([a,b])
I end up with this:
A 1
B 2
C 3
A 1
B 2
C 3
Name: 0, dtype: int64
Whearas I would expect the original data frame:
A B C
0 1 2 3
1 1 2 3
I can get the values and create a new dataframe, but this doesn't seem like the way to do it.
If you want to concat two series vertically (vertical stacking), then one option is a concat and transpose.
Another is using np.vstack:
pd.DataFrame(np.vstack([a, b]), columns=a.index)
A B C
0 1 2 3
1 1 2 3
Since you are slicing by index I'd use .iloc and then notice the difference between [[]] and [] which return a DataFrame and Series*
a = x.iloc[[0]]
b = x.iloc[[1]]
pd.concat([a, b])
# A B C
#0 1 2 3
#1 1 2 3
To still use .loc, you'd do something like
a = x.loc[[0,]]
b = x.loc[[1,]]
*There's a small caveat that if index 0 is duplicated in x then x.loc[0,:] will return a DataFrame and not a Series.
It looks like you want to make a new dataframe from a collection of records. There's a method for that:
import pandas as pd
x = pd.DataFrame([[1,2,3],[1,2,3]], columns=["A","B","C"])
a = x.loc[0,:]
b = x.loc[1,:]
c = pd.DataFrame.from_records([a, b])
print(c)
# A B C
# 0 1 2 3
# 1 1 2 3

Convert Dataframe to series and viceversa / Delete columns from serie or dataframe

I'am trying to convert this dataframe into a series or the series to a dataframe (basicly one into an other) in order to be able to do operations with it, my second problem is wanting to delete the first column of the dataframe below (before of after converting doesn't really matter) or be able to delete a column from a series.
I searched for similar questions but they did not correspond to my issue.
Thanks in advance here are the dataframe and the series.
JOUR FL_AB_PCOUP FL_ABER_NEGA FL_AB_PMAX FL_AB_PSKVA FL_TROU_PDC \
0 2018-07-09 -0.448787 0.0 1.498464 -0.197012 1.001577
CDC_INCOMPLET_HORS_ABERRANTS CDC_COMPLET_HORS_ABERRANTS CDC_ABSENT \
0 -0.729002 -1.03586 1.032936
CDC_ABERRANTS PRM_X_PDC_ZERO mean.msr.pdc sd.msr.pdc sum.msr.pdc \
0 1.49976 -0.497693 -1.243274 -1.111366 0.558516
FL_AB_PCOUP 8.775974e-05
FL_ABER_NEGA 0.000000e+00
FL_AB_PMAX 1.865632e-03
FL_AB_PSKVA 2.027215e-05
FL_TROU_PDC 2.222952e-02
FL_AB_COMBI 1.931156e-03
CDC_INCOMPLET_HORS_ABERRANTS 1.562195e-03
CDC_COMPLET_HORS_ABERRANTS 9.758743e-01
CDC_ABSENT 2.063239e-02
CDC_ABERRANTS 1.931156e-03
PRM_X_PDC_ZERO 2.127753e+01
mean.msr.pdc 1.125987e+03
sd.msr.pdc 1.765955e+03
sum.msr.pdc 3.310615e+08
n.resil 3.884103e-04
dtype: float64
Setup:
df = pd.DataFrame({'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4]})
print (df)
B C D E
0 4 7 1 5
1 5 8 3 3
2 4 9 5 6
3 5 4 7 9
4 5 2 1 2
5 4 3 0 4
Use for DataFrame to Series selecting, e.g. by position by iloc or by name of index by loc :
#select some row, e.g. first
s = df.iloc[0]
print (s)
B 4
C 7
D 1
E 5
Name: 0, dtype: int64
And for Series to DataFrame use to_frame with transpose if necessary:
df = s.to_frame().T
print (df)
B C D E
0 4 7 1 5
Last for remove column from DataFrame use DataFrame.drop:
df = df.drop('B',axis=1)
print (df)
C D E
0 7 1 5
And value from Series use Series.drop:
s = s.drop('C')
print (s)
B 4
D 1
E 5
Name: 0, dtype: int64
you can delete your particular column by
df.drop(df.columns[i], axis=1)
to convert dataframe to series
pd.Series(df)

Categories

Resources