How to select multiple columns without deprecated ix in pandas - python

In python, for data slicing in DataFrame in package pandas, .ix is already deprecated from pandas 0.20.0. The official website offers alternative solutions with either .loc or .iloc to do the hybrid selection (http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html). The .index could help extract multiple rows. By contrast, the columns.get_loc seems can only select one column at most. Is there an alternative function available that can be used to extract multiple columns in a hybrid manner using .iloc?

Yes, function is called Index.get_indexer and return position of columns or index by list of names.
Use it this way:
df = pd.DataFrame({
'a':[4,5,4,5,5,4],
'b':[7,8,9,4,2,3],
'c':[1,3,5,7,1,0],
'd':[5,3,6,9,2,4],
}, index=list('ABCDEF'))
print (df)
a b c d
A 4 7 1 5
B 5 8 3 3
C 4 9 5 6
D 5 4 7 9
E 5 2 1 2
F 4 3 0 4
cols = ['a','b','c']
df1 = df.iloc[1, df.columns.get_indexer(cols)]
print (df1)
a 5
b 8
c 3
Name: B, dtype: int64
df11 = df.iloc[[1], df.columns.get_indexer(cols)]
print (df11)
a b c
B 5 8 3
idx = ['A','C']
df2 = df.iloc[df.index.get_indexer(idx), 2:]
print (df2)
c d
A 1 5
C 5 6

Related

Pandas melt function using column index positions rather than colum names

Is there a way to set column names for arguments as column index position, rather than column names?
Every example that I see is written with column names on value_vars. I need to use the column index.
For instance, instead of:
df2 = pd.melt(df,value_vars=['asset1','asset2'])
Using something similar to:
df2 = pd.melt(df,value_vars=[0,1])
Select columns names by indexing:
df = pd.DataFrame({
'asset1':list('acacac'),
'asset2':[4]*6,
'A':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4]
})
df2 = pd.melt(df,
id_vars=df.columns[[0,1]],
value_vars=df.columns[[2,3]],
var_name= 'c_name',
value_name='Value')
print (df2)
asset1 asset2 c_name Value
0 a 4 A 7
1 c 4 A 8
2 a 4 A 9
3 c 4 A 4
4 a 4 A 2
5 c 4 A 3
6 a 4 D 1
7 c 4 D 3
8 a 4 D 5
9 c 4 D 7
10 a 4 D 1
11 c 4 D 0

Pandas - How to swap column contents leaving label sequence intact?

I am using pandas v0.25.3. and am inexperienced but learning.
I have a dataframe and would like to swap the contents of two columns leaving the columns labels and sequence intact.
df = pd.DataFrame ({"A": [(1),(2),(3),(4)],
'B': [(5),(6),(7),(8)],
'C': [(9),(10),(11),(12)]})
This yields a dataframe,
A B C
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
I want to swap column contents B and C to get
A B C
0 1 9 5
1 2 10 6
2 3 11 7
3 4 12 8
I have tried looking at pd.DataFrame.values which sent me to numpy array and advanced slicing and got lost.
Whats the simplest way to do this?.
You can assign numpy array:
#pandas 0.24+
df[['B','C']] = df[['C','B']].to_numpy()
#oldier pandas versions
df[['B','C']] = df[['C','B']].values
Or use DataFrame.assign:
df = df.assign(B = df.C, C = df.B)
print (df)
A B C
0 1 9 5
1 2 10 6
2 3 11 7
3 4 12 8
Or just use:
df['B'], df['C'] = df['C'], df['B'].copy()
print(df)
Output:
A B C
0 1 9 5
1 2 10 6
2 3 11 7
3 4 12 8
You can also swap the labels:
df.columns = ['A','C','B']
If your DataFrame is very large, I believe this would require less from your computer than copying all the data.
If the order of the columns is important, you can then reorder them:
df = df.reindex(['A','B','C'], axis=1)

Duplicate row of low occurrence in pandas dataframe

In the following dataset what's the best way to duplicate row with groupby(['Type']) count < 3 to 3. df is the input, and df1 is my desired outcome. You see row 3 from df was duplicated by 2 times at the end. This is only an example deck. the real data has approximately 20mil lines and 400K unique Types, thus a method that does this efficiently is desired.
>>> df
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
>>> df1
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Thought about using something like the following but do not know the best way to write the func.
df.groupby('Type').apply(func)
Thank you in advance.
Use value_counts with map and repeat:
counts = df.Type.value_counts()
repeat_map = 3 - counts[counts < 3]
df['repeat_num'] = df.Type.map(repeat_map).fillna(0,downcast='infer')
df = df.append(df.set_index('Type')['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)[['Type','Val']]
print(df)
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Note : sort=False for append is present in pandas>=0.23.0, remove if using lower version.
EDIT : If data contains multiple val columns then make all columns columns as index expcept one column and repeat and then reset_index as:
df = df.append(df.set_index(['Type','Val_1','Val_2'])['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)

Convert Dataframe to series and viceversa / Delete columns from serie or dataframe

I'am trying to convert this dataframe into a series or the series to a dataframe (basicly one into an other) in order to be able to do operations with it, my second problem is wanting to delete the first column of the dataframe below (before of after converting doesn't really matter) or be able to delete a column from a series.
I searched for similar questions but they did not correspond to my issue.
Thanks in advance here are the dataframe and the series.
JOUR FL_AB_PCOUP FL_ABER_NEGA FL_AB_PMAX FL_AB_PSKVA FL_TROU_PDC \
0 2018-07-09 -0.448787 0.0 1.498464 -0.197012 1.001577
CDC_INCOMPLET_HORS_ABERRANTS CDC_COMPLET_HORS_ABERRANTS CDC_ABSENT \
0 -0.729002 -1.03586 1.032936
CDC_ABERRANTS PRM_X_PDC_ZERO mean.msr.pdc sd.msr.pdc sum.msr.pdc \
0 1.49976 -0.497693 -1.243274 -1.111366 0.558516
FL_AB_PCOUP 8.775974e-05
FL_ABER_NEGA 0.000000e+00
FL_AB_PMAX 1.865632e-03
FL_AB_PSKVA 2.027215e-05
FL_TROU_PDC 2.222952e-02
FL_AB_COMBI 1.931156e-03
CDC_INCOMPLET_HORS_ABERRANTS 1.562195e-03
CDC_COMPLET_HORS_ABERRANTS 9.758743e-01
CDC_ABSENT 2.063239e-02
CDC_ABERRANTS 1.931156e-03
PRM_X_PDC_ZERO 2.127753e+01
mean.msr.pdc 1.125987e+03
sd.msr.pdc 1.765955e+03
sum.msr.pdc 3.310615e+08
n.resil 3.884103e-04
dtype: float64
Setup:
df = pd.DataFrame({'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4]})
print (df)
B C D E
0 4 7 1 5
1 5 8 3 3
2 4 9 5 6
3 5 4 7 9
4 5 2 1 2
5 4 3 0 4
Use for DataFrame to Series selecting, e.g. by position by iloc or by name of index by loc :
#select some row, e.g. first
s = df.iloc[0]
print (s)
B 4
C 7
D 1
E 5
Name: 0, dtype: int64
And for Series to DataFrame use to_frame with transpose if necessary:
df = s.to_frame().T
print (df)
B C D E
0 4 7 1 5
Last for remove column from DataFrame use DataFrame.drop:
df = df.drop('B',axis=1)
print (df)
C D E
0 7 1 5
And value from Series use Series.drop:
s = s.drop('C')
print (s)
B 4
D 1
E 5
Name: 0, dtype: int64
you can delete your particular column by
df.drop(df.columns[i], axis=1)
to convert dataframe to series
pd.Series(df)

How can I sort one column without changing other columns in pandas?

Example:
Current df looks like:
df=
A B
1 5
2 6
3 8
4 1
I want the resulting df to be like this (B is sorted and A remains untouched):
df=
A B
1 8
2 6
3 5
4 1
You need to break an internal Pandas security mechanism - aligning by index, which takes care of the data consistency. So assigning 1D Numpy array or a vanilla Python list would do the trick, because both of them don't have an index, so Pandas can't do alignment:
df['B'] = df['B'].sort_values(ascending=False).values
or
df['B'] = df['B'].sort_values(ascending=False).tolist()
both yield:
In [77]: df
Out[77]:
A B
0 1 8
1 2 6
2 3 5
3 4 1
You can do this as well :
df['B'] = sorted(df['B'].tolist())[::-1]

Categories

Resources