How to keep pandas indexes when extracting columns

How to keep pandas indexes when extracting columns - python

So I have a file 500 columns by 600 rows and want to take the average of all columns for rows 200-400:
df = pd.read_csv('file.csv', sep= '\s+')
sliced_df=df.iloc[200:400]
Then create a new column of the averages of all rows across all columns. And extract only that newly created column:
sliced_df['mean'] = sliced_df.mean(axis=1)
final_df = sliced_df['mean']
But how can I prevent the indexes from resetting when I extract the new column?

I think is not necessary create new column in sliced_df, only rename name of Series and if need output as DataFrame add to_frame. Indexes are not resetting, see sample bellow:
#random dataframe
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10, size=(5,5)), columns=list('ABCDE'))
print (df)
A B C D E
0 8 8 3 7 7
1 0 4 2 5 2
2 2 2 1 0 8
3 4 0 9 6 2
4 4 1 5 3 4
#in real data use df.iloc[200:400]
sliced_df=df.iloc[2:4]
print (sliced_df)
A B C D E
2 2 2 1 0 8
3 4 0 9 6 2
final_ser = sliced_df.mean(axis=1).rename('mean')
print (final_ser)
2 2.6
3 4.2
Name: mean, dtype: float64
final_df = sliced_df.mean(axis=1).rename('mean').to_frame()
print (final_df)
mean
2 2.6
3 4.2
Python counts from 0, so maybe need change slice from 200:400 to 100:300, see difference:
sliced_df=df.iloc[1:3]
print (sliced_df)
A B C D E
1 0 4 2 5 2
2 2 2 1 0 8
final_ser = sliced_df.mean(axis=1).rename('mean')
print (final_ser)
1 2.6
2 2.6
Name: mean, dtype: float64
final_df = sliced_df.mean(axis=1).rename('mean').to_frame()
print (final_df)
mean
1 2.6
2 2.6

Use copy() function as follows:
df = pd.read_csv('file.csv', sep= '\s+')
sliced_df=df.iloc[200:400].copy()
sliced_df['mean'] = sliced_df.mean(axis=1)
final_df = sliced_df['mean'].copy()

Related

Pandas - How to swap column contents leaving label sequence intact?

I am using pandas v0.25.3. and am inexperienced but learning.
I have a dataframe and would like to swap the contents of two columns leaving the columns labels and sequence intact.
df = pd.DataFrame ({"A": [(1),(2),(3),(4)],
'B': [(5),(6),(7),(8)],
'C': [(9),(10),(11),(12)]})
This yields a dataframe,
A B C
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
I want to swap column contents B and C to get
A B C
0 1 9 5
1 2 10 6
2 3 11 7
3 4 12 8
I have tried looking at pd.DataFrame.values which sent me to numpy array and advanced slicing and got lost.
Whats the simplest way to do this?.

You can assign numpy array:
#pandas 0.24+
df[['B','C']] = df[['C','B']].to_numpy()
#oldier pandas versions
df[['B','C']] = df[['C','B']].values
Or use DataFrame.assign:
df = df.assign(B = df.C, C = df.B)
print (df)
A B C
0 1 9 5
1 2 10 6
2 3 11 7
3 4 12 8

Or just use:
df['B'], df['C'] = df['C'], df['B'].copy()
print(df)
Output:
A B C
0 1 9 5
1 2 10 6
2 3 11 7
3 4 12 8

You can also swap the labels:
df.columns = ['A','C','B']
If your DataFrame is very large, I believe this would require less from your computer than copying all the data.
If the order of the columns is important, you can then reorder them:
df = df.reindex(['A','B','C'], axis=1)

pandas add a column with only one row

This sounds a bit weird, but I think that's exactly what I needed now:
I got several pandas dataframes that contains columns with float numbers, for example:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
Now I want to add a column, with only one row, and the value is equal to the average of column 'a', in this case, is 3.0. So the new dataframe will looks like this:
a b c average
0 0 1 2 3.0
1 3 4 5
2 6 7 8
And all the rows below are empty.
I've tried things like df['average'] = np.mean(df['a']) but that give me a whole column of 3.0. Any help will be appreciated.

Assign a series, this is cleaner.
df['average'] = pd.Series(df['a'].mean(), index=df.index[[0]])
Or, even better, assign with loc:
df.loc[df.index[0], 'average'] = df['a'].mean().item()
Filling NaNs is straightforward, you can do
df['average'] = df['average'].fillna('')
df
a b c average
0 0 1 2 3
1 3 4 5
2 6 7 8

Can do something like:
df['average'] = [np.mean(df['a'])]+['']*(len(df)-1)

Here is a full example:
import pandas as pd
import numpy as np
df = pd.DataFrame(
[(0,1,2), (3,4,5), (6,7,8)],
columns=['a', 'b', 'c'])
print(df)
a b c
0 0 1 2
1 3 4 5
2 6 7 8
df['average'] = ''
df['average'][0] = df['a'].mean()
print(df)
a b c average
0 0 1 2 3
1 3 4 5
2 6 7 8

Duplicate row of low occurrence in pandas dataframe

In the following dataset what's the best way to duplicate row with groupby(['Type']) count < 3 to 3. df is the input, and df1 is my desired outcome. You see row 3 from df was duplicated by 2 times at the end. This is only an example deck. the real data has approximately 20mil lines and 400K unique Types, thus a method that does this efficiently is desired.
>>> df
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
>>> df1
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Thought about using something like the following but do not know the best way to write the func.
df.groupby('Type').apply(func)
Thank you in advance.

Use value_counts with map and repeat:
counts = df.Type.value_counts()
repeat_map = 3 - counts[counts < 3]
df['repeat_num'] = df.Type.map(repeat_map).fillna(0,downcast='infer')
df = df.append(df.set_index('Type')['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)[['Type','Val']]
print(df)
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Note : sort=False for append is present in pandas>=0.23.0, remove if using lower version.
EDIT : If data contains multiple val columns then make all columns columns as index expcept one column and repeat and then reset_index as:
df = df.append(df.set_index(['Type','Val_1','Val_2'])['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)

Select rows of pandas dataframe from list, in order of list

The question was originally asked here as a comment but could not get a proper answer as the question was marked as a duplicate.
For a given pandas.DataFrame, let us say
df = DataFrame({'A' : [5,6,3,4], 'B' : [1,2,3, 5]})
df
A B
0 5 1
1 6 2
2 3 3
3 4 5
How can we select rows from a list, based on values in a column ('A' for instance)
For instance
# from
list_of_values = [3,4,6]
# we would like, as a result
# A B
# 2 3 3
# 3 4 5
# 1 6 2
Using isin as mentioned here is not satisfactory as it does not keep order from the input list of 'A' values.
How can the abovementioned goal be achieved?

One way to overcome this is to make the 'A' column an index and use loc on the newly generated pandas.DataFrame. Eventually, the subsampled dataframe's index can be reset.
Here is how:
ret = df.set_index('A').loc[list_of_values].reset_index(inplace=False)
# ret is
# A B
# 0 3 3
# 1 4 5
# 2 6 2
Note that the drawback of this method is that the original indexing has been lost in the process.
More on pandas indexing: What is the point of indexing in pandas?

Use merge with helper DataFrame created by list and with column name of matched column:
df = pd.DataFrame({'A' : [5,6,3,4], 'B' : [1,2,3,5]})
list_of_values = [3,6,4]
df1 = pd.DataFrame({'A':list_of_values}).merge(df)
print (df1)
A B
0 3 3
1 6 2
2 4 5
For more general solution:
df = pd.DataFrame({'A' : [5,6,5,3,4,4,6,5], 'B':range(8)})
print (df)
A B
0 5 0
1 6 1
2 5 2
3 3 3
4 4 4
5 4 5
6 6 6
7 5 7
list_of_values = [6,4,3,7,7,4]
#create df from list
list_df = pd.DataFrame({'A':list_of_values})
print (list_df)
A
0 6
1 4
2 3
3 7
4 7
5 4
#column for original index values
df1 = df.reset_index()
#helper column for count duplicates values
df1['g'] = df1.groupby('A').cumcount()
list_df['g'] = list_df.groupby('A').cumcount()
#merge together, create index from column and remove g column
df = list_df.merge(df1).set_index('index').rename_axis(None).drop('g', axis=1)
print (df)
A B
1 6 1
4 4 4
3 3 3
5 4 5

1] Generic approach for list_of_values.
In [936]: dff = df[df.A.isin(list_of_values)]
In [937]: dff.reindex(dff.A.map({x: i for i, x in enumerate(list_of_values)}).sort_values().index)
Out[937]:
A B
2 3 3
3 4 5
1 6 2
2] If list_of_values is sorted. You can use
In [926]: df[df.A.isin(list_of_values)].sort_values(by='A')
Out[926]:
A B
2 3 3
3 4 5
1 6 2

Convert Dataframe to series and viceversa / Delete columns from serie or dataframe

I'am trying to convert this dataframe into a series or the series to a dataframe (basicly one into an other) in order to be able to do operations with it, my second problem is wanting to delete the first column of the dataframe below (before of after converting doesn't really matter) or be able to delete a column from a series.
I searched for similar questions but they did not correspond to my issue.
Thanks in advance here are the dataframe and the series.
JOUR FL_AB_PCOUP FL_ABER_NEGA FL_AB_PMAX FL_AB_PSKVA FL_TROU_PDC \
0 2018-07-09 -0.448787 0.0 1.498464 -0.197012 1.001577
CDC_INCOMPLET_HORS_ABERRANTS CDC_COMPLET_HORS_ABERRANTS CDC_ABSENT \
0 -0.729002 -1.03586 1.032936
CDC_ABERRANTS PRM_X_PDC_ZERO mean.msr.pdc sd.msr.pdc sum.msr.pdc \
0 1.49976 -0.497693 -1.243274 -1.111366 0.558516
FL_AB_PCOUP 8.775974e-05
FL_ABER_NEGA 0.000000e+00
FL_AB_PMAX 1.865632e-03
FL_AB_PSKVA 2.027215e-05
FL_TROU_PDC 2.222952e-02
FL_AB_COMBI 1.931156e-03
CDC_INCOMPLET_HORS_ABERRANTS 1.562195e-03
CDC_COMPLET_HORS_ABERRANTS 9.758743e-01
CDC_ABSENT 2.063239e-02
CDC_ABERRANTS 1.931156e-03
PRM_X_PDC_ZERO 2.127753e+01
mean.msr.pdc 1.125987e+03
sd.msr.pdc 1.765955e+03
sum.msr.pdc 3.310615e+08
n.resil 3.884103e-04
dtype: float64

Setup:
df = pd.DataFrame({'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4]})
print (df)
B C D E
0 4 7 1 5
1 5 8 3 3
2 4 9 5 6
3 5 4 7 9
4 5 2 1 2
5 4 3 0 4
Use for DataFrame to Series selecting, e.g. by position by iloc or by name of index by loc :
#select some row, e.g. first
s = df.iloc[0]
print (s)
B 4
C 7
D 1
E 5
Name: 0, dtype: int64
And for Series to DataFrame use to_frame with transpose if necessary:
df = s.to_frame().T
print (df)
B C D E
0 4 7 1 5
Last for remove column from DataFrame use DataFrame.drop:
df = df.drop('B',axis=1)
print (df)
C D E
0 7 1 5
And value from Series use Series.drop:
s = s.drop('C')
print (s)
B 4
D 1
E 5
Name: 0, dtype: int64

you can delete your particular column by
df.drop(df.columns[i], axis=1)
to convert dataframe to series
pd.Series(df)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to keep pandas indexes when extracting columns - python

Use copy() function as follows: df = pd.read_csv('file.csv', sep= '\s+') sliced_df=df.iloc[200:400].copy() sliced_df['mean'] = sliced_df.mean(axis=1) final_df = sliced_df['mean'].copy()

Related

Pandas - How to swap column contents leaving label sequence intact?

pandas add a column with only one row

Duplicate row of low occurrence in pandas dataframe

Select rows of pandas dataframe from list, in order of list

Convert Dataframe to series and viceversa / Delete columns from serie or dataframe

Categories

Resources