Selecting a dataframe column without dropping the label - python

How do I select a dataframe column, df['col'], without dropping the name of the column?
df
index colname col1 col2 col3
1 0 1 2
2 3 4 5
3 6 7 8
4 9 10 11
Desired output:
df['col1']
index colname col1
1 0
2 3
3 6
4 9
Edit: as correctly answered, df[['col1']] does the job... Now a bit more tricky. What if the columns are multiindexed?
df grpname A B ... Z
index colname cA1 ... cAN cB1 ... cBN ... cZ1 ... cZN
1 a11 ... a1N b11 ... b1N ... z11 ... z1N
2 a21 ... a2N b21 ... b2N ... z21 ... z2N
3 a31 ... a3N b31 ... b3N ... z31 ... z3N
4 a41 ... a4N b41 ... b4N ... z41 ... z4N
I want to get
df grpname A
index colname cA1 cA2
1 a11 a12
2 a21 a22
3 a31 a32
4 a41 a42
Looks like .xs() only allows me to retrieve a certain column, namely df.xs( ('A', 'cAi'), level=('grpname','colname'), axis=1, drop_level=False) ), and df[['A']]['cA1':'cAi'] doesn't work either?

For a single column selection then df['col'] will return a series, if you want to keep the column name then you need to double subscript which will return a dataframe:
In [2]:
import pandas as pd
pd.set_option('display.notebook_repr_html', False)
import io
temp = """index col1 col2 col3
1 0 1 2
2 3 4 5
3 6 7 8
4 9 10 11"""
df = pd.read_csv(io.StringIO(temp), sep='\s+',index_col=[0])
df
Out[2]:
col1 col2 col3
index
1 0 1 2
2 3 4 5
3 6 7 8
4 9 10 11
In [4]:
df[['col1']]
Out[4]:
col1
index
1 0
2 3
3 6
4 9
contrast this with:
In [5]:
df['col1']
Out[5]:
index
1 0
2 3
3 6
4 9
Name: col1, dtype: int64
EDIT
As #joris has pointed out you can see that the name is displayed at the bottom of the output, the name isn't lost as such just a different output

There is a way to do it if you are sure of the space taken by each column.
Here is the example ...
np.loadtxt("df.txt",
dtype={
'names': ('index', 'colname', 'col1', 'col2', 'col3'),
'formats': (np.float, np.string, np.float, np.float, np.float)},
delimiter= ' ', skiprows=1)

Related

pandas drop last group element

I have a DataFrame df = pd.DataFrame({'col1': ["a","b","c","d","e", "f","g","h"], 'col2': [1,1,1,2,2,3,3,3]}) that looks like
Input:
col1 col2
0 a 1
1 b 1
2 c 1
3 d 2
4 e 2
5 f 3
6 g 3
7 h 3
I want to drop the last row bases off of grouping "col2" which would look like...
Expected Output:
col1 col2
0 a 1
1 b 1
3 d 2
5 f 3
6 g 3
I wrote df.groupby('col2').tail(1) which gets me what I want to delete but when I try to write df.drop(df.groupby('col2').tail(1)) I get an axis error. What would be a solution to this
Look like duplicated would work:
df[df.duplicated('col2', keep='last') |
(~df.duplicated('col2', keep=False)) # this is to keep all single-row groups
]
Or with your approach, you should drop the index:
# this would also drop all single-row groups
df.drop(df.groupby('col2').tail(1).index)
Output:
col1 col2
0 a 1
1 b 1
3 d 2
5 f 3
6 g 3
try this:
df.groupby('col2', as_index=False).apply(lambda x: x.iloc[:-1,:]).reset_index(drop=True)

Pandas, duplicate a row based on a condition

I have a dataframe like this -
What I want to do is, whenever there is 'X' in Col3, that row should get duplicated and 'X' should be changed to 'Z'. The result must look like this -
I did try a few approaches, but nothing worked!
Can somebody please guide on how to do this.
You can filter first by boolean indexing and set Z to Col3 by DataFrame.assign, join with original with concat, sorting index by DataFrame.sort_index with stabble algo mergesort and last create default RangeIndex by DataFrame.reset_index with drop=True:
df = pd.DataFrame({
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'Col3':list('aXcdXf'),
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
df = (pd.concat([df, df[df['Col3'].eq('X')].assign(Col3 = 'Z')])
.sort_index(kind='mergesort')
.reset_index(drop=True))
print (df)
B C Col3 D E F
0 4 7 a 1 5 a
1 5 8 X 3 3 a
2 5 8 Z 3 3 a
3 4 9 c 5 6 a
4 5 4 d 7 9 b
5 5 2 X 1 2 b
6 5 2 Z 1 2 b
7 4 3 f 0 4 b

Delete pandas column if column name begins with a number

I have a pandas DataFrame with about 200 columns. Roughly, I want to do this
for col in df.columns:
if col begins with a number:
df.drop(col)
I'm not sure what are the best practices when it comes to handling pandas DataFrames, how should I handle this? Will my pseudocode work, or is it not recommended to modify a pandas dataframe in a for loop?
I think simpliest is select all columns which not starts with number by filter with regex - ^ is for start of string and \D is for not number:
df1 = df.filter(regex='^\D')
Similar alternative:
df1 = df.loc[:, df.columns.str.contains('^\D')]
Or inverse condition and select numbers:
df1 = df.loc[:, ~df.columns.str.contains('^\d')]
df1 = df.loc[:, ~df.columns.str[0].str.isnumeric()]
If want use your pseudocode:
for col in df.columns:
if col[0].isnumeric():
df = df.drop(col, axis=1)
Sample:
df = pd.DataFrame({'2A':list('abcdef'),
'1B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D3':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
1B 2A C D3 E F
0 4 a 7 1 5 a
1 5 b 8 3 3 a
2 4 c 9 5 6 a
3 5 d 4 7 9 b
4 5 e 2 1 2 b
5 4 f 3 0 4 b
df1 = df.filter(regex='^\D')
print (df1)
C D3 E F
0 7 1 5 a
1 8 3 3 a
2 9 5 6 a
3 4 7 9 b
4 2 1 2 b
5 3 0 4 b
An alternative can be this:
columns = [x for x in df.columns if not x[0].isdigit()]
df = df[columns]

Set new column values in pandas DataFrame1 where DF2 column values match DF1 index

I'd like to set a new column in a pandas dataframe with values calculated using a groupby on dataframe2.
DF1:
col1 col2
id
1 'a'
2 'b'
3 'c'
DF2:
id col2
index
1 1 11
1 1 22
1 1 12
1 1 45
3 3 83
3 3 11
3 3 35
3 3 54
I want to group DF2 by 'id', and then apply a function on 'col2' to put the result into the corresponding index in DF1. If there is no group for that particular index, then I want to fill with NaN...
ret_val = DF2.groupby('id').apply(lambda x: my_func(x['col_2']))
col1 col2
id
1 'a' ret_val
2 'b' NaN
3 'c' ret_val
... I can't quite figure out how to achieve this though
Use map on df1.index series.
In [5327]: df1['col2'] = df1.index.to_series().map(df2.groupby('id')
.apply(lambda x: my_func(x['col2'])))
In [5328]: df1
Out[5328]:
col1 col2
id
1 a 360.0
2 b NaN
3 c 536.0
Details
In [5322]: def my_func(x):
...: return x.sum()
...:
In [5323]: df2.groupby('id').apply(lambda x: my_func(x['col2']))
Out[5323]:
id
1 360.0
3 536.0
dtype: float64
In [5324]: df1.index.to_series().map(df2.groupby('id').apply(lambda x: my_func(x['col2'])))
Out[5324]:
id
1 360.0
2 NaN
3 536.0
Name: id, dtype: float64
Apply the function on col2 of df2 first then use pd.concat droping the col2 in df since it is empty.
x = df2.groupby('id')['col2'].apply(sum) # instead of sum use your own function
ndf = pd.concat([df.drop('col2',1),x],1)
col1 col2
id
1 'a' 90.0
2 'b' NaN
3 'c' 183.0
Straight and simple suggested by #Zero
df1['col2'] = df2.groupby('id')['col2'].apply(sum)
you can replace sum with .apply(lambda x : your_func(x))
df1.col2=df.set_index('id').groupby(level='id').sum()
df1
Out[975]:
col1 col2
id
1 'a' 90.0
2 'b' NaN
3 'c' 183.0

Concatenate column values in Pandas DataFrame with "NaN" values

I'm trying to concatenate Pandas DataFrame columns with NaN values.
In [96]:df = pd.DataFrame({'col1' : ["1","1","2","2","3","3"],
'col2' : ["p1","p2","p1",np.nan,"p2",np.nan], 'col3' : ["A","B","C","D","E","F"]})
In [97]: df
Out[97]:
col1 col2 col3
0 1 p1 A
1 1 p2 B
2 2 p1 C
3 2 NaN D
4 3 p2 E
5 3 NaN F
In [98]: df['concatenated'] = df['col2'] +','+ df['col3']
In [99]: df
Out[99]:
col1 col2 col3 concatenated
0 1 p1 A p1,A
1 1 p2 B p2,B
2 2 p1 C p1,C
3 2 NaN D NaN
4 3 p2 E p2,E
5 3 NaN F NaN
Instead of 'NaN' values in "concatenated" column, I want to get "D" and "F" respectively for this example?
I don't think your problem is trivial. However, here is a workaround using numpy vectorization :
In [49]: def concat(*args):
...: strs = [str(arg) for arg in args if not pd.isnull(arg)]
...: return ','.join(strs) if strs else np.nan
...: np_concat = np.vectorize(concat)
...:
In [50]: np_concat(df['col2'], df['col3'])
Out[50]:
array(['p1,A', 'p2,B', 'p1,C', 'D', 'p2,E', 'F'],
dtype='|S64')
In [51]: df['concatenated'] = np_concat(df['col2'], df['col3'])
In [52]: df
Out[52]:
col1 col2 col3 concatenated
0 1 p1 A p1,A
1 1 p2 B p2,B
2 2 p1 C p1,C
3 2 NaN D D
4 3 p2 E p2,E
5 3 NaN F F
[6 rows x 4 columns]
You could first replace NaNs with empty strings, for the whole dataframe or the column(s) you desire.
In [6]: df = df.fillna('')
In [7]: df['concatenated'] = df['col2'] +','+ df['col3']
In [8]: df
Out[8]:
col1 col2 col3 concatenated
0 1 p1 A p1,A
1 1 p2 B p2,B
2 2 p1 C p1,C
3 2 D ,D
4 3 p2 E p2,E
5 3 F ,F
We can use stack which will drop the NaN, then use groupby.agg and ','.join the strings:
df['concatenated'] = df[['col2', 'col3']].stack().groupby(level=0).agg(','.join)
col1 col2 col3 concatenated
0 1 p1 A p1,A
1 1 p2 B p2,B
2 2 p1 C p1,C
3 2 NaN D D
4 3 p2 E p2,E
5 3 NaN F F

Categories

Resources