Hello guys I have been trying to drop 2 columns of Excel data frame on pandas, using a drop command like this
energy = energy.drop(energy.columns[[0 , 1]], axis= 1 )
however, I could not make it to avoid the columns from view. and finally i sense the columns I am supposed to delete comes as a multi level index on my machine. finally I have tried to drop one of the level from it like this
energy.index = energy.index.droplevel(2)
But still i cant manage to how I should avoid these columns.
I have attached a screen copy of my work enter image description here
Instead of dropping the columns, you could subset your data frame like so:
In [3]: mydf = pd.DataFrame({"A":[1,2,3,4],"B":[4,3,2,1], "C":[3,4,5,3],"D":[6,4,3,2]})
In [4]: mydf
Out[4]:
A B C D
0 1 4 3 6
1 2 3 4 4
2 3 2 5 3
3 4 1 3 2
In [5]: mydf[mydf.columns[2:]]
Out[5]:
C D
0 3 6
1 4 4
2 5 3
3 3 2
This will work if you're trying to remove the first 2 columns for example. It works by creating a list with df.columns which you then subset and apply to your dataframe. You would then likely want to set the new dataframe to a variable.
If the columns that you want to drop are nonadjacent you can loop through a list of columns to drop:
In [7]: mydf1 = mydf.copy()
In [8]: for col in ["A","D"]:
...: mydf1 = mydf1.drop(col,axis=1)
In [9]: mydf1
Out[9]:
B C
0 4 3
1 3 4
2 2 5
3 1 3
Try simply renaming the columns
Say you have
In: df.columns
Out: MultiIndex(levels=[['BURGLARY', 'GRAND LARCENY', 'GRAND LARCENY OF MOTOR
VEHICLE', 'TMAX', 'TMIN'], ['count', 'mean']],
labels=[[0, 1, 2, 3, 4], [0, 0, 0, 1, 1]])
Then
In: df.columns = ['Burglary', 'Grand Larceny', 'Grand Larceny on Motor Vehicle',
'TMAX', 'TMIN']
And voila
In: df.columns
Out: Index(['BURGLARY', 'GRAND LARCENY', 'GRAND LARCENY OF MOTOR VEHICLE',
'TMAX',
'TMIN'],
dtype='object')
If you really want to remove columns you can use del:
>>> df = pd.DataFrame({'A':range(3),'B':list('abc'), 'C':range(3,6), 'D':list('gde')})
>>> for x in ['A', 'B']:
... del df[x]
...
>>> df
C D
0 3 g
1 4 d
2 5 e
This might help
energy.drop(energy.columns[[0,1]] , axis=1, inplace=True)
Related
I'm simply trying to access named pandas columns by an integer.
You can select a row by location using df.ix[3].
But how to select a column by integer?
My dataframe:
df=pandas.DataFrame({'a':np.random.rand(5), 'b':np.random.rand(5)})
Two approaches that come to mind:
>>> df
A B C D
0 0.424634 1.716633 0.282734 2.086944
1 -1.325816 2.056277 2.583704 -0.776403
2 1.457809 -0.407279 -1.560583 -1.316246
3 -0.757134 -1.321025 1.325853 -2.513373
4 1.366180 -1.265185 -2.184617 0.881514
>>> df.iloc[:, 2]
0 0.282734
1 2.583704
2 -1.560583
3 1.325853
4 -2.184617
Name: C
>>> df[df.columns[2]]
0 0.282734
1 2.583704
2 -1.560583
3 1.325853
4 -2.184617
Name: C
Edit: The original answer suggested the use of df.ix[:,2] but this function is now deprecated. Users should switch to df.iloc[:,2].
You can also use df.icol(n) to access a column by integer.
Update: icol is deprecated and the same functionality can be achieved by:
df.iloc[:, n] # to access the column at the nth position
You could use label based using .loc or index based using .iloc method to do column-slicing including column ranges:
In [50]: import pandas as pd
In [51]: import numpy as np
In [52]: df = pd.DataFrame(np.random.rand(4,4), columns = list('abcd'))
In [53]: df
Out[53]:
a b c d
0 0.806811 0.187630 0.978159 0.317261
1 0.738792 0.862661 0.580592 0.010177
2 0.224633 0.342579 0.214512 0.375147
3 0.875262 0.151867 0.071244 0.893735
In [54]: df.loc[:, ["a", "b", "d"]] ### Selective columns based slicing
Out[54]:
a b d
0 0.806811 0.187630 0.317261
1 0.738792 0.862661 0.010177
2 0.224633 0.342579 0.375147
3 0.875262 0.151867 0.893735
In [55]: df.loc[:, "a":"c"] ### Selective label based column ranges slicing
Out[55]:
a b c
0 0.806811 0.187630 0.978159
1 0.738792 0.862661 0.580592
2 0.224633 0.342579 0.214512
3 0.875262 0.151867 0.071244
In [56]: df.iloc[:, 0:3] ### Selective index based column ranges slicing
Out[56]:
a b c
0 0.806811 0.187630 0.978159
1 0.738792 0.862661 0.580592
2 0.224633 0.342579 0.214512
3 0.875262 0.151867 0.071244
You can access multiple columns by passing a list of column indices to dataFrame.ix.
For example:
>>> df = pandas.DataFrame({
'a': np.random.rand(5),
'b': np.random.rand(5),
'c': np.random.rand(5),
'd': np.random.rand(5)
})
>>> df
a b c d
0 0.705718 0.414073 0.007040 0.889579
1 0.198005 0.520747 0.827818 0.366271
2 0.974552 0.667484 0.056246 0.524306
3 0.512126 0.775926 0.837896 0.955200
4 0.793203 0.686405 0.401596 0.544421
>>> df.ix[:,[1,3]]
b d
0 0.414073 0.889579
1 0.520747 0.366271
2 0.667484 0.524306
3 0.775926 0.955200
4 0.686405 0.544421
The method .transpose() converts columns to rows and rows to column, hence you could even write
df.transpose().ix[3]
Most of the people have answered how to take columns starting from an index. But there might be some scenarios where you need to pick columns from in-between or specific index, where you can use the below solution.
Say that you have columns A,B and C. If you need to select only column A and C you can use the below code.
df = df.iloc[:, [0,2]]
where 0,2 specifies that you need to select only 1st and 3rd column.
You can use the method take. For example, to select first and last columns:
df.take([0, -1], axis=1)
I have a dataframe with header that is list of 'string-integers':
import pandas as pd
d = {'1': [1, 2], '7': [3, 4], '3': [3, 4], '5': [2, 7]}
df = pd.DataFrame(data=d)
1 3 5 7
0 1 3 2 3
1 2 4 7 4
This code change order of column (sort):
cols = df.columns.tolist()
cols = [int(x) for x in cols]
cols.sort()
cols = [str(x) for x in cols]
df = df[cols]
1 3 5 7
0 1 3 2 3
1 2 4 7 4
I'm not happy with this solution. Of course, I can hide it in the function, but probably more elegant approach exist.
There are several options depending on what you require.
Option 1
You can use sort_values to sort as strings:
df.columns = df.columns.sort_values()
Note this means that "10" will appear before "2".
Option 2
If you wish to convert to integers and then sort:
df.columns = df.columns.astype(int).sort_values()
Option 3
If you want to keep as string, but order by integer value:
df.columns = df.columns.astype(int).sort_values().astype(str)
A pure Python approach is also possible:
df.columns = sorted(df, key=int)
suppose a dataframe like this one:
df = pd.DataFrame([[1,2,3,4],[5,6,7,8],[9,10,11,12]], columns = ['A', 'B', 'A1', 'B1'])
I would like to have a dataframe which looks like:
what does not work:
new_rows = int(df.shape[1]/2) * df.shape[0]
new_cols = 2
df.values.reshape(new_rows, new_cols, order='F')
of course I could loop over the data and make a new list of list but there must be a better way. Any ideas ?
The pd.wide_to_long function is built almost exactly for this situation, where you have many of the same variable prefixes that end in a different digit suffix. The only difference here is that your first set of variables don't have a suffix, so you will need to rename your columns first.
The only issue with pd.wide_to_long is that it must have an identification variable, i, unlike melt. reset_index is used to create a this uniquely identifying column, which is dropped later. I think this might get corrected in the future.
df1 = df.rename(columns={'A':'A1', 'B':'B1', 'A1':'A2', 'B1':'B2'}).reset_index()
pd.wide_to_long(df1, stubnames=['A', 'B'], i='index', j='id')\
.reset_index()[['A', 'B', 'id']]
A B id
0 1 2 1
1 5 6 1
2 9 10 1
3 3 4 2
4 7 8 2
5 11 12 2
You can use lreshape, for column id numpy.repeat:
a = [col for col in df.columns if 'A' in col]
b = [col for col in df.columns if 'B' in col]
df1 = pd.lreshape(df, {'A' : a, 'B' : b})
df1['id'] = np.repeat(np.arange(len(df.columns) // 2), len (df.index)) + 1
print (df1)
A B id
0 1 2 1
1 5 6 1
2 9 10 1
3 3 4 2
4 7 8 2
5 11 12 2
EDIT:
lreshape is currently undocumented, but it is possible it might be removed(with pd.wide_to_long too).
Possible solution is merging all 3 functions to one - maybe melt, but now it is not implementated. Maybe in some new version of pandas. Then my answer will be updated.
I solved this in 3 steps:
Make a new dataframe df2 holding only the data you want to be added to the initial dataframe df.
Delete the data from df that will be added below (and that was used to make df2.
Append df2 to df.
Like so:
# step 1: create new dataframe
df2 = df[['A1', 'B1']]
df2.columns = ['A', 'B']
# step 2: delete that data from original
df = df.drop(["A1", "B1"], 1)
# step 3: append
df = df.append(df2, ignore_index=True)
Note how when you do df.append() you need to specify ignore_index=True so the new columns get appended to the index rather than keep their old index.
Your end result should be your original dataframe with the data rearranged like you wanted:
In [16]: df
Out[16]:
A B
0 1 2
1 5 6
2 9 10
3 3 4
4 7 8
5 11 12
Use pd.concat() like so:
#Split into separate tables
df_1 = df[['A', 'B']]
df_2 = df[['A1', 'B1']]
df_2.columns = ['A', 'B'] # Make column names line up
# Add the ID column
df_1 = df_1.assign(id=1)
df_2 = df_2.assign(id=2)
# Concatenate
pd.concat([df_1, df_2])
i'm using pandas to read an excel file and convert the spreadsheet to a dataframe. Then i apply groupby and store the individual groups in variables using get_group for later computation.
My issue is that the input file isn't always the same size, sometimes the groupby will result in 10 dfs, sometimes 25 etc. How to i get my program to ignore if a df is missing from the intial data?
df = pd.read_excel(filepath, 0, skiprows=3, parse_cols='A,B,C,E,F,G',
names=['Result', 'Trial', 'Well', 'Distance', 'Speed', 'Time'])
df = df.replace({'-': 0}, regex=True) #replaces '-' values with 0
df = df['Trial'].unique()
gb = df.groupby('Trial') #groups by column Trial
trial_1 = gb.get_group('Trial 1')
trial_2 = gb.get_group('Trial 2')
trial_3 = gb.get_group('Trial 3')
trial_4 = gb.get_group('Trial 4')
trial_5 = gb.get_group('Trial 5')
Say my initial data only has 3 trials, how would i get it to ignore trials 4, 5 later? My code runs when all trials are present but fails when some are missing :( It sounds very much like an if statement would be needed, but my tired brain has no idea where...
Thanks in advance!
After grouping you can get the groups using attribute .groups this returns a dict of the group names, you can then just iterate over the dict keys dynamically so you don't need to hard code the size:
In [22]:
df = pd.DataFrame({'grp':list('aabbbc'), 'val':np.arange(6)})
df
Out[22]:
grp val
0 a 0
1 a 1
2 b 2
3 b 3
4 b 4
5 c 5
In [23]:
gp = df.groupby('grp')
gp.groups
Out[23]:
{'a': Int64Index([0, 1], dtype='int64'),
'b': Int64Index([2, 3, 4], dtype='int64'),
'c': Int64Index([5], dtype='int64')}
In [25]:
for g in gp.groups.keys():
print(gp.get_group(g))
grp val
0 a 0
1 a 1
grp val
2 b 2
3 b 3
4 b 4
grp val
5 c 5
In a column risklevels I want to replace Small with 1, Medium with 5 and High with 15.
I tried:
dfm.replace({'risk':{'Small': '1'}},
{'risk':{'Medium': '5'}},
{'risk':{'High': '15'}})
But only the medium were replaced.
What is wrong ?
Your replace format is off
In [21]: df = pd.DataFrame({'a':['Small', 'Medium', 'High']})
In [22]: df
Out[22]:
a
0 Small
1 Medium
2 High
[3 rows x 1 columns]
In [23]: df.replace({'a' : { 'Medium' : 2, 'Small' : 1, 'High' : 3 }})
Out[23]:
a
0 1
1 2
2 3
[3 rows x 1 columns]
In [123]: import pandas as pd
In [124]: state_df = pd.DataFrame({'state':['Small', 'Medium', 'High', 'Small', 'High']})
In [125]: state_df
Out[125]:
state
0 Small
1 Medium
2 High
3 Small
4 High
In [126]: replace_values = {'Small' : 1, 'Medium' : 2, 'High' : 3 }
In [127]: state_df = state_df.replace({"state": replace_values})
In [128]: state_df
Out[128]:
state
0 1
1 2
2 3
3 1
4 3
You could define a dict and call map
In [256]:
df = pd.DataFrame({'a':['Small', 'Medium', 'High']})
df
Out[256]:
a
0 Small
1 Medium
2 High
[3 rows x 1 columns]
In [258]:
vals_to_replace = {'Small':'1', 'Medium':'5', 'High':'15'}
df['a'] = df['a'].map(vals_to_replace)
df
Out[258]:
a
0 1
1 5
2 15
[3 rows x 1 columns]
In [279]:
val1 = [1,5,15]
df['risk'].update(pd.Series(val1))
df
Out[279]:
risk
0 1
1 5
2 15
[3 rows x 1 columns]
Looks like OP may have been looking for a one-liner to solve this through consecutive calls to .str.replace:
dfm.column = dfm.column.str.replace('Small', '1') \
.str.replace('Medium', '5') \
.str.replace('High', '15')
OP, you were close but just needed to replace your commas with .str.replace and the column call ('risk') in a dictionary format isn't necessary. Just pass the pattern-to-match and replacement-value as arguments to replace.
I had to turn on the "regex" flag to make it work:
df.replace({'a' : {'Medium':2, 'Small':1, 'High':3 }}, regex=True)
String replace each string (Small, Medium, High) for the new string (1,5,15)\
If dfm is the dataframe name, column is the column name.
dfm.column = dfm.column.str.replace('Small', '1')
dfm.column = dfm.column.str.replace('Medium', '5')
dfm.column = dfm.column.str.replace('High', '15')
Use series.replace with lists of before and after values for greater ease:
df.risklevels = df.risklevels.replace( ['Small','Medium','High'], [1,2,3] )
See here.