How can I avoid repeated indices in pandas DataFrame after concat? - python

I have two pandas dataframes and concatenate them:
In[55]: adict = {'a':[0, 1]}
bdict = {'a': [2, 3]}
dfa = DataFrame(adict)
dfb = DataFrame(bdict)
dfab = pd.concat([dfa,dfb])
The problem is, the resulting dataframe has repeated index.
In [56]: dfab.head()
Out[56]:
a
0 0
1 1
0 2
1 3
How can I have a single index running through the resulting dataframe, i.e.
In [56]: dfab.head()
Out[56]:
a
0 0
1 1
2 2
3 3

Just do: dfab = pd.concat([dfa,dfb], ignore_index=True)

Related

Accessing an Non Numerical Index in a DataFrame [duplicate]

I'm simply trying to access named pandas columns by an integer.
You can select a row by location using df.ix[3].
But how to select a column by integer?
My dataframe:
df=pandas.DataFrame({'a':np.random.rand(5), 'b':np.random.rand(5)})
Two approaches that come to mind:
>>> df
A B C D
0 0.424634 1.716633 0.282734 2.086944
1 -1.325816 2.056277 2.583704 -0.776403
2 1.457809 -0.407279 -1.560583 -1.316246
3 -0.757134 -1.321025 1.325853 -2.513373
4 1.366180 -1.265185 -2.184617 0.881514
>>> df.iloc[:, 2]
0 0.282734
1 2.583704
2 -1.560583
3 1.325853
4 -2.184617
Name: C
>>> df[df.columns[2]]
0 0.282734
1 2.583704
2 -1.560583
3 1.325853
4 -2.184617
Name: C
Edit: The original answer suggested the use of df.ix[:,2] but this function is now deprecated. Users should switch to df.iloc[:,2].
You can also use df.icol(n) to access a column by integer.
Update: icol is deprecated and the same functionality can be achieved by:
df.iloc[:, n] # to access the column at the nth position
You could use label based using .loc or index based using .iloc method to do column-slicing including column ranges:
In [50]: import pandas as pd
In [51]: import numpy as np
In [52]: df = pd.DataFrame(np.random.rand(4,4), columns = list('abcd'))
In [53]: df
Out[53]:
a b c d
0 0.806811 0.187630 0.978159 0.317261
1 0.738792 0.862661 0.580592 0.010177
2 0.224633 0.342579 0.214512 0.375147
3 0.875262 0.151867 0.071244 0.893735
In [54]: df.loc[:, ["a", "b", "d"]] ### Selective columns based slicing
Out[54]:
a b d
0 0.806811 0.187630 0.317261
1 0.738792 0.862661 0.010177
2 0.224633 0.342579 0.375147
3 0.875262 0.151867 0.893735
In [55]: df.loc[:, "a":"c"] ### Selective label based column ranges slicing
Out[55]:
a b c
0 0.806811 0.187630 0.978159
1 0.738792 0.862661 0.580592
2 0.224633 0.342579 0.214512
3 0.875262 0.151867 0.071244
In [56]: df.iloc[:, 0:3] ### Selective index based column ranges slicing
Out[56]:
a b c
0 0.806811 0.187630 0.978159
1 0.738792 0.862661 0.580592
2 0.224633 0.342579 0.214512
3 0.875262 0.151867 0.071244
You can access multiple columns by passing a list of column indices to dataFrame.ix.
For example:
>>> df = pandas.DataFrame({
'a': np.random.rand(5),
'b': np.random.rand(5),
'c': np.random.rand(5),
'd': np.random.rand(5)
})
>>> df
a b c d
0 0.705718 0.414073 0.007040 0.889579
1 0.198005 0.520747 0.827818 0.366271
2 0.974552 0.667484 0.056246 0.524306
3 0.512126 0.775926 0.837896 0.955200
4 0.793203 0.686405 0.401596 0.544421
>>> df.ix[:,[1,3]]
b d
0 0.414073 0.889579
1 0.520747 0.366271
2 0.667484 0.524306
3 0.775926 0.955200
4 0.686405 0.544421
The method .transpose() converts columns to rows and rows to column, hence you could even write
df.transpose().ix[3]
Most of the people have answered how to take columns starting from an index. But there might be some scenarios where you need to pick columns from in-between or specific index, where you can use the below solution.
Say that you have columns A,B and C. If you need to select only column A and C you can use the below code.
df = df.iloc[:, [0,2]]
where 0,2 specifies that you need to select only 1st and 3rd column.
You can use the method take. For example, to select first and last columns:
df.take([0, -1], axis=1)

Keep pair of row data in pandas [duplicate]

I am stuck with a seemingly easy problem: dropping unique rows in a pandas dataframe. Basically, the opposite of drop_duplicates().
Let's say this is my data:
A B C
0 foo 0 A
1 foo 1 A
2 foo 1 B
3 bar 1 A
I would like to drop the rows when A, and B are unique, i.e. I would like to keep only the rows 1 and 2.
I tried the following:
# Load Dataframe
df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})
uniques = df[['A', 'B']].drop_duplicates()
duplicates = df[~df.index.isin(uniques.index)]
But I only get the row 2, as 0, 1, and 3 are in the uniques!
Solutions for select all duplicated rows:
You can use duplicated with subset and parameter keep=False for select all duplicates:
df = df[df.duplicated(subset=['A','B'], keep=False)]
print (df)
A B C
1 foo 1 A
2 foo 1 B
Solution with transform:
df = df[df.groupby(['A', 'B'])['A'].transform('size') > 1]
print (df)
A B C
1 foo 1 A
2 foo 1 B
A bit modified solutions for select all unique rows:
#invert boolean mask by ~
df = df[~df.duplicated(subset=['A','B'], keep=False)]
print (df)
A B C
0 foo 0 A
3 bar 1 A
df = df[df.groupby(['A', 'B'])['A'].transform('size') == 1]
print (df)
A B C
0 foo 0 A
3 bar 1 A
I came up with a solution using groupby:
groupped = df.groupby(['A', 'B']).size().reset_index().rename(columns={0: 'count'})
uniques = groupped[groupped['count'] == 1]
duplicates = df[~df.index.isin(uniques.index)]
Duplicates now has the proper result:
A B C
2 foo 1 B
3 bar 1 A
Also, my original attempt in the question can be fixed by simply adding keep=False in the drop_duplicates method:
# Load Dataframe
df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})
uniques = df[['A', 'B']].drop_duplicates(keep=False)
duplicates = df[~df.index.isin(uniques.index)]
Please #jezrael answer, I think it is safest(?), as I am using pandas indexes here.
df1 = df.drop_duplicates(['A', 'B'],keep=False)
df1 = pd.concat([df, df1])
df1 = df1.drop_duplicates(keep=False)
This technique is more suitable when you have two datasets dfX and dfY with millions of records. You may first concatenate dfX and dfY and follow the same steps.

Sum columns in a pandas dataframe which contain a string

I am trying to do something relatively simple in summing all columns in a pandas dataframe that contain a certain string. Then making that a new column in the dataframe from the sum. These columns are all numeric float values...
I can get the list of columns which contain the string I want
StmCol = [col for col in cdf.columns if 'Stm_Rate' in col]
But when I try to sum them using:
cdf['PadStm'] = cdf[StmCol].sum()
I get a new column full of "nan" values.
You need to pass in axis=1 to .sum, by default (axis=0) sums over each column:
In [11]: df = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"])
In [12]: df
Out[12]:
A B
0 1 2
1 3 4
In [13]: df[["A"]].sum() # Here I'm passing the list of columns ["A"]
Out[13]:
A 4
dtype: int64
In [14]: df[["A"]].sum(axis=1)
Out[14]:
0 1
1 3
dtype: int64
Only the latter matches the index of df:
In [15]: df["C"] = df[["A"]].sum()
In [16]: df["D"] = df[["A"]].sum(axis=1)
In [17]: df
Out[17]:
A B C D
0 1 2 NaN 1
1 3 4 NaN 3

Get pandas groupby object to ignore missing dataframes

i'm using pandas to read an excel file and convert the spreadsheet to a dataframe. Then i apply groupby and store the individual groups in variables using get_group for later computation.
My issue is that the input file isn't always the same size, sometimes the groupby will result in 10 dfs, sometimes 25 etc. How to i get my program to ignore if a df is missing from the intial data?
df = pd.read_excel(filepath, 0, skiprows=3, parse_cols='A,B,C,E,F,G',
names=['Result', 'Trial', 'Well', 'Distance', 'Speed', 'Time'])
df = df.replace({'-': 0}, regex=True) #replaces '-' values with 0
df = df['Trial'].unique()
gb = df.groupby('Trial') #groups by column Trial
trial_1 = gb.get_group('Trial 1')
trial_2 = gb.get_group('Trial 2')
trial_3 = gb.get_group('Trial 3')
trial_4 = gb.get_group('Trial 4')
trial_5 = gb.get_group('Trial 5')
Say my initial data only has 3 trials, how would i get it to ignore trials 4, 5 later? My code runs when all trials are present but fails when some are missing :( It sounds very much like an if statement would be needed, but my tired brain has no idea where...
Thanks in advance!
After grouping you can get the groups using attribute .groups this returns a dict of the group names, you can then just iterate over the dict keys dynamically so you don't need to hard code the size:
In [22]:
df = pd.DataFrame({'grp':list('aabbbc'), 'val':np.arange(6)})
df
Out[22]:
grp val
0 a 0
1 a 1
2 b 2
3 b 3
4 b 4
5 c 5
In [23]:
gp = df.groupby('grp')
gp.groups
Out[23]:
{'a': Int64Index([0, 1], dtype='int64'),
'b': Int64Index([2, 3, 4], dtype='int64'),
'c': Int64Index([5], dtype='int64')}
In [25]:
for g in gp.groups.keys():
print(gp.get_group(g))
grp val
0 a 0
1 a 1
grp val
2 b 2
3 b 3
4 b 4
grp val
5 c 5

Mean of repeated columns in pandas dataframe

I have a dataframe with repeated column names which account for repeated measurements.
df = pd.DataFrame({'A': randn(5), 'B': randn(5)})
df2 = pd.DataFrame({'A': randn(5), 'B': randn(5)})
df3 = pd.concat([df,df2], axis=1)
df3
A B A B
0 -0.875884 -0.298203 0.877414 1.282025
1 1.605602 -0.127038 -0.286237 0.572269
2 1.349540 -0.067487 0.126440 1.063988
3 -0.142809 1.282968 0.941925 -1.593592
4 -0.630353 1.888605 -1.176436 -1.623352
I'd like to take the mean of cols 'A's and 'B's such that the dataframe shrinks to
A B
0 0.000765 0.491911
1 0.659682 0.222616
2 0.737990 0.498251
3 0.399558 -0.155312
4 -0.903395 0.132627
If I do the typical
df3['A'].mean(axis=1)
I get a Series (with no column name) and I should then build a new dataframe with the means of each col group. Also the .groupby() method apparently doesn't allow you to group by column name, but rather you give the columns and it sorts the indexes. Is there a fancy way to do this?
Side question: why does
df = pd.DataFrame({'A': randn(5), 'B': randn(5), 'A': randn(5), 'B': randn(5)})
not generate a 4-column dataframe but merges same-name cols?
You can use the level keyword (regarding your columns as the first level (level 0) of the index with only one level in this case):
In [11]: df3
Out[11]:
A B A B
0 -0.367326 -0.422332 2.379907 1.502237
1 -1.060848 0.083976 0.619213 -0.303383
2 0.805418 -0.109793 0.257343 0.186462
3 2.419282 -0.452402 0.702167 0.216165
4 -0.464248 -0.980507 0.823302 0.900429
In [12]: df3.mean(axis=1, level=0)
Out[12]:
A B
0 1.006291 0.539952
1 -0.220818 -0.109704
2 0.531380 0.038334
3 1.560725 -0.118118
4 0.179527 -0.040039
You've created df3 in a strange way for this simple case the following would work:
In [86]:
df = pd.DataFrame({'A': randn(5), 'B': randn(5)})
df2 = pd.DataFrame({'A': randn(5), 'B': randn(5)})
print(df)
print(df2)
A B
0 -0.732807 -0.571942
1 -1.546377 -1.586371
2 0.638258 0.569980
3 -1.017427 1.395300
4 0.666853 -0.258473
[5 rows x 2 columns]
A B
0 0.589185 1.029062
1 -1.447809 -0.616584
2 -0.506545 0.432412
3 -1.168424 0.312796
4 1.390517 1.074129
[5 rows x 2 columns]
In [87]:
(df+df2)/2
Out[87]:
A B
0 -0.071811 0.228560
1 -1.497093 -1.101477
2 0.065857 0.501196
3 -1.092925 0.854048
4 1.028685 0.407828
[5 rows x 2 columns]
to answer your side question, this is nothing to do with Pandas and more to do with the dict constructor:
In [88]:
{'A': randn(5), 'B': randn(5), 'A': randn(5), 'B': randn(5)}
Out[88]:
{'B': array([-0.03087831, -0.24416885, -2.29924624, 0.68849978, 0.41938536]),
'A': array([ 2.18471335, 0.68051101, -0.35759988, 0.54023489, 0.49029071])}
dict keys must be unique so my guess is that in the constructor it just reassigns the values to the pre-existing keys
EDIT
If you insist on having duplicate columns then you have to create a new dataframe from this because if you were to update the columns 'A' and 'B', the mean will be duplicated still as the columns are repeated:
In [92]:
df3 = pd.concat([df,df2], axis=1)
new_df = pd.DataFrame()
new_df['A'], new_df['B'] = df3['A'].sum(axis=1)/df3['A'].shape[1], df3['B'].sum(axis=1)/df3['B'].shape[1]
new_df
Out[92]:
A B
0 -0.071811 0.228560
1 -1.497093 -1.101477
2 0.065857 0.501196
3 -1.092925 0.854048
4 1.028685 0.407828
[5 rows x 2 columns]
So the above would work with df3 and in fact for an arbritary numer of repeated columns which is why I am using shape, you could hard code this to 2 if you new the columns were only ever duplicated once

Categories

Resources