I have a DataFrame where each observation is identified by an index. However, for some indices the DF contains several observations. One of them has the most updated data. I would like to drop the outdated duplicated rows based on values from some of the columns.
For example, in the following DataFrame, how can I drop the first and third rows with index = 122?
index col1 col2
122 - -
122 one two
122 - two
123 four one
124 five -
That is, I would like to get a final DF like this:
index col1 col2
122 one two
123 four one
124 five -
This seems to be a very common problem when we get data through several different retrievals over time. But I cannot figure out an efficient way of cleaning the data.
You could use groupby/transform to create a boolean mask which is True where the group count is greater than 1 and any of the values in the row equals '-'. Then you could use df.loc[~mask] to select the unmasked rows of df:
import pandas as pd
df = pd.read_table('data', sep='\s+')
count = df.groupby(['index'])['col1'].transform('count') > 1
mask = (df['col1'] == '-') | (df['col2'] == '-')
mask = mask & count
result = df.loc[~mask]
print(result)
yields
index col1 col2
0 122 one two
1 123 four one
2 124 five -
If the index is already a column then you can drop_duplicates and pass param take-last=True:
In [14]:
df.drop_duplicates('index', take_last=True)
Out[14]:
index col1 col2
1 122 - two
2 123 four one
if it's actually your index, then you'd be better off calling reset_index first and then perform the above step and then set the index back again.
There is a method for Index to call drop_duplicates but this just removed duplicates from the index, the returned index with duplicates removed does not allow you to index back into the df with the duplicates removed so I recommend the above approach by calling drop_duplicates on the df itself.
EDIT
Based on your new information, the easiest maybe to replace the outdated data with NaN values and drop these:
In [36]:
df.replace('-', np.NaN).dropna()
Out[36]:
col1 col2
index
122 one two
123 four one
Another Edit
What you could do is groupby the index and take the first values of the remaining columns, then call reset_index:
In [56]:
df.groupby('index')['col1', 'col2'].first().reset_index()
Out[56]:
index col1 col2
0 122 - -
1 123 four one
2 124 five -
Related
I have the following df:
d = {"Col1":['a','d','b','c','a','d','b','c'],
"Col2":['x','y','x','z','x','y','z','y'],
"Col3":['n','m','m','l','m','m','l','l'],
"Col4":[1,4,2,2,1,4,2,2]}
df = pd.DataFrame(d)
When I groupby on three fields, I get the result:
gb = df.groupby(['Col1', 'Col2', 'Col3'])['Col4'].agg(['sum', 'mean'])
How can I extract only the groups and rows where a row of a group matches with at least one other row of another group on grouped columns. Please see the picture below, I want to get the highlighted rows
I want to get the rows in red on the basis of the ones in Blue and Black which match eachother
Apologies if my statement is ambiguous. Any help would be appreciated
You can reset_index then use duplicated and boolean index filter your dataframe:
gb = gb.reset_index()
gb[gb.duplicated(subset=['Col2','Col3'], keep=False)]
Output:
Col1 Col2 Col3 sum mean
0 a x m 1 1
2 b x m 2 2
3 b z l 2 2
5 c z l 2 2
Make a table with all allowed combinations and then inner join it with this dataframe.
Is it possible to reorder pandas.DataFrame rows (based on an index) inplace?
I am looking for something like this:
df.reorder(idx, axis=0, inplace=True)
Where idx is not equal to but is the same type of df.index, it contains the same elements, but in another order. The output is df reordered before new idx.
I have not found anything in documentation and I fail to use reindex_axis. Which made me hoping it was possible because:
A new object is produced unless the new index is equivalent to the
current one and copy=False
I might have misunderstood what equivalent index means in this context.
Try using the reindex function (note that this is not inplace):
>>import pandas as pd
>>df = pd.DataFrame({'col1':[1,2,3],'col2':['test','hi','hello']})
>>df
col1 col2
0 1 test
1 2 hi
2 3 hello
>>df = df.reindex([2,0,1])
>>df
col1 col2
2 3 hello
0 1 test
1 2 hi
A non-indexed df contains rows of gene, a cell that contains a mutation in that gene, and the type of mutation in that gene:
df = pd.DataFrame({'gene': ['one','one','one','two','two','two','three'],
'cell': ['A', 'A', 'C', 'A', 'B', 'C','A'],
'mutation': ['frameshift', 'missense', 'nonsense', '3UTR', '3UTR', '3UTR', '3UTR']})
df:
cell gene mutation
0 A one frameshift
1 A one missense
2 C one nonsense
3 A two 3UTR
4 B two 3UTR
5 C two 3UTR
6 A three 3UTR
I'd like to pivot this df so I can index by gene and set columns to cells. The trouble is that there can be multiple entries per cell: there can be multiple mutations in any one gene in a given cell (cell A has two different mutations in gene One). So when I run:
df.pivot_table(index='gene', columns='cell', values='mutation')
this happens:
DataError: No numeric types to aggregate
I'd like to use masking to perform the pivot while capturing the presence of at least one mutation:
A B C
gene
one 1 1 1
two 0 1 0
three 1 1 0
Solution with drop_duplicates and pivot_table:
df = df.drop_duplicates(['cell','gene'])
.pivot_table(index='gene',
columns='cell',
values='mutation',
aggfunc=len,
fill_value=0)
print (df)
cell A B C
gene
one 1 0 1
three 1 0 0
two 1 1 1
Another solution with drop_duplicates, groupby with aggregate size and last reshape by unstack:
df = df.drop_duplicates(['cell','gene'])
.groupby(['cell', 'gene'])
.size()
.unstack(0, fill_value=0)
print (df)
cell A B C
gene
one 1 0 1
three 1 0 0
two 1 1 1
The error message is not what is produced when you run pivot_table. You can have multiple values in the index for pivot_table. I don't believe this is true for the pivot method. You can however fix your problem by changing the aggregation to something that works on strings as opposed to numerics. Most aggregation functions operate on numeric columns and the code you wrote above would produce an error relating to the data type of the column not an index error.
df.pivot_table(index='gene',
columns='cell',
values='mutation',
aggfunc='count', fill_value=0)
If you only want 1 value per cell you can do a groupby and aggregate everything to 1 and then unstack a level.
df.groupby(['cell', 'gene']).agg(lambda x: 1).unstack(fill_value=0)
I'm trying to re-insert back into a pandas dataframe a column that I extracted and of which I changed the order by sorting it.
Very simply, I have extracted a column from a pandas df:
col1 = df.col1
This column contains integers and I used the .sort() method to order it from smallest to largest. And did some operation on the data.
col1.sort()
#do stuff that changes the values of col1.
Now the indexes of col1 are the same as the indexes of the overall df, but in a different order.
I was wondering how I can insert the column back into the original dataframe (replacing the col1 that is there at the moment)
I have tried both of the following methods:
1)
df.col1 = col1
2)
df.insert(column_index_of_col1, "col1", col1)
but both methods give me the following error:
ValueError: cannot reindex from a duplicate axis
Any help will be greatly appreciated.
Thank you.
Consider this DataFrame:
df = pd.DataFrame({'A': [1, 2, 3], 'B': [6, 5, 4]}, index=[0, 0, 1])
df
Out:
A B
0 1 6
0 2 5
1 3 4
Assign the second column to b and sort it and take the square, for example:
b = df['B']
b = b.sort_values()
b = b**2
Now b is:
b
Out:
1 16
0 25
0 36
Name: B, dtype: int64
Without knowing the exact operation you've done on the column, there is no way to know whether 25 corresponds to the first row in the original DataFrame or the second one. You can take the inverse of the operation (take the square root and match, for example) but that would be unnecessary I think. If you start with an index that has unique elements (df = df.reset_index()) it would be much easier. In that case,
df['B'] = b
should work just fine.
So I created two dataframes from existing CSV files, both consisting of entirely numbers. The second dataframe consists of an index from 0 to 8783 and one column of numbers and I want to add it on as a new column to the first dataframe which has an index consisting of a month, day and hour. I tried using append, merge and concat and none worked and then tried simply using:
x1GBaverage['Power'] = x2_cut
where x1GBaverage is the first dataframe and x2_cut is the second. When I did this it added x2_cut on properly but all the values were entered as NaN instead of the numerical values that they should be. How should I be approaching this?
x1GBaverage['Power'] = x2_cut.values
problem solved :)
The thing about pandas is that values are implicitly linked to their indices unless you deliberately specify that you only need the values to be transferred over.
If they're the same row counts and you just want to tack it on the end, the indexes either need to match, or you need to just pass the underlying values. In the example below, columns 3 and 5 are the index matching & value versions, and 4 is what you're running into now:
In [58]: df = pd.DataFrame(np.random.random((3,3)))
In [59]: df
Out[59]:
0 1 2
0 0.670812 0.500688 0.136661
1 0.185841 0.239175 0.542369
2 0.351280 0.451193 0.436108
In [61]: df2 = pd.DataFrame(np.random.random((3,1)))
In [62]: df2
Out[62]:
0
0 0.638216
1 0.477159
2 0.205981
In [64]: df[3] = df2
In [66]: df.index = ['a', 'b', 'c']
In [68]: df[4] = df2
In [70]: df[5] = df2.values
In [71]: df
Out[71]:
0 1 2 3 4 5
a 0.670812 0.500688 0.136661 0.638216 NaN 0.638216
b 0.185841 0.239175 0.542369 0.477159 NaN 0.477159
c 0.351280 0.451193 0.436108 0.205981 NaN 0.205981
If the row counts differ, you'll need to use df.merge and let it know which columns it should be using to join the two frames.