Given a dataframe, I want to get the duplicated indexes, which do not have duplicate values in the columns, and see which values are different.
Specifically, I have this dataframe:
import pandas as pd
wget https://www.dropbox.com/s/vmimze2g4lt4ud3/alt_exon_repeatmasker_intersect.bed
alt_exon_repeatmasker = pd.read_table('alt_exon_repeatmasker_intersect.bed', header=None, index_col=3)
In [74]: alt_exon_repeatmasker.index.is_unique
Out[74]: False
And some of the indexes have duplicate values in the 9th column (the type of DNA repetitive element in this location), and I want to know what are the different types of repetitive elements for individual locations (each index = a genome location).
I'm guessing this will require some kind of groupby and hopefully some groupby ninja can help me out.
To simplify even further, if we only have the index and the repeat type,
genome_location1 MIR3
genome_location1 AluJb
genome_location2 Tigger1
genome_location3 AT_rich
So the output I'd like to see all duplicate indexes and their repeat types, as such:
genome_location1 MIR3
genome_location1 AluJb
EDIT: added toy example
Also useful and very succinct:
df[df.index.duplicated()]
Note that this only returns one of the duplicated rows, so to see all the duplicated rows you'll want this:
df[df.index.duplicated(keep=False)]
df.groupby(level=0).filter(lambda x: len(x) > 1)['type']
We added filter method for this kind of operation. You can also use masking and transform for equivalent results, but this is faster, and a little more readable too.
Important:
The filter method was introduced in version 0.12, but it failed to work on DataFrames/Series with nonunique indexes. The issue -- and a related issue with transform on Series -- was fixed for version 0.13, which should be released any day now.
Clearly, nonunique indexes are the heart of this question, so I should point out that this approach will not help until you have pandas 0.13. In the meantime, the transform workaround is the way to go. Be ware that if you try that on a Series with a nonunique index, it too will fail.
There is no good reason why filter and transform should not be applied to nonunique indexes; it was just poorly implemented at first.
Even faster and better:
df.index.get_duplicates()
As of 9/21/18 Pandas indicates FutureWarning: 'get_duplicates' is deprecated and will be removed in a future release, instead suggesting the following:
df.index[df.index.duplicated()].unique()
>>> df[df.groupby(level=0).transform(len)['type'] > 1]
type
genome_location1 MIR3
genome_location1 AluJb
More succinctly:
df[df.groupby(level=0).type.count() > 1]
FYI a multi-index:
df[df.groupby(level=[0,1]).type.count() > 1]
This gives you index values along with a preview of duplicated rows
def dup_rows_index(df):
dup = df[df.duplicated()]
print('Duplicated index loc:',dup[dup == True ].index.tolist())
return dup
Related
I am new to Python and am converting SQL to Python and want to learn the most efficient way to process a large dataset (rows > 1 million and columns > 100). I need to create multiple new columns based on other columns in the DataFrame. I have recently learned how to use pd.concat for new boolean columns, but I also have some non-boolean columns that rely on the values of other columns.
In SQL I would use a single case statement (case when age > 1000 then sample_id else 0 end as custom1, etc...). In Python I can achieve the same result in 2 steps (pd.concat + loc find & replace) as shown below. I have seen references in other posts to using the apply method but have also read in other posts that the apply method can be inefficient.
My question is then, for the code shown below, is there a more efficient way to do this? Can I do it all in one step within the pd.concat (so far I haven't been able to get that to work)? I am okay doing it in 2 steps if necessary. I need to be able to handle large integers (100 billion) in my custom1 element and have decimals in my custom2 element.
And finally, I tried using multiple separate np.where statements but received a warning that my DataFrame was fragmented and that I should try to use concat. So I am not sure which approach overall is most efficient or recommended.
Update - after receiving a comment and an answer pointing me towards use of np.where, I decided to test the approaches. Using a data set with 2.7 million rows and 80 columns, I added 25 new columns. First approach was to use the concat + df.loc replace as shown in this post. Second approach was to use np.where. I ran the test 10 times and np.where was faster in all 10 trials. As noted above, I think repeated use of np.where in this way can cause fragmentation, so I suppose now my decision comes down to faster np.where with potential fragmentation vs. slower use of concat without risk of fragmentation. Any further insight on this final update is appreciated.
df = pd.DataFrame({'age': [120, 4000],
'weight': [505.31, 29.01],
'sample_id': [999999999999, 555555555555]},
index=['rock1', 'rock2'])
#step 1: efficiently create starting custom columns using concat
df = pd.concat(
[
df,
(df["age"] > 1000).rename("custom1").astype(int),
(df["weight"] < 100).rename("custom2").astype(float),
],
axis=1,
)
#step2: assign final values to custom columns based on other column values
df.loc[df.custom1 == 1, 'custom1'] = (df['sample_id'])
df.loc[df.custom2 == 1, 'custom2'] = (df['weight'] / 2)
Thanks for any feedback you can provide...I appreciate your time helping me.
The standard way to do this is using numpy where:
import numpy as np
df['custom1'] = np.where(df.age.gt(1000), df.sample_id, 0)
df['custom2'] = np.where(df.weight.lt(100), df.weight / 2, 0)
I have a Pandas dataframe (tempDF) of 5 columns by N rows. Each element of the dataframe is an object (string in this case). For example, the dataframe looks like (this is fake data - not real world):
I have two tuples, each contains a collection of numbers as a string type. For example:
codeset = ('6108','532','98120')
additionalClinicalCodes = ('131','1','120','130')
I want to retrieve a subset of the rows from the tempDF in which the columns "medcode" OR "enttype" have at least one entry in the tuples above. Thus, from the example above, I would retrieve a subset containing rows with the index 8 and 9 and 11.
Until updating some packages earlier today (too many now to work out which has started throwing the warning), this did work:
tempDF = tempDF[tempDF["medcode"].isin(codeSet) | tempDF["enttype"].isin(additionalClinicalCodes)]
But now it is throwing the warning:
FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
mask |= (ar1 == a)
Looking at the API, isin states the the condition "if ALL" is in the iterable collection. I want an "if ANY" condition.
UPDATE #1
The problem lies with using the | operator, also the np.logical_or method. If I remove the second isin condition i.e., just keep tempDF[tempDF["medcode"].isin(codeSet) then no warning is thrown but I'm only subsetting on the one possible condition.
import numpy as np
tempDF = tempDF[np.logical_or(tempDF["medcode"].isin(codeSet), tempDF["enttype"].isin(additionalClinicalCodes))
I'm unable to reproduce your warning (I assume you are using an outdated numpy version), however I believe it is related to the fact that your enttype column is a numerical type, but you're using strings in additionalClinicalCodes.
Try this:
tempDF = temp[temp["medcode"].isin(list(codeset)) | temp["enttype"].isin(list(additionalClinicalCodes))]
Boiling your question down to an executable example:
import pandas as pd
tempDF = pd.DataFrame({'medcode': ['6108', '6154', '95744', '98120'], 'enttype': ['99', '131', '372', '372']})
codeset = ('6108','532','98120')
additionalClinicalCodes = ('131','1','120','130')
newDF = tempDF[tempDF["medcode"].isin(codeset) | tempDF["enttype"].isin(additionalClinicalCodes)]
print(newDF)
print("Pandas Version")
print(pd.__version__)
This returns for me
medcode enttype
0 6108 99
1 6154 131
3 98120 372
Pandas Version
1.4.2
Thus I am not able to reproduce your warning.
This is a numpy strange behaviour. I think the right way to do this is yours way, but if the warning bothers you, try this:
tempDF = tempDF[
(
tempDF.medcode.isin(codeset).astype(int) +
tempDF.isin(additionalClinicalCode).astype(int)
) >= 1
]
I have a dataframe df:
20060930 10.103 NaN 10.103 7.981
20061231 15.915 NaN 15.915 12.686
20070331 3.196 NaN 3.196 2.710
20070630 7.907 NaN 7.907 6.459
Then I want to select rows with certain sequence numbers which indicated in a list, suppose here is [1,3], then left:
20061231 15.915 NaN 15.915 12.686
20070630 7.907 NaN 7.907 6.459
How or what function can do that?
Use .iloc for integer based indexing and .loc for label based indexing. See below example:
ind_list = [1, 3]
df.iloc[ind_list]
you can also use iloc:
df.iloc[[1,3],:]
This will not work if the indexes in your dataframe do not correspond to the order of the rows due to prior computations. In that case use:
df.index.isin([1,3])
... as suggested in other responses.
Another way (although it is a longer code) but it is faster than the above codes. Check it using %timeit function:
df[df.index.isin([1,3])]
PS: You figure out the reason
If index_list contains your desired indices, you can get the dataframe with the desired rows by doing
index_list = [1,2,3,4,5,6]
df.loc[df.index[index_list]]
This is based on the latest documentation as of March 2021.
For large datasets, it is memory efficient to read only selected rows via the skiprows parameter.
Example
pred = lambda x: x not in [1, 3]
pd.read_csv("data.csv", skiprows=pred, index_col=0, names=...)
This will now return a DataFrame from a file that skips all rows except 1 and 3.
Details
From the docs:
skiprows : list-like or integer or callable, default None
...
If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise. An example of a valid callable argument would be lambda x: x in [0, 2]
This feature works in version pandas 0.20.0+. See also the corresponding issue and a related post.
There are many ways of solving this problem, and the ones listed above are the most commonly used ways of achieving the solution. I want to add two more ways, just in case someone is looking for an alternative.
index_list = [1,3]
df.take(pos)
#or
df.query('index in #index_list')
What you are trying to do is to filter your dataframe by index. The best way to do that in pandas at the moment is the following:
Single Index
desired_index_list = [1,3]
df[df.index.isin(desired_index_list)]
Multiindex
desired_index_list = [1,3]
index_level_to_filter = 0
df[df.index.get_level_values(index_level_to_filter).isin(desired_index_list)]
To get a new DataFrame from filtered indexes:
For my problem, I needed a new dataframe from the indexes. I found a straight-forward way to do this:
iloc_list=[1,2,4,8]
df_new = df.filter(items = iloc_list , axis=0)
You can also filter columns using this. Please see the documentation for details.
I have a pandas dataframe containing indices that have a one-to-many relationship. A very simplified and shortened example of my data is shown in the DataFrame Example link. I want to get a list or Series or ndarray of the unique namIdx values in which nCldLayers <= 1. The final result should show indices of 601 and 603.
I am able to accomplish this with the 3 statements below, but I am wondering if there is a much better, more succinct way with perhaps 'filter', 'select', or 'where'.
grouped=(namToViirs['nCldLayers']<=1).groupby(namToViirs.index).all(axis=0)
grouped = grouped[grouped==True]
filterIndex = grouped.index
Is there a better approach in accomplishing this result by applying the logical condition (namToViirs['nCldLayers >= 1) in a subsequent part of the chain, i.e., first group then apply logical condition, and then retrieve only the namIdx where the logical result is true for each member of the group?
I think your code works nice, only you can add use small changes:
In all can be omit axis=0
grouped==True can be omit ==True
grouped=(namToViirs['nCldLayers']<=1).groupby(level='namldx').all()
grouped = grouped[grouped]
filterIndex = grouped.index
print (filterIndex)
Int64Index([601, 603], dtype='int64', name='namldx')
I think better is first filter by boolean indexing and then groupby, because less loops -> better performance.
For question 1, see jezrael answer. For question 2, you could play with indexes as sets:
namToViirs.index[namToViirs.nCldLayers <= 1] \
.difference(namToViirs.index[namToViirs.nCldLayers > 1])
You might be interested in this answer.
The implementation is currently a bit hackish, but it should reduce your statement above to:
filterIndex = ((namToViirs['nCldLayers']<=1)
.groupby(namToViirs.index).all(axis=0)[W].index)
EDIT: also see this answer for an analogous approach not requiring external components, resulting in:
filterIndex = ((namToViirs['nCldLayers']<=1)
.groupby(namToViirs.index).all(axis=0)[lambda x : x].index)
Another option is to use .pipe() and a function which applies the desired filtering.
For instance:
filterIndex = ((namToViirs['nCldLayers']<=1)
.groupby(namToViirs.index)
.all(axis=0)
.pipe(lambda s : s[s])
.index)
I have a vanilla pandas dataframe with an index. I need to check if the index is sorted. Preferably without sorting it again.
e.g. I can test an index to see if it is unique by index.is_unique() is there a similar way for testing sorted?
How about:
df.index.is_monotonic
Just for the sake of completeness, this would be the procedure to check whether the dataframe index is monotonic increasing and also unique, and, if not, make it to be:
if not (df.index.is_monotonic_increasing and df.index.is_unique):
df.reset_index(inplace=True, drop=True)
NOTE df.index.is_monotonic_increasing is returning True even if there are repeated indices, so it has to be complemented with df.index.is_unique.
API References
Index.is_monotonic_increasing
Index.is_unique
DataFrame.reset_index
If sort is all allowed, try
all(df.sort_index().index == df.index)
If not, try
all(a <= b for a, b in zip(df.index, df.index[1:]))
The first one is more readable while the second one has smaller time complexity.
EDIT
Add another method I've just found. Similar with the second one but the comparison is vetorized
all(df.index[:-1] <= df.index[1:])
For non-indices:
df.equals(df.sort())