Make new dataframe with value based on list Python [duplicate] - python

I have a dataframe df:
20060930 10.103 NaN 10.103 7.981
20061231 15.915 NaN 15.915 12.686
20070331 3.196 NaN 3.196 2.710
20070630 7.907 NaN 7.907 6.459
Then I want to select rows with certain sequence numbers which indicated in a list, suppose here is [1,3], then left:
20061231 15.915 NaN 15.915 12.686
20070630 7.907 NaN 7.907 6.459
How or what function can do that?

Use .iloc for integer based indexing and .loc for label based indexing. See below example:
ind_list = [1, 3]
df.iloc[ind_list]

you can also use iloc:
df.iloc[[1,3],:]
This will not work if the indexes in your dataframe do not correspond to the order of the rows due to prior computations. In that case use:
df.index.isin([1,3])
... as suggested in other responses.

Another way (although it is a longer code) but it is faster than the above codes. Check it using %timeit function:
df[df.index.isin([1,3])]
PS: You figure out the reason

If index_list contains your desired indices, you can get the dataframe with the desired rows by doing
index_list = [1,2,3,4,5,6]
df.loc[df.index[index_list]]
This is based on the latest documentation as of March 2021.

For large datasets, it is memory efficient to read only selected rows via the skiprows parameter.
Example
pred = lambda x: x not in [1, 3]
pd.read_csv("data.csv", skiprows=pred, index_col=0, names=...)
This will now return a DataFrame from a file that skips all rows except 1 and 3.
Details
From the docs:
skiprows : list-like or integer or callable, default None
...
If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise. An example of a valid callable argument would be lambda x: x in [0, 2]
This feature works in version pandas 0.20.0+. See also the corresponding issue and a related post.

There are many ways of solving this problem, and the ones listed above are the most commonly used ways of achieving the solution. I want to add two more ways, just in case someone is looking for an alternative.
index_list = [1,3]
df.take(pos)
#or
df.query('index in #index_list')

What you are trying to do is to filter your dataframe by index. The best way to do that in pandas at the moment is the following:
Single Index
desired_index_list = [1,3]
df[df.index.isin(desired_index_list)]
Multiindex
desired_index_list = [1,3]
index_level_to_filter = 0
df[df.index.get_level_values(index_level_to_filter).isin(desired_index_list)]

To get a new DataFrame from filtered indexes:
For my problem, I needed a new dataframe from the indexes. I found a straight-forward way to do this:
iloc_list=[1,2,4,8]
df_new = df.filter(items = iloc_list , axis=0)
You can also filter columns using this. Please see the documentation for details.

Related

Pandas map, check if any values in a list is inside another

I have the following list
x = [1,2,3]
And the following df
Sample df
pd.DataFrame({'UserId':[1,1,1,2,2,2,3,3,3,4,4,4],'Origins':[1,2,3,2,2,3,7,8,9,10,11,12]})
Lets say I want to return, the userid who contains any of the values in the list, in his groupby origins list.
Wanted result
pd.Series({'UserId':[1,2]})
What would be the best approach? To do this, maybe a groupby with a lambda, but I am having a little trouble formulating the condition.
df['UserId'][df['Origins'].isin(x)].drop_duplicates()
I had considered using unique(), but that returns a numpy array. Since you wanted a series, I went with drop_duplicates().
IIUC, OP wants, for each Origin, the UserId whose number appears in list x. If that is the case, the following, using pandas.Series.isin and pandas.unique will do the work
df_new = df[df['Origins'].isin(x)]['UserId'].unique()
[Out]:
[1 2]
Assuming one wants a series, one can convert the dataframe to a series as follows
df_new = pd.Series(df_new)
[Out]:
0 1
1 2
dtype: int64
If one wants to return a Series, and do it all in one step, instead of pandas.unique, one can use pandas.DataFrame.drop_duplicates (see Steven Rumbaliski answer).

Referencing to each element in a dictionary of dataframes [duplicate]

I am curious as to why df[2] is not supported, while df.ix[2] and df[2:3] both work.
In [26]: df.ix[2]
Out[26]:
A 1.027680
B 1.514210
C -1.466963
D -0.162339
Name: 2000-01-03 00:00:00
In [27]: df[2:3]
Out[27]:
A B C D
2000-01-03 1.02768 1.51421 -1.466963 -0.162339
I would expect df[2] to work the same way as df[2:3] to be consistent with Python indexing convention. Is there a design reason for not supporting indexing row by single integer?
echoing #HYRY, see the new docs in 0.11
http://pandas.pydata.org/pandas-docs/stable/indexing.html
Here we have new operators, .iloc to explicity support only integer indexing, and .loc to explicity support only label indexing
e.g. imagine this scenario
In [1]: df = pd.DataFrame(np.random.rand(5,2),index=range(0,10,2),columns=list('AB'))
In [2]: df
Out[2]:
A B
0 1.068932 -0.794307
2 -0.470056 1.192211
4 -0.284561 0.756029
6 1.037563 -0.267820
8 -0.538478 -0.800654
In [5]: df.iloc[[2]]
Out[5]:
A B
4 -0.284561 0.756029
In [6]: df.loc[[2]]
Out[6]:
A B
2 -0.470056 1.192211
[] slices the rows (by label location) only
The primary purpose of the DataFrame indexing operator, [] is to select columns.
When the indexing operator is passed a string or integer, it attempts to find a column with that particular name and return it as a Series.
So, in the question above: df[2] searches for a column name matching the integer value 2. This column does not exist and a KeyError is raised.
The DataFrame indexing operator completely changes behavior to select rows when slice notation is used
Strangely, when given a slice, the DataFrame indexing operator selects rows and can do so by integer location or by index label.
df[2:3]
This will slice beginning from the row with integer location 2 up to 3, exclusive of the last element. So, just a single row. The following selects rows beginning at integer location 6 up to but not including 20 by every third row.
df[6:20:3]
You can also use slices consisting of string labels if your DataFrame index has strings in it. For more details, see this solution on .iloc vs .loc.
I almost never use this slice notation with the indexing operator as its not explicit and hardly ever used. When slicing by rows, stick with .loc/.iloc.
You can think DataFrame as a dict of Series. df[key] try to select the column index by key and returns a Series object.
However slicing inside of [] slices the rows, because it's a very common operation.
You can read the document for detail:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#basics
To index-based access to the pandas table, one can also consider numpy.as_array option to convert the table to Numpy array as
np_df = df.as_matrix()
and then
np_df[i]
would work.
You can take a look at the source code .
DataFrame has a private function _slice() to slice the DataFrame, and it allows the parameter axis to determine which axis to slice. The __getitem__() for DataFrame doesn't set the axis while invoking _slice(). So the _slice() slice it by default axis 0.
You can take a simple experiment, that might help you:
print df._slice(slice(0, 2))
print df._slice(slice(0, 2), 0)
print df._slice(slice(0, 2), 1)
you can loop through the data frame like this .
for ad in range(1,dataframe_c.size):
print(dataframe_c.values[ad])
I would normally go for .loc/.iloc as suggested by Ted, but one may also select a row by transposing the DataFrame. To stay in the example above, df.T[2] gives you row 2 of df.
If you want to index multiple rows by their integer indexes, use a list of indexes:
idx = [2,3,1]
df.iloc[idx]
N.B. If idx is created using some rule, then you can also sort the dataframe by using .iloc (or .loc) because the output will be ordered by idx. So in a sense, iloc can act like a sorting function where idx is the sorting key.

Efficient way of converting a numpy array of 2 dimensions into a list with no duplicates

I want to extract the values from two different columns of a pandas dataframe, put them in a list with no duplicate values.
I have tried the following:
arr = df[['column1', 'column2']].values
thelist= []
for ix, iy in np.ndindex(arr.shape):
if arr[ix, iy] not in thelist:
thelist.append(edges[ix, iy])
This works but it is taking too long. The dataframe contains around 30 million rows.
Example:
column1 column2
1 adr1 adr2
2 adr1 adr2
3 adr3 adr4
4 adr4 adr5
Should generate the list with the values:
[adr1, adr2, adr3, adr4, adr5]
Can you please help me find a more efficient way of doing this, considering that the dataframe contains 30 million rows.
#ALollz gave a right answer. I'll extend from there. To convert into list as expected just use list(np.unique(df.values))
You can use just np.unique(df) (maybe this is the shortest version).
Formally, the first parameter of np.unique should be an array_like object,
but as I checked, you can also pass just a DataFrame.
Of course, if you want just plain list not a ndarray, write
np.unique(df).tolist().
Edit following your comment
If you want the list unique but in the order of appearance, write:
pd.DataFrame(df.values.reshape(-1,1))[0].drop_duplicates().tolist()
Operation order:
reshape changes the source array into a single column.
Then a DataFrame is created, with default column name = 0.
Then [0] takes just this (the only) column.
drop_duplicates acts exactly what the name says.
And the last step: tolist converts to a plain list.

How to optimize code that iterates on a big dataframe in Python

I have a big pandas dataframe. It has thousands of columns and over a million rows. I want to calculate the difference between the max value and the min value row-wise. Keep in mind that there are many NaN values and some rows are all NaN values (but I still want to keep them!).
I wrote the following code. It works but it's time consuming:
totTime = []
for index, row in date.iterrows():
myRow = row.dropna()
if len(myRow):
tt = max(myRow) - min(myRow)
else:
tt = None
totTime.append(tt)
Is there any way to optimize it? I tried with the following code but I get an error when it encounters all NaN rows:
tt = lambda x: max(x.dropna()) - min(x.dropna())
totTime = date.apply(tt, axis=1)
Any suggestions will be appreciated!
It is usually a bad idea to use a python for loop to iterate over a large pandas.DataFrame or a numpy.ndarray. You should rather use the available build in functions on them as they are optimized and in many cases actually not written in python but in a compiled language. In your case you should use the methods pandas.DataFrame.max and pandas.DataFrame.min that both give you an option skipna to skip nan values in your DataFrame without the need to actually drop them manually. Furthermore, you can choose a axis to minimize along. So you can specifiy axis=1 to get the minimum along columns.
This will add up to something similar as what #EdChum just mentioned in the comments:
data.max(axis=1, skipna=True) - data.min(axis=1, skipna=True)
I have the same problem about iterating. 2 points:
Why don't you replace NaN values with 0? You can do it with this df.replace(['inf','nan'],[0,0]). It replaces inf and nan values.
Take a look at this This. Maybe you can understand, I have a similar question about how to optimize the loop to calculate de difference between actual row with the previous one.

Pandas: Get duplicated indexes

Given a dataframe, I want to get the duplicated indexes, which do not have duplicate values in the columns, and see which values are different.
Specifically, I have this dataframe:
import pandas as pd
wget https://www.dropbox.com/s/vmimze2g4lt4ud3/alt_exon_repeatmasker_intersect.bed
alt_exon_repeatmasker = pd.read_table('alt_exon_repeatmasker_intersect.bed', header=None, index_col=3)
In [74]: alt_exon_repeatmasker.index.is_unique
Out[74]: False
And some of the indexes have duplicate values in the 9th column (the type of DNA repetitive element in this location), and I want to know what are the different types of repetitive elements for individual locations (each index = a genome location).
I'm guessing this will require some kind of groupby and hopefully some groupby ninja can help me out.
To simplify even further, if we only have the index and the repeat type,
genome_location1 MIR3
genome_location1 AluJb
genome_location2 Tigger1
genome_location3 AT_rich
So the output I'd like to see all duplicate indexes and their repeat types, as such:
genome_location1 MIR3
genome_location1 AluJb
EDIT: added toy example
Also useful and very succinct:
df[df.index.duplicated()]
Note that this only returns one of the duplicated rows, so to see all the duplicated rows you'll want this:
df[df.index.duplicated(keep=False)]
df.groupby(level=0).filter(lambda x: len(x) > 1)['type']
We added filter method for this kind of operation. You can also use masking and transform for equivalent results, but this is faster, and a little more readable too.
Important:
The filter method was introduced in version 0.12, but it failed to work on DataFrames/Series with nonunique indexes. The issue -- and a related issue with transform on Series -- was fixed for version 0.13, which should be released any day now.
Clearly, nonunique indexes are the heart of this question, so I should point out that this approach will not help until you have pandas 0.13. In the meantime, the transform workaround is the way to go. Be ware that if you try that on a Series with a nonunique index, it too will fail.
There is no good reason why filter and transform should not be applied to nonunique indexes; it was just poorly implemented at first.
Even faster and better:
df.index.get_duplicates()
As of 9/21/18 Pandas indicates FutureWarning: 'get_duplicates' is deprecated and will be removed in a future release, instead suggesting the following:
df.index[df.index.duplicated()].unique()
>>> df[df.groupby(level=0).transform(len)['type'] > 1]
type
genome_location1 MIR3
genome_location1 AluJb
More succinctly:
df[df.groupby(level=0).type.count() > 1]
FYI a multi-index:
df[df.groupby(level=[0,1]).type.count() > 1]
This gives you index values along with a preview of duplicated rows
def dup_rows_index(df):
dup = df[df.duplicated()]
print('Duplicated index loc:',dup[dup == True ].index.tolist())
return dup

Categories

Resources