I have data from an Excel file in the format
0,1,0
1,0,0
0,0,1
I want to convert those data into a list where the ith element indicates the position of the nonzero element for the ith row. For example, the above would be:
[1,0,2]
I tried two ways to no avail:
Way one (NumPy)
df = pd.read_excel(file,convert_float=False)
idx = np.where(df==1)[1]
This gives me an odd error- idx is never the same length as the number of row in df. For this data set the two numbers are always equal. (I double checked, and there are no empty rows.)
Way two (Pandas)
idx = df.where(df==1)
This gives me output like:
52 NaN NaN NaN
53 1 NaN NaN
54 1 NaN NaN
This is the appropriate shape, but I don't know how to just get the column index.
Set up the dataframe
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[0,1,0],[1,0,0],[0,0,1]]))
Use np.argwhere to find the element indices:
np.argwhere(df.values ==1)
returns:
array([[0, 1],
[1, 0],
[2, 2]], dtype=int64)
so for row 0 the column 1 contains 1 for the df:
0 1 2
0 0 1 0
1 1 0 0
2 0 0 1
Note:
(you can get just the column index by using: np.array_split(indices, 2,1)[1] for example)
Here is a solution that works for limited use cases including this one. If you know that you will only have a single 1 in your row, then you can transpose the original data frame so the indices of your columns from the original data frame become the row indices of the transposed data frame. With that you can find the max value in each row and return an array of those values.
Your original data frame is not the best example for this solution because it is symmetrical and its transpose is the same as the original data frame. So for the sake of this solution we'll use a starting data frame that looks like:
df = pd.DataFrame({0:[0,0,1], 1:[1,0,0], 2:[0,1,0]})
# original data frame --> df
0 1 2
0 0 1 0
1 0 0 1
2 1 0 0
# transposed data frame --> df.T
0 1 2
0 0 0 1
1 1 0 0
2 0 1 0
Now to find the max of each row:
np.array(df.T.idxmax())
Which returns an array of values that represent the column indices of the original data frame that contain a 1:
[1 2 0]
Related
I have a pandas dataframe as shown below:
Pandas Dataframe
I want to drop the rows that has only one non zero value. What's the most efficient way to do this?
Try boolean indexing
# sample data
df = pd.DataFrame(np.zeros((10, 10)), columns=list('abcdefghij'))
df.iloc[2:5, 3] = 1
df.iloc[4:5, 4] = 1
# boolean indexing based on condition
df[df.ne(0).sum(axis=1).ne(1)]
Only rows 2 and 3 are removed because row 4 has two non-zero values and every other row has zero non-zero values. So we drop rows 2 and 3.
df.ne(0).sum(axis=1)
0 0
1 0
2 1
3 1
4 2
5 0
6 0
7 0
8 0
9 0
Not sure if this is the most efficient but I'll try:
df[[col for col in df.columns if (df[col] != 0).sum() == 1]]
2 loops per column here: 1 for checking if != 0 and one more to sum the boolean values up (could break earlier if the second value is found).
Otherwise, you can define a custom function to check without looping twice per column:
def check(column):
already_has_one = False
for value in column:
if value != 0:
if already_has_one:
return False
already_has_one = True
return already_has_one
then:
df[[col for col in df.columns if check(df[col])]]
Which is much faster than the first.
Or like this:
df[(df.applymap(lambda x: bool(x)).sum(1) > 1).values]
i was looking for some help regarding how i can add a column in my df that contains the cluster id (used algorith to cluster dataset is DBSCAN, i tried the following
# Compute DBSCAN
db = DBSCAN(eps=1, min_samples=30, algorithm='kd_tree', n_jobs=-1).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
np.sum(labels)
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_clusters_
n_noise_ = list(labels).count(-1)
print('Estimated number of clusters: %d' % n_clusters_)
print('Estimated number of noise points: %d' % n_noise_)
print("Silhouette Coefficient: %0.3f"
% metrics.silhouette_score(X, labels))
df = df.join(pd.DataFrame(labels))
df = df.rename(columns={0:'Cluster'})
df.head
but i have a problem that does not seem logical.Before the clustering my datasaet had no missing values, whereas , when i add the column(Cluster),clsuter=-1 is for noise etc, i get missing values too(!),so when i try to clean my dataset i do not have any option rather than exlcude cluster=-1 and missing values too,something that i do not want .Can you please help me with my issue?
You can find attached the output that contains the problem .
There are about 3000 missing values in the column of clustering and i don' t understand how that happened.
The dataset's columns before the entry of extra column had 38037 rows .
Any comment would be helpful.
Thank you
Problem with missing values
Something have been happened with indices in your df. As you can read in Pandas join docs, if parameter on have not been specified:
Column or index level name(s) in the caller to join on the index in other, otherwise joins index-on-index.
So, something like this is happening:
labels
Out[66]: array([ 0, 0, 0, 1, 1, -1], dtype=int64)
# make dataframe that exactly matches labels
df = pd.DataFrame(labels, columns=['a'])
df
Out[68]:
a
0 0
1 0
2 0
3 1
4 1
5 -1
# change indices
df = df.set_index([pd.Index([0, 1, 3, 5, 7, 8])])
df
Out[70]:
a
0 0
1 0
3 0
5 1
7 1
8 -1
df.join(pd.DataFrame(labels))
Out[71]:
a 0
0 0 0.0
1 0 0.0
3 0 1.0
5 1 -1.0
7 1 NaN
8 -1 NaN
I'd suggest to reset indices before DBSCAN if you don't need current indices: df.reset_index(drop=True, inplace=True).
This line in your code is causing the missing values:
df = df.join(pd.DataFrame(labels))
Explanation:
pandas.DataFrame.join() joins DataFrame objects by index. The "df" DataFrame has an Int64Index with values ranging from 0 to 41187, but only 38037 entries - that means the index values are not consecutive but contain gaps, probably from removing/filtering rows after the dataframe was created and before your code snippet was executed.
The DataFrame containing the labels you create with pd.DataFrame(labels) will have its own index, with values ranging from 0 to 38037. If this DataFrame is joined with the original DataFrame, the resulting DataFrame will only contain rows where the index values of your original DataFrame and the label DataFrame match, and due to the gaps in your original DataFrame's index, this is only the case for 35246 rows.
The easiest solution is to reindex the original DataFrame so it contains consecutive index values again:
df = df.reset_index(drop=True).join(pd.DataFrame(labels))
I have a multi-index dataframe and want to set a slice of one of its columns equal to a series, ordered (sorted) according to the column slice' and series' index-match. The column's innermost index and series' index are identical, except their ordering (sorting). (see example below)
I can do this by first sorting the series' index according to the column's index and then using series.values (see below), but this feels like a workaround and I was wondering if it's possible to directly assign the series to the column slice.
example:
import pandas as pd
multi_index=pd.MultiIndex.from_product([['a','b'],['x','y']])
df=pd.DataFrame(0,multi_index,['p','q'])
s1=pd.Series([1,2],['y','x'])
df.loc['a','p']=s1[df.loc['a','p'].index].values
The code above gives the desired output, but I was wondering if the last line could be done simpler, e.g.:
df.loc['a','p']=s1
but this sets the column slice to NaNs.
Desired output:
p q
a x 2 0
y 1 0
b x 0 0
y 0 0
obtained output from df.loc['a','p']=s1:
p q
a x NaN 0
y NaN 0
b x 0.0 0
y 0.0 0
It seems like a simple issue to me but I haven't been able to find the answer anywhere.
Have you tried something like that?
df.loc['a']['p'] = s1
Resulting df is here
p q
a x 2 0
y 1 0
b x 0 0
y 0 0
I have seen a variant of this question asked that keeps the top n rows of each group in a pandas dataframe and the solutions use n as an absolute number rather than a percentage here Pandas get topmost n records within each group. However, in my dataframe, each group has different numbers of rows in it and I want to keep the top n% rows of each group. How would I approach this problem?
You can construct a Boolean series of flags and filter before you groupby. First let's create an example dataframe and look at the number of row for each unique value in the first series:
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0, 2, (10, 3)))
print(df[0].value_counts())
0 6
1 4
Name: 0, dtype: int64
Then define a fraction, e.g. 50% below, and construct a Boolean series for filtering:
n = 0.5
g = df.groupby(0)
flags = (g.cumcount() + 1) <= g[1].transform('size') * n
Then apply the condition, set the index as the first series and (if required) sort the index:
df = df.loc[flags].set_index(0).sort_index()
print(df)
1 2
0
0 1 1
0 1 1
0 1 0
1 1 1
1 1 0
As you can see, the resultant dataframe only has 3 0 indices and 2 1 indices, in each case half the number in the original dataframe.
Here is another option which builds on some of the answers in the post you mentioned
First of all here is a quick function to either round up or round down. If we want the top 30% of rows of a dataframe 8 rows long then we would try to take 2.4 rows. So we will need to either round up or down.
My preferred option is to round up. This is because, for eaxample, if we were to take 50% of the rows, but had one group which only had one row, we would still keep that one row. I kept this separate so that you can change the rounding as you wish
def round_func(x, up=True):
'''Function to round up or round down a float'''
if up:
return int(x+1)
else:
return int(x)
Next I make a dataframe to work with and set a parameter p to be the fraction of the rows from each group that we should keep. Everything follows and I have commented it so that hopefully you can follow.
import pandas as pd
df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4],'value':[1,2,3,1,2,3,4,1,1]})
p = 0.30 # top fraction to keep. Currently set to 80%
df_top = df.groupby('id').apply( # group by the ids
lambda x: x.reset_index()['value'].nlargest( # in each group take the top rows by column 'value'
round_func(x.count().max()*p))) # calculate how many to keep from each group
df_top = df_top.reset_index().drop('level_1', axis=1) # make the dataframe nice again
df looked like this
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
df_top looks like this
id value
0 1 3
1 2 4
2 2 3
3 3 1
4 4 1
This question was asked in multiple other posts but I could not get any of the methods to work. This is my dataframe:
df = pd.DataFrame([[1,2,3,4.5],[1,2,0,4,5]])
I would like to know how I can either:
1) Delete rows that contain any/all zeros
2) Delete columns that contain any/all zeros
In order to delete rows that contain any zeros, this worked:
df2 = df[~(df == 0).any(axis=1)]
df2 = df[~(df == 0).all(axis=1)]
But I cannot get this to work column wise. I tried to set axis=0 but that gives me this error:
__main__:1: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
Any suggestions?
You're going to need loc for this:
df
0 1 2 3 4
0 1 2 3 4 5
1 1 2 0 4 5
df.loc[:, ~(df == 0).any(0)] # notice the :, this means we are indexing on the columns now, not the rows
0 1 3 4
0 1 2 4 5
1 1 2 4 5
Direct indexing defaults to indexing on the rows. You are trying to index a dataframe with only two rows using [0, 1, 3, 4], so pandas is warning you about that.