Problem with missing values in cluster id column - python

i was looking for some help regarding how i can add a column in my df that contains the cluster id (used algorith to cluster dataset is DBSCAN, i tried the following
# Compute DBSCAN
db = DBSCAN(eps=1, min_samples=30, algorithm='kd_tree', n_jobs=-1).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
np.sum(labels)
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_clusters_
n_noise_ = list(labels).count(-1)
print('Estimated number of clusters: %d' % n_clusters_)
print('Estimated number of noise points: %d' % n_noise_)
print("Silhouette Coefficient: %0.3f"
% metrics.silhouette_score(X, labels))
df = df.join(pd.DataFrame(labels))
df = df.rename(columns={0:'Cluster'})
df.head
but i have a problem that does not seem logical.Before the clustering my datasaet had no missing values, whereas , when i add the column(Cluster),clsuter=-1 is for noise etc, i get missing values too(!),so when i try to clean my dataset i do not have any option rather than exlcude cluster=-1 and missing values too,something that i do not want .Can you please help me with my issue?
You can find attached the output that contains the problem .
There are about 3000 missing values in the column of clustering and i don' t understand how that happened.
The dataset's columns before the entry of extra column had 38037 rows .
Any comment would be helpful.
Thank you
Problem with missing values

Something have been happened with indices in your df. As you can read in Pandas join docs, if parameter on have not been specified:
Column or index level name(s) in the caller to join on the index in other, otherwise joins index-on-index.
So, something like this is happening:
labels
Out[66]: array([ 0, 0, 0, 1, 1, -1], dtype=int64)
# make dataframe that exactly matches labels
df = pd.DataFrame(labels, columns=['a'])
df
Out[68]:
a
0 0
1 0
2 0
3 1
4 1
5 -1
# change indices
df = df.set_index([pd.Index([0, 1, 3, 5, 7, 8])])
df
Out[70]:
a
0 0
1 0
3 0
5 1
7 1
8 -1
df.join(pd.DataFrame(labels))
Out[71]:
a 0
0 0 0.0
1 0 0.0
3 0 1.0
5 1 -1.0
7 1 NaN
8 -1 NaN
I'd suggest to reset indices before DBSCAN if you don't need current indices: df.reset_index(drop=True, inplace=True).

This line in your code is causing the missing values:
df = df.join(pd.DataFrame(labels))
Explanation:
pandas.DataFrame.join() joins DataFrame objects by index. The "df" DataFrame has an Int64Index with values ranging from 0 to 41187, but only 38037 entries - that means the index values are not consecutive but contain gaps, probably from removing/filtering rows after the dataframe was created and before your code snippet was executed.
The DataFrame containing the labels you create with pd.DataFrame(labels) will have its own index, with values ranging from 0 to 38037. If this DataFrame is joined with the original DataFrame, the resulting DataFrame will only contain rows where the index values of your original DataFrame and the label DataFrame match, and due to the gaps in your original DataFrame's index, this is only the case for 35246 rows.
The easiest solution is to reindex the original DataFrame so it contains consecutive index values again:
df = df.reset_index(drop=True).join(pd.DataFrame(labels))

Related

Combine some columns in last row of a pandas dataframe

edit: I understand how to get the actual values, but I wonder how to append a row with these 2 sums to the existing df?
I have a dataframe score_card that looks like:
15min_colour
15min_high
15min_price
30min_colour
30min_high
30min_price
1
1
-1
1
-1
1
1
-1
1
1
1
1
-1
1
-1
1
1
1
-1
1
-1
1
-1
1
Now I'd like to add a row that sums up all the 15min numbers (first 3 columns) and the 30min numbers and so on (the actual df is larger). Means I don't want to add up the individual columns but rather the sum of the columns' sums. The row I'd like to add would look like:
sum_15min_colour&15min_high&15min_price
sum_30min_colour&30min_high&30min_price
0
8
Please disregard the header, it's only to clarify what I'm intending to do.
I assume there's a multiindex involved, but I couldn't figure out how to apply it to my existing df to achieve the desired output.
Also, is it possible to add a colum with the sum of the whole table?
Thanks for your support.
You can sum in this way:
np.sum(df.filter(like='15').values), np.sum(df.filter(like='30').values)
0,8
groupby
Can take a callable (think function) and use it on the index or columns
df.groupby(lambda x: x.split('_')[0], axis=1).sum().sum()
15min 0
30min 8
dtype: int64
it's depends on axis.
Simply - this sum the value in axis 0:
So in your case - columns(it's sum all values in columns vertically).
df.sum(axis = 0, skipna = True)
print(df):
OUTPUT:
sum_column = df["col1"] + df["col2"]
df["col3"] = sum_column
print(df)
OUTPUT:
So in your case:
summed0Axis = df.sum(axis = 0, skipna = True)
sum_column = summed0Axis["15min_colour"] + summed0Axis["15min_high"] + summed0Axis["15min_price"]
print(sum_column)
more intelligent option:
Find all columns, which included 15:
columnsWith15 = df.loc[:,df.columns.str.contains("15").sum]
columnsWith30 = df.loc[:,df.columns.str.contains("30").sum]

pandas count number of filled cells within row

I have a large dataset with columns labelled from 1 - 65 (among other titled columns), and want to find how many of the columns, per row, have a string (of any value) in them. For example, if all rows 1 - 65 are filled, the count should be 65 in this particular row, if only 10 are filled then the count should be 10.
Is there any easy way to do this? I'm currently using the following code, which is taking very long as there are a large number of rows.
array = pd.read_csv(csvlocation, encoding = "ISO-8859-1")
for i in range (0, lengthofarray)
for k in range(1,66):
if array[k][i]!="":
array["count"][i]=array["count"][i]+1
From my understanding of the post and the subsequent comments, you are interested in knowing the number of strings in each row for columns labels 1 through 65. There are two steps, the first is to subset your data down to columns 1 through 65, and then the following is the count the number of strings in each row. To do this:
import pandas as pd
import numpy as np
# create sample data
df = pd.DataFrame({'col1': list('abdecde'),
'col2': np.random.rand(7)})
# change one val of column two to string for illustration purposes
df.loc[3, 'col2'] = 'b'
# to create the subset of columns, you could use
# subset = [str(num) for num in list(range(1, 66))]
# and then just use df[subset]
# for each row, count the number of columns that have a string value
# applymap operates elementwise, so we are essentially creating
# a new representation of your data in place, where a 1 represents a
# string value was there, and a 0 represent not a string.
# we then sum along the rows to get the final counts
col_str_counts = np.sum(df.applymap(lambda x: 1 if isinstance(x, str) else 0), axis=1)
# we changed the column two value above, so to check that the count is 2 for that row idx:
col_str_counts[3]
>>> 2
# and for the subset, it would simply become:
# col_str_counts = np.sum(df[subset].applymap(lambda x: 1 if isinstance(x, str) else 0), axis=1)
You should be able to adapt your problem to this example
Say we have this dataframe
df = pd.DataFrame([["","foo","bar"],["","","bar"],["","",""],["foo","bar","bar"]])
0 1 2
0 foo bar
1 bar
2
3 foo bar bar
Then we create a boolean mask where a cell != "" and sum those values
df['count'] = (df != "").sum(1)
print(df)
0 1 2 count
0 foo bar 2
1 bar 1
2 0
3 foo bar bar 3
df = pandas.DataFrame([["","foo","bar"],["","","bar"],["","",""],["foo","bar","bar"]])
total_cells = df.size
df['filled_cell_count'] = (df != "").sum(1)
print(f"{df}")
0 1 2 filled_cell_count
0 foo bar 2
1 bar 1
2 0
3 foo bar bar 3
total_filled_cells = df['filled_cell_count'].sum()/total_cells
print()
print(f"Total Filled Cells in dataframe: {total_filled_cells}")
Total Filled Cells in dataframe: 0.5

Delete zeros from a pandas dataframe

This question was asked in multiple other posts but I could not get any of the methods to work. This is my dataframe:
df = pd.DataFrame([[1,2,3,4.5],[1,2,0,4,5]])
I would like to know how I can either:
1) Delete rows that contain any/all zeros
2) Delete columns that contain any/all zeros
In order to delete rows that contain any zeros, this worked:
df2 = df[~(df == 0).any(axis=1)]
df2 = df[~(df == 0).all(axis=1)]
But I cannot get this to work column wise. I tried to set axis=0 but that gives me this error:
__main__:1: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
Any suggestions?
You're going to need loc for this:
df
0 1 2 3 4
0 1 2 3 4 5
1 1 2 0 4 5
df.loc[:, ~(df == 0).any(0)] # notice the :, this means we are indexing on the columns now, not the rows
0 1 3 4
0 1 2 4 5
1 1 2 4 5
Direct indexing defaults to indexing on the rows. You are trying to index a dataframe with only two rows using [0, 1, 3, 4], so pandas is warning you about that.

Index of columns where rows match criterion Python Pandas

I have data from an Excel file in the format
0,1,0
1,0,0
0,0,1
I want to convert those data into a list where the ith element indicates the position of the nonzero element for the ith row. For example, the above would be:
[1,0,2]
I tried two ways to no avail:
Way one (NumPy)
df = pd.read_excel(file,convert_float=False)
idx = np.where(df==1)[1]
This gives me an odd error- idx is never the same length as the number of row in df. For this data set the two numbers are always equal. (I double checked, and there are no empty rows.)
Way two (Pandas)
idx = df.where(df==1)
This gives me output like:
52 NaN NaN NaN
53 1 NaN NaN
54 1 NaN NaN
This is the appropriate shape, but I don't know how to just get the column index.
Set up the dataframe
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[0,1,0],[1,0,0],[0,0,1]]))
Use np.argwhere to find the element indices:
np.argwhere(df.values ==1)
returns:
array([[0, 1],
[1, 0],
[2, 2]], dtype=int64)
so for row 0 the column 1 contains 1 for the df:
0 1 2
0 0 1 0
1 1 0 0
2 0 0 1
Note:
(you can get just the column index by using: np.array_split(indices, 2,1)[1] for example)
Here is a solution that works for limited use cases including this one. If you know that you will only have a single 1 in your row, then you can transpose the original data frame so the indices of your columns from the original data frame become the row indices of the transposed data frame. With that you can find the max value in each row and return an array of those values.
Your original data frame is not the best example for this solution because it is symmetrical and its transpose is the same as the original data frame. So for the sake of this solution we'll use a starting data frame that looks like:
df = pd.DataFrame({0:[0,0,1], 1:[1,0,0], 2:[0,1,0]})
# original data frame --> df
0 1 2
0 0 1 0
1 0 0 1
2 1 0 0
# transposed data frame --> df.T
0 1 2
0 0 0 1
1 1 0 0
2 0 1 0
Now to find the max of each row:
np.array(df.T.idxmax())
Which returns an array of values that represent the column indices of the original data frame that contain a 1:
[1 2 0]

Ambiguity in Pandas Dataframe / Numpy Array "axis" definition

I've been very confused about how python axes are defined, and whether they refer to a DataFrame's rows or columns. Consider the code below:
>>> df = pd.DataFrame([[1, 1, 1, 1], [2, 2, 2, 2], [3, 3, 3, 3]], columns=["col1", "col2", "col3", "col4"])
>>> df
col1 col2 col3 col4
0 1 1 1 1
1 2 2 2 2
2 3 3 3 3
So if we call df.mean(axis=1), we'll get a mean across the rows:
>>> df.mean(axis=1)
0 1
1 2
2 3
However, if we call df.drop(name, axis=1), we actually drop a column, not a row:
>>> df.drop("col4", axis=1)
col1 col2 col3
0 1 1 1
1 2 2 2
2 3 3 3
Can someone help me understand what is meant by an "axis" in pandas/numpy/scipy?
A side note, DataFrame.mean just might be defined wrong. It says in the documentation for DataFrame.mean that axis=1 is supposed to mean a mean over the columns, not the rows...
It's perhaps simplest to remember it as 0=down and 1=across.
This means:
Use axis=0 to apply a method down each column, or to the row labels (the index).
Use axis=1 to apply a method across each row, or to the column labels.
Here's a picture to show the parts of a DataFrame that each axis refers to:
It's also useful to remember that Pandas follows NumPy's use of the word axis. The usage is explained in NumPy's glossary of terms:
Axes are defined for arrays with more than one dimension. A 2-dimensional array has two corresponding axes: the first running vertically downwards across rows (axis 0), and the second running horizontally across columns (axis 1). [my emphasis]
So, concerning the method in the question, df.mean(axis=1), seems to be correctly defined. It takes the mean of entries horizontally across columns, that is, along each individual row. On the other hand, df.mean(axis=0) would be an operation acting vertically downwards across rows.
Similarly, df.drop(name, axis=1) refers to an action on column labels, because they intuitively go across the horizontal axis. Specifying axis=0 would make the method act on rows instead.
There are already proper answers, but I give you another example with > 2 dimensions.
The parameter axis means axis to be changed.
For example, consider that there is a dataframe with dimension a x b x c.
df.mean(axis=1) returns a dataframe with dimenstion a x 1 x c.
df.drop("col4", axis=1) returns a dataframe with dimension a x (b-1) x c.
Here, axis=1 means the second axis which is b, so b value will be changed in these examples.
Another way to explain:
// Not realistic but ideal for understanding the axis parameter
df = pd.DataFrame([[1, 1, 1, 1], [2, 2, 2, 2], [3, 3, 3, 3]],
columns=["idx1", "idx2", "idx3", "idx4"],
index=["idx1", "idx2", "idx3"]
)
---------------------------------------1
| idx1 idx2 idx3 idx4
| idx1 1 1 1 1
| idx2 2 2 2 2
| idx3 3 3 3 3
0
About df.drop (axis means the position)
A: I wanna remove idx3.
B: **Which one**? // typing while waiting response: df.drop("idx3",
A: The one which is on axis 1
B: OK then it is >> df.drop("idx3", axis=1)
// Result
---------------------------------------1
| idx1 idx2 idx4
| idx1 1 1 1
| idx2 2 2 2
| idx3 3 3 3
0
About df.apply (axis means direction)
A: I wanna apply sum.
B: Which direction? // typing while waiting response: df.apply(lambda x: x.sum(),
A: The one which is on *parallel to axis 0*
B: OK then it is >> df.apply(lambda x: x.sum(), axis=0)
// Result
idx1 6
idx2 6
idx3 6
idx4 6
It should be more widely known that the string aliases 'index' and 'columns' can be used in place of the integers 0/1. The aliases are much more explicit and help me remember how the calculations take place. Another alias for 'index' is 'rows'.
When axis='index' is used, then the calculations happen down the columns, which is confusing. But, I remember it as getting a result that is the same size as another row.
Let's get some data on the screen to see what I am talking about:
df = pd.DataFrame(np.random.rand(10, 4), columns=list('abcd'))
a b c d
0 0.990730 0.567822 0.318174 0.122410
1 0.144962 0.718574 0.580569 0.582278
2 0.477151 0.907692 0.186276 0.342724
3 0.561043 0.122771 0.206819 0.904330
4 0.427413 0.186807 0.870504 0.878632
5 0.795392 0.658958 0.666026 0.262191
6 0.831404 0.011082 0.299811 0.906880
7 0.749729 0.564900 0.181627 0.211961
8 0.528308 0.394107 0.734904 0.961356
9 0.120508 0.656848 0.055749 0.290897
When we want to take the mean of all the columns, we use axis='index' to get the following:
df.mean(axis='index')
a 0.562664
b 0.478956
c 0.410046
d 0.546366
dtype: float64
The same result would be gotten by:
df.mean() # default is axis=0
df.mean(axis=0)
df.mean(axis='rows')
To get use an operation left to right on the rows, use axis='columns'. I remember it by thinking that an additional column may be added to my DataFrame:
df.mean(axis='columns')
0 0.499784
1 0.506596
2 0.478461
3 0.448741
4 0.590839
5 0.595642
6 0.512294
7 0.427054
8 0.654669
9 0.281000
dtype: float64
The same result would be gotten by:
df.mean(axis=1)
Add a new row with axis=0/index/rows
Let's use these results to add additional rows or columns to complete the explanation. So, whenever using axis = 0/index/rows, its like getting a new row of the DataFrame. Let's add a row:
df.append(df.mean(axis='rows'), ignore_index=True)
a b c d
0 0.990730 0.567822 0.318174 0.122410
1 0.144962 0.718574 0.580569 0.582278
2 0.477151 0.907692 0.186276 0.342724
3 0.561043 0.122771 0.206819 0.904330
4 0.427413 0.186807 0.870504 0.878632
5 0.795392 0.658958 0.666026 0.262191
6 0.831404 0.011082 0.299811 0.906880
7 0.749729 0.564900 0.181627 0.211961
8 0.528308 0.394107 0.734904 0.961356
9 0.120508 0.656848 0.055749 0.290897
10 0.562664 0.478956 0.410046 0.546366
Add a new column with axis=1/columns
Similarly, when axis=1/columns it will create data that can be easily made into its own column:
df.assign(e=df.mean(axis='columns'))
a b c d e
0 0.990730 0.567822 0.318174 0.122410 0.499784
1 0.144962 0.718574 0.580569 0.582278 0.506596
2 0.477151 0.907692 0.186276 0.342724 0.478461
3 0.561043 0.122771 0.206819 0.904330 0.448741
4 0.427413 0.186807 0.870504 0.878632 0.590839
5 0.795392 0.658958 0.666026 0.262191 0.595642
6 0.831404 0.011082 0.299811 0.906880 0.512294
7 0.749729 0.564900 0.181627 0.211961 0.427054
8 0.528308 0.394107 0.734904 0.961356 0.654669
9 0.120508 0.656848 0.055749 0.290897 0.281000
It appears that you can see all the aliases with the following private variables:
df._AXIS_ALIASES
{'rows': 0}
df._AXIS_NUMBERS
{'columns': 1, 'index': 0}
df._AXIS_NAMES
{0: 'index', 1: 'columns'}
When axis='rows' or axis=0, it means access elements in the direction of the rows, up to down. If applying sum along axis=0, it will give us totals of each column.
When axis='columns' or axis=1, it means access elements in the direction of the columns, left to right. If applying sum along axis=1, we will get totals of each row.
Still confusing! But the above makes it a bit easier for me.
I remembered by the change of dimension, if axis=0, row changes, column unchanged, and if axis=1, column changes, row unchanged.

Categories

Resources