I have a question, how does one count the number of unique values that occur within each column of a pandas data-frame?
Say I have a data frame named df that looks like this:
1 2 3 4
a yes f c
b no f e
c yes d h
I am wanting to get output that shows the frequency of unique values within the four columns. The output would be something similar to this:
Column # of Unique Values
1 3
2 2
3 2
4 3
I don't need to know what the unique values are, just how many there are within each column.
I have played around with something like this:
df[all_cols].value_counts()
[all_cols] is a list of all the columns within the data frame. But this is counting how many times the value appears within the column.
Any advice/suggestions would be a great help. Thanks
You could apply Series.nunique:
>>> df.apply(pd.Series.nunique)
1 3
2 2
3 2
4 3
dtype: int64
Or you could do a groupby/nunique on the unstacked version of the frame:
>>> df.unstack().groupby(level=0).nunique()
1 3
2 2
3 2
4 3
dtype: int64
Both of these produce a Series, which you could then use to build a frame with whatever column names you wanted.
You could try df.nunique()
>>> df.nunique()
1 3
2 2
3 2
4 3
dtype: int64
Related
I have a dataframe like this:
A
B
1
2
3
4
5
6
I want to take its rows and put them in front like this:
A
B
A
B
A
B
1
2
3
4
5
6
Is there any way I can do that?
I tried using iloc but could not figure out how to do this.
One option is to:
flatten values as numpy array using .values dataframe property and np.reshape function
build a new dataframe, whose column names can be obtained by using np.tile on the original column list
pd.DataFrame(
df.values.reshape(1, -1),
columns = np.tile(df.columns.values, len(df)).tolist()
)
Output:
A B A B A B
0 1 2 3 4 5 6
A super simple question, for which I cannot find an answer.
I have a dataframe with 1000+ columns and cannot drop by column number, I do not know them. I want to drop all columns between two columns, based on their names.
foo = foo.drop(columns = ['columnWhatever233':'columnWhatever826'])
does not work. I tried several other options, but do not see a simple solution. Thanks!
You can use .loc with column range. For example if you have this dataframe:
A B C D E
0 1 3 3 6 0
1 2 2 4 9 1
2 3 1 5 8 4
Then to delete columns B to D:
df = df.drop(columns=df.loc[:, "B":"D"].columns)
print(df)
Prints:
A E
0 1 0
1 2 1
2 3 4
I have N Dataframes with different number of columns, I want to get one dataframe with 2 columns x and Y where x is the data from the columns of the input dataframe and Y is the column name itself. I have many such dataframes that I need to concat (N is of the order of 10^2), so efficiency is of priority. A numpy way rather than pandas way is also welcome.
For example,
df1:
one two
0 1 a
1 2 b
2 3 c
3 4 d
4 5 e
df2:
three four
0 NaN
1 None f
2 g
3 6 7
Final Output Dataframe:
x y
0 1 one
1 2 one
2 3 one
3 4 one
4 5 one
5 a two
6 b two
7 c two
8 d two
9 e two
10 6 three
11 f four
12 g four
13 7 four
Note: I'm ignoring empty strings, NaNs and Nones in the final dataframe.
IIUC you can use melt() before concating:
final=(pd.concat([df1.melt(),df2.dropna().melt()]).
rename(columns={'variable':'y','value':'x'}). reindex(['x','y'],axis=1))
print(final)
So I have an extremely simple dataframe:
values
1
1
1
2
2
I want to add a new column and for each row assign the sum of it's unique occurences, so the table would look like:
values unique_sum
1 3
1 3
1 3
2 2
2 2
I have seen some examples in R, but for python and pandas I have not come across anything and am stuck. I can list the value counts using .value_counts() and I have tried groupbyroutines but cannot fathom it.
Just use map to map your column onto its value_counts:
>>> x
A
0 1
1 1
2 1
3 2
4 2
>>> x['unique'] = x.A.map(x.A.value_counts())
>>> x
A unique
0 1 3
1 1 3
2 1 3
3 2 2
4 2 2
(I named the column A instead of values. values is not a great choice for a column name, because DataFrames have a special attribute called values, which prevents you from getting the column with x.values --- you'd have to use x['values'] instead.)
I am new to data frames so I apologize if the question is obvious ,Assume I have a data frame that looks like that:
1 2 3
4 5 6
7 8 9
and I would like to check if it contains the following data frame:
5 6
8 9
is there any build in function in pandas.dataframe that do it?
Supposed two dataframes have the same relative columns and index (I assume so as they are dataframe not just values array), here is a quick solution (not the most elegant or efficient) where you compare two dataframes after combine_first:
DataFrame.combine_first(other)
Combine two DataFrame objects and
default to non-null values in frame calling the method. Result index
columns will be the union of the respective indexes and columns
Example:
df
a b c
0 1 2 3
1 4 5 6
2 7 8 9
df1
a b
1 4 5
2 7 8
all(df1.combine_first(df) == df.combine_first(df1))
True
or, if you want to check df1 (smaller) is in df (you know their size already):
all(df == df1.combine_first(df))
True