Create column named "Id" based on row index - python

I would like to create a new column for my dataframe named "Id" where the value is the row index +1.
I would like to be like the example below:
ID Col1 ...
0 1 a ...
1 2 b ...
2 3 c ...

You can add one to the index and assign it to the id column:
df = pd.DataFrame({"Col1": list("abc")})
df["id"] = df.index + 1
df
#Col1 id
#0 a 1
#1 b 2
#2 c 3

I converted the prediction list to an array then created an dataframe with group totals and plotted the dataframe
y_pred=model.predict(X_test)
#print(y_pred)
setosa=np.array([y_pred[i][0] for i in range(len(y_pred))])
versicolor=np.array([y_pred[i][1] for i in range(len(y_pred))])
virginica=np.array([y_pred[i][2] for i in range(len(y_pred))])
df2=pd.DataFrame({'setosa': [len(setosa[setosa>.5])],'versicolor':
[len(versicolor[versicolor>.5])], 'virginica': [len(virginica[virginica>.5])] })
df2['id']=np.arange(0,1)
df2=df2.set_index('id')
df2.plot.bar()
plt.show()

Related

Assign counts from .count() to a dataframe + column names - pandas python

Hoping someone can help me here - i believe i am close to the solution.
I have a dataframe, of which i have am using .count() in order to return a series of all column names of my dataframe, and each of their respective non-NAN value counts.
Example dataframe:
feature_1
feature_2
1
1
2
NaN
3
2
4
NaN
5
3
Example result for .count() here would output a series that looks like:
feature_1 5
feature_2 3
I am now trying to get this data into a dataframe, with the column names "Feature" and "Count". To have the expected output look like this:
Feature
Count
feature_1
5
feature_2
3
I am using .to_frame() to push the series to a dataframe in order to add column names. Full code:
df = data.count()
df = df.to_frame()
df.columns = ['Feature', 'Count']
However receiving this error message - "ValueError: Length mismatch: Expected axis has 1 elements, new values have 2 elements", as if though it is not recognising the actual column names (Feature) as a column with values.
How can i get it to recognise both Feature and Count columns to be able to add column names to them?
Add Series.reset_index instead Series.to_frame for 2 columns DataFrame - first column from index, second from values of Series:
df = data.count().reset_index()
df.columns = ['Feature', 'Count']
print (df)
Feature Count
0 feature_1 5
1 feature_2 3
Another solution with name parameter and Series.rename_axis or with DataFrame.set_axis:
df = data.count().rename_axis('Feature').reset_index(name='Count')
#alternative
df = data.count().reset_index().set_axis(['Feature', 'Count'], axis=1)
print (df)
Feature Count
0 feature_1 5
1 feature_2 3
This happens because your new dataframe has only one column (the column name is taken as series index, then translated into dataframe index with the func to_frame()). In order to assign a 2 elements list to df.columns you have to reset the index first:
df = data.count()
df = df.to_frame().reset_index()
df.columns = ['Feature', 'Count']

How to Pivot/Stack for multi header column dataframe

np.random.seed(2022) # added to make the data the same each time
cols = pd.MultiIndex.from_arrays([['A','A' ,'B','B'], ['min','max','min','max']])
df = pd.DataFrame(np.random.rand(3,4),columns=cols)
df.index.name = 'item'
A B
min max min max
item
0 0.009359 0.499058 0.113384 0.049974
1 0.685408 0.486988 0.897657 0.647452
2 0.896963 0.721135 0.831353 0.827568
There are two column headers and while working with csv, I get a blank column name for every other column on unmerging.
I want result that looks like this. How can I do it?
I tried to use pivot table but couldn't do it.
Try:
df = (
df.stack(level=0)
.reset_index()
.rename(columns={"level_1": "title"})
.sort_values(by=["title", "item"])
)
print(df)
Prints:
item title max min
0 0 A 0.762221 0.737758
2 1 A 0.930523 0.275314
4 2 A 0.746246 0.123621
1 0 B 0.044137 0.264969
3 1 B 0.577637 0.699877
5 2 B 0.601034 0.706978
Then to CSV:
df.to_csv('out.csv', index=False)

Pandas: select column with most unique values

I have a pandas DataFrame and want to find select the column with the most unique values.
I already filtered the unique values with nunique(). How can I now choose the column with the highest nunique()?
This is my code so far:
numeric_columns = df.select_dtypes(include = (int or float))
unique = []
for column in numeric_columns:
unique.append(numeric_columns[column].nunique())
I later need to filter all the columns of my dataframe depending on this column(most uniques)
Use DataFrame.select_dtypes with np.number, then get DataFrame.nunique with column by maximal value by Series.idxmax:
df = pd.DataFrame({'a':[1,2,3,4],'b':[1,2,2,2], 'c':list('abcd')})
print (df)
a b c
0 1 1 a
1 2 2 b
2 3 2 c
3 4 2 d
numeric = df.select_dtypes(include = np.number)
nu = numeric.nunique().idxmax()
print (nu)
a

Mapping Values of key in Dictionary present in dataframe columns

I have a dataframe A with column 'col_1' and values of column is A and B and and I am trying to map the values of A and B present in Dictionary
DataFrame A:
enter image description here
and have dictionary
enter image description here
and I want the output like this
Dataframe :
col_1 Values
A 1
A 2
A 3
B 1
B 2
Any help will be highly appreciated
thanks
I tried to frame your problem properly:
df = pd.DataFrame({"col_1":["A","A","A","B","B"]})
Printing df gives us your dataframe shown in the image above:
print(df)
col_1
0 A
1 A
2 A
3 B
4 B
Here is your dictionary:
dict1 = {"A":[1,2,3], "B":[1,2]}
I created an empty list to hold the elements then stack up the list with your request, and finally created a new column called values and write the list into the column
values1 = []
for key,value_list in dict1.items():
for item in value_list:
value_item = key+" "+ str(item)
values1.append(value_item)
df["values"] = values1
printing df results into:
df
col_1 values
0 A A 1
1 A A 2
2 A A 3
3 B B 1
4 B B 2

Unable to fillna a column in dataframe with values from a series

I am trying to fillna in a specific column of the dataframe with the mean of not-null values of the same type (based on the value from another column in the dataframe).
Here is the code to reproduce my issue:
import numpy as np
import pandas as pd
df = pd.DataFrame()
#Create the DateFrame with a column of floats
#And a column of labels (str)
np.random.seed(seed=6)
df['col0']=np.random.randn(100)
lett=['a','b','c','d']
df['col1']=np.random.choice(lett,100)
#Set some of the floats to NaN for the test.
toz = np.random.randint(0,100,25)
df.loc[toz,'col0']=np.NaN
df[df['col0'].isnull()==False].count()
#Create a DF with mean for each label.
w_series = df.loc[(~df['col0'].isnull())].groupby('col1').mean()
col0
col1
a 0.057199
b 0.363899
c -0.068074
d 0.251979
#This dataframe has our label (a,b,c,d) as the index. Doesn't seem
#to work when I try to df.fillna(w_series). So I try to reindex such
#that the labels (a,b,c,d) become a column again.
#
#For some reason I cannot just do a set_index and expect the
#old index to become column. So I append the new index and
#then reset it.
w_series['col2'] = list(range(w_series.size))
w_frame = w_series.set_index('col2',append=True)
w_frame.reset_index('col1',inplace=True)
#I try fillna() with the new dataframe.
df.fillna(w_frame)
Still no luck:
col0 col1
0 0.057199 b
1 0.729004 a
2 0.217821 d
3 0.251979 c
4 -2.486781 a
5 0.913252 b
6 NaN a
7 NaN b
What am I doing wrong?
How do I fillna the dataframe with the averages of specific rows that match the missing information?
Does the size of the dataframe being filled (df) and the filler dataframe (w_frame) have to match?
Thank you
fillna is base on index, so , you need same index for your target dataframe and process dataframe
df.set_index('col1')['col0'].fillna(w_frame.set_index('col1').col0).reset_index()
# I only show the first 11 row
Out[74]:
col1 col0
0 b 0.363899
1 a 0.729004
2 d 0.217821
3 c -0.068074
4 a -2.486781
5 b 0.913252
6 a 0.057199
7 b 0.363899
8 c -0.068074
9 b -0.429894
10 a 2.631281
My way to fillna
df['col1']=df.groupby("col1")['col0'].transform(lambda x: x.fillna(x.mean()))

Categories

Resources