Getting the NaN rows from pandas.dropna - python

I am using dropna to get rid of the NaN values, but instead of just dropping them i want to get a new table where those rows are saved. That's to say from the current code:
df_weight.dropna(subset = ["age"], inplace=True)
df_weight.dropna(subset = ["height"], inplace=True)
df_weight.dropna(subset = ["weight"], inplace=True)
df_weight
i want to save the rows that are droppen in the line df_weight.dropna(subset = ["weight"], inplace=True). I think that dropna does not have a return value, so there is any work around to archive this?
EDIT: my db comes from https://data.world/bgadoci/crossfit-data/workspace/file?filename=athletes.csv. I deleted all the other rows to make a mini db: after loading the data into pandas i do df_weight = df[['gender','age','height','weight']] with the code mentioned above i get something like this (where the desired row datatype is marked)

You could use
dropped_rows_df = df_weight[df_weight[['age','height','weight']]].isna().any(axis=1)]
#then
df_weight.dropna(subset = ["age"], inplace=True)
df_weight.dropna(subset = ["height"], inplace=True)
df_weight.dropna(subset = ["weight"], inplace=True)
df_weight

you can try as follows. if you share the DF, it will be easier to reproduce and provide the working solution
its an idea or direction
df_weight.isna()['age']
df_weight.isna()['height']
df_weight.isna()['weight']

Related

Cannot drop index column from DataFrame when convert to html

I'm trying to get rid off index column, when converting DataFrame into HTML, but even though I reset index or set index=False in to_html it is still there, however with no values.
df = df.set_index(['ID','Name','PM', 'Theme'])['Score'].unstack()
df = df.reset_index()
df_HTML = df.to_html(table_id = "table_score", index=False, escape=False)
Any idea how to get rid off that, please?
Try this:
df = df.set_index(['ID','Name','PM', 'Theme'])['Score'].unstack()
df = df.reset_index(drop=True).drop('Theme',axis=1)
df_HTML = df.to_html(table_id = "table_score", index=False, escape=False)
The error was caused because your theme columns seens to be your old index. And since you didnt drop in the reset_index method well, it stayed there.
If this doesnt work well just drop 'Theme'.

Overwrite portion of dataframe

I'm starting to lose my mind a bit. I have:
df = pd.DataFrame(bunch_of_stuff)
df2 = df.loc[bunch_of_conditions].copy()
def transform_df2(df2):
df2['new_col'] = [rand()]*len(df2)
df2['existing_column_1'] = [list of new values]
return df2
df2 = transform_df2(df2)
I know what to re-insert df2 into df, such that it overwrites all its previous records.
What would the best way to do this be? df.loc[df2.index] = df2 ? This doesn't bring over any of the new columns in df2 though.
You have the right method with pd.concat. However you can optimize a little bit by using a boolean mask to avoid to recompute the index difference:
m = bunch_of_conditions
df2 = df[m].copy()
df = pd.concat([df[~m], df2]).sort_index()
Why do you want to make a copy of your dataframe? Is not simpler to use the dataframe itself?
One way I did it was:
df= pd.concat([df.loc[~df.index.isin(df2.index)],df2])

How to delete specific values from a column in a dataset (Python)?

I have a data set as below:
I want to remove 'Undecided' from my ['Grad Intention'] column. For this, I created a copy DataFrame and using the code as follows:
df_copy=df_copy.drop([df_copy['Grad Intention'] =='Undecided'], axis=1)
However, this is giving me an error.
How can I remove the row with 'Undecided'? Also, what's wrong with my code?
you could simply use:
df = df[df['Grad Intention'] != 'Undecided']
or
df.drop(df[df['Grad Intention'] == 'Undecided'].index, inplace = True)

Pandas "A value is trying to be set on a copy of a slice from a DataFrame"

Having a bit of trouble understanding the documentation
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
dfbreed['x'] = dfbreed.apply(testbreed, axis=1)
C:/Users/erasmuss/PycharmProjects/Sarah/farmdata.py:38: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Code is basically to re-arrange and clean some data to make analysis easier.
Code in given row-by per each animal, but has repetitions, blanks, and some other sparse values
Idea is to basically stack rows into columns and grab the useful data (Weight by date and final BCS) per animal
Initial DF
few snippets of the dataframe
Output Format
Output DF/csv
import pandas as pd
import numpy as np
#Function for cleaning up multiple entries of breeds
def testbreed(x):
if x.first_valid_index() is None:
return None
else:
return x[x.first_valid_index()]
#Read Data
df1 = pd.read_csv("farmdata.csv")
#Drop empty rows
df1.dropna(how='all', axis=1, inplace=True)
#Copy to extract Weights in DF2
df2 = df1.copy()
df2 = df2.drop(['BCS', 'Breed','Age'], axis=1)
#Pivot for ID names in DF1
df1 = df1.pivot(index='ID', columns='Date', values=['Breed','Weight', 'BCS'])
#Pivot for weights in DF2
df2 = df2.pivot(index='ID', columns='Date', values = 'Weight')
#Split out Breeds and BCS into individual dataframes w/Duplicate/missing data for each ID
df3 = df1.copy()
dfbreed = df3[['Breed']]
dfBCS = df3[['BCS']]
#Drop empty BCS columns
df1.dropna(how='all', axis=1, inplace=True)
#Shorten Breed and BCS to single Column by grabbing first value that is real. see function above
dfbreed['x'] = dfbreed.apply(testbreed, axis=1)
dfBCS['x'] = dfBCS.apply(testbreed, axis=1)
#Populate BCS and Breed into new DF
df5= pd.DataFrame(data=None)
df5['Breed'] = dfbreed['x']
df5['BCS'] = dfBCS['x']
#Join Weights
df5 = df5.join(df2)
#Write output
df5.to_csv(r'.\out1.csv')
I want to take the BCS and Breed dataframes which are multi-indexed on the column by Breed or BCS and then by date to take the first non-NaN value in the rows of dates and set that into a column named breed.
I had a lot of trouble getting the columns to pick the first unique values in-situ on the DF
I found a work-around with a 2015 answer:
2015 Answer
which defined the function at the top.
reading through the setting a value on the copy-of a slice makes sense intuitively,
but I can't seem to think of a way to make it work as a direct-replacement or index-based.
Should I be looping through?
Trying from The second answer here
I get
dfbreed.loc[:,'Breed'] = dfbreed['Breed'].apply(testbreed, axis=1)
dfBCS.loc[:, 'BCS'] = dfBCS.apply['BCS'](testbreed, axis=1)
which returns
ValueError: Must have equal len keys and value when setting with an iterable
I'm thinking this has something to do with the multi-index
keys come up as:
MultiIndex([('Breed', '1/28/2021'),
('Breed', '2/12/2021'),
('Breed', '2/4/2021'),
('Breed', '3/18/2021'),
('Breed', '7/30/2021')],
names=[None, 'Date'])
MultiIndex([('BCS', '1/28/2021'),
('BCS', '2/12/2021'),
('BCS', '2/4/2021'),
('BCS', '3/18/2021'),
('BCS', '7/30/2021')],
names=[None, 'Date'])
Sorry for the long question(s?)
Can anyone help me out?
Thanks.
You created dfbreed as:
dfbreed = df3[['Breed']]
So it is a view of the original DataFrame (limited to just this one column).
Remember that a view has not any own data buffer, it is only a tool to "view"
a fragment of the original DataFrame, with read only access.
When you attempt to perform dfbreed['x'] = dfbreed.apply(...), you
actually attempt to violate the read-only access mode.
To avoid this error, create dfbreed as an "independent" DataFrame:
dfbreed = df3[['Breed']].copy()
Now dfbreed has its own data buffer and you are free to change the data.

How to get the correlated values of a dataset in a seperate column if the index and the columns are same

I imported a dataset into my python script and took the correlation. This is the code for correlation:
data = pd.read_excel('RQ_ID_Grouping.xlsx' , 'Sheet1')
corr = data.corr()
After the correlation the data looks like this:
I want to convert the data into below format:
I am using this code to achieve the above data , but it doesn't seem to be working:
corr1 = (corr.melt(var_name = 'X' , value_name = 'Y').groupby('X')['Y'].reset_index(name = 'Corr_Value'))
I know there should be something after the 'groupby' part but I don't know what . If you could help me , I would greatly appreciate it.
Use DataFrame.stack for reshape and drop missing values, convert MultiIndex to columns by DataFrame.reset_index and last set columns names:
df = corr.stack().reset_index()
df.columns = ['X','Y','Corr_Value']
Another solution with DataFrame.rename_axis:
df = corr.stack().rename_axis(('X','Y')).reset_index(name='Corr_Value')
And your solution with melt is also possible:
df = (corr.rename_axis('X')
.reset_index()
.melt('X', var_name='Y', value_name='Corr_Value')
.dropna()
.sort_values(['X','Y'])
.reset_index(drop=True))

Categories

Resources