I would like to drop all data in a pandas dataframe, but am getting TypeError: drop() takes at least 2 arguments (3 given). I essentially want a blank dataframe with just my columns headers.
import pandas as pd
web_stats = {'Day': [1, 2, 3, 4, 2, 6],
'Visitors': [43, 43, 34, 23, 43, 23],
'Bounce_Rate': [3, 2, 4, 3, 5, 5]}
df = pd.DataFrame(web_stats)
df.drop(axis=0, inplace=True)
print df
You need to pass the labels to be dropped.
df.drop(df.index, inplace=True)
By default, it operates on axis=0.
You can achieve the same with
df.iloc[0:0]
which is much more efficient.
My favorite:
df = df.iloc[0:0]
But be aware df.index.max() will be nan.
To add items I use:
df.loc[0 if math.isnan(df.index.max()) else df.index.max() + 1] = data
My favorite way is:
df = df[0:0]
Overwrite the dataframe with something like that
import pandas as pd
df = pd.DataFrame(None)
or if you want to keep columns in place
df = pd.DataFrame(columns=df.columns)
If your goal is to drop the dataframe, then you need to pass all columns. For me: the best way is to pass a list comprehension to the columns kwarg. This will then work regardless of the different columns in a df.
import pandas as pd
web_stats = {'Day': [1, 2, 3, 4, 2, 6],
'Visitors': [43, 43, 34, 23, 43, 23],
'Bounce_Rate': [3, 2, 4, 3, 5, 5]}
df = pd.DataFrame(web_stats)
df.drop(columns=[i for i in check_df.columns])
This code make clean dataframe:
df = pd.DataFrame({'a':[1,2], 'b':[3,4]})
#clean
df = pd.DataFrame()
Related
I'm creating a function to filter many dataframes using groupby. The dataframes look like below. However each dataframe does not always contain the same number of columns.
df = pd.DataFrame({
'xyz CODE': [1,2,3,3,4, 5,6,7,7,8],
'a': [4, 5, 3, 1, 2, 20, 10, 40, 50, 30],
'b': [20, 10, 40, 50, 30, 4, 5, 3, 1, 2],
'c': [25, 20, 5, 15, 10, 25, 20, 5, 15, 10] })
For each dataframe I always apply groupby to the first column - which are named differently across dataframes. All other columns are named consistently across all dataframes.
My question: Is it possible to run groupby using a combination of column location and column names? How can I do it?
I wrote the following function and got an error TypeError: unhashable type: 'list'
def filter_all_df(df):
df['max_c'] = df.groupby(df.columns[0])['a'].transform('max')
newdf = df[df['a'] == df['max_c']].drop(['max_c'], axis=1)
newdf['max_score'] = newdf.groupby([newdf.columns[0],'a','b'])['c'].transform('max')
newdf = newdf[newdf['c'] == newdf['max_score']]
newdf = newdf.sort_values([newdf.columns[0]]).drop_duplicates([newdf.columns[0], 'a','b', 'c'], keep='last')
newdf.to_csv('newdf_all.csv')
return newdf
This is my task:
Write a function that accepts a dataframe as input, the name of the column with missing values , and a list of grouping columns and returns the dataframe by filling in missing values with the median value
Here is that I tried to do:
def fillnull(set,col):
val = {col:set[col].sum()/set[col].count()}
set.fillna(val)
return set
fillnull(titset,'Age')
My problem is that my function doesn't work, also I don't know how to count median and how to group through this function
Here are photos of my dataframe and missing values of my dataset
DATAFRAME
NaN Values
Check does this code works for you
import pandas as pd
df = pd.DataFrame({
'processId': range(100, 900, 100),
'groupId': [1, 1, 2, 2, 3, 3, 4, 4],
'other': [1, 2, 3, None, 3, 4, None, 9]
})
print(df)
def fill_na(df, missing_value_col, grouping_col):
values = df.groupby(grouping_col)[missing_value_col].median()
df.set_index(grouping_col, inplace=True)
df[missing_value_col].fillna(values, inplace=True)
df.reset_index(grouping_col, inplace=True)
return df
fill_na(df, 'other', 'groupId')
I'm using iterrows() to work my way through a dataframe. Using a for loop and nested if statements I'm able to identify the cells I want to change.
I used a print statement to verifiy I'm able to change the data but when I print out the dataframe the information is unchanged. I was able to do this on a smaller dataframe. Any ideas?
My original this was my code that worked:
data.loc[(data.ID.isin([10,45])) & (data.source.notnull()), 'ID'] = 50
But I need to add this:
data.loc[(data.ID.isin([23,45])) & (data.source.notnull()), 'ID'] = 60
This worked for me as a test
The DataFrame did change with this logic:
import pandas as pd
data = pd.DataFrame({'num_legs': [2, 4, 8, 0],
'num_wings': [10, 23, 32, 45],
'num_specimen_seen': [10, 2, 1, 8]},
index=['falcon', 'dog', 'spider', 'fish'])
for x,y in data.iterrows():
if y['num_wings'] in [10,45]:
y['num_wings'] = 50
print(x,y)
This is basically what I'm trying to do:
I can changed the data using this logic but it doesn't seem to change the actual DataFrame:
import pandas as pd
...
...
for x,y in data.iterrows():
if y['ID'] in [10,45]:
if y['source'] == 0:
if y['username'] == 'bill':
y['IDs'] = 50
print(x,y) #print the results to confirmed it worked, it did/
# however, dataframe is unchanged
This worked for me
The DataFrame did change with this logic:
import pandas as pd
data = pd.DataFrame({'num_legs': [2, 4, 8, 0],
'num_wings': [10, 23, 32, 45],
'num_specimen_seen': [10, 2, 1, 8]},
index=['falcon', 'dog', 'spider', 'fish'])
for x,y in data.iterrows():
if y['num_wings'] in [10,45]:
y['num_wings'] = 50
print(x,y)
I feel confident that I can make the changes I want but I need to appy it to the DataFrame.
To clarify, you're trying to conditionally update the value of the num_wings column? If so, here you go. You need to use the .loc method to update values in a dataframe.
import pandas as pd
data = pd.DataFrame({'num_legs': [2, 4, 8, 0],
'num_wings': [10, 23, 32, 45],
'num_specimen_seen': [10, 2, 1, 8]},
index=['falcon', 'dog', 'spider', 'fish'])
data.loc[data['num_wings'].isin([10,45]),'num_wings'] = 50
data
num_legs num_specimen_seen num_wings
falcon 2 10 50
dog 4 2 23
spider 8 1 32
fish 0 8 50
The code doesn't work because: (source)
Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.
To write to it, you can try to see if at works, i.e.,
for x,y in data.iterrows():
if y['num_wings'] in [10,45]:
data.at[x, 'num_wings'] = 50
Just modifying something while you're iterating over it is not recommended. But I think it should be OK in your case.
I am working on a dataset and when I try to create a new column after find the difference I get the KeyError: 'filtered'
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
d = {'col1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'col2': [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]}
df = pd.DataFrame(data=d)
fig, ax = plt.subplots(2, figsize=(8,8))
df['col2'].diff().plot(ax=ax[0])
cutoff = 3
df['filtered'] = df.loc[df['col2'].diff().abs() > cutoff]
df.plot(ax=ax[1])
I used to create new column like this (df['filtered'] = some operation), but it gives KeyError: 'filtered' in this situation. Thank you for the help.
You need to replace the second-to-last line with:
df['filtered'] = df.loc[df['col2'].diff().abs() > cutoff, 'col2']
assuming that you want to get a filtered version of 'col2'. As #RafaelC mentioned, the current .loc[] operation you have returns all the columns (2 in this case) for which the row filter applies hence the error.
I am trying to draw subplots using two identical DataFrames ( predicted and observed) with exact same structure ... the first column is index
The code below makes new index when they are concatenated using pd.melt and merge
as you can see in the figure the index of orange line is changed from 1-5 to 6-10
I was wondering if some could fix the code below to keep the same index for the orange line:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
actual = pd.DataFrame({'a': [5, 8, 9, 6, 7, 2],
'b': [89, 22, 44, 6, 44, 1]})
predicted = pd.DataFrame({'a': [7, 2, 13, 18, 20, 2],
'b': [9, 20, 4, 16, 40, 11]})
# Creating a tidy-dataframe to input under seaborn
merged = pd.concat([pd.melt(actual), pd.melt(predicted)]).reset_index()
merged['category'] = ''
merged.loc[:len(actual)*2,'category'] = 'actual'
merged.loc[len(actual)*2:,'category'] = 'predicted'
g = sns.FacetGrid(merged, col="category", hue="variable")
g.map(plt.plot, "index", "value", alpha=.7)
g.add_legend();
The orange line ('variable' == 'b') doesn't have an index of 0-5 because of how you used melt. If you look at pd.melt(actual), the index doesn't match what you are expecting, IIUC.
Here is how I would rearrange the dataframe:
merged = pd.concat([actual, predicted], keys=['actual', 'predicted'])
merged.index.names = ['category', 'index']
merged = merged.reset_index()
merged = pd.melt(merged, id_vars=['category', 'index'], value_vars=['a', 'b'])
Set the ignore_index variable to false to preserve the index., e.g.
df = df.melt(var_name=‘species’, value_name=‘height’, ignore_index = False)