I want to delete all the rows in a dataframe.
The reason I want to do this is so that I can reconstruct the dataframe with an iterative loop. I want to start with a completely empty dataframe.
Alternatively, I could create an empty df from just the column / type information if that is possible
Here's another method if you have an existing DataFrame that you'd like to empty without recreating the column information:
df_empty = df[0:0]
df_empty is a DataFrame with zero rows but with the same column structure as df
The latter is possible and strongly recommended - "inserting" rows row-by-row is highly inefficient. A sketch could be
>>> import numpy as np
>>> import pandas as pd
>>> index = np.arange(0, 10)
>>> df = pd.DataFrame(index=index, columns=['foo', 'bar'])
>>> df
Out[268]:
foo bar
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 NaN NaN
9 NaN NaN
If you have an existing DataFrame with the columns you want then extract the column names into a list comprehension then create an empty DataFrame with your column names.
# Creating DataFrame from a CSV file with desired headers
csv_a = "path/to/my.csv"
df_a = pd.read_csv(csv_a)
# Extract column names into a list
names = [x for x in df_a.columns]
# Create empty DataFrame with those column names
df_b = pd.DataFrame(columns=names)
df.drop(df.index,inplace=True)
This line will delete all rows, while keeping the column names.
You can also use head:
df_empty = df.head(0)
Old Thread. But i found another way
df_final=df_dup[0:0].copy(deep=True)
Related
I have a simple problem that probably has a simple solution but I couldn't found it anywhere. I have the following multindex column Dataframe:
mux = pd.MultiIndex.from_product(['A','B','C'], ['Datetime', 'Str', 'Ret']])
dfr = pd.DataFrame(columns=mux)
| A | B | C |
|Datetime|Str|Ret|Datetime|Str|Ret|Datetime|Str|Ret|
I need to add values one by one at the end of a specific subcolumn. For example add one value at the end of column A sub-column Datetime and leave the rest of the row as it is, then add another value to column B sub-column Str and again leave the rest of the values in the same row untouched and so on. So my questions are: Is it possible to target individual locations in this type of Dataframes? How? and also Is it possible to append not a full row but an individual value always at the end after the previous value without knowing where the end is?. Thank you so much for your answers.
IIUC, you can use .loc:
idx = len(dfr) # get the index of the next row after the last one
dfr.loc[idx, ('A', 'Datetime')] = pd.to_datetime('2021-09-24')
dfr.loc[idx, ('B', 'Str')] = 'Hello'
dfr.loc[idx, ('C', 'Ret')] = 4.3
Output:
>>> dfr
A B C
Datetime Str Ret Datetime Str Ret Datetime Str Ret
0 2021-09-24 00:00:00 NaN NaN NaN Hello NaN NaN NaN 4.3
Update
I mean for example when I have different number of values in different columns (for example 6 values in column A-Str but only 4 in column B-Datetime) but I donĀ“t really know. In that case what I need is to add the next value in that column after the last one so I need to know the index of the last non Nan value of that particular column so I can use it in your answer because if I use len(dfr) while trying to add value to the column that only has 4 values it will end up in the 7th row instead of the 5th row, this is because one of the columns may have more values than the others.
You can do it easily using last_valid_index. Create a convenient function append_to_col to append values inplace in your dataframe:
def append_to_col(col, val):
idx = dfr[col].last_valid_index()
dfr.loc[idx+1 if idx is not None else 0, col] = val
# Fill your dataframe
append_to_col(('A', 'Datetime'), '2021-09-24')
append_to_col(('A', 'Datetime'), '2021-09-25')
append_to_col(('B', 'Str'), 'Hello')
append_to_col(('C', 'Ret'), 4.3)
append_to_col(('C', 'Ret'), 8.2)
append_to_col(('A', 'Datetime'), '2021-09-26')
Output:
>>> dfr
A B C
Datetime Str Ret Datetime Str Ret Datetime Str Ret
0 2021-09-24 NaN NaN NaN Hello NaN NaN NaN 4.3
1 2021-09-25 NaN NaN NaN NaN NaN NaN NaN 8.2
2 2021-09-26 NaN NaN NaN NaN NaN NaN NaN NaN
Saying that I have a DataFrame with 6 Columns and I want to add a new colum that could give me a 1 when column 3 to 6 is NaN and a 0 when not all of them are NaN
like this
How Can I do that?
Or How could I remove those rows where all those columns are NaN except from the fisrt 2 columns?
IIUC, here's one way:
df['<New Col Name>'] = df.set_index(['Date', 'Name']).isna().all(1).astype(int)
My goal is to have my column titles in the small df added to an existing large dataframe without me manually typing the name in.
This is the small dataframe.
veddra_term_code veddra_version veddra_term_name number_of_animals_affected accuracy
335 11 Emesis NaN NaN
142 11 Anaemia NOS NaN NaN
The large dataframe is similar to the above but has forty columns.
This is the code I used to extract the small dataframe from dict.
df = pd.DataFrame(reaction for result in d['results'] for reaction in result['reaction']) #get reaction data
df
You can pass dataframe.reindex a list of columns, consisting of the existing columns and also new ones. If a column does not exist yet in the dataframe, it will get as value NaN.
Assume that df is your big dataframe you want to extend with columns. You can then create a new list of column names (columns_to_add) from your small dataframe and combine them. Then you call reindex on the big dataframe.
import pandas as pd
df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4]})
existing_columns = df.columns.tolist()
columns_to_add = ["C", "D"] # or use small_df.columns.tolist()
new_columns = existing_columns + columns_to_add
df = df.reindex(columns = new_columns)
This will produce:
A B C D
0 1 2 NaN NaN
1 2 3 NaN NaN
2 3 4 NaN NaN
If you do not like NaN you can use a different value by passing the keyword fill_value.
(e.g. df.reindex(columns = new_columns, fill_value=0).
df.columns will give you an array of the names of the columns
import numpy as np
#loop small dataframe headers
for i in small_df.columns:
# if large df doesnt have the header, create the header
if i not in large_df.columns:
#creates new header with no data
large_df.loc[:,i]=np.nan
I have a dataframe like the following
df = [[1,'NaN',3],[4,5,'Nan'],[7,8,9]]
df = pd.DataFrame(df)
and I would like to remove all columns that have in their first row a NaN value.
So the output should be:
df = [[1,3],[4,'Nan'],[7,9]]
df = pd.DataFrame(df)
So in this case, only the second column is removed since the first element was a NaN value.
Hence, dropna() is based on a condition.. any idea how to handle this? Thx!
If values are np.nan and not string NaN(else replace them), you can do:
Input:
df = [[1,np.nan,3],[4,5,np.nan],[7,8,9]]
df = pd.DataFrame(df)
Solution:
df.loc[:,df.iloc[0].notna()] #assign back to your desired variable
0 2
0 1 3.0
1 4 NaN
2 7 9.0
I tried to insert a new row to a dataframe named 'my_df1' using the my_df1.loc function.But in the result ,the new row added has NaN values
my_data = {'A':pd.Series([1,2,3]),'B':pd.Series([4,5,6]),'C':('a','b','c')}
my_df1 = pd.DataFrame(my_data)
print(my_df1)
my_df1.loc[3] = pd.Series([5,5,5])
Result displayed is as below
A B C
0 1.0 4.0 a
1 2.0 5.0 b
2 3.0 6.0 c
3 NaN NaN NaN
The reason that is all NaN is that my_df1.loc[3] as index (A,B,C) while pd.Series([5,5,5]) as index (0,1,2). When you do series1=series2, pandas only copies values of common indices, hence the result.
To fix this, do as #anky_91 says, or if you already has a series, use its values:
my_df1.loc[3] = my_series.values
Finally I found out how to add a Series as a row or column to a dataframe
my_data = {'A':pd.Series([1,2,3]),'B':pd.Series([4,5,6]),'C':('a','b','c')}
my_df1 = pd.DataFrame(my_data)
print(my_df1)
Code1 adds a new column 'D' and values 5,5,5 to the dataframe
my_df1.loc[:,'D'] = pd.Series([5,5,5],index = my_df1.index)
print(my_df1)
Code2 adds a new row with index 3 and values 3,4,3,4 to the dataframe in code 1
my_df1.loc[3] = pd.Series([3,4,3,4],index = ('A','B','C','D'))
print(my_df1)