How to fix displaced rows in excel with pandas? - python

I need to fix a large excel database where in some columns some cells are blank and all the data from the row is moved one cell to the right.
For example:
In this example I need a script that would detect that the first cell form the last row is blank and then it would move all the values one cell to the left.
I'm trying to do it with this function. Vencli_col is the dataset, df1 and df2 are copies. In df2 I drop column 12, which is where the error originates. I index the rows where the error happens and then I try to replace them with the values from df2.
df1 = vencli_col.copy()
df2 = vencli_col.copy()
df2 = df1.drop(columns=['Column12'])
df2['droppedcolumn'] = np.nan
i = 0
col =[]
for k, value in vencli_col.iterrows():
i +=1
if str(value['Column12']) == '' or str(value['Column12']) == str(np.nan):
col.append(i+1)
for j in col:
df1.iloc[j] = df2.iloc[j]
df1.head(25)

You could do something like the below. It is not very pretty but it does the trick.
# Select the column names that are correct and the ones that are shifted
# This is assuming the error column is the second one as in the image you have
correct_cols = df.columns[1:-1]
shifted_cols = df.columns[2:]
# Get the indexes of the rows that are NaN or ""
df = df.fillna("")
shifted_indexes = df[df["col1"] == ""].index
# Shift the data 1 column to the left
# It has to be transformed in numpy because if you don't the column names
# prevent from copying in the destination columns
df.loc[shifted_indexes ,correct_cols] = df.loc[shifted_indexes, shifted_cols].to_numpy()
EDIT: just realised there is an easier way using df.shift()
columns_to_shift = df.columns[1:]
shifted_indexes = df[df["col1"] == ""].index
df.loc[shifted_indexes, columns_to_shift] = df.loc[shifted_indexes, columns_to_shift].shift(-1, axis=1)

Related

delete rows based on if consecutive rows are similar - Python

I have this data frame and want to remove rows based on this set of rules. If consecutive rows have the same 'area' and 'local' value and the 'group_name' is different then I want to remove the first row:
df = pd.DataFrame()
df['time'] = pd.date_range("2018-01-01", freq = "s", periods = 10)
df['area'] = [1,1,1,2,2,2,3,3,4,4]
df['local'] = [1,1,1,1,2,2,2,2,2,2]
df['group_name'] = [1,1,2,2,2,3,3,3,4,4]
df['value'] = [1,4,3,2,5,6,2,1,7,8]
The image above shows the table and I would want to remove row 1 and 4.
I have tried using duplicated() on the subset of Area, Local and Group Name, but this not keep all the unique ones that I need
Please help me out!
You can do this by writing a number of if statements like this:
for i in range(len(df)-1):
if df.loc[i]['local'] == df.loc[i+1]['local']:
if df.loc[i]['area'] == df.loc[i+1]['area']:
if df.loc[i]['group_name'] != df.loc[i+1]['group_name']:
df.drop(i, inplace=True)

iloc[column,0] in for loop

Not sure if this is right for me to ask, but I am having trouble understand this For loop with iloc in it.
I am having trouble understanding what this line is doing hr_new['ID']=hr_new[column_list.iloc[column,0]]
Can anyone help with this?
code:
column_list = pd.DataFrame(['ColA','ColB','ColC','ColD'])
final_df = pd.DataFrame()
for column in range(len(column_list)):
hr_new=hr.copy()
hr_new.dropna(subset=[column_list.iloc[column,0]], inplace = True)
hr_new['ID']=hr_new[column_list.iloc[column,0]]
merged_data = pd.merge(hr_new, dataframenotshown, how='left', left_on='ID', right_on ='IDtwo')
final_df = final_df.append(merged_data)
You could also rewrite the code as
final_df = pd.DataFrame()
for i in range(4):
hr_new=hr.copy()
hr_new.dropna(subset=[column_list.iloc[i,0]], inplace = True)
hr_new['ID']=hr_new[column_list.iloc[i,0]]
...
Now you can see i is a value between 0 and 3 (len(colum_list) == 4).
Selecting (multiple) Rows/Cols using iloc would look like this:
data.iloc[row_1, col_1] # select one cell
data.iloc[[row_1,row_2,row_3,row_4], [col_1,col_2,col_3]] # select multiple cells
data.iloc[:, col_1] # select one column
data.iloc[row_1, :] # select one row
So the code:
hr_new['ID']=hr_new[column_list.iloc[i,0]]
EDIT:
The code Selects the column 'ID' from hr_new and fills it with the column of hr_new[x] where x is the value stored in the column_list by selecting rows of column 0.
In my opinion, this is a very complicated way to do this.
Consider storing the Column names as a list and iterate over them instead of creating a dataframe and selecting rows.
column_list = ["col_1","col_2","col_3","col_4"]
for col in column_list:
hr_new=hr.copy()
hr_new.dropna(subset=[col], inplace = True)
hr_new['ID']=hr_new[col]
...
This should work the same way if I understand your code correctly

If dataframe.tail(1) is X, do something

I am trying to check if the last cell in a pandas data-frame column contains a 1 or a 2 (these are the only options). If it is a 1, I would like to delete the whole row, if it is a 2 however I would like to keep it.
import pandas as pd
df1 = pd.DataFrame({'number':[1,2,1,2,1], 'name': ['bill','mary','john','sarah','tom']})
df2 = pd.DataFrame({'number':[1,2,1,2,1,2], 'name': ['bill','mary','john','sarah','tom','sam']})
In the above example I would want to delete the last row of df1 (so the final row is 'sarah'), however in df2 I would want to keep it exactly as it is.
So far, I have thought to try the following but I am getting an error
if df1['number'].tail(1) == 1:
df = df.drop(-1)
DataFrame.drop removes rows based on labels (the actual values of the indices). While it is possible to do with df1.drop(df1.index[-1]) this is problematic with a duplicated index. The last row can be selected with iloc, or a single value with .iat
if df1['number'].iat[-1] == 1:
df1 = df1.iloc[:-1, :]
You can check if the value of number in the last row is equal to one:
check = df1['number'].tail(1).values == 1
# Or check entire row with
# check = 1 in df1.tail(1).values
If that condition holds, you can select all rows, except the last one and assign back to df1:
if check:
df1 = df1.iloc[:-1, :]
if df1.tail(1).number == 1:
df1.drop(len(df1)-1, inplace = True)
You can use the same tail function
df.drop(df.tail(n).index,inplace=True) # drop last n rows

How to eliminate the index value of columns and rows in a python dataframe?

I am trying to eliminate the column and the row which state the index of the values and replace the first value of my column by 'kw'.
Have tried .drop without success
def main():
df_old_m, df_new_m = open_file_with_data_m(file_old_m, file_new_m)
#combine two dataframes together
df = df_old_m.append(df_new_m)
son=df["son"]
gson=df["gson"]
#add son
df_old_m = df["seeds"].append(son)
#add gson
df_old_m = df_old_m.append(gson)
#deleted repeated values
df_old_m = df_old_m.drop_duplicates()
#add kw as the header of the list
df_old_m.loc[-1] = 'kw' # adding a row
df_old_m.index = df_old_m.index + 1 # shifting index
df_old_m.sort_index(inplace=True)
This gives me .xlsx output
If kw is the column you want to be your new index, you can do this:
df.set_index('kw', inplace=True)

Deleting a DataFrame from a list of DataFrames depending on the DataFrame - python?

I has a list of DataFrames and I want to delete DataFrames from the list which fulfill any of the below conditions:
If the DataFrame has 2 or less columns.
If the DataFrame contains the string 'A3' anywhere.
The code I have tried for the column length is shown below here the list is named df_list:
for i in df_list:
if len(i.columns) == 1:
del[i]
or
df_list = [i for i in df_list if not (i.shape[1] == 2)]
The code I have tried to remove DataFrames that include the string 'A3' anywhere is:
df_list = [i for i in df_list if not ('A3' in i.columns)]
I know my numbers are wrong but neither are removing anything from my list when they should, does anyone know the way to correctly do this?
Is this what you're looking for?
import pandas as pd
url = 'https://www.bls.gov/web/empsit/cesbmart.htm'
df_list = pd.read_html(url)
key_word = 'CES'
delete_by_idx = []
for idx, dataframe in enumerate(df_list):
A3_found = False
# Check if A3 is in any row
for i, row in dataframe.iterrows():
if row.str.contains(key_word).any():
A3_found = True
# If A3 was found, delete the dataframe
if A3_found == True:
delete_by_idx.append(idx)
continue
# If A3 is in the columns, delete the dataframe
cols = [ str(col_name) for col_name in list(dataframe.columns) ]
if any(key_word in x for x in cols):
delete_by_idx.append(idx)
continue
# If columns less than or equal to 2, delete the dataframe
if len(dataframe.columns) <= 2:
delete_by_idx.append(idx)
continue
delete_by_idx.sort(reverse=True)
for each in delete_by_idx:
del df_list[each]
This will check for "A3" in the column names. You can then use the same format to check the values of the columns.
for each in df_list:
if 'A3' in each.loc[0]:
df_list.remove(each)

Categories

Resources