iloc[column,0] in for loop - python

Not sure if this is right for me to ask, but I am having trouble understand this For loop with iloc in it.
I am having trouble understanding what this line is doing hr_new['ID']=hr_new[column_list.iloc[column,0]]
Can anyone help with this?
code:
column_list = pd.DataFrame(['ColA','ColB','ColC','ColD'])
final_df = pd.DataFrame()
for column in range(len(column_list)):
hr_new=hr.copy()
hr_new.dropna(subset=[column_list.iloc[column,0]], inplace = True)
hr_new['ID']=hr_new[column_list.iloc[column,0]]
merged_data = pd.merge(hr_new, dataframenotshown, how='left', left_on='ID', right_on ='IDtwo')
final_df = final_df.append(merged_data)

You could also rewrite the code as
final_df = pd.DataFrame()
for i in range(4):
hr_new=hr.copy()
hr_new.dropna(subset=[column_list.iloc[i,0]], inplace = True)
hr_new['ID']=hr_new[column_list.iloc[i,0]]
...
Now you can see i is a value between 0 and 3 (len(colum_list) == 4).
Selecting (multiple) Rows/Cols using iloc would look like this:
data.iloc[row_1, col_1] # select one cell
data.iloc[[row_1,row_2,row_3,row_4], [col_1,col_2,col_3]] # select multiple cells
data.iloc[:, col_1] # select one column
data.iloc[row_1, :] # select one row
So the code:
hr_new['ID']=hr_new[column_list.iloc[i,0]]
EDIT:
The code Selects the column 'ID' from hr_new and fills it with the column of hr_new[x] where x is the value stored in the column_list by selecting rows of column 0.
In my opinion, this is a very complicated way to do this.
Consider storing the Column names as a list and iterate over them instead of creating a dataframe and selecting rows.
column_list = ["col_1","col_2","col_3","col_4"]
for col in column_list:
hr_new=hr.copy()
hr_new.dropna(subset=[col], inplace = True)
hr_new['ID']=hr_new[col]
...
This should work the same way if I understand your code correctly

Related

How to fix displaced rows in excel with pandas?

I need to fix a large excel database where in some columns some cells are blank and all the data from the row is moved one cell to the right.
For example:
In this example I need a script that would detect that the first cell form the last row is blank and then it would move all the values one cell to the left.
I'm trying to do it with this function. Vencli_col is the dataset, df1 and df2 are copies. In df2 I drop column 12, which is where the error originates. I index the rows where the error happens and then I try to replace them with the values from df2.
df1 = vencli_col.copy()
df2 = vencli_col.copy()
df2 = df1.drop(columns=['Column12'])
df2['droppedcolumn'] = np.nan
i = 0
col =[]
for k, value in vencli_col.iterrows():
i +=1
if str(value['Column12']) == '' or str(value['Column12']) == str(np.nan):
col.append(i+1)
for j in col:
df1.iloc[j] = df2.iloc[j]
df1.head(25)
You could do something like the below. It is not very pretty but it does the trick.
# Select the column names that are correct and the ones that are shifted
# This is assuming the error column is the second one as in the image you have
correct_cols = df.columns[1:-1]
shifted_cols = df.columns[2:]
# Get the indexes of the rows that are NaN or ""
df = df.fillna("")
shifted_indexes = df[df["col1"] == ""].index
# Shift the data 1 column to the left
# It has to be transformed in numpy because if you don't the column names
# prevent from copying in the destination columns
df.loc[shifted_indexes ,correct_cols] = df.loc[shifted_indexes, shifted_cols].to_numpy()
EDIT: just realised there is an easier way using df.shift()
columns_to_shift = df.columns[1:]
shifted_indexes = df[df["col1"] == ""].index
df.loc[shifted_indexes, columns_to_shift] = df.loc[shifted_indexes, columns_to_shift].shift(-1, axis=1)

How can I append values from one df to the bottom of a column in my second in one expression?

I'm trying to append a list of values ('A') from a separate df to the bottom of my output (finalDf) where the values are always the same and don't need to be in order.
Heres what i have tried so far:
temp1 = pd.DataFrame(df['A'].append(df1['A'], ignore_index = True))
temp2 = pd.DataFrame(df['B'].append(df1['B'], ignore_index = True))
print(df.shape)
print(temp1.shape)
print(temp2.shape)
shape output (example from my code with + 28 values from df1):
(11641, 6)
(11669, 1)
(11669, 1)
Where appending the values seems to work based on the shape of temp1 but I cant seem to apply the values from both Col 'A' and Col 'B' to the bottom of col 'A' in dfFinal together - it's always either col 'A' or col 'B' from df1 never both in df
TLDR; How can I best take the values from col 'A' and Col 'B' in df1 and append them to Col 'A' and Col 'B' in df to make dfFinal which I can then export to csv ?
This can be done with the concat function along axis=0 i.e. it will join the data frames provided along rows. In layman terms, it will join the 2nd data frame below the 1st. Keep in mind that the number of columns should be the same in both the data frames.
df.concat([temp1, temp2], axis=0, ignore_index=True)
Over here, ignore_index ignores the new indexes that will be formed by concatenations and instead creates a new one from 0 to 'n-1'.
For more information: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html

pandas df masking specific row by list

I have pandas df which has 7000 rows * 7 columns. And I have list (row_list) that consists with the value that I want to filter out from df.
What I want to do is to filter out the rows if the rows from df contain the corresponding value in the list.
This is what I got when I tried,
"Empty DataFrame
Columns: [A,B,C,D,E,F,G]
Index: []"
df = pd.read_csv('filename.csv')
df1 = pd.read_csv('filename1.csv', names = 'A')
row_list = []
for index, rows in df1.iterrows():
my_list = [rows.A]
row_list.append(my_list)
boolean_series = df.D.isin(row_list)
filtered_df = df[boolean_series]
print(filtered_df)
replace
boolean_series = df.RightInsoleImage.isin(row_list)
with
boolean_series = df.RightInsoleImage.isin(df1.A)
And let us know the result. If it doesn't work show a sample of df and df1.A
(1) generating separate dfs for each condition, concat, then dedup (slow)
(2) a custom function to annotate with bool column (default as False, then annotated True if condition is fulfilled), then filter based on that column
(3) keep a list of indices of all rows with your row_list values, then filter using iloc based on your indices list
Without an MRE, sample data, or a reason why your method didn't work, it's difficult to provide a more specific answer.

Python Pandas showing change in position between two dataframes

I am reading two dataframes looking at one column and then showing the difference in position between the two dataframe with a -1 or +1 etc.
I have try the following code but it only shows 0 in Position Change when there should be a difference between British Airways and Ryanair
first = pd.read_csv("C:\\Users\\airma\\PycharmProjects\\Vatsim_Stats\\Vatsim_stats\\Base.csv", encoding='unicode_escape')
df1 = pd.DataFrame(first, columns=['airlines', 'Position'])
second = pd.read_csv("C:\\Users\\airma\\PycharmProjects\\Vatsim_Stats\\Vatsim_stats\\Base2.csv", encoding='unicode_escape')
df2 = pd.DataFrame(second, columns=['airlines', 'Position'])
df1['Position Change'] = np.where(df1['airlines'] == df2['airlines'], 0, df1['Position'] - df2['Position'])
I have also try to do it with the following code, but just keep getting a ValueError: cannot reindex from a duplicate axis
df1.set_index('airlines', drop=False) # Set index to cross reference by (icao)
df2.set_index('airlines', drop=False)
df2['Position Change'] = df1[['Position']].sub(df2['Position'], axis=0)
df2 = df2.reset_index(drop=True)
pd.set_option('display.precision', 0)
Base csv looks like this -
and Base2 csv looks like this -
As you can see British Airways is in 3 position on Base csv and 4 in Base 2 csv, but when running the code it just shows 0 and does not do the math between the two dataframes.
Have been stuck on this for days now, would be so grateful for any help.
I would like to offer some easier way based on columns, value and if-statement.
It is probably a little bit useless while you have big dataframe, but it can gives you the information you expect.
first = pd.read_csv("C:\\Users\\airma\\PycharmProjects\\Vatsim_Stats\\Vatsim_stats\\Base.csv", encoding='unicode_escape')
df1 = pd.DataFrame(first, columns=['airlines', 'Position'])
second = pd.read_csv("C:\\Users\\airma\\PycharmProjects\\Vatsim_Stats\\Vatsim_stats\\Base2.csv", encoding='unicode_escape')
df2 = pd.DataFrame(second, columns=['airlines', 'Position'])
I agree, that my answer was not correct with your question.
Now, if I understand correctly - you want to create new column in DataFrame that gives you -1 if two same columns in 2 DataFrames are incorrect and 1 if correct.
It should help:
key = "Name_Of_Column"
new = []
for i in range(0, len(df1)):
if df1[key][i] != df2[key][i]:
new.append(-1)
else:
new.append(1)
df3 = pd.DataFrame({"Diff":new}) # I create new DataFrame as Dictionary.
df1 = df1.append(df3, ignore_index = True)
print(df1)
i am giving u an alternative, i am not sure whether it is appreciated or not. But just an idea.
After reading two csv's and getting the column u require, why don't you try to join two dataframes for the column'airlines'? it will merge two dataframes with key as 'airlines'

Pandas - contains from other DF

I have 2 dataframes:
DF A:
and DF B:
I need to check every row in the DFA['item'] if it contains some of the values in the DFB['original'] and if it does, then add new column in DFA['my'] that would correspond to the value in DFB['my'].
So here is the result I need:
I tought of converting the DFB['original'] into list and then use regex, but this way I wont get the matching result from column 'my'.
Ok, maybe not the best solution, but it seems to be working.
I did cartesian join and then check the records which contains the data needed
dfa['join'] = 1
dfb['join'] = 1
dfFull = dfa.merge(dfb, on='join').drop('join' , axis=1)
dfFull['match'] = dfFull.apply(lambda x: x.original in x.item, axis = 1)
dfFull[dfFull['match']]

Categories

Resources