Pandas Dataframe delete row by repitition - python

I am looking to delete a row in a dataframe that is imported into python by pandas.
if you see the sheet below, the first column has same name multiple times. So the condition is, if the first column value re-appears in a next row, delete that row. If not keep that frame in the dataframe.
My final output should look like the following:
Presently I am doing it by converting each column into a list and deleting them by index values. I am hoping there would be an easy way. Rather than this workaround/

df.drop_duplicates([df.columns[0])
should do the trick.

Try the following code;
df.drop_duplicates(subset='columnName', keep=’first’, inplace=true)

Related

How do I add every column in a pandas dataframe to a list except for the first column?

Normally, I would be able to call dataframe.columns for a list of all columns, but I don't want to include the very first column in my list. Writing each column manually is an option, but one I'd like to avoid, given the few hundred column headers I'm working with. I do need to use this column, though, so deleting it from the dataframe entirely wouldn't work. How can I put every column into a list except for the first one?
This should work:
list(df.columns[1:])

Save an updated dataframe in Pandas

I am using pandas for the first time.
df.groupby(np.arange(len(df))//10).mean()
I used the code above which works to take an average of every 10th row. I want to save this updated data frame but doing df.to_csv is saving the original dataframe which I imported.
I also want to multiply one column from my df (df.groupby dataframe essentially) with a number and make a new column. How do I do that?
The operation:
df.groupby(np.arange(len(df))//10).mean()
Might return the averages dataframe as you want it, but it wont change the original dataframe. Instead you'll need to do:
df_new = df.groupby(np.arange(len(df))//10).mean()
You could assign it the same name if you want. The other options is some operations which you might expect to modify the dataframe accept in inplace argument which normally defaults to False. See this question on SO.
To create a new column which is an existing column multpied by a number you'd do:
df_new['new_col'] = df_new['existing_col']*a_number

Splitting column of dataframe based on text characters in cells

I imported a .csv file with a single column of data into a dataframe that I am trying to clean up by splitting the column based on various string occurrences within the cells. I've tried numerous means to split the column, but can't seem to get it to work. My latest attempt was using the following:
df.loc[:,'DataCol'] = df.DataCol.str.split(pat=':\n',expand=True)
df
The result is a dataframe that is still one column and completely unchanged. What am I doing wrong? This is my first time doing anything like this so please forgive the simple question.
Df.loc creates a copy of the column you've selected - try replacing the code below with df['DataCol'], which references the actual column in the original dataframe.
df.loc[:,'DataCol']

python: looping through indeces in Excel and replacing with string

So I have an excel sheet with the following format:
Now what I'm looking to do is to loop trough each index cell in column A and assign all cells the same value until the next 0 is reached. so for example:
Now I have tried importing the excel file into a pandas dataframe and then using for loops to do this, but I can't seem to make it work. Any suggestions or directions to the appropriate method would be much appreciated!
Thank you for your time
Edit:
Using #wen-ben's method: s.index=pd.Series((s.index==0).cumsum()).map({1:'bananas',2:'cherries',3:'pineapples'})
just enters the first element (bananas) for all cells in Column A
Assuming you have dataframe s using cumsum
s.index=pd.Series((s.index==0).cumsum()).map({1:'bananas',2:'cherries',3:'pineapples'})

How to iterate row by row in a pandas dataframe and look for a value in its columns

I must read each row of an excel file and preform calculations based on the contents of each row. Each row is divided in columns, my problem is that I cannot find a way to access the contents of those columns.
I'm reading the rows with:
for i in df.index,:
print(df.loc[i])
Which works well, but when I try to access, say, the 4h column with this type of indexing I get an error:
for i in df.index,:
print(df.loc[i][3])
I'm pretty sure I'm approaching the indexing issue in the wrong way, but I cannot figure put how to solve it.
You can use iterrows(), like in the following code:
for index, row in dataFrame.iterrows():
print(row)
But this is not the most efficient way to iterate over a panda DataFrame, more info at this post.

Categories

Resources