Pandas Dataframe: How to drop_duplicates() based on index subset? - python

Wondering if someone could please help me on this:
Have a pandas df with a rather large amount of columns (over 50). I'd like to remove duplicates based on a subset (column 2 to 50).
Been trying to use df.drop_duplicates(subset=["col1","col2",....]), but wondering if there is a way to pass the column index instead so I don't have to actually write out all the column headers to consider for the drop but instead can do something along the lines of df.drop_duplicates(subset = [2:])
Thanks upfront

You can slice df.columns like:
df.drop_duplicates(subset = df.columns[2:])
Or:
df.drop_duplicates(subset = df.columns[2:].tolist())

Related

How to merge a big dataframe with small dataframe?

I have a big dataframe with 100 rows and the structure is [qtr_dates<datetime.date>, sales<float>] and a small dataframe with same structure with less than 100 rows. I want to merge these two dfs such that merged df will have all the rows from small df and remaining rows will be taken from big df.
Right now I am doing this
df = big_df.merge(small_df, on=big_df.columns.tolist(), how='outer')
But this is creating a df with duplicate qtr_dates.
Use concat with remove duplicates by DataFrame.drop_duplicates:
pd.concat([small_df, big_df], ignore_index=True).drop_duplicates(subset=['qtr_dates'])
If I understand correctly, you want everything from the bigger dataframe, but if that date is present in the smaller data frame you would want it replaced by the relevant value from the smaller one?
Hence I think you want to do this:
df = big_df.merge(small_df, on=big_df.columns.tolist(),how='left',indicator=True)
df = df[df._merge!= "both"]
df_out = pd.concat([df,small_df],ignore_index=True)
This would remove any rows from the big_df which exist in the small_df in the 2nd step, before then adding the small_df rows by concatenating rather than merging.
If you had more column names that weren't involved with the join you'd have to do some column renaming/dropping though I think.
Hope that's right.
Try maybe join instead of merge.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html

How to include grouping variable in the groupby().diff() results

Newby to pandas here so go easy on me.
I have a dataframe with lots of columns. I want to do something like
df.groupby('row').diff()
However, the result of the groupby don't include the row column.
How do I include the row column in the groupby results.
Alternatively, is it possible to merge the groupby results in the dataframe?
Create index by row column first:
df1 = df.set_index('row').groupby('row').diff().reset_index()
Or:
df1 = df.set_index('row').groupby(level=0).diff().reset_index()
You could use agg with np.diff:
df.groupby('row').agg(np.diff)

How to find if a values exists in all rows of a dataframe?

I have an array of unique elements and a dataframe.
I want to find out if the elements in the array exist in all the row of the dataframe.
p.s- I am new to python.
This is the piece of code I've written.
for i in uniqueArray:
for index,row in newDF.iterrows():
if i in row['MKT']:
#do something to find out if the element i exists in all rows
Also, this way of iterating is quite expensive, is there any better way to do the same?
Thanks in Advance.
Pandas allow you to filter a whole column like if it was Excel:
import pandas
df = pandas.Dataframe(tableData)
Imagine your columns names are "Column1", "Column2"... etc
df2 = df[ df["Column1"] == "ValueToFind"]
df2 now has only the rows that has "ValueToFind" in df["Column1"]. You can concatenate several filters and use AND OR logical doors.
You can try
for i in uniqueArray:
if newDF['MKT'].contains(i).any():
# do your task
You can use isin() method of pd.Series object.
Assuming you have a data frame named df and you check if your column 'MKT' includes any items of your uniqueArray.
new_df = df[df.MKT.isin(uniqueArray)].copy()
new_df will only contain the rows where values of MKT is contained in unique Array.
Now do your things on new_df, and join/merge/concat to the former df as you wish.

I have pandas dataframe which i would like to be sliced after every 4 columns

I have pandas dataframe which i would like to be sliced after every 4 columns and then vertically stacked on top of each other which includes the date as index.Is this possible by using np.vstack()? Thanks in advance!
ORIGINAL DATAFRAME
Please refer the image for the dataframe.
I want something like this
WANT IT MODIFIED TO THIS
Until you provide a Minimal, Complete, and Verifiable example, I will not test this answer but the following should work:
given that we have the data stored in a Pandas DataFrame called df, we can use pd.melt
moltendfs = []
for i in range(4):
moltendfs.append(df.iloc[:, i::4].reset_index().melt(id_vars='date'))
newdf = pd.concat(moltendfs, axis=1)
We use iloc to take only every fourth column, starting with the i-th column. Then we reset_index in order to be able to keep the date column as our identifier variable. We use melt in order to melt our DataFrame. Finally we simply concatenate all of these molten DataFrames together side by side.

Adding DataFrame columns in Python pandas

I have pandas DataFrame that has a number of columns (about 20) containing string objects. I'm looking for a simple method to add all of the columns together into one new column, but have so far been unsuccessful e.g.:
for i in df.columns:
df[‘newcolumn’] = df[‘newcolumn’] + ‘/‘ + df.ix[:,i]
This results in an empty DataFrame column ‘newcolumn’ instead of the concatenated column.
I’m new to pandas, so any help would be much appreciated.
df['newcolumn'] = df.apply(''.join, axis=1)

Categories

Resources