how to get ride of equal column in our panda data frame? - python

I have three data frames with sizes of df1=(176, 5766) df2=(8, 5766) df3=(16, 5766), in my columns despite the different columns name there are similar data values (all the columns in each three are equal), but when I using for example
df1.T.drop_duplicates().T
df2.T.drop_duplicates().T
df3.T.drop_duplicates().T
although it must create the same output column but it convert the
df1=(176, 581) df2=(8, 632) df3=(16, 622)
how I can get ride of this?

Syntax: df.drop_duplicates(subset=None, keep=’first’, inplace=False)
Parameters:
subset:
Subset takes a column or list of column label. It’s default value is none. After passing columns, it will consider them only for duplicates.
keep:
keep is to control how to consider duplicate value. It has only three distinct values - first, last and false. default is ‘first’. ‘first’, it considers the first value as unique and rest of the same values as duplicate.
inplace:
Boolean values, removes rows with duplicates if True.
# dropping duplicate values
df1.drop_duplicates(keep=False,inplace=True)
df2.drop_duplicates(keep=False,inplace=True)
df3.drop_duplicates(keep=False,inplace=True)

Related

How to convert a pandas dataframe with non unique indexes into a one with unique indexes?

I created a dataframe with some previous operations but when I query a column name with an index (for example, df['order_number][0] ), multiple rows/records come as output.
The screenshot shows the unique and total indexes of the dataframe. image shows the difference in lengths of uniques indexes and all indexes
It looks like the data kept their index when you merged/joined df. Try:
df.reset_index()
Could you should a df.head() for example, usually when you consume a data source, if you sent the arg indexto True each row will be assigned a unique numerical index

Python/Pandas: Divide numeric columns from different dataframes based on a common row identifier and unique row-col combination

I would like to calculate the rate of change between the numeric columns of two dataframes based on a common unique row identifier and unique row-column combination.
Here is an example. I opted to present the tables as images in order to use colors to highlight the peculiarities of the two datasets. That is, each dataframe contains numeric and non-numeric columns, and rows and columns may not be in the same order. Also, the numeric columns on which the calculation should take place are always those after the 'Time' column.
The df.divide() approach doesn't work here because the rows and columns are not in the same order. I also saw the top answer in this thread, but again the approach doesn't generalize to mine.
If your problem is essentially with the columns and rows not being in the right order, that can be solved by essentially reordering the columns and rows.
#Identifying the columns for which the difference is to be computed. Since #'Time' is the 4th column, we take all columns after that
valCols = list(df.columns)[4:]
#Sorting the datasets so that the rows align
df1 = df1.sort_values('ID')
df2 = df2.sort_values('ID')
#Keeping only the value columns. This also ensures that the columns are in the same order now
df1 = df1[valCols]
df2 = df1[valCols]

Identify records with duplicate values in a specified column using pandas

I am new to Python and Pandas.
I am cleansing a data file of 50,000 pieces of equipment (50,000 rows and 10 columns).
One column ('UNITNUMBER') should be unique for each record. However there are duplicates and I'm trying to produce two data frames: one containing all the records where UNITNUMBER is unique and a second containing all the records where UNITNUMBER is repeated in another record.
The following produces a series where UNITNUMBER is the Index, True means duplicated, and False means unique.
MData=pd.read_excel(MFile,MFileTab, skiprows=0)
DupSeries=(MData.UNITNUMBER.value_counts()>1)
The following produces a series where there is one record for each piece of equipment in the same order as the original DataFrame, the index is the UNITMASTER values and the series value is True or False.
DupSeries[MROData['UNITNUMBER']]
I expected that
MData[DupSeries[MData['UNITNUMBER']]]
would yield all the records in MData where UNITNUMBER is duplicated but instead I get a warning and an error:
UserWarning: Boolean Series key will be reindexed to match DataFrame index.
ValueError: cannot reindex from a duplicate axis
In short, I can't figure out the syntax. Please help. I'm happy to use a completely different method if there is one.
MDuplicates = MData.loc[MData.duplicated('UNITNUMBER', keep=False)]
MUnique = MData.drop_duplicates('UNITNUMBER', keep='first')

pandas unique values multiple columns different dtypes

Similar to pandas unique values multiple columns I want to count the number of unique values per column. However, as the dtypes differ I get the following error:
The data frame looks like
A small[['TARGET', 'title']].apply(pd.Series.describe) gives me the result, but only for the category types and I am unsure how to filter the index for only the last row with the unique values per column
Use apply and np.unique to grab the unique values in each column and take its size:
small[['TARGET','title']].apply(lambda x: np.unique(x).size)
Thanks!

Unique Value Index from two fields

I'm new to pandas and python, and could definitely use some help.
I have the code below, which almost does what I want. It creates dummy variables for the unique values in a field and indexes them by the unique combinations of the unique values in two other fields.
What I would like is only one row for each unique combination of the fields used for the index. Right now I get multiple rows for say 'asset subs end dt' = 10/30/2008 and 'reseller csn' = 55008 if the dummy variable comes up 3 times. I would rather have one row for the combination of index field values with a 3 in the dummy variable column.
Code:
df = data
df = df.set_index(['ASSET_SUBS_END_DT','RESELLER_CSN'])
Dummies=pd.get_dummies(df['EXPERTISE'])
something like:
df.groupby(level=[0, 1]).EXPERTISE.count()
when you do this groupby, everything with the same index is grouped together. assuming your data in EXPERTISE is notnull, you will get a new DataFrame returned with unique index values and the count per each index. try it out for yourself, play around with the results, and see how it can be combined with your existing DataFrame to get the final result you want.

Categories

Resources