I have the below DataFrame
As you can see, ItemNo 1 is duplicated three times, and each column has a value corresponding to it.
I am looking for a method to check against all columns, and if they match then put Price, Sales, and Stock as one entry, not three.
Any help will be greatly appreciated.
Simply remove all the NaN instances and redefine the column names
df = df1.apply(lambda x: pd.Series(x.dropna().values), axis=1)
df.columns = ['ItemNo','Category','SIZE','Model','Customer','Week Date','<New col name>']
For converging to one row, you can use groupby like this
df.groupby('ItemNo', as_index=False).first()
Related
I have a df where I want to check for duplicate rows in only two of the columns, but if those columns are similar to the previous row, then I'd like to isolate/print them. So for example, if rows 12 - 89 have the same value in column 2 and column 3 as the previous row(s), then I want to know this range of rows.
See image 1 for example of df where 'pm10_ugm3' and 'pm25_ugm3' are duplicated but other columns are not:
Many thanks
Try the dataframe's duplicated function. This returns an index that you can use to slice/select those rows. Some variation of this will get you close:
dup_rows = df.duplicated(subset=['col1', 'col2'], keep='first')
print(df[dup_rows])
I have a dataframe with 7 columns and ~5.000 rows. I want to check that all the column values in a row are in my list and if so either add them to a new dataframe OR remove those where all values do not match, i.e. remove false rows (w/e is the easiest);
for row in df:
for columns in row:
if df.iloc[row, column].isin(MyList):
...*something*
I could imagine that .apply and .all could be used, but I'm afraid my python skills are limited, any help?
If I understood correctly, you can solve this by using apply with a lambda expression like:
df.loc[df.apply(lambda row: all(value in MyList for value in row), axis=1))]
I would like to have a new dataframe with only rows that are duplicated in the previous df.
I tried to assign a new column that it is true if there are duplicates and then select only rows that are true. However I got 0 entities. I am sure that I have duplicates in the df
I want to keep in the old dataframe the first rows and remove all the other duplicates.
Column with duplicate values is called 'merged'
df=df.assign(
is_duplicate= lambda d: d.duplicated()
).sort_values('merged').reset_index(drop=True)
df2= df.loc[df['is_duplicate'] == 'True']
They are not strings, they are booleans, so use:
df2 = df.loc[df['is_duplicate']]
I think you need boolean indexing, loc should be removed:
df[df.duplicated()]
Or your solution cannot be used with .reset_index(drop=True), because then filtered another rows, also sorting should be better before or after solution:
df = df.assign(is_duplicate= lambda d: d.duplicated())
df2= df[df['is_duplicate']]
I need do create some additional columns to my table or separate table based on following:
I have a table
and I need to create additional columns where the column indexes (names of columns) will be inserted as values. Like this:
How to do it in pandas? Any ideas?
Thank you
If need matched columns only for 1 values:
df = (df.set_index('name')
.eq(1)
.dot(df.columns[1:].astype(str) + ',')
.str.rstrip(',')
.str.split(',', expand=True)
.add_prefix('c')
.reset_index())
print (df)
Explanation:
Idea is create boolean mask with True for values which are replaced by columns names - so compare by DataFrame.eq by 1 and used matrix multiplication by DataFrame.dot by all columns without first with added separator. Then remove last traling separator by Series.str.rstrip and use Series.str.split for new column, changed columns names by DataFrame.add_prefix.
Another solution:
df1 = df.set_index('name').eq(1).apply(lambda x: x.index[x].tolist(), 1)
df = pd.DataFrame(df1.values.tolist(), index=df1.index).add_prefix('c').reset_index()
I have a Pandas dataframe with two columns:
I would like to group the numbers by the column Fee_Code. I do the following:
df.groupby('Fee_Code').sum()
However, as output I get for the row Management fees: 137651.03, or the first value. When I do:
df.groupby('Fee_Code').count()
I do see that Management fees has 2 observations. So why is then .sum() not working?
EDITS:
df.groupby('Fee_Code').get_group('Management fees') returns:
Solved it. My column of values was not numeric, so it was just taking the first element.
To make it numeric I did the following:
df.loc[:, 'Value'] = pd.to_numeric( df.loc[:, 'Value'], downcast='float', errors='coerce')
And then .groupby(..).sum(..) worked perfectly fine.