I have the following DataFrame:
KPI_01 KPI_02 KPI_03
date
2015-05-24 green green red
2015-06-24 orange red NaN
And I want to count the number of colors for each date in order to obtain:
value green orange red
date
2015-05-24 2 0 1
2015-06-24 0 1 1
Here is my code that does the job. Is there a better way (shorter) to do that ?
# Test data
df= pd.DataFrame({'date': ['05-24-2015','06-24-2015'],
'KPI_01': ['green','orange'],
'KPI_02': ['green','red'],
'KPI_03': ['red',np.nan]
})
df.set_index('date', inplace=True)
# Transforming to long format
df.reset_index(inplace=True)
long = pd.melt(df, id_vars=['date'])
# Pivoting data
pivoted = pd.pivot_table(long, index='date', columns=['value'], aggfunc='count', fill_value=0)
# Dropping unnecessary level
pivoted.columns = pivoted.columns.droplevel()
You could apply value_counts:
>>> df.apply(pd.Series.value_counts,axis=1).fillna(0)
green orange red
date
05-24-2015 2 0 1
06-24-2015 0 1 1
apply tends to be slow, and row-wise operations slow as well, but to be honest if your frame isn't very big you might not even notice the difference.
Related
I want to subset a DataFrame by two columns in different dataframes if the values in the columns are the same. Here is an example of df1 and df2:
df1
A
0 apple
1 pear
2 orange
3 apple
df2
B
0 apple
1 orange
2 orange
3 pear
I would like the output to be a subsetted df1 based upon the df2 column:
A
0 apple
2 orange
I tried
df1 = df1[df1.A == df2.B] but get the following error:
ValueError: Can only compare identically-labeled Series objects
I do not want to rename the column in either.
What is the best way to do this? Thanks
If need compare index values with both columns create Multiindex and use Index.isin:
df = df1[df1.set_index('A', append=True).index.isin(df2.set_index('B', append=True).index)]
print (df)
A
0 apple
2 orange
I'd like to group by a specific column within a data frame called 'Fruit' and calculate the percentage of that particular fruit that are 'Good'
See below for my initial dataframe
import pandas as pd
df = pd.DataFrame({'Fruit': ['Apple','Apple','Banana'], 'Condition': ['Good','Bad','Good']})
Dataframe
Fruit Condition
0 Apple Good
1 Apple Bad
2 Banana Good
See below for my desired output data frame
Fruit Percentage
0 Apple 50%
1 Banana 100%
Note: Because there is 1 "Good" Apple and 1 "Bad" Apple, the percentage of Good Apples is 50%.
See below for my attempt which is overwriting all the columns
groupedDF = df.groupby('Fruit')
groupedDF.apply(lambda x: x[(x['Condition'] == 'Good')].count()/x.count())
See below for resulting table, which seems to calculate percentage but within existing columns instead of new column:
Fruit Condition
Fruit
Apple 0.5 0.5
Banana 1.0 1.0
We can compare Condition with eq and take advantage of the fact that True is (1) and False is (0) when processed as numbers and take the groupby mean over Fruits:
new_df = (
df['Condition'].eq('Good').groupby(df['Fruit']).mean().reset_index()
)
new_df:
Fruit Condition
0 Apple 0.5
1 Banana 1.0
We can further map to a format string and rename to get output into the shown desired output:
new_df = (
df['Condition'].eq('Good')
.groupby(df['Fruit']).mean()
.map('{:.0%}'.format) # Change to Percent Format
.rename('Percentage') # Rename Column to Percentage
.reset_index() # Restore RangeIndex and make Fruit a Column
)
new_df:
Fruit Percentage
0 Apple 50%
1 Banana 100%
*Naturally further manipulations can be done as well.
I have a dataframe as below.
Date Fruit level_0 Num Color
0 2013-11-25 Apple DF2 22.1 Red
1 2013-11-24 Banana DF1 22.1 Yellow
2 2013-11-24 Banana DF2 122.1 Yellow
3 2013-11-23 Celery DF1 10.2 Green
4 2013-11-24 Orange DF1 8.6 Orange
5 2013-11-24 Orange DF2 8.6 Orange1
6 2013-11-25 Orange DF1 8.6 Orange
I need to find and compare the rows within the dataframe and see which columns have data mismatch. The rows that are selected for comparison should be only those which have the same "Date" and "Fruit" values but different "level_0" values. So in the dataframe i need to compare rows having index 1 and 2 since they have same value for "Date" & "Fruit", but different "level_0" values. When comparing these since they differ in the "Num" column, we need to suffix a label(say "NM" ) beside the value in both rows. Rows which have only one occurrence of "Date" & "Fruit" combination will need to have a label (say "Miss") suffixed to the value in "Fruit" column.
Example of expected output below:
1.)Is it possible to get such an output?
2.)Is there a fast way get it, as my actual dataset contains millions of rows and 20-25 columns?
This is pretty complex, since there are lot different filters you want to do. If I get you right, you want
for rows that have the same "Date" and "Fruit" values, and
of those rows, those that have different "level_0" values, and
of those rows, those that have different "Num" values to get -NM. From your example you want to do the same with the "Color"-column.
Rows that are the only occurence of a "Date" and "Fruit" value get -Miss.
First, you'll need to make Num a string column, since we are adding suffixes. Then we groupby Date and Fruit (1). Then, since you wanted the groups to have different level_0 values, we make filter on that called diff_frames (2). Then we add the suffixes using transform on both columns if they have two unique elements (3).
df['Num'] = df['Num'].astype(str)
g = df.groupby(['Date', 'Fruit'])
diff_frames = g['level_0'].transform(lambda s: s.nunique() == 2)
df[['Num', 'Color']] = df[diff_frames].groupby(['Date', 'Fruit'])[['Num', 'Color']].transform(
lambda s: s+'-NM' if s.nunique() == 2 else s)
Then, for the second part, we get the non-duplicated rows in Date and Fruit, and add -Miss to the Fruit column. (4)
df.loc[~df.duplicated(subset=['Date', 'Fruit'], keep=False), 'Fruit'] += '-Miss'
print(df)
Date Fruit level_0 Num Color
0 0 Apple-Miss DF2 22.1 Red
1 1 Banana DF1 22.1-NM Yellow
2 1 Banana DF2 122.1-NM Yellow
3 2 Celery-Miss DF1 10.2 Green
4 3 Orange DF1 8.6 Orange-NM
5 3 Orange DF2 8.6 Orange1-NM
6 4 Orange-Miss DF2 8.6 Orange
I have a data frame df where some rows are duplicates with respect to a subset of columns:
A B C
1 Blue Green
2 Red Green
3 Red Green
4 Blue Orange
5 Blue Orange
I would like to remove (or replace with a dummy string) values for duplicate rows with respect to B and C, without deleting the whole row, ideally producing:
A B C
1 Blue Green
2 Red Green
3 NaN NaN
4 Blue Orange
5 Nan NaN
As per this thread: Replace duplicate values across columns in Pandas I've tried using pd.Series.duplicated, however I can't get it to work with duplicates in a subset of columns.
I've also played around with:
is_duplicate = df.loc[df.duplicated(subset=['B','C'])]
df = df.where(is_duplicated==True, 999) # 999 intended as a placeholder that I could find-and-replace later on
However this replaces almost every row with 999 in each column - so clearly I'm doing something wrong. I'd appreciate any advice on how to proceed!
df.loc[df.duplicated(subset=['B','C']), ['B','C']] = np.nan seems to work for me.
Edited to include #ALollz and #macaw_9227 correction.
Let me share with you how I used to confront those kind of challenges in the beginning. Obviously, there are quicker ways (a one-liner) but for the sake of the answer, let's do it on a more intuitive level (later, you'll see that you can do it in one line).
So here we go...
df = pd.DataFrame({"B":['Blue','Red','Red','Blue','Blue'],"C":['Green','Green','Green','Orange','Orange']})
which result in
Step 1: identify the duplication:
For this, I'm simply adding another (facilitator) column and asking with True/False if B and C are duplicated.
df['IS_DUPLICATED']= df.duplicated(subset=['B','C'])
Step 2: Identify the indexes of the 'True' IS_DUPLICATED:
dup_index = df[df['IS_DUPLICATED']==True].index
result: Int64Index([2, 4], dtype='int64')
Step 3: mark them as Nan:
df.iloc[dup_index]=np.NaN
Step 4: remove the IS_DUPLICATED column:
df.drop('IS_DUPLICATED',axis=1, inplace=True)
and the desired result:
I will using
df[['B','C']]=df[['B','C']].mask(df.duplicated(['B','C']))
df
Out[141]:
A B C
0 1 Blue Green
1 2 Red Green
2 3 NaN NaN
3 4 Blue Orange
4 5 NaN NaN
After using transpose on a dataframe there is always an extra row as a remainder from the initial dataframe's index for example:
import pandas as pd
df = pd.DataFrame({'fruit':['apple','banana'],'number':[3,5]})
df
fruit number
0 apple 3
1 banana 5
df.transpose()
0 1
fruit apple banana
number 3 5
Even when i have no index:
df.reset_index(drop = True, inplace = True)
df
fruit number
0 apple 3
1 banana 5
df.transpose()
0 1
fruit apple banana
number 3 5
The problem is that when I save the dataframe to a csv file by:
df.to_csv(f)
this extra row stays at the top and I have to remove it manually every time.
Also this doesn't work:
df.to_csv(f, index = None)
because the old index is no longer considered an index (just another row...).
It also happened when I transposed the other way around and I got an extra column which i could not remove.
Any tips?
I had the same problem, I solved it by reseting index before doing the transpose. I mean df.set_index('fruit').transpose():
import pandas as pd
df = pd.DataFrame({'fruit':['apple','banana'],'number':[3,5]})
df
fruit number
0 apple 3
1 banana 5
And df.set_index('fruit').transpose() gives:
fruit apple banana
number 3 5
Instead of removing the extra index, why don't try setting the new index that you want and then use slicing ?
step 1: Set the new index you want:
df.columns = df.iloc[0]
step 2: Create a new dataframe removing extra row.
df_new = df[1:]