I want to subset a DataFrame by two columns in different dataframes if the values in the columns are the same. Here is an example of df1 and df2:
df1
A
0 apple
1 pear
2 orange
3 apple
df2
B
0 apple
1 orange
2 orange
3 pear
I would like the output to be a subsetted df1 based upon the df2 column:
A
0 apple
2 orange
I tried
df1 = df1[df1.A == df2.B] but get the following error:
ValueError: Can only compare identically-labeled Series objects
I do not want to rename the column in either.
What is the best way to do this? Thanks
If need compare index values with both columns create Multiindex and use Index.isin:
df = df1[df1.set_index('A', append=True).index.isin(df2.set_index('B', append=True).index)]
print (df)
A
0 apple
2 orange
Related
I'd like to group by a specific column within a data frame called 'Fruit' and calculate the percentage of that particular fruit that are 'Good'
See below for my initial dataframe
import pandas as pd
df = pd.DataFrame({'Fruit': ['Apple','Apple','Banana'], 'Condition': ['Good','Bad','Good']})
Dataframe
Fruit Condition
0 Apple Good
1 Apple Bad
2 Banana Good
See below for my desired output data frame
Fruit Percentage
0 Apple 50%
1 Banana 100%
Note: Because there is 1 "Good" Apple and 1 "Bad" Apple, the percentage of Good Apples is 50%.
See below for my attempt which is overwriting all the columns
groupedDF = df.groupby('Fruit')
groupedDF.apply(lambda x: x[(x['Condition'] == 'Good')].count()/x.count())
See below for resulting table, which seems to calculate percentage but within existing columns instead of new column:
Fruit Condition
Fruit
Apple 0.5 0.5
Banana 1.0 1.0
We can compare Condition with eq and take advantage of the fact that True is (1) and False is (0) when processed as numbers and take the groupby mean over Fruits:
new_df = (
df['Condition'].eq('Good').groupby(df['Fruit']).mean().reset_index()
)
new_df:
Fruit Condition
0 Apple 0.5
1 Banana 1.0
We can further map to a format string and rename to get output into the shown desired output:
new_df = (
df['Condition'].eq('Good')
.groupby(df['Fruit']).mean()
.map('{:.0%}'.format) # Change to Percent Format
.rename('Percentage') # Rename Column to Percentage
.reset_index() # Restore RangeIndex and make Fruit a Column
)
new_df:
Fruit Percentage
0 Apple 50%
1 Banana 100%
*Naturally further manipulations can be done as well.
I have a dataframe as below.
Date Fruit level_0 Num Color
0 2013-11-25 Apple DF2 22.1 Red
1 2013-11-24 Banana DF1 22.1 Yellow
2 2013-11-24 Banana DF2 122.1 Yellow
3 2013-11-23 Celery DF1 10.2 Green
4 2013-11-24 Orange DF1 8.6 Orange
5 2013-11-24 Orange DF2 8.6 Orange1
6 2013-11-25 Orange DF1 8.6 Orange
I need to find and compare the rows within the dataframe and see which columns have data mismatch. The rows that are selected for comparison should be only those which have the same "Date" and "Fruit" values but different "level_0" values. So in the dataframe i need to compare rows having index 1 and 2 since they have same value for "Date" & "Fruit", but different "level_0" values. When comparing these since they differ in the "Num" column, we need to suffix a label(say "NM" ) beside the value in both rows. Rows which have only one occurrence of "Date" & "Fruit" combination will need to have a label (say "Miss") suffixed to the value in "Fruit" column.
Example of expected output below:
1.)Is it possible to get such an output?
2.)Is there a fast way get it, as my actual dataset contains millions of rows and 20-25 columns?
This is pretty complex, since there are lot different filters you want to do. If I get you right, you want
for rows that have the same "Date" and "Fruit" values, and
of those rows, those that have different "level_0" values, and
of those rows, those that have different "Num" values to get -NM. From your example you want to do the same with the "Color"-column.
Rows that are the only occurence of a "Date" and "Fruit" value get -Miss.
First, you'll need to make Num a string column, since we are adding suffixes. Then we groupby Date and Fruit (1). Then, since you wanted the groups to have different level_0 values, we make filter on that called diff_frames (2). Then we add the suffixes using transform on both columns if they have two unique elements (3).
df['Num'] = df['Num'].astype(str)
g = df.groupby(['Date', 'Fruit'])
diff_frames = g['level_0'].transform(lambda s: s.nunique() == 2)
df[['Num', 'Color']] = df[diff_frames].groupby(['Date', 'Fruit'])[['Num', 'Color']].transform(
lambda s: s+'-NM' if s.nunique() == 2 else s)
Then, for the second part, we get the non-duplicated rows in Date and Fruit, and add -Miss to the Fruit column. (4)
df.loc[~df.duplicated(subset=['Date', 'Fruit'], keep=False), 'Fruit'] += '-Miss'
print(df)
Date Fruit level_0 Num Color
0 0 Apple-Miss DF2 22.1 Red
1 1 Banana DF1 22.1-NM Yellow
2 1 Banana DF2 122.1-NM Yellow
3 2 Celery-Miss DF1 10.2 Green
4 3 Orange DF1 8.6 Orange-NM
5 3 Orange DF2 8.6 Orange1-NM
6 4 Orange-Miss DF2 8.6 Orange
I have Pandas dataframe with two columns. One is unique identifier and second is the name of product attached to this unique identifier. I have duplicate values for identifier and product names. I want to convert one column of product names into several columns without duplicating identifier. Maybe I need to aggregate product names through identifier.
My dataframe looks like:
ID Product_Name
100 Apple
100 Banana
200 Cherries
200 Apricots
200 Apple
300 Avocados
I want to have dataframe like this:
ID
100 Apple Banana
200 Cherries Apricots Apple
300 Avocados
Each product along each identifier has to be in separate column
I tried pd.melt, pd.pivot, pd.pivot_table but only errors and this errors says No numeric types to aggregate
Any idea how to do this?
Use cumcount for new columns names to MultiIndex by set_index and reshape by unstack:
df = df.set_index(['ID',df.groupby('ID').cumcount()])['Product_Name'].unstack()
Or create Series of lists and new DataFrame by contructor:
s = df.groupby('ID')['Product_Name'].apply(list)
df = pd.DataFrame(s.values.tolist(), index=s.index)
print (df)
0 1 2
ID
100 Apple Banana NaN
200 Cherries Apricots Apple
300 Avocados NaN NaN
But if want 2 column DataFrame:
df1 = df.groupby('ID')['Product_Name'].apply(' '.join).reset_index(name='new')
print (df1)
ID new
0 100 Apple Banana
1 200 Cherries Apricots Apple
2 300 Avocados
use pivot funtion pivoting it can do the required thing!!
I am attempting to perform a partial string match between columns in data frames for example:
df_A:
Items_A
purse
string
hat
glue
gum
cherry
cherry
cherry pie
and
df_B:
1 2 3
string gum cherry
glue
desired output:
df_matched:
matched Items_A
0 purse
1 string
0 hat
1 glue
2 gum
3 cherry
3 cherry
3 cherry pie
Note that numbers in the matched columns are the labels from the column that is matched, either 1, 2, or 3. If there is no match, then the label is 0.
I was able to use Regular expression matching with several nested loops but was wondering if there was a way to use the panda's libraries to perform the operation more efficiently.
Reshape df_B to get this :
level_0 level_1 0
0 0 1 string
1 0 2 gum
2 0 3 cherry
3 1 1 glue
rename df_B columns
get the list of unique words in df_B
create a new column in df_B to find the matching word from df_B in
df_A
Merge and filter
import regex
df_B = df_B.stack().reset_index()
df_B = df_B.rename(columns={"level_1": "matched", 0: "Items_A"})
items = df_B.Items_A.unique()
def partial_match(x, items):
for item in items:
if regex.search(r'.?'+item+'.?', x):
return item
return 0
df_A["matching_item"] = df_A["Items_A"].apply(lambda x: partial_match(x, items))
df_A = df_A.merge(df_B, how="left", left_on="matching_item", right_on="Items_A", suffixes=('', '_y'))
df_A = df_A.loc[:,["Items_A", "matched"]]
After using transpose on a dataframe there is always an extra row as a remainder from the initial dataframe's index for example:
import pandas as pd
df = pd.DataFrame({'fruit':['apple','banana'],'number':[3,5]})
df
fruit number
0 apple 3
1 banana 5
df.transpose()
0 1
fruit apple banana
number 3 5
Even when i have no index:
df.reset_index(drop = True, inplace = True)
df
fruit number
0 apple 3
1 banana 5
df.transpose()
0 1
fruit apple banana
number 3 5
The problem is that when I save the dataframe to a csv file by:
df.to_csv(f)
this extra row stays at the top and I have to remove it manually every time.
Also this doesn't work:
df.to_csv(f, index = None)
because the old index is no longer considered an index (just another row...).
It also happened when I transposed the other way around and I got an extra column which i could not remove.
Any tips?
I had the same problem, I solved it by reseting index before doing the transpose. I mean df.set_index('fruit').transpose():
import pandas as pd
df = pd.DataFrame({'fruit':['apple','banana'],'number':[3,5]})
df
fruit number
0 apple 3
1 banana 5
And df.set_index('fruit').transpose() gives:
fruit apple banana
number 3 5
Instead of removing the extra index, why don't try setting the new index that you want and then use slicing ?
step 1: Set the new index you want:
df.columns = df.iloc[0]
step 2: Create a new dataframe removing extra row.
df_new = df[1:]