Dataframes in Python - matching multiple columns of rows between two data frames - python

I have two data frames df1 - which holds a 'grouped inventory' of items grouped by numerical values A, B and C. For each item there is a sum column which should reflect the total price of all the items I have of that particular type. Initially I have set the sum column to zero.
df2 is a list of items I have with A, B, C and the price of the item.
df1 (Initial Inventory):
A B C SUM
1 1 1 0
1 1 2 0
1 2 2 0
2 2 2 0
df2 (List of items):
A B C PRICE
2 2 2 30
1 1 2 100
1 1 2 110
1 1 2 105
So my code should convert df1 into:
df1 (expected output):
A B C SUM
1 1 1 0
1 1 2 315
1 2 2 0
2 2 2 30
Explanation: My list of items (df2) contains one item coded as 2,2,2 which has a value of 30 and contains three items coded as 1,1,2 which has values of 100 + 110 + 105 = 315. So I update the inventory table df1, to reflect that I have a total value of 30 for items coded 2,2,2 and total value of 315 for items coded 1,1,2. I have 0 in value for items coded 1,1,1 and 1,2,2 - since they aren't found in my items list.
What would be the most efficient way to do this?
I would rather not use loops since df1 is 720 rows and df2 is 10,000 rows.

You can try to merge on columns "A", "B", and "C" with how="left". (df2_sum below is a subset of df1, so we choose left here.)
df2_sum = df2.groupby(["A", "B", "C"])["PRICE"].sum().reset_index()
df1.merge(df2_sum, on=["A","B","C"], how="left").fillna(0)
A B C SUM PRICE
0 1 1 1 0 0.0
1 1 1 2 0 315.0
2 1 2 2 0 0.0
3 2 2 2 0 30.0
You can then add PRICE to your SUM column.

Related

Python Pandas Dataframe: Divide values in two rows based on column values

I have a pandas dataframe:
A B C D
1 1 0 32
1 4
2 0 43
1 12
3 0 58
1 34
2 1 0 37
1 5
[..]
where A, B and C are index columns. What I want to compute is for every group of rows with unique values for A and B: D WHERE C=1 / D WHERE C=0.
The result should look like this:
A B NEW
1 1 4/32
2 12/43
3 58/34
2 1 37/5
[..]
Can you help me?
Use Series.unstack first, so possible divide columns 0,1:
new = df['D'].unstack()
new = new[1].div(new[0]).to_frame('NEW')
print (new)
NEW
A B
1 1 0.125000
2 0.279070
3 0.586207
2 2 0.135135

Compare all columns value to another one with Pandas

I am having trouble with Pandas.
I try to compare each value of a row to another one.
In the attached link you will be able to see a slice of my dataframe.
For each date I have the daily variation of some stocks.
I want to compare each stock variation to the variation of the columns labelled 'CAC 40'.
If the value is greater I want to turn it into a Boolean 1 or 0 if lower.
This should return a dataframe filled only with 1 or 0 so I can then summarize by columns.
I have tried the apply method but this doesn't work.
It returns a Pandas.Serie ( attached below )
def compare_to_cac(row):
for i in row:
if row[i] >= row['CAC 40']:
return 1
else:
return 0
data2 = data.apply(compare_to_cac, axis=1)
Please can someone help me out ?
I worked with this data (column names are not important here, only the CAC 40 one is):
A B CAC 40
0 0 2 9
1 1 3 9
2 2 4 1
3 3 5 2
4 4 7 2
With just a for loop :
for column in df.columns:
if column == "CAC 40":
continue
condition = [df[column] > df["CAC 40"]]
value = [1]
df[column] = np.select(condition, value, default=0)
Which gives me as a result :
A B CAC 40
0 0 0 9
1 0 0 9
2 1 1 1
3 1 1 2
4 1 1 2

Filter rows with more than 1 value in a set and count their occurrence pandas python

Let's assume, I have the following data frame.
Id Combinations
1 (A,B)
2 (C,)
3 (A,D)
4 (D,E,F)
5 (F)
I would like to filter out Combination column values with more than value in a set. Something like below. AND I would like count the number of occurrence as whole in Combination column. For example, ID number 2 and 5 should be removed since their value in a set is only 1.
The result I am looking for is:
ID Combination Frequency
1 A 2
1 B 1
3 A 2
3 D 2
4 D 2
4 E 1
4 F 2
Can anyone help to get the above result in Python pandas?
First if necessary convert values to lists:
df['Combinations'] = df['Combinations'].str.strip('(,)').str.split(',')
If need count after filtering only one values by Series.str.len in boolean indexing, then use DataFrame.explode and count values by Series.map with Series.value_counts:
df1 = df[df['Combinations'].str.len().gt(1)].explode('Combinations')
df1['Frequency'] = df1['Combinations'].map(df1['Combinations'].value_counts())
print (df1)
Id Combinations Frequency
0 1 A 2
0 1 B 1
2 3 A 2
2 3 D 2
3 4 D 2
3 4 E 1
3 4 F 1
Or if need count before removing them filter them by Series.duplicated in last step:
df2 = df.explode('Combinations')
df2['Frequency'] = df2['Combinations'].map(df2['Combinations'].value_counts())
df2 = df2[df2['Id'].duplicated(keep=False)]
Alternative:
df2 = df2[df2.groupby('Id').Id.transform('size') > 1]
Or:
df2 = df2[df2['Id'].map(df2['Id'].value_counts() > 1]
print (df2)
Id Combinations Frequency
0 1 A 2
0 1 B 1
2 3 A 2
2 3 D 2
3 4 D 2
3 4 E 1
3 4 F 2

Deleting rows with values higher than the minimum value of all rows with the same id using Pandas

So, i have the following dataframe:
id value
0 a 1
1 a 1
2 a 2
3 b 3
4 b 3
For example, for rows with id 'a', the minimum value is 1 and for id 'b', the minimum value is 3, so no rows would be deleted.
Output:
id value
0 a 1
1 a 1
2 b 3
3 b 3
So far, I've only grouped the rows with same id and found their lowest values but couldn't find a way to delete all expected rows.
I've used the following command:
min_values = df.loc[df.groupby(['id'])['value'].idxmin()]['value']
Using transform( idxmin: will only return the first index of min value , in your case you have duplicates so it would not return all index )
df[df.value==df.groupby('id').value.transform('min')]
Out[257]:
id value
0 a 1
1 a 1
3 b 3
4 b 3

Adding non-existing combination

I want to make a table with all available products for every customer. However, I only have a table with the combination of product and customer if it was bought. I want to make a new table that also included the product that were not bought by the customer. The current table looks as follows:
The table I want to end up with is:
Could anyone help me how to do this in pandas?
One way to do this is to use pd.MultiIndex and reindex:
df = pd.DataFrame({'Product':list('ABCDEF'),
'Customer':[1,1,2,3,3,3],
'Amount':[4,5,3,1,1,2]})
indx = pd.MultiIndex.from_product([df['Product'].unique(),
df['Customer'].unique()],
names=['Product','Customer'])
df.set_index(['Product','Customer'])\
.reindex(indx, fill_value=0)\
.reset_index()\
.sort_values(['Customer','Product'])
Output:
Product Customer Amount
0 A 1 4
3 B 1 5
6 C 1 0
9 D 1 0
12 E 1 0
15 F 1 0
1 A 2 0
4 B 2 0
7 C 2 3
10 D 2 0
13 E 2 0
16 F 2 0
2 A 3 0
5 B 3 0
8 C 3 0
11 D 3 1
14 E 3 1
17 F 3 2
You can also create a pivot to do what you want in one line. Note that the output format is different -- it's a pandas.DataFrame.pivot rather than a standard pandas data frame. But if you're not especially fussed about that (depends on how you intend to use the final table), the following code does the job.
df = pd.DataFrame({'Product':['A','B','C','D','E','F'],
'Customer':[1,1,2,3,3,3],
'Amount':[4,5,3,1,1,2]})
pivot_df = df.pivot(index='Product',
columns='Customer',
values='Amount').fillna(0).astype('int')
Output:
Customer 1 2 3
Product
A 4 0 0
B 5 0 0
C 0 3 0
D 0 0 1
E 0 0 1
F 0 0 2
df.pivot creates NaN values when there are no corresponding entries in the original df (it creates a NaN value for Product A and Customer 2, for instance). NaNs are float values, so all the 'Amounts' in the pivot are implicitly converted into floats. This is why I use fillna(0) to convert the NaN values into 0s, and then finally change the dtype back to int.

Categories

Resources