How we can define specific columns to multiply it with specifc columns? - python

I have df like this
Web R_Val B_Cost R_Total B_Total
A 20 2 20 1
B 30 3 10 2
C 40 1 30 1
I would like to multiply the column started with R_ together and B_ together and in real data there are many more. This is just dummy data, what could be the best solution to achieve this?
Web R_Val B_Cost R_Total B_Total R_New B_New
A 20 2 20 1
B 30 3 10 2
C 40 1 30 1

Check the answer I just posted on your other question:
How to multiply specific column from dataframe with one specific column in same dataframe?
dfr = pd.DataFrame({
'Brand' : ['A', 'B', 'C', 'D', 'E', 'F'],
'price' : [10, 20, 30, 40, 50, 10],
'S_Value' : [2,4,2,1,1,1],
'S_Factor' : [2,1,1,2,1,1]
})
pre_fixes = ['S_']
for prefix in pre_fixes:
coltocal = [col for col in dfr.columns if col.startswith(prefix)]
for col in coltocal:
dfr.loc[:,col+'_new'] = dfr.price*dfr[col]
dfr

Related

divide a number by all values of column in pandas

df1 have one column(total) with 2 values 5000 and 1000 each with id A & B respectively. df2 have one column(marks) with 10 values where first 5(100,200,300,400,500) values have id A and next 5 values have id B(10,20,30,40,50).
Now I have to get expected output as
id final_value
- A 50
- A 25
- A 16.6
- A 12.5
- A 10
- B 100
- B 50
- B 33.3
- B 25
- B 20
my code is
new_df = df1['total']/df2['marks']
But I got output as
A 50
B 100
Remaining NaN
pandas division is using both "columns" (series), element by element.
If you want to divide using 'id' as a link, you have to merge your dataframes before :
df1 = pd.DataFrame([[5000, 'A'], [1000, 'B']], columns=['test', 'id'])
df2 = pd.DataFrame([[100, 'A'], [200, 'A'], [300, 'A'], [400, 'A'], [500, 'A'], [10, 'B'], [20, 'B'], [30, 'B'], [40, 'B'], [50, 'B']],columns=['marks', 'id'])
df3 = df1.merge(df2, on='id')
df3['test']/df3['marks']
Setup:
df1 = pd.DataFrame({'total': [5000, 1000]}, index = ['A', 'B'])
df2 = pd.DataFrame({'marks': [100, 200, 300, 400, 500, 10, 20, 30, 40, 50]}, index = ['A','A','A','A','A','B','B','B','B','B'])
Interestingly enough this works:
df2['total'] = df1['total']
df2['final_value'] = df2['total'] / df2['marks']
And then you can just drop the rows and copy answer to new df if you want it as you stated:
new_df = df2[['final_value']]
df2 = df2.drop(['total', 'final_value'], axis = 1)
Assuming your data looks like this:
df1 = pd.DataFrame(dict(id=['A', 'B'], total=[5000,1000]))
df2 = pd.DataFrame(dict(id=['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B'],
vals=[100,200,300,400,500,10,20,30,40,50]))
you can get the new column you're interested in by first merging the two dataframes on the id column and then applying a lambda function to divide the total by the value provided in df1. Specifically:
df2['final_result'] = df2.merge(df1, on='id').apply(lambda x: round(x.total/x.vals, 1), axis=1)
And if you only want the id and final_result columns, you can just select those:
df2[['id', 'final_result']]
Your data should now look like you expected:
id final_result
0 A 50.0
1 A 25.0
2 A 16.7
3 A 12.5
4 A 10.0
5 B 100.0
6 B 50.0
7 B 33.3
8 B 25.0
9 B 20.0
Note that in the lambda function I also applied some rounding to get just 1 decimal as you indicated.
Try:
>>> df1.set_index("id").rename(columns={"total": "marks"}).div(df2.set_index("id")).round(1).reset_index()
id marks
0 A 50.0
1 A 25.0
2 A 16.7
3 A 12.5
4 A 10.0
5 B 100.0
6 B 50.0
7 B 33.3
8 B 25.0
9 B 20.0
It leverages the fact, that for any arithmetical operations between 2 data frames pandas will autofit both dataframe attributes by index, and columns (so the arithmetic operation will be index x on index x and column a on column a)

Filter Columns from Pandas Dataframe with given list when list elements may or may not be present as column

I have a huge dataframe and I need to filter out the columns from the dataframe if the columns are present in a given list.
For example,
df = pd.DataFrame([[1,2,3,4,5],[6,7,8,9,10]], columns=list('ABCDE'))
This is the dataframe.
A B C D E
0 1 2 3 4 5
1 6 7 8 9 10
I have a list.
fil_lst = ['A', 'D', 'F']
The list may contain column names that are not present in the dataframe. I need only the columns that are present in the dataframe.
I need the resulting dataframe like,
A D
0 1 4
1 6 9
I know it can be done with the help of list comprehension like,
new_df = df[[col for col in fil_lst if col in df.columns]]
But as I have a huge dataframe, it is better if I don't use this computationally expensive process.
Is it possible to vectorize this in any way?
Use Index.isin for test membership in columns and DataFrame.loc for filter by columns, so : mean select all rows and columns by mask:
fil_lst = ['A', 'D', 'F']
df = df.loc[:, df.columns.isin(fil_lst)]
print(df)
A D
0 1 4
1 6 9
Or use Index.intersection:
fil_lst = ['A', 'D', 'F']
df = df[df.columns.intersection(fil_lst)]
print(df)
A D
0 1 4
1 6 9
If you are dealing with large lists, and the focus is on performance more than order of columns, you can use set intersection:
In [2944]: fil_lst = ['A', 'D', 'F']
In [2945]: col_list = df.columns.tolist()
In [2947]: df = df[list(set(col_list) & set(fil_lst))]
In [2947]: df
Out[2947]:
D A
0 4 1
1 9 6
EDIT: If order of columns is important, then do this:
In [2953]: df = df[sorted(set(col_list) & set(fil_lst), key = col_list.index)]
In [2953]: df
Out[2953]:
A D
0 1 4
1 6 9

loop through a single column in one dataframe compare to a column in another dataframe create new column in first dataframe using pandas

right now I have two dataframes they look like:
c = pd.DataFrame({'my_goal':[3, 4, 5, 6, 7],
'low_number': [0,100,1000,2000,3000],
'high_number': [100,1000,2000,3000,4000]})
and
a= pd.DataFrame({'a':['a', 'b', 'c', 'd', 'e'],
'Number':[50, 500, 1030, 2005 , 3575]})
what I want to do is if 'Number' falls between the low number and the high number I want it to bring back the value in 'my_goal'. For example if we look at 'a' it's 'Number is is 100 so I want it to bring back 3. I also want to create a dataframe that contains all the columns from dataframe a and the 'my_goal' column from dataframe c. I want the output to look like:
I tried making my high and low numbers into a separate list and running a for loop from that, but all that gives me are 'my_goal' numbers:
low_number= 'low_number': [0,100,1000,2000,3000]
for i in a:
if float(i) >= low_number:
a = c['my_goal']
print(a)
You can use pd.cut, when I see ranges, I first think of pd.cut:
dfa = pd.DataFrame(a)
dfc = pd.DataFrame(c)
dfa['my_goal'] = pd.cut(dfa['Number'],
bins=[0]+dfc['high_number'].tolist(),
labels=dfc['my_goal'])
Output:
a Number my_goal
0 a 50 3
1 b 500 4
2 c 1030 5
3 d 2005 6
4 e 3575 7
I changed row 4 slightly to include a test case where the condition is not met. You can concat a with rows of c where the condition is true.
a= pd.DataFrame({'a':['a', 'b', 'c', 'd', 'e'],'Number':[50, 500, 1030, 1995 , 3575]})
cond= a.Number.between( c.low_number, c.high_number)
pd.concat([a, c.loc[cond, ['my_goal']] ], axis = 1, join = 'inner')
Number a my_goal
0 50 a 3
1 500 b 4
2 1030 c 5
4 3575 e 7

Looking for the presence of composite key in three DataFrames, and concatenating DataFrames accordingly

This question was very hard to word..
Here is some sample code for a reproducible example:
import numpy as np
import pandas as pd
df1 = pd.DataFrame([['a', 1, 10, 1], ['a', 2, 20, 1], ['b', 1, 4, 1], ['c', 1, 2, 1], ['e', 2, 10, 1]])
df2 = pd.DataFrame([['a', 1, 15, 2], ['a', 2, 20, 2], ['c', 1, 2, 2]])
df3 = pd.DataFrame([['d', 1, 10, 3], ['e', 2, 20, 3], ['f', 1, 15, 3]])
df1.columns = ['name', 'id', 'price', 'part']
df2.columns = ['name', 'id', 'price', 'part']
df3.columns = ['name', 'id', 'price', 'part']
result = pd.DataFrame([['a', 1, 10, 15, 'missing'],
['a', 2, 20, 20, 'missing'],
['b', 1, 4, 'missing', 'missing'],
['c', 1, 2, 2, 'missing'],
['e', 2, 10, 'missing', 20],
['d', 1, 'missing', 'missing', 10],
['f', 1, 'missing', 'missing', 15]])
result.columns = ['name', 'id', 'pricepart1', 'pricepart2', 'pricepart3']
So there are three DataFrames:
df1
name id price part
0 a 1 10 1
1 a 2 20 1
2 b 1 4 1
3 c 1 2 1
4 e 2 10 1
df2
name id price part
0 a 1 15 2
1 a 2 20 2
2 c 1 2 2
df3
name id price part
0 d 1 10 3
1 e 2 20 3
2 f 1 15 3
The name and id is like a composite key. It may be present in all three DataFrames, just two of the three DataFrames, in just 1 of the DataFrames. To represent which DataFrame the name, id came from, a part column exists in df1, df2 and df3.
The result I'm looking for is given by the result DataFrame.
name id pricepart1 pricepart2 pricepart3
0 a 1 10 15 missing
1 a 2 20 20 missing
2 b 1 4 missing missing
3 c 1 2 2 missing
4 e 2 10 missing 20
5 d 1 missing missing 10
6 f 1 missing missing 15
Basically, I want EVERY name, id pair to be accounted for. Even if the SAME name, id comes in both df1 and df2, I want separate columns for price from each of the part even if the prices in both the parts/DataFrames are the same.
In the results DataFrame, take row1, a 1 10 15 missing
What this represents is, the name, id pair a 1 had a price of 10 in df1, 15 in df2, and missing in df3.
If the row value is missing for a specific pricepart that means, the name, id pair did not appear in that particular DataFrame!
I've used the part to represent the DataFrame! so, you can asusme that part is ALWAYS 1 in df1, ALWAYS 2 in df2 and ALWAYS 3 in df3.
So far.. I literally just did, pd.concat([df1, df2, df3])
Not sure if this approach is going to lead to a dead end..
Keep in mind that the original three DataFrames are 62245 rows × 4 columns EACH. And each DataFrame may or may not contain the name, id pair. If the name, id pair is present in EVEN 1 of the DataFrames, and not the others, I wanted that to be accounted for with a missing for the other DataFrames.
You can use pd.merge whilst using how='outer'
# Change column names and remove 'part' column
df1 = df1.rename(columns={'price':'pricepart1'}).drop('part', axis=1)
df2 = df2.rename(columns={'price':'pricepart2'}).drop('part', axis=1)
df3 = df3.rename(columns={'price':'pricepart3'}).drop('part', axis=1)
# Merge dataframes
df = pd.merge(df1, df2, left_on=['name', 'id'], right_on=['name', 'id'], how='outer')
df = pd.merge(df , df3, left_on=['name', 'id'], right_on=['name', 'id'], how='outer')
# Fill na values with 'missing'
df = df.fillna('missing')
Out[]:
name id pricepart1 pricepart2 pricepart3
0 a 1 10 15 missing
1 a 2 20 20 missing
2 b 1 4 missing missing
3 c 1 2 2 missing
4 e 2 10 missing 20
5 d 1 missing missing 10
6 f 1 missing missing 15

Merging and sum up several value-counts series in Pandas

I usually use value_counts() to get the number of occurrences of a value. However, I deal now with large database-tables (cannot load it fully into RAM) and query the data in fractions of 1 month.
Is there a way to store the result of value_counts() and merge it with / add it to the next results?
I want to count the number user actions. Assume the following structure of
user-activity logs:
# month 1
id userId actionType
1 1 a
2 1 c
3 2 a
4 3 a
5 3 b
# month 2
id userId actionType
6 1 b
7 1 b
8 2 a
9 3 c
Using value_counts() on those produces:
# month 1
userId
1 2
2 1
3 2
# month 2
userId
1 2
2 1
3 1
Expected output:
# month 1+2
userId
1 4
2 2
3 3
Up until now, I just have found a method using groupby and sum:
# count users actions and remember them in new column
df1['count'] = df1.groupby(['userId'], sort=False)['id'].transform('count')
# delete not necessary columns
df1 = df1[['userId', 'count']]
# delete not necessary rows
df1 = df1.drop_duplicates(subset=['userId'])
# repeat
df2['count'] = df2.groupby(['userId'], sort=False)['id'].transform('count')
df2 = df2[['userId', 'count']]
df2 = df2.drop_duplicates(subset=['userId'])
# merge and sum up
print pd.concat([df1,df2]).groupby(['userId'], sort=False).sum()
What is the pythonic / pandas' way of merging the information of several series' (and dataframes) efficiently?
Let me suggest "add" and specify a fill value of 0. This has an advantage over the previously suggested answer in that it will work when the two Dataframes have non-identical sets of unique keys.
# Create frames
df1 = pd.DataFrame(
{'User_id': ['a', 'a', 'b', 'c', 'c', 'd'], 'a': [1, 1, 2, 3, 3, 5]})
df2 = pd.DataFrame(
{'User_id': ['a', 'a', 'b', 'b', 'c', 'c', 'c'], 'a': [1, 1, 2, 2, 3, 3, 4]})
Now add the the two sets of values_counts(). The fill_value argument will handle any NaN values that would arise, in this example, the 'd' that appears in df1, but not df2.
a = df1.User_id.value_counts()
b = df2.User_id.value_counts()
a.add(b,fill_value=0)
You can sum the series generated by the value_counts method directly:
#create frames
df= pd.DataFrame({'User_id': ['a','a','b','c','c'],'a':[1,1,2,3,3]})
df1= pd.DataFrame({'User_id': ['a','a','b','b','c','c','c'],'a':[1,1,2,2,3,3,4]})
sum the series:
df.User_id.value_counts() + df1.User_id.value_counts()
output:
a 4
b 3
c 5
dtype: int64
This is know as "Split-Apply-Combine". It is done in 1 line and 3-4 clicks, using a lambda function as follows.
1️⃣ paste this into your code:
df['total_for_this_label'] = df.groupby('label', as_index=False)['label'].transform(lambda x: x.count())
2️⃣ replace 3x label with the name of the column whose values you are counting (case-sensitive)
3️⃣ print df.head() to check it's worked correctly

Categories

Resources