Pandas Dataframe multiply with a column of another dataframe - python

Is there a way to multiply each element of a row of a dataframe by an element of the same row from a particular column of another dataframe.
For example, such that:
df1:
1 2 3
2 2 2
3 2 1
and df2:
x 1 b
z 2 c
x 4 a
results in
1 2 3
4 4 4
12 8 4
So basically such that df1[i,:] * df2[i,j] = df3[i,:].

Multiply the first df by the column of the second df
Assuming your column names are 0,1,2
df1.mul(df2[1],0)
Output
0 1 2
0 1 2 3
1 4 4 4
2 12 8 4

Here you go.
I have created a variable that allows you to select that which column of the second dataframe you want to multiply with the numbers in the first dataframe.
arr1 = np.array(df1) # df1 to array
which_df2col_to_multiply = 0 # select col from df2
arr2 = np.array(df2)[:, which_df2col_to_multiply ] # selected col to array
print(np.transpose(arr2*arr1)) # result
This is the output:
[[1 2 3]
[4 4 4]
[12 8 4]]

Related

Combine 2 dataframes of different length to form a new dataframe that has a length equal to the max length of the 2 dataframes

I have a dataframe:
t = pd.Series([2,4,6,8,10,12],index= index)
df1 = pd.DataFrame(s,columns = ["MUL1"])
df1["MUL2"] =t
MUL1 MUL2
0 1 2
1 2 4
2 2 6
3 3 8
4 3 10
5 6 12
and another dataframe:
u = pd.Series([1,2,3,6],index= index)
v = pd.Series([2,8,10,12],index= index)
df2 = pd.DataFrame(u,columns = ["MUL3"])
df2["MUL4"] =v
Now I want a new dataframe which looks like the following:
MUL6 MUL7
0 1 2
1 2 8
2 2 8
3 3 10
4 3 10
5 6 12
By combining the first 2 dataframes.
I have tried the following:
X1 = df1.to_numpy()
X2 = df2.to_numpy()
list = []
for i in range(X1.shape[0]):
for j in range(X2.shape[0]):
if X1[i, -1] == X2[j, -1]:
list.append(X2[X1[i, -1]==X2[j, -1], -1])
I was trying to convert the dataframes to numpy arrays so I can iterate through them to get a new array that I can convert back to a dataframe. But the size of the new dataframe is not equal to size of the first dataframe. Please I would appreciate any help. Thanks.
Although the details of the logic are cryptic, I believe that you want a merge:
(df1[['MUL1']].rename(columns={'MUL1': 'MUL6'})
.merge(df2.rename(columns={'MUL3': 'MUL6', 'MUL4': 'MUL7'}),
on='MUL6', how='left')
)
output:
MUL6 MUL7
0 1 2
1 2 8
2 2 8
3 3 10
4 3 10
5 6 12

pandas dataframe from dictionary where keys are tuples of tuples of row indexes and column indexes resp [duplicate]

I tried to create a data frame df using the below code :
import numpy as np
import pandas as pd
index = [0,1,2,3,4,5]
s = pd.Series([1,2,3,4,5,6],index= index)
t = pd.Series([2,4,6,8,10,12],index= index)
df = pd.DataFrame(s,columns = ["MUL1"])
df["MUL2"] =t
print df
MUL1 MUL2
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
While trying to create the same data frame using the below syntax, I am getting a wierd output.
df = pd.DataFrame([s,t],columns = ["MUL1","MUL2"])
print df
MUL1 MUL2
0 NaN NaN
1 NaN NaN
Please explain why the NaN is being displayed in the dataframe when both the Series are non empty and why only two rows are getting displayed and no the rest.
Also provide the correct way to create the data frame same as has been mentioned above by using the columns argument in the pandas DataFrame method.
One of the correct ways would be to stack the array data from the input list holding those series into columns -
In [161]: pd.DataFrame(np.c_[s,t],columns = ["MUL1","MUL2"])
Out[161]:
MUL1 MUL2
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
Behind the scenes, the stacking creates a 2D array, which is then converted to a dataframe. Here's what the stacked array looks like -
In [162]: np.c_[s,t]
Out[162]:
array([[ 1, 2],
[ 2, 4],
[ 3, 6],
[ 4, 8],
[ 5, 10],
[ 6, 12]])
If remove columns argument get:
df = pd.DataFrame([s,t])
print (df)
0 1 2 3 4 5
0 1 2 3 4 5 6
1 2 4 6 8 10 12
Then define columns - if columns not exist get NaNs column:
df = pd.DataFrame([s,t], columns=[0,'MUL2'])
print (df)
0 MUL2
0 1.0 NaN
1 2.0 NaN
Better is use dictionary:
df = pd.DataFrame({'MUL1':s,'MUL2':t})
print (df)
MUL1 MUL2
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
And if need change columns order add columns parameter:
df = pd.DataFrame({'MUL1':s,'MUL2':t}, columns=['MUL2','MUL1'])
print (df)
MUL2 MUL1
0 2 1
1 4 2
2 6 3
3 8 4
4 10 5
5 12 6
More information is in dataframe documentation.
Another solution by concat - DataFrame constructor is not necessary:
df = pd.concat([s,t], axis=1, keys=['MUL1','MUL2'])
print (df)
MUL1 MUL2
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
A pandas.DataFrame takes in the parameter data that can be of type ndarray, iterable, dict, or dataframe.
If you pass in a list it will assume each member is a row. Example:
a = [1,2,3]
b = [2,4,6]
df = pd.DataFrame([a, b], columns = ["Col1","Col2", "Col3"])
# output 1:
Col1 Col2 Col3
0 1 2 3
1 2 4 6
You are getting NaN because it expects index = [0,1] but you are giving [0,1,2,3,4,5]
To get the shape you want, first transpose the data:
data = np.array([a, b]).transpose()
How to create a pandas dataframe
import pandas as pd
a = [1,2,3]
b = [2,4,6]
df = pd.DataFrame(dict(Col1=a, Col2=b))
Output:
Col1 Col2
0 1 2
1 2 4
2 3 6

delete the first n rows of each ids in dataframe

I have a DataFrame, with two columns. I want to delete the first 3 rows values of each ids. If the id has less or equal to three rows, delete those rows also. Like in the following, the ids 3 and 1 have 3 and 2 rows, sod they should be deleted. for ids 4 and 2, only the rows 4, 5 are preserved.
import pandas as pd
df = pd.DataFrame()
df ['id'] = [4,4,4,4, 4,2, 2,2,2,2,3,3,3, 1, 1]
df ['value'] = [2,1,1,2, 3, 4, 6,-1,-2,2,-3,5,7, -2, 5]
Here is the DataFrame which I want.
Number each "id" using groupby + cumcount and filter the rows where the the number is more than 2:
out = df[df.groupby('id').cumcount() > 2]
Output:
id value
3 4 2
4 4 3
8 2 -2
9 2 2
Use Series.value_counts and Series.map in order to performance a boolean indexing
new_df = df[df['id'].map(df['id'].value_counts().gt(2))]
id value
3 4 2
4 4 3
8 2 -2
9 2 2
Using cumcount is the way but with drop work as well
out = df.groupby('id',sort=False).apply(lambda x : x.drop(x.index[:3])).reset_index(drop=True)
Out[12]:
id value
0 4 2
1 4 3
2 2 -2
3 2 2

Select rows of pandas dataframe from list, in order of list

The question was originally asked here as a comment but could not get a proper answer as the question was marked as a duplicate.
For a given pandas.DataFrame, let us say
df = DataFrame({'A' : [5,6,3,4], 'B' : [1,2,3, 5]})
df
A B
0 5 1
1 6 2
2 3 3
3 4 5
How can we select rows from a list, based on values in a column ('A' for instance)
For instance
# from
list_of_values = [3,4,6]
# we would like, as a result
# A B
# 2 3 3
# 3 4 5
# 1 6 2
Using isin as mentioned here is not satisfactory as it does not keep order from the input list of 'A' values.
How can the abovementioned goal be achieved?
One way to overcome this is to make the 'A' column an index and use loc on the newly generated pandas.DataFrame. Eventually, the subsampled dataframe's index can be reset.
Here is how:
ret = df.set_index('A').loc[list_of_values].reset_index(inplace=False)
# ret is
# A B
# 0 3 3
# 1 4 5
# 2 6 2
Note that the drawback of this method is that the original indexing has been lost in the process.
More on pandas indexing: What is the point of indexing in pandas?
Use merge with helper DataFrame created by list and with column name of matched column:
df = pd.DataFrame({'A' : [5,6,3,4], 'B' : [1,2,3,5]})
list_of_values = [3,6,4]
df1 = pd.DataFrame({'A':list_of_values}).merge(df)
print (df1)
A B
0 3 3
1 6 2
2 4 5
For more general solution:
df = pd.DataFrame({'A' : [5,6,5,3,4,4,6,5], 'B':range(8)})
print (df)
A B
0 5 0
1 6 1
2 5 2
3 3 3
4 4 4
5 4 5
6 6 6
7 5 7
list_of_values = [6,4,3,7,7,4]
#create df from list
list_df = pd.DataFrame({'A':list_of_values})
print (list_df)
A
0 6
1 4
2 3
3 7
4 7
5 4
#column for original index values
df1 = df.reset_index()
#helper column for count duplicates values
df1['g'] = df1.groupby('A').cumcount()
list_df['g'] = list_df.groupby('A').cumcount()
#merge together, create index from column and remove g column
df = list_df.merge(df1).set_index('index').rename_axis(None).drop('g', axis=1)
print (df)
A B
1 6 1
4 4 4
3 3 3
5 4 5
1] Generic approach for list_of_values.
In [936]: dff = df[df.A.isin(list_of_values)]
In [937]: dff.reindex(dff.A.map({x: i for i, x in enumerate(list_of_values)}).sort_values().index)
Out[937]:
A B
2 3 3
3 4 5
1 6 2
2] If list_of_values is sorted. You can use
In [926]: df[df.A.isin(list_of_values)].sort_values(by='A')
Out[926]:
A B
2 3 3
3 4 5
1 6 2

Adding row in Pandas DataFrame keeping index order

I have a DataFrame and I would like to add some inexisting rows to it. I have found the .loc method, but this adds the values at the end, and not in a sorted way. For example
import numpy as np
import pandas as pd
dfi = pd.DataFrame(np.arange(6).reshape(3,2),columns=['A','B'])
>>> dfi
A B
0 0 1
1 2 3
2 4 5
[3 rows x 2 columns]
Adding a inexisting row through .loc:
dfi.loc[5,:] = 0
>>> dfi
A B
0 0 1
1 2 3
2 4 5
5 0 0
[3 rows x 2 columns]
So far everything ok. But this is what happens when trying to add another row, with index smaller than the last one:
dfi.loc[3,:] = 0
>>> dfi
A B
0 0 1
1 2 3
2 4 5
5 0 0
3 0 0
[3 rows x 2 columns]
I would like it to put the row with index 3 between the row 2 and the 5. I could sort the DataFrame by index everytime, but that would take too long. Is there another way?
My actual problem is considering a DataFrame where the indexes are datetime objects. I didn't put the whole detail of that implementation here because that would confuse what my real problem is: adding rows in DataFrame such that the result has an ordered index.
If your index is almost continuous, only missing a few values here and there. I think you may try the following,
In [15]:
df=pd.DataFrame(np.zeros((100,2)), columns=['A', 'B'])
df['A']=np.nan
df['B']=np.nan
In [16]:
df.iloc[[0,1,2]]=pd.DataFrame({'A': [0,2,4,], 'B': [1,3,5]})
df.iloc[5]=[0,0]
df.iloc[3]=0
print df.dropna()
A B
0 0 1
1 2 3
2 4 5
3 0 0
5 0 0
[5 rows x 2 columns]

Categories

Resources