Find values from other dataframe and assign to original dataframe

Find values from other dataframe and assign to original dataframe - python

Having input dataframe:
x_1 x_2
0 0.0 0.0
1 1.0 0.0
2 2.0 0.2
3 2.5 1.5
4 1.5 2.0
5 -2.0 -2.0
and additional dataframe as follows:
index x_1_x x_2_x x_1_y x_2_y value dist dist_rank
0 0 0.0 0.0 0.1 0.1 5.0 0.141421 2.0
4 0 0.0 0.0 1.5 1.0 -2.0 1.802776 3.0
5 0 0.0 0.0 0.0 0.0 3.0 0.000000 1.0
9 1 1.0 0.0 0.1 0.1 5.0 0.905539 1.0
11 1 1.0 0.0 2.0 0.4 3.0 1.077033 3.0
14 1 1.0 0.0 0.0 0.0 3.0 1.000000 2.0
18 2 2.0 0.2 0.1 0.1 5.0 1.902630 3.0
20 2 2.0 0.2 2.0 0.4 3.0 0.200000 1.0
22 2 2.0 0.2 1.5 1.0 -2.0 0.943398 2.0
29 3 2.5 1.5 2.0 0.4 3.0 1.208305 3.0
30 3 2.5 1.5 2.5 2.5 4.0 1.000000 1.0
31 3 2.5 1.5 1.5 1.0 -2.0 1.118034 2.0
38 4 1.5 2.0 2.0 0.4 3.0 1.676305 3.0
39 4 1.5 2.0 2.5 2.5 4.0 1.118034 2.0
40 4 1.5 2.0 1.5 1.0 -2.0 1.000000 1.0
45 5 -2.0 -2.0 0.1 0.1 5.0 2.969848 2.0
46 5 -2.0 -2.0 1.0 -2.0 6.0 3.000000 3.0
50 5 -2.0 -2.0 0.0 0.0 3.0 2.828427 1.0
I want to create new columns in input dataframe, basing on additional dataframe with respect to dist_rank. It should extract x_1_y, x_2_y and value for each row, with respect to index and dist_rank so my expected output is following:
I tried following lines:
df['value_dist_rank1']=result.loc[result['dist_rank']==1.0, 'value']
df['value_dist_rank1 ']=result[result['dist_rank']==1.0]['value']
but both gave the same output:
x_1 x_2 value_dist_rank1
0 0.0 0.0 NaN
1 1.0 0.0 NaN
2 2.0 0.2 NaN
3 2.5 1.5 NaN
4 1.5 2.0 NaN
5 -2.0 -2.0 3.0

Here is a way to do it :
(For the sake of clarity I consider the input df as df1 and the additional df as df2)
# First we goupby df2 by index to get all the column information of each index on one line
df2 = df2.groupby('index').agg(lambda x: list(x)).reset_index()
# Then we explode each column into three columns since there is always three columns for each index
columns = ['dist_rank', 'value', 'x_1_y', 'x_2_y']
column_to_add = ['value', 'x_1_y', 'x_2_y']
for index, row in df2.iterrows():
for i in range(3):
column_names = ["{}_dist_rank{}".format(x, row.dist_rank[i])[:-2] for x in column_to_add]
values = [row[x][i] for x in column_to_add]
for column, value in zip(column_names, values):
df2.loc[index, column] = value
# We drop the columns that are not useful :
df2.drop(columns=columns+['dist', 'x_1_x', 'x_2_x'], inplace = True)
# Finally we merge the modified df with our initial dataframe :
result = df1.merge(df2, left_index=True, right_on='index', how='left')
Output :
x_1 x_2 index value_dist_rank2 x_1_y_dist_rank2 x_2_y_dist_rank2 \
0 0.0 0.0 0 5.0 0.1 0.1
1 1.0 0.0 1 3.0 0.0 0.0
2 2.0 0.2 2 -2.0 1.5 1.0
3 2.5 1.5 3 -2.0 1.5 1.0
4 1.5 2.0 4 4.0 2.5 2.5
5 -2.0 -2.0 5 5.0 0.1 0.1
value_dist_rank3 x_1_y_dist_rank3 x_2_y_dist_rank3 value_dist_rank1 \
0 -2.0 1.5 1.0 3.0
1 3.0 2.0 0.4 5.0
2 5.0 0.1 0.1 3.0
3 3.0 2.0 0.4 4.0
4 3.0 2.0 0.4 -2.0
5 6.0 1.0 -2.0 3.0
x_1_y_dist_rank1 x_2_y_dist_rank1
0 0.0 0.0
1 0.1 0.1
2 2.0 0.4
3 2.5 2.5
4 1.5 1.0
5 0.0 0.0

Related

Multiple based on different data frame

I have two dataframes:
df1:
Name Segment Axis 1 2 3 4 5
Amazon 1 slope NaN 2.5 2.5 2.5 2.5
Amazon 1 x 0.0 1.0 2.0 3.0 4.0
Amazon 1 y 0.0 0.4 0.8 1.2 1.6
Amazon 2 slope NaN 2.0 2.0 2.0 2.0
Amazon 2 x 0.0 2.0 4.0 6.0 8.0
Amazon 2 y 0.0 1.0 2.0 3.0 4.0
df2:
Name Segment Cost
Amazon 1 100
Amazon 2 112
Netflix 1 110
Netflix 2 210
I want to multiple all the values on that fall on the "Slope" in columns 1-5 by the corresponding cost in the second dataframe.
Expected output:
Name Segment Axis 1 2 3 4 5
Amazon 1 slope NaN 250 250 250 250
Amazon 1 x 0.0 1.0 2.0 3.0 4.0
Amazon 1 y 0.0 0.4 0.8 1.2 1.6
Amazon 2 slope NaN 224 224 224 224
Amazon 2 x 0.0 2.0 4.0 6.0 8.0
Amazon 2 y 0.0 1.0 2.0 3.0 4.0

Try this:
#merge df2 to align to df1
u = df1.merge(df2,on=['Name','Segment'],how='left')
#find columns to multiply the cost
cols = df1.columns ^ ['Name','Segment','Axis']
#multiply and assign back
df1[cols] = u[cols].mul(u['Cost'],axis=0).where(df1['Axis'].eq('slope'),df1[cols])
print(df1)
Name Segment Axis 1 2 3 4 5
0 Amazon 1 slope NaN 250.0 250.0 250.0 250.0
1 Amazon 1 x 0.0 1.0 2.0 3.0 4.0
2 Amazon 1 y 0.0 0.4 0.8 1.2 1.6
3 Amazon 2 slope NaN 224.0 224.0 224.0 224.0
4 Amazon 2 x 0.0 2.0 4.0 6.0 8.0
5 Amazon 2 y 0.0 1.0 2.0 3.0 4.0

You can make use of the Index to have pandas do all the heavy lifting with alignment. Unfortunately DataFrame.mul(Series) doesn't yet support a fill_value so we need to .fillna.
df1 = df1.set_index(['Name', 'Segment', 'Axis'])
# Give df2 a 'slope' level so we know what to align
df2 = df2.assign(Axis='slope').set_index(['Name', 'Segment', 'Axis'])
# So we don't add rows from df2 not in df1
df2 = df2[df2.index.isin(df1.index)]
df1 = df1.mul(df2['Cost'], axis=0).fillna(df1)
print(df1)
1 2 3 4 5
Name Segment Axis
Amazon 1 slope NaN 250.0 250.0 250.0 250.0
x 0.0 1.0 2.0 3.0 4.0
y 0.0 0.4 0.8 1.2 1.6
2 slope NaN 224.0 224.0 224.0 224.0
x 0.0 2.0 4.0 6.0 8.0
y 0.0 1.0 2.0 3.0 4.0

Move one column to another dataframe pandas

I have a DataFrame df1 that looks like this:
userId movie1 movie2 movie3
0 4.1 0.0 1.0
1 3.1 1.1 3.4
2 2.8 0.0 1.7
3 0.0 5.0 0.0
4 0.0 0.0 0.0
5 2.3 0.0 2.0
and another DataFrame, df2 that looks like this:
userId movie4 movie5 movie6
0 4.1 0.0 1.0
1 3.1 1.1 3.4
2 2.8 0.0 1.7
3 0.0 5.0 0.0
4 0.0 0.0 0.0
5 2.3 0.0 2.0
How do I select one column from df2 and add it to df1? For example, adding movie6 to df1 would result:
userId movie1 movie2 movie3 movie6
0 4.1 0.0 1.0 1.0
1 3.1 1.1 3.4 3.4
2 2.8 0.0 1.7 1.7
3 0.0 5.0 0.0 0.0
4 0.0 0.0 0.0 0.0
5 2.3 0.0 2.0 2.0

df1=pd.concat([df1,df2['movie6']],axis=0)

You can merge on the shared column, userId:
df1 = df1.merge(df2[["userId","movie6"]], on="userId")

How to add the values of one smaller DataFrame to part of another mixed type DataFrame, but only to rows after some arbitrary row index?

I have two .csv files, one contains what could be described as a header and a body. The header contains data like the total number of rows, datetime, what application generated the data, and what line the body starts on. The second file contains a single row.
>>> import pandas as pd
>>> df = pd.read_csv("data.csv", names=list('abcdef'))
>>> df
a b c d e f
0 data start row 5 NaN NaN NaN NaN
1 row count 7 NaN NaN NaN NaN
2 made by foo.exe NaN NaN NaN NaN
3 date 01-01-2000 NaN NaN NaN NaN
4 a b c d e f
5 0.0 1.0 2.0 3.0 4.0 5.0
6 0.0 1.0 2.0 3.0 4.0 5.0
7 0.0 1.0 2.0 3.0 4.0 5.0
8 0.0 1.0 2.0 3.0 4.0 5.0
9 0.0 1.0 2.0 3.0 4.0 5.0
10 0.0 1.0 2.0 3.0 4.0 5.0
11 0.0 1.0 2.0 3.0 4.0 5.0
>>> df2 = pd.read_csv("extra_data.csv")
>>> df2
a b c
0 6.0 5.0 4.0
>>> row = df2.loc[0]
>>>
I am having trouble modifying the 'a', 'b' and 'c' columns and then saving the DataFrame to a new .csv file.
I have tried adding the row by way of slicing and the addition operator but this did not work:
>>> df[5:,'a':'c'] += row
TypeError: '(slice(5, None, None), slice('a', 'c', None))' is an invalid key
>>>
I also tried the answer I found here, but this gave a similar error:
>>> df[5:,row.index] += row
TypeError: '(slice(5, None, None), Index(['a', 'b', 'c'], dtype='object'))' is an invalid key
>>>
I suspect the problem is coming from object dtypes so I tried converting a subframe to the float type:
>>> sub_section = df.loc[5:,['a','b','c']].astype(float)
>>> sub_section
a b c
5 0.0 1.0 2.0
6 0.0 1.0 2.0
7 0.0 1.0 2.0
8 0.0 1.0 2.0
9 0.0 1.0 2.0
10 0.0 1.0 2.0
11 0.0 1.0 2.0
>>> sub_section += row
>>> sub_section
a b c
5 6.0 6.0 6.0
6 6.0 6.0 6.0
7 6.0 6.0 6.0
8 6.0 6.0 6.0
9 6.0 6.0 6.0
10 6.0 6.0 6.0
11 6.0 6.0 6.0
>>> df
a b c d e f
0 data start row 5 NaN NaN NaN NaN
1 row count 7 NaN NaN NaN NaN
2 made by foo.exe NaN NaN NaN NaN
3 date 01-01-2000 NaN NaN NaN NaN
4 a b c d e f
5 0.0 1.0 2.0 3.0 4.0 5.0
6 0.0 1.0 2.0 3.0 4.0 5.0
7 0.0 1.0 2.0 3.0 4.0 5.0
8 0.0 1.0 2.0 3.0 4.0 5.0
9 0.0 1.0 2.0 3.0 4.0 5.0
10 0.0 1.0 2.0 3.0 4.0 5.0
11 0.0 1.0 2.0 3.0 4.0 5.0
>>>
Obviously, in this case df.loc[] is returning a copy, and then modifying the copy does nothing to the df.
How do I modify parts of a DataFrame (dtype=object) and then save the changes?

Create Max and Min column values from a single column value pandas

I have a dataframe like the one below and I need to create two columns out of the base column.
Input
Kg
0.5
0.5
1
1
1
2
2
5
5
5
Expected Output
Kg_From Kg_To
0 0.5
0 0.5
0.5 1
0.5 1
0.5 1
1 2
1 2
2 5
2 5
2 5
How can this be done in pandas ?

Assuming your kg column is sorted:
s = df["Kg"].unique()
df["Kg_from"] = df["Kg"].map({k:v for k,v in zip(s[1:], s)}).fillna(0)
print (df)
Kg Kg_from
0 0.5 0.0
1 0.5 0.0
2 1.0 0.5
3 1.0 0.5
4 1.0 0.5
5 2.0 1.0
6 2.0 1.0
7 5.0 2.0
8 5.0 2.0
9 5.0 2.0

#get unique values and counts of each value in the Kg column
val,counts = np.unique(df.Kg,return_counts=True)
#shift forward by 1 and replace the first value with 0
val = np.roll(val,1)
val[0] = 0
#repeat the count of each value with the counts generated earlier
df['Kg_from'] = np.repeat(val,counts)
df
Kg Kg_from
0 0.5 0.0
1 0.5 0.0
2 1.0 0.5
3 1.0 0.5
4 1.0 0.5
5 2.0 1.0
6 2.0 1.0
7 5.0 2.0
8 5.0 2.0
9 5.0 2.0

Use zip and dict for mapping new column created by DataFrame.insert with unique sorted values by np.unique with added first 0 value by np.insert:
df = df.rename(columns={'Kg':'Kg_To'})
a = np.unique(df["Kg_To"])
df.insert(0, 'Kg_from', df['Kg_To'].map(dict(zip(a, np.insert(a, 0, 0)))))
print (df)
Kg_from Kg_To
0 0.0 0.5
1 0.0 0.5
2 0.5 1.0
3 0.5 1.0
4 0.5 1.0
5 1.0 2.0
6 1.0 2.0
7 2.0 5.0
8 2.0 5.0
9 2.0 5.0

Code:
kgs = df.Kg.unique()
lower = [0] + list(kgs[:-1])
kg_dict = {k:v for v,k in zip(lower,kgs)}
# new dataframe
new_df = pd.DataFrame({
'Kg_From': df['Kg'].map(kg_dict),
'Kg_To': df['Kg']
})
# or if you want new columns:
df['Kg_from'] = df['Kg'].map(kg_dict)
Output:
Kg_From Kg_To
0 0.0 0.5
1 0.0 0.5
2 0.5 1.0
3 0.5 1.0
4 0.5 1.0
5 1.0 2.0
6 1.0 2.0
7 2.0 5.0
8 2.0 5.0
9 2.0 5.0

How can I change a specific row label in a Pandas dataframe?

I have a dataframe such as:
0 1 2 3 4 5
0 41.0 22.0 9.0 4.0 2.0 1.0
1 6.0 1.0 2.0 1.0 1.0 1.0
2 4.0 2.0 4.0 1.0 0.0 1.0
3 1.0 2.0 1.0 1.0 1.0 1.0
4 5.0 1.0 0.0 1.0 0.0 1.0
5 11.4 5.6 3.2 1.6 0.8 1.0
Where the final row contains averages. I would like to rename the final row label to "A" so that the dataframe will look like this:
0 1 2 3 4 5
0 41.0 22.0 9.0 4.0 2.0 1.0
1 6.0 1.0 2.0 1.0 1.0 1.0
2 4.0 2.0 4.0 1.0 0.0 1.0
3 1.0 2.0 1.0 1.0 1.0 1.0
4 5.0 1.0 0.0 1.0 0.0 1.0
A 11.4 5.6 3.2 1.6 0.8 1.0
I understand columns can be done with df.columns = . . .. But how can I do this with a specific row label?

You can get the last index using negative indexing similar to that in Python
last = df.index[-1]
Then
df = df.rename(index={last: 'a'})
Edit: If you are looking for a one-liner,
df.index = df.index[:-1].tolist() + ['a']

use index attribute:
df.index = df.index[:-1].append(pd.Index(['A']))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find values from other dataframe and assign to original dataframe - python

Related

Multiple based on different data frame

Move one column to another dataframe pandas

How to add the values of one smaller DataFrame to part of another mixed type DataFrame, but only to rows after some arbitrary row index?

Create Max and Min column values from a single column value pandas

How can I change a specific row label in a Pandas dataframe?

Categories

Resources