I have posted two sample dataframes. I would like to map one column of a dataframe with respect to the index of a column in another dataframe and place the values back to the first dataframe shown as below
A = np.array([0,1,1,3,5,2,5,4,2,0])
B = np.array([55,75,86,98,100,111])
df1 = pd.Series(A, name='data').to_frame()
df2 = pd.Series(B, name='values_for_replacement').to_frame()
The below is the first dataframe df1
data
0 0
1 1
2 1
3 3
4 5
5 2
6 5
7 4
8 2
9 0
And the below is the second dataframe df2
values_for_replacement
0 55
1 75
2 86
3 98
4 100
5 111
The below is the output needed (Mapped with respect to the index of the df2)
data new_data
0 0 55
1 1 75
2 1 75
3 3 98
4 5 111
5 2 86
6 5 111
7 4 100
8 2 86
9 0 55
I would kindly like to know how one can achieve this using some pandas functions like map.
Looking forward for some answers. Many thanks in advance
Related
I have the following dataframe:
df = pd.DataFrame({'timestamp' : [10,10,10,20,20,20], 'idx': [1,2,3,1,2,3], 'v1' : [1,2,4,5,1,9], 'v2' : [1,2,8,5,1,2]})
timestamp idx v1 v2
0 10 1 1 1
1 10 2 2 2
2 10 3 4 8
3 20 1 5 5
4 20 2 1 1
5 20 3 9 2
I'd like to group data by timestamp and calculate the following cumulative statistic:
np.sum(v1*v2) for every timestamp. I'd like to see the following result:
timestamp idx v1 v2 stat
0 10 1 1 1 37
1 10 2 2 2 37
2 10 3 4 8 37
3 20 1 5 5 44
4 20 2 1 1 44
5 20 3 9 2 44
I'm trying to do the following:
def calc_some_stat(d):
return np.sum(d.v1 * d.v2)
df.loc[:, 'stat'] = df.groupby('timestamp').apply(calc_some_stat)
But for stat columns I receive all NaN values - what is wrong in my code?
We want groupby transform here not groupby apply:
df['stat'] = (df['v1'] * df['v2']).groupby(df['timestamp']).transform('sum')
If we really want to use the function we need to join back to scale up the aggregated DataFrame:
def calc_some_stat(d):
return np.sum(d.v1 * d.v2)
df = df.join(
df.groupby('timestamp').apply(calc_some_stat)
.rename('stat'), # Needed to use join but also sets the col name
on='timestamp'
)
df:
timestamp idx v1 v2 stat
0 10 1 1 1 37
1 10 2 2 2 37
2 10 3 4 8 37
3 20 1 5 5 44
4 20 2 1 1 44
5 20 3 9 2 44
The issue is that groupby apply is producing summary information:
timestamp
10 37
20 44
dtype: int64
This does not assign back to the DataFrame naturally as there are only 2 rows when the initial DataFrame has 6. We either need to use join to scale these 2 rows up to align with the original DataFrame, or we can avoid all of this using groupby transform which is designed to produce a:
like-indexed DataFrame on each group and return a DataFrame having the same indexes as the original object filled with the transformed values
I have a data frame like this.
FOOD_ID Cumulative_addition
0 110 0
1 110 15
2 110 15
3 110 35
4 111 0
5 111 10
6 111 10
I want to add another column that gives the addition only for a particular FOOD ID. The final data frame that I want looks like below....
FOOD_ID Cumulative_addition Addition_Only
0 110 0 0
1 110 15 15
2 110 15 0
3 110 35 20
4 111 0 0
5 111 10 10
6 111 10 0
I know how to do this in excel using if statement but do not know how to do it in python.
Try :
df['Addition_only'] = (df.groupby('FOOD_ID').Cumulative_addition.shift(-1) - df.Cumulative_addition).shift(1).fillna(0)
Detail
df.groupby('FOOD_ID').Cumulative_addition.shift(-1)
Will give group the cumulative addition column grouped by food id and then shift it by 1 row.
The you can subtract the original column to get the diff and shift it back by one row.
Hope that helps.
I have been on modifying an excel document with Pandas. I only need to work with small sections at a time, and breaking each into a separate DataFrame and then recombining back into the whole after modifying seems like the best solution. Is this feasible? I've tried a couple options with merge() and concat() but they don't seem to give me the results I am looking for.
As previously stated, I've tried using the merge() function to recombine them back together with the larger table I just got a memory error, and when I tested it with smaller dataframes, rows weren't maintained.
here's an small scale example of what I am looking to do:
import pandas as pd
df1 = pd.DataFrame({'A':[1,2,3,5,6],'B':[3,10,11,13,324],'C':[64,'','' ,'','' ],'D':[32,45,67,80,100]})#example df
print(df1)
df2= df1[['B','C']]#section taken
df2.at[2,'B'] = 1#modify area
print(df2)
df1 = df1.merge(df2)#merge dataframes
print(df1)
output:
A B C D
0 1 3 64 32
1 2 10 45
2 3 11 67
3 5 13 80
4 6 324 100
B C
0 3 64
1 10
2 1
3 13
4 324
A B C D
0 1 3 64 32
1 2 10 45
2 5 13 80
3 6 324 100
what I would like to see
A B C D
0 1 3 64 32
1 2 10 45
2 3 11 67
3 5 13 80
4 6 324 100
B C
0 3 64
1 10
2 1
3 13
4 324
A B C D
0 1 3 64 32
1 2 10 45
2 3 1 67
3 5 13 80
4 6 324 100
as I said before,in my actual code I just get a memoryerror if I try this due to the size of the dataframe
No need for merging here, you can just re-assign back the values from df2 into df1:
...
df1.loc[df2.index, df2.columns] = df2 #recover changes into original dataframe
print(df1)
giving as expected:
A B C D
0 1 3 64 32
1 2 10 45
2 3 1 67
3 5 13 80
4 6 324 100
df1.update(df2) gives same result (thanks to Quang Hoang for the precision)
I have two DataFrames that roughly look like
(ID) (Category) (Value1) (Value2)
111 1 5 7
112 1 3 8
113 2 6 9
114 3 2 6
and
(Category) (Value1 Average for Category) (Value2 Average for Category)
1 4 5
2 6 7
3 9 2
Ultimately, I'd like to join the two DataFrames so that each ID can have the average value for its category in the row with it. I'm having trouble finding the right way to join/merge/etc. that will fill in columns by checking the category from the other DateFrame. Does anyone have any idea where to start?
You are simply looking for a join, in pandas we use pd.merge for that like the following:
df3 = pd.merge(df1, df2, on='Category')
ID Category Value1 Value2 Value 1 Average Value 2 Average
0 111 1 5 7 4 5
1 112 1 3 8 4 5
2 113 2 6 9 6 7
3 114 3 2 6 9 2
Official documentation of pandas on merging:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
Here is a good explanation on joins:
Pandas Merging 101
Just do:
df1.groupby(['ID', 'Category']).transform(func='mean')
on the first dataframe to get the desired dataframe.
I have a database that I am bringing in a SQL table of events and alarms (df1), and I have a txt file of alarm codes and properties (df2) to watch for. Want to use 1 columns values from df2 that each value needs cross checked against an entire column values in df1, and output the entire rows of any that match into another dataframe df3.
df1 A B C D
0 100 20 1 1
1 101 30 1 1
2 102 21 2 3
3 103 15 2 3
4 104 40 2 3
df2 0 1 2 3 4
0 21 2 2 3 3
1 40 0 NaN NaN NaN
Output entire rows from df1 that column B match with any of df2 column 0 values into df3.
df3 A B C D
0 102 21 2 3
1 104 40 2 3
I was able to get single results using:
df1[df1['B'] == df2.iloc[0,0]]
But I need something that will do this on a larger scale.
Method 1: merge
Use merge, on B and 0. Then select only the df1 columns
df1.merge(df2, left_on='B', right_on='0')[df1.columns]
A B C D
0 102 21 2 3
1 104 40 2 3
Method 2: loc
Alternatively use loc to find rows in df1 where B has a match in df2 column 0 using .isin:
df1.loc[df1.B.isin(df2['0'])]
A B C D
2 102 21 2 3
4 104 40 2 3