Merging data in Pandas dataframe by values in a column - python

I have a pandas dataframe like this:
Name Year Sales
Ann 2010 500
Ann 2011 500
Bob 2010 400
Bob 2011 700
Ed 2010 300
Ed 2011 300
I want to be able to combine the figures in the sales column for each name returning:
Name Sales
Ann 1000
Bob 1100
Ed 600
Perhaps I need a for loop to go through and combine the 2 values for both years and create a new column, but I'm not quite sure. Is there a pandas function that can help me with this?

That's a simple dataframe groupby.
In that case you'll just have to select the two columns you need
df = df[["Name", "Sales"]]
And then apply the groupby
df.groupby(["name"], as_index=False).sum()
By default the groupby will make the grouped by columns part of the index. If you want to keep them as colum you need to specify as_index=False

Related

Creating a calculated field based on row values provided column values match

I have a pandas dataframe, US state temperature data that is grouped firstly by State and then by Year. I have already selected the first and last years of entries by subsetting the original dataframe. I want to create a new dataframe that shows the difference in AvgTemperature from the first year (1995) and the last year (2019) for all 50 states.
State
Year
AvgTemperature
Alabama
1995
63.66
Alabama
2019
66.32
Alaska
1995
35.97
...
...
...
I want to have a result that I can plot to show which states have changed the most over time, preferably in the format simply of State as column 1 and Temperature_Change as column 2.
You can pivot, compute the diff and plot as bar:
(df.pivot('State', 'Year', 'AvgTemperature')
.diff(axis=1)
.iloc[:,-1]
.rename('diff')
.plot.bar()
)
NB. I used dummy data for Alaska in 2019.
Output:
Try this:
df.sort_values(['State', 'Year']).groupby('State').apply(lambda g: g.iloc[-1]['AvgTemperature'] - g.iloc[0]['AvgTemperature'])

Compare different df's row by row and return changes

Every month I collect data that contains details of employees to be stored in our database.
I need to find a solution to compare the data stored in the previous month to the data received and, for each row that any of the columns had a change, it would return into a new dataframe.
I would also need to know somehow which columns in each row of this new returned dataframe had a change when this comparison happened.
There are also some important details to mention:
Each column can also contain blank values in any of the dataframes;
The dataframes have the same column names but not necessarily the same data type;
The dataframes do not have the same number of rows necessarily;
If a row do not find its Index match, do not return to the new dataframe;
The rows of the dataframes can be matched by a column named "Index"
So, for example, we would have this dataframe (which is just a slice of the real one as it has 63 columns):
df1:
Index Department Salary Manager Email Start_Date
1 IT 6000.00 Jack ax#i.com 01-01-2021
2 HR 7000 O'Donnel ay#i.com
3 MKT $7600 Maria d 30-06-2021
4 I'T 8000 Peter az#i.com 14-07-2021
df2:
Index Department Salary Manager Email Start_Date
1 IT 6000.00 Jack ax#i.com 01-01-2021
2 HR 7000 O'Donnel ay#i.com 01-01-2021
3 MKT 7600 Maria dy#i.com 30-06-2021
4 IT 8000 Peter az#i.com 14-07-2021
5 IT 9000 John NOT PROVIDED
6 IT 9900 John NOT PROVIDED
df3:
Index Department Salary Manager Email Start_Date
2 HR 7000 O'Donnel ay#i.com 01-01-2021
3 MKT 7600 Maria dy#i.com 30-06-2021
4 IT 8000 Peter az#i.com 14-07-2021
**The differences in this example are:
Start date added in row of Index 2
Salary format corrected and email corrected for row Index 3
Department format corrected for row Index 4
What would be the best way to to this comparison?
I am not sure if there is an easy solution to understand what changed in each field but returning the dataframe with rows that had at least 1 change would be helpful.
Thank you for the support!
I think compare could do the trick: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.compare.html
But first you would need to align the rows between old and new dataframe via the index:
new_df_to_compare=new_df.loc[old_df.index]
When datatypes don't match. You would also need to align them:
new_df_to_compare = new_df_to_compare.astype(old_df.dtypes.to_dict())
Then compare should work just like this:
difference_df = old_df.compare(new_df_to_compare)

Compare two columns A, B from two pandas dataframes df1 and df2 and create new column to df1 [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 1 year ago.
I have two dataframes (df1 and df2) of different size.
df1
NUTS lvl2 Code
Greece GR
Italy IT
Germany DE
and df2
Ports Code Year
Patras GR 2010
Lefkada GR 2010
Bergamo IT 2010
Palermo IT 2011
I try to figure out how to take df2 as follow:
Ports Code Year NUTS lvl2
Patras GR 2010 Greece
Lefkada GR 2010 Greece
Bergamo IT 2010 Italy
Palermo IT 2011 Italy
Could you please help me figure this out? I am trying with "where" and conditions but I have no luck so far. Ditto for inner joins.
df3 = df2.merge(df1, on='Code', how='left')
This is the same as a left join in SQL. You want all the rows of one dataframe (df2 in this case) and you want to match the Code column with its corresponding column associated to that Code in df1. On is a common column between the two dataframes (must be the same column name), how is left because you want all the rows of one dataframe. Whichever dataframe that you want to keep all the columns of goes first in the syntax for instance df2.merge(). An inner join in this scenario would have gotten you the same result because the column Code in df2 has GR and IT which are both in df1. Inner join takes what both columns have in common in Layman’s Terms.

how to select a group from groupby dataframe using pandas

I have a dataframe with multilevel index (company, year) that grouped by mean, looks like this:
company year mean salary
ABC 2018 3000
2019 3400
LOL 2018 1200
2019 3500
I want to select the data belongs to "LOL", my desired outcome would be:
company year mean salary
LOL 2018 1200
2019 3500
Is there a way I can only select a certain group? I tried to use .filter function on dataframe but I was only able to apply it to rows such as (lambda x: x > 1000) but not for index value.
Any advice will be appreciated!
Use DataFrame.xs with drop_level=False for avoid removed first level:
df1 = df.xs('LOL', drop_level=False)
Or filter by first level with Index.get_level_values:
df1 = df[df.index.get_level_values(0) == 'LOL']
print (df1)
mean salary
company year
LOL 2018 1200
2019 3500

Pandas sum by groupby, but exclude certain columns

What is the best way to do a groupby on a Pandas dataframe, but exclude some columns from that groupby? e.g. I have the following dataframe:
Code Country Item_Code Item Ele_Code Unit Y1961 Y1962 Y1963
2 Afghanistan 15 Wheat 5312 Ha 10 20 30
2 Afghanistan 25 Maize 5312 Ha 10 20 30
4 Angola 15 Wheat 7312 Ha 30 40 50
4 Angola 25 Maize 7312 Ha 30 40 50
I want to groupby the column Country and Item_Code and only compute the sum of the rows falling under the columns Y1961, Y1962 and Y1963. The resulting dataframe should look like this:
Code Country Item_Code Item Ele_Code Unit Y1961 Y1962 Y1963
2 Afghanistan 15 C3 5312 Ha 20 40 60
4 Angola 25 C4 7312 Ha 60 80 100
Right now I am doing this:
df.groupby('Country').sum()
However this adds up the values in the Item_Code column as well. Is there any way I can specify which columns to include in the sum() operation and which ones to exclude?
You can select the columns of a groupby:
In [11]: df.groupby(['Country', 'Item_Code'])[["Y1961", "Y1962", "Y1963"]].sum()
Out[11]:
Y1961 Y1962 Y1963
Country Item_Code
Afghanistan 15 10 20 30
25 10 20 30
Angola 15 30 40 50
25 30 40 50
Note that the list passed must be a subset of the columns otherwise you'll see a KeyError.
The agg function will do this for you. Pass the columns and function as a dict with column, output:
df.groupby(['Country', 'Item_Code']).agg({'Y1961': np.sum, 'Y1962': [np.sum, np.mean]}) # Added example for two output columns from a single input column
This will display only the group by columns, and the specified aggregate columns. In this example I included two agg functions applied to 'Y1962'.
To get exactly what you hoped to see, included the other columns in the group by, and apply sums to the Y variables in the frame:
df.groupby(['Code', 'Country', 'Item_Code', 'Item', 'Ele_Code', 'Unit']).agg({'Y1961': np.sum, 'Y1962': np.sum, 'Y1963': np.sum})
If you are looking for a more generalized way to apply to many columns, what you can do is to build a list of column names and pass it as the index of the grouped dataframe. In your case, for example:
columns = ['Y'+str(i) for year in range(1967, 2011)]
df.groupby('Country')[columns].agg('sum')
If you want to add a suffix/prefix to the aggregated column names, use add_suffix() / add_prefix().
df.groupby(["Code", "Country"])[["Y1961", "Y1962", "Y1963"]].sum().add_suffix("_total")
If you want to retain Code and Country as columns after aggregation, set as_index=False in groupby() or use reset_index().
df.groupby(["Code", "Country"], as_index=False)[["Y1961", "Y1962", "Y1963"]].sum()
# df.groupby(["Code", "Country"])[["Y1961", "Y1962", "Y1963"]].sum().reset_index()

Categories

Resources