Combining, sum and renaming rows in a dataframe - python

I have the following Data
--------------------------------------------------
| code | name | qty |
--------------------------------------------------
| FZH | apple | 3 |
--------------------------------------------------
| ZH2 | orange| 7 |
--------------------------------------------------
| H26 | mt dew | 5 |
--------------------------------------------------
| 6YS | pear | 7 |
--------------------------------------------------
| LKZ | coke | 4 |
--------------------------------------------------
Using pandas I want to tell it to sum the apple, orange, and pear and write
| 2DC | FRUIT | 17 |
The actual list of fruit is a lot longer. "Bananas" is sometimes on the list for the day and I want it to sum that too but skip it if it does not find it.
I want to do a similar thing with all of the sodas, using my list of possible sodas needed for the day

It looks like you're trying to filter, then sum. You can use loc, isin, and sum:
fruits = ["apple", "orange", "pear"]
df.loc[df["name"].isin(fruits)].qty.sum()

Related

how to find the sum of a dataframe?

while finding sum as follows
g.loc[g.index[0], 'sum'] = g[RDM].sum()
where RDM is
RDM = [f"R_Dist_meas_{i}" for i in range(48)]
the error was as follows:
KeyError: "None of [Index(['R_Dist_meas_0', 'R_Dist_meas_1', 'R_Dist_meas_2',\n .........................'R_Dist_meas_45', 'R_Dist_meas_46', 'R_Dist_meas_47'],\n dtype='object')] are in the [columns]"
the sample dataframe is as follows,it have many other column other than distance(angle,velocity etc..)
The format of dataframe is A0B0C0 A1B1C1 A2B2C2 A3B3C3 ....... A47B47C47
| R_Dist_meas_0 |R_vel_meas_0 | R_Dist_meas_1 |R_vel_meas_1 | R_Dist_meas_2 |R_vel_meas_2 |--------| R_Dist_meas_47 |R_vel_meas_47 |
|---------------|-------------|---------------|-------------|---------------|-------------|
| 5 | | | | | |
| | | | |10 | |
| | | | | 8 | |
| 2 | | 8 | | | |
the sum = 33
How to solve it?
Your list comprehension will go out of bounds if you try to search the dataframe since you only have columns up to R_Dist_meas_2. If you try to use the RDM as header keys you will be looking for columns not rows.
sum(g.iloc[:,:2].sum())
Excluding the sum outside, this allows you to sum up the rows of each column seperately and then add their totals for the final summation. This should give you the sum you are looking for.

How to find and group similar terms in a dataframe in order to sum their values?

I have data like this:
| Term | Value|
| -------- | -----|
| Apple | 100 |
| Appel | 50 |
| Banana | 200 |
| Banan | 25 |
| Orange | 140 |
| Pear | 75 |
| Lapel | 10 |
Currently, I am using the following code:
matches = []
for term in terms:
tlist = difflib.get_close_matches(term, terms, cutoff = .80, n=5)
matches.append(tlist)
df["terms"] = matches
The output is like this
| Term | Value|
| --------------------- | -----|
| [Apple, Appel] | 100 |
| [Appel, Apple, Lapel] | 50 |
| [Banana, Banan] | 200 |
| [Banan, Banana] | 25 |
| [Orange] | 140 |
| [Pear] | 75 |
| [Lapel, Appel] | 10 |
This code isn't really helpful. My desired output is something like:
| Term | Value|
| -------- | -----|
| Apple | 150 |
| Banana | 225 |
| Orange | 140 |
| Pear | 75 |
| Lapel | 10 |
The main issue is that the lists aren't in the same order, and often there is only one or two words of overlap in the lists. For example, I might have
[apple, appel]
[appel, apple, lapel]
Ideally, I would like to have both these return "apple", because that has the highest value of the overlapping terms.
Is there a way to do this?
One simple way to achieve your goal is to use the Python standard library difflib module, which provides helpers for computing deltas, like this:
from difflib import SequenceMatcher
import pandas as pd
# Toy dataframe
df = pd.DataFrame(
{
"Term": ["Apple", "Appel", "Banana", "Banan", "Orange", "Pear", "Lapel"],
"Value": [100, 50, 200, 25, 140, 75, 10],
}
)
KEY_TERMS = ("Apple", "Banana", "Orange", "Pear")
for i, row in df.copy().iterrows():
# Get the similarity ratio for a given value in df "Term" column (row[0])
# and each term from KEY_TERM, and store the pair "term:ratio" in a dict
similarities = {
term: SequenceMatcher(None, row[0], term).ratio() for term in KEY_TERMS
}
# Find the key term for which the similarity ratio is maximum
# and use it to replace the original term in the dataframe
df.loc[i, "Term"] = max(similarities, key=lambda key: similarities[key])
# Group by term and sum values
df = df.groupby("Term").agg("sum").reset_index()
Then:
print(df)
# Outputs
Term Value
0 Apple 160
1 Banana 225
2 Orange 140
3 Pear 75

Group by and chose the string value of a column based on a condition using pandas

I have a dataframe consisting of people: ('id','name','occupation').
| id | name | occupation |
|:--:|:--------:|:----------:|
| 1 | John | artist |
| 1 | John | painter |
| 2 | Mary | consultant |
| 3 | Benjamin | architect |
| 3 | Benjamin | manager |
| 4 | Alice | intern |
| 4 | Alice | architect |
Task:
Some people have multiple occupations, however I need each person to have only one. For this I am trying to use the groupby pandas function.
Issue:
So far so good, however I need to apply a condition based on their occupation and here is where I got stuck.
The condition is simple:
if "architect" is in the 'occupation' of the group (person):
   keep the 'occupation' as "architect"
else:
   keep any/last/first (it doesn't matter) 'occupation'
The desired output would be:
| id | name | occupation |
|:--:|:--------:|:----------:|
| 1 | John | artist |
| 2 | Mary | consultant |
| 3 | Benjamin | architect |
| 4 | Alice | architect |
Attempt:
def one_occupation_per_person(occupation):
if "architect" in occupation:
return "architect"
else:
return ???
df.groupby(['id','name')['occupation'].apply(lambda x: one_occupation_per_person(x['occupation']),axis=1)
I hope this describes the issue clear enough. Any hints and ideas are appreciated!
Since architect will come out at the first item from a natural sort, you can simply sort on occupation and then groupby:
df.sort_values("occupation").groupby("id", as_index=False).first()
If you somehow had another occupation that sorts before architect, you can convert the column to pd.Categorical before sorting:
s = ["architect"] + df.loc[df["occupation"].ne("architect"),"occupation"].unique().tolist()
df["occupation"] = pd.Categorical(df["occupation"], ordered=True, categories=s)
print (df.sort_values("occupation").groupby("id", as_index=False).first())
Result:
id name occupation
0 1 John artist
1 2 Mary consultant
2 3 Benjamin architect
3 4 Alice architect

Is there any way to rearrange excel data without copy paste?

I have an excel file that contain country name and dates as column name.
+---------+------------+------------+------------+
| country | 20/01/2020 | 21/01/2020 | 22/01/2020 |
+--------- ------------+------------+------------+
| us | 0 | 5 | 6 |
+---------+------------+------------+------------+
| Italy | 20 | 23 | 33 |
+--------- ------------+------------+------------+
| India | 0 | 0 | 6 |
+---------+------------+------------+------------+
But i need to arrange column names country, date, and count. Is there any way to rearrange excel data without copy paste.
final excel sheet need to look like this
+---------+------------+------------+
| country | date | count |
+--------- ------------+------------+
| us | 20/01/2020 | 0 |
+---------+------------+------------+
| us | 21/01/2020 | 5 |
+---------+------------+------------+
| us | 22/01/2020 | 6 |
+---------+------------+------------+
| Italy | 20/01/2020 | 20 |
+--------- ------------+------------+
| Italy | 21/01/2020 | 23 |
+--------- ------------+------------+
| Italy | 22/01/2020 | 33 |
+--------- ------------+------------+
| India | 20/01/2020 | 0 |
+---------+------------+------------+
Unpivot using Power Query:
Data --> Get & Transform --> From Table/Range
Select the country column
Unpivot Other columns
Rename the resulting Attribute and Value columns to date and count
Because the Dates which are in the header are turned into Text, you may need to change the date column type to date, or, as I did, to date using locale
M-Code
Source = Excel.CurrentWorkbook(){[Name="Table2"]}[Content],
#"Changed Type" = Table.TransformColumnTypes(Source,{{"country", type text}, {"20/01/2020", Int64.Type}, {"21/01/2020", Int64.Type}, {"22/01/2020", Int64.Type}}),
#"Unpivoted Other Columns" = Table.UnpivotOtherColumns(#"Changed Type", {"country"}, "date", "count"),
#"Changed Type with Locale" = Table.TransformColumnTypes(#"Unpivoted Other Columns", {{"date", type date}}, "en-150")
in
#"Changed Type with Locale"
Power Pivot is the best way, but if you want to use formulas:
In F1 enter:
=INDEX($A$2:$A$4,ROUNDUP(ROWS($1:1)/3,0))
and copy downward. In G1 enter:
=INDEX($B$1:$D$1,MOD(ROWS($1:1)-1,3)+1)
and copy downward. H1 enter:
=INDEX($B$2:$D$4,ROUNDUP(ROWS($1:1)/3,0),MOD(ROWS($1:1)-1,3)+1)
and copy downward
The 3 in these formulas is because we have 3 dates in the original table.

groupby: how to show max(field1) and the value of field2 corresponding to max(field1)?

Let's say I have a table with 3 fields: client, city, sales, with sales being a float.
+--------+--------+-------+
| client | city | sales |
+--------+--------+-------+
| a | NY | 0 |
| a | LA | 1 |
| a | London | 2 |
| b | NY | 3 |
| b | LA | 4 |
| b | London | 5 |
+--------+--------+-------+
For each client, I would like to show what is the city with the greatest sales, and what those sales are, ie I want this output:
+--------+--------+-------+
| client | city | sales |
+--------+--------+-------+
| a | London | 2 |
| b | London | 5 |
+--------+--------+-------+
Any suggestions?
This table can be generated with:
df=pd.DataFrame()
df['client']= np.repeat( ['a','b'],3 )
df['city'] = np.tile( ['NY','LA','London'],2)
df['sales']= np.arange(0,6)
This is wrong because it calculates the 'maximum' of the city, and shows NY because it considers that N > L
max_by_id = df.groupby('client').max()
I can first create a dataframe with the highest sales, and then merge it with the initial dataframe to retrieve the city; it works, but I was wondering if there is a faster / more elegant way?
out = pd.merge( df, max_by_id, how='inner' ,on=['client','sales'] )
I remember doing something similar with cross apply statements in SQL but wouldn't know how to run a Pandas equivalent.
You need to sort by sales and then groupby client and pick first
df.sort_values(['sales'], ascending=False).groupby('client').first().reset_index()
OR
As #user3483203:
df.loc[df.groupby('client')['sales'].idxmax()]
Output:
client city sales
0 a London 2
1 b London 5

Categories

Resources