I'm not sure I titled this post correctly but I have a unique situation where I want to append a new set of rows to an existing DataFrame as a sum of rows from existing sets and I'm not sure where to start.
For example, I have the following DataFrame:
import pandas as pd
data = {'Team': ['Atlanta', 'Atlanta', 'Cleveland', 'Cleveland'],
'Position': ['Defense', 'Kicker', 'Defense', 'Kicker'],
'Points': [5, 10, 15, 20]}
df = pd.DataFrame(data)
print(df)
Team Position Points
0 Atlanta Defense 5
1 Atlanta Kicker 10
2 Cleveland Defense 15
3 Cleveland Kicker 20
How do I create/append new rows which create a new position for each team and sum the points of the two existing positions for each team? Additionally, the full dataset consists of several more teams so I'm looking for a solution that will work for any number of teams.
edit: I forgot to include that there are other positions in the complete DataFrame; but, I only want this solution to be applied for the positions "Defense" and "Kicker".
My desired output is below.
Team Position Points
Team Position Points
0 Atlanta Defense 5
1 Atlanta Kicker 10
2 Cleveland Defense 15
3 Cleveland Kicker 20
4 Atlanta Defense + Kicker 15
5 Cleveland Defense + Kicker 35
Thanks in advance!
We can use groupby agg to create the summary rows then append to the DataFrame:
df = df.append(df.groupby('Team', as_index=False).agg({
'Position': ' + '.join, # Concat Strings together
'Points': 'sum' # Total Points
}), ignore_index=True)
df:
Team Position Points
0 Atlanta Defense 5
1 Atlanta Kicker 10
2 Cleveland Defense 15
3 Cleveland Kicker 20
4 Atlanta Defense + Kicker 15
5 Cleveland Defense + Kicker 35
We can also whitelist certain positions by filtering df before groupby to aggregate only the desired positions:
whitelisted_positions = ['Kicker', 'Defense']
df = df.append(
df[df['Position'].isin(whitelisted_positions)]
.groupby('Team', as_index=False).agg({
'Position': ' + '.join, # Concat Strings together
'Points': 'sum' # Total Points
}), ignore_index=True
)
pandas.DataFrame.append is deprecated since version 1.4.0. Henry Ecker's neat solution just needs a slight tweak to use concat instead.
whitelisted_positions = ['Kicker', 'Defense']
df = pd.concat([df,
df[df['Position'].isin(whitelisted_positions)]
.groupby('Team', as_index=False).agg({
'Position': ' + '.join, # Concat Strings together
'Points': 'sum' # Total Points
})],
ignore_index=True)
I have a dataframe with inventory and purchases across multiple stores and regions. I am trying to stack the dataframe using melt, but I need to have two value columns, inventory and purchases, and can't figure out how to do that. The dataframe looks like this:
Region | Store | Inventory_Item_1 | Inventory_Item_2 | Purchase_Item_1 | Purchase_Item_2
------------------------------------------------------------------------------------------------------
North A 15 20 5 6
North B 20 25 7 8
North C 18 22 6 10
South D 10 15 9 7
South E 12 12 10 8
The format I am trying to get the dataframe into looks like this:
Region | Store | Item | Inventory | Purchases
-----------------------------------------------------------------------------
North A Inventory_Item_1 15 5
North A Inventory_Item_2 20 6
North B Inventory_Item_1 20 7
North B Inventory_Item_2 25 8
North C Inventory_Item_1 18 6
North C Inventory_Item_2 22 10
South D Inventory_Item_1 10 9
South D Inventory_Item_2 15 7
South E Inventory_Item_1 12 10
South E Inventory_Item_2 12 8
This is what I have written, but I don't know how to create columns for Inventory and Purchases. Note that my full dataframe is considerably larger (50+ regions, 140+ stores, 15+ items).
df_1 = df.melt(id_vars = ['Store','Region'],value_vars = ['Inventory_Item_1','Inventory_Item_2'])
Any help or advice would be appreciated!
I would do these with hierarchical indexes on the rows and columns.
For the rows, you can set_index(['Region', 'Store']) easily enough.
You have to get a little tricksy for the columns though. Since you need access to the non-index columns that result from setting the index on Region and Store, you need to pipe it to a custom function that builds the desired tuples and creates a name multi-level column index.
After that, you can stack the columns into the row index and optionally reset the full row index to make everything a normal column again.
df = pd.DataFrame({
'Region': ['North', 'North', 'North', 'South', 'South'],
'Store': ['A', 'B', 'C', 'D', 'E'],
'Inventory_Item_1': [15, 20, 18, 10, 12],
'Inventory_Item_2': [20, 25, 22, 15, 12],
'Purchase_Item_1': [5, 7, 6, 9, 10],
'Purchase_Item_2': [6, 8, 10, 7, 8]
})
output = (
df.set_index(['Region', 'Store'])
.pipe(lambda df:
df.set_axis(df.columns.str.split('_', n=1, expand=True), axis='columns')
)
.rename_axis(['Status', 'Product'], axis='columns')
.stack(level='Product')
.reset_index()
)
Which gives me:
Region Store Product Inventory Purchase
North A Item_1 15 5
North A Item_2 20 6
North B Item_1 20 7
North B Item_2 25 8
North C Item_1 18 6
North C Item_2 22 10
South D Item_1 10 9
South D Item_2 15 7
South E Item_1 12 10
South E Item_2 12 8
You can use the pivot_longer function from pyjanitor; at the moment you have to install the latest development version from github :
# install the latest dev version of pyjanitor
# pip install git+https://github.com/ericmjl/pyjanitor.git
import janitor
df.pivot_longer(
index=["Region", "Store"],
names_to=(".value", "item"),
names_pattern=r"(Inventory|Purchase)_(.+)",
sort_by_appearance=True,
)
Region Store item Inventory Purchase
0 North A Item_1 15 5
1 North A Item_2 20 6
2 North B Item_1 20 7
3 North B Item_2 25 8
4 North C Item_1 18 6
5 North C Item_2 22 10
6 South D Item_1 10 9
7 South D Item_2 15 7
8 South E Item_1 12 10
9 South E Item_2 12 8
It works by passing a regex, containing groups to names_pattern parameter. The '.value' in names_to ensures that Inventory and Purchase are kept as column headers while the other group(Item_1 and Item_2) are collated into a new group item.
You can get to there by these steps:
# please always provide minimal working code - we as helpers and answerers
# otherwise have to invest extra time to generate beginning working code
# and that is unfair - we already spend enough time to solve the problem:
df = pd.DataFrame([
["North","A",15,20,5,6],
["North","B",20,25,7,8],
["North","C",18,22,6,10],
["South","D",10,15,9,7],
["South","E",12,12,10,8]], columns=["Region","Store","Inventory_Item_1","Inventory_Item_2","Purchase_Item_1","Purchase_Item_2"])
# melt the dataframe completely first
df_final = pd.melt(df, id_vars=['Region', 'Store'], value_vars=['Inventory_Item_1', 'Inventory_Item_2', 'Purchase_Item_1', 'Purchase_Item_2'])
# extract inventory and purchase sub data frames
# they have in common the "variable" column (the item number!)
# so let it look exactly the same in both data frames by removing
# unnecessary parts
df_inventory = df_final.loc[[x.startswith("Inventory") for x in df_final.variable],:]
df_inventory.variable = [s.replace("Inventory_", "") for s in df_inventory.variable]
df_purchase = df_final.loc[[x.startswith("Purchase") for x in df_final.variable],:]
df_purchase.variable = [s.replace("Purchase_", "") for s in df_purchase.variable]
# deepcopy the data frames (just to keep old results so that you can inspect them)
df_purchase_ = df_purchase.copy()
df_inventory_ = df_inventory.copy()
# rename the columns to prepare for merging
df_inventory_.columns = ["Region", "Store", "variable", "Inventory"]
df_purchase_.columns = ["Region", "Store", "variable", "Purchase"]
# merge by the three common columns
df_final_1 = pd.merge(df_inventory_, df_purchase_, how="left", left_on=["Region", "Store", "variable"], right_on=["Region", "Store", "variable"])
# sort by the three common columns
df_final_1.sort_values(by=["Region", "Store", "variable"], axis=0)
This returns
Region Store variable Inventory Purchase
0 North A Item_1 15 5
5 North A Item_2 20 6
1 North B Item_1 20 7
6 North B Item_2 25 8
2 North C Item_1 18 6
7 North C Item_2 22 10
3 South D Item_1 10 9
8 South D Item_2 15 7
4 South E Item_1 12 10
9 South E Item_2 12 8
I am trying to sort dataframe column values in conjunction with value_count -
Below is a code snippet of my algorithm:
with open (f_out_txt_2, 'w', encoding='utf-8') as f_txt_out_2:
f_txt_out_2.write(f"SORTED First Names w/SORTED value counts:\n")
for val, cnt in df['First Name'].value_counts(sort='True').iteritems():
f_txt_out_2.write("\n{0:9s} {1:2d}".format(val, cnt))
Below is the first few lines of output - note that "First Name" values are not in alphabetic order.
How can I get the "First Name" values sorted while keeping value counts sorted?
Output:
SORTED First Names w/SORTED value counts:
Marilyn 11
Todd 10
Jeremy 10
Barbara 10
Sarah 9
Rose 9
Kathy 9
Steven 9
Irene 9
Cynthia 9
Carl 8
Alice 8
Justin 8
Bobby 8
Ruby 8
Gloria 8
Julie 8
Clarence 8
Harry 8
Andrea 8
....
Unfortunately I can't find the original source link of where I downloaded the "employee.csv" file from, but here is a sample of it to give an idea of what it contained:
I believe you would use the following code to sort by first name, then by value counts.
dfg = df.groupby('First Name').agg(value_count = ('First Name','count')).sort_values(by = ['First Name','value_count'], ascending = [True,False])
The below is my dataframe :
Sno Name Region Num
0 1 Rubin Indore 79744001550
1 2 Rahul Delhi 89824304549
2 3 Rohit Noida 91611611478
3 4 Chirag Delhi 85879761557
4 5 Shan Bharat 95604535786
5 6 Jordi Russia 80777784005
6 7 El Russia 70008700104
7 8 Nino Spain 87707101233
8 9 Mark USA 98271377772
9 10 Pattinson Hawk Eye 87888888889
Retrieve the numbers and store it region wise from the given CSV file.
delhi_list = []
for i in range(len(data)):
if data.loc[i]['Region'] == 'Delhi':
delhi_list.append(data.loc[i]['Num'])
delhi_list = []
for i in range(len(data)):
if data.loc[i]['Region'] == 'Delhi':
delhi_list.append(data.loc[i]['Num'])
I am getting the results, but I want to achieve the data by the use of dictionary in python. Can I use it?
IIUC, you can use groupby, apply the list aggregation then use to_dict:
data.groupby('Region')['Num'].apply(list).to_dict()
[out]
{'Bharat': [95604535786],
'Delhi': [89824304549, 85879761557],
'Hawk Eye': [87888888889],
'Indore': [79744001550],
'Noida': [91611611478],
'Russia': [80777784005, 70008700104],
'Spain': [87707101233],
'USA': [98271377772]}
I have a csv with comma delimiters that has multiple values in a column that are delimited by a pipe and I need to map them to another column with multiple pipe delimited values and then give them their own row along with data in the original row that doesn't have multiple values. My CSV looks like this (with commas between the categories):
row name city amount
1 frank | john | dave toronto | new york | anaheim 10
2 george | joe | fred fresno | kansas city | reno 20
I need it to look like this:
row name city amount
1 frank toronto 10
2 john new york 10
3 dave anaheim 10
4 george fresno 20
5 joe kansas city 20
6 fred reno 20
Maybe not the nicest but working solution:
(works with no piped lines and for different pipe-length)
df = pd.read_csv('<your_data>.csv')
str_split = ' | '
# Calculate maximum length of piped (' | ') values
df['max_len'] = df[['name', 'city']].apply(lambda x: max(len(x[0].split(str_split)),
len(x[0].split(str_split))), axis=1)
max_len = df['max_len'].max()
# Split '|' piped cell values into columns (needed at unpivot step)
# Create as many new 'name_<x>' & 'city_<x>' columns as 'max_len'
df[['name_{}'.format(i) for i in range(max_len)]] = df['name'].apply(lambda x: \
pd.Series(x.split(str_split)))
df[['city_{}'.format(i) for i in range(max_len)]] = df['city'].apply(lambda x: \
pd.Series(x.split(str_split)))
# Unpivot 'name_<x>' & 'city_<x>' columns into rows
df_pv_name = pd.melt(df, value_vars=['name_{}'.format(i) for i in range(max_len)],
id_vars=['amount'])
df_pv_city = pd.melt(df, value_vars=['city_{}'.format(i) for i in range(max_len)],
id_vars=['amount'])
# Rename upivoted columns (these are the final columns)
df_pv_name = df_pv_name.rename(columns={'value':'name'})
df_pv_city = df_pv_city.rename(columns={'value':'city'})
# Rename 'city_<x>' values (rows) to be 'key' for join (merge)
df_pv_city['variable'] = df_pv_city['variable'].map({'city_{}'.format(i):'name_{}'\
.format(i) for i in range(max_len)})
# Join unpivoted 'name' & 'city' dataframes
df_res = df_pv_name.merge(df_pv_city, on=['variable', 'amount'])
# Drop 'variable' column and NULL rows if you have not equal pipe-length in original rows
# If you want to drop any NULL rows then replace 'all' to 'any'
df_res = df_res.drop(['variable'], axis=1).dropna(subset=['name', 'city'], how='all',
axis=0).reset_index(drop=True)
The result is:
amount name city
0 10 frank toronto
1 20 george fresno
2 10 john new york
3 20 joe kansas city
4 10 dave anaheim
5 20 fred reno
Another test input:
name city amount
0 frank | john | dave | joe | bill toronto | new york | anaheim | los angeles | caracas 10
1 george | joe | fred fresno | kansas city 20
2 danny miami 30
Result of this test (if you don't want NaN rows then replace how='all' to how='any' in the code at merging):
amount name city
0 10 frank toronto
1 20 george fresno
2 30 danny miami
3 10 john new york
4 20 joe kansas city
5 10 dave anaheim
6 20 fred NaN
7 10 joe los angeles
8 10 bill caracas
Given a row:
['1','frank|joe|dave', 'toronto|new york|anaheim', '20']
you can use
itertools.izip_longest(*[value.split('|') for value in row])
on it to obtain following structure:
[('1', 'frank', 'toronto', '20'),
(None, 'joe', 'new york', None),
(None, 'dave', 'anaheim', None)]
Here we want to replace all None values with last seen value in corresponding column. Can be done when looping over result.
So given a TSV already splitted by tabs following code should do the trick:
import itertools
def flatten_tsv(lines):
result = []
for line in lines:
flat_lines = itertools.izip_longest(*[value.split('|') for value in line])
for flat_line in flat_lines:
result.append([result[-1][i] if v is None else v
for i, v in enumerate(flat_line)])
return result