Splitting Pandas Row Data into Multiple Rows without Adding Columns - python

I have some American Football data in a DataFrame like below:
df = pd.DataFrame({'Green Bay Packers' : ['30-18-0', '5-37', '10-71' ],
'Chicago Bears' : ['45-26-1', '5-20', '10-107']},
index=['Att - Comp - Int', 'Sacked - Yds Lost', 'Penalties - Yards'])
Green Bay Packers Chicago Bears
Att - Comp - Int 30-18-0 45-26-1
Sacked - Yds Lost 5-37 5-20
Penalties - Yards 10-71 10-107
You can see above that each row contains multiple data points that need to be split off.
What I'd like to do is find some way to split the rows up so that each data point is it's own row. The final output would like like:
Green Bay Packers Chicago Bears
Att 30 45
Comp 18 26
Int 0 1
Sacked 5 5
Yds Lost 37 20
Penalties 10 10
Yards 71 107
Is there a way to do this efficiently? I tried some Regex but it just turned into a mess. Sorry if my formatting isn't perfect...2nd question ever posted here.

Try:
df = df.reset_index().apply(lambda x: x.str.split("-"))
df = pd.DataFrame(
{c: df[c].explode().str.strip() for c in df.columns},
).set_index("index")
df.index.name = None
print(df)
Prints:
Green Bay Packers Chicago Bears
Att 30 45
Comp 18 26
Int 0 1
Sacked 5 5
Yds Lost 37 20
Penalties 10 10
Yards 71 107

First reset the index, then stack all the columns and split them on -, You can also additionally apply to remove any left over whitespace characters after after using split, then unstack again, then apply pd.Series.explode finally reset the index, and drop any left-over unrequired column.
out = (df.reset_index()
.stack().str.split('-').apply(lambda x:[i.strip() for i in x])
.unstack()
.apply(pd.Series.explode)
.reset_index()
.drop(columns='level_0'))
index Green Bay Packers Chicago Bears
0 Att 30 45
1 Comp 18 26
2 Int 0 1
3 Sacked 5 5
4 Yds Lost 37 20
5 Penalties 10 10
6 Yards 71 107

Assuming you have same number of splits for every row, with pandas >= 1.3.0, you can explode multiple columns at the same time:
df = df.reset_index().apply(lambda s: s.str.split(' *- *'))
df.explode(df.columns.tolist()).set_index('index')
Green Bay Packers Chicago Bears
index
Att 30 45
Comp 18 26
Int 0 1
Sacked 5 5
Yds Lost 37 20
Penalties 10 10
Yards 71 107

Use .apply() on each column (including index) and for each column:
use .str.split() to split data points and
use .explode() to create rows for each split data point element
df_out = (df.reset_index()
.apply(lambda x: x.str.split(r'\s*-\s*').explode())
.set_index('index').rename_axis(index=None)
)
Result:
print(df_out)
Green Bay Packers Chicago Bears
Att 30 45
Comp 18 26
Int 0 1
Sacked 5 5
Yds Lost 37 20
Penalties 10 10
Yards 71 107

Related

Pandas groupby and add total for numeric column and for time column ,want to add average

A B C Time D
1 sandy 12 02:30:24 California
2 sandy 22 01:24:06 California
3 sunny 8 05:03:52 Rhode Island
4 sunny 32 07:03:25 Rhode Island
Required output
A B C Time D
1 sandy 12 02:30:24 California
2 22 01:24:06 California
sandy Total 34 01:57:15
3 sunny 8 05:03:52 Rhode Island
4 32 07:03:25 Rhode Island
sunny Total 40 06:03:38
Total 74 04:00:27
want to add a total of the numeric columns at the end of each group and average time (i have two time column in actual)of time column
You can generate the Total lines by .groupby() + agg() and .assign(), the Grand Total line by pd.Series(). Then, append to the original df by .append(), followed by sort_index() to sort back together the same column B ('sandy', 'sunny'):
df_total = (df.assign(Time=pd.to_timedelta(df['Time']))
.groupby('B')[['C', 'Time']]
.agg({'C': 'sum', 'Time': lambda x: str(x.mean().round('1s')).split()[-1]})
.assign(A='Total: ', D='')
)
df_grand_total = pd.Series({'A': '',
'C': df['C'].sum(),
'Time': str(pd.to_timedelta(df['Time']).mean().round('1s')).split()[-1],
'D': ''},
name='~Grand Total:')
df_final = (df.set_index('B')
.append(df_total)
.append(df_grand_total)
.sort_index()
.reset_index()
)
Result:
print(df_total)
C Time A D
B
sandy 34 01:57:15 Total:
sunny 40 06:03:38 Total:
print(df_grand_total)
A
C 74
Time 04:00:27
D
Name: ~Grand Total:, dtype: object
print(df_final)
B A C Time D
0 sandy 1 12 02:30:24 California
1 sandy 2 22 01:24:06 California
2 sandy Total: 34 01:57:15
3 sunny 3 8 05:03:52 Rhode Island
4 sunny 4 32 07:03:25 Rhode Island
5 sunny Total: 40 06:03:38
6 ~Grand Total: 74 04:00:27

I want to create a new column territory based on the city column

Data Frame :
city Temperature
0 Chandigarh 15
1 Delhi 22
2 Kanpur 20
3 Chennai 26
4 Manali -2
0 Bengalaru 24
1 Coimbatore 35
2 Srirangam 36
3 Pondicherry 39
I need to create another column in data frame, which contains a boolean value for each city to indicate whether it's a union territory or not. Chandigarh, Pondicherry and Delhi are only 3 union territories here.
I have written below code
import numpy as np
conditions = [df3['city'] == 'Chandigarh',df3['city'] == 'Pondicherry',df3['city'] == 'Delhi']
values =[1,1,1]
df3['territory'] = np.select(conditions, values)
Is there any easier or efficient way that I can write?
You can use isin:
union_terrs = ["Chandigarh", "Pondicherry", "Delhi"]
df3["territory"] = df3["city"].isin(union_terrs).astype(int)
which checks each entry in city column and if it is in union_terrs, gives True and otherwise False. The astype makes True/False to 1/0 conversion,
to get
city Temperature territory
0 Chandigarh 15 1
1 Delhi 22 1
2 Kanpur 20 0
3 Chennai 26 0
4 Manali -2 0
0 Bengalaru 24 0
1 Coimbatore 35 0
2 Srirangam 36 0
3 Pondicherry 39 1

How to get top 3 sales in data frame after using group by and sorting in python?

recently I am doing with this data set
import pandas as pd
data = {'Product':['Box','Bottles','Pen','Markers','Bottles','Pen','Markers','Bottles','Box','Markers','Markers','Pen'],
'State':['Alaska','California','Texas','North Carolina','California','Texas','Alaska','Texas','North Carolina','Alaska','California','Texas'],
'Sales':[14,24,31,12,13,7,9,31,18,16,18,14]}
df1=pd.DataFrame(data, columns=['Product','State','Sales'])
df1
I want to find the 3 groups that have the highest sales
grouped_df1 = df1.groupby('State')
grouped_df1.apply(lambda x: x.sort_values(by = 'Sales', ascending=False))
So I have a dataframe like this
Now, I want to find the top 3 State that have the highest sales.
I tried to use
grouped_df1.apply(lambda x: x.sort_values(by = 'Sales', ascending=False)).head(3)
# It gives me the first three rows
grouped_df1.apply(lambda x: x.sort_values(by = 'Sales', ascending=False)).max()
#It only gives me the maximum value
The expected result should be:
Texas: 31
California: 24
North Carolina: 18
Thus, how can I fix it? Because sometimes, a State can have 3 top sales, for example Alaska may have 3 top sales. When I simply sort it, the top 3 will be Alaska, and it cannot find 2 other groups.
Many thanks!
You could add a new column called Sales_Max_For_State and then use drop_duplicates and nlargest:
>>> df1['Sales_Max_For_State'] = df1.groupby(['State'])['Sales'].transform(max)
>>> df1
Product State Sales Sales_Max_For_State
0 Box Alaska 14 16
1 Bottles California 24 24
2 Pen Texas 31 31
3 Markers North Carolina 12 18
4 Bottles California 13 24
5 Pen Texas 7 31
6 Markers Alaska 9 16
7 Bottles Texas 31 31
8 Box North Carolina 18 18
9 Markers Alaska 16 16
10 Markers California 18 24
11 Pen Texas 14 31
>>> df2 = df1.drop_duplicates(['Sales_Max_For_State']).nlargest(3, 'Sales_Max_For_State')[['State', 'Sales_Max_For_State']]
>>> df2
State Sales_Max_For_State
2 Texas 31
1 California 24
3 North Carolina 18
I think there are a few ways to do this:
1-
df1.groupby('State').agg({'Sales': 'max'}).sort_values(by='Sales', ascending=False).iloc[:3]
2-df1.groupby('State').agg({'Sales': 'max'})['Sales'].nlargest(3)
Sales
State
Texas 31
California 24
North Carolina 18

Set multiple columns to zero based on a value in another column [duplicate]

This question already has answers here:
Change one value based on another value in pandas
(7 answers)
Closed 2 years ago.
I have a sample dataset here. In real case, it has a train and test dataset. Both of them have around 300 columns and 800 rows. I want to filter out all those rows based on a certain value in one column and then set all values in that row from column 3 e.g. to column 50 to zero. How can I do it?
Sample dataset:
import pandas as pd
data = {'Name':['Jai', 'Princi', 'Gaurav','Princi','Anuj','Nancy'],
'Age':[27, 24, 22, 32,66,43],
'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj', 'Katauj', 'vbinauj'],
'Payment':[15,20,40,50,3,23],
'Qualification':['Msc', 'MA', 'MCA', 'Phd','MA','MS']}
df = pd.DataFrame(data)
df
Here is the output of sample dataset:
Name Age Address Payment Qualification
0 Jai 27 Delhi 15 Msc
1 Princi 24 Kanpur 20 MA
2 Gaurav 22 Allahabad 40 MCA
3 Princi 32 Kannauj 50 Phd
4 Anuj 66 Katauj 3 MA
5 Nancy 43 vbinauj 23 MS
As you can see, in the first column, there values =="Princi", So if I find rows that Name column value =="Princi", then I want to set column "Address" and "Payment" in those rows to zero.
Here is the expected output:
Name Age Address Payment Qualification
0 Jai 27 Delhi 15 Msc
1 Princi 24 0 0 MA #this row
2 Gaurav 22 Allahabad 40 MCA
3 Princi 32 0 0 Phd #this row
4 Anuj 66 Katauj 3 MA
5 Nancy 43 vbinauj 23 MS
In my real dataset, I tried:
train.loc[:, 'got':'tod']# get those columns # I could select all those columns
and train.loc[df['column_wanted'] == "that value"] # I got all those rows
But how can I combine them? Thanks for your help!
Use the loc accessor; df.loc[boolean selection, columns]
df.loc[df['Name'].eq('Princi'),'Address':'Payment']=0
Name Age Address Payment Qualification
0 Jai 27 Delhi 15 Msc
1 Princi 24 0 0 MA
2 Gaurav 22 Allahabad 40 MCA
3 Princi 32 0 0 Phd
4 Anuj 66 Katauj 3 MA
5 Nancy 43 vbinauj 23 MS

Python - Performing Max Function on Multiple Groupby

I have a data frame below that shows the price of wood and steel from two different suppliers.
I would like to add a column that shows the highest price for the opposite item (i.e. if line is wood, it would pull steel) from the same supplier.
For example, the "Steel" row for "Tom" would show his highest wood price which is 42.
The code I have so far simply returns the highest price for the original item (i.e. not the opposite, so for Tom's steel row returns 24 but I would have wanted it to return 42).
I think this is an issue with pulling the max value for a multi-group. I have tried a number of different ways but just cannot seem to get it.
Any thoughts would be greatly appreciated.
import pandas as pd
import numpy as np
data = {'Supplier':['Tom', 'Tom', 'Tom', 'Bill','Bill','Bill'],'Item':['Wood','Wood','Steel','Steel','Steel','Wood'],'Price':[42,33,24,16,12,18]}
df = pd.DataFrame(data)
df['Opp_Item'] = np.where(df['Item']=="Wood", "Steel", "Wood")
df['Opp_Item_Max'] = df.groupby(['Supplier','Opp_Item'])['Price'].transform(max)
print(df)
Supplier Item Price Opp_Item Opp_Item_Max
0 Tom Wood 42 Steel 42
1 Tom Wood 33 Steel 42
2 Tom Steel 24 Wood 24
3 Bill Steel 16 Wood 16
4 Bill Steel 12 Wood 16
5 Bill Wood 18 Steel 18
If you can find the per supplier+item maximum, then you can just swap the values and assign them back through a join:
v = df.groupby(['Supplier', 'Item']).Price.max().unstack(-1)
# This reversal operation works under the assumption that
# there are only two items and that they are opposites of each other.
v[:] = v.values[:, ::-1]
df = (df.set_index(['Supplier', 'Item'])
.join(v.stack().to_frame('Opp_Item_Max'), how='left')
.reset_index())
print(df)
Supplier Item Price Opp_Item_Max
0 Bill Steel 16 18
1 Bill Steel 12 18
2 Bill Wood 18 16
3 Tom Steel 24 42
4 Tom Wood 42 24
5 Tom Wood 33 24
Note: Ordering of your data will not be preserved after the join.
You could map to the opposite values before a groupby, and then merge this back to the original DataFrame.
d = {'Steel': 'Wood', 'Wood': 'Steel'}
df.merge(df.assign(Item = df.Item.map(d))
.groupby(['Supplier', 'Item'], as_index=False).max(),
on=['Supplier', 'Item'],
how='left',
suffixes=['', '_Opp_Item'])
Supplier Item Price Price_Opp_Item
0 Tom Wood 42 24
1 Tom Wood 33 24
2 Tom Steel 24 42
3 Bill Steel 16 18
4 Bill Steel 12 18
5 Bill Wood 18 16

Categories

Resources