Divide DataFrame Column on (,) into two new columns - python

I have a pandas DataFrame called data_combined with the following structure:
index corr_year corr_5d
0 (DAL, AAL) 0.873762 0.778594
1 (WEC, ED) 0.851578 0.850549
2 (CMS, LNT) 0.850028 0.776143
3 (SWKS, QRVO) 0.850799 0.830603
4 (ALK, DAL) 0.874162 0.744590
Now I am trying to divide the column named index into two columns on the (,).
The desired output should look like this:
index1 index2 corr_year corr_5d
0 DAL AAL 0.873762 0.778594
1 WEC ED 0.851578 0.850549
2 CMS LNT 0.850028 0.776143
3 SWKS QRVO 0.850799 0.830603
4 ALK DAL 0.874162 0.744590
I have tried using pd.explode() with the following code
data_results_test = data_results_combined.explode('index')
data_results_test
Which leads to the following output:
index corr_year corr_5d
0 DAL 0.873762 0.778594
0 AAL 0.873762 0.778594
1 WEC 0.851578 0.850549
1 ED 0.851578 0.850549
How can I achieve the split with newly added columns instead of rows. pd.explode does not seem to have any option to choose wether to add new rows or columns

How about a simple apply? (Assuming 'index' column is a tuple)
data_results_combined['index1'] = data_results_combined['index'].apply(lambda x: x[0])
data_results_combined['index2'] = data_results_combined['index'].apply(lambda x: x[1])

df[['index1','index2']] = df['index'].str.split(',',expand=True)

Related

Pandas - pivoting multiple columns into fewer columns with some level of detail kept

Say I have the following code that generates a dataframe:
df = pd.DataFrame({"customer_code": ['1234','3411','9303'],
"main_purchases": [3,10,5],
"main_revenue": [103.5,401.5,99.0],
"secondary_purchases": [1,2,4],
"secondary_revenue": [43.1,77.5,104.6]
})
df.head()
There's the customer_code column that's the unique ID for each client.
And then there are 2 columns to indicate the purchases that took place and revenue generated from main branches by those clients.
And another 2 columns to indicate the purchases/revenue from secondary branches by those clients.
I want to get the data into a format like this, where a pivot is done where there's a new column to differentiate between main vs secondary, but the revenue numbers and purchase columns are not mixed up:
The obvious solution is just to split this into 2 dataframes, and then simply do a concatenate, but I'm wondering whether there's a built-in way to do this in a line or two - this strikes me as the kind of thing someone might have thought to bake in a solution for.
With a little column renaming to get the "revenue" and "purchases" in the column names first using a regular expression and str.replace we can use pd.wide_to_long to convert these now stubnames from columns to rows:
# Reorder column names so stubnames are first
df.columns = [df.columns[0],
*df.columns[1:].str.replace(r'(.*)_(.*)', r'\2_\1', regex=True)]
# Convert wide_to_long
df = (
pd.wide_to_long(
df,
i='customer_code',
stubnames=['purchases', 'revenue'],
j='type',
sep='_',
suffix='.*'
)
.sort_index() # Optional sort to match expected output
.reset_index() # retrieve customer_code from the index
)
df:
customer_code
type
purchases
revenue
0
1234
main
3
103.5
1
1234
secondary
1
43.1
2
3411
main
10
401.5
3
3411
secondary
2
77.5
4
9303
main
5
99
5
9303
secondary
4
104.6
What does reordering the column headers do?
df.columns = [df.columns[0],
*df.columns[1:].str.replace(r'(.*)_(.*)', r'\2_\1', regex=True)]
Produces:
Index(['customer_code', 'purchases_main', 'revenue_main',
'purchases_secondary', 'revenue_secondary'],
dtype='object')
The "type" column is now the suffix of the column header which allows wide_to_long to process the table as expected.
You can abstract the reshaping process with pivot_longer from pyjanitor; they are just a bunch of wrapper functions in Pandas:
#pip install pyjanitor
import pandas as pd
import janitor
df.pivot_longer(index = 'customer_code',
names_to=('type', '.value'),
names_sep='_',
sort_by_appearance=True)
customer_code type purchases revenue
0 1234 main 3 103.5
1 1234 secondary 1 43.1
2 3411 main 10 401.5
3 3411 secondary 2 77.5
4 9303 main 5 99.0
5 9303 secondary 4 104.6
The .value in names_to signifies to the function that you want that part of the column to remain as a header; the other part goes under the type column. The split is determined in this case by names_sep (there is a names_pattern option, that allows regular expression split); if you do not care about the order of appearance, you can set sort_by_appearance as False.
You can also use melt() and concat() function to solve this problem.
import pandas as pd
df1 = df.melt(
id_vars='customer_code',
value_vars=['main_purchases', 'secondary_purchases'],
var_name='type',
value_name='purchases',
ignore_index=True)
df2 = df.melt(
id_vars='customer_code',
value_vars=['main_revenue', 'secondary_revenue'],
var_name='type',
value_name='revenue',
ignore_index=True)
Then we use concat() with the parameter axis=1 to join side by side and use sort_values(by='customer_code') to sort data by customer.
result= pd.concat([df1,df2['revenue']],
axis=1,
ignore_index=False).sort_values(by='customer_code')
Using replace() with regex to align type names:
result.type.replace(r'_.*$','', regex=True, inplace=True)
The above code will output the below dataframe:
customer_code
type
purchases
revenue
0
1234
main
3
103.5
3
1234
secondary
1
43.1
1
3411
main
10
401.5
4
3411
secondary
2
77.5
2
9303
main
5
99
5
9303
secondary
4
104.6

How to count the occurrences of a string starts with a specific substring from comma separated values in a pandas data frame?

I am new to Python. I am working with a dataframe (360000 rows and 2 columns) that looks something like this:
business_id date
P01 2019-07-6 , 2018-06-05, 2019-07-06...
P02 2016-03-6 , 2019-04-10
P03 2019-01-02
The date column has dates separated by comma and dates from year 2010-2019. I am trying to count only the dates for each month that are in year 2019 for each business id. Specifically, I am looking for the output:
Can anyone please help me? Thanks.
You can do as follows
first use str.split to separate the dates in each cell to a list,
then explode to flatten the lists
convert to datetime with pd.to_datetime and extract the month
finally use pd.crosstab to pivot/count the months and join.
Altogether:
s = pd.to_datetime(df['date'].str.split('\s*,\s*').explode()).dt.to_period('M')
out = pd.crosstab(s.index,s )
# this gives the expected output
df.join(out)
Output (out):
date 2016-03 2018-06 2019-01 2019-04 2019-07
row_0
0 0 1 0 0 2
1 1 0 0 1 0
2 0 0 1 0 0
If they are not datetime objects yet, you may want to start by converting the column (series) to datetime:
pd.to_datetime()
Note: the format parameter.
Then you can access the datetime attributes through .dt
i.e df[df.COLUMN_NAME.dt.month == 5]

How to solve error: InvalidIndexError: Reindexing only valid with uniquely valued Index objects when mapping a dataframe to another one

I have a Pandas DataFrame. I am trying to map the ProductID from one dataframe to another dataframe.
Here is my attempt:
Product_id_mapper = dict(df1[['ProductID', 'Cost']].drop_duplicates().values)
df2["Actual_cost"] = df2['ProductID'].map(Product_id_mapper)
Unfortunately, I get the following error:
InvalidIndexError: Reindexing only valid with uniquely valued Index objects
I wonder why I keep getting this error even after dropping duplicates
If I have understood correctly you want to merge based on a Key two dataframes. Then this is my suggestion:
Suppose a.csv:
carrier,type,count
DTH,a,123
DTH,b,3123
DTH,c,41341
DTH,d,13411
BLUEDART,a,12123
BLUEDART,b,31231
BLUEDART,c,411
BLUEDART,d,11
And b.csv is:
carrier,year
DTH,1997
BLUEDART,2005
Python code:
import pandas as pd
a_df = pd.read_csv(r"a.csv")
b_df = pd.read_csv(r"b.csv")
merged_df = pd.merge(a_df, b_df, on=['carrier'])
print(merged_df)
and the output:
carrier type count year
0 DTH a 123 1997
1 DTH b 3123 1997
2 DTH c 41341 1997
3 DTH d 13411 1997
4 BLUEDART a 12123 2005
5 BLUEDART b 31231 2005
6 BLUEDART c 411 2005
7 BLUEDART d 11 2005
If Key column name is called differently on the CSVs, use:
out = (df1.merge(df2, left_on='key1', right_on='key2')
.reindex(columns=[...]))

How to fill empty column values with another dataframe's value if two other columns have matching values in Pandas?

I have two similar dataframes. I want to populate df2['Material'] with the value from df1['Material'] if df1['PartNumber'] and df2['PartNumber'] match. Do can I accomplish this with Pandas (or Python in general)? The data frames are several thousand lines each, these are just snippets.
df1
PartNumber Material ProgramNo Machine
114 JEFD0302000 E 304L O0219 CHNC III
218 REFD0502050 B 6AL-4V O0295 CHNC III
df2
PartNumber ProgramNo Machine Material
0 JEFD0302670 A 6109 + 6609 WY-100 NaN
1 JEFD0510820 A 6110 + 6610 WY-100 NaN
you can do :
df2['Material']=df2['PartNumber'].map(dict(zip(df1['PartNumber'],df1['Material']))).fillna(df2.Material)
Using np.where with map
s=df2.PartNumber.map(df1.set_index('PartNumber').Material)
df2.Material=np.where(df2.PartNumber.isnull(),s,df2.Material)

extract unique values and make a new dataframe on a condition

Assume this is my sample input df:
date h_league
0 19901126 AA
1 19911127 NA
2 20030130 AA
3 20041217 NaN
4 20080716 AA
5 20011215 NA
6 19970603 NaN
I'm looking to extract unique leagues from h_league and also make new two cols one is max_date and has maximum date and min_date that has minimum date for the league.
# Desired Output:
h_league Max_date Min_date
0 AA 20080716 19901126
1 NA 20011215 19911127
I had to write a function for this task that returns similar output that I desire but not the exact desired output.
def league_info(league):
league_games = df[df["h_league"] == league]
earliest = df["date"].min()
latest = df["date"].max()
print("{} went from {} to {}".format(league,earliest,latest))
for league in df["h_league"].unique():
league_info(league)
I'm looking for a pandas way to achieve my desired output. Any help is appreciated. Thank you!
IIUC
df=df.fillna('NA')
df.groupby('h_league').date.agg(['max','min'])
Out[98]:
max min
h_league
AA 20080716 19901126
NA 20041217 19911127
df2=df.fillna('NA')
df2.groupby('h_league').date.agg(['max','min'])
Does this work for you? You can assign df=df.fillna('NA') too. let me know if this works. I tried it.

Categories

Resources