I have a dataframe that looks like the following:
ID,CUSTOMER_ID,ACC_NUMBER,TRANSACTION_ID,PACK_DESC,PACK_VALIDITY,PACK_NUMBER
1,ABCVRXJ,1027,1248,PackA,30,PACKA-XXXX
2,ABCVRXJ,1029,1249,PackC,32,PACKC-XXXX
3,XUVZ200,1028,12491,PackB,31,PACKB-XXXX
4,XUVZ200,1030,12421,PackD,33,PACKD-XXXX
I want the final dataframe to look something like:
ID,CUSTOMER_ID,ACC_NUMBER,TRANSACTION_ID,PACK_DESC,PACK_VALIDITY,PACK_NUMBER_1,PACK_NUMBER_2
1,ABCVRXJ,1027,1248,PackA,30,PACKA-XXXX,PACKC-XXXX
3,XUVZ200,1028,12491,PackB,31,PACKB-XXXX,PACKD-XXXX
Each CUSTOMER_ID who has opted for 2 packs should be converted into a single row, with both the PACK_NUMBERs being 2 new columns.
I tried:
df['index'] = df.groupby('CUSTOMER_ID').cumcount()
df_vchrNumber = df.pivot(index='CUSTOMER_ID', columns='index', values='PACK_NUMBER').rename(columns=lambda x: 'PACK_NUMBER_'+str(x + 1))
df_vchrNumber = df_vchrNumber.fillna('').reset_index()
but this returns,
CUSTOMER_ID,PACK_NUMBER_1,PACK_NUMBER_2
0123456789,PACKA-XXXX,PACKC-XXXX
9876543210,PACKB-XXXX,PACKD-XXXX
**but this is not the expected output as i'm not sure how to include the other columns **
Would somebody mind helping me out a bit?
If need only first and last value of PACK_NUMBER use DataFrame.drop_duplicates for first values per groups and last value of PACK_NUMBER per groups:
s = (df.drop_duplicates('CUSTOMER_ID', keep='last')
.set_index('CUSTOMER_ID')['PACK_NUMBER']
.rename('PACK_NUMBER_2'))
df = (df.drop_duplicates('CUSTOMER_ID')
.rename(columns={'PACK_NUMBER':'PACK_NUMBER_1'})
.join(s, on='CUSTOMER_ID'))
print (df)
ID CUSTOMER_ID ACC_NUMBER TRANSACTION_ID PACK_DESC PACK_VALIDITY \
0 1 ABCVRXJ 1027 1248 PackA 30
2 3 XUVZ200 1028 12491 PackB 31
PACK_NUMBER_1 PACK_NUMBER_2
0 PACKA-XXXX PACKC-XXXX
2 PACKB-XXXX PACKD-XXXX
Your solution should be changed with remove duplicates and join Series:
df['index']= df.groupby('CUSTOMER_ID').cumcount()
df_vchrNumber = (df.pivot(index='CUSTOMER_ID', columns='index', values='PACK_NUMBER')
.rename(columns=lambda x: 'PACK_NUMBER_'+str(x + 1)))
df=df.drop_duplicates('CUSTOMER_ID').drop('PACK_NUMBER',1).join(df_vchrNumber,on='CUSTOMER_ID')
And if need processes all columns:
df['index']= df.groupby('CUSTOMER_ID').cumcount() + 1
df = df.set_index(['CUSTOMER_ID', 'index']).unstack()
df.columns = [f'{a}_{b}' for a, b in df.columns]
df = df.reset_index()
print (df)
CUSTOMER_ID ID_1 ID_2 ACC_NUMBER_1 ACC_NUMBER_2 TRANSACTION_ID_1 \
0 ABCVRXJ 1 2 1027 1029 1248
1 XUVZ200 3 4 1028 1030 12491
TRANSACTION_ID_2 PACK_DESC_1 PACK_DESC_2 PACK_VALIDITY_1 PACK_VALIDITY_2 \
0 1249 PackA PackC 30 32
1 12421 PackB PackD 31 33
PACK_NUMBER_1 PACK_NUMBER_2
0 PACKA-XXXX PACKC-XXXX
1 PACKB-XXXX PACKD-XXXX
Use groupby with agg to select the first row of the group. Then groupby again and get the last row, finally merge the two dataframes together to get your wanted output:
a = df.groupby('CUSTOMER_ID', as_index=False).agg('first')
b = df.groupby('CUSTOMER_ID', as_index=False).agg({'PACK_NUMBER':'last'})
df_final = a.merge(b, on='CUSTOMER_ID', suffixes=['_1', '_2'])
CUSTOMER_ID ID ACC_NUMBER TRANSACTION_ID PACK_DESC PACK_VALIDITY PACK_NUMBER_1 PACK_NUMBER_2
0 ABCVRXJ 1 1027 1248 PackA 30 PACKA-XXXX PACKC-XXXX
1 XUVZ200 3 1028 12491 PackB 31 PACKB-XXXX PACKD-XXXX
Related
I am using the code below to make a search on a .csv file and match a column in both files and grab a different column I want and add it as a new column. However, I am trying to make the match based on two columns instead of one. Is there a way to do this?
import pandas as pd
df1 = pd.read_csv("matchone.csv")
df2 = pd.read_csv("comingfrom.csv")
def lookup_prod(ip):
for row in df2.itertuples():
if ip in row[1]:
return row[3]
else:
return '0'
df1['want'] = df1['name'].apply(lookup_prod)
df1[df1.want != '0']
print(df1)
#df1.to_csv('file_name.csv')
The code above makes a search from the column name 'samename' in both files and gets the column I request ([3]) from the df2. I want to make the code make a match for both column 'name' and another column 'price' and only if both columns in both df1 and df2 match then the code take the value on ([3]).
df 1 :
name price value
a 10 35
b 10 21
c 10 33
d 10 20
e 10 88
df 2 :
name price want
a 10 123
b 5 222
c 10 944
d 10 104
e 5 213
When the code is run (asking for the want column from d2, based on both if df1 name = df2 name) the produced result is :
name price value want
a 10 35 123
b 10 21 222
c 10 33 944
d 10 20 104
e 10 88 213
However, what I want is if both df1 name = df2 name and df1 price = df2 price, then take the column df2 want, so the desired result is:
name price value want
a 10 35 123
b 10 21 0
c 10 33 944
d 10 20 104
e 10 88 0
You need to use pandas.DataFrame.merge() method with multiple keys:
df1.merge(df2, on=['name','price'], how='left').fillna(0)
Method represents missing values as NaNs, so that the column's dtype changes to float64 but you can change it back after filling the missed values with 0.
Also please be aware that duplicated combinations of name and price in df2 will appear several times in the result.
If you are matching the two dataframes based on the name and the price, you can use df.where and df.isin
df1['want'] = df2['want'].where(df1[['name','price']].isin(df2).all(axis=1)).fillna('0')
df1
name price value want
0 a 10 35 123.0
1 b 10 21 0
2 c 10 33 944.0
3 d 10 20 104.0
4 e 10 88 0
Expanding on https://stackoverflow.com/a/73830294/20110802:
You can add the validate option to the merge in order to avoid duplication on one side (or both):
pd.merge(df1, df2, on=['name','price'], how='left', validate='1:1').fillna(0)
Also, if the float conversion is a problem for you, one option is to do an inner join first and then pd.concat the result with the "leftover" df1 where you already added a constant valued column. Would look something like:
df_inner = pd.merge(df1, df2, on=['name', 'price'], how='inner', validate='1:1')
merged_pairs = set(zip(df_inner.name, df_inner.price))
df_anti = df1.loc[~pd.Series(zip(df1.name, df1.price)).isin(merged_pairs)]
df_anti['want'] = 0
df_result = pd.concat([df_inner, df_anti]) # perhaps ignore_index=True ?
Looks complicated, but should be quite performant because it filters by set. I think there might be a possibility to set name and price as index, merge on index and then filter by index to not having to do the zip-set-shenanigans, bit I'm no expert on multiindex-handling.
#Try this code it will give you expected results
import pandas as pd
df1 = pd.DataFrame({'name' :['a','b','c','d','e'] ,
'price' :[10,10,10,10,10],
'value' : [35,21,33,20,88]})
df2 = pd.DataFrame({'name' :['a','b','c','d','e'] ,
'price' :[10,5,10,10,5],
'want' : [123,222,944,104 ,213]})
new = pd.merge(df1,df2, how='left', left_on=['name','price'], right_on=['name','price'])
print(new.fillna(0))
I have a dataframe (df1) that contains information about project_id, cost_center and other features:
project_id
cost_center
month
101
8575
3
321
8597
4
321
8597
2
Nan
8522
1
Sometimes the project_id is not included there, so is Nan, and for these cases I have a "mapping table" (df2) that indicates the project_id that should be associated to that cost center:
project_id
cost_center
832
8522
So in my example, I should be able to replace the Nan in df1 for a 832. It means, I should replace the project_id in df1, when the cost_center is in df2.
I tried the following code, but is not working. It says "Length of values (0) does not match length of index (565)" I think because df1 and df2 have different sizes
df['project_id'] = df_mapping[df['cost_center'].isin(df_mapping['cost_center'])]['project_id'].values
one way to do this is to merge the 2 DFs and then use fillna() to create a new (output) column. Hope it helps!
df_1 = pd.DataFrame({"id":[1,2,3,None],"center":[5,6,7,8]},index=["a","b","c","d"])
df_2 = pd.DataFrame({"id":[4],"center":[8]},index=["g"])
df_merge = df_1.merge(df_2,on="center",how="outer")
df_merge["id_output"] = df_merge["id_x"].fillna(df_merge["id_y"])
df_merge.drop(["id_x","id_y"],inplace=True, axis=1)
You can use a mapping series and fillna:
df1['project_id'] = (df1['project_id']
.fillna(df1['cost_center'].map(df2.set_index('cost_center')['project_id']),
downcast='infer'
)
)
output:
project_id cost_center month
0 101 8575 3
1 321 8597 4
2 321 8597 2
3 832 8522 1
My pandas dataframe currently has a column titled BinLocation that contains the location of a material in a warehouse. For example:
If a part is located in column A02, row 33, and then level B21 then the BinLocation ID is A02033B21.
For some columns, the format maybe A0233B21. The naming convention is not consistent but that was not up to me, and now I have to clean the data up.
I want to split the string such that for any given input for the BinLocation, I can return the column, row and level. Ultimately, I want to create 3 new columns for the dataframe (column, row, level).
In case it is not clear, the general structure of the ID is ColumnChar_ColumnInt_RowInt_ColumnChar_LevelInt
Now,for some BinLocations, the ID is separated by a hyphen so I wrote this code for those:
def forHyphenRow(s):
return s.split('-')[1]
def forHyphenColumn(s):
return s.split('-')[0]
def forHyphenLevel(s):
return s.split('-')[2]
How do I do the same but for the other IDs?
Also, in the dataframe is there anyway to group the columns in the dataframe all together? (so A02 are all grouped together, CB-22 are all grouped together etc)
Here is an answer that:
uses Python regular expression syntax to parse your ID (handles cases with and without hyphens and can be tweaked to accommodate other quirks of historical IDs if needed)
puts the ID in a regularized format
adds columns for the ID components
sorts based on the ID components so rows are "grouped" together (though not in the "groupby" sense of pandas)
import pandas as pd
df = pd.DataFrame({'BinLocation':['A0233B21', 'A02033B21', 'A02-033-B21', 'A02-33-B21', 'A02-33-B15', 'A02-30-B21', 'A01-33-B21']})
print(df)
print()
df['RawBinLocation'] = df['BinLocation']
import re
def parse(s):
m = re.match('^([A-Z])([0-9]{2})-?([0-9]+)-?([A-Z])([0-9]{2})$', s)
if not m:
return None
tup = m.groups()
colChar, colInt, rowInt, levelChar, levelInt = tup[0], int(tup[1]), int(tup[2]), tup[3], int(tup[4])
tup = (colChar, colInt, rowInt, levelChar, levelInt)
return pd.Series(tup)
df[['ColChar', 'ColInt', 'RowInt', 'LevChar', 'LevInt']] = df['BinLocation'].apply(parse)
df['BinLocation'] = df.apply(lambda x: f"{x.ColChar}{x.ColInt:02}-{x.RowInt:03}-{x.LevChar}{x.LevInt:02}", axis=1)
df.sort_values(by=['ColChar', 'ColInt', 'RowInt', 'LevChar', 'LevInt'], inplace=True, ignore_index=True)
print(df)
Output:
BinLocation
0 A0233B21
1 A02033B21
2 A02-033-B21
3 A02-33-B21
4 A02-33-B15
5 A02-30-B21
6 A01-33-B21
BinLocation RawBinLocation ColChar ColInt RowInt LevChar LevInt
0 A01-033-B21 A01-33-B21 A 1 33 B 21
1 A02-030-B21 A02-30-B21 A 2 30 B 21
2 A02-033-B15 A02-33-B15 A 2 33 B 15
3 A02-033-B21 A0233B21 A 2 33 B 21
4 A02-033-B21 A02033B21 A 2 33 B 21
5 A02-033-B21 A02-033-B21 A 2 33 B 21
6 A02-033-B21 A02-33-B21 A 2 33 B 21
If there will always be the first three characters of a string as Column, and last three as Level (and therefore Row as everything in-between):
def forNotHyphenColumn(s):
return s[:3]
def forNotHyphenLevel(s):
return s[-3:]
def forNotHyphenRow(s):
return s[3:-3]
Then, you could sort your DataFrame by Column by creating separate DataFrame columns for the BinLocation items and using df.sort_values():
df = pd.DataFrame(data={"BinLocation": ["A02033B21", "C02044C12", "A0233B21"]})
# Create dataframe columns for BinLocation items
df["Column"] = df["BinLocation"].apply(lambda x: forNotHyphenColumn(x))
df["Row"] = df["BinLocation"].apply(lambda x: forNotHyphenRow(x))
df["Level"] = df["BinLocation"].apply(lambda x: forNotHyphenLevel(x))
# Sort values
df.sort_values(by=["Column"], ascending=True, inplace=True)
df
#Out:
# BinLocation Column Row Level
#0 A02033B21 A02 033 B21
#2 A0233B21 A02 33 B21
#1 C02044C12 C02 044 C12
EDIT:
To also use the hyphen functions in the apply():
df = pd.DataFrame(data={"BinLocation": ["A02033B21", "C02044C12", "A0233B21", "A01-33-C13"]})
# Create dataframe columns for BinLocation items
df["Column"] = df["BinLocation"].apply(lambda x: forHyphenColumn(x) if "-" in x else forNotHyphenColumn(x))
df["Row"] = df["BinLocation"].apply(lambda x: forHyphenRow(x) if "-" in x else forNotHyphenRow(x))
df["Level"] = df["BinLocation"].apply(lambda x: forHyphenLevel(x) if "-" in x else forNotHyphenLevel(x))
# Sort values
df.sort_values(by=["Column"], ascending=True, inplace=True)
df
#Out:
# BinLocation Column Row Level
#3 A01-33-C13 A01 33 C13
#0 A02033B21 A02 033 B21
#2 A0233B21 A02 33 B21
#1 C02044C12 C02 044 C12
Data frame
I am working with a data frame in Jupyter Notebooks and I am having some difficulty with it. The data frame consists of locations and these are represented by coordinates. These points represent a route taken by a driver on a given day.
There are 3 columns at the moment; Start, Intermediary or End.
A driver begins the day at the Start point, visits 1 or more Intermediary points and returns to the End point at the end of the day. The Start point is like a base location so the End point is identical to the Start point.
It's very basic but I am having trouble visualising this data. I was thinking something like this below to help improve my situation:
| Start | Intermediary | End |
| | | | | | |
_________________________________________________________________
| s_lat | s_lng | i_lat | i_lng | e_lat | e_lng |
Or would it be best if I scrap the top 3 columns (Start, Intermediary, End)?
I am keen not to start a discussion here as per the Guidelines so I am keen to learn something new about Python Pandas and if there is a way I can improve my current method.
I think need here MultiIndex created by MultiIndex.from_product:
mux = pd.MultiIndex.from_product([['Start','Intermediary','End'], ['lat','lng']])
df = pd.DataFrame(data, columns=mux)
EDIT:
Setup:
temp=u""" start intermediary end
('54.957055',' -7.740156') ('54.956915136264', ' -7.753690062122') ('54.957055','-7.740156')
('54.8913208', '-7.5740475') ('54.864402885577', '-7.653445692445'),('54','0') ('54.8913208','-7.5740475')
('55.2375819', '-7.2357427') ('55.253936739337', '-7.259624609577'), ('54','2'),('54','1') ('55.2375819','-7.2357427')
('54.5298806', '-8.1350247') ('54.504374314741', '-8.188334960168') ('54.5298806','-8.1350247')
('54.2810187', ' -7.896937') ('54.303836850038', '-8.180136033695'), ('54','3') ('54.2810187','-7.896937')
"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), sep="\s{3,}")
print (df)
start \
0 ('54.957055',' -7.740156')
1 ('54.8913208', '-7.5740475')
2 ('55.2375819', '-7.2357427')
3 ('54.5298806', '-8.1350247')
4 ('54.2810187', ' -7.896937')
intermediary \
0 ('54.956915136264', ' -7.753690062122')
1 ('54.864402885577', '-7.653445692445'),('54','0')
2 ('55.253936739337', '-7.259624609577'), ('54',...
3 ('54.504374314741', '-8.188334960168')
4 ('54.303836850038', '-8.180136033695'), ('54',...
end
0 ('54.957055','-7.740156')
1 ('54.8913208','-7.5740475')
2 ('55.2375819','-7.2357427')
3 ('54.5298806','-8.1350247')
4 ('54.2810187','-7.896937')
import ast
#convert string values to tuples
df = df.applymap(lambda x: ast.literal_eval(x))
#convert onpy pairs values to nested lists
df['intermediary'] = df['intermediary'].apply(lambda x: list(x) if isinstance(x[1], tuple) else [x])
#DataFrame by first Start column
df1 = pd.DataFrame(df['start'].values.tolist(), columns=['lat','lng'])
#DataFrame by intermediary column with reshape for 2 columns df
df2 = (pd.concat([pd.DataFrame(x, columns=['lat','lng']) for x in df['intermediary']], keys=df.index)
.reset_index(level=1, drop=True)
.add_prefix('intermediary_'))
print (df2)
#join all DataFrames together
df3 = df1.add_prefix('start_').join(df2).join(df1.add_prefix('end_'))
#create MultiIndex by split
df3.columns = df3.columns.str.split('_', expand=True)
print (df3)
start intermediary end \
lat lng lat lng lat
0 54.957055 -7.740156 54.956915136264 -7.753690062122 54.957055
1 54.8913208 -7.5740475 54.864402885577 -7.653445692445 54.8913208
1 54.8913208 -7.5740475 54 0 54.8913208
2 55.2375819 -7.2357427 55.253936739337 -7.259624609577 55.2375819
2 55.2375819 -7.2357427 54 2 55.2375819
2 55.2375819 -7.2357427 54 1 55.2375819
3 54.5298806 -8.1350247 54.504374314741 -8.188334960168 54.5298806
4 54.2810187 -7.896937 54.303836850038 -8.180136033695 54.2810187
4 54.2810187 -7.896937 54 3 54.2810187
lng
0 -7.740156
1 -7.5740475
1 -7.5740475
2 -7.2357427
2 -7.2357427
2 -7.2357427
3 -8.1350247
4 -7.896937
4 -7.896937
To add a top to column to a pd.DataFrame run:
def add_top_column(df, top_col, inplace=False):
if not inplace:
df = df.copy()
df.columns = pd.MultiIndex.from_product([[top_col], df.columns])
return df
orig_df = pd.DataFrame([[1, 2], [3, 4]], columns=['a', 'b'])
new_df = add_top_column(orig_df, "new column")
In order to combine 3 DataFrames each with its own new top column:
new_df2 = add_top_column(orig_df, "new column2")
new_df3 = add_top_column(orig_df, "new column3")
print(pd.concat([new_df, new_df2, new_df3], axis=1))
"""
# And this is the expected output:
new column new column2 new column3
a b a b a b
0 1 2 1 2 1 2
1 3 4 3 4 3 4
"""
Note that if the DataFrames' index do not match, you might need to reset the index.
You can read an Excel file with 2 headers (2 levels of columns).
df = pd.read_excel(
sourceFilePath,
index_col = [0],
header = [0, 1]
)
You can reshape your df like this in order to keep just 1 header (its easier to work with only 1 header):
df = df.stack([0,1], dropna=False).to_frame('Valeur').reset_index()
I have dataframe
ID 2016-01 2016-02 ... 2017-01 2017-02 ... 2017-10 2017-11 2017-12
111 12 34 0 12 3 0 0
222 0 32 5 5 0 0 0
I need to count every 12 columns and get
ID 2016 2017
111 46 15
222 32 10
I try to use
(df.groupby((np.arange(len(df.columns)) // 31) + 1, axis=1).sum().add_prefix('s'))
But it returns to all columns
But when I try to use
df.groupby['ID']((np.arange(len(df.columns)) // 31) + 1, axis=1).sum().add_prefix('s'))
It returns
TypeError: 'method' object is not subscriptable
How can I fix that?
First set_index of all columns without dates:
df = df.set_index('ID')
1. groupby by splited columns and selected first:
df = df.groupby(df.columns.str.split('-').str[0], axis=1).sum()
2. lambda function for split:
df = df.groupby(lambda x: x.split('-')[0], axis=1).sum()
3. converted columns to datetimes and groupby years:
df.columns = pd.to_datetime(df.columns)
df = df.groupby(df.columns.year, axis=1).sum()
4. resample by years:
df.columns = pd.to_datetime(df.columns)
df = df.resample('A', axis=1).sum()
df.columns = df.columns.year
print (df)
2016 2017
ID
111 46 15
222 32 10
The above code has a slight syntax error and throws the following error:
ValueError: No axis named 1 for object type
Basically, the groupby condition needs to be wrapped by []. So I'm rewriting the code correctly for convenience:
new_df = df.groupby([[i//n for i in range(0,m)]], axis = 1).sum()
where n is the number of columns you want to group together and m is the total number of columns being grouped. You have to rename the columns after that.
If you don't mind losing the labels, you can try this:
new_df = df.groupby([i//n for i in range(0,m)], axis = 1).sum()
where n is the number of columns you want to group together and m is the total number of columns being grouped. You have to rename the columns after that.