Imagine having the following df:
Document type Invoicenumber Invoicedate description quantity unit price line amount
Invoice 123 28-08-2020
0 NaN 17-09-2020 test 1,5 5 20
0 NaN 16-04-2020 test2 1,5 5 20
Invoice 456 02-03-2020
0 NaN NaN test3 21 3 64
0 0 NaN test3 21 3 64
0 0 NaN test3 21 3 64
The rows where there is a 0 are belonging to the row above and are line items of the same document.
My goal is to transpose the line items so that these are on the same line for each invoice as such:
I've tried to transpose them based on index but this did not work..
**Document type** **Invoicenumber Invoicedate** description#1 description#2 quantity quantity#2 unit price unit price #2 line amount line amount #2
Invoice 123 28-08-2020 test test2 1,5 1,5 5 5 20 20
and for the second row:
**Document type** **Invoicenumber Invoicedate** description#1 description#2 description #3 quantity quantity#2 quantity #3 unit price unit price #2 unit price #3 line amount line amount #2 line amount #3
Invoice 123 28-08-2020 test3 test3 test3 21 21 21 3 3 3 64 64 64
here is the dictionary code:
df = pd.DataFrame.from_dict({'Document Type': {0: 'IngramMicro.AccountsPayable.Invoice',
1: 0,
2: 0,
3: 'IngramMicro.AccountsPayable.Invoice',
4: 0,
5: 0,
6: 0},
'Factuurnummer': {0: '0.78861803',
1: 'NaN',
2: 'NaN',
3: '202130534',
4: 'NaN',
5: 'NaN',
6: 'NaN'},
'Factuurdatum': {0: '2021-05-03',
1: 'NaN',
2: 'NaN',
3: '2021-09-03',
4: 'NaN',
5: 'NaN',
6: 'NaN'},
'description': {0: 'NaN',
1: 'TM 300 incl onderstel 3058C003 84433210 4549292119381',
2: 'ESP 5Y 36 inch 7950A539 00000000 4960999794266',
3: 'NaN',
4: 'Basistarief A3 Office',
5: 'Toeslag 100 km enkele reis Leveren installeren Xerox VL C7020 05-03-2021',
6: 'Toeslag 100 km enkele reis Leveren installeren Xerox VL C7020 05-03-2021'},
'quantity': {0: 'NaN', 1: 1.0, 2: 1.0, 3: 'NaN', 4: 1.0, 5: 1.0, 6: 2.0},
'unit price': {0: 'NaN',
1: 1211.63,
2: 742.79,
3: 'NaN',
4: 260.0,
5: 30.0,
6: 30.0},
'line amount': {0: 'NaN',
1: 21.0,
2: 21.0,
3: 'NaN',
4: 260.0,
5: 30.0,
6: 30.0}})
I've tried the following:
df = pd.DataFrame(data=d1)
However failing to accomplish significant results.
Please help !
Here is what you can do. First we enumerate the groups and the line items within each group, and clean up 'Document Type':
import numpy as np
df['g'] = df['Document Type'].ne(0).cumsum()
df['l'] = df.groupby('g').cumcount()
df['Document Type'] = df['Document Type'].replace(0,np.nan).fillna(method = 'ffill')
df
we get
Document Type Factuurnummer Factuurdatum description quantity unit price line amount g l
-- ----------------------------------- --------------- -------------- ------------------------------------------------------------------------ ---------- ------------ ------------- --- ---
0 IngramMicro.AccountsPayable.Invoice 0.788618 2021-05-03 NaN nan nan nan 1 0
1 IngramMicro.AccountsPayable.Invoice nan NaN TM 300 incl onderstel 3058C003 84433210 4549292119381 1 1211.63 21 1 1
2 IngramMicro.AccountsPayable.Invoice nan NaN ESP 5Y 36 inch 7950A539 00000000 4960999794266 1 742.79 21 1 2
3 IngramMicro.AccountsPayable.Invoice 2.02131e+08 2021-09-03 NaN nan nan nan 2 0
4 IngramMicro.AccountsPayable.Invoice nan NaN Basistarief A3 Office 1 260 260 2 1
5 IngramMicro.AccountsPayable.Invoice nan NaN Toeslag 100 km enkele reis Leveren installeren Xerox VL C7020 05-03-2021 1 30 30 2 2
6 IngramMicro.AccountsPayable.Invoice nan NaN Toeslag 100 km enkele reis Leveren installeren Xerox VL C7020 05-03-2021 2 30 30 2 3
Now we can index on 'g' and 'l' and then move 'l' to columns via unstack. we drop columns that are all NaNs
df2 = df.set_index(['g','Document Type','l']).unstack(level = 2).replace('NaN',np.nan).dropna(axis='columns', how = 'all')
we rename column labels to be single-level:
df2.columns = [tup[0] + '_' + str(tup[1]) for tup in df2.columns.values]
df2.reset_index().drop(columns = 'g')
and we get something that looks like what you are after, I believe
Document Type Factuurnummer_0 Factuurdatum_0 description_1 description_2 description_3 quantity_1 quantity_2 quantity_3 unit price_1 unit price_2 unit price_3 line amount_1 line amount_2 line amount_3
-- ----------------------------------- ----------------- ---------------- ----------------------------------------------------- ------------------------------------------------------------------------ ------------------------------------------------------------------------ ------------ ------------ ------------ -------------- -------------- -------------- --------------- --------------- ---------------
0 IngramMicro.AccountsPayable.Invoice 0.788618 2021-05-03 TM 300 incl onderstel 3058C003 84433210 4549292119381 ESP 5Y 36 inch 7950A539 00000000 4960999794266 nan 1 1 nan 1211.63 742.79 nan 21 21 nan
1 IngramMicro.AccountsPayable.Invoice 2.02131e+08 2021-09-03 Basistarief A3 Office Toeslag 100 km enkele reis Leveren installeren Xerox VL C7020 05-03-2021 Toeslag 100 km enkele reis Leveren installeren Xerox VL C7020 05-03-2021 1 1 2 260 30 30 260 30 30
Related
The task is the following:
Is there a correlation between the age of an athlete and his result at the Olympics in the entire dataset?
Each athlete has a name, age, medal (gold, silver, bronze or NA).
In my opinion, it is necessary to count the number of all athletes of the same age and calculate the percentage of them who have any kind of medal (data.Medal.notnull()). The graph should show all ages on the x-axis, and the percentage of those who has any medal on the y-axis. How to get this data and create the graphic with help of pandas and matprolib?
For instance, some data like in table:
Name Age Medal
Name1 20 Silver
Name2 21 NA
Name3 20 NA
Name4 22 Bronze
Name5 22 NA
Name6 21 NA
Name7 20 Gold
Name8 19 Silver
Name9 20 Gold
Name10 20 NA
Name11 21 Silver
The result should be (in the graphic):
19 - 100%
20 - 60%
21 - 33%
22 - 50%
First, turn df.Medal into 1s for a medal and 0s for NaN values using np.where.
import pandas as pd
import numpy as np
data = {'Name': {0: 'Name1', 1: 'Name2', 2: 'Name3', 3: 'Name4', 4: 'Name5',
5: 'Name6', 6: 'Name7', 7: 'Name8', 8: 'Name9', 9: 'Name10',
10: 'Name11'},
'Age': {0: 20, 1: 21, 2: 20, 3: 22, 4: 22, 5: 21, 6: 20, 7: 19, 8: 20,
9: 20, 10: 21},
'Medal': {0: 'Silver', 1: np.nan, 2: np.nan, 3: 'Bronze', 4: np.nan,
5: np.nan, 6: 'Gold', 7: 'Silver', 8: 'Gold', 9: np.nan,
10: 'Silver'}}
df = pd.DataFrame(data)
df.Medal = np.where(df.Medal.notna(),1,0)
print(df)
Name Age Medal
0 Name1 20 1
1 Name2 21 0
2 Name3 20 0
3 Name4 22 1
4 Name5 22 0
5 Name6 21 0
6 Name7 20 1
7 Name8 19 1
8 Name9 20 1
9 Name10 20 0
10 Name11 21 1
Now, you could plot the data maybe as follows:
import seaborn as sns
import matplotlib.ticker as mtick
sns.set_theme()
ax = sns.barplot(data=df, x=df.Age, y=df.Medal, errorbar=None)
# in versions prior to `seaborn 0.12` use
# `ax = sns.barplot(data=df, x=df.Age, y=df.Medal, ci=None)`
ax.yaxis.set_major_formatter(mtick.PercentFormatter(xmax=1.0))
# adding labels
ax.bar_label(ax.containers[0],
labels=[f'{round(v*100,2)}%' for v in ax.containers[0].datavalues])
Result:
Incidentally, if you would have wanted to calculate these percentages, one option could have been to use pd.crosstab:
percentages = pd.crosstab(df.Age,df.Medal, normalize='index')\
.rename(columns={1:'percentages'})['percentages']
print(percentages)
Age
19 1.000000
20 0.600000
21 0.333333
22 0.500000
Name: percentages, dtype: float64
So, with matplotlib, you could also do something like:
percentages = pd.crosstab(df.Age,df.Medal, normalize='index')\
.rename(columns={1:'percentages'})['percentages'].mul(100)
my_cmap = plt.get_cmap("viridis")
rescale = lambda y: (y - np.min(y)) / (np.max(y) - np.min(y))
fig, ax = plt.subplots()
ax.bar(x=percentages.index.astype(str),
height=percentages.to_numpy(),
color=my_cmap(rescale(percentages.to_numpy())))
ax.yaxis.set_major_formatter(mtick.PercentFormatter())
ax.bar_label(ax.containers[0], fmt='%.1f%%')
plt.show()
Result:
i am newbie in pandas. So please bear with me.
I have this dataframe:
Name,Year,Engine,Price
Car1,2001,100 CC,1000
Car2,2002,150 CC,2000
Car1,2001,100 CC,nan
Car1,2001,100 CC,100
I can't figure out how to change the nan or null value of “Car 1" + Year+ "100 CC” from nan to 1000.
I need to extract the value of “Price” while combining “Name +Year + Engine”. And replace where its null.
There are numbers of rows in the csv file which have the null “Price” while combining “Name + Engine”, however in some rows same “Name + Year+ Engine “ has “Price” association with it.
Thanks for the help.
With the update of your question (an extra row with Price == 100, where Name == Car and Engine == 100 CC), the logic behind the choice for filling the NaN value in this group with 1000.0 has become ambiguous. Let's add yet another row:
import pandas as pd
import numpy as np
data = {'Name': {0: 'Car1', 1: 'Car2', 2: 'Car1', 3: 'Car1', 4: 'Car1'},
'Year': {0: 2001, 1: 2002, 2: 2001, 3: 2001, 4: 2001},
'Engine': {0: '100 CC', 1: '150 CC', 2: '100 CC', 3: '100 CC', 4: '100 CC'},
'Price': {0: 1000.0, 1: 2000.0, 2: np.nan, 3: 100.0, 4: np.nan}}
df = pd.DataFrame(data)
print(df)
Name Year Engine Price
0 Car1 2001 100 CC 1000.0
1 Car2 2002 150 CC 2000.0
2 Car1 2001 100 CC NaN
3 Car1 2001 100 CC 100.0
4 Car1 2001 100 CC NaN
In this case, what should happen with the second associated NaN value? If you want to fill all NaNs with the first value, you could limit the assignment to the rows that contain NaNs by combining df.loc with pd.Series.isna(). This way you'll only overwrite the NaNs:
df.loc[df['Price'].isna(),'Price'] = df.groupby(['Name','Engine'])\
['Price'].transform('first')
print(df)
Name Year Engine Price
0 Car1 2001 100 CC 1000.0
1 Car2 2002 150 CC 2000.0
2 Car1 2001 100 CC 1000.0
3 Car1 2001 100 CC 100.0
4 Car1 2001 100 CC 1000.0
But you can of course change the function (here: "first") passed to DataFrameGroupBy.transform. E.g. use "max" for 1000.0, if you are selecting it because it is the highest value. Or if you want the mode, you could do: .transform(lambda x: x.mode().iloc[0]) (and get 100.0 in this case!); or get "mean" (550.0), "last" (100) etc.
More likely, you would want to use df.ffill, i.e. "forward fill", to propagate the last valid value forward. So, fill first NaN with 1000.0, and the second with 100.0. If so, use:
df['Price'] = df.groupby(['Name','Engine'])['Price'].transform('ffill')
print(df)
Name Year Engine Price
0 Car1 2001 100 CC 1000.0
1 Car2 2002 150 CC 2000.0
2 Car1 2001 100 CC 1000.0
3 Car1 2001 100 CC 100.0
4 Car1 2001 100 CC 100.0
I have a dataset with fields such as:
OrderID, Supplier, Order_Date, Fulfillment_Date
Assumptions:
OrderIDs are unique with no duplicates
Every OrderID has an Order_Date, but not necessarily a
Fulfillment_Date
Every fulfilled order has a fulfillment date
Unfulfilled orders are missing a fulfillment date
I want to calculate 2 things:
Number of unfulfilled orders per supplier by every date in a
range. If there are no unfulfilled orders on a particular date for a particular supplier, mark it 0.
The total unfulfilled order age/vintage by supplier for every date. If there are no unfulfilled orders, mark it 0.
What I've tried:
Some hack nested loops: it worked, but it's really slow
Some faked sample data
df=pd.DataFrame.from_dict({'orderID': {0: 'ORDER3762642',
1: 'ORDER3787490',
2: 'ORDER3807252',
3: 'ORDER3800697',
4: 'ORDER3815902',
5: 'ORDER3798524',
6: 'ORDER3809288',
7: 'ORDER3814427',
8: 'ORDER3808695',
9: 'ORDER3809680'},
'supplier': {0: 'Under Armour',
1: 'Nike',
2: 'Nike',
3: 'Nike',
4: 'Nike',
5: 'Adidas',
6: 'Under Armour',
7: 'Adidas',
8: 'Adidas',
9: 'Adidas'},
'order_date': {0: '2022-01-06 17:27:00',
1: '2022-01-20 12:32:00',
2: '2022-02-03 12:18:00',
3: '2022-01-31 09:08:00',
4: '2022-02-08 08:43:00',
5: '2022-01-28 11:10:00',
6: '2022-02-04 12:38:00',
7: '2022-02-07 15:05:00',
8: '2022-02-04 03:39:00',
9: '2022-02-04 17:08:00'},
'fulfillment_date': {0: '2022-02-08 13:05:00',
1: '2022-02-08 12:48:00',
2: '2022-02-08 12:46:00',
3: '2022-02-08 12:45:00',
4: '2022-02-08 12:44:00',
5: '2022-02-08 12:34:00',
6: '2022-02-08 12:22:00',
7: '2022-02-08 12:12:00',
8: "",
9: ""}})
To walk through an example of how one single order would calculate:
df[df["orderID"]=='ORDER3807252']
orderID | supplier | order_date | fulfillment_date
ORDER3807252 | Nike | 2022-02-03 12:18:00 | 2022-02-08 12:46:00
Assuming we were just looking at this one single order, the output might look like:
Supplier | Date | Unfulfilled Orders | Unfulfilled Vintage
-------- | ---------- |------------------- | -------------------
Nike | 2022/02/03 | 0 | 0
Nike | 2022/02/04 | 1 | 36 hours
Nike | 2022/02/05 | 1 | 60 hours
Nike | 2022/02/06 | 1 | 94 hours
Nike | 2022/02/07 | 1 | 118 hours
Nike | 2022/02/08 | 0 | 0
It seems you want to groupby "supplier" and "order_date" to count the number of empty fulfillment_dates. So here goes:
For the first question:
unfulfilled_orders_per_day = df['fulfillment_date'].eq('').groupby([df['supplier'], df['order_date']]).sum().reset_index()
Output:
supplier order_date fulfillment_date
0 Adidas 2022-01-28 11:10:00 0
1 Adidas 2022-02-04 03:39:00 1
2 Adidas 2022-02-04 17:08:00 1
3 Adidas 2022-02-07 15:05:00 0
4 Nike 2022-01-20 12:32:00 0
5 Nike 2022-01-31 09:08:00 0
6 Nike 2022-02-03 12:18:00 0
7 Nike 2022-02-08 08:43:00 0
8 Under Armour 2022-01-06 17:27:00 0
9 Under Armour 2022-02-04 12:38:00 0
For the second question:
total_unfulfilled_orders = df['fulfillment_date'].eq('').groupby(df['supplier']).sum().reset_index()
Output:
supplier fulfillment_date
0 Adidas 2
1 Nike 0
2 Under Armour 0
I have the below dataset. How can create a new column that shows the difference of money for each person, for each expiry?
The column is yellow is what I want. You can see that it is the difference in money for each expiry point for the person. I highlighted the other rows in colors so it is more clear.
Thanks a lot.
Example
[]
import pandas as pd
import numpy as np
example = pd.DataFrame( data = {'Day': ['2020-08-30', '2020-08-30','2020-08-30','2020-08-30',
'2020-08-29', '2020-08-29','2020-08-29','2020-08-29'],
'Name': ['John', 'Mike', 'John', 'Mike','John', 'Mike', 'John', 'Mike'],
'Money': [100, 950, 200, 1000, 50, 50, 250, 1200],
'Expiry': ['1Y', '1Y', '2Y','2Y','1Y','1Y','2Y','2Y']})
example_0830 = example[ example['Day']=='2020-08-30' ].reset_index()
example_0829 = example[ example['Day']=='2020-08-29' ].reset_index()
example_0830['key'] = example_0830['Name'] + example_0830['Expiry']
example_0829['key'] = example_0829['Name'] + example_0829['Expiry']
example_0829 = pd.DataFrame( example_0829, columns = ['key','Money'])
example_0830 = pd.merge(example_0830, example_0829, on = 'key')
example_0830['Difference'] = example_0830['Money_x'] - example_0830['Money_y']
example_0830 = example_0830.drop(columns=['key', 'Money_y','index'])
Result:
Day Name Money_x Expiry Difference
0 2020-08-30 John 100 1Y 50
1 2020-08-30 Mike 950 1Y 900
2 2020-08-30 John 200 2Y -50
3 2020-08-30 Mike 1000 2Y -200
If the difference is just derived from the previous date, you can just define a date variable in the beginning to find today(t) and previous day (t-1) to filter out original dataframe.
You can solve it with groupby.diff
Take the dataframe
df = pd.DataFrame({
'Day': [30, 30, 30, 30, 29, 29, 28, 28],
'Name': ['John', 'Mike', 'John', 'Mike', 'John', 'Mike', 'John', 'Mike'],
'Money': [100, 950, 200, 1000, 50, 50, 250, 1200],
'Expiry': [1, 1, 2, 2, 1, 1, 2, 2]
})
print(df)
Which looks like
Day Name Money Expiry
0 30 John 100 1
1 30 Mike 950 1
2 30 John 200 2
3 30 Mike 1000 2
4 29 John 50 1
5 29 Mike 50 1
6 28 John 250 2
7 28 Mike 1200 2
And the code
# make sure we have dates in the order we want
df.sort_values('Day', ascending=False)
# groubpy and get the difference from the next row in each group
# diff(1) calculates the difference from the previous row, so -1 will point to the next
df['Difference'] = df.groupby(['Name', 'Expiry']).Money.diff(-1)
Output
Day Name Money Expiry Difference
0 30 John 100 1 50.0
1 30 Mike 950 1 900.0
2 30 John 200 2 -50.0
3 30 Mike 1000 2 -200.0
4 29 John 50 1 NaN
5 29 Mike 50 1 NaN
6 28 John 250 2 NaN
7 28 Mike 1200 2 NaN
I am kind of new to data frames and been trying to find a solution to shift data in specific columns to another row if ID and CODE matches.
df = pd.DataFrame({'ID': ['123', '123', '123', '154', '167', '167'],
'NAME': ['Adam', 'Adam', 'Adam', 'Bob', 'Charlie', 'Charlie'],
'CODE': ['1001', '1001', '1011', '1002', 'A0101', 'A0101'],
'TAG1': ['A123', 'B123', 'K123', 'D123', 'E123', 'G123'],
'TAG2': [ np.NaN, 'C123', 'L123', np.NaN, 'F123', 'H123'],
'TAG3': [ np.NaN, 'M123', np.NaN, np.NaN, np.NaN, np.NaN]})
ID NAME CODE TAG1 TAG2 TAG3
0 123 Adam 1001 A123 NaN NaN
1 123 Adam 1001 B123 C123 M123
2 123 Adam 1011 K123 L123 NaN
3 154 Bob 1002 D123 NaN NaN
4 167 Charlie A0101 E123 F123 NaN
5 167 Charlie A0101 G123 H123 NaN
So above I have added the code which depicts the initial data frame, now we can see that ID='123' has two rows with the same codes and the values in 'TAG' columns vary, I would like to shift the 'TAG' data in the second row to the first row after 'TAG1' or in 'TAG1' if it is empty and delete the second row completely. It should be the same for ID='167'.
Below is another code with a sample data frame I have added manually which depicts the final result, any suggestions would be great. I tried one thing, it exactly did not work the way I wanted it to.
df_result = pd.DataFrame({'ID': ['123', '123', '154', '167'],
'NAME': ['Adam', 'Adam', 'Bob', 'Charlie',],
'CODE': ['1001', '1011', '1002', 'A0101'],
'TAG1': ['A123', 'K123', 'D123', 'E123'],
'TAG2': ['B123', 'L123', np.NaN, 'F123'],
'TAG3': ['C123', np.NaN, np.NaN, 'G123'],
'TAG4': ['M123', np.NaN, np.NaN, 'H123']})
ID NAME CODE TAG1 TAG2 TAG3 TAG4
0 123 Adam 1001 A123 B123 C123 M123
1 123 Adam 1011 K123 L123 NaN NaN
2 154 Bob 1002 D123 NaN NaN NaN
3 167 Charlie A0101 E123 F123 G123 H123
The code that I have tried is below to kind of get the result. But it did not get exactly the output I wanted
df2=pd.pivot_table(test_shift, index=['ID', 'NAME', 'CODE'],
columns=test_shift.groupby(['ID', 'CODE']).cumcount().add(1),
values=['TAG1'], aggfunc='sum')
Output Image
NOTE: Sorry for the bad posting the first time, I tried adding the dataframe as code to help you visually. But I 'FAILED'. I will try to learn that over the coming days and be a better member of 'Stack Overflow.
Thank you for the help.
Here is one-way:
def f(x):
x = x.stack().to_frame(name='val')
x = x.assign(tags='Tag'+x["val"].notna().cumsum().astype(str))
x = (x.reset_index(level=3, drop=True)
.set_index('tags', append=True)['val'].unstack())
return x
df_out = (df.set_index(['ID', 'NAME','CODE'])
.groupby(level=[0,1,2], as_index=False).apply(f)
.reset_index().drop('level_0', axis=1))
print(df_out)
Output:
ID NAME CODE Tag1 Tag2 Tag3 Tag4
0 123 Adam 1001 A123 B123 C123 M123
1 123 Adam 1011 K123 L123 NaN NaN
2 154 Bob 1002 D123 NaN NaN NaN
3 167 Charlie A0101 E123 F123 G123 H123
An approach with pd.wide_to_long for flattening the TAGS and then mapping the index as per the group of ID and CODE , then unstacking and joining the de-duplicated rows from ID,NAME and CODE
u = pd.wide_to_long(df.filter(like='TAG').reset_index(),'TAG','index','j').dropna()
idx = u.index.get_level_values(0).map(df.groupby(['ID','CODE']).ngroup())
u = u.set_index(pd.MultiIndex.from_arrays((idx,u.groupby(idx).cumcount()+1))).unstack()
u.columns = u.columns.map('{0[0]}{0[1]}'.format)
out = df[['ID','NAME','CODE']].drop_duplicates().reset_index(drop=True).join(u)
print(out)
ID NAME CODE TAG1 TAG2 TAG3 TAG4
0 123 Adam 1001 A123 B123 C123 M123
1 123 Adam 1011 K123 L123 NaN NaN
2 154 Bob 1002 D123 NaN NaN NaN
3 167 Charlie A0101 E123 F123 G123 H123