Panda dataframe nan replacements - python

i am newbie in pandas. So please bear with me.
I have this dataframe:
Name,Year,Engine,Price
Car1,2001,100 CC,1000
Car2,2002,150 CC,2000
Car1,2001,100 CC,nan
Car1,2001,100 CC,100
I can't figure out how to change the nan or null value of “Car 1" + Year+ "100 CC” from nan to 1000.
I need to extract the value of “Price” while combining “Name +Year + Engine”. And replace where its null.
There are numbers of rows in the csv file which have the null “Price” while combining “Name + Engine”, however in some rows same “Name + Year+ Engine “ has “Price” association with it.
Thanks for the help.

With the update of your question (an extra row with Price == 100, where Name == Car and Engine == 100 CC), the logic behind the choice for filling the NaN value in this group with 1000.0 has become ambiguous. Let's add yet another row:
import pandas as pd
import numpy as np
data = {'Name': {0: 'Car1', 1: 'Car2', 2: 'Car1', 3: 'Car1', 4: 'Car1'},
'Year': {0: 2001, 1: 2002, 2: 2001, 3: 2001, 4: 2001},
'Engine': {0: '100 CC', 1: '150 CC', 2: '100 CC', 3: '100 CC', 4: '100 CC'},
'Price': {0: 1000.0, 1: 2000.0, 2: np.nan, 3: 100.0, 4: np.nan}}
df = pd.DataFrame(data)
print(df)
Name Year Engine Price
0 Car1 2001 100 CC 1000.0
1 Car2 2002 150 CC 2000.0
2 Car1 2001 100 CC NaN
3 Car1 2001 100 CC 100.0
4 Car1 2001 100 CC NaN
In this case, what should happen with the second associated NaN value? If you want to fill all NaNs with the first value, you could limit the assignment to the rows that contain NaNs by combining df.loc with pd.Series.isna(). This way you'll only overwrite the NaNs:
df.loc[df['Price'].isna(),'Price'] = df.groupby(['Name','Engine'])\
['Price'].transform('first')
print(df)
Name Year Engine Price
0 Car1 2001 100 CC 1000.0
1 Car2 2002 150 CC 2000.0
2 Car1 2001 100 CC 1000.0
3 Car1 2001 100 CC 100.0
4 Car1 2001 100 CC 1000.0
But you can of course change the function (here: "first") passed to DataFrameGroupBy.transform. E.g. use "max" for 1000.0, if you are selecting it because it is the highest value. Or if you want the mode, you could do: .transform(lambda x: x.mode().iloc[0]) (and get 100.0 in this case!); or get "mean" (550.0), "last" (100) etc.
More likely, you would want to use df.ffill, i.e. "forward fill", to propagate the last valid value forward. So, fill first NaN with 1000.0, and the second with 100.0. If so, use:
df['Price'] = df.groupby(['Name','Engine'])['Price'].transform('ffill')
print(df)
Name Year Engine Price
0 Car1 2001 100 CC 1000.0
1 Car2 2002 150 CC 2000.0
2 Car1 2001 100 CC 1000.0
3 Car1 2001 100 CC 100.0
4 Car1 2001 100 CC 100.0

Related

selecting duplicates by condition python pandas

I have a simple dataframe which I would like to separate from each other with some conditions.
Car
Year
Speed
Cond
BMW
2001
150
X
BMW
2000
150
Audi
1997
200
Audi
2000
200
Audi
2012
200
X
Fiat
2020
180
Mazda
2022
183
What i have to do is take duplicates to another dataframe and in my main dataframe leave only one line.
Rows that are duplicates in the Car column I would like to separate into a separate dataframe, but I don't need the ones that have X in the cond column.
In main dataframe I would like keep one row. I would like the left row to be the one that contains X in the cond column
I have code:
import pandas as pd
import numpy as np
cars = {'Car': {0: 'BMW', 1: 'BMW', 2: 'Audi', 3: 'Audi', 4: 'Audi', 5: 'Fiat', 6: 'Mazda'},
'Year': {0: 2001, 1: 2000, 2: 1997, 3: 2000, 4: 2012, 5: 2020, 6: 2022},
'Speed': {0: 150, 1: 150, 2: 200, 3: 200, 4: 200, 5: 180, 6: 183},
'Cond': {0: 'X', 1: np.nan, 2: 'X', 3: np.nan, 4: np.nan, 5: np.nan, 6: np.nan}}
df = pd.DataFrame.from_dict(cars)
df_duplicates = df.loc[df.duplicated(subset=['Car'], keep = False)].loc[df['Cond']!='X']
I don't know how i can leave the main dataframe with only one row which additionally contains X in cond column
Maybe it's possible to have one command that will delete and select another dataframe according to the rules above?
If I understand correctly the desired logic, you can use groupby.idxmax to select the first X per group if any (else the first row of the group), to keep in the main DataFrame. The rest goes in the other DataFrame (df2).
# get indices of the row with X is any, else of the first one per group
keep = df['Cond'].eq('X').groupby(df['Car']).idxmax()
# drop selected rows
df2 = df.drop(keep)
# keep selected rows
df = df.loc[keep]
Output:
# updated df1
Car Year Speed Cond
0 BMW 2001 150 X
2 Audi 1997 200 X
5 Fiat 2020 180 NaN
6 Mazda 2022 183 NaN
# df2
Car Year Speed Cond
1 BMW 2000 150 NaN
3 Audi 2000 200 NaN
4 Audi 2012 200 NaN

Transposing values in df?

Imagine having the following df:
Document type Invoicenumber Invoicedate description quantity unit price line amount
Invoice 123 28-08-2020
0 NaN 17-09-2020 test 1,5 5 20
0 NaN 16-04-2020 test2 1,5 5 20
Invoice 456 02-03-2020
0 NaN NaN test3 21 3 64
0 0 NaN test3 21 3 64
0 0 NaN test3 21 3 64
The rows where there is a 0 are belonging to the row above and are line items of the same document.
My goal is to transpose the line items so that these are on the same line for each invoice as such:
I've tried to transpose them based on index but this did not work..
**Document type** **Invoicenumber Invoicedate** description#1 description#2 quantity quantity#2 unit price unit price #2 line amount line amount #2
Invoice 123 28-08-2020 test test2 1,5 1,5 5 5 20 20
and for the second row:
**Document type** **Invoicenumber Invoicedate** description#1 description#2 description #3 quantity quantity#2 quantity #3 unit price unit price #2 unit price #3 line amount line amount #2 line amount #3
Invoice 123 28-08-2020 test3 test3 test3 21 21 21 3 3 3 64 64 64
here is the dictionary code:
df = pd.DataFrame.from_dict({'Document Type': {0: 'IngramMicro.AccountsPayable.Invoice',
1: 0,
2: 0,
3: 'IngramMicro.AccountsPayable.Invoice',
4: 0,
5: 0,
6: 0},
'Factuurnummer': {0: '0.78861803',
1: 'NaN',
2: 'NaN',
3: '202130534',
4: 'NaN',
5: 'NaN',
6: 'NaN'},
'Factuurdatum': {0: '2021-05-03',
1: 'NaN',
2: 'NaN',
3: '2021-09-03',
4: 'NaN',
5: 'NaN',
6: 'NaN'},
'description': {0: 'NaN',
1: 'TM 300 incl onderstel 3058C003 84433210 4549292119381',
2: 'ESP 5Y 36 inch 7950A539 00000000 4960999794266',
3: 'NaN',
4: 'Basistarief A3 Office',
5: 'Toeslag 100 km enkele reis Leveren installeren Xerox VL C7020 05-03-2021',
6: 'Toeslag 100 km enkele reis Leveren installeren Xerox VL C7020 05-03-2021'},
'quantity': {0: 'NaN', 1: 1.0, 2: 1.0, 3: 'NaN', 4: 1.0, 5: 1.0, 6: 2.0},
'unit price': {0: 'NaN',
1: 1211.63,
2: 742.79,
3: 'NaN',
4: 260.0,
5: 30.0,
6: 30.0},
'line amount': {0: 'NaN',
1: 21.0,
2: 21.0,
3: 'NaN',
4: 260.0,
5: 30.0,
6: 30.0}})
I've tried the following:
df = pd.DataFrame(data=d1)
However failing to accomplish significant results.
Please help !
Here is what you can do. First we enumerate the groups and the line items within each group, and clean up 'Document Type':
import numpy as np
df['g'] = df['Document Type'].ne(0).cumsum()
df['l'] = df.groupby('g').cumcount()
df['Document Type'] = df['Document Type'].replace(0,np.nan).fillna(method = 'ffill')
df
we get
Document Type Factuurnummer Factuurdatum description quantity unit price line amount g l
-- ----------------------------------- --------------- -------------- ------------------------------------------------------------------------ ---------- ------------ ------------- --- ---
0 IngramMicro.AccountsPayable.Invoice 0.788618 2021-05-03 NaN nan nan nan 1 0
1 IngramMicro.AccountsPayable.Invoice nan NaN TM 300 incl onderstel 3058C003 84433210 4549292119381 1 1211.63 21 1 1
2 IngramMicro.AccountsPayable.Invoice nan NaN ESP 5Y 36 inch 7950A539 00000000 4960999794266 1 742.79 21 1 2
3 IngramMicro.AccountsPayable.Invoice 2.02131e+08 2021-09-03 NaN nan nan nan 2 0
4 IngramMicro.AccountsPayable.Invoice nan NaN Basistarief A3 Office 1 260 260 2 1
5 IngramMicro.AccountsPayable.Invoice nan NaN Toeslag 100 km enkele reis Leveren installeren Xerox VL C7020 05-03-2021 1 30 30 2 2
6 IngramMicro.AccountsPayable.Invoice nan NaN Toeslag 100 km enkele reis Leveren installeren Xerox VL C7020 05-03-2021 2 30 30 2 3
Now we can index on 'g' and 'l' and then move 'l' to columns via unstack. we drop columns that are all NaNs
df2 = df.set_index(['g','Document Type','l']).unstack(level = 2).replace('NaN',np.nan).dropna(axis='columns', how = 'all')
we rename column labels to be single-level:
df2.columns = [tup[0] + '_' + str(tup[1]) for tup in df2.columns.values]
df2.reset_index().drop(columns = 'g')
and we get something that looks like what you are after, I believe
Document Type Factuurnummer_0 Factuurdatum_0 description_1 description_2 description_3 quantity_1 quantity_2 quantity_3 unit price_1 unit price_2 unit price_3 line amount_1 line amount_2 line amount_3
-- ----------------------------------- ----------------- ---------------- ----------------------------------------------------- ------------------------------------------------------------------------ ------------------------------------------------------------------------ ------------ ------------ ------------ -------------- -------------- -------------- --------------- --------------- ---------------
0 IngramMicro.AccountsPayable.Invoice 0.788618 2021-05-03 TM 300 incl onderstel 3058C003 84433210 4549292119381 ESP 5Y 36 inch 7950A539 00000000 4960999794266 nan 1 1 nan 1211.63 742.79 nan 21 21 nan
1 IngramMicro.AccountsPayable.Invoice 2.02131e+08 2021-09-03 Basistarief A3 Office Toeslag 100 km enkele reis Leveren installeren Xerox VL C7020 05-03-2021 Toeslag 100 km enkele reis Leveren installeren Xerox VL C7020 05-03-2021 1 1 2 260 30 30 260 30 30

Pandas - Create column with difference in values

I have the below dataset. How can create a new column that shows the difference of money for each person, for each expiry?
The column is yellow is what I want. You can see that it is the difference in money for each expiry point for the person. I highlighted the other rows in colors so it is more clear.
Thanks a lot.
Example
[]
import pandas as pd
import numpy as np
example = pd.DataFrame( data = {'Day': ['2020-08-30', '2020-08-30','2020-08-30','2020-08-30',
'2020-08-29', '2020-08-29','2020-08-29','2020-08-29'],
'Name': ['John', 'Mike', 'John', 'Mike','John', 'Mike', 'John', 'Mike'],
'Money': [100, 950, 200, 1000, 50, 50, 250, 1200],
'Expiry': ['1Y', '1Y', '2Y','2Y','1Y','1Y','2Y','2Y']})
example_0830 = example[ example['Day']=='2020-08-30' ].reset_index()
example_0829 = example[ example['Day']=='2020-08-29' ].reset_index()
example_0830['key'] = example_0830['Name'] + example_0830['Expiry']
example_0829['key'] = example_0829['Name'] + example_0829['Expiry']
example_0829 = pd.DataFrame( example_0829, columns = ['key','Money'])
example_0830 = pd.merge(example_0830, example_0829, on = 'key')
example_0830['Difference'] = example_0830['Money_x'] - example_0830['Money_y']
example_0830 = example_0830.drop(columns=['key', 'Money_y','index'])
Result:
Day Name Money_x Expiry Difference
0 2020-08-30 John 100 1Y 50
1 2020-08-30 Mike 950 1Y 900
2 2020-08-30 John 200 2Y -50
3 2020-08-30 Mike 1000 2Y -200
If the difference is just derived from the previous date, you can just define a date variable in the beginning to find today(t) and previous day (t-1) to filter out original dataframe.
You can solve it with groupby.diff
Take the dataframe
df = pd.DataFrame({
'Day': [30, 30, 30, 30, 29, 29, 28, 28],
'Name': ['John', 'Mike', 'John', 'Mike', 'John', 'Mike', 'John', 'Mike'],
'Money': [100, 950, 200, 1000, 50, 50, 250, 1200],
'Expiry': [1, 1, 2, 2, 1, 1, 2, 2]
})
print(df)
Which looks like
Day Name Money Expiry
0 30 John 100 1
1 30 Mike 950 1
2 30 John 200 2
3 30 Mike 1000 2
4 29 John 50 1
5 29 Mike 50 1
6 28 John 250 2
7 28 Mike 1200 2
And the code
# make sure we have dates in the order we want
df.sort_values('Day', ascending=False)
# groubpy and get the difference from the next row in each group
# diff(1) calculates the difference from the previous row, so -1 will point to the next
df['Difference'] = df.groupby(['Name', 'Expiry']).Money.diff(-1)
Output
Day Name Money Expiry Difference
0 30 John 100 1 50.0
1 30 Mike 950 1 900.0
2 30 John 200 2 -50.0
3 30 Mike 1000 2 -200.0
4 29 John 50 1 NaN
5 29 Mike 50 1 NaN
6 28 John 250 2 NaN
7 28 Mike 1200 2 NaN

fill column with value of a column from another dataframe, depending on conditions

I have a dataframe that looks like this (my input database on COVID cases)
data:
date state cases
0 20200625 NY 300
1 20200625 CA 250
2 20200625 TX 200
3 20200625 FL 100
5 20200624 NY 290
6 20200624 CA 240
7 20200624 TX 100
8 20200624 FL 80
...
worth noting that the "date" column in the above data is a number (not datetime)
I want to make it a timeseries like this (desired output), with dates as index and each state's COVID cases as columns
NY CA TX FL
20200625 300 250 200 100
20200626 290 240 100 80
...
As of now I managed to create only the scheleton of the output with the following code
states = ['NY', 'CA', 'TX', 'FL']
days = [20200625, 20200626]
columns = states
positives = pd.DataFrame(columns = columns)
i = 0
for day in days:
positives.loc[i, "date"] = day
i = i +1
positives.set_index('date', inplace=True)
positives= positives.rename_axis(None)
print(positives)
which returns:
NY CA TX FL
20200625.0 NaN NaN NaN NaN
20200626.0 NaN NaN NaN NaN
how can I get from the "data" dataframe the value of column "cases" when:
(i) value in data["state"] = column header of "positives",
(ii) value in data["date"] = row index of "positives"
You can do:
df = df.set_index(['date', 'state']).unstack().reset_index()
# fix column names
df.columns = df.columns.get_level_values(1)
state CA FL NY TX
0 20200624 240.0 NaN 290.0 NaN
1 20200625 250.0 100.0 300.0 200.0
Later, to set index again we need to set the name explicitly, do:
df = df.set_index("")
df.index.name = "date"
The transformation you are interested in is called a pivot. You can achieve this in Pandas as follows:
# Reproduce part of the data
data = pd.DataFrame({'date': [20200625, 20200625, 20200624, 20200624],
'state': ['NY', 'CA', 'NY', 'CA'],
'cases': [300, 250, 290, 240]})
data
# date state cases
# 0 20200625 NY 300
# 1 20200625 CA 250
# 2 20200624 NY 290
# 3 20200624 CA 240
# Pivot
data.pivot(index='date', columns='state', values='cases')
# state CA NY
# date
# 20200624 240 290
# 20200625 250 300

Pandas DataFrame multplication with missing values

I have 2 dataframes
Value
Location Time
Hawai 2000 1.764052
2002 0.400157
Torino 2000 0.978738
2002 2.240893
Paris 2000 1.867558
2002 -0.977278
2000 2002
Country Unit Location
US USD Hawai 2 8
IT EUR Torino 4 10
FR EUR Paris 6 12
Created with
np.random.seed(0)
tuples = list(zip(*[['Hawai', 'Hawai', 'Torino', 'Torino',
'Paris', 'Paris'],
[2000, 2002, 2000, 2002, 2000,2002]]))
idx = pd.MultiIndex.from_tuples(tuples, names=['Location', 'Time'])
df = pd.DataFrame(np.random.randn(6, 1), index=idx, columns=['Value'])
df2 = pd.DataFrame({'Country': [ 'US', 'IT', 'FR'],
'Unit': [ 'USD', 'EUR', 'EUR'],
'Location': [ 'Hawai', 'Torino', 'Paris'],
'2000': [2, 4,6],
'2002': [8,10,12]
})
df2.set_index(['Country','Unit','Location'],inplace=True)
I want to multiply each column from df2 with the corresponding Value from df1
This code does well
df2.columns=df2.columns.astype(int)
s=df.Value.unstack(fill_value=1)
df2 = df2.mul(s)
and produces
2000 2002
Country Unit Location
US USD Hawai 3.528105 3.201258
IT EUR Torino 3.914952 22.408932
FR EUR Paris 11.205348 -11.727335
Now I want to handle case where df2 has missing value represented as '..' so multiplying the numerical values and skip the others
2000 2002
Country Unit Location
US USD Hawai 2 8
IT EUR Torino .. 10
FR EUR Paris 6 12
running the code above give error TypeError: can't multiply sequence by non-int of type 'float'
Any idea how to achieve this result ?
2000 2002
Country Unit Location
US USD Hawai 3.528105 3.201258
IT EUR Torino .. 22.408932
FR EUR Paris 11.205348 -11.727335
I think better here is use missing values instead .. by to_numeric with errors='coerce', so divide working very nice:
df2 = pd.DataFrame({'Country': [ 'US', 'IT', 'FR'],
'Unit': [ 'USD', 'EUR', 'EUR'],
'Location': [ 'Hawai', 'Torino', 'Paris'],
'2000': [2, '..',6],
'2002': [8,10,12]
})
df2.set_index(['Country','Unit','Location'],inplace=True)
df2.columns=df2.columns.astype(int)
s= df.Value.unstack(fill_value=1)
df2 = df2.apply(lambda x: pd.to_numeric(x, errors='coerce')).mul(s)
print (df2)
2000 2002
Country Unit Location
US USD Hawai 3.528105 3.201258
IT EUR Torino NaN 22.408932
FR EUR Paris 11.205348 -11.727335
If only non numeric values are .. another solution is use replace:
df2 = df2.replace('..', np.nan).mul(s)

Categories

Resources