Pandas DataFrame multplication with missing values - python

I have 2 dataframes
Value
Location Time
Hawai 2000 1.764052
2002 0.400157
Torino 2000 0.978738
2002 2.240893
Paris 2000 1.867558
2002 -0.977278
2000 2002
Country Unit Location
US USD Hawai 2 8
IT EUR Torino 4 10
FR EUR Paris 6 12
Created with
np.random.seed(0)
tuples = list(zip(*[['Hawai', 'Hawai', 'Torino', 'Torino',
'Paris', 'Paris'],
[2000, 2002, 2000, 2002, 2000,2002]]))
idx = pd.MultiIndex.from_tuples(tuples, names=['Location', 'Time'])
df = pd.DataFrame(np.random.randn(6, 1), index=idx, columns=['Value'])
df2 = pd.DataFrame({'Country': [ 'US', 'IT', 'FR'],
'Unit': [ 'USD', 'EUR', 'EUR'],
'Location': [ 'Hawai', 'Torino', 'Paris'],
'2000': [2, 4,6],
'2002': [8,10,12]
})
df2.set_index(['Country','Unit','Location'],inplace=True)
I want to multiply each column from df2 with the corresponding Value from df1
This code does well
df2.columns=df2.columns.astype(int)
s=df.Value.unstack(fill_value=1)
df2 = df2.mul(s)
and produces
2000 2002
Country Unit Location
US USD Hawai 3.528105 3.201258
IT EUR Torino 3.914952 22.408932
FR EUR Paris 11.205348 -11.727335
Now I want to handle case where df2 has missing value represented as '..' so multiplying the numerical values and skip the others
2000 2002
Country Unit Location
US USD Hawai 2 8
IT EUR Torino .. 10
FR EUR Paris 6 12
running the code above give error TypeError: can't multiply sequence by non-int of type 'float'
Any idea how to achieve this result ?
2000 2002
Country Unit Location
US USD Hawai 3.528105 3.201258
IT EUR Torino .. 22.408932
FR EUR Paris 11.205348 -11.727335

I think better here is use missing values instead .. by to_numeric with errors='coerce', so divide working very nice:
df2 = pd.DataFrame({'Country': [ 'US', 'IT', 'FR'],
'Unit': [ 'USD', 'EUR', 'EUR'],
'Location': [ 'Hawai', 'Torino', 'Paris'],
'2000': [2, '..',6],
'2002': [8,10,12]
})
df2.set_index(['Country','Unit','Location'],inplace=True)
df2.columns=df2.columns.astype(int)
s= df.Value.unstack(fill_value=1)
df2 = df2.apply(lambda x: pd.to_numeric(x, errors='coerce')).mul(s)
print (df2)
2000 2002
Country Unit Location
US USD Hawai 3.528105 3.201258
IT EUR Torino NaN 22.408932
FR EUR Paris 11.205348 -11.727335
If only non numeric values are .. another solution is use replace:
df2 = df2.replace('..', np.nan).mul(s)

Related

Panda dataframe nan replacements

i am newbie in pandas. So please bear with me.
I have this dataframe:
Name,Year,Engine,Price
Car1,2001,100 CC,1000
Car2,2002,150 CC,2000
Car1,2001,100 CC,nan
Car1,2001,100 CC,100
I can't figure out how to change the nan or null value of “Car 1" + Year+ "100 CC” from nan to 1000.
I need to extract the value of “Price” while combining “Name +Year + Engine”. And replace where its null.
There are numbers of rows in the csv file which have the null “Price” while combining “Name + Engine”, however in some rows same “Name + Year+ Engine “ has “Price” association with it.
Thanks for the help.
With the update of your question (an extra row with Price == 100, where Name == Car and Engine == 100 CC), the logic behind the choice for filling the NaN value in this group with 1000.0 has become ambiguous. Let's add yet another row:
import pandas as pd
import numpy as np
data = {'Name': {0: 'Car1', 1: 'Car2', 2: 'Car1', 3: 'Car1', 4: 'Car1'},
'Year': {0: 2001, 1: 2002, 2: 2001, 3: 2001, 4: 2001},
'Engine': {0: '100 CC', 1: '150 CC', 2: '100 CC', 3: '100 CC', 4: '100 CC'},
'Price': {0: 1000.0, 1: 2000.0, 2: np.nan, 3: 100.0, 4: np.nan}}
df = pd.DataFrame(data)
print(df)
Name Year Engine Price
0 Car1 2001 100 CC 1000.0
1 Car2 2002 150 CC 2000.0
2 Car1 2001 100 CC NaN
3 Car1 2001 100 CC 100.0
4 Car1 2001 100 CC NaN
In this case, what should happen with the second associated NaN value? If you want to fill all NaNs with the first value, you could limit the assignment to the rows that contain NaNs by combining df.loc with pd.Series.isna(). This way you'll only overwrite the NaNs:
df.loc[df['Price'].isna(),'Price'] = df.groupby(['Name','Engine'])\
['Price'].transform('first')
print(df)
Name Year Engine Price
0 Car1 2001 100 CC 1000.0
1 Car2 2002 150 CC 2000.0
2 Car1 2001 100 CC 1000.0
3 Car1 2001 100 CC 100.0
4 Car1 2001 100 CC 1000.0
But you can of course change the function (here: "first") passed to DataFrameGroupBy.transform. E.g. use "max" for 1000.0, if you are selecting it because it is the highest value. Or if you want the mode, you could do: .transform(lambda x: x.mode().iloc[0]) (and get 100.0 in this case!); or get "mean" (550.0), "last" (100) etc.
More likely, you would want to use df.ffill, i.e. "forward fill", to propagate the last valid value forward. So, fill first NaN with 1000.0, and the second with 100.0. If so, use:
df['Price'] = df.groupby(['Name','Engine'])['Price'].transform('ffill')
print(df)
Name Year Engine Price
0 Car1 2001 100 CC 1000.0
1 Car2 2002 150 CC 2000.0
2 Car1 2001 100 CC 1000.0
3 Car1 2001 100 CC 100.0
4 Car1 2001 100 CC 100.0

Pandas: removing float values from output of a pivot_table used for counting

I have the following (toy) dataset:
import pandas as pd
import numpy as np
df = pd.DataFrame({'System_Key':['MER-002', 'MER-003', 'MER-004', 'MER-005', 'BAV-378', 'BAV-379', 'BAV-380', 'BAV-381', 'AUD-220', 'AUD-221', 'AUD-222', 'AUD-223'],
'Manufacturer':['Mercedes', 'Mercedes', 'Mercedes', 'Mercedes', 'BMW', 'BMW', 'BMW', 'BMW', 'Audi', 'Audi', 'Audi', 'Audi'],
'Region':['Americas', 'Europe', 'Americas', 'Asia', 'Asia', 'Europe', 'Europe', 'Europe', 'Americas', 'Asia', 'Americas', 'Americas'],
'Department':[np.nan, 'Sales', np.nan, 'Operations', np.nan, np.nan, 'Accounting', np.nan, 'Finance', 'Finance', 'Finance', np.nan]
})
System_Key Manufacturer Region Department
0 MER-002 Mercedes Americas NaN
1 MER-003 Mercedes Europe Sales
2 MER-004 Mercedes Americas NaN
3 MER-005 Mercedes Asia Operations
4 BAV-378 BMW Asia NaN
5 BAV-379 BMW Europe NaN
6 BAV-380 BMW Europe Accounting
7 BAV-381 BMW Europe NaN
8 AUD-220 Audi Americas Finance
9 AUD-221 Audi Asia Finance
10 AUD-222 Audi Americas Finance
11 AUD-223 Audi Americas NaN
First, I remove the NaN values in the data frame:
df = df.fillna('')
Then, I pivot the data frame as follows:
pivot = pd.pivot_table(df, index='Manufacturer', columns='Region', values='System_Key', aggfunc='size').applymap(str)
Notice that I'm passing aggfunc='size' for counting.
This results in the following pivot table:
Region Americas Asia Europe
Manufacturer
Audi 3.0 1.0 NaN
BMW NaN 1.0 3.0
Mercedes 2.0 1.0 1.0
How would I convert the float values in this pivot table to integers?
Thanks in advance!
Try fill_value
pivot = pd.pivot_table(df, index='Manufacturer', columns='Region', values='System_Key', aggfunc='size',fill_value=-1).astype(int)
The only reason you get floats when aggregating integers is because some missing size() values are NaN. So use fill_value=0 to impute them to zeros. Avoid getting the NaNs in the first place:
df.pivot_table(index='Manufacturer', columns='Region', values='System_Key', aggfunc='size', fill_value=0)
Region Americas Asia Europe
Manufacturer
Audi 3 1 0
BMW 0 1 3
Mercedes 2 1 1
Notes:
This is much better than kludging the dtype after.
You also don't need the df.fillna(''), and filling NaN with string '' on an integer(/float) column is a bad idea
Note you don't need to do pd.pivot_table(df, ...), just call df.pivot_table(...) directly since it's a method of dataframe.
Since you have NaN in your data, pandas would automatically downcast to float. You can either use Int64 (available from Pandas 0.24+) datatype:
pivot = (pd.pivot_table(df, index='Manufacturer', columns='Region',
values='System_Key', aggfunc='size')
.astype('Int64')
)
Output:
Region Americas Asia Europe
Manufacturer
Audi 3 1 <NA>
BMW <NA> 1 3
Mercedes 2 1 1
or fill NaN with, say, -1 in pivot_table:
pivot = (pd.pivot_table(df, index='Manufacturer', columns='Region',
values='System_Key', aggfunc='size',
fill_value=-1) # <--- here
)
Output:
Region Americas Asia Europe
Manufacturer
Audi 3 1 -1
BMW -1 1 3
Mercedes 2 1 1
Use the Int64 datatype which allows for intgeger NaNs. The convert_dtypes() function would be handy here.
pivot.convert_dtypes()
Americas Asia Europe
Manufacturer
Audi 3 1 <NA>
BMW <NA> 1 3
Mercedes 2 1 1
Also...
I'd probably do df.fillna('', inplace=True) instead of df = df.fillna('') to minimize data copies
I assume you meant to ditch the .applymap(str) bit at the end of your call to pivot_table().

why I have NAN after key-value matching with dict in pandas dataframe?

I am trying to add new label column by key-value matching with a dictionary in my dataframe. I used map function to do so. But, the value of new added column all have NAN, means that matching doesn't work in my code. How can I correct this? why is that happening? I intend to add new label column by matching key from my dictionary to key in pandas dataframe.
minimal data:
import numpy as np
import pandas as pd
df = pd.DataFrame(data=[list('EEEIEEIIEI'),
['AR', 'AUC', 'CA', 'CN', 'MX', 'MX', 'AR', 'IT', 'UK', 'RU'],
['ALBANIA', 'PAKISTN', 'UGANDA', 'FRANCE', 'USA', 'RUSSIA', 'COLOMBIA', 'KAZAK', 'KOREA', 'JAPAN'],
[20230, 20220, 20120, 20230, 20230, 20220, 20230, 20120, 20130, 20329],
list(np.random.randint(10, 100, 10)),
list(np.random.randint(10, 100, 10))]
).T
df.columns =['ID', 'cty', 'cty_ptn', 'prod_code', 'Quantity1', 'Quantity2']
print(df)
here is my code:
my_dict={'20230':'Gas',
'20220':'Water',
'20210': 'Refined',
'20120':'Oil',
'20239':'Other'}
df['prod_label']=df['prod_code'].map(my_dict)
how can I fix NAN in new assigned column? any idea? Thanks
Since the column prod_code is a int you have to convert to str with astype before mapping:
my_dict={'20230':'Gas',
'20220':'Water',
'20210': 'Refined',
'20120':'Oil',
'20239':'Other'}
df['prod_label']=df['prod_code'].astype(str).map(my_dict)
ID cty cty_ptn prod_code Quantity1 Quantity2 prod_label
0 E AR ALBANIA 20230 45 84 Gas
1 E AUC PAKISTN 20220 68 10 Water
2 E CA UGANDA 20120 48 45 Oil
3 I CN FRANCE 20230 11 93 Gas
4 E MX USA 20230 62 81 Gas
5 E MX RUSSIA 20220 27 49 Water
6 I AR COLOMBIA 20230 55 97 Gas
7 I IT KAZAK 20120 32 93 Oil
8 E UK KOREA 20130 63 88 NaN
9 I RU JAPAN 20329 99 39 NaN

How to combine two pandas dataframes on two different columns having elements not in order? [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 4 years ago.
I have two datasets that look like this:
name Longitude Latitude continent
0 Aruba -69.982677 12.520880 North America
1 Afghanistan 66.004734 33.835231 Asia
2 Angola 17.537368 -12.293361 Africa
3 Anguilla -63.064989 18.223959 North America
4 Albania 20.049834 41.142450 Europe
And another dataset looks like this:
COUNTRY GDP (BILLIONS) CODE
0 Afghanistan 21.71 AFG
1 Albania 13.40 ALB
2 Algeria 227.80 DZA
3 American Samoa 0.75 ASM
4 Andorra 4.80 AND
Here, columns name and COUNTRY contains the country names but not in the same order.
How to combine the second dataframe into first one and add the CODE columns to the first dataframe.
Required output:
name Longitude Latitude continent CODE
0 Aruba -69.982677 12.520880 North America NaN
1 Afghanistan 66.004734 33.835231 Asia AFG
2 Angola 17.537368 -12.293361 Africa NaN
3 Anguilla -63.064989 18.223959 North America NaN
4 Albania 20.049834 41.142450 Europe ALB
Attempt:
import numpy as np
import pandas as pd
df = pd.DataFrame({'name' : ['Aruba', 'Afghanistan', 'Angola', 'Anguilla', 'Albania'],
'Longitude' : [-69.982677, 66.004734, 17.537368, -63.064989, 20.049834],
'Latitude' : [12.520880, 33.835231, '-12.293361', 18.223959, 41.142450],
'continent' : ['North America','Asia','Africa','North America','Europe'] })
print(df)
df2 = pd.DataFrame({'COUNTRY' : ['Afghanistan', 'Albania', 'Algeria', 'American Samoa', 'Andorra'],
'GDP (BILLIONS)' : [21.71, 13.40, 227.80, 0.75, 4.80],
'CODE' : ['AFG', 'ALB', 'DZA', 'ASM', 'AND']})
print(df2)
pd.merge(left=df, right=df2,left_on='name',right_on='COUNTRY')
# but this fails
By default, pd.merge uses how='inner', which uses the intersection of keys across your two dataframes. Here, you need how='left' to use keys only from the left dataframe:
res = pd.merge(df, df2, how='left', left_on='name', right_on='COUNTRY')
The merge performs an 'inner' merge or join by default, only keeping records that have a match on both the left and the right. You want an 'outer' join, keeping all records (there is also 'left' or 'right').
Example:
import pandas as pd
df1 = pd.DataFrame({
'name': ['Aruba', 'Afghanistan', 'Angola', 'Anguilla', 'Albania'],
'Longitude': [-69.982677, 66.004734, 17.537368, -63.064989, 20.049834],
'Latitude': [12.520880, 33.835231, '-12.293361', 18.223959, 41.142450],
'continent': ['North America', 'Asia', 'Africa', 'North America', 'Europe']
})
print(df1)
df2 = pd.DataFrame({
'COUNTRY': ['Afghanistan', 'Albania', 'Algeria', 'American Samoa', 'Andorra'],
'GDP (BILLIONS)': [21.71, 13.40, 227.80, 0.75, 4.80],
'CODE': ['AFG', 'ALB', 'DZA', 'ASM', 'AND']
})
print(df2)
# merge, using 'outer' to avoid losing records from either left or right
df3 = pd.merge(left=df1, right=df2, left_on='name', right_on='COUNTRY', how='outer')
# combining the columns used to match
df3['name'] = df3.apply(lambda row: row['name'] if not pd.isnull(row['name']) else row['COUNTRY'], axis=1)
# dropping the now spare column
df3 = df3.drop('COUNTRY', axis=1)
print(df3)
Pandas have pd.merge [https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html] function which uses inner join by default. Inner join basically takes only those values that are present in both the keys specified in either on or on left_on and right_on if the keys to merge on in both the dataframes are different.
Since, you require the CODE value to be added, following line of code could be used:
pd.merge(left=df, right=df2[['COUNTRY', 'CODE']], left_on='name', right_on='COUNTRY', how='left')
This gives the following output:
name Longitude Latitude continent COUNTRY CODE
0 Aruba -69.982677 12.520880 North America NaN NaN
1 Afghanistan 66.004734 33.835231 Asia Afghanistan AFG
2 Angola 17.537368 -12.293361 Africa NaN NaN
3 Anguilla -63.064989 18.223959 North America NaN NaN
4 Albania 20.049834 41.142450 Europe Albania ALB
Following also gives the same result:
new_df = pd.merge(left=df1[['COUNTRY', 'CODE']], right=df, left_on='COUNTRY', right_on='name', how='right')
COUNTRY CODE name Longitude Latitude continent
0 Afghanistan AFG Afghanistan 66.004734 33.835231 Asia
1 Albania ALB Albania 20.049834 41.142450 Europe
2 NaN NaN Aruba -69.982677 12.520880 North America
3 NaN NaN Angola 17.537368 -12.293361 Africa
4 NaN NaN Anguilla -63.064989 18.223959 North America

Pandas DataFrame compare columns to a threshold column using where()

I need to null values in several columns where they are less in absolute value than correspond values in the threshold column
import pandas as pd
import numpy as np
df=pd.DataFrame({'key1': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'key2': [2000, 2001, 2002, 2001, 2002],
'data1': np.random.randn(5),
'data2': np.random.randn(5),
'threshold': [0.5,0.4,0.6,0.1,0.2]}).set_index(['key1','key2'])
data1 data2 threshold
key1 key2
Ohio 2000 0.201240 0.083833 0.5
2001 -1.993489 -1.081208 0.4
2002 0.759038 -1.688769 0.6
Nevada 2001 -0.543916 1.412679 0.1
2002 -1.545781 0.181224 0.2
this gives me an error "cannot join with no level specified and no overlapping names"
df.where(df.abs()>df['threshold'])
this works but obviously against a scalar
df.where(df.abs()>0.5)
data1 data2 threshold
key1 key2
Ohio 2000 NaN NaN NaN
2001 -1.993489 -1.081208 NaN
2002 0.759038 -1.688769 NaN
Nevada 2001 -0.543916 1.412679 NaN
2002 -1.545781 NaN NaN
BTW, this does appear to be giving me an OK result - still want to find out how to do it with where() method
df.apply(lambda x:x.where(x.abs()>x['threshold']),axis=1)
Here's a slightly different option using the DataFrame.gt (greater than) method.
df[df.abs().gt(df['threshold'], axis='rows')]
Out[16]:
# Output might not look the same because of different random numbers,
# use np.random.seed() for reproducible random number gen
Out[13]:
data1 data2 threshold
key1 key2
Ohio 2000 NaN NaN NaN
2001 1.954543 1.372174 NaN
2002 NaN NaN NaN
Nevada 2001 0.275814 0.854617 NaN
2002 NaN 0.204993 NaN

Categories

Resources