Pandas DataFrame compare columns to a threshold column using where() - python

I need to null values in several columns where they are less in absolute value than correspond values in the threshold column
import pandas as pd
import numpy as np
df=pd.DataFrame({'key1': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'key2': [2000, 2001, 2002, 2001, 2002],
'data1': np.random.randn(5),
'data2': np.random.randn(5),
'threshold': [0.5,0.4,0.6,0.1,0.2]}).set_index(['key1','key2'])
data1 data2 threshold
key1 key2
Ohio 2000 0.201240 0.083833 0.5
2001 -1.993489 -1.081208 0.4
2002 0.759038 -1.688769 0.6
Nevada 2001 -0.543916 1.412679 0.1
2002 -1.545781 0.181224 0.2
this gives me an error "cannot join with no level specified and no overlapping names"
df.where(df.abs()>df['threshold'])
this works but obviously against a scalar
df.where(df.abs()>0.5)
data1 data2 threshold
key1 key2
Ohio 2000 NaN NaN NaN
2001 -1.993489 -1.081208 NaN
2002 0.759038 -1.688769 NaN
Nevada 2001 -0.543916 1.412679 NaN
2002 -1.545781 NaN NaN
BTW, this does appear to be giving me an OK result - still want to find out how to do it with where() method
df.apply(lambda x:x.where(x.abs()>x['threshold']),axis=1)

Here's a slightly different option using the DataFrame.gt (greater than) method.
df[df.abs().gt(df['threshold'], axis='rows')]
Out[16]:
# Output might not look the same because of different random numbers,
# use np.random.seed() for reproducible random number gen
Out[13]:
data1 data2 threshold
key1 key2
Ohio 2000 NaN NaN NaN
2001 1.954543 1.372174 NaN
2002 NaN NaN NaN
Nevada 2001 0.275814 0.854617 NaN
2002 NaN 0.204993 NaN

Related

How to concatenate two dataframes with duplicates some values?

I have two dataframes of unequal lengths. I want to combine them with a condition.
If two rows of df1 are identical then they must share the same value of df2.(without changing order )
import pandas as pd
d = {'country': ['France', 'France','Japan','China', 'China','Canada','Canada','India']}
df1 = pd.DataFrame(data=d)
I={'conc': [0.30, 0.25, 0.21, 0.37, 0.15]}
df2 = pd.DataFrame(data=I)
dfc=pd.concat([df1,df2], axis=1)
my output
country conc
0 France 0.30
1 France 0.25
2 Japan 0.21
3 China 0.37
4 China 0.15
5 Canada NaN
6 Canada NaN
7 India NaN
expected output
country conc
0 France 0.30
1 France 0.30
2 Japan 0.25
3 China 0.21
4 China 0.21
5 Canada 0.37
6 Canada 0.37
7 India 0.15
You need to create a link between the values and the countries first.
df2["country"] = df1["country"].unique()
Then you can use it to merge it with your original dataframe.
pd.merge(df1, df2, on="country")
But be aware that this only works as long as the number of the values is identical to the number of countries and the order for them is as expected.
I'd construct the dataframe directly, without intermediate dfs.
d = {'country': ['France', 'France','Japan','China', 'China','Canada','Canada','India']}
I = {'conc': [0.30, 0.25, 0.21, 0.37, 0.15]}
c = 'country'
dfc = pd.DataFrame(I, index=pd.Index(pd.unique(d[c]), name=c)).reindex(d[c]).reset_index()

remove typos from a dictionary of dataframes

I am trying to remove specific typos from a dictionary of dataframes, which looks like this:
import pandas as pd
data = {'dataframe_1':pd.DataFrame({'col1': ['John', 'Ashley'], 'col2': ['+10', '-1']}), 'dataframe_2':pd.DataFrame({'col3': ['Italy', 'Brazil', 'Japan'], 'col4': ['Milan', 'Rio do Jaineiro', 'Tokio'], 'percentage':['+95%', '≤0%', '80%+']})}
The function remove_typos() is used to remove specific typos, however when applied it returns a corrupted dataframe.
def remove_typos(string):
# remove '+' and '≤'
string=string.replace('+', '')
string=string.replace('≤', '')
return string
# store remove_typos() output in a dictionary of dataframes
cleaned_df = pd.concat(data.values()).pipe(remove_typos)
Console Output:
# col1 col2 col3 col4 percentage
#0 John +10 NaN NaN NaN
#1 Ashley -1 NaN NaN NaN
#0 NaN NaN Italy Milan +95%
#1 NaN NaN Brazil Rio do Jaineiro ≤0%
#2 NaN NaN Japan Tokio 80%+
The idea is that the function returns a cleaned df where each dataframe is represented by a dictionary key:
data['dataframe_1']
# col1 col2
#0 John 10
#1 Ashley -1
Is there any other way to apply this function over a dict of df's?
We can replace the values inside a dict comprehension
data = {k: v.replace([r'\+', '≤'], '', regex=True) for k, v in data.items()}
>>> data['dataframe_1']
col1 col2
0 John 10
1 Ashley -1
>>> data['dataframe_2']
col3 col4 percentage
0 Italy Milan 95%
1 Brazil Rio do Jaineiro 0%
2 Japan Tokio 80%
There is no harm using a loop in a dictionary (not a dataframe)
data1 = {}
for k,v in data.items():
v1 = v.select_dtypes("O")
v = v.assign(**v1.applymap(remove_typos))
data1[k] = v
print(data1)
{'dataframe_1': col1 col2
0 John 10
1 Ashley -1, 'dataframe_2': col3 col4 percentage
0 Italy Milan 95%
1 Brazil Rio do Jaineiro 0%
2 Japan Tokio 80%}

The problem that a value whose index I know with the loc function cannot update another column in the same index?

Datatable:
ARAÇ VEHICLE_YEAR NUM_PASSENGERS
0 CHEVROLET 2017 NaN
1 NISSAN 2017 NaN
2 HYUNDAI 2017 1.0
3 DODGE 2017 NaN
I want to update more than one index and column data on that index with the loc function.
but when I use the loc function, it changes the new values ​​by twos
listcolumns = ['VEHICLE_YEAR', 'NUM_PASSENGERS']
listnewvalue = [16000, 28000]
indexlister = [0, 1]
data.loc[indexlister , listcolumns] = listnewvalue
As you can see in the output below. just zero and the first index 'VEHICLE_YEAR' should be 16000, 'NUM_PASSENGERS' should be 28000. BUT, BOTH ZERO AND THE FIRST ROW HAS CHANGED IN BOTH COLUMNS.
How can i check this and change only the columns and index i want.or do you have a different method? thank you very much.
output:
ARAÇ VEHICLE_YEAR NUM_PASSENGERS
0 CHEVROLET 16000 28000.0
1 NISSAN 16000 28000.0
In the printout, I set fields to be empty so that new entries appear. for example; I want to assign the value 2005 to the 0 index of the column 'VEHICLE_YEAR' and to the 1st index 2005 of the column 'NUM_PASSENGERS'
The output I want is as follows:
ARAÇ VEHICLE_YEAR NUM_PASSENGERS
0 CHEVROLET 2005 Nan
1 NISSAN Nan 2005
2 HYUNDAI Nan Nan
The list you're setting the values with needs to correspond to the number of rows and number of columns you've selected with loc. If it receives a single list, it will assign all selected rows at those columns to that value.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'ARAC' : ['CHEVROLET', 'NISSAN', 'HYUNDAI', 'DODGE'],
'VEHICLE_YEAR' : [2017, 2017, 2017, 2017],
'NUM_PASSENGERS' : [np.nan, np.nan, 1.0, np.nan]
})
ARAC NUM_PASSENGERS VEHICLE_YEAR
0 CHEVROLET NaN 2017
1 NISSAN NaN 2017
2 HYUNDAI 1.0 2017
3 DODGE NaN 2017
df.loc[[0, 2], ['NUM_PASSENGERS', 'VEHICLE_YEAR']] = [[1000, 2014], [3000, 2015]]
ARAC NUM_PASSENGERS VEHICLE_YEAR
0 CHEVROLET 1000.0 2014
1 NISSAN NaN 2017
2 HYUNDAI 3000.0 2015
3 DODGE NaN 2017
If you only want to change the values in the NUM_PASSENGERS column, select only that and give it a single list/array, the same length as your row indices.
df.loc[[0,1,3], ['NUM_PASSENGERS']] = [10, 20, 30]
ARAC NUM_PASSENGERS VEHICLE_YEAR
0 CHEVROLET 10.0 2014
1 NISSAN 20.0 2017
2 HYUNDAI 3000.0 2015
3 DODGE 30.0 2017
The docs might be helpful too. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc
If this didn't answer your question, please provide your expected output.
I solved the problem as follows.
I could not describe the problem exactly, I am working on it, but when I changed it that way, it worked. And now I can change the row and column value I want to the value I want.
listcolumns = ['VEHICLE_YEAR', 'NUM_PASSENGERS']
listnewvalue = [16000, 28000]
indexlister = [0, 1]
for i in len(indexlister):
df.loc[lister[count], listcolumn[count]] = listnewvalue[count]

Pandas DataFrame multplication with missing values

I have 2 dataframes
Value
Location Time
Hawai 2000 1.764052
2002 0.400157
Torino 2000 0.978738
2002 2.240893
Paris 2000 1.867558
2002 -0.977278
2000 2002
Country Unit Location
US USD Hawai 2 8
IT EUR Torino 4 10
FR EUR Paris 6 12
Created with
np.random.seed(0)
tuples = list(zip(*[['Hawai', 'Hawai', 'Torino', 'Torino',
'Paris', 'Paris'],
[2000, 2002, 2000, 2002, 2000,2002]]))
idx = pd.MultiIndex.from_tuples(tuples, names=['Location', 'Time'])
df = pd.DataFrame(np.random.randn(6, 1), index=idx, columns=['Value'])
df2 = pd.DataFrame({'Country': [ 'US', 'IT', 'FR'],
'Unit': [ 'USD', 'EUR', 'EUR'],
'Location': [ 'Hawai', 'Torino', 'Paris'],
'2000': [2, 4,6],
'2002': [8,10,12]
})
df2.set_index(['Country','Unit','Location'],inplace=True)
I want to multiply each column from df2 with the corresponding Value from df1
This code does well
df2.columns=df2.columns.astype(int)
s=df.Value.unstack(fill_value=1)
df2 = df2.mul(s)
and produces
2000 2002
Country Unit Location
US USD Hawai 3.528105 3.201258
IT EUR Torino 3.914952 22.408932
FR EUR Paris 11.205348 -11.727335
Now I want to handle case where df2 has missing value represented as '..' so multiplying the numerical values and skip the others
2000 2002
Country Unit Location
US USD Hawai 2 8
IT EUR Torino .. 10
FR EUR Paris 6 12
running the code above give error TypeError: can't multiply sequence by non-int of type 'float'
Any idea how to achieve this result ?
2000 2002
Country Unit Location
US USD Hawai 3.528105 3.201258
IT EUR Torino .. 22.408932
FR EUR Paris 11.205348 -11.727335
I think better here is use missing values instead .. by to_numeric with errors='coerce', so divide working very nice:
df2 = pd.DataFrame({'Country': [ 'US', 'IT', 'FR'],
'Unit': [ 'USD', 'EUR', 'EUR'],
'Location': [ 'Hawai', 'Torino', 'Paris'],
'2000': [2, '..',6],
'2002': [8,10,12]
})
df2.set_index(['Country','Unit','Location'],inplace=True)
df2.columns=df2.columns.astype(int)
s= df.Value.unstack(fill_value=1)
df2 = df2.apply(lambda x: pd.to_numeric(x, errors='coerce')).mul(s)
print (df2)
2000 2002
Country Unit Location
US USD Hawai 3.528105 3.201258
IT EUR Torino NaN 22.408932
FR EUR Paris 11.205348 -11.727335
If only non numeric values are .. another solution is use replace:
df2 = df2.replace('..', np.nan).mul(s)

Pandas: Divide MultiIndex data frame by row

I have a data frame with a multi index (panel), and I would like to divide for each group (county) and each row, the values by a specific year.
>>> fields
Out[39]: ['emplvl', 'population', 'estab', 'estab_pop', 'emp_pop']
>>> df[fields]
Out[40]:
emplvl population estab estab_pop emp_pop
county year
1001 2003 11134.500000 46800 801.75 0.017131 0.237917
2004 11209.166667 48366 824.00 0.017037 0.231757
2005 11452.166667 49676 870.75 0.017529 0.230537
2006 11259.250000 51328 862.50 0.016804 0.219359
2007 11403.333333 52405 879.25 0.016778 0.217600
2008 11272.833333 53277 890.25 0.016710 0.211589
2009 11003.833333 54135 877.00 0.016200 0.203267
2010 10693.916667 54632 877.00 0.016053 0.195745
2011 10627.000000 NaN 862.00 NaN NaN
2012 10136.916667 NaN 841.75 NaN NaN
1003 2003 51372.250000 151509 4272.00 0.028196 0.339071
2004 53450.583333 156266 4536.25 0.029029 0.342049
2005 56110.250000 162183 4880.50 0.030093 0.345969
2006 59291.000000 168121 5067.50 0.030142 0.352669
2007 62600.083333 172404 5337.25 0.030958 0.363101
2008 62611.500000 175827 5529.25 0.031447 0.356097
2009 58947.666667 179406 5273.75 0.029396 0.328571
2010 58139.583333 183195 5171.25 0.028228 0.317364
2011 59581.000000 NaN 5157.75 NaN NaN
2012 60440.250000 NaN 5171.75 NaN NaN
The rows to divide by
>>> df[fields].loc[df.index.get_level_values('year') == 2007, fields]
Out[32]:
emplvl population estab estab_pop emp_pop
county year
1001 2007 11403.333333 52405 879.25 0.016778 0.217600
1003 2007 62600.083333 172404 5337.25 0.030958 0.363101
However, both
df[fields].div(df.loc[df.index.get_level_values('year') == 2007, fields], axis=0)
df[fields].div(df.loc[df.index.get_level_values('year') == 2007, fields], axis=1)
gives me a data frame full of NaN, probably because pandas is trying to divide taking the year-index into account and not finding anything to divide.
To compensate for this, I also tried
df[fields].div(df.loc[df.index.get_level_values('year') == 2007, fields].values)
which gives me ValueError: Shape of passed values is (5, 2), indices imply (5, 20).
I think you can reset_index with df1 and then use div:
fields = ['emplvl', 'population', 'estab', 'estab_pop', 'emp_pop']
df1 = df.loc[df.index.get_level_values('year') == 2007, fields].reset_index(level=1)
print df1
year emplvl population estab estab_pop emp_pop
county
1001 2007 11403.333333 52405.0 879.25 0.016778 0.217600
1003 2007 62600.083333 172404.0 5337.25 0.030958 0.363101
print df.div(df1[fields], axis=0)
emplvl population estab estab_pop emp_pop
county year
1001 2003 0.976425 0.893045 0.911857 1.021039 1.093369
2004 0.982973 0.922927 0.937162 1.015437 1.065060
2005 1.004282 0.947925 0.990333 1.044761 1.059453
2006 0.987365 0.979449 0.980950 1.001550 1.008084
2007 1.000000 1.000000 1.000000 1.000000 1.000000
2008 0.988556 1.016640 1.012511 0.995947 0.972376
2009 0.964966 1.033012 0.997441 0.965550 0.934131
2010 0.937789 1.042496 0.997441 0.956789 0.899563
2011 0.931920 NaN 0.980381 NaN NaN
2012 0.888943 NaN 0.957350 NaN NaN
1003 2003 0.820642 0.878802 0.800412 0.910782 0.933820
2004 0.853842 0.906394 0.849923 0.937690 0.942022
2005 0.896329 0.940715 0.914422 0.972059 0.952818
2006 0.947139 0.975157 0.949459 0.973642 0.971270
2007 1.000000 1.000000 1.000000 1.000000 1.000000
2008 1.000182 1.019855 1.035974 1.015796 0.980711
2009 0.941655 1.040614 0.988102 0.949545 0.904902
2010 0.928746 1.062591 0.968898 0.911816 0.874038
2011 0.951772 NaN 0.966368 NaN NaN
2012 0.965498 NaN 0.968992 NaN NaN
At first I would suggest you to set a unique dataframe for your operation. Let's assume its name is df.
You need to select the row by which all the rows are to be divided by at first. This is the row with 2007 as year and individual county names.
In the following code, I have looped through the index and column of dataframe. The index of row to divide by has been selected as reference_index which includes the name of county and the year.
Then the rest of the rows are divided by the row with reference_index first. At the end, the row is divided by itself to get 1 value.
for index in df.index:
for column in df.columns:
county = index[0]
#index of reference row to divide rest of the rows by
reference_index = (county, 2007)
if index!=reference_index:
df.loc[index, column] = df.loc[index, column] / df.loc[reference_index, column]
#The row with 2007 year should also be divided by itself, but at the end. otherwise, it becomes 1 beforehand.
df.loc[reference_index] = df.loc[reference_index] / df.loc[reference_index]

Categories

Resources