divide all columns by each column in pandas - python

I have a data frame like 1 and I am trying to create a new data frame 2 which consists of ratios of each column of above data frame.
I tried below mentioned logic.
df_new = pd.concat([df[df.columns.difference([col])].div(df[col], axis=0)\
.add_suffix('/R') for col in df.columns], axis=1)
Output is:
B/R C/R D/R A/R C/R D/R A/R B/R D/R A/R B/R C/R
0 0.46 1.16 0.78 2.16 2.50 1.69 0.86 0.40 0.68 1.28 0.59 1.48
1 1.05 1.25 1.64 0.95 1.19 1.55 0.80 0.84 1.30 0.61 0.64 0.77
2 1.56 2.78 2.78 0.64 1.79 1.79 0.36 0.56 1.00 0.36 0.56 1.00
3 0.54 2.23 0.35 1.86 4.14 0.64 0.45 0.24 0.16 2.89 1.56 6.44
However, here I am facing two issues. One is I am getting both A/B and B/A which are not needed and also increases number of columns. Is there a way to get the output only A/B and eliminate/restrict B/A.
Second issue is with Naming of columns using add suffix method which does not convey which is divided by which. Is there a way to create column names like A/B for Column A divided by column B.

Use combinations with divide columns in list comprehension:
df = pd.DataFrame({
'A':[5,3,6,9,2,4],
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,8],
})
from itertools import combinations
L = {f'{a}/{b}': df[a].div(df[b]) for a, b in combinations(df.columns, 2)}
df = pd.concat(L, axis=1)
print (df)
A/B A/C A/D B/C B/D C/D
0 1.25 0.714286 5.000000 0.571429 4.000000 7.000000
1 0.60 0.375000 1.000000 0.625000 1.666667 2.666667
2 1.50 0.666667 1.200000 0.444444 0.800000 1.800000
3 1.80 2.250000 1.285714 1.250000 0.714286 0.571429
4 0.40 1.000000 2.000000 2.500000 5.000000 2.000000
5 1.00 1.333333 0.500000 1.333333 0.500000 0.375000

Related

How to list the specific countries in a df which have NaN values?

I have created the df_nan below which shows the sum of NaN values from the main df, which, as seen, shows how many are in each specific column.
However, I want to create a new df, which has a column/index of countries, then another with the number of NaN values for the given country.
Country Number of NaN Values
Aruba 4
Finland 3
I feel like I have to use groupby, to create something along the lines of this below, but .isna is not an attribute of the groupby function. Any help would be great, thanks!
df_nan2= df_nan.groupby(['Country']).isna().sum()
Current code
import pandas as pd
import seaborn as sns
import numpy as np
from scipy.stats import spearmanr
# given dataframe df
df = pd.read_csv('countries.csv')
df.drop(columns= ['Population (millions)', 'HDI', 'GDP per Capita','Fish Footprint','Fishing Water',
'Urban Land','Earths Required', 'Countries Required', 'Data Quality'], axis=1, inplace = True)
df_nan= df.isna().sum()
Head of main df
0 Afghanistan Middle East/Central Asia 0.30 0.20 0.08 0.18 0.79 0.24 0.20 0.02 0.50 -0.30
1 Albania Northern/Eastern Europe 0.78 0.22 0.25 0.87 2.21 0.55 0.21 0.29 1.18 -1.03
2 Algeria Africa 0.60 0.16 0.17 1.14 2.12 0.24 0.27 0.03 0.59 -1.53
3 Angola Africa 0.33 0.15 0.12 0.20 0.93 0.20 1.42 0.64 2.55 1.61
4 Antigua and Barbuda Latin America NaN NaN NaN NaN 5.38 NaN NaN NaN 0.94 -4.44
5 Argentina Latin America 0.78 0.79 0.29 1.08 3.14 2.64 1.86 0.66 6.92 3.78
6 Armenia Middle East/Central Asia 0.74 0.18 0.34 0.89 2.23 0.44 0.26 0.10 0.89 -1.35
7 Aruba Latin America NaN NaN NaN NaN 11.88 NaN NaN NaN 0.57 -11.31
8 Australia Asia-Pacific 2.68 0.63 0.89 4.85 9.31 5.42 5.81 2.01 16.57 7.26
9 Austria European Union 0.82 0.27 0.63 4.14 6.06 0.71 0.16 2.04 3.07 -3.00
10 Azerbaijan Middle East/Central Asia 0.66 0.22 0.11 1.25 2.31 0.46 0.20 0.11 0.85 -1.46
11 Bahamas Latin America 0.97 1.05 0.19 4.46 6.84 0.05 0.00 1.18 9.55 2.71
12 Bahrain Middle East/Central Asia 0.52 0.45 0.16 6.19 7.49 0.01 0.00 0.00 0.58 -6.91
13 Bangladesh Asia-Pacific 0.29 0.00 0.08 0.26 0.72 0.25 0.00 0.00 0.38 -0.35
14 Barbados Latin America 0.56 0.24 0.14 3.28 4.48 0.08 0.00 0.02 0.19 -4.29
15 Belarus Northern/Eastern Europe 1.32 0.12 0.91 2.57 5.09 1.52 0.30 1.71 3.64 -1.45
16 Belgium European Union 1.15 0.48 0.99 4.43 7.44 0.56 0.03 0.28 1.19 -6.25
17 Benin Africa 0.49 0.04 0.26 0.51 1.41 0.44 0.04 0.34 0.88 -0.53
18 Bermuda North America NaN NaN NaN NaN 5.77 NaN NaN NaN 0.13 -5.64
19 Bhutan Asia-Pacific 0.50 0.42 3.03 0.63 4.84 0.28 0.34 4.38 5.27 0.43
Nan head
Country 0
Region 0
Cropland Footprint 15
Grazing Footprint 15
Forest Footprint 15
Carbon Footprint 15
Total Ecological Footprint 0
Cropland 15
Grazing Land 15
Forest Land 15
Total Biocapacity 0
Biocapacity Deficit or Reserve 0
dtype: int64
Suppose, you want to get Null count for each Country from "Cropland Footprint" column, then you can use the following code -
Unique_Country = df['Country'].unique()
Col1 = 'Cropland Footprint'
NullCount = []
for i in Unique_Country:
s = df[df['Country']==i][Col1].isnull().sum()
NullCount.append(s)
df2 = pd.DataFrame({'Country': Unique_Country,
'Number of NaN Values': NullCount})
df2 = df2[df2['Number of NaN Values']!=0]
df2
Output -
Country Number of NaN Values
Antigua and Barbuda 1
Aruba 1
Bermuda 1
If you want to get Null Count from another Column then just change the Value of Col1 variable.

Correlation between values

I want to make correlation in this DataFrame but not the way it is shown, but to rank values ​​from the lowest to largest.
import pandas as pd
import numpy as np
rs = np.random.RandomState(1)
df = pd.DataFrame(rs.rand(9, 8))
corr = df.corr()
corr.style.background_gradient().set_precision(2)
0 1 2 3 4 5 6 7
0 1 0.42 0.031 -0.16 -0.35 0.23 -0.22 0.4
1 0.42 1 -0.24 -0.55 0.011 0.3 -0.26 0.23
2 0.031 -0.24 1 0.29 0.44 0.29 0.23 0.25
3 -0.16 -0.55 0.29 1 -0.33 -0.42 0.58 -0.37
4 -0.35 0.011 0.44 -0.33 1 0.46 0.074 0.19
5 0.23 0.3 0.29 -0.42 0.46 1 -0.41 0.71
6 -0.22 -0.26 0.23 0.58 0.074 -0.41 1 -0.66
7 0.4 0.23 0.25 -0.37 0.19 0.71 -0.66 1
You can use sort_values:
import pandas as pd
import numpy as np
rs = np.random.RandomState(1)
df = pd.DataFrame(rs.rand(9, 8))
corr = df.corr()
print(corr)
print(corr.sort_values(by=0, axis=1, inplace=False)) # by=0 means first row
Results:
0 1 2 3 4 5 6 7
0 1.000000 0.418246 0.030692 -0.160001 -0.352993 0.230069 -0.216804 0.395662
1 0.418246 1.000000 -0.244115 -0.549013 0.010745 0.299203 -0.262351 0.232681
2 0.030692 -0.244115 1.000000 0.288011 0.435907 0.285408 0.225205 0.253840
3 -0.160001 -0.549013 0.288011 1.000000 -0.326950 -0.415688 0.578549 -0.366539
4 -0.352993 0.010745 0.435907 -0.326950 1.000000 0.455738 0.074293 0.193905
5 0.230069 0.299203 0.285408 -0.415688 0.455738 1.000000 -0.413383 0.708467
6 -0.216804 -0.262351 0.225205 0.578549 0.074293 -0.413383 1.000000 -0.664207
7 0.395662 0.232681 0.253840 -0.366539 0.193905 0.708467 -0.664207 1.000000
0 1 7 5 2 3 6 4
0 1.000000 0.418246 0.395662 0.230069 0.030692 -0.160001 -0.216804 -0.352993
1 0.418246 1.000000 0.232681 0.299203 -0.244115 -0.549013 -0.262351 0.010745
2 0.030692 -0.244115 0.253840 0.285408 1.000000 0.288011 0.225205 0.435907
3 -0.160001 -0.549013 -0.366539 -0.415688 0.288011 1.000000 0.578549 -0.326950
4 -0.352993 0.010745 0.193905 0.455738 0.435907 -0.326950 0.074293 1.000000
5 0.230069 0.299203 0.708467 1.000000 0.285408 -0.415688 -0.413383 0.455738
6 -0.216804 -0.262351 -0.664207 -0.413383 0.225205 0.578549 1.000000 0.074293
7 0.395662 0.232681 1.000000 0.708467 0.253840 -0.366539 -0.664207 0.193905

transpose multiple columns Pandas dataframe

I'm trying to reshape a dataframe, but I'm not able to get the results I need.
The dataframe looks like this:
m r s p O W N
1 4 3 1 2.81 3.70 3.03
1 4 4 1 2.14 2.82 2.31
1 4 5 1 1.47 1.94 1.59
1 4 3 2 0.58 0.78 0.60
1 4 4 2 0.67 0.00 0.00
1 4 5 2 1.03 2.45 1.68
1 4 3 3 1.98 1.34 1.81
1 4 4 3 0.00 0.04 0.15
1 4 5 3 0.01 0.00 0.26
I need to reshape the dataframe so it will look like this:
m r s p O W N p O W N p O W N
1 4 3 1 2.81 3.70 3.03 2 0.58 0.78 0.60 3 1.98 1.34 1.81
1 4 4 1 2.14 2.82 2.31 2 0.67 0.00 0.00 3 0.00 0.04 0.15
1 4 5 1 1.47 1.94 1.59 2 1.03 2.45 1.68 3 0.01 0.00 0.26
I tried to use the pivot_table function
df.pivot_table(index=['m','r','s'], columns=['p'], values=['O','W','N'])
but I'm not able to get quite what I want. Does anyone know how to do this?
As someone who fancies himself as pretty handy with pandas, the pivot_table and melt functions are confusing to me. I prefer to stick with a well-defined and unique index and use the stack and unstack methods of the dataframe itself.
First, I'll ask if you really need to repeat the p-column like that? I can sort of see its value when presenting data, but IMO pandas isn't really set up to work like that. We could shoehorn it in, but let's see if a simpler solution gets you what you need.
Here's what I would do:
from io import StringIO
import pandas
datatable = StringIO("""\
m r s p O W N
1 4 3 1 2.81 3.70 3.03
1 4 4 1 2.14 2.82 2.31
1 4 5 1 1.47 1.94 1.59
1 4 3 2 0.58 0.78 0.60
1 4 4 2 0.67 0.00 0.00
1 4 5 2 1.03 2.45 1.68
1 4 3 3 1.98 1.34 1.81
1 4 4 3 0.00 0.04 0.15
1 4 5 3 0.01 0.00 0.26""")
df = (
pandas.read_table(datatable, sep='\s+')
.set_index(['m', 'r', 's', 'p'])
.unstack(level='p')
)
df.columns = df.columns.swaplevel(0, 1)
df.sort(axis=1, inplace=True)
print(df)
Which prints:
p 1 2 3
O W N O W N O W N
m r s
1 4 3 2.81 3.70 3.03 0.58 0.78 0.60 1.98 1.34 1.81
4 2.14 2.82 2.31 0.67 0.00 0.00 0.00 0.04 0.15
5 1.47 1.94 1.59 1.03 2.45 1.68 0.01 0.00 0.26
So now the columns are a MultiIndex and you can access, for example, all of the values where p = 2 with df[2] or df.xs(2, level='p', axis=1), which gives me:
O W N
m r s
1 4 3 0.58 0.78 0.60
4 0.67 0.00 0.00
5 1.03 2.45 1.68
Similarly, you can get all of the W columns with: df.xs('W', level=1, axis=1)
(we say level=1) because that column level does not have a name, so we use its position instead)
p 1 2 3
m r s
1 4 3 3.70 0.78 1.34
4 2.82 0.00 0.04
5 1.94 2.45 0.00
You can similarly query the columns by using axis=0.
If you really need the p values in a column, just add it there manually and reindex your columns:
for p in df.columns.get_level_values('p').unique():
df[p, 'p'] = p
cols = pandas.MultiIndex.from_product([[1,2,3], list('pOWN')])
df = df.reindex(columns=cols)
print(df)
1 2 3
p O W N p O W N p O W N
m r s
1 4 3 1 2.81 3.70 3.03 2 0.58 0.78 0.60 3 1.98 1.34 1.81
4 1 2.14 2.82 2.31 2 0.67 0.00 0.00 3 0.00 0.04 0.15
5 1 1.47 1.94 1.59 2 1.03 2.45 1.68 3 0.01 0.00 0.26
b = open('ss2.csv', 'w')
a = csv.writer(b)
sk = ''
with open ('df_col2.csv', 'r') as ann:
for col in ann:
an = col.lower().strip('\n').split(',')
suk += an[0] + ','
sk = sk[:-2]
a.writerow([sk])

Group by - select most recent 4 events

I have the following df in pandas:
df:
DATE STOCK DATA1 DATA2 DATA3
01/01/12 ABC 0.40 0.88 0.22
04/01/12 ABC 0.50 0.49 0.13
07/01/12 ABC 0.85 0.36 0.83
10/01/12 ABC 0.28 0.12 0.39
01/01/13 ABC 0.86 0.87 0.58
04/01/13 ABC 0.95 0.39 0.87
07/01/13 ABC 0.60 0.25 0.56
10/01/13 ABC 0.15 0.28 0.69
01/01/11 XYZ 0.94 0.40 0.50
04/01/11 XYZ 0.65 0.19 0.81
07/01/11 XYZ 0.89 0.59 0.69
10/01/11 XYZ 0.12 0.09 0.18
01/01/12 XYZ 0.25 0.94 0.55
04/01/12 XYZ 0.07 0.22 0.67
07/01/12 XYZ 0.46 0.08 0.54
10/01/12 XYZ 0.04 0.03 0.94
...
I want to group by the stocks, sort by date and then for specified columns (in this case DATA1 and DATA3), I want to get the last four items summed (TTM data).
The output would look like this:
DATE STOCK DATA1 DATA2 DATA3 DATA1_TTM DATA3_TTM
01/01/12 ABC 0.40 0.88 0.22 NaN NaN
04/01/12 ABC 0.50 0.49 0.13 NaN NaN
07/01/12 ABC 0.85 0.36 0.83 NaN NaN
10/01/12 ABC 0.28 0.12 0.39 2.03 1.56
01/01/13 ABC 0.86 0.87 0.58 2.49 1.92
04/01/13 ABC 0.95 0.39 0.87 2.94 2.66
07/01/13 ABC 0.60 0.25 0.56 2.69 2.39
10/01/13 ABC 0.15 0.28 0.69 2.55 2.70
01/01/11 XYZ 0.94 0.40 0.50 NaN NaN
04/01/11 XYZ 0.65 0.19 0.81 NaN NaN
07/01/11 XYZ 0.89 0.59 0.69 NaN NaN
10/01/11 XYZ 0.12 0.09 0.18 2.59 2.18
01/01/12 XYZ 0.25 0.94 0.55 1.90 2.23
04/01/12 XYZ 0.07 0.22 0.67 1.33 2.09
07/01/12 XYZ 0.46 0.08 0.54 0.89 1.94
10/01/12 XYZ 0.04 0.03 0.94 0.82 2.70
...
My approach so far has been to sort by date, then group, then iterate through each group and if there are 3 older events then the current event I sum. Also, I want to check to see if the dates fall within 1 year. Can anyone offer a better way in Python? Thank you.
Added: As a clarification for the 1 year part, let's say you take the last four dates and it goes 1/1/1993, 4/1/12, 7/1/12, 10/1/12 -- a data error. I wouldn't want to sum those four. I would want that one to say NaN.
For this I think you can use transform and rolling_sum. Starting from your dataframe, I might do something like:
>>> df["DATE"] = pd.to_datetime(df["DATE"]) # switch to datetime to ease sorting
>>> df = df.sort(["STOCK", "DATE"])
>>> rsum_columns = "DATA1", "DATA3"
>>> grouped = df.groupby("STOCK")[rsum_columns]
>>> new_columns = grouped.transform(lambda x: pd.rolling_sum(x, 4))
>>> df[new_columns.columns + "_TTM"] = new_columns
>>> df
DATE STOCK DATA1 DATA2 DATA3 DATA1_TTM DATA3_TTM
0 2012-01-01 00:00:00 ABC 0.40 0.88 0.22 NaN NaN
1 2012-04-01 00:00:00 ABC 0.50 0.49 0.13 NaN NaN
2 2012-07-01 00:00:00 ABC 0.85 0.36 0.83 NaN NaN
3 2012-10-01 00:00:00 ABC 0.28 0.12 0.39 2.03 1.57
4 2013-01-01 00:00:00 ABC 0.86 0.87 0.58 2.49 1.93
5 2013-04-01 00:00:00 ABC 0.95 0.39 0.87 2.94 2.67
6 2013-07-01 00:00:00 ABC 0.60 0.25 0.56 2.69 2.40
7 2013-10-01 00:00:00 ABC 0.15 0.28 0.69 2.56 2.70
8 2011-01-01 00:00:00 XYZ 0.94 0.40 0.50 NaN NaN
9 2011-04-01 00:00:00 XYZ 0.65 0.19 0.81 NaN NaN
10 2011-07-01 00:00:00 XYZ 0.89 0.59 0.69 NaN NaN
11 2011-10-01 00:00:00 XYZ 0.12 0.09 0.18 2.60 2.18
12 2012-01-01 00:00:00 XYZ 0.25 0.94 0.55 1.91 2.23
13 2012-04-01 00:00:00 XYZ 0.07 0.22 0.67 1.33 2.09
14 2012-07-01 00:00:00 XYZ 0.46 0.08 0.54 0.90 1.94
15 2012-10-01 00:00:00 XYZ 0.04 0.03 0.94 0.82 2.70
[16 rows x 7 columns]
I don't know what you're asking by "Also, I want to check to see if the dates fall within 1 year", so I'll leave that alone.

Pandas DataFrame Advanced Slicing

I am a R user and I found myself struggling a bit moving to Python, especially with the indexing capabilities of Pandas.
Household_id is my second column. I sorted my dataframe based on this column and ran the following instructions, returning various results (that I would expect to be the same). Are those expressions the same? If so, why do I see different results?
In [63]: ground_truth.columns
Out[63]: Index([Timestamp, household_id, ... (continues)
In [59]: ground_truth.ix[1107177,'household_id']
Out[59]: 2
In [60]: ground_truth.ix[1107177,1]
Out[60]: 2.0
In [61]: ground_truth.iloc[1107177,1]
Out[61]: 4.0
In [62]: ground_truth['household_id'][1107177]
Out[62]: 2
PS: I cant post the data unfortunately (too big).
NOTE: When you sort by a column, you'll rearrange the index, and assuming it wasn't sorted that way to begin with you'll have integers labels that don't equal their linear index in the array.
First, ix will first try integers as labels then as indices, so it is immediate that 59 and 62 are the same. Second, if the index is not 0:n - 1 then 1107177 is a label, not a integer index thus the difference between 60 and 61. As far as the float casting goes, it looks like you might be using an older version of pandas. This doesn't happen in git master.
Here are the docs on ix.
Here's an example with a toy DataFrame:
In [1]:
df = DataFrame(randn(10, 3), columns=list('abc'))
print df
print
print df.sort('a')
a b c
0 -1.80 -0.28 -1.10
1 -0.58 1.00 -0.48
2 -2.50 1.59 -1.42
3 -1.00 -0.12 -0.93
4 -0.65 1.41 1.20
5 0.51 0.96 1.28
6 -0.28 0.13 1.59
7 1.28 -0.84 0.51
8 0.77 -1.26 -0.50
9 -0.59 -1.34 -1.06
a b c
2 -2.50 1.59 -1.42
0 -1.80 -0.28 -1.10
3 -1.00 -0.12 -0.93
4 -0.65 1.41 1.20
9 -0.59 -1.34 -1.06
1 -0.58 1.00 -0.48
6 -0.28 0.13 1.59
5 0.51 0.96 1.28
8 0.77 -1.26 -0.50
7 1.28 -0.84 0.51
Notice that the sorted row indices are integers and they don't map to their locations.

Categories

Resources