I have one dataframe as below. I want to add one column based on the column 'need' (such as in row zero ,the need is 1, so select part1 value -0.17. I have pasted the dataframe which I want. Thanks.
df = pd.DataFrame({
'date': [20130101,20130101, 20130103, 20130104, 20130105, 20130107],
'need':[1,3,2,4,3,1],
'part1':[-0.17,-1.03,1.59,-0.05,-0.1,0.9],
'part2':[0.67,-0.03,1.95,-3.25,-0.3,0.6],
'part3':[0.7,-3,1.5,-0.25,-0.37,0.62],
'part4':[0.24,-0.44,1.335,-0.45,-0.57,0.92]
})
date need output part1 part2 part3 part4
0 20130101 1 -0.17 -0.17 0.67 0.70 0.240
1 20130101 3 -3.00 -1.03 -0.03 -3.00 -0.440
2 20130103 2 1.95 1.59 1.95 1.50 1.335
3 20130104 4 -0.45 -0.05 -3.25 -0.25 -0.450
4 20130105 3 -0.37 -0.10 -0.30 -0.37 -0.570
5 20130107 1 0.90 0.90 0.60 0.62 0.920
Use DataFrame.lookup:
df['new'] = df.lookup(df.index, 'part' + df['need'].astype(str))
print (df)
date need part1 part2 part3 part4 new
0 20130101 1 -0.17 0.67 0.70 0.240 -0.17
1 20130101 3 -1.03 -0.03 -3.00 -0.440 -3.00
2 20130103 2 1.59 1.95 1.50 1.335 1.95
3 20130104 4 -0.05 -3.25 -0.25 -0.450 -0.45
4 20130105 3 -0.10 -0.30 -0.37 -0.570 -0.37
5 20130107 1 0.90 0.60 0.62 0.920 0.90
Numpy solution, is necessary sorting increaseing columns by 1 like in sample:
df['new'] = df.filter(like='part').values[np.arange(len(df)), df['need'] - 1]
print (df)
date need part1 part2 part3 part4 new
0 20130101 1 -0.17 0.67 0.70 0.240 -0.17
1 20130101 3 -1.03 -0.03 -3.00 -0.440 -3.00
2 20130103 2 1.59 1.95 1.50 1.335 1.95
3 20130104 4 -0.05 -3.25 -0.25 -0.450 -0.45
4 20130105 3 -0.10 -0.30 -0.37 -0.570 -0.37
5 20130107 1 0.90 0.60 0.62 0.920 0.90
It should be fine
df['new'] = df.iloc[:, 1:].apply(lambda row: row['part'+str(int(row['need']))], axis=1)
date need part1 part2 part3 part4 new
0 20130101 1 -0.17 0.67 0.70 0.240 -0.17
1 20130101 3 -1.03 -0.03 -3.00 -0.440 -3.00
2 20130103 2 1.59 1.95 1.50 1.335 1.95
3 20130104 4 -0.05 -3.25 -0.25 -0.450 -0.45
4 20130105 3 -0.10 -0.30 -0.37 -0.570 -0.37
5 20130107 1 0.90 0.60 0.62 0.920 0.90
Related
I have created the df_nan below which shows the sum of NaN values from the main df, which, as seen, shows how many are in each specific column.
However, I want to create a new df, which has a column/index of countries, then another with the number of NaN values for the given country.
Country Number of NaN Values
Aruba 4
Finland 3
I feel like I have to use groupby, to create something along the lines of this below, but .isna is not an attribute of the groupby function. Any help would be great, thanks!
df_nan2= df_nan.groupby(['Country']).isna().sum()
Current code
import pandas as pd
import seaborn as sns
import numpy as np
from scipy.stats import spearmanr
# given dataframe df
df = pd.read_csv('countries.csv')
df.drop(columns= ['Population (millions)', 'HDI', 'GDP per Capita','Fish Footprint','Fishing Water',
'Urban Land','Earths Required', 'Countries Required', 'Data Quality'], axis=1, inplace = True)
df_nan= df.isna().sum()
Head of main df
0 Afghanistan Middle East/Central Asia 0.30 0.20 0.08 0.18 0.79 0.24 0.20 0.02 0.50 -0.30
1 Albania Northern/Eastern Europe 0.78 0.22 0.25 0.87 2.21 0.55 0.21 0.29 1.18 -1.03
2 Algeria Africa 0.60 0.16 0.17 1.14 2.12 0.24 0.27 0.03 0.59 -1.53
3 Angola Africa 0.33 0.15 0.12 0.20 0.93 0.20 1.42 0.64 2.55 1.61
4 Antigua and Barbuda Latin America NaN NaN NaN NaN 5.38 NaN NaN NaN 0.94 -4.44
5 Argentina Latin America 0.78 0.79 0.29 1.08 3.14 2.64 1.86 0.66 6.92 3.78
6 Armenia Middle East/Central Asia 0.74 0.18 0.34 0.89 2.23 0.44 0.26 0.10 0.89 -1.35
7 Aruba Latin America NaN NaN NaN NaN 11.88 NaN NaN NaN 0.57 -11.31
8 Australia Asia-Pacific 2.68 0.63 0.89 4.85 9.31 5.42 5.81 2.01 16.57 7.26
9 Austria European Union 0.82 0.27 0.63 4.14 6.06 0.71 0.16 2.04 3.07 -3.00
10 Azerbaijan Middle East/Central Asia 0.66 0.22 0.11 1.25 2.31 0.46 0.20 0.11 0.85 -1.46
11 Bahamas Latin America 0.97 1.05 0.19 4.46 6.84 0.05 0.00 1.18 9.55 2.71
12 Bahrain Middle East/Central Asia 0.52 0.45 0.16 6.19 7.49 0.01 0.00 0.00 0.58 -6.91
13 Bangladesh Asia-Pacific 0.29 0.00 0.08 0.26 0.72 0.25 0.00 0.00 0.38 -0.35
14 Barbados Latin America 0.56 0.24 0.14 3.28 4.48 0.08 0.00 0.02 0.19 -4.29
15 Belarus Northern/Eastern Europe 1.32 0.12 0.91 2.57 5.09 1.52 0.30 1.71 3.64 -1.45
16 Belgium European Union 1.15 0.48 0.99 4.43 7.44 0.56 0.03 0.28 1.19 -6.25
17 Benin Africa 0.49 0.04 0.26 0.51 1.41 0.44 0.04 0.34 0.88 -0.53
18 Bermuda North America NaN NaN NaN NaN 5.77 NaN NaN NaN 0.13 -5.64
19 Bhutan Asia-Pacific 0.50 0.42 3.03 0.63 4.84 0.28 0.34 4.38 5.27 0.43
Nan head
Country 0
Region 0
Cropland Footprint 15
Grazing Footprint 15
Forest Footprint 15
Carbon Footprint 15
Total Ecological Footprint 0
Cropland 15
Grazing Land 15
Forest Land 15
Total Biocapacity 0
Biocapacity Deficit or Reserve 0
dtype: int64
Suppose, you want to get Null count for each Country from "Cropland Footprint" column, then you can use the following code -
Unique_Country = df['Country'].unique()
Col1 = 'Cropland Footprint'
NullCount = []
for i in Unique_Country:
s = df[df['Country']==i][Col1].isnull().sum()
NullCount.append(s)
df2 = pd.DataFrame({'Country': Unique_Country,
'Number of NaN Values': NullCount})
df2 = df2[df2['Number of NaN Values']!=0]
df2
Output -
Country Number of NaN Values
Antigua and Barbuda 1
Aruba 1
Bermuda 1
If you want to get Null Count from another Column then just change the Value of Col1 variable.
How could I read in a txt file like the one from
https://psl.noaa.gov/data/correlation/pna.data (example below)
1960 -0.16 -0.22 -0.69 -0.07 0.99 1.20 1.11 1.85 -0.01 0.48 -0.52 1.15
1961 1.16 0.17 0.28 -1.14 -0.25 1.84 -0.52 0.47 1.10 -1.94 -0.40 -1.54
1962 -0.74 -0.54 -0.71 -1.50 -1.11 -0.97 -0.36 0.57 -0.83 1.33 0.53 -0.38
1963 0.09 0.79 -2.04 -0.79 -0.95 0.50 -1.10 -1.01 0.87 0.93 -0.31 1.46
1964 -0.44 1.36 -1.31 -1.30 -2.27 0.27 0.20 0.83 0.92 0.80 -0.78 -2.03
1965 -0.92 -1.03 -0.80 -1.07 -0.42 1.89 -1.26 0.32 0.36 1.42 -0.81 -1.56
into a pandas dataframe to plot as a time series, for example from 1960-1965 with each value column (corresponding to months) being plotted? I rarely use .txt's
Here's what you can try:
import pandas as pd
import requests
import re
aa=requests.get("https://psl.noaa.gov/data/correlation/pna.data").text
aa=aa.split("\n")[1:-4]
aa=list(map(lambda x:x[1:],aa))
aa="\n".join(aa)
aa=re.sub(" +",",",aa)
with open("test.csv","w") as f:
f.write(aa)
df=pd.read_csv("test.csv", header=None, index_col=0).rename_axis('Year')
df.columns=list(pd.date_range(start='2021-01', freq='M', periods=12).month_name())
print(df.head())
df.to_csv("test.csv")
This is going to give you, in test.csv file:
Year
January
February
March.....
up to December
1948
73
67
67
773....
1949
73
67
67
773....
1950
73
67
67
773....
....
..
..
..
.......
....
..
..
..
.......
2021
73
88
84
733....
Use pd.read_fwf as suggested by #SanskarSingh
>>> pd.read_fwf('data.txt', header=None, index_col=0).rename_axis('Year')
1 2 3 4 5 6 7 8 9 10 11 12
Year
1960 -0.16 -0.22 -0.69 -0.07 0.99 1.20 1.11 1.85 -0.01 0.48 -0.52 1.15
1961 1.16 0.17 0.28 -1.14 -0.25 1.84 -0.52 0.47 1.10 -1.94 -0.40 -1.54
1962 -0.74 -0.54 -0.71 -1.50 -1.11 -0.97 -0.36 0.57 -0.83 1.33 0.53 -0.38
1963 0.09 0.79 -2.04 -0.79 -0.95 0.50 -1.10 -1.01 0.87 0.93 -0.31 1.46
1964 -0.44 1.36 -1.31 -1.30 -2.27 0.27 0.20 0.83 0.92 0.80 -0.78 -2.03
1965 -0.92 -1.03 -0.80 -1.07 -0.42 1.89 -1.26 0.32 0.36 1.42 -0.81 -1.56
I have a CSV file as below:
t dd hh v.amm v.alc v.no2 v.cmo aqi
0 201811170000 17 0 0.40 0.41 1.33 1.55 2.45
1 201811170002 17 0 0.40 0.41 1.34 1.51 2.46
2 201811170007 17 0 0.40 0.37 1.35 1.45 2.40
Now I have to fill in the missing minutes by last observation carried forward. Expected output:
t dd hh v.amm v.alc v.no2 v.cmo aqi
0 201811170000 17 0 0.40 0.41 1.33 1.55 2.45
1 201811170001 17 0 0.40 0.41 1.33 1.55 2.45
2 201811170002 17 0 0.40 0.41 1.34 1.51 2.46
2 201811170003 17 0 0.40 0.41 1.34 1.51 2.46
2 201811170004 17 0 0.40 0.41 1.34 1.51 2.46
2 201811170005 17 0 0.40 0.41 1.34 1.51 2.46
2 201811170006 17 0 0.40 0.41 1.34 1.51 2.46
3 201811170007 17 0 0.40 0.37 1.35 1.45 2.40
I tried following this link but unable to achieve the expected output. Sorry I'm new to coding.
First create DatetimeIndex by to_datetime and DataFrame.set_index and then change frequency by DataFrame.asfreq:
df['t'] = pd.to_datetime(df['t'], format='%Y%m%d%H%M')
df = df.set_index('t').sort_index().asfreq('Min', method='ffill')
print (df)
dd hh v.amm v.alc v.no2 v.cmo aqi
t
2018-11-17 00:00:00 17 0 0.4 0.41 1.33 1.55 2.45
2018-11-17 00:01:00 17 0 0.4 0.41 1.33 1.55 2.45
2018-11-17 00:02:00 17 0 0.4 0.41 1.34 1.51 2.46
2018-11-17 00:03:00 17 0 0.4 0.41 1.34 1.51 2.46
2018-11-17 00:04:00 17 0 0.4 0.41 1.34 1.51 2.46
2018-11-17 00:05:00 17 0 0.4 0.41 1.34 1.51 2.46
2018-11-17 00:06:00 17 0 0.4 0.41 1.34 1.51 2.46
2018-11-17 00:07:00 17 0 0.4 0.37 1.35 1.45 2.40
Or use DataFrame.resample with Resampler.ffill:
df['t'] = pd.to_datetime(df['t'], format='%Y%m%d%H%M')
df = df.set_index('t').sort_index().resample('Min').ffill()
I want to make correlation in this DataFrame but not the way it is shown, but to rank values from the lowest to largest.
import pandas as pd
import numpy as np
rs = np.random.RandomState(1)
df = pd.DataFrame(rs.rand(9, 8))
corr = df.corr()
corr.style.background_gradient().set_precision(2)
0 1 2 3 4 5 6 7
0 1 0.42 0.031 -0.16 -0.35 0.23 -0.22 0.4
1 0.42 1 -0.24 -0.55 0.011 0.3 -0.26 0.23
2 0.031 -0.24 1 0.29 0.44 0.29 0.23 0.25
3 -0.16 -0.55 0.29 1 -0.33 -0.42 0.58 -0.37
4 -0.35 0.011 0.44 -0.33 1 0.46 0.074 0.19
5 0.23 0.3 0.29 -0.42 0.46 1 -0.41 0.71
6 -0.22 -0.26 0.23 0.58 0.074 -0.41 1 -0.66
7 0.4 0.23 0.25 -0.37 0.19 0.71 -0.66 1
You can use sort_values:
import pandas as pd
import numpy as np
rs = np.random.RandomState(1)
df = pd.DataFrame(rs.rand(9, 8))
corr = df.corr()
print(corr)
print(corr.sort_values(by=0, axis=1, inplace=False)) # by=0 means first row
Results:
0 1 2 3 4 5 6 7
0 1.000000 0.418246 0.030692 -0.160001 -0.352993 0.230069 -0.216804 0.395662
1 0.418246 1.000000 -0.244115 -0.549013 0.010745 0.299203 -0.262351 0.232681
2 0.030692 -0.244115 1.000000 0.288011 0.435907 0.285408 0.225205 0.253840
3 -0.160001 -0.549013 0.288011 1.000000 -0.326950 -0.415688 0.578549 -0.366539
4 -0.352993 0.010745 0.435907 -0.326950 1.000000 0.455738 0.074293 0.193905
5 0.230069 0.299203 0.285408 -0.415688 0.455738 1.000000 -0.413383 0.708467
6 -0.216804 -0.262351 0.225205 0.578549 0.074293 -0.413383 1.000000 -0.664207
7 0.395662 0.232681 0.253840 -0.366539 0.193905 0.708467 -0.664207 1.000000
0 1 7 5 2 3 6 4
0 1.000000 0.418246 0.395662 0.230069 0.030692 -0.160001 -0.216804 -0.352993
1 0.418246 1.000000 0.232681 0.299203 -0.244115 -0.549013 -0.262351 0.010745
2 0.030692 -0.244115 0.253840 0.285408 1.000000 0.288011 0.225205 0.435907
3 -0.160001 -0.549013 -0.366539 -0.415688 0.288011 1.000000 0.578549 -0.326950
4 -0.352993 0.010745 0.193905 0.455738 0.435907 -0.326950 0.074293 1.000000
5 0.230069 0.299203 0.708467 1.000000 0.285408 -0.415688 -0.413383 0.455738
6 -0.216804 -0.262351 -0.664207 -0.413383 0.225205 0.578549 1.000000 0.074293
7 0.395662 0.232681 1.000000 0.708467 0.253840 -0.366539 -0.664207 0.193905
Round works on a single element but not the DataFrame, tried DataFrame.round() but didn't work... any idea? Thanks.
Have code below:
print "Panda Version: ", pd.__version__
print "['5am'][0]: ", x3['5am'][0]
print "Round element: ", np.round(x3['5am'][0]*4) /4
print "Round Dataframe: \r\n", np.round(x3 * 4, decimals=2) / 4
df = np.round(x3 * 4, decimals=2) / 4
print "Round Dataframe Again: \r\n", df.round(2)
Got result:
Panda Version: 0.18.0
['5am'][0]: 0.279914529915
Round element: 0.25
Round Dataframe:
5am 6am 7am 8am 9am 10am 11am
Date
2016-07-11 0.279915 0.279915 2.85256 4.52778 6.23291 9.01496 8.53632
2016-07-12 0.339744 0.369658 2.67308 4.52778 5.00641 7.30983 6.98077
2016-07-13 0.399573 0.459402 2.61325 3.83974 5.48504 6.77137 5.24573
2016-07-14 0.339744 0.549145 2.64316 3.36111 5.66453 5.96368 7.87821
2016-07-15 0.309829 0.459402 2.55342 4.64744 4.46795 6.80128 6.17308
2016-07-16 0.25 0.369658 2.46368 2.67308 4.58761 6.35256 5.63462
2016-07-17 0.279915 0.369658 2.58333 2.91239 4.19872 5.51496 6.65171
Round Dataframe Again:
5am 6am 7am 8am 9am 10am 11am
Date
2016-07-11 0.279915 0.279915 2.85256 4.52778 6.23291 9.01496 8.53632
2016-07-12 0.339744 0.369658 2.67308 4.52778 5.00641 7.30983 6.98077
2016-07-13 0.399573 0.459402 2.61325 3.83974 5.48504 6.77137 5.24573
2016-07-14 0.339744 0.549145 2.64316 3.36111 5.66453 5.96368 7.87821
2016-07-15 0.309829 0.459402 2.55342 4.64744 4.46795 6.80128 6.17308
2016-07-16 0.25 0.369658 2.46368 2.67308 4.58761 6.35256 5.63462
2016-07-17 0.279915 0.369658 2.58333 2.91239 4.19872 5.51496 6.65171
Try to cast to float type:
x3.astype(float).round(2)
as simple as this
df['col_name'] = df['col_name'].astype(float).round(2)
Explanation of your code:
In [166]: np.round(df * 4, decimals=2)
Out[166]:
a b c d
0 0.11 0.45 1.65 3.38
1 3.97 2.90 1.89 3.42
2 1.46 0.79 3.00 1.44
3 3.48 2.33 0.81 1.02
4 1.03 0.65 1.94 2.92
5 1.88 2.21 0.59 0.39
6 0.08 2.09 4.00 1.02
7 2.86 0.71 3.56 0.57
8 1.23 1.38 3.47 0.03
9 3.09 1.10 1.12 3.31
In [167]: np.round(df * 4, decimals=2) / 4
Out[167]:
a b c d
0 0.0275 0.1125 0.4125 0.8450
1 0.9925 0.7250 0.4725 0.8550
2 0.3650 0.1975 0.7500 0.3600
3 0.8700 0.5825 0.2025 0.2550
4 0.2575 0.1625 0.4850 0.7300
5 0.4700 0.5525 0.1475 0.0975
6 0.0200 0.5225 1.0000 0.2550
7 0.7150 0.1775 0.8900 0.1425
8 0.3075 0.3450 0.8675 0.0075
9 0.7725 0.2750 0.2800 0.8275
In [168]: np.round(np.round(df * 4, decimals=2) / 4, 2)
Out[168]:
a b c d
0 0.03 0.11 0.41 0.84
1 0.99 0.72 0.47 0.86
2 0.36 0.20 0.75 0.36
3 0.87 0.58 0.20 0.26
4 0.26 0.16 0.48 0.73
5 0.47 0.55 0.15 0.10
6 0.02 0.52 1.00 0.26
7 0.72 0.18 0.89 0.14
8 0.31 0.34 0.87 0.01
9 0.77 0.28 0.28 0.83
This is working properly for me (pandas 0.18.1)
In [162]: df = pd.DataFrame(np.random.rand(10,4), columns=list('abcd'))
In [163]: df
Out[163]:
a b c d
0 0.028700 0.112959 0.412192 0.845663
1 0.991907 0.725550 0.472020 0.856240
2 0.365117 0.197468 0.750554 0.360272
3 0.870041 0.582081 0.203692 0.255915
4 0.257433 0.161543 0.483978 0.730548
5 0.470767 0.553341 0.146612 0.096358
6 0.020052 0.522482 0.999089 0.254312
7 0.714934 0.178061 0.889703 0.143701
8 0.308284 0.344552 0.868151 0.007825
9 0.771984 0.274245 0.280431 0.827999
In [164]: df.round(2)
Out[164]:
a b c d
0 0.03 0.11 0.41 0.85
1 0.99 0.73 0.47 0.86
2 0.37 0.20 0.75 0.36
3 0.87 0.58 0.20 0.26
4 0.26 0.16 0.48 0.73
5 0.47 0.55 0.15 0.10
6 0.02 0.52 1.00 0.25
7 0.71 0.18 0.89 0.14
8 0.31 0.34 0.87 0.01
9 0.77 0.27 0.28 0.83
Similar issue. df.round(1) didn't round as expected (e.g. .400000000123) but df.astype('float64').round(1) worked. Significantly, the dtype of df is float32. Apparently round() doesn't work properly on float32. How is this behavior not a bug?
As I just found here,
"round does not modify in-place. Rather, it returns the dataframe
rounded."
It might be helpful to think of this as follows:
df.round(2) is doing the correct rounding operation, but you are not asking it to see the result or saving it anywhere.
Thus, df_final = df.round(2) will likely complete your expected functionality, instead of just df.round(2). That's because the results of the rounding operation are now being saved to the df_final dataframe.
Additionally, it might be best to do one additional thing and use df_final = df.round(2).copy() instead of simply df_final = df.round(2). I find that some things return unexpected results if I don't assign a copy of the old dataframe to the new dataframe.
I've tried to reproduce your situation. and it seems to work nicely.
import pandas as pd
import numpy as np
from io import StringIO
s = """Date 5am 6am 7am 8am 9am 10am 11am
2016-07-11 0.279915 0.279915 2.85256 4.52778 6.23291 9.01496 8.53632
2016-07-12 0.339744 0.369658 2.67308 4.52778 5.00641 7.30983 6.98077
2016-07-13 0.399573 0.459402 2.61325 3.83974 5.48504 6.77137 5.24573
2016-07-14 0.339744 0.549145 2.64316 3.36111 5.66453 5.96368 7.87821
2016-07-15 0.309829 0.459402 2.55342 4.64744 4.46795 6.80128 6.17308
2016-07-16 0.25 0.369658 2.46368 2.67308 4.58761 6.35256 5.63462
2016-07-17 0.279915 0.369658 2.58333 2.91239 4.19872 5.51496 6.65171
"""
df = pd.read_table(StringIO(s), delim_whitespace=True)
df.set_index('Date').round(2)