I want to find the corr btw cities and and Rainfall. Note that 'city' is categorical, not numerical.
I wand to compare their rainfall.
How do I go about it? I haven't seen anything on here that talk about how to deal with duplicate cities with different data
like
Date Location MinTemp MaxTemp Rainfall
12/1/2008 Albury 13.4 22.9 0.6
12/2/2008 Albury 7.4 25.1 0
12/3/2008 Albury 12.9 25.7 0
12/5/2008 Brisbane 20.5 29 9.6
12/6/2008 Brisbane 22.1 33.4 7.8
12/7/2008 Brisbane 22.6 33.4 12.4
12/8/2008 Brisbane 21.9 26.7 0
12/9/2008 Brisbane 19.5 27.6 0.2
12/10/2008 Brisbane 22.1 30.3 0.6
3/30/2011 Tuggeranong 9.8 25.2 0.4
3/31/2011 Tuggeranong 10.3 18.5 2.8
5/1/2011 Tuggeranong 5.5 20.8 0
5/2/2011 Tuggeranong 11 16.1 0
5/3/2011 Tuggeranong 7.3 17.5 0.6
8/29/2016 Woomera 15 22.9 0
8/30/2016 Woomera 12.5 22.1 12.8
8/31/2016 Woomera 8 20 0
9/1/2016 Woomera 11.6 21.4 0
9/2/2016 Woomera 11.2 19.6 0.3
9/3/2016 Woomera 7.1 20.4 0
9/4/2016 Woomera 6.5 18.6 0
9/5/2016 Woomera 7.3 21.5 0
One possible solution, if I understood you correctly (based on the title of OP), is:
Step 1
Preparing a dataset with Locations as columns and Rainfall as rows (note, you will lose information here up to a shortest rainfall series)
df2=df.groupby("Location")[["Location", "Rainfall"]].head(3) # head(3) is first 3 observations
df2.loc[:,"col"] = 4*["x1","x2","x3"] # 4 is number of unique cities
df3 = df2.pivot_table(index="col",columns="Location",values="Rainfall")
df3
Location Albury Brisbane Tuggeranong Woomera
col
x1 0.6 9.6 0.4 0.0
x2 0.0 7.8 2.8 12.8
x3 0.0 12.4 0.0 0.0
Step 2
Doing correlation matrix on the obtained dataset
df3.corr()
Location Albury Brisbane Tuggeranong Woomera
Location
Albury 1.000000 -0.124534 -0.381246 -0.500000
Brisbane -0.124534 1.000000 -0.869799 -0.797017
Tuggeranong -0.381246 -0.869799 1.000000 0.991241
Woomera -0.500000 -0.797017 0.991241 1.000000
An alternative, slightly more involved solution would be to keep the longest series and impute missing values with means or median.
But even though you will feed more data into your algo, it won't cure the main problem: your data seem to be misaligned. What I mean by this is that to do correlation analysis properly you should make it sure, that you compare comparable values, e.g. rainfall for summer with rainfall for summer for another city. To do analysis this way, you should make it sure you have equal amount of comparable rainfalls for each city: e.g. winter, spring, summer, autumn; or, January, February, ..., December.
Related
this is my current DataFrame:
Df:
DATA
4.15
4.02
3.70
3.51
3.17
2.95
2.86
NaN
NaN
i alredy know that 4.15(first value) is 100%, 2.86(last value) is 30% and 2.5 is 0%. firstly, i want to interpolate first column the NaN(second last)value based on last NaN is 2.5(this is alredy predfined). after this i want to create second column and interpolate based on first coumn and available these three percentage value.
is it possible?
i have tried this code but it is not giving expected results:
df = pd.DataFrame({'DATA':range(df.DATA.min(), df.DATA.max()+1)}).merge(df, on='DATA', how='left')
df.Voltage = df.Voltage.interpolate()
Expected output:
Df:
DATA %
4.15 100%
4.02 89%
3.70 75%
3.51 70%
3.17 50%
2.95 35%
2.86 30%
2.74 15%
2.5 0%
Your logic is unclear, my understanding is that you want to compute a rank, but the provided output is unclear, please details the computations.
What I would do:
df.loc[df.index[-1], 'DATA'] = 2.5
df['DATA'] = df['DATA'].interpolate()
# compute rank
s = df['DATA'].rank(pct=True)
# rescale to 0-1 and convert to %
df['%'] = ((s-s.min())/(1-s.min())).mul(100)
output:
DATA %
0 4.15 100.0
1 4.02 87.5
2 3.70 75.0
3 3.51 62.5
4 3.17 50.0
5 2.95 37.5
6 2.86 25.0
7 2.68 12.5
8 2.50 0.0
Trying to group by a state with different countries and facing a problem where I have columns of percentage values such as unemployment, other columns with fixed values such as lat and lon, and other columns with numeric values such as totals so doing summation yields inaccurate results.
what approaches to such a problem can I implement?
dataset sample:
county state total_votes20 votes20_Donald_Trump \
685 Dale AL 19699.0 14281.0
716 DeKalb AL 29322.0 24744.0
1546 Lauderdale AL 44149.0 31578.0
votes20_Joe_Biden lat long cases deaths TotalPop ... \
685 5154.0 31.430371 -85.610957 1926.0 52.0 49393.0 ...
716 4271.0 34.459469 -85.807829 3691.0 30.0 71194.0 ...
1546 11872.0 34.901719 -87.656247 2743.0 43.0 92590.0 ...
Walk OtherTransp WorkAtHome MeanCommute Employed PrivateWork \
685 2.1 1.8 2.2 20.7 17988.0 73.6
716 0.9 0.6 2.1 23.2 28416.0 78.0
1546 1.7 0.7 1.8 24.1 39857.0 77.6
PublicWork SelfEmployed FamilyWork Unemployment
685 20.7 5.6 0.2 10.4
716 12.1 9.7 0.1 5.1
1546 15.6 6.6 0.1 6.6
I want to calculate a positive streak for numbers in a row in reverse fashion.
I tried using cumsum() but that's not helping me.
The DataFrame looks as follows with the expected output:
country score_1 score_2 score_3 score_4 score_5 expected_streak
U.S. 12.4 13.6 19.9 22 28.7 4
Africa 11.1 15.5 9.2 7 34.2 1
India 13.9 6.6 16.3 21.8 30.9 3
Australia 25.4 36.9 18.9 29 NaN 0
Malaysia 12.8 NaN -6.2 28.6 31.7 2
Argentina 40.7 NaN 16.3 20.1 39 2
Canada 56.4 NaN NaN -2 -1 1
So, basically score_5 should be greater than score_4 and so on... to get a count of streak. If a number is greater than score_5 the streak count ends.
One way using diff with cummin:
df2 = df.filter(like="score_").loc[:, ::-1]
df["expected"] = df2.diff(-1, axis=1).gt(0).cummin(1).sum(1)
print(df)
Output:
country score_1 score_2 score_3 score_4 score_5 expected
0 U.S. 12.4 13.6 19.9 22.0 28.7 4
1 Africa 11.1 15.5 9.2 7.0 34.2 1
2 India 13.9 6.6 16.3 21.8 30.9 3
3 Australia 25.4 36.9 18.9 29.0 NaN 0
4 Malaysia 12.8 NaN -6.2 28.6 31.7 2
5 Argentina 40.7 NaN 16.3 20.1 39.0 2
6 Canada 56.4 NaN NaN -2.0 -1.0 1
I have a dataframe with the following data
T1 SO DR AX NO Overig
SK1 20.2 21.7 27 22.4 22.6 25
PA 20.2 21.7 21.6 20.4 17.7 25.0
T4 30.8 30.0 24.3 28.6 32.3 0.0
XXS 7.7 10.0 10.8 8.2 9.7 25.0
MvM 20.2 16.7 13.5 18.4 14.5 25.0
ACH 1.0 0.0 2.7 2.0 3.2 0.0
With an specified index and columns.
I need a bar chart for just the columns T1, SO, and DR, with the index name on the x-axis, and the values of the index for the three columns on the y-axis. In this case the total of bars will be 6*3 = 18.
I have tried the following:
df.T.plot(kind='bar')
tevr_asp[['T1','SO','DR']].T.plot.bar()
You can use dataframe plot function to render specific columns in y-axis and with use_index you can render the index in x-axis.
df.plot(y=["T1", "SO", "DR"],use_index=True, kind="bar")
Can someone please tell me how I can fill in the missing values of my dataframe? The missing values dont come up as NaN or anything that is common instead they show as two dots like .. How would i go about filling them in with the mean of that row that they are in?
1971 1990 1999 2000 2001 2002
Estonia .. 17.4 8.3 8.5 8.5 8.6
Spain 61.6 151.2 205.9 222.2 233.2 241.6
SlovakRepublic 10.9 25.5 28.1 30.8 31.9 32.2
Slovenia .. 12.4 13.3 13.6 14.5 14.6
My headers are the years and my index are the countries.
It seems you can use mask, compare by numpy array created by values and replace by mean, last cast all columns to float:
print (df.mean(axis=1))
Estonia 10.26
Spain 210.82
SlovakRepublic 29.70
Slovenia 13.68
df = df.mask(df.values == '..', df.mean(axis=1), axis=0).astype(float)
print (df)
1971 1990 1999 2000 2001 2002
Estonia 10.26 17.4 8.3 8.5 8.5 8.6
Spain 61.6 151.2 205.9 222.2 233.2 241.6
SlovakRepublic 10.9 25.5 28.1 30.8 31.9 32.2
Slovenia 13.68 12.4 13.3 13.6 14.5 14.6
You should be able to use an .set_value
try df_name.set_value('index', 'column', value)
something like
df_name.set_value('Estonia','1971', 50)