Can i add new column in DataFrame with Interpolation? - python

this is my current DataFrame:
Df:
DATA
4.15
4.02
3.70
3.51
3.17
2.95
2.86
NaN
NaN
i alredy know that 4.15(first value) is 100%, 2.86(last value) is 30% and 2.5 is 0%. firstly, i want to interpolate first column the NaN(second last)value based on last NaN is 2.5(this is alredy predfined). after this i want to create second column and interpolate based on first coumn and available these three percentage value.
is it possible?
i have tried this code but it is not giving expected results:
df = pd.DataFrame({'DATA':range(df.DATA.min(), df.DATA.max()+1)}).merge(df, on='DATA', how='left')
df.Voltage = df.Voltage.interpolate()
Expected output:
Df:
DATA %
4.15 100%
4.02 89%
3.70 75%
3.51 70%
3.17 50%
2.95 35%
2.86 30%
2.74 15%
2.5 0%

Your logic is unclear, my understanding is that you want to compute a rank, but the provided output is unclear, please details the computations.
What I would do:
df.loc[df.index[-1], 'DATA'] = 2.5
df['DATA'] = df['DATA'].interpolate()
# compute rank
s = df['DATA'].rank(pct=True)
# rescale to 0-1 and convert to %
df['%'] = ((s-s.min())/(1-s.min())).mul(100)
output:
DATA %
0 4.15 100.0
1 4.02 87.5
2 3.70 75.0
3 3.51 62.5
4 3.17 50.0
5 2.95 37.5
6 2.86 25.0
7 2.68 12.5
8 2.50 0.0

Related

Pandas: Parse Excel spreadsheet with merged cells and blank values

My question is similar to this one. I have a spreadsheet with some merged cells, but the column with merged cells also has empty cells, e.g.:
Day Sample CD4 CD8
----------------------------
Day 1 8311 17.3 6.44
--------------------
8312 13.6 3.50
--------------------
8321 19.8 5.88
--------------------
8322 13.5 4.09
----------------------------
Day 2 8311 16.0 4.92
--------------------
8312 5.67 2.28
--------------------
8321 13.0 4.34
--------------------
8322 10.6 1.95
----------------------------
8323 16.0 4.92
----------------------------
8324 5.67 2.28
----------------------------
8325 13.0 4.34
How can I parse this into a Pandas DataFrame? I understand that the fillna(method='ffill') method will not solve my issue, since it will replace the actually missing values with something else. I want to get a DataFrame like this:
Day Sample CD4 CD8
----------------------------
Day 1 8311 17.3 6.44
----------------------------
Day 1 8312 13.6 3.50
----------------------------
Day 1 8321 19.8 5.88
----------------------------
Day 1 8322 13.5 4.09
----------------------------
Day 2 8311 16.0 4.92
----------------------------
Day 2 8312 5.67 2.28
----------------------------
Day 2 8321 13.0 4.34
----------------------------
Day 2 8322 10.6 1.95
----------------------------
NA 8323 16.0 4.92
----------------------------
NA 8324 5.67 2.28
----------------------------
NA 8325 13.0 4.34
Something like this should work assuming you know the starting row of your excel file (or come up with a better way to check that)
import pandas as pd
import numpy as np
import openpyxl
def test():
filepath = "C:\\Users\\me\\Desktop\\SO nonsense\\PandasMergeCellTest.xlsx"
df = pd.read_excel(filepath)
wb = openpyxl.load_workbook(filepath)
sheet = wb["Sheet1"]
df["Row"] = np.arange(len(df)) + 2 #My headers were row 1 so adding 2 to get the row numbers
df["Merged"] = df.apply(lambda x: checkMerged(x, sheet), axis=1)
df["Day"] = np.where(df["Merged"] == True, df["Day"].ffill(), np.nan)
df = df.drop(["Row", "Merged"], 1)
print(df)
def checkMerged(x, sheet):
cell = sheet.cell(x["Row"], 1)
for mergedcell in sheet.merged_cells.ranges:
if(cell.coordinate in mergedcell):
return True
test()

Calculate Positive Streak for Pandas Rows in reverse

I want to calculate a positive streak for numbers in a row in reverse fashion.
I tried using cumsum() but that's not helping me.
The DataFrame looks as follows with the expected output:
country score_1 score_2 score_3 score_4 score_5 expected_streak
U.S. 12.4 13.6 19.9 22 28.7 4
Africa 11.1 15.5 9.2 7 34.2 1
India 13.9 6.6 16.3 21.8 30.9 3
Australia 25.4 36.9 18.9 29 NaN 0
Malaysia 12.8 NaN -6.2 28.6 31.7 2
Argentina 40.7 NaN 16.3 20.1 39 2
Canada 56.4 NaN NaN -2 -1 1
So, basically score_5 should be greater than score_4 and so on... to get a count of streak. If a number is greater than score_5 the streak count ends.
One way using diff with cummin:
df2 = df.filter(like="score_").loc[:, ::-1]
df["expected"] = df2.diff(-1, axis=1).gt(0).cummin(1).sum(1)
print(df)
Output:
country score_1 score_2 score_3 score_4 score_5 expected
0 U.S. 12.4 13.6 19.9 22.0 28.7 4
1 Africa 11.1 15.5 9.2 7.0 34.2 1
2 India 13.9 6.6 16.3 21.8 30.9 3
3 Australia 25.4 36.9 18.9 29.0 NaN 0
4 Malaysia 12.8 NaN -6.2 28.6 31.7 2
5 Argentina 40.7 NaN 16.3 20.1 39.0 2
6 Canada 56.4 NaN NaN -2.0 -1.0 1

Compare entries to bucketed table thresholds, as they get populated in an SQL database

I'm trying to compare live sensor data stored in an SQL database to a table of "acceptable thresholds".
Each entry in the database has three variables that identify it, and then a series of independent variables which I'm trying to compare.
This is the table to which I'd like to compare each entry. 'Gear', 'Throttle Bucket' and '% Load' are the three variables that I want to use to identify which row from the comparison table I want to compare each entry to.
Gear Throttle Bucket % Load Sensor 1 Expected Sensor 2 Expected Sensor 3 Expected
1 1000-1500 0-50 40.1 72.4 8.7
1 1000-1500 51-100 58.8 70.4 17.5
1 1000-1500 0-50 79.0 68.5 22.5
1 1000-1500 51-100 40.2 71.7 8.5
2 1000-1500 0-50 58.6 70.6 16.9
2 1000-1500 51-100 43.9 71.0 10.1
2 1000-1500 0-50 46.6 67.4 14.3
2 1000-1500 51-100 78.5 95.4 25.0
3 1000-1500 0-50 17.7 68.4 3.7
3 1000-1500 51-100 47.7 88.4 19.7
3 1000-1500 0-50 46.6 67.4 14.3
3 1000-1500 51-100 78.5 95.4 25.0
4 1000-1500 0-50 17.7 68.4 3.7
4 1000-1500 0-50 40.2 71.7 8.5
4 1000-1500 51-100 77.7 69.2 23.8
And this is how the data is coming into the server.
Entry Number Gear Throttle % Load Sensor 1 Sensor 2 Sensor 3
1 2 1134 36% 34.1 76.4 9.7
2 4 1758 87% 47.6 89.4 14.5
3 3 1642 15% 52.6 64.5 17.3
4 1 1224 51% 48.6 59.7 4.2
5 4 1421 34% 61.4 74.9 3.9
6 2 1801 73% 52.9 88.4 12.5
So what I've been trying to do is for every entry of that table, and using the value for "Gear", "Throttle" and '% Load", find out what the expected value for each of the sensors should've been and calculate the percent difference.
I have the "Comparison Table" stored locally as a Pandas dataframe, and every morning I bulk download all the entries from the live sensor data (approximately 1000 lines).
I've put together a loop for iterrows but I'm having trouble comparing float values to the bucketed columns. And I'm also having trouble setting three different conditions to "triangulate" on the correct line.
Please let me know if there is any information that I've missed. I appreciate any help in advance!

correlation matrix between cities

I want to find the corr btw cities and and Rainfall. Note that 'city' is categorical, not numerical.
I wand to compare their rainfall.
How do I go about it? I haven't seen anything on here that talk about how to deal with duplicate cities with different data
like
Date Location MinTemp MaxTemp Rainfall
12/1/2008 Albury 13.4 22.9 0.6
12/2/2008 Albury 7.4 25.1 0
12/3/2008 Albury 12.9 25.7 0
12/5/2008 Brisbane 20.5 29 9.6
12/6/2008 Brisbane 22.1 33.4 7.8
12/7/2008 Brisbane 22.6 33.4 12.4
12/8/2008 Brisbane 21.9 26.7 0
12/9/2008 Brisbane 19.5 27.6 0.2
12/10/2008 Brisbane 22.1 30.3 0.6
3/30/2011 Tuggeranong 9.8 25.2 0.4
3/31/2011 Tuggeranong 10.3 18.5 2.8
5/1/2011 Tuggeranong 5.5 20.8 0
5/2/2011 Tuggeranong 11 16.1 0
5/3/2011 Tuggeranong 7.3 17.5 0.6
8/29/2016 Woomera 15 22.9 0
8/30/2016 Woomera 12.5 22.1 12.8
8/31/2016 Woomera 8 20 0
9/1/2016 Woomera 11.6 21.4 0
9/2/2016 Woomera 11.2 19.6 0.3
9/3/2016 Woomera 7.1 20.4 0
9/4/2016 Woomera 6.5 18.6 0
9/5/2016 Woomera 7.3 21.5 0
One possible solution, if I understood you correctly (based on the title of OP), is:
Step 1
Preparing a dataset with Locations as columns and Rainfall as rows (note, you will lose information here up to a shortest rainfall series)
df2=df.groupby("Location")[["Location", "Rainfall"]].head(3) # head(3) is first 3 observations
df2.loc[:,"col"] = 4*["x1","x2","x3"] # 4 is number of unique cities
df3 = df2.pivot_table(index="col",columns="Location",values="Rainfall")
df3
Location Albury Brisbane Tuggeranong Woomera
col
x1 0.6 9.6 0.4 0.0
x2 0.0 7.8 2.8 12.8
x3 0.0 12.4 0.0 0.0
Step 2
Doing correlation matrix on the obtained dataset
df3.corr()
Location Albury Brisbane Tuggeranong Woomera
Location
Albury 1.000000 -0.124534 -0.381246 -0.500000
Brisbane -0.124534 1.000000 -0.869799 -0.797017
Tuggeranong -0.381246 -0.869799 1.000000 0.991241
Woomera -0.500000 -0.797017 0.991241 1.000000
An alternative, slightly more involved solution would be to keep the longest series and impute missing values with means or median.
But even though you will feed more data into your algo, it won't cure the main problem: your data seem to be misaligned. What I mean by this is that to do correlation analysis properly you should make it sure, that you compare comparable values, e.g. rainfall for summer with rainfall for summer for another city. To do analysis this way, you should make it sure you have equal amount of comparable rainfalls for each city: e.g. winter, spring, summer, autumn; or, January, February, ..., December.

How can I fill my dataframe

Can someone please tell me how I can fill in the missing values of my dataframe? The missing values dont come up as NaN or anything that is common instead they show as two dots like .. How would i go about filling them in with the mean of that row that they are in?
1971 1990 1999 2000 2001 2002
Estonia .. 17.4 8.3 8.5 8.5 8.6
Spain 61.6 151.2 205.9 222.2 233.2 241.6
SlovakRepublic 10.9 25.5 28.1 30.8 31.9 32.2
Slovenia .. 12.4 13.3 13.6 14.5 14.6
My headers are the years and my index are the countries.
It seems you can use mask, compare by numpy array created by values and replace by mean, last cast all columns to float:
print (df.mean(axis=1))
Estonia 10.26
Spain 210.82
SlovakRepublic 29.70
Slovenia 13.68
df = df.mask(df.values == '..', df.mean(axis=1), axis=0).astype(float)
print (df)
1971 1990 1999 2000 2001 2002
Estonia 10.26 17.4 8.3 8.5 8.5 8.6
Spain 61.6 151.2 205.9 222.2 233.2 241.6
SlovakRepublic 10.9 25.5 28.1 30.8 31.9 32.2
Slovenia 13.68 12.4 13.3 13.6 14.5 14.6
You should be able to use an .set_value
try df_name.set_value('index', 'column', value)
something like
df_name.set_value('Estonia','1971', 50)

Categories

Resources