Pandas: Parse Excel spreadsheet with merged cells and blank values

Pandas: Parse Excel spreadsheet with merged cells and blank values - python

My question is similar to this one. I have a spreadsheet with some merged cells, but the column with merged cells also has empty cells, e.g.:
Day Sample CD4 CD8
----------------------------
Day 1 8311 17.3 6.44
--------------------
8312 13.6 3.50
--------------------
8321 19.8 5.88
--------------------
8322 13.5 4.09
----------------------------
Day 2 8311 16.0 4.92
--------------------
8312 5.67 2.28
--------------------
8321 13.0 4.34
--------------------
8322 10.6 1.95
----------------------------
8323 16.0 4.92
----------------------------
8324 5.67 2.28
----------------------------
8325 13.0 4.34
How can I parse this into a Pandas DataFrame? I understand that the fillna(method='ffill') method will not solve my issue, since it will replace the actually missing values with something else. I want to get a DataFrame like this:
Day Sample CD4 CD8
----------------------------
Day 1 8311 17.3 6.44
----------------------------
Day 1 8312 13.6 3.50
----------------------------
Day 1 8321 19.8 5.88
----------------------------
Day 1 8322 13.5 4.09
----------------------------
Day 2 8311 16.0 4.92
----------------------------
Day 2 8312 5.67 2.28
----------------------------
Day 2 8321 13.0 4.34
----------------------------
Day 2 8322 10.6 1.95
----------------------------
NA 8323 16.0 4.92
----------------------------
NA 8324 5.67 2.28
----------------------------
NA 8325 13.0 4.34

Something like this should work assuming you know the starting row of your excel file (or come up with a better way to check that)
import pandas as pd
import numpy as np
import openpyxl
def test():
filepath = "C:\\Users\\me\\Desktop\\SO nonsense\\PandasMergeCellTest.xlsx"
df = pd.read_excel(filepath)
wb = openpyxl.load_workbook(filepath)
sheet = wb["Sheet1"]
df["Row"] = np.arange(len(df)) + 2 #My headers were row 1 so adding 2 to get the row numbers
df["Merged"] = df.apply(lambda x: checkMerged(x, sheet), axis=1)
df["Day"] = np.where(df["Merged"] == True, df["Day"].ffill(), np.nan)
df = df.drop(["Row", "Merged"], 1)
print(df)
def checkMerged(x, sheet):
cell = sheet.cell(x["Row"], 1)
for mergedcell in sheet.merged_cells.ranges:
if(cell.coordinate in mergedcell):
return True
test()

Related

Can i add new column in DataFrame with Interpolation?

this is my current DataFrame:
Df:
DATA
4.15
4.02
3.70
3.51
3.17
2.95
2.86
NaN
NaN
i alredy know that 4.15(first value) is 100%, 2.86(last value) is 30% and 2.5 is 0%. firstly, i want to interpolate first column the NaN(second last)value based on last NaN is 2.5(this is alredy predfined). after this i want to create second column and interpolate based on first coumn and available these three percentage value.
is it possible?
i have tried this code but it is not giving expected results:
df = pd.DataFrame({'DATA':range(df.DATA.min(), df.DATA.max()+1)}).merge(df, on='DATA', how='left')
df.Voltage = df.Voltage.interpolate()
Expected output:
Df:
DATA %
4.15 100%
4.02 89%
3.70 75%
3.51 70%
3.17 50%
2.95 35%
2.86 30%
2.74 15%
2.5 0%

Your logic is unclear, my understanding is that you want to compute a rank, but the provided output is unclear, please details the computations.
What I would do:
df.loc[df.index[-1], 'DATA'] = 2.5
df['DATA'] = df['DATA'].interpolate()
# compute rank
s = df['DATA'].rank(pct=True)
# rescale to 0-1 and convert to %
df['%'] = ((s-s.min())/(1-s.min())).mul(100)
output:
DATA %
0 4.15 100.0
1 4.02 87.5
2 3.70 75.0
3 3.51 62.5
4 3.17 50.0
5 2.95 37.5
6 2.86 25.0
7 2.68 12.5
8 2.50 0.0

Getting the average of the previous "x" amount of days into the current position of the new Pandas column

I need help with getting the average of the previous X amount of days into the current position of the new column.
The problem I am having is at the line of code df['avg'] = (df['Close'].shift(0) + df['Close'].shift(1)) / 2.
This is what I want, but of course, I want it to be dynamic. That is where I need help! I can't figure out how to do so because I am having issues with how it already seems to by looping itself when called.
I understand what it is doing and why (...I think) but can't figure out a way around it to get my desired result.
import pandas as pd
import os
import sys
import NasdaqTickerSymbols as nts
class MY_PANDA_INDICATORS():
def __init__(self, days, csvFile):
self.days = days
self.df = None
self.csvFile = csvFile
def GetDataFrame(self):
modpath = os.path.dirname(os.path.abspath(sys.argv[0]))
datapath = os.path.join(modpath, "CSV\\"+ self.csvFile + ".csv")
df = pd.read_csv(datapath)
return(df)
def GetEMA(self):
df['avg'] = df['Close'].shift(0) + df['Close'].shift(1)
return(df)
myD = MY_PANDA_INDICATORS(2,nts.matches[0])
print(myD.GetEMA())
Here is what I am getting and also what I want, but I want to be able to change the number of days and get the average of that "x" amount I pass to it. I have tried looping but none work as intended.
Date Open High Low Close Adj Close Volume avg
0 2020-11-16 1.15 1.15 1.11 1.12 1.12 17100 NaN
1 2020-11-17 1.15 1.15 1.11 1.13 1.13 29900 1.125
2 2020-11-18 1.15 1.20 1.12 1.16 1.16 127700 1.145
3 2020-11-19 1.17 1.22 1.16 1.16 1.16 64500 1.160
4 2020-11-20 1.18 1.18 1.14 1.15 1.15 32600 1.155
.. ... ... ... ... ... ... ... ...
246 2021-11-08 2.40 2.40 2.31 2.32 2.32 20000 2.340
247 2021-11-09 2.35 2.35 2.28 2.31 2.31 19700 2.315
248 2021-11-10 2.29 2.31 2.20 2.20 2.20 24200 2.255
249 2021-11-11 2.20 2.22 2.18 2.21 2.21 18700 2.205
250 2021-11-12 2.21 2.22 2.18 2.21 2.21 7800 2.210

You can reindex your DataFrame by the date, and then perform a rolling mean and with the argument x number of days as a string (such as "2D"):
df['avg'] = df.set_index(["Date"]).rolling(f"{self.days}D").mean().values
On a smaller example:
df = pd.DataFrame({'date': pd.date_range('2021-01-01','2021-01-05'), 'close':[1,3,5,7,9]})
Input:
>>> df
date close
0 2021-01-01 1
1 2021-01-02 3
2 2021-01-03 5
3 2021-01-04 7
4 2021-01-05 9
df['avg'] = df.set_index(["date"]).rolling("2D").mean().values
Output:
>>> df
date close avg
0 2021-01-01 1 1.0
1 2021-01-02 3 2.0
2 2021-01-03 5 4.0
3 2021-01-04 7 6.0
4 2021-01-05 9 8.0

How to decompose cohort data?

I'm trying to decompose cohort data into time series for further analysis. I'm imagining the algorithm pretty well, but my code doesn't work at all.
The input data in df is like:
Cohort Day
0
1
2
3
4
5
2020-12-27
5.87
4.9
2.89
1.47
1.38
0.95
2020-12-28
13.2
3.1
0.79
1.47
1.38
0.95
I'm trying to decompose it in this format:
day
sum
2020-12-27
5.87
2020-12-28
4.9
2020-12-29
2.89
2020-12-30
1.47
2020-12-31
1.38
2020-01-01
0.95
2020-12-28
13.2
2020-12-29
3.1
2020-12-30
0.79
2020-12-31
1.47
2020-01-01
1.38
2020-01-02
0.95
To achieve that I created an empty dataframe test and then I'm using for loop to create a column with dates at first:
for row in test.itertuples():
test[0:5, 0] = df['Cohort Day'] + df.apply(lambda x: int(str(df.iloc[0, 4:].columns)) for x in df.iteritems())
test[0:5, 1] = df[0, 1:].transpose()
But all I receive is an empty test dataframe.
Any suggestions will be appreciated!

Avoid using looping codes which are slow. Use fast vectorized Pandas built-in functions whenever possible.
You can transform the dataframe from wide to long by .stack(). Set day as Cohort Day plus the day offsets 0, 1, ..., 5, as follows:
# convert `Cohort Day` to datetime format
df['Cohort Day'] = pd.to_datetime(df['Cohort Day'])
# transform from wide to long
df2 = (df.set_index('Cohort Day')
.rename_axis(columns='day_offset')
.stack()
.reset_index(name='sum')
)
# convert day offsets 0, 1, 2, ..., 5 to timedelta format
df2['day_offset'] = pd.to_timedelta(df2['day_offset'].astype(int), unit='d')
# set up column `day` as the `Cohort Day` + day offsets
df2['day'] = df2['Cohort Day'] + df2['day_offset']
# Get the desired columns
df_out = df2[['day', 'sum']]
Result:
print(df_out)
day sum
0 2020-12-27 5.87
1 2020-12-28 4.90
2 2020-12-29 2.89
3 2020-12-30 1.47
4 2020-12-31 1.38
5 2021-01-01 0.95
6 2020-12-28 13.20
7 2020-12-29 3.10
8 2020-12-30 0.79
9 2020-12-31 1.47
10 2021-01-01 1.38
11 2021-01-02 0.95

correlation matrix between cities

I want to find the corr btw cities and and Rainfall. Note that 'city' is categorical, not numerical.
I wand to compare their rainfall.
How do I go about it? I haven't seen anything on here that talk about how to deal with duplicate cities with different data
like
Date Location MinTemp MaxTemp Rainfall
12/1/2008 Albury 13.4 22.9 0.6
12/2/2008 Albury 7.4 25.1 0
12/3/2008 Albury 12.9 25.7 0
12/5/2008 Brisbane 20.5 29 9.6
12/6/2008 Brisbane 22.1 33.4 7.8
12/7/2008 Brisbane 22.6 33.4 12.4
12/8/2008 Brisbane 21.9 26.7 0
12/9/2008 Brisbane 19.5 27.6 0.2
12/10/2008 Brisbane 22.1 30.3 0.6
3/30/2011 Tuggeranong 9.8 25.2 0.4
3/31/2011 Tuggeranong 10.3 18.5 2.8
5/1/2011 Tuggeranong 5.5 20.8 0
5/2/2011 Tuggeranong 11 16.1 0
5/3/2011 Tuggeranong 7.3 17.5 0.6
8/29/2016 Woomera 15 22.9 0
8/30/2016 Woomera 12.5 22.1 12.8
8/31/2016 Woomera 8 20 0
9/1/2016 Woomera 11.6 21.4 0
9/2/2016 Woomera 11.2 19.6 0.3
9/3/2016 Woomera 7.1 20.4 0
9/4/2016 Woomera 6.5 18.6 0
9/5/2016 Woomera 7.3 21.5 0

One possible solution, if I understood you correctly (based on the title of OP), is:
Step 1
Preparing a dataset with Locations as columns and Rainfall as rows (note, you will lose information here up to a shortest rainfall series)
df2=df.groupby("Location")[["Location", "Rainfall"]].head(3) # head(3) is first 3 observations
df2.loc[:,"col"] = 4*["x1","x2","x3"] # 4 is number of unique cities
df3 = df2.pivot_table(index="col",columns="Location",values="Rainfall")
df3
Location Albury Brisbane Tuggeranong Woomera
col
x1 0.6 9.6 0.4 0.0
x2 0.0 7.8 2.8 12.8
x3 0.0 12.4 0.0 0.0
Step 2
Doing correlation matrix on the obtained dataset
df3.corr()
Location Albury Brisbane Tuggeranong Woomera
Location
Albury 1.000000 -0.124534 -0.381246 -0.500000
Brisbane -0.124534 1.000000 -0.869799 -0.797017
Tuggeranong -0.381246 -0.869799 1.000000 0.991241
Woomera -0.500000 -0.797017 0.991241 1.000000
An alternative, slightly more involved solution would be to keep the longest series and impute missing values with means or median.
But even though you will feed more data into your algo, it won't cure the main problem: your data seem to be misaligned. What I mean by this is that to do correlation analysis properly you should make it sure, that you compare comparable values, e.g. rainfall for summer with rainfall for summer for another city. To do analysis this way, you should make it sure you have equal amount of comparable rainfalls for each city: e.g. winter, spring, summer, autumn; or, January, February, ..., December.

Python/Pandas - Sum dataframe items if indexes have the same month

I have this two DataFrames:
Seasonal_Component:
# DataFrame that has the seasonal component of a time series
Date
2014-12 -1.08
2015-01 -0.28
2015-02 0.15
2015-03 0.46
2015-04 0.48
2015-05 0.37
2015-06 0.20
2015-07 0.15
2015-08 0.12
2015-09 -0.02
2015-10 -0.17
2015-11 -0.39
Prediction_df:
# DataFrame with the prediction of the trend of that same time series
Prediction MAPE Score
2015-11-01 7.93 1.83 1
2015-12-01 7.93 1.67 1
2016-01-01 7.92 1.71 1
2016-02-01 7.95 1.84 1
2016-03-01 7.94 1.53 1
2016-04-01 7.87 1.45 1
2016-05-01 7.91 1.53 1
2016-06-01 7.87 1.40 1
2016-07-01 7.84 1.40 1
2016-08-01 7.89 1.77 1
2016-09-01 7.87 1.99 1
What I need to do:
Check which Prediction_df index have the same months as the Seasonal_Component index and sum the correspondent seasonal component with the prediction, so the Prediction_df looks like this:
Prediction MAPE Score
2015-11-01 7,54 1.83 1
2015-12-01 6.85 1.67 1
2016-01-01 7.64 1.71 1
2016-02-01 8.10 1.84 1
2016-03-01 8.40 1.53 1
2016-04-01 8.35 1.45 1
2016-05-01 8.28 1.53 1
2016-06-01 8.07 1.40 1
2016-07-01 7.99 1.40 1
2016-08-01 8.01 1.77 1
2016-09-01 7.85 1.99 1
Anyone available to enlight my journey?
I'm already on the "almost mad" stage trying to solve this.
EDIT
Important note to make it clearer: I need to disconsider the year and consider only the month to make the sum. Something like "everytime that an April appears (doesn't matter if it is 2006 or 2025) I need to sum with the April value of the Seasonal_Component frame.

Consider a data frame merge on the date fields (month values), then a simple addition of the two fields. The date fields may require conversion from string values:
import datetime as dt
...
# IF DATES ARE REGULAR COLUMNS
seasonal_component['Date'] = pd.to_datetime(seasonal_component['Date'])
seasonal_component['Month'] = seasonal_component['Date'].dt.month
predict_df['Date'] = pd.to_datetime(predict_df['Date'])
predict_df['Month'] = predict_df['Date'].dt.month
# IF DATES ARE INDICES
seasonal_component.index = pd.to_datetime(seasonal_component.index)
seasonal_component['Month'] = seasonal_component.index.month
predict_df.index = pd.to_datetime(predict_df.index)
predict_df['Month'] = predict_df.index.month
However, think about how you need to join the two data sets (akin to SQL's join clauses):
inner (default) - keeps only records matching both
left - keeps records of predict_df and only those matching seasonal_component where predict_df is first argument
right - keeps records of seasonal_component and only those matching predict_df where predict_df is first argument
outer - keeps all records, those that match and those that don't match
Below assumes an outer join where data on both sides remain with NaNs to fill for missing values.
# MERGING DATA FRAMES
merge_df = pd.merge(predict_df, seasonal_component[['Month', 'SeasonalComponent']],
on=['Month'], how='outer')
# ADDING COLUMNS
merge_df['Prediction'] = merge_df['Prediction'] + merge_df['SeasonalComponent']
Outcome (using posted data)
Date Prediction MAPE Score Month SeasonalComponent
0 2015-11-01 7.54 1.83 1 11 -0.39
1 2015-12-01 6.85 1.67 1 12 -1.08
2 2016-01-01 7.64 1.71 1 1 -0.28
3 2016-02-01 8.10 1.84 1 2 0.15
4 2016-03-01 8.40 1.53 1 3 0.46
5 2016-04-01 8.35 1.45 1 4 0.48
6 2016-05-01 8.28 1.53 1 5 0.37
7 2016-06-01 8.07 1.40 1 6 0.20
8 2016-07-01 7.99 1.40 1 7 0.15
9 2016-08-01 8.01 1.77 1 8 0.12
10 2016-09-01 7.85 1.99 1 9 -0.02
11 NaT NaN NaN NaN 10 -0.17

Firstly separate the month from both dataframes and then merge on basis of month. Further add the required columns and create new column with desired output. Here is the code below:
import pandas as pd
import numpy as np
from pandas import DataFrame,Series
from numpy.random import randn
Seasonal_Component = DataFrame({
'Date': ['2014-12','2015-01','2015-02','2015-03','2015-04','2015-05','2015-06','2015-07','2015-08','2015-09','2015-10','2015-11'],
'Value': [-1.08,-0.28,0.15,0.46,0.48,0.37,0.20,0.15,0.12,-0.02,-0.17,-0.39]
})
Prediction_df = DataFrame({
'Date': ['2015-11-01','2015-12-01','2016-01-01','2016-02-01','2016-03-01','2016-04-01','2016-05-01','2016-06-01','2016-07-01','2016-08-01','2016-09-01'],
'Prediction': [7.93,7.93,7.92,7.95,7.94,7.87,7.91,7.87,7.84,7.89,7.87],
'MAPE':[1.83,1.67,1.71,1.84,1.53,1.45,1.53,1.40,1.40,1.77,1.99],
'Score':[1,1,1,1,1,1,1,1,1,1,1]
})
def mon_extract(date):
return date.split('-')[1]
Seasonal_Component['Month']=Seasonal_Component['Date'].apply(mon_extract)
def mon_extract(date):
return date.split('-')[1].split('-')[0]
Prediction_df['Month']=Prediction_df['Date'].apply(mon_extract)
FinalDF=pd.merge(Seasonal_Component,Prediction_df,on='Month',how='right')
FinalDF
FinalDF['PredictionF']=FinalDF['Value']+FinalDF['Prediction']
FinalDF.loc[:,['Date_y','PredictionF','MAPE','Score']]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: Parse Excel spreadsheet with merged cells and blank values - python

Related

Can i add new column in DataFrame with Interpolation?

Getting the average of the previous "x" amount of days into the current position of the new Pandas column

How to decompose cohort data?

correlation matrix between cities

Python/Pandas - Sum dataframe items if indexes have the same month

Categories

Resources