Interpolate and match missing values between two dataframes of different dimensions

Interpolate and match missing values between two dataframes of different dimensions - python

I'm new to pandas and python in general.
Currently I'm trying to interpolate and make the coordinates of two different dataframes match. The data comes from two different GEOTIFF files from the same source, one being temperature and the other being radiation. The file was converted to pandas with georasters.
The radiation dataframe has more points and data, I want to upscale the temperature dataframe and have the same coordinates as the prior.
Radiation Dataframe:
row
col
value
x
y
0
197
2427
5.755
-83.9325
17.5075
1
197
2428
5.755
-83.93
17.5075
2
197
2429
5.755
-83.9275
17.5075
3
197
2430
5.755
-83.925
17.5075
4
197
2431
5.755
-83.9225
17.5075
1850011 rows × 5 columns
Temperature Dataframe:
row
col
value
x
y
0
59
725
26.8
-83.9583
17.5083
1
59
726
26.8
-83.95
17.5083
2
59
727
26.8
-83.9417
17.5083
3
59
728
26.8
-83.9333
17.5083
4
59
729
26.8
-83.925
17.5083
167791 rows × 5 columns
Source of data
"Gis data - LTAym_AvgDailyTotals (GeoTIFF)"
Temperature Map
Radiation (GHI) Map

In order to be able to change the information of a column, you have to use iloc. I said I gave the fourth column from the left and index 3, which is the same as column x, then I gave it your values, then I printed the result.
import pandas as pd
Radiation = {'row':["197","197","197","197","197"],
'col':["2427","2428","2429","2430","2431"],
'value':['5.755','5.755','5.755','5.755','5.755'],
'x':['-83.9325','-83.93','-83.9275','-83.925','-83.9225'],
'y':['17.5075','17.5075','17.5075','17.5075','17.5075']
}
Temperature = { 'row':["59","59","59","59","59"],
'col':["725","726","727","728","729"],
'value':["26.8","26.8","26.8","26.8","26.8"],
'x':["-83.9583","-83.95","-83.9417","-83.9333","-83.925"],
'y':["17.5083","17.5083","17.5083","17.5083","17.5083"]
}
df1 = pd.DataFrame(Radiation)
df2 = pd.DataFrame(Temperature)
df1.iloc[4:,3]='1850011'
df2.iloc[4:,3]='167791'
Comparison = df1.compare(df2, keep_shape=True, keep_equal=True)
print(df1)
print(df2)

Related

Iterate over specific rows, sum results and store in new row

I have a DataFrame in which I have already defined rows to be summed up and store the results in a new row.
For example in Year 1990:
Category
A
B
C
D
Year
E
147
78
476
531
1990
F
914
356
337
781
1990
G
117
874
15
69
1990
H
45
682
247
65
1990
I
20
255
465
19
1990
Here, the rows G - H should be summed up and the results stored in a new row. The same categories repeat every year from 1990 - 2019
I have already tried it with .iloc e.g. [4:8], [50:54] [96:100] and so on, but with iloc I can not specify multiple index. I can't manage to make a loop over the single years.
Is there a way to sum the values in categories (G-H) for each year (1990 -2019)?

I'm not sure the multiple index what you mean.
It usually appear after some group and aggregate function.
At your table, it looks just multiple column
So, if I understand correctly.
Here a complete code to show how to use the multiple condition of DataFrame
import io
import pandas as pd
data = """Category A B C D Year
E 147 78 476 531 1990
F 914 356 337 781 1990
G 117 874 15 69 1990
H 45 682 247 65 1990
I 20 255 465 19 1990"""
table = pd.read_csv(io.StringIO(data), delimiter="\t")
years = table["Year"].unique()
for year in years:
row = table[((table["Category"] == "G") | (table["Category"] == "H")) & (table["Year"] == year)]
row = row[["A", "B", "C", "D"]].sum()
row["Category"], row["Year"] = "sum", year
table = table.append(row, ignore_index=True)

If you are only interested in G/H, you can slice with isin combined with boolean indexing, then sum:
df[df['Category'].isin(['G', 'H'])].sum()
output:
Category GH
A 162
B 1556
C 262
D 134
Year 3980
dtype: object
NB. note here the side effect of sum that combines the two "G"/"H" strings into one "GH".
Or, better, set Category as index and slice with loc:
df.set_index('Category').loc[['G', 'H']].sum()
output:
A 162
B 1556
C 262
D 134
Year 3980
dtype: int64

Copying existing columns as moving averages to a dataframe

I think I am overthinking this - I am trying to copy existing pandas data frame columns and values and making rolling averages - I do not want to overwrite original data. I am iterating over the columns, taking the columns and values, making a rolling 7 day ma as a new column with the suffix _ma as a copy to the original copy. I want to compare existing data to the 7day MA and see how many standard dev the data is from the 7 day MA - which I can figure out - I am just trying to save MA data as a new data frame.
I have
for column in original_data[ma_columns]:
ma_df = pd.DataFrame(original_data[ma_columns].rolling(window=7).mean(), columns = str(column)+'_ma')
and getting the error : Index(...) must be called with a collection of some kind, 'Carrier_AcctPswd_ma' was passed
But if I am iterating with
for column in original_data[ma_columns]:
print('Colunm Name : ', str(column)+'_ma')
print('Contents : ', original_data[ma_columns].rolling(window=7).mean())
I get the data I need :
My issue is just saving this as a new data frame, which I can concatenate to the old, and then do my analysis.
EDIT
I have now been able to make a bunch of data frames, but I want to concatenate them together and this is where the issue is:
for column in original_data[ma_columns]:
MA_data = pd.DataFrame(original_data[column].rolling(window=7).mean())
for i in MA_data:
new = pd.concat(i)
print(i)
<ipython-input-75-7c5e5fa775b3> in <module>
17 # print(type(MA_data))
18 for i in MA_data:
---> 19 new = pd.concat(i)
20 print(i)
21
~\Anaconda3\lib\site-packages\pandas\core\reshape\concat.py in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
279 verify_integrity=verify_integrity,
280 copy=copy,
--> 281 sort=sort,
282 )
283
~\Anaconda3\lib\site-packages\pandas\core\reshape\concat.py in __init__(self, objs, axis, join, keys, levels, names, ignore_index, verify_integrity, copy, sort)
307 "first argument must be an iterable of pandas "
308 "objects, you passed an object of type "
--> 309 '"{name}"'.format(name=type(objs).__name__)
310 )
311
TypeError: first argument must be an iterable of pandas objects, you passed an object of type "str"

You should iterate over column names and assign the resulting pandas series as a new named column, for example:
import pandas as pd
original_data = pd.DataFrame({'A': range(100), 'B': range(100, 200)})
ma_columns = ['A', 'B']
for column in ma_columns:
new_column = column + '_ma'
original_data[new_column] = pd.DataFrame(original_data[column].rolling(window=7).mean())
print(original_data)
Output dataframe:
A B A_ma B_ma
0 0 100 NaN NaN
1 1 101 NaN NaN
2 2 102 NaN NaN
3 3 103 NaN NaN
4 4 104 NaN NaN
.. .. ... ... ...
95 95 195 92.0 192.0
96 96 196 93.0 193.0
97 97 197 94.0 194.0
98 98 198 95.0 195.0
99 99 199 96.0 196.0
[100 rows x 4 columns]

Aggregations over specific columns of a large dataframe, with named output

I am looking for a way to aggregate over a large dataframe, possibly using groupby. Each group would be based on either pre-specified columns or regex, and the aggregation should produce a named output.
This produces a sample dataframe:
import pandas as pd
import itertools
import numpy as np
col = "A,B,C".split(',')
col1 = "1,2,3,4,5,6,7,8,9".split(',')
col2 = "E,F,G".split(',')
all_dims = [col, col1, col2]
all_keys = ['.'.join(i) for i in itertools.product(*all_dims)]
rng = pd.date_range(end=pd.Timestamp.today().date(), periods=12, freq='M')
df = pd.DataFrame(np.random.randint(0, 1000, size=(len(rng), len(all_keys))), columns=all_keys, index=rng)
Above produces a dataframe with one year's worth of monthly data, with 36 columns with following names:
['A.1.E', 'A.1.F', 'A.1.G', 'A.2.E', 'A.2.F', 'A.2.G', 'A.3.E', 'A.3.F',
'A.3.G', 'A.4.E', 'A.4.F', 'A.4.G', 'A.5.E', 'A.5.F', 'A.5.G', 'A.6.E',
'A.6.F', 'A.6.G', 'A.7.E', 'A.7.F', 'A.7.G', 'A.8.E', 'A.8.F', 'A.8.G',
'A.9.E', 'A.9.F', 'A.9.G', 'B.1.E', 'B.1.F', 'B.1.G', 'B.2.E', 'B.2.F',
'B.2.G', 'B.3.E', 'B.3.F', 'B.3.G', 'B.4.E', 'B.4.F', 'B.4.G', 'B.5.E',
'B.5.F', 'B.5.G', 'B.6.E', 'B.6.F', 'B.6.G', 'B.7.E', 'B.7.F', 'B.7.G',
'B.8.E', 'B.8.F', 'B.8.G', 'B.9.E', 'B.9.F', 'B.9.G', 'C.1.E', 'C.1.F',
'C.1.G', 'C.2.E', 'C.2.F', 'C.2.G', 'C.3.E', 'C.3.F', 'C.3.G', 'C.4.E',
'C.4.F', 'C.4.G', 'C.5.E', 'C.5.F', 'C.5.G', 'C.6.E', 'C.6.F', 'C.6.G',
'C.7.E', 'C.7.F', 'C.7.G', 'C.8.E', 'C.8.F', 'C.8.G', 'C.9.E', 'C.9.F',
'C.9.G']
What I would like now is to be able aggregate over the dataframe and take certain column combinations and produce named outputs. For example, one rules might be that I will take all 'A.*.E' columns (that have any number in the middle), sum them and produce a named output column called 'A.SUM.E'. And then do the same for 'A.*.F', 'A.*.G' and so on.
I have looked into pandas 25 named aggregation which allows me to name my outputs but I couldn't see how to simultaneously capture the right column combinations and produce the right output names.
If you need to reshape the dataframe to make a workable solution, that is fine as well.
Note, I am aware I could do something like this in a Python loop but I am looking for a pandas way to do it.

Not a groupby solution and it uses a loop but I think it's nontheless rather elegant: first get a list of unique column from - to combinations using a set and then do the sums using filter:
cols = sorted([(x[0],x[1]) for x in set([(x.split('.')[0], x.split('.')[-1]) for x in df.columns])])
for c0, c1 in cols:
df[f'{c0}.SUM.{c1}'] = df.filter(regex = f'{c0}\.\d+\.{c1}').sum(axis=1)
Result:
A.1.E A.1.F A.1.G A.2.E ... B.SUM.G C.SUM.E C.SUM.F C.SUM.G
2018-08-31 978 746 408 109 ... 4061 5413 4102 4908
2018-09-30 923 649 488 447 ... 5585 3634 3857 4228
2018-10-31 911 359 897 425 ... 5039 2961 5246 4126
2018-11-30 77 479 536 509 ... 4634 4325 2975 4249
2018-12-31 608 995 114 603 ... 5377 5277 4509 3499
2019-01-31 138 612 363 218 ... 4514 5088 4599 4835
2019-02-28 994 148 933 990 ... 3907 4310 3906 3552
2019-03-31 950 931 209 915 ... 4354 5877 4677 5557
2019-04-30 255 168 357 800 ... 5267 5200 3689 5001
2019-05-31 593 594 824 986 ... 4221 2108 4636 3606
2019-06-30 975 396 919 242 ... 3841 4787 4556 3141
2019-07-31 350 312 104 113 ... 4071 5073 4829 3717
If you want to have the result in a new DataFrame, just create an empty one and add the columns to it:
result = pd.DataFrame()
for c0, c1 in cols:
result[f'{c0}.SUM.{c1}'] = df.filter(regex = f'{c0}\.\d+\.{c1}').sum(axis=1)
Update: using simple groupby (which is even more simple in this particular case):
def grouper(col):
c = col.split('.')
return f'{c[0]}.SUM.{c[-1]}'
df.groupby(grouper, axis=1).sum()

Handling Zeros or NaNs in a Pandas DataFrame operations

I have a DataFrame (df) like shown below where each column is sorted from largest to smallest for frequency analysis. That leaves some values either zeros or NaN values as each column has a different length.
08FB006 08FC001 08FC003 08FC005 08GD004
----------------------------------------------
0 253 872 256 11.80 2660
1 250 850 255 10.60 2510
2 246 850 241 10.30 2130
3 241 827 235 9.32 1970
4 241 821 229 9.17 1900
5 232 0 228 8.93 1840
6 231 0 225 8.05 1710
7 0 0 225 0 1610
8 0 0 224 0 1590
9 0 0 0 0 1590
10 0 0 0 0 1550
I need to perform the following calculation as if each column has different lengths or number of records (ignoring zero values). I have tried using NaN but for some reason operations on Nan values are not possible.
Here is what I am trying to do with my df columns :
shape_list1=[]
location_list1=[]
scale_list1=[]
for column in df.columns:
shape1, location1, scale1=stats.genpareto.fit(df[column])
shape_list1.append(shape1)
location_list1.append(location1)
scale_list1.append(scale1)

Assuming all values are positive (as seems from your example and description), try:
stats.genpareto.fit(df[df[column] > 0][column])
This filters every column to operate just on the positive values.
Or, if negative values are allowed,
stats.genpareto.fit(df[df[column] != 0][column])

The syntax is messy, but change
shape1, location1, scale1=stats.genpareto.fit(df[column])
to
shape1, location1, scale1=stats.genpareto.fit(df[column][df[column].nonzero()[0]])
Explanation: df[column].nonzero() returns a tuple of size (1,) whose only element, element [0], is a numpy array that holds the index labels where df is nonzero. To index df[column] by these nonzero labels, you can use df[column][df[column].nonzero()[0]].

Join dataframe with matrix output using pandas

I am trying to translate the input dataframe (inp_df) to output dataframe (out_df) using the the data from the cell based intermediate dataframe (matrix_df) as shown below.
There are several cell number based files with distance values shown in matrix_df .
The program iterates by cell & fetches data from appropriate file so each time matrix_df will have the data for all rows of the current cell# that we are iterating for in inp_df.
inp_df
A B cell
100 200 1
115 270 1
145 255 2
115 266 1
matrix_df (cell_1.csv)
B 100 115 199 avg_distance
200 7.5 80.7 67.8 52
270 6.8 53 92 50
266 58 84 31 57
matrix_df (cell_2.csv)
B 145 121 166 avg_distance
255 74.9 77.53 8 53.47
out_df dataframe
A B cell distance avg_distance
100 200 1 7.5 52
115 270 1 53 50
145 255 2 74.9 53.47
115 266 1 84 57
My current thought process for each cell# based data is
use a apply function to go row by row
then use a join based on column B in the inp_df with with matrix_df, where the matrix df is somehow translated into a tuple of column name, distance & average distance.
But I am looking for a pandonic way of doing this since my approach will slow down when there are millions of rows in the input. I am specifically looking for core logic inside an iteration to fetch the matches, since in each cell the number of columns in matrix_df would vary
If its any help the matrix files is the distance based outputs from sklearn.metrics.pairwise.pairwise_distances .
NB: In inp_df the value of column B is unique and values of column A may or may not be unique
Also the matrix_dfs first column was empty & i had renamed it with the following code for easiness in understanding since it was a header-less matrix output file.
dist_df = pd.read_csv(mypath,index_col=False)
dist_df.rename(columns={'Unnamed: 0':'B'}, inplace=True)

Step 1: Concatenate your inputs with pd.concat and merge with inp_df using df.merge
In [641]: out_df = pd.concat([matrix_df1, matrix_df2]).merge(inp_df)
Step 2: Create the distance column with df.apply by using A's values to index into the correct column
In [642]: out_df.assign(distance=out_df.apply(lambda x: x[str(int(x['A']))], axis=1))\
[['A', 'B', 'cell', 'distance', 'avg_distance']]
Out[642]:
A B cell distance avg_distance
0 100 200 1 7.5 52.00
1 115 270 1 53.0 50.00
2 115 266 1 84.0 57.00
3 145 255 2 74.9 53.47

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Interpolate and match missing values between two dataframes of different dimensions - python

Related

Iterate over specific rows, sum results and store in new row

Copying existing columns as moving averages to a dataframe

Aggregations over specific columns of a large dataframe, with named output

Handling Zeros or NaNs in a Pandas DataFrame operations

Join dataframe with matrix output using pandas

Categories

Resources