Join dataframe with matrix output using pandas - python

I am trying to translate the input dataframe (inp_df) to output dataframe (out_df) using the the data from the cell based intermediate dataframe (matrix_df) as shown below.
There are several cell number based files with distance values shown in matrix_df .
The program iterates by cell & fetches data from appropriate file so each time matrix_df will have the data for all rows of the current cell# that we are iterating for in inp_df.
inp_df
A B cell
100 200 1
115 270 1
145 255 2
115 266 1
matrix_df (cell_1.csv)
B 100 115 199 avg_distance
200 7.5 80.7 67.8 52
270 6.8 53 92 50
266 58 84 31 57
matrix_df (cell_2.csv)
B 145 121 166 avg_distance
255 74.9 77.53 8 53.47
out_df dataframe
A B cell distance avg_distance
100 200 1 7.5 52
115 270 1 53 50
145 255 2 74.9 53.47
115 266 1 84 57
My current thought process for each cell# based data is
use a apply function to go row by row
then use a join based on column B in the inp_df with with matrix_df, where the matrix df is somehow translated into a tuple of column name, distance & average distance.
But I am looking for a pandonic way of doing this since my approach will slow down when there are millions of rows in the input. I am specifically looking for core logic inside an iteration to fetch the matches, since in each cell the number of columns in matrix_df would vary
If its any help the matrix files is the distance based outputs from sklearn.metrics.pairwise.pairwise_distances .
NB: In inp_df the value of column B is unique and values of column A may or may not be unique
Also the matrix_dfs first column was empty & i had renamed it with the following code for easiness in understanding since it was a header-less matrix output file.
dist_df = pd.read_csv(mypath,index_col=False)
dist_df.rename(columns={'Unnamed: 0':'B'}, inplace=True)​

Step 1: Concatenate your inputs with pd.concat and merge with inp_df using df.merge
In [641]: out_df = pd.concat([matrix_df1, matrix_df2]).merge(inp_df)
Step 2: Create the distance column with df.apply by using A's values to index into the correct column
In [642]: out_df.assign(distance=out_df.apply(lambda x: x[str(int(x['A']))], axis=1))\
[['A', 'B', 'cell', 'distance', 'avg_distance']]
Out[642]:
A B cell distance avg_distance
0 100 200 1 7.5 52.00
1 115 270 1 53.0 50.00
2 115 266 1 84.0 57.00
3 145 255 2 74.9 53.47

Related

Interpolate and match missing values between two dataframes of different dimensions

I'm new to pandas and python in general.
Currently I'm trying to interpolate and make the coordinates of two different dataframes match. The data comes from two different GEOTIFF files from the same source, one being temperature and the other being radiation. The file was converted to pandas with georasters.
The radiation dataframe has more points and data, I want to upscale the temperature dataframe and have the same coordinates as the prior.
Radiation Dataframe:
row
col
value
x
y
0
197
2427
5.755
-83.9325
17.5075
1
197
2428
5.755
-83.93
17.5075
2
197
2429
5.755
-83.9275
17.5075
3
197
2430
5.755
-83.925
17.5075
4
197
2431
5.755
-83.9225
17.5075
1850011 rows × 5 columns
Temperature Dataframe:
row
col
value
x
y
0
59
725
26.8
-83.9583
17.5083
1
59
726
26.8
-83.95
17.5083
2
59
727
26.8
-83.9417
17.5083
3
59
728
26.8
-83.9333
17.5083
4
59
729
26.8
-83.925
17.5083
167791 rows × 5 columns
Source of data
"Gis data - LTAym_AvgDailyTotals (GeoTIFF)"
Temperature Map
Radiation (GHI) Map
In order to be able to change the information of a column, you have to use iloc. I said I gave the fourth column from the left and index 3, which is the same as column x, then I gave it your values, then I printed the result.
import pandas as pd
Radiation = {'row':["197","197","197","197","197"],
'col':["2427","2428","2429","2430","2431"],
'value':['5.755','5.755','5.755','5.755','5.755'],
'x':['-83.9325','-83.93','-83.9275','-83.925','-83.9225'],
'y':['17.5075','17.5075','17.5075','17.5075','17.5075']
}
Temperature = { 'row':["59","59","59","59","59"],
'col':["725","726","727","728","729"],
'value':["26.8","26.8","26.8","26.8","26.8"],
'x':["-83.9583","-83.95","-83.9417","-83.9333","-83.925"],
'y':["17.5083","17.5083","17.5083","17.5083","17.5083"]
}
df1 = pd.DataFrame(Radiation)
df2 = pd.DataFrame(Temperature)
df1.iloc[4:,3]='1850011'
df2.iloc[4:,3]='167791'
Comparison = df1.compare(df2, keep_shape=True, keep_equal=True)
print(df1)
print(df2)

How to join concatenated values to new values in python

Hi im new to python and trying to understand joining
I have two dataframe -
df1
OutputValues
12-99
22-99
264-99
12-323,138-431
4-21
12-123
df2
OldId NewId
99 191
84 84
323 84
59 59
431 59
208 59
60 59
58 59
325 59
390 59
324 59
564 564
123 564
21 21
I want to join both of these based on the second half of the values in df1 i.e. the values after the hifen, for example 12--99 joins old id 99 in df2 and 4-21 to old id 21.
The final new output dataframe should join to the new values in df2 and look like-
df3
OutputValues OutputValues2
12-99 12-191
22-99 22-191
264-99 264-191
12-323,138-431 12-323,138-59
4-21 4-21
12-123,4-325 12-564,4-59
As you see, now the first part of the concatenation is joined with the new id in my desired final output dataframe df3 where there is 99 it is replaced with 191, 123 is replaced with 564 and 325 with 59,etc
How can i do this?
Let's extract both parts, map the last part then concatenate back:
s = df1.OutputValues.str.extractall('(\d+-)(\d+)');
df1['OutputValues2'] = (s[0]+s[1].map(df2.astype(str).set_index('OldId')['NewId'])
).groupby(level=0).agg(','.join)
Output:
OutputValues OutputValues2
0 12-99 12-191
1 22-99 22-191
2 264-99 264-191
3 12-323,138-431 12-84,138-59
4 4-21 4-21
5 12-123 12-564
Update: Looks like simple replace would also work, but this might fail in some edge cases:
df1['OutputValues2'] = df1.OutputValues.replace(('-'+df2.astype(str))
.set_index('OldId')['NewId'],
regex=True)
df1=df1['OutputValues'].str.split(',').explode().str.split('\-',expand=True).join(df1)#Separate explode to separate OutputValues and join them back to df1
df3=df2.astype(str).merge(g, left_on='OldId', right_on=1)#merge df2 and new df1
df3=df3.assign(OutputValues2=df3[0].str.cat(h.NewId, sep='-')).drop(columns=['OldId','NewId',0,1])#Create OutputValues2 and drop unrequired columns
df3.groupby('OutputValues')['OutputValues2'].agg(','.join).reset_index()
OutputValues OutputValues2
0 12-123 12-564
1 12-323,138-431 12-84,138-59
2 12-99 12-191
3 22-99 22-191
4 264-99 264-191
5 4-21 4-21

Maximal Subset of Pandas Column based on a CutoFF

I am having an algoritmic problem which I am trying to solve in python. I have a pandas dataframe ( say) of two columns as: ( I have it kept it sorted in descending here to make it easier to explain the problem)
df:
ACOL BCOL
LA1 234
LA2 230
LA3 220
LA4 218
LA5 210
LA6 200
LA7 185
LA8 180
LA9 150
LA10 100
I have a threshold value of BCOL, say 215. So what I want is to get the maximal subset from the above pandas dataframe, which when I take the average of BCOL will give me greater than or equal to 215.
So in this case, if I keep the BCOL values upto 200 then the mean of (234, 230,... 200) is 218.67, whereas if I keep up to 185 ( 234, 230, ..., 200, 185), the mean is 213.86. So my maximal subset to get the BCOL mean greater than 215 should be from ( 234,... 200). So I will drop the rest of the rows. So my final output pandas dataframe should be :
dfnew:
ACOL BCOL
LA1 234
LA2 230
LA3 220
LA4 218
LA5 210
LA6 200
I was trying to put the BCOL into a list and trying a for/while loop, but it is not pythonic and also a bit time consuming for very large data table. Is there a way in pandas to achieve this more pythonic way.
Will appreciate any help. Thanks.
IIUC, you could do:
# guarantee that the DF is sorted by non ascending
df = df.sort_values(by=['BCOL'], ascending=False)
# cumulative mean, then find where is gt 215
mask = (df['BCOL'].cumsum() / np.arange(1, len(df) + 1)) > 215.0
print(df[mask])
Output
ACOL BCOL
0 LA1 234
1 LA2 230
2 LA3 220
3 LA4 218
4 LA5 210
5 LA6 200

Average/Percentile of each value against column and save result in new column using pandas built in function

I have a dataframe with two columns
df
hr count
2 53
3 1586
4 890
5 833
6 209
I want to take average/Percentile of each value of column and want to store result in new column. Currently I am using this way.
df['avg'] = (df['count'] / df['count'].sum()) * 100
df
hr count avg
2 53 1.484178
3 1586 44.413330
4 890 24.922991
5 833 23.326799
6 209 5.852702
I want to this by built in function like mean() . How can I achieve this with built in function?

Handling Zeros or NaNs in a Pandas DataFrame operations

I have a DataFrame (df) like shown below where each column is sorted from largest to smallest for frequency analysis. That leaves some values either zeros or NaN values as each column has a different length.
08FB006 08FC001 08FC003 08FC005 08GD004
----------------------------------------------
0 253 872 256 11.80 2660
1 250 850 255 10.60 2510
2 246 850 241 10.30 2130
3 241 827 235 9.32 1970
4 241 821 229 9.17 1900
5 232 0 228 8.93 1840
6 231 0 225 8.05 1710
7 0 0 225 0 1610
8 0 0 224 0 1590
9 0 0 0 0 1590
10 0 0 0 0 1550
I need to perform the following calculation as if each column has different lengths or number of records (ignoring zero values). I have tried using NaN but for some reason operations on Nan values are not possible.
Here is what I am trying to do with my df columns :
shape_list1=[]
location_list1=[]
scale_list1=[]
for column in df.columns:
shape1, location1, scale1=stats.genpareto.fit(df[column])
shape_list1.append(shape1)
location_list1.append(location1)
scale_list1.append(scale1)
Assuming all values are positive (as seems from your example and description), try:
stats.genpareto.fit(df[df[column] > 0][column])
This filters every column to operate just on the positive values.
Or, if negative values are allowed,
stats.genpareto.fit(df[df[column] != 0][column])
The syntax is messy, but change
shape1, location1, scale1=stats.genpareto.fit(df[column])
to
shape1, location1, scale1=stats.genpareto.fit(df[column][df[column].nonzero()[0]])
Explanation: df[column].nonzero() returns a tuple of size (1,) whose only element, element [0], is a numpy array that holds the index labels where df is nonzero. To index df[column] by these nonzero labels, you can use df[column][df[column].nonzero()[0]].

Categories

Resources