Join/merge not working in python - python

Trying to join df61 and df_petsy_gz on pic_code. I've included the data types for the variables also. My code is outputting a bunch of NaN indicating none of the pic_codes match between the two data sets. There are a couple million lines of data so I'm certain there are a bunch of matches. I think I'm doing something wrong.
df61.head(3)
mpe_wgt pic_code
10 420336479305589843900801597032
10 420907139300189843900792911982
10 420967449300189843900797682603
mpe_wgt object
pic_code object
df_petsy_gz.head(3)
monthly_fiscal_year month pic_code class_of_mail
2017 11 420606019300189843900566128707 FC
2017 11 420731629300189843900584700299 FC
2017 11 420405029300189843900568579224 FC
weight calc_postage calc_total_postage MikeZone
0.8750 4.02 4.02 5
0.3750 2.77 2.77 6
0.6875 3.60 3.60 8
monthly_fiscal_year int64
month int64
pic_code object
class_of_mail object
weight float64
calc_postage float64
calc_total_postage float64
MikeZone int64
df61_mpe=pd.merge(df_petsy_gz,df61,on='pic_code', how='outer')
output
monthly_fiscal_year month pic_code class_of_mail \
2017.0 11.0 420606019300189843900566128707 FC
2017.0 11.0 420731629300189843900584700299 FC
2017.0 11.0 420405029300189843900568579224 FC
2017.0 11.0 420301349300189843900567382542 FC
weight calc_postage calc_total_postage MikeZone mpe_wgt
0.8750 4.02 4.02 5.0 NaN
0.3750 2.77 2.77 6.0 NaN
0.6875 3.60 3.60 8.0 NaN
0.5000 2.77 2.77 4.0 NaN

I don't know how your data looks like but all what i know is that the type of join affects how the rows will be formed after the join.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.join.html
how : {‘left’, ‘right’, ‘outer’, ‘inner’}, default: ‘left’
How to handle the operation of the two objects.
left: use calling frame’s index (or column if on is specified)
right: use other frame’s index
outer: form union of calling frame’s index (or column if on is specified) with other frame’s index, and sort it lexicographically
inner: form intersection of calling frame’s index (or column if on is specified) with other frame’s index, preserving the order of the calling’s one
Try using 'inner' join and see if that is what you need.
This will only return rows of where pic_code is found in both dataframes and having mpe_wgt.
Additionally make sure that the pic_code has no trailing/leading whitespaces so that similar pic_code from the two dataframes would match.

Related

Pandas Dataframe Comparison and Copying

Below I have two dataframes, the first being dataframe det and the second being orig. I need to compare det['Detection'] with orig['Date/Time']. Once the values are found during the comparion, I need to copy values from orig and det to some final dataframe (final). The format that I need the final dataframe in is det['Date/Time'] orig['Lat'] orig['Lon'] orig['Dep'] det['Mag'] I hope that my formatting is adequate for folks. I was not sure how to handle the dataframes so I just placed them in tables. Some additional information that probably won't matter is that det is 3385 rows by 3 columns and orig is 818 rows by 9 columns.
det:
Date/Time
Mag
Detection
2008/12/27T01:06:56.37
0.280
2008/12/27T13:50:07.00
2008/12/27T01:17:39.39
0.485
2008/12/27T01:17:39.00
2008/12/27T01:33:23.00
-0.080
2008/12/27T01:17:39.00
orig:
Date/Time
Lat
Lon
Dep
Ml
Mc
N
Dmin
ehz
2008/12/27T01:17:39.00
44.5112
-110.3742
5.07
-9.99
0.51
5
6
3.2
2008/12/27T04:33:30.00
44.4985
-110.3750
4.24
-9.99
1.63
9
8
0.9
2008/12/27T05:38:22.00
44.4912
-110.3743
4.73
-9.99
0.37
8
8
0.8
final:
det['Date/Time']
orig['Lat']
orig['Lon']
orig['Dep']
det['Mag']
You can merge the two dataframes, since you want to use Detection column from the first data frame and Date/Time column from the second dataframe, you can just rename the column of second dataframe while merging since the column name already exits in the first dataframe:
det.merge(org.rename(columns={'Date/Time': 'Detection'}))
OUTPUT:
Date/Time Mag Detection Lat Lon Dep Ml Mc N Dmin ehz
0 2008/12/27T01:17:39.39 0.485 2008/12/27T01:17:39.00 44.5112 -110.3742 5.07 -9.99 0.51 5 6 3.2
1 2008/12/27T01:33:23.00 -0.080 2008/12/27T01:17:39.00 44.5112 -110.3742 5.07 -9.99 0.51 5 6 3.2
You can then select the columns you want.

Pandas Way of Weighted Average in a Large DataFrame

I do have a large dataset (around 8 million rows x 25 columns) in Pandas and I am struggling to find a way to compute weighted average on this dataframe which in turn creates another data frame.
Here is how my dataset looks like (very simplified version of it):
prec temp
location_id hours
135 1 12.0 4.0
2 14.0 4.1
3 14.3 3.5
4 15.0 4.5
5 15.0 4.2
6 15.0 4.7
7 15.5 5.1
136 1 12.0 4.0
2 14.0 4.1
3 14.3 3.5
4 15.0 4.5
5 15.0 4.2
6 15.0 4.7
7 15.5 5.1
I have a multi-index on [location_id, hours]. I have around 60k locations and 140 hours for each location (making up the 8 million rows).
The rest of the data is numeric (float) or categorical. I have only included 2 columns here, normally there are around 20 columns.
What I am willing to do is to create a new data frame that is basically a weighted average of this data frame. The requirements indicate that 12 of these location_ids should be averaged out by a specified weight to form the combined_location_id values.
For example, location_ids 1,3,5,7,9,11,13,15,17,19,21,23 with their appropriate weights (separate data coming in from another data frame) should be weighted averaged to from the combined_location_id CL_1's data.
That is a lot of data to handle and I wasn't able to find a completely Pandas way of solving it. Therefore, I went with a for loop approach. It is extremely slow and I am sure this is not the right way to do it:
def __weighted(self, ds, weights):
return np.average(ds, weights=weights)
f = {'hours': 'first', 'location_id': 'first',
'temp': lambda x: self.__weighted(x, weights), 'prec': lambda x: self.__weighted(x, weights)}
data_frames = []
for combined_location in all_combined_locations:
mapped_location_ids = combined_location.location_ids
weights = combined_location.weights_of_location_ids
data_for_this_combined_location = pd.concat(df_data.loc[df_data.index.get_level_values(0) == location_id] for location_id in mapped_location_ids)
data_grouped_by_distance = data_for_this_combined_location.groupby("hours", as_index=False)
data_grouped_by_distance = data_grouped_by_distance.agg(f)
data_frames.append(data_grouped_by_distance)
df_combined_location_data = pd.concat(data_frames)
df_combined_location_data.set_index(['location_id', 'hours'], inplace=True)
This works well functionally, however the performance and the memory consumption is horrible. It is taking over 2 hours on my dataset and that is currently not acceptable. The existence of the for loop is an indicator that this could be handled better.
Is there a better/faster way to implement this?
From what I saw you can reduce one for loop with mapped_location_ids
data_for_this_combined_location = df_data.loc[df_data.index.get_level_values(0).isin(mapped_location_ids)]

How to remove missing data and 0s whilst keeping the dataframe the same shape using Pandas?

I have a dataframe and I want to reformat it in order it remove the instances of whether a missing value or a zero occurs before the first non-zero value appears across a row. However I do not want to delete any rows or columns and I do not want to remove any 0s or missing values which appear after the non-zeroes.
Below is the dataframe I am working with:
> data =[['Adam',2.55,4.53,3.45,2.12,3.14],['Bill',np.NaN,2.14,3.65,4.12],['Chris',np.NaN,0,2.82,0,6.04],['David',np.NaN,0,7.42,3.52]]
> df = pd.DataFrame(data, columns = ['Name', 'A','B','C','D','E'])
Moreover, here is the expected outcome:
> data1 =[['Adam',2.55,4.53,3.45,2.12,3.14],['Bill',2.14,3.65,4.12],['Chris',2.82,0,6.04],['David',7.42,3.52]]
> df1 = pd.DataFrame(data1, columns = ['Name', 'A','B','C','D','E'])
This is not a trivial problem. Here is the solution:
m=df.set_index('Name')
m=m[m.isin(m.mask(m.le(0)).bfill(axis=1).iloc[:,0]).cumsum(axis=1).astype(bool)]
print(m)
A B C D E
Name
Adam 2.55 4.53 3.45 2.12 3.14
Bill NaN 2.14 3.65 4.12 NaN
Chris NaN NaN 2.82 0.00 6.04
David NaN NaN 7.42 3.52 NaN
Then using justify:
pd.DataFrame(justify(m.values,np.nan),columns=m.columns,index=m.index).reset_index()
Name A B C D E
0 Adam 2.55 4.53 3.45 2.12 3.14
1 Bill 2.14 3.65 4.12 NaN NaN
2 Chris 2.82 0.00 6.04 NaN NaN
3 David 7.42 3.52 NaN NaN NaN
Explanation:
Step1: Set the Name column as index so we can deal with numeric values only.
Step2: m.mask(m.le(0)).bfill(axis=1).iloc[:,0] gives the first value which is greater than 0.
Step3: Then using isin() to return True wherever the value appears in each row.
Step4: cumsum(axis=1).astype(bool) makes all the remaining elements as True so we can filter only those values, other values becomes NaN.
Then use justify function from the linked post.

Row wise calculations(Python)

Trying to run the following code to create a new column 'Median Rank':
N=data2.Rank.count()
for i in data2.Rank:
data2['Median_Rank']=i-0.3/(N+0.4)
But I'm getting a constant value of 0.99802. Even though my rank column is as follows:
data2.Rank.head()
Out[464]:
4131 1.0
4173 3.0
4172 3.0
4132 3.0
5335 10.0
4171 10.0
4159 10.0
5079 10.0
4115 10.0
4179 10.0
4180 10.0
4147 10.0
4181 10.0
4175 10.0
4170 10.0
4116 24.0
4129 24.0
4156 24.0
4153 24.0
4160 24.0
5358 24.0
4152 24.0
Somebody please point out the errors in my code.
Your code isn't vectorised. Use this:
N = data2.Rank.count()
data2['Median_Rank'] = data2['Rank'] - 0.3 / (N+0.4)
The reason your code does not work is because you are assigning the entire column in each loop. So only the last i iteration sticks, values in data2['Median_Rank'] are guaranteed to be identical.
This occurs because every time you make data2['Median_Rank']=i-0.3/(N+0.4) you are updating the entire column with the value calculated by the expression, the easiest way to do that actually don't need a loop:
N=data2.Rank.count()
data2['Median_Rank'] = data2.Rank-0.3/(N+0.4)
It is possible because pandas supports element-wise operations with series.
if you still want to use for loop, you will need to use .at and iterate by rows as follow:
for i, el in zip(df_filt.index,df_filt.rendimento_liquido.values):
df_filt.at[i,'Median_Rank']=el-0.3/(N+0.4)

Summing rows in Python Dataframe

I just started learning Python so forgive me if this question has already been answered somewhere else. I want to create a new column called "Sum", which will simply be the previous columns added up.
Risk_Parity.tail()
VCIT VCLT PCY RWR IJR XLU EWL
Date
2017-01-31 21.704155 11.733716 9.588649 8.278629 5.061788 7.010918 7.951747
2017-02-28 19.839319 10.748690 9.582891 7.548530 5.066478 7.453951 7.950232
2017-03-31 19.986782 10.754507 9.593623 7.370828 5.024079 7.402774 7.654366
2017-04-30 18.897307 11.102380 10.021139 9.666693 5.901137 7.398604 11.284331
2017-05-31 63.962659 23.670240 46.018698 9.917160 15.234977 12.344524 20.405587
The table columns are a little off but all I need is (21.70 + 11.73...+7.95)
I can only get as far as creating the column Risk_Parity['sum'] = , but then I'm lost.
I'd rather not do have to do Risk_Parity['sum] = Risk_Parity['VCIT'] + Risk_Parity['VCLT']...
After creating the sum column, I want to divide each column by the sum column and make that into a new dataframe, which wouldn't include the sum column.
If anyone could help, I'd greatly appreciate it. Please try to dumb your answers down as much as possible lol.
Thanks!
Tom
Use sum with the parameter axis=1 to specify summation over rows
Risk_Parity['Sum'] = Risk_Parity.sum(1)
To create a new copy of Risk_Parity without writing a new column to the original
Risk_Parity.assign(Sum= Risk_Parity.sum(1))
Notice also, that I named the column Sum and not sum. I did this to avoid colliding with the very same method named sum I used to create the column.
To only include numeric columns... however, sum knows to skip non-numeric columns anyway.
RiskParity.assign(Sum=RiskParity.select_dtypes(['number']).sum(1))
# same as
# RiskParity.assign(Sum=RiskParity.sum(1))
VCIT VCLT PCY RWR IJR XLU EWL Sum
Date
2017-01-31 21.70 11.73 9.59 8.28 5.06 7.01 7.95 71.33
2017-02-28 19.84 10.75 9.58 7.55 5.07 7.45 7.95 68.19
2017-03-31 19.99 10.75 9.59 7.37 5.02 7.40 7.65 67.79
2017-04-30 18.90 11.10 10.02 9.67 5.90 7.40 11.28 74.27
2017-05-31 63.96 23.67 46.02 9.92 15.23 12.34 20.41 191.55
l = ['VCIT' , VCLT' ,PCY' ... 'EWL']
Risk_Parity['sum'] = 0
for item in l:
Risk_Parity['sum'] += Risk_Parity[item]

Categories

Resources