pandas dataframe merge returns empty data frame - python
I have been trying to join/merge 2 dataframe, "df & df_QA" for a while
The first data frame:
df_QA:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6878 entries, 0 to 6877
Data columns (total 14 columns):
PROPERTY_CODE 6878 non-null object
ACCOUNT_CODE 6878 non-null object
Jan 6878 non-null float64
Feb 6878 non-null float64
Mar 6878 non-null float64
Apr 6878 non-null float64
May 6878 non-null float64
Jun 6878 non-null float64
Jul 6878 non-null float64
Aug 6878 non-null float64
Sep 6878 non-null float64
Oct 6878 non-null float64
Nov 6878 non-null float64
Dec 6878 non-null float64
dtypes: float64(12), object(2)
memory usage: 752.4+ KB
The 2nd data frame:
df:
df = pd.read_csv(fname, sep="^",usecols=[2,3,5,6,7,8,9,10,11,12,13,14,15,16],converters={'Account': np.str, 'Entity ID': lambda x: str(x)}).dropna(subset=['Account'],how='any')
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2441 entries, 0 to 2440
Data columns (total 14 columns):
PROPERTY_CODE 2441 non-null object
ACCOUNT_CODE 2441 non-null object
Jan 2441 non-null float64
Feb 2441 non-null float64
Mar 2441 non-null float64
Apr 2441 non-null float64
May 2441 non-null float64
Jun 2441 non-null float64
Jul 2441 non-null float64
Aug 2441 non-null float64
Sep 2441 non-null float64
Oct 2441 non-null float64
Nov 2441 non-null int64
Dec 2441 non-null int64
dtypes: float64(10), int64(2), object(2)
memory usage: 286.1+ KB
I tried:
df_check = pd.merge(df, df_QA, how='inner', on=['PROPERTY_CODE','ACCOUNT_CODE'])
or
df_check = df.merge(df_QA, left_on=['PROPERTY_CODE', 'ACCOUNT_CODE'], right_on=['PROPERTY_CODE', 'ACCOUNT_CODE'], how='inner',sort='True')
returns:
print (df_check)
Empty DataFrame
Columns: [PROPERTY_CODE, ACCOUNT_CODE, Jan_x, Feb_x, Mar_x, Apr_x, May_x, Jun_x, Jul_x, Aug_x, Sep_x, Oct_x, Nov_x, Dec_x, Jan_y, Feb_y, Mar_y, Apr_y, May_y, Jun_y, Jul_y, Aug_y, Sep_y, Oct_y, Nov_y, Dec_y]
Index: []
We are hoping to get a data frame in the following format:
PROPERTY_CODE, ACCOUNT_CODE, Jan_x, Feb_x, Mar_x, Apr_x, May_x, Jun_x, Jul_x, Aug_x, Sep_x, Oct_x, Nov_x, Dec_x, Jan_y, Feb_y, Mar_y, Apr_y, May_y, Jun_y, Jul_y, Aug_y, Sep_y, Oct_y, Nov_y, Dec_y
Any thought? Thank you!
when i try outer:
df_check = pd.merge(df, df_QA, how='inner', on=['PROPERTY_CODE','ACCOUNT_CODE'])
PROPERTY_CODE ACCOUNT_CODE Jan Feb Mar Apr May Jun Jul Aug Sep \
0 05099 MR01030000 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 05099 MR01060000 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 05099 MR01060005 NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 05099 MR01200000 NaN NaN NaN NaN NaN NaN NaN NaN
It returns NaN. I check the PROPERTY_CODE and ACCOUNT_CODE, they look exactly the same to me though.
print (df_QA.loc[df_QA['PROPERTY_CODE'] == "05099"])
PROPERTY_CODE ACCOUNT_CODE Jan Feb Mar
604 05099 MR01030000 -1000 -10000.75 -10000.09
605 05099 MR01060000 100000.05 100.35 -1003128.17
print (df.loc[df['PROPERTY_CODE'] == "05099"])
PROPERTY_CODE ACCOUNT_CODE Jan Feb Mar
0 05099 MR01030000 -1.000000e+09 -100000.75 -100000.09
1 05099 MR01060000 1.000000e+05 1100.35 -1000000.17
Related
Extract prefix (city name) from column title and place in new column
I have a DataFrame that looks like this: # Column Non-Null Count Dtype --- ------ -------------- ----- 0 time 8763 non-null datetime64[ns] 1 Madrid_wind_speed 8763 non-null float64 2 Valencia_wind_deg 8763 non-null object 3 Bilbao_rain_1h 8763 non-null float64 4 Valencia_wind_speed 8763 non-null float64 5 Seville_humidity 8763 non-null float64 6 Madrid_humidity 8763 non-null float64 7 Bilbao_clouds_all 8763 non-null float64 8 Bilbao_wind_speed 8763 non-null float64 9 Seville_clouds_all 8763 non-null float64 10 Bilbao_wind_deg 8763 non-null float64 11 Barcelona_wind_speed 8763 non-null float64 12 Barcelona_wind_deg 8763 non-null float64 13 Madrid_clouds_all 8763 non-null float64 14 Seville_wind_speed 8763 non-null float64 15 Barcelona_rain_1h 8763 non-null float64 16 Seville_pressure 8763 non-null object 17 Seville_rain_1h 8763 non-null float64 18 Bilbao_snow_3h 8763 non-null float64 19 Barcelona_pressure 8763 non-null float64 20 Seville_rain_3h 8763 non-null float64 21 Madrid_rain_1h 8763 non-null float64 22 Barcelona_rain_3h 8763 non-null float64 23 Valencia_snow_3h 8763 non-null float64 24 Madrid_weather_id 8763 non-null float64 25 Barcelona_weather_id 8763 non-null float64 26 Bilbao_pressure 8763 non-null float64 27 Seville_weather_id 8763 non-null float64 28 Valencia_pressure 6695 non-null float64 29 Seville_temp_max 8763 non-null float64 30 Madrid_pressure 8763 non-null float64 31 Valencia_temp_max 8763 non-null float64 32 Valencia_temp 8763 non-null float64 33 Bilbao_weather_id 8763 non-null float64 34 Seville_temp 8763 non-null float64 35 Valencia_humidity 8763 non-null float64 36 Valencia_temp_min 8763 non-null float64 37 Barcelona_temp_max 8763 non-null float64 38 Madrid_temp_max 8763 non-null float64 39 Barcelona_temp 8763 non-null float64 40 Bilbao_temp_min 8763 non-null float64 41 Bilbao_temp 8763 non-null float64 42 Barcelona_temp_min 8763 non-null float64 43 Bilbao_temp_max 8763 non-null float64 44 Seville_temp_min 8763 non-null float64 45 Madrid_temp 8763 non-null float64 46 Madrid_temp_min 8763 non-null float64 47 load_shortfall_3h 8763 non-null float64 dtypes: datetime64[ns](1), float64(45), object(2) How would I go about to extract the City name from Column name and place them in a new column and merge the Wind_speed, rain_1h, etc. data into their own respective columns?
Use DataFrame.set_index for create MultiIndex first by not splitted columns names and then use str.split with DataFrame.stack for columns names by cities: df1 = df.set_index(['time','load_shortfall_3h']) df1.columns = df1.columns.str.split('_', n=1, expand=True) df1 = df1.rename_axis([None, 'type'], axis=1).stack().reset_index()
Apply operation to column
I import a dataframe which has a column 'Goals' that accumulates the previous results, like so: print (df[df['name']=='Player Name']['Goals']) 152 1.0 828 2.0 1591 3.0 I know for a fact that the player scored only one goal per game, so the column should be like: 152 1.0 828 1.0 1591 1.0 By the way the same logic applies to all other scout columns: ... FF 322 non-null float64 FS 568 non-null float64 Goals 80 non-null float64 A 63 non-null float64 PI 834 non-null float64 SG 140 non-null float64 DD 46 non-null float64 DS 611 non-null float64 FC 602 non-null float64 GC 3 non-null float64 GS 45 non-null float64 FD 231 non-null float64 CA 190 non-null float64 FT 34 non-null float64 I 112 non-null float64 PP 4 non-null float64 CV 9 non-null float64 ... QUESTION What is the best way to correct this logic and apply this subtraction to the subset of columns above? Edit: df['G'] = df['G'].diff() returns Name: G, dtype: float64 152 NaN 828 NaN 1562 NaN
Writing this in the blind as you have not shown the full DataFrame: for col in ['Goals', 'FF', 'FS', ...]: tmp = df.groupby('name')[col].diff().fillna(1) df[col] = tmp
How to add a column name for Timeseries when it is indexed
Input: df.info() Output: <class 'pandas.core.frame.DataFrame'> Index: 100 entries, 2019-01-16 to 2018-08-23 - I want to add this as my first column to to analysis. Data columns (total 5 columns): open 100 non-null float64 high 100 non-null float64 low 100 non-null float64 close 100 non-null float64 volume 100 non-null float64 dtypes: float64(5) memory usage: 9.7+ KB
df = df.assign(time=date) df.head() Out[76]: 1. open 2. high 3. low 4. close 5. volume time 2019-01-16 105.2600 106.2550 104.9600 105.3800 29655851 2019-01-16 2019-01-15 102.5100 105.0500 101.8800 105.0100 31587616 2019-01-15 2019-01-14 101.9000 102.8716 101.2600 102.0500 28437079 2019-01-14 2019-01-11 103.1900 103.4400 101.6400 102.8000 28314202 2019-01-11 2019-01-10 103.2200 103.7500 102.3800 103.6000 30067556 2019-01-10
Pandas Merge returning only null values
I am trying to use Pandas to merge a products packing information with each order record for a given product. The data frame information is below. BreakerOrders.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 3774010 entries, 0 to 3774009 Data columns (total 2 columns): Material object Quantity float64 dtypes: float64(1), object(1) memory usage: 86.4+ MB manh.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 1381 entries, 0 to 1380 Data columns (total 4 columns): Material 1381 non-null object SUBPACK_QTY 202 non-null float64 PACK_QTY 591 non-null float64 PALLET_QTY 809 non-null float64 dtypes: float64(3), object(1) memory usage: 43.2+ KB When attempting the merge using the code below, I get the following table with all NaN values for packaging quantities. BreakerOrders.merge(manh,how='left',on='Material') Material Quantity SUBPACK_QTY PACK_QTY PALLET_QTY HOM230CP 5.0 NaN NaN NaN QO115 20.0 NaN NaN NaN QO2020CP 20.0 NaN NaN NaN QO220CP 50.0 NaN NaN NaN HOM115CP 50.0 NaN NaN NaN HOM120 100.0 NaN NaN NaN
I was having the same and I was able to solve it by just flipping the DFs. so instead of: df2 = df.merge(df1) try df2 = df1.merge(df) Looks silly, but it solved my issue.
Applying function on multiple columns to create multiple new columns
I need to apply a function on a df to create multiple new columns. As an input to my function I would need (i) a row or (ii) multiple columns def divideAndMultiply(x,y): return x/y, x*y df["e"], df["f"] = zip(*df.a.apply(lambda val: divideAndMultiply(val,2))) (https://stackoverflow.com/a/36600318/1831695) This works to create multiple new columns, but not with multiple inputs (= columns from that df). Or did I miss something? Example what I want to do: My DF has several columns. 3 (a, b, c) of them are relevant to calculate two new columns (y, z) where, y = a + b + c and z = c - b - a I know this is an easy calculation where I do not need a function, but let's assume we would need one. Instead of writing and applying 2 functions, I would like to have only one function, return both values and accepting all three values (or even better: a row) for the calculation. This example: df["y"], df["z"] = zip(*df.a.apply(lambda val: divideAndMultiply(val,2))) works only when using one column data item (val) and on other value (2 in this case). I would need something like that: df["y"], df["z"] = zip(*df.a.apply(lambda val: divideAndMultiply(df['a'],df['b'],df['c']))) (And yes, I know that val is assigned to df.a ) Update 2 this is how my df looks like: <class 'pandas.core.frame.DataFrame'> Int64Index: 4655 entries, 0 to 4654 Data columns (total 71 columns): Open 4655 non-null float64 Close 4655 non-null float64 High 4655 non-null float64 Low 4655 non-null float64 DateTime 4655 non-null datetime64[ns] Date 4655 non-null datetime64[ns] T_CLOSE 4655 non-null float64 T_HIGH 4655 non-null float64 T_LOW 4655 non-null float64 T_OPEN 4655 non-null float64 MVA5 4651 non-null float64 MVA10 4646 non-null float64 MVA14 4642 non-null float64 MVA15 4641 non-null float64 MVA28 4628 non-null float64 MVA50 4606 non-null float64 STD5 4651 non-null float64 STD10 4646 non-null float64 STD15 4641 non-null float64 CV_5 4651 non-null float64 CV_10 4646 non-null float64 CV_15 4641 non-null float64 DIFF_VP_CLOSE 4654 non-null float64 DIFF_VP_HIGH 4654 non-null float64 DIFF_VP_OPEN 4654 non-null float64 DIFF_VP_LOW 4654 non-null float64 AVG_STEIG_5 4650 non-null float64 AVG_STEIG_10 4645 non-null float64 AVG_STEIG_15 4640 non-null float64 AVG_STEIG_28 4627 non-null float64 AVG_5_DIFF 4651 non-null float64 AVG_10_DIFF 4646 non-null float64 AVG_15_DIFF 4641 non-null float64 AVG_14_DIFF 4642 non-null float64 AVG_50_DIFF 4606 non-null float64 AD_5_14 4642 non-null float64 Momentum_4 4651 non-null float64 ROC_4 4652 non-null float64 Momentum_8 4647 non-null float64 ROC_8 4648 non-null float64 Momentum_12 4643 non-null float64 ROC_12 4644 non-null float64 VT_OPEN 4598 non-null float64 VT_CLOSE 4598 non-null float64 VT_HIGH 4598 non-null float64 VT_LOW 4598 non-null float64 PP_VT 4598 non-null float64 R1_VT 4598 non-null float64 R2_VT 4598 non-null float64 R3_VT 4598 non-null float64 S1_VT 4598 non-null float64 S2_VT 4598 non-null float64 S3_VT 4598 non-null float64 DIFF_VT_CLOSE 4598 non-null float64 DIFF_VT_HIGH 4598 non-null float64 DIFF_VT_OPEN 4598 non-null float64 DIFF_VT_LOW 4598 non-null float64 DIFF_T_OPEN 4655 non-null float64 DIFF_T_LOW 4655 non-null float64 DIFF_T_HIGH 4655 non-null float64 DIFF_T_CLOSE 4655 non-null float64 DIFF_VTCLOSE_TOPEN 4598 non-null float64 VP_HIGH 4654 non-null float64 VP_LOW 4654 non-null float64 VP_OPEN 4654 non-null float64 VP_CLOSE 4654 non-null float64 regel_r1 4655 non-null int64 regel_r2 4655 non-null int64 regel_r3 4655 non-null int64 regeln 4655 non-null int64 vormittag_flag 4655 non-null int64 dtypes: datetime64[ns](2), float64(64), int64(5) memory usage: 2.6 MB None Open Close High Low DateTime Date T_CLOSE T_HIGH T_LOW T_OPEN MVA5 MVA10 MVA14 MVA15 MVA28 MVA50 STD5 STD10 STD15 CV_5 CV_10 CV_15 DIFF_VP_CLOSE DIFF_VP_HIGH DIFF_VP_OPEN DIFF_VP_LOW AVG_STEIG_5 AVG_STEIG_10 AVG_STEIG_15 AVG_STEIG_28 AVG_5_DIFF AVG_10_DIFF AVG_15_DIFF AVG_14_DIFF AVG_50_DIFF AD_5_14 Momentum_4 ROC_4 Momentum_8 ROC_8 Momentum_12 ROC_12 VT_OPEN VT_CLOSE VT_HIGH VT_LOW PP_VT R1_VT R2_VT R3_VT S1_VT S2_VT S3_VT DIFF_VT_CLOSE DIFF_VT_HIGH DIFF_VT_OPEN DIFF_VT_LOW DIFF_T_OPEN DIFF_T_LOW DIFF_T_HIGH DIFF_T_CLOSE DIFF_VTCLOSE_TOPEN VP_HIGH VP_LOW VP_OPEN VP_CLOSE regel_r1 regel_r2 regel_r3 regeln vormittag_flag T_DIRC T_WECHSELC T_NUM_INNEN T_CANDLEC 4653 12488.1 12490.1 12490.6 12484.9 2017-05-03 14:00:00 2017-05-03 12490.1 12508.3 12475.4 12506.5 12490.18 12488.41 12487.521429 12487.053333 12493.078571 12498.118 2.178761 4.334218 4.515033 0.000174 0.000347 0.000362 -1.7 2.5 -1.0 -2.7 -9.6 8.5 4.866667 -8.178571 -0.08 1.69 3.046667 2.578571 -8.018 -2.658571 -3.8 0.00004 8.7 0.000577 -0.3 0.000521 12449.3 12514.3 12527.8 12432.0 12491.366667 12550.733333 12587.166667 12646.533333 12454.933333 12395.566667 12359.133333 24.2 37.7 -40.8 -58.1 16.4 -14.7 18.2 0.0 7.8 12492.6 12487.4 12489.1 12488.4 0 0 0 0 0 neutral INNEN 1 GRUEN 4654 12489.9 12489.9 12489.9 12489.6 2017-05-03 14:15:00 2017-05-03 12489.9 12508.3 12475.4 12506.5 12489.38 12488.91 12487.828571 12487.680000 12492.182143 12498.436 0.712039 4.169586 4.180431 0.000057 0.000334 0.000335 0.2 0.7 -1.8 -5.0 -8.0 5.0 6.266667 -8.964286 0.52 0.99 2.220000 2.071429 -8.536 -1.551429 0.3 0.00008 7.0 0.000064 6.3 0.000665 12449.3 12514.3 12527.8 12432.0 12491.366667 12550.733333 12587.166667 12646.533333 12454.933333 12395.566667 12359.133333 24.4 37.9 -40.6 -57.9 16.6 -14.5 18.4 0.0 7.8 12490.6 12484.9 12488.1 12490.1 0 0 0 0 0 neutral INNEN 1 GRUEN
UPDATED This is a straightforward example with a Quandl dataset, import quandl df = quandl.get("WIKI/GOOGL") columns = ["High", 'Low', 'Close'] def operations(row, columns): df1 = row[columns[0]] + row[columns[1]] + row[columns[2]] df2 = -row[columns[1]] - row[columns[2]] + row[columns[0]] return df1, df2 df["function1"], df["function2"] = zip(*df.apply(lambda row: operations(row, columns), axis=1)) df[["High","Low","Close","function1", "function2"]].head(5)