pandas dataframe merge returns empty data frame

pandas dataframe merge returns empty data frame - python

I have been trying to join/merge 2 dataframe, "df & df_QA" for a while
The first data frame:
df_QA:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6878 entries, 0 to 6877
Data columns (total 14 columns):
PROPERTY_CODE 6878 non-null object
ACCOUNT_CODE 6878 non-null object
Jan 6878 non-null float64
Feb 6878 non-null float64
Mar 6878 non-null float64
Apr 6878 non-null float64
May 6878 non-null float64
Jun 6878 non-null float64
Jul 6878 non-null float64
Aug 6878 non-null float64
Sep 6878 non-null float64
Oct 6878 non-null float64
Nov 6878 non-null float64
Dec 6878 non-null float64
dtypes: float64(12), object(2)
memory usage: 752.4+ KB
The 2nd data frame:
df:
df = pd.read_csv(fname, sep="^",usecols=[2,3,5,6,7,8,9,10,11,12,13,14,15,16],converters={'Account': np.str, 'Entity ID': lambda x: str(x)}).dropna(subset=['Account'],how='any')
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2441 entries, 0 to 2440
Data columns (total 14 columns):
PROPERTY_CODE 2441 non-null object
ACCOUNT_CODE 2441 non-null object
Jan 2441 non-null float64
Feb 2441 non-null float64
Mar 2441 non-null float64
Apr 2441 non-null float64
May 2441 non-null float64
Jun 2441 non-null float64
Jul 2441 non-null float64
Aug 2441 non-null float64
Sep 2441 non-null float64
Oct 2441 non-null float64
Nov 2441 non-null int64
Dec 2441 non-null int64
dtypes: float64(10), int64(2), object(2)
memory usage: 286.1+ KB
I tried:
df_check = pd.merge(df, df_QA, how='inner', on=['PROPERTY_CODE','ACCOUNT_CODE'])
or
df_check = df.merge(df_QA, left_on=['PROPERTY_CODE', 'ACCOUNT_CODE'], right_on=['PROPERTY_CODE', 'ACCOUNT_CODE'], how='inner',sort='True')
returns:
print (df_check)
Empty DataFrame
Columns: [PROPERTY_CODE, ACCOUNT_CODE, Jan_x, Feb_x, Mar_x, Apr_x, May_x, Jun_x, Jul_x, Aug_x, Sep_x, Oct_x, Nov_x, Dec_x, Jan_y, Feb_y, Mar_y, Apr_y, May_y, Jun_y, Jul_y, Aug_y, Sep_y, Oct_y, Nov_y, Dec_y]
Index: []
We are hoping to get a data frame in the following format:
PROPERTY_CODE, ACCOUNT_CODE, Jan_x, Feb_x, Mar_x, Apr_x, May_x, Jun_x, Jul_x, Aug_x, Sep_x, Oct_x, Nov_x, Dec_x, Jan_y, Feb_y, Mar_y, Apr_y, May_y, Jun_y, Jul_y, Aug_y, Sep_y, Oct_y, Nov_y, Dec_y
Any thought? Thank you!
when i try outer:
df_check = pd.merge(df, df_QA, how='inner', on=['PROPERTY_CODE','ACCOUNT_CODE'])
PROPERTY_CODE ACCOUNT_CODE Jan Feb Mar Apr May Jun Jul Aug Sep \
0 05099 MR01030000 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 05099 MR01060000 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 05099 MR01060005 NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 05099 MR01200000 NaN NaN NaN NaN NaN NaN NaN NaN
It returns NaN. I check the PROPERTY_CODE and ACCOUNT_CODE, they look exactly the same to me though.
print (df_QA.loc[df_QA['PROPERTY_CODE'] == "05099"])
PROPERTY_CODE ACCOUNT_CODE Jan Feb Mar
604 05099 MR01030000 -1000 -10000.75 -10000.09
605 05099 MR01060000 100000.05 100.35 -1003128.17
print (df.loc[df['PROPERTY_CODE'] == "05099"])
PROPERTY_CODE ACCOUNT_CODE Jan Feb Mar
0 05099 MR01030000 -1.000000e+09 -100000.75 -100000.09
1 05099 MR01060000 1.000000e+05 1100.35 -1000000.17

Related

Extract prefix (city name) from column title and place in new column

I have a DataFrame that looks like this:
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 time 8763 non-null datetime64[ns]
1 Madrid_wind_speed 8763 non-null float64
2 Valencia_wind_deg 8763 non-null object
3 Bilbao_rain_1h 8763 non-null float64
4 Valencia_wind_speed 8763 non-null float64
5 Seville_humidity 8763 non-null float64
6 Madrid_humidity 8763 non-null float64
7 Bilbao_clouds_all 8763 non-null float64
8 Bilbao_wind_speed 8763 non-null float64
9 Seville_clouds_all 8763 non-null float64
10 Bilbao_wind_deg 8763 non-null float64
11 Barcelona_wind_speed 8763 non-null float64
12 Barcelona_wind_deg 8763 non-null float64
13 Madrid_clouds_all 8763 non-null float64
14 Seville_wind_speed 8763 non-null float64
15 Barcelona_rain_1h 8763 non-null float64
16 Seville_pressure 8763 non-null object
17 Seville_rain_1h 8763 non-null float64
18 Bilbao_snow_3h 8763 non-null float64
19 Barcelona_pressure 8763 non-null float64
20 Seville_rain_3h 8763 non-null float64
21 Madrid_rain_1h 8763 non-null float64
22 Barcelona_rain_3h 8763 non-null float64
23 Valencia_snow_3h 8763 non-null float64
24 Madrid_weather_id 8763 non-null float64
25 Barcelona_weather_id 8763 non-null float64
26 Bilbao_pressure 8763 non-null float64
27 Seville_weather_id 8763 non-null float64
28 Valencia_pressure 6695 non-null float64
29 Seville_temp_max 8763 non-null float64
30 Madrid_pressure 8763 non-null float64
31 Valencia_temp_max 8763 non-null float64
32 Valencia_temp 8763 non-null float64
33 Bilbao_weather_id 8763 non-null float64
34 Seville_temp 8763 non-null float64
35 Valencia_humidity 8763 non-null float64
36 Valencia_temp_min 8763 non-null float64
37 Barcelona_temp_max 8763 non-null float64
38 Madrid_temp_max 8763 non-null float64
39 Barcelona_temp 8763 non-null float64
40 Bilbao_temp_min 8763 non-null float64
41 Bilbao_temp 8763 non-null float64
42 Barcelona_temp_min 8763 non-null float64
43 Bilbao_temp_max 8763 non-null float64
44 Seville_temp_min 8763 non-null float64
45 Madrid_temp 8763 non-null float64
46 Madrid_temp_min 8763 non-null float64
47 load_shortfall_3h 8763 non-null float64
dtypes: datetime64[ns](1), float64(45), object(2)
How would I go about to extract the City name from Column name and place them in a new column and merge the Wind_speed, rain_1h, etc. data into their own respective columns?

Use DataFrame.set_index for create MultiIndex first by not splitted columns names and then use str.split with DataFrame.stack for columns names by cities:
df1 = df.set_index(['time','load_shortfall_3h'])
df1.columns = df1.columns.str.split('_', n=1, expand=True)
df1 = df1.rename_axis([None, 'type'], axis=1).stack().reset_index()

Apply operation to column

I import a dataframe which has a column 'Goals' that accumulates the previous results, like so:
print (df[df['name']=='Player Name']['Goals'])
152 1.0
828 2.0
1591 3.0
I know for a fact that the player scored only one goal per game, so the column should be like:
152 1.0
828 1.0
1591 1.0
By the way the same logic applies to all other scout columns:
...
FF 322 non-null float64
FS 568 non-null float64
Goals 80 non-null float64
A 63 non-null float64
PI 834 non-null float64
SG 140 non-null float64
DD 46 non-null float64
DS 611 non-null float64
FC 602 non-null float64
GC 3 non-null float64
GS 45 non-null float64
FD 231 non-null float64
CA 190 non-null float64
FT 34 non-null float64
I 112 non-null float64
PP 4 non-null float64
CV 9 non-null float64
...
QUESTION
What is the best way to correct this logic and apply this subtraction to the subset of columns above?
Edit:
df['G'] = df['G'].diff()
returns
Name: G, dtype: float64
152 NaN
828 NaN
1562 NaN

Writing this in the blind as you have not shown the full DataFrame:
for col in ['Goals', 'FF', 'FS', ...]:
tmp = df.groupby('name')[col].diff().fillna(1)
df[col] = tmp

How to add a column name for Timeseries when it is indexed

Input: df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
Index: 100 entries, 2019-01-16 to 2018-08-23 - I want to add this as my first column to to analysis.
Data columns (total 5 columns):
open 100 non-null float64
high 100 non-null float64
low 100 non-null float64
close 100 non-null float64
volume 100 non-null float64
dtypes: float64(5)
memory usage: 9.7+ KB

df = df.assign(time=date)
df.head()
Out[76]:
1. open 2. high 3. low 4. close 5. volume time
2019-01-16 105.2600 106.2550 104.9600 105.3800 29655851 2019-01-16
2019-01-15 102.5100 105.0500 101.8800 105.0100 31587616 2019-01-15
2019-01-14 101.9000 102.8716 101.2600 102.0500 28437079 2019-01-14
2019-01-11 103.1900 103.4400 101.6400 102.8000 28314202 2019-01-11
2019-01-10 103.2200 103.7500 102.3800 103.6000 30067556 2019-01-10

Pandas Merge returning only null values

I am trying to use Pandas to merge a products packing information with each order record for a given product. The data frame information is below.
BreakerOrders.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3774010 entries, 0 to 3774009
Data columns (total 2 columns):
Material object
Quantity float64
dtypes: float64(1), object(1)
memory usage: 86.4+ MB
manh.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1381 entries, 0 to 1380
Data columns (total 4 columns):
Material 1381 non-null object
SUBPACK_QTY 202 non-null float64
PACK_QTY 591 non-null float64
PALLET_QTY 809 non-null float64
dtypes: float64(3), object(1)
memory usage: 43.2+ KB
When attempting the merge using the code below, I get the following table with all NaN values for packaging quantities.
BreakerOrders.merge(manh,how='left',on='Material')
Material Quantity SUBPACK_QTY PACK_QTY PALLET_QTY
HOM230CP 5.0 NaN NaN NaN
QO115 20.0 NaN NaN NaN
QO2020CP 20.0 NaN NaN NaN
QO220CP 50.0 NaN NaN NaN
HOM115CP 50.0 NaN NaN NaN
HOM120 100.0 NaN NaN NaN

I was having the same and I was able to solve it by just flipping the DFs. so instead of:
df2 = df.merge(df1)
try
df2 = df1.merge(df)
Looks silly, but it solved my issue.

Applying function on multiple columns to create multiple new columns

I need to apply a function on a df to create multiple new columns. As an input to my function I would need (i) a row or (ii) multiple columns
def divideAndMultiply(x,y):
return x/y, x*y
df["e"], df["f"] = zip(*df.a.apply(lambda val: divideAndMultiply(val,2)))
(https://stackoverflow.com/a/36600318/1831695)
This works to create multiple new columns, but not with multiple inputs (= columns from that df). Or did I miss something?
Example what I want to do:
My DF has several columns. 3 (a, b, c) of them are relevant to calculate two new columns (y, z) where, y = a + b + c and z = c - b - a
I know this is an easy calculation where I do not need a function, but let's assume we would need one.
Instead of writing and applying 2 functions, I would like to have only one function, return both values and accepting all three values (or even better: a row) for the calculation.
This example:
df["y"], df["z"] = zip(*df.a.apply(lambda val: divideAndMultiply(val,2)))
works only when using one column data item (val) and on other value (2 in this case).
I would need something like that:
df["y"], df["z"] = zip(*df.a.apply(lambda val: divideAndMultiply(df['a'],df['b'],df['c'])))
(And yes, I know that val is assigned to df.a )
Update 2
this is how my df looks like:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4655 entries, 0 to 4654
Data columns (total 71 columns):
Open 4655 non-null float64
Close 4655 non-null float64
High 4655 non-null float64
Low 4655 non-null float64
DateTime 4655 non-null datetime64[ns]
Date 4655 non-null datetime64[ns]
T_CLOSE 4655 non-null float64
T_HIGH 4655 non-null float64
T_LOW 4655 non-null float64
T_OPEN 4655 non-null float64
MVA5 4651 non-null float64
MVA10 4646 non-null float64
MVA14 4642 non-null float64
MVA15 4641 non-null float64
MVA28 4628 non-null float64
MVA50 4606 non-null float64
STD5 4651 non-null float64
STD10 4646 non-null float64
STD15 4641 non-null float64
CV_5 4651 non-null float64
CV_10 4646 non-null float64
CV_15 4641 non-null float64
DIFF_VP_CLOSE 4654 non-null float64
DIFF_VP_HIGH 4654 non-null float64
DIFF_VP_OPEN 4654 non-null float64
DIFF_VP_LOW 4654 non-null float64
AVG_STEIG_5 4650 non-null float64
AVG_STEIG_10 4645 non-null float64
AVG_STEIG_15 4640 non-null float64
AVG_STEIG_28 4627 non-null float64
AVG_5_DIFF 4651 non-null float64
AVG_10_DIFF 4646 non-null float64
AVG_15_DIFF 4641 non-null float64
AVG_14_DIFF 4642 non-null float64
AVG_50_DIFF 4606 non-null float64
AD_5_14 4642 non-null float64
Momentum_4 4651 non-null float64
ROC_4 4652 non-null float64
Momentum_8 4647 non-null float64
ROC_8 4648 non-null float64
Momentum_12 4643 non-null float64
ROC_12 4644 non-null float64
VT_OPEN 4598 non-null float64
VT_CLOSE 4598 non-null float64
VT_HIGH 4598 non-null float64
VT_LOW 4598 non-null float64
PP_VT 4598 non-null float64
R1_VT 4598 non-null float64
R2_VT 4598 non-null float64
R3_VT 4598 non-null float64
S1_VT 4598 non-null float64
S2_VT 4598 non-null float64
S3_VT 4598 non-null float64
DIFF_VT_CLOSE 4598 non-null float64
DIFF_VT_HIGH 4598 non-null float64
DIFF_VT_OPEN 4598 non-null float64
DIFF_VT_LOW 4598 non-null float64
DIFF_T_OPEN 4655 non-null float64
DIFF_T_LOW 4655 non-null float64
DIFF_T_HIGH 4655 non-null float64
DIFF_T_CLOSE 4655 non-null float64
DIFF_VTCLOSE_TOPEN 4598 non-null float64
VP_HIGH 4654 non-null float64
VP_LOW 4654 non-null float64
VP_OPEN 4654 non-null float64
VP_CLOSE 4654 non-null float64
regel_r1 4655 non-null int64
regel_r2 4655 non-null int64
regel_r3 4655 non-null int64
regeln 4655 non-null int64
vormittag_flag 4655 non-null int64
dtypes: datetime64[ns](2), float64(64), int64(5)
memory usage: 2.6 MB
None
Open Close High Low DateTime Date T_CLOSE T_HIGH T_LOW T_OPEN MVA5 MVA10 MVA14 MVA15 MVA28 MVA50 STD5 STD10 STD15 CV_5 CV_10 CV_15 DIFF_VP_CLOSE DIFF_VP_HIGH DIFF_VP_OPEN DIFF_VP_LOW AVG_STEIG_5 AVG_STEIG_10 AVG_STEIG_15 AVG_STEIG_28 AVG_5_DIFF AVG_10_DIFF AVG_15_DIFF AVG_14_DIFF AVG_50_DIFF AD_5_14 Momentum_4 ROC_4 Momentum_8 ROC_8 Momentum_12 ROC_12 VT_OPEN VT_CLOSE VT_HIGH VT_LOW PP_VT R1_VT R2_VT R3_VT S1_VT S2_VT S3_VT DIFF_VT_CLOSE DIFF_VT_HIGH DIFF_VT_OPEN DIFF_VT_LOW DIFF_T_OPEN DIFF_T_LOW DIFF_T_HIGH DIFF_T_CLOSE DIFF_VTCLOSE_TOPEN VP_HIGH VP_LOW VP_OPEN VP_CLOSE regel_r1 regel_r2 regel_r3 regeln vormittag_flag T_DIRC T_WECHSELC T_NUM_INNEN T_CANDLEC
4653 12488.1 12490.1 12490.6 12484.9 2017-05-03 14:00:00 2017-05-03 12490.1 12508.3 12475.4 12506.5 12490.18 12488.41 12487.521429 12487.053333 12493.078571 12498.118 2.178761 4.334218 4.515033 0.000174 0.000347 0.000362 -1.7 2.5 -1.0 -2.7 -9.6 8.5 4.866667 -8.178571 -0.08 1.69 3.046667 2.578571 -8.018 -2.658571 -3.8 0.00004 8.7 0.000577 -0.3 0.000521 12449.3 12514.3 12527.8 12432.0 12491.366667 12550.733333 12587.166667 12646.533333 12454.933333 12395.566667 12359.133333 24.2 37.7 -40.8 -58.1 16.4 -14.7 18.2 0.0 7.8 12492.6 12487.4 12489.1 12488.4 0 0 0 0 0 neutral INNEN 1 GRUEN
4654 12489.9 12489.9 12489.9 12489.6 2017-05-03 14:15:00 2017-05-03 12489.9 12508.3 12475.4 12506.5 12489.38 12488.91 12487.828571 12487.680000 12492.182143 12498.436 0.712039 4.169586 4.180431 0.000057 0.000334 0.000335 0.2 0.7 -1.8 -5.0 -8.0 5.0 6.266667 -8.964286 0.52 0.99 2.220000 2.071429 -8.536 -1.551429 0.3 0.00008 7.0 0.000064 6.3 0.000665 12449.3 12514.3 12527.8 12432.0 12491.366667 12550.733333 12587.166667 12646.533333 12454.933333 12395.566667 12359.133333 24.4 37.9 -40.6 -57.9 16.6 -14.5 18.4 0.0 7.8 12490.6 12484.9 12488.1 12490.1 0 0 0 0 0 neutral INNEN 1 GRUEN

UPDATED
This is a straightforward example with a Quandl dataset,
import quandl
df = quandl.get("WIKI/GOOGL")
columns = ["High", 'Low', 'Close']
def operations(row, columns):
df1 = row[columns[0]] + row[columns[1]] + row[columns[2]]
df2 = -row[columns[1]] - row[columns[2]] + row[columns[0]]
return df1, df2
df["function1"], df["function2"] = zip(*df.apply(lambda row: operations(row, columns), axis=1))
df[["High","Low","Close","function1", "function2"]].head(5)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas dataframe merge returns empty data frame - python

Related

Extract prefix (city name) from column title and place in new column

Apply operation to column

How to add a column name for Timeseries when it is indexed

Pandas Merge returning only null values

Applying function on multiple columns to create multiple new columns

Categories

Resources