I have a DataFrame that looks like this:
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 time 8763 non-null datetime64[ns]
1 Madrid_wind_speed 8763 non-null float64
2 Valencia_wind_deg 8763 non-null object
3 Bilbao_rain_1h 8763 non-null float64
4 Valencia_wind_speed 8763 non-null float64
5 Seville_humidity 8763 non-null float64
6 Madrid_humidity 8763 non-null float64
7 Bilbao_clouds_all 8763 non-null float64
8 Bilbao_wind_speed 8763 non-null float64
9 Seville_clouds_all 8763 non-null float64
10 Bilbao_wind_deg 8763 non-null float64
11 Barcelona_wind_speed 8763 non-null float64
12 Barcelona_wind_deg 8763 non-null float64
13 Madrid_clouds_all 8763 non-null float64
14 Seville_wind_speed 8763 non-null float64
15 Barcelona_rain_1h 8763 non-null float64
16 Seville_pressure 8763 non-null object
17 Seville_rain_1h 8763 non-null float64
18 Bilbao_snow_3h 8763 non-null float64
19 Barcelona_pressure 8763 non-null float64
20 Seville_rain_3h 8763 non-null float64
21 Madrid_rain_1h 8763 non-null float64
22 Barcelona_rain_3h 8763 non-null float64
23 Valencia_snow_3h 8763 non-null float64
24 Madrid_weather_id 8763 non-null float64
25 Barcelona_weather_id 8763 non-null float64
26 Bilbao_pressure 8763 non-null float64
27 Seville_weather_id 8763 non-null float64
28 Valencia_pressure 6695 non-null float64
29 Seville_temp_max 8763 non-null float64
30 Madrid_pressure 8763 non-null float64
31 Valencia_temp_max 8763 non-null float64
32 Valencia_temp 8763 non-null float64
33 Bilbao_weather_id 8763 non-null float64
34 Seville_temp 8763 non-null float64
35 Valencia_humidity 8763 non-null float64
36 Valencia_temp_min 8763 non-null float64
37 Barcelona_temp_max 8763 non-null float64
38 Madrid_temp_max 8763 non-null float64
39 Barcelona_temp 8763 non-null float64
40 Bilbao_temp_min 8763 non-null float64
41 Bilbao_temp 8763 non-null float64
42 Barcelona_temp_min 8763 non-null float64
43 Bilbao_temp_max 8763 non-null float64
44 Seville_temp_min 8763 non-null float64
45 Madrid_temp 8763 non-null float64
46 Madrid_temp_min 8763 non-null float64
47 load_shortfall_3h 8763 non-null float64
dtypes: datetime64[ns](1), float64(45), object(2)
How would I go about to extract the City name from Column name and place them in a new column and merge the Wind_speed, rain_1h, etc. data into their own respective columns?
Use DataFrame.set_index for create MultiIndex first by not splitted columns names and then use str.split with DataFrame.stack for columns names by cities:
df1 = df.set_index(['time','load_shortfall_3h'])
df1.columns = df1.columns.str.split('_', n=1, expand=True)
df1 = df1.rename_axis([None, 'type'], axis=1).stack().reset_index()
Input: df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
Index: 100 entries, 2019-01-16 to 2018-08-23 - I want to add this as my first column to to analysis.
Data columns (total 5 columns):
open 100 non-null float64
high 100 non-null float64
low 100 non-null float64
close 100 non-null float64
volume 100 non-null float64
dtypes: float64(5)
memory usage: 9.7+ KB
df = df.assign(time=date)
df.head()
Out[76]:
1. open 2. high 3. low 4. close 5. volume time
2019-01-16 105.2600 106.2550 104.9600 105.3800 29655851 2019-01-16
2019-01-15 102.5100 105.0500 101.8800 105.0100 31587616 2019-01-15
2019-01-14 101.9000 102.8716 101.2600 102.0500 28437079 2019-01-14
2019-01-11 103.1900 103.4400 101.6400 102.8000 28314202 2019-01-11
2019-01-10 103.2200 103.7500 102.3800 103.6000 30067556 2019-01-10
I have been trying to join/merge 2 dataframe, "df & df_QA" for a while
The first data frame:
df_QA:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6878 entries, 0 to 6877
Data columns (total 14 columns):
PROPERTY_CODE 6878 non-null object
ACCOUNT_CODE 6878 non-null object
Jan 6878 non-null float64
Feb 6878 non-null float64
Mar 6878 non-null float64
Apr 6878 non-null float64
May 6878 non-null float64
Jun 6878 non-null float64
Jul 6878 non-null float64
Aug 6878 non-null float64
Sep 6878 non-null float64
Oct 6878 non-null float64
Nov 6878 non-null float64
Dec 6878 non-null float64
dtypes: float64(12), object(2)
memory usage: 752.4+ KB
The 2nd data frame:
df:
df = pd.read_csv(fname, sep="^",usecols=[2,3,5,6,7,8,9,10,11,12,13,14,15,16],converters={'Account': np.str, 'Entity ID': lambda x: str(x)}).dropna(subset=['Account'],how='any')
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2441 entries, 0 to 2440
Data columns (total 14 columns):
PROPERTY_CODE 2441 non-null object
ACCOUNT_CODE 2441 non-null object
Jan 2441 non-null float64
Feb 2441 non-null float64
Mar 2441 non-null float64
Apr 2441 non-null float64
May 2441 non-null float64
Jun 2441 non-null float64
Jul 2441 non-null float64
Aug 2441 non-null float64
Sep 2441 non-null float64
Oct 2441 non-null float64
Nov 2441 non-null int64
Dec 2441 non-null int64
dtypes: float64(10), int64(2), object(2)
memory usage: 286.1+ KB
I tried:
df_check = pd.merge(df, df_QA, how='inner', on=['PROPERTY_CODE','ACCOUNT_CODE'])
or
df_check = df.merge(df_QA, left_on=['PROPERTY_CODE', 'ACCOUNT_CODE'], right_on=['PROPERTY_CODE', 'ACCOUNT_CODE'], how='inner',sort='True')
returns:
print (df_check)
Empty DataFrame
Columns: [PROPERTY_CODE, ACCOUNT_CODE, Jan_x, Feb_x, Mar_x, Apr_x, May_x, Jun_x, Jul_x, Aug_x, Sep_x, Oct_x, Nov_x, Dec_x, Jan_y, Feb_y, Mar_y, Apr_y, May_y, Jun_y, Jul_y, Aug_y, Sep_y, Oct_y, Nov_y, Dec_y]
Index: []
We are hoping to get a data frame in the following format:
PROPERTY_CODE, ACCOUNT_CODE, Jan_x, Feb_x, Mar_x, Apr_x, May_x, Jun_x, Jul_x, Aug_x, Sep_x, Oct_x, Nov_x, Dec_x, Jan_y, Feb_y, Mar_y, Apr_y, May_y, Jun_y, Jul_y, Aug_y, Sep_y, Oct_y, Nov_y, Dec_y
Any thought? Thank you!
when i try outer:
df_check = pd.merge(df, df_QA, how='inner', on=['PROPERTY_CODE','ACCOUNT_CODE'])
PROPERTY_CODE ACCOUNT_CODE Jan Feb Mar Apr May Jun Jul Aug Sep \
0 05099 MR01030000 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 05099 MR01060000 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 05099 MR01060005 NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 05099 MR01200000 NaN NaN NaN NaN NaN NaN NaN NaN
It returns NaN. I check the PROPERTY_CODE and ACCOUNT_CODE, they look exactly the same to me though.
print (df_QA.loc[df_QA['PROPERTY_CODE'] == "05099"])
PROPERTY_CODE ACCOUNT_CODE Jan Feb Mar
604 05099 MR01030000 -1000 -10000.75 -10000.09
605 05099 MR01060000 100000.05 100.35 -1003128.17
print (df.loc[df['PROPERTY_CODE'] == "05099"])
PROPERTY_CODE ACCOUNT_CODE Jan Feb Mar
0 05099 MR01030000 -1.000000e+09 -100000.75 -100000.09
1 05099 MR01060000 1.000000e+05 1100.35 -1000000.17
I need to apply a function on a df to create multiple new columns. As an input to my function I would need (i) a row or (ii) multiple columns
def divideAndMultiply(x,y):
return x/y, x*y
df["e"], df["f"] = zip(*df.a.apply(lambda val: divideAndMultiply(val,2)))
(https://stackoverflow.com/a/36600318/1831695)
This works to create multiple new columns, but not with multiple inputs (= columns from that df). Or did I miss something?
Example what I want to do:
My DF has several columns. 3 (a, b, c) of them are relevant to calculate two new columns (y, z) where, y = a + b + c and z = c - b - a
I know this is an easy calculation where I do not need a function, but let's assume we would need one.
Instead of writing and applying 2 functions, I would like to have only one function, return both values and accepting all three values (or even better: a row) for the calculation.
This example:
df["y"], df["z"] = zip(*df.a.apply(lambda val: divideAndMultiply(val,2)))
works only when using one column data item (val) and on other value (2 in this case).
I would need something like that:
df["y"], df["z"] = zip(*df.a.apply(lambda val: divideAndMultiply(df['a'],df['b'],df['c'])))
(And yes, I know that val is assigned to df.a )
Update 2
this is how my df looks like:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4655 entries, 0 to 4654
Data columns (total 71 columns):
Open 4655 non-null float64
Close 4655 non-null float64
High 4655 non-null float64
Low 4655 non-null float64
DateTime 4655 non-null datetime64[ns]
Date 4655 non-null datetime64[ns]
T_CLOSE 4655 non-null float64
T_HIGH 4655 non-null float64
T_LOW 4655 non-null float64
T_OPEN 4655 non-null float64
MVA5 4651 non-null float64
MVA10 4646 non-null float64
MVA14 4642 non-null float64
MVA15 4641 non-null float64
MVA28 4628 non-null float64
MVA50 4606 non-null float64
STD5 4651 non-null float64
STD10 4646 non-null float64
STD15 4641 non-null float64
CV_5 4651 non-null float64
CV_10 4646 non-null float64
CV_15 4641 non-null float64
DIFF_VP_CLOSE 4654 non-null float64
DIFF_VP_HIGH 4654 non-null float64
DIFF_VP_OPEN 4654 non-null float64
DIFF_VP_LOW 4654 non-null float64
AVG_STEIG_5 4650 non-null float64
AVG_STEIG_10 4645 non-null float64
AVG_STEIG_15 4640 non-null float64
AVG_STEIG_28 4627 non-null float64
AVG_5_DIFF 4651 non-null float64
AVG_10_DIFF 4646 non-null float64
AVG_15_DIFF 4641 non-null float64
AVG_14_DIFF 4642 non-null float64
AVG_50_DIFF 4606 non-null float64
AD_5_14 4642 non-null float64
Momentum_4 4651 non-null float64
ROC_4 4652 non-null float64
Momentum_8 4647 non-null float64
ROC_8 4648 non-null float64
Momentum_12 4643 non-null float64
ROC_12 4644 non-null float64
VT_OPEN 4598 non-null float64
VT_CLOSE 4598 non-null float64
VT_HIGH 4598 non-null float64
VT_LOW 4598 non-null float64
PP_VT 4598 non-null float64
R1_VT 4598 non-null float64
R2_VT 4598 non-null float64
R3_VT 4598 non-null float64
S1_VT 4598 non-null float64
S2_VT 4598 non-null float64
S3_VT 4598 non-null float64
DIFF_VT_CLOSE 4598 non-null float64
DIFF_VT_HIGH 4598 non-null float64
DIFF_VT_OPEN 4598 non-null float64
DIFF_VT_LOW 4598 non-null float64
DIFF_T_OPEN 4655 non-null float64
DIFF_T_LOW 4655 non-null float64
DIFF_T_HIGH 4655 non-null float64
DIFF_T_CLOSE 4655 non-null float64
DIFF_VTCLOSE_TOPEN 4598 non-null float64
VP_HIGH 4654 non-null float64
VP_LOW 4654 non-null float64
VP_OPEN 4654 non-null float64
VP_CLOSE 4654 non-null float64
regel_r1 4655 non-null int64
regel_r2 4655 non-null int64
regel_r3 4655 non-null int64
regeln 4655 non-null int64
vormittag_flag 4655 non-null int64
dtypes: datetime64[ns](2), float64(64), int64(5)
memory usage: 2.6 MB
None
Open Close High Low DateTime Date T_CLOSE T_HIGH T_LOW T_OPEN MVA5 MVA10 MVA14 MVA15 MVA28 MVA50 STD5 STD10 STD15 CV_5 CV_10 CV_15 DIFF_VP_CLOSE DIFF_VP_HIGH DIFF_VP_OPEN DIFF_VP_LOW AVG_STEIG_5 AVG_STEIG_10 AVG_STEIG_15 AVG_STEIG_28 AVG_5_DIFF AVG_10_DIFF AVG_15_DIFF AVG_14_DIFF AVG_50_DIFF AD_5_14 Momentum_4 ROC_4 Momentum_8 ROC_8 Momentum_12 ROC_12 VT_OPEN VT_CLOSE VT_HIGH VT_LOW PP_VT R1_VT R2_VT R3_VT S1_VT S2_VT S3_VT DIFF_VT_CLOSE DIFF_VT_HIGH DIFF_VT_OPEN DIFF_VT_LOW DIFF_T_OPEN DIFF_T_LOW DIFF_T_HIGH DIFF_T_CLOSE DIFF_VTCLOSE_TOPEN VP_HIGH VP_LOW VP_OPEN VP_CLOSE regel_r1 regel_r2 regel_r3 regeln vormittag_flag T_DIRC T_WECHSELC T_NUM_INNEN T_CANDLEC
4653 12488.1 12490.1 12490.6 12484.9 2017-05-03 14:00:00 2017-05-03 12490.1 12508.3 12475.4 12506.5 12490.18 12488.41 12487.521429 12487.053333 12493.078571 12498.118 2.178761 4.334218 4.515033 0.000174 0.000347 0.000362 -1.7 2.5 -1.0 -2.7 -9.6 8.5 4.866667 -8.178571 -0.08 1.69 3.046667 2.578571 -8.018 -2.658571 -3.8 0.00004 8.7 0.000577 -0.3 0.000521 12449.3 12514.3 12527.8 12432.0 12491.366667 12550.733333 12587.166667 12646.533333 12454.933333 12395.566667 12359.133333 24.2 37.7 -40.8 -58.1 16.4 -14.7 18.2 0.0 7.8 12492.6 12487.4 12489.1 12488.4 0 0 0 0 0 neutral INNEN 1 GRUEN
4654 12489.9 12489.9 12489.9 12489.6 2017-05-03 14:15:00 2017-05-03 12489.9 12508.3 12475.4 12506.5 12489.38 12488.91 12487.828571 12487.680000 12492.182143 12498.436 0.712039 4.169586 4.180431 0.000057 0.000334 0.000335 0.2 0.7 -1.8 -5.0 -8.0 5.0 6.266667 -8.964286 0.52 0.99 2.220000 2.071429 -8.536 -1.551429 0.3 0.00008 7.0 0.000064 6.3 0.000665 12449.3 12514.3 12527.8 12432.0 12491.366667 12550.733333 12587.166667 12646.533333 12454.933333 12395.566667 12359.133333 24.4 37.9 -40.6 -57.9 16.6 -14.5 18.4 0.0 7.8 12490.6 12484.9 12488.1 12490.1 0 0 0 0 0 neutral INNEN 1 GRUEN
UPDATED
This is a straightforward example with a Quandl dataset,
import quandl
df = quandl.get("WIKI/GOOGL")
columns = ["High", 'Low', 'Close']
def operations(row, columns):
df1 = row[columns[0]] + row[columns[1]] + row[columns[2]]
df2 = -row[columns[1]] - row[columns[2]] + row[columns[0]]
return df1, df2
df["function1"], df["function2"] = zip(*df.apply(lambda row: operations(row, columns), axis=1))
df[["High","Low","Close","function1", "function2"]].head(5)
Have been trying to just replace the NaN values in my DataFrame with the last valued item however this does not seem to do the job. Just wondering if anyone else has this same issue or what could be causing this problem.
In [16]: ABCW.info()
Out[16]:<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 692 entries, 2014-10-22 10:30:00 to 2015-05-21 16:00:00
Data columns (total 6 columns):
Price 692 non-null float64
Volume 692 non-null float64
Symbol_Num 692 non-null object
Actual Price 577 non-null float64
Market Cap Rank 577 non-null float64
Market Cap 577 non-null float64
dtypes: float64(5), object(1)
memory usage: 37.8+ KB
In [18]: ABCW.fillna(method = 'pad')
In [19]: ABCW.info()
Out [19]: <class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 692 entries, 2014-10-22 10:30:00 to 2015-05-21 16:00:00
Data columns (total 6 columns):
Price 692 non-null float64
Volume 692 non-null float64
Symbol_Num 692 non-null object
Actual Price 577 non-null float64
Market Cap Rank 577 non-null float64
Market Cap 577 non-null float64
dtypes: float64(5), object(1)
memory usage: 37.8+ KB
There is no change in the number of non-null values and there is still all the preexisting NaN values that were previously in the data frame
You are using the 'pad' method. This is basically a forward fill. See examples at http://pandas.pydata.org/pandas-docs/stable/missing_data.html
I am reproducing the relevant example here,
In [33]: df
Out[33]:
one two three
a NaN -0.282863 -1.509059
c NaN 1.212112 -0.173215
e 0.119209 -1.044236 -0.861849
f -2.104569 -0.494929 1.071804
h NaN -0.706771 -1.039575
In [34]: df.fillna(method='pad')
Out[34]:
one two three
a NaN -0.282863 -1.509059
c NaN 1.212112 -0.173215
e 0.119209 -1.044236 -0.861849
f -2.104569 -0.494929 1.071804
h -2.104569 -0.706771 -1.039575
This method will not do a backfill. You should also consider doing a backfill if you want all your NaNs to go away. Also, 'inplace= False' by default. So you probably want to assign results of the operation back to ABCW.. like so,
ABCW = ABCW.fillna(method = 'pad')
ABCW = ABCW.fillna(method = 'bfill')