I have a DataFrame that looks like this:
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 time 8763 non-null datetime64[ns]
1 Madrid_wind_speed 8763 non-null float64
2 Valencia_wind_deg 8763 non-null object
3 Bilbao_rain_1h 8763 non-null float64
4 Valencia_wind_speed 8763 non-null float64
5 Seville_humidity 8763 non-null float64
6 Madrid_humidity 8763 non-null float64
7 Bilbao_clouds_all 8763 non-null float64
8 Bilbao_wind_speed 8763 non-null float64
9 Seville_clouds_all 8763 non-null float64
10 Bilbao_wind_deg 8763 non-null float64
11 Barcelona_wind_speed 8763 non-null float64
12 Barcelona_wind_deg 8763 non-null float64
13 Madrid_clouds_all 8763 non-null float64
14 Seville_wind_speed 8763 non-null float64
15 Barcelona_rain_1h 8763 non-null float64
16 Seville_pressure 8763 non-null object
17 Seville_rain_1h 8763 non-null float64
18 Bilbao_snow_3h 8763 non-null float64
19 Barcelona_pressure 8763 non-null float64
20 Seville_rain_3h 8763 non-null float64
21 Madrid_rain_1h 8763 non-null float64
22 Barcelona_rain_3h 8763 non-null float64
23 Valencia_snow_3h 8763 non-null float64
24 Madrid_weather_id 8763 non-null float64
25 Barcelona_weather_id 8763 non-null float64
26 Bilbao_pressure 8763 non-null float64
27 Seville_weather_id 8763 non-null float64
28 Valencia_pressure 6695 non-null float64
29 Seville_temp_max 8763 non-null float64
30 Madrid_pressure 8763 non-null float64
31 Valencia_temp_max 8763 non-null float64
32 Valencia_temp 8763 non-null float64
33 Bilbao_weather_id 8763 non-null float64
34 Seville_temp 8763 non-null float64
35 Valencia_humidity 8763 non-null float64
36 Valencia_temp_min 8763 non-null float64
37 Barcelona_temp_max 8763 non-null float64
38 Madrid_temp_max 8763 non-null float64
39 Barcelona_temp 8763 non-null float64
40 Bilbao_temp_min 8763 non-null float64
41 Bilbao_temp 8763 non-null float64
42 Barcelona_temp_min 8763 non-null float64
43 Bilbao_temp_max 8763 non-null float64
44 Seville_temp_min 8763 non-null float64
45 Madrid_temp 8763 non-null float64
46 Madrid_temp_min 8763 non-null float64
47 load_shortfall_3h 8763 non-null float64
dtypes: datetime64[ns](1), float64(45), object(2)
How would I go about to extract the City name from Column name and place them in a new column and merge the Wind_speed, rain_1h, etc. data into their own respective columns?
Use DataFrame.set_index for create MultiIndex first by not splitted columns names and then use str.split with DataFrame.stack for columns names by cities:
df1 = df.set_index(['time','load_shortfall_3h'])
df1.columns = df1.columns.str.split('_', n=1, expand=True)
df1 = df1.rename_axis([None, 'type'], axis=1).stack().reset_index()
Related
I import a dataframe which has a column 'Goals' that accumulates the previous results, like so:
print (df[df['name']=='Player Name']['Goals'])
152 1.0
828 2.0
1591 3.0
I know for a fact that the player scored only one goal per game, so the column should be like:
152 1.0
828 1.0
1591 1.0
By the way the same logic applies to all other scout columns:
...
FF 322 non-null float64
FS 568 non-null float64
Goals 80 non-null float64
A 63 non-null float64
PI 834 non-null float64
SG 140 non-null float64
DD 46 non-null float64
DS 611 non-null float64
FC 602 non-null float64
GC 3 non-null float64
GS 45 non-null float64
FD 231 non-null float64
CA 190 non-null float64
FT 34 non-null float64
I 112 non-null float64
PP 4 non-null float64
CV 9 non-null float64
...
QUESTION
What is the best way to correct this logic and apply this subtraction to the subset of columns above?
Edit:
df['G'] = df['G'].diff()
returns
Name: G, dtype: float64
152 NaN
828 NaN
1562 NaN
Writing this in the blind as you have not shown the full DataFrame:
for col in ['Goals', 'FF', 'FS', ...]:
tmp = df.groupby('name')[col].diff().fillna(1)
df[col] = tmp
Input: df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
Index: 100 entries, 2019-01-16 to 2018-08-23 - I want to add this as my first column to to analysis.
Data columns (total 5 columns):
open 100 non-null float64
high 100 non-null float64
low 100 non-null float64
close 100 non-null float64
volume 100 non-null float64
dtypes: float64(5)
memory usage: 9.7+ KB
df = df.assign(time=date)
df.head()
Out[76]:
1. open 2. high 3. low 4. close 5. volume time
2019-01-16 105.2600 106.2550 104.9600 105.3800 29655851 2019-01-16
2019-01-15 102.5100 105.0500 101.8800 105.0100 31587616 2019-01-15
2019-01-14 101.9000 102.8716 101.2600 102.0500 28437079 2019-01-14
2019-01-11 103.1900 103.4400 101.6400 102.8000 28314202 2019-01-11
2019-01-10 103.2200 103.7500 102.3800 103.6000 30067556 2019-01-10
I have been trying to join/merge 2 dataframe, "df & df_QA" for a while
The first data frame:
df_QA:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6878 entries, 0 to 6877
Data columns (total 14 columns):
PROPERTY_CODE 6878 non-null object
ACCOUNT_CODE 6878 non-null object
Jan 6878 non-null float64
Feb 6878 non-null float64
Mar 6878 non-null float64
Apr 6878 non-null float64
May 6878 non-null float64
Jun 6878 non-null float64
Jul 6878 non-null float64
Aug 6878 non-null float64
Sep 6878 non-null float64
Oct 6878 non-null float64
Nov 6878 non-null float64
Dec 6878 non-null float64
dtypes: float64(12), object(2)
memory usage: 752.4+ KB
The 2nd data frame:
df:
df = pd.read_csv(fname, sep="^",usecols=[2,3,5,6,7,8,9,10,11,12,13,14,15,16],converters={'Account': np.str, 'Entity ID': lambda x: str(x)}).dropna(subset=['Account'],how='any')
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2441 entries, 0 to 2440
Data columns (total 14 columns):
PROPERTY_CODE 2441 non-null object
ACCOUNT_CODE 2441 non-null object
Jan 2441 non-null float64
Feb 2441 non-null float64
Mar 2441 non-null float64
Apr 2441 non-null float64
May 2441 non-null float64
Jun 2441 non-null float64
Jul 2441 non-null float64
Aug 2441 non-null float64
Sep 2441 non-null float64
Oct 2441 non-null float64
Nov 2441 non-null int64
Dec 2441 non-null int64
dtypes: float64(10), int64(2), object(2)
memory usage: 286.1+ KB
I tried:
df_check = pd.merge(df, df_QA, how='inner', on=['PROPERTY_CODE','ACCOUNT_CODE'])
or
df_check = df.merge(df_QA, left_on=['PROPERTY_CODE', 'ACCOUNT_CODE'], right_on=['PROPERTY_CODE', 'ACCOUNT_CODE'], how='inner',sort='True')
returns:
print (df_check)
Empty DataFrame
Columns: [PROPERTY_CODE, ACCOUNT_CODE, Jan_x, Feb_x, Mar_x, Apr_x, May_x, Jun_x, Jul_x, Aug_x, Sep_x, Oct_x, Nov_x, Dec_x, Jan_y, Feb_y, Mar_y, Apr_y, May_y, Jun_y, Jul_y, Aug_y, Sep_y, Oct_y, Nov_y, Dec_y]
Index: []
We are hoping to get a data frame in the following format:
PROPERTY_CODE, ACCOUNT_CODE, Jan_x, Feb_x, Mar_x, Apr_x, May_x, Jun_x, Jul_x, Aug_x, Sep_x, Oct_x, Nov_x, Dec_x, Jan_y, Feb_y, Mar_y, Apr_y, May_y, Jun_y, Jul_y, Aug_y, Sep_y, Oct_y, Nov_y, Dec_y
Any thought? Thank you!
when i try outer:
df_check = pd.merge(df, df_QA, how='inner', on=['PROPERTY_CODE','ACCOUNT_CODE'])
PROPERTY_CODE ACCOUNT_CODE Jan Feb Mar Apr May Jun Jul Aug Sep \
0 05099 MR01030000 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 05099 MR01060000 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 05099 MR01060005 NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 05099 MR01200000 NaN NaN NaN NaN NaN NaN NaN NaN
It returns NaN. I check the PROPERTY_CODE and ACCOUNT_CODE, they look exactly the same to me though.
print (df_QA.loc[df_QA['PROPERTY_CODE'] == "05099"])
PROPERTY_CODE ACCOUNT_CODE Jan Feb Mar
604 05099 MR01030000 -1000 -10000.75 -10000.09
605 05099 MR01060000 100000.05 100.35 -1003128.17
print (df.loc[df['PROPERTY_CODE'] == "05099"])
PROPERTY_CODE ACCOUNT_CODE Jan Feb Mar
0 05099 MR01030000 -1.000000e+09 -100000.75 -100000.09
1 05099 MR01060000 1.000000e+05 1100.35 -1000000.17
I need to apply a function on a df to create multiple new columns. As an input to my function I would need (i) a row or (ii) multiple columns
def divideAndMultiply(x,y):
return x/y, x*y
df["e"], df["f"] = zip(*df.a.apply(lambda val: divideAndMultiply(val,2)))
(https://stackoverflow.com/a/36600318/1831695)
This works to create multiple new columns, but not with multiple inputs (= columns from that df). Or did I miss something?
Example what I want to do:
My DF has several columns. 3 (a, b, c) of them are relevant to calculate two new columns (y, z) where, y = a + b + c and z = c - b - a
I know this is an easy calculation where I do not need a function, but let's assume we would need one.
Instead of writing and applying 2 functions, I would like to have only one function, return both values and accepting all three values (or even better: a row) for the calculation.
This example:
df["y"], df["z"] = zip(*df.a.apply(lambda val: divideAndMultiply(val,2)))
works only when using one column data item (val) and on other value (2 in this case).
I would need something like that:
df["y"], df["z"] = zip(*df.a.apply(lambda val: divideAndMultiply(df['a'],df['b'],df['c'])))
(And yes, I know that val is assigned to df.a )
Update 2
this is how my df looks like:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4655 entries, 0 to 4654
Data columns (total 71 columns):
Open 4655 non-null float64
Close 4655 non-null float64
High 4655 non-null float64
Low 4655 non-null float64
DateTime 4655 non-null datetime64[ns]
Date 4655 non-null datetime64[ns]
T_CLOSE 4655 non-null float64
T_HIGH 4655 non-null float64
T_LOW 4655 non-null float64
T_OPEN 4655 non-null float64
MVA5 4651 non-null float64
MVA10 4646 non-null float64
MVA14 4642 non-null float64
MVA15 4641 non-null float64
MVA28 4628 non-null float64
MVA50 4606 non-null float64
STD5 4651 non-null float64
STD10 4646 non-null float64
STD15 4641 non-null float64
CV_5 4651 non-null float64
CV_10 4646 non-null float64
CV_15 4641 non-null float64
DIFF_VP_CLOSE 4654 non-null float64
DIFF_VP_HIGH 4654 non-null float64
DIFF_VP_OPEN 4654 non-null float64
DIFF_VP_LOW 4654 non-null float64
AVG_STEIG_5 4650 non-null float64
AVG_STEIG_10 4645 non-null float64
AVG_STEIG_15 4640 non-null float64
AVG_STEIG_28 4627 non-null float64
AVG_5_DIFF 4651 non-null float64
AVG_10_DIFF 4646 non-null float64
AVG_15_DIFF 4641 non-null float64
AVG_14_DIFF 4642 non-null float64
AVG_50_DIFF 4606 non-null float64
AD_5_14 4642 non-null float64
Momentum_4 4651 non-null float64
ROC_4 4652 non-null float64
Momentum_8 4647 non-null float64
ROC_8 4648 non-null float64
Momentum_12 4643 non-null float64
ROC_12 4644 non-null float64
VT_OPEN 4598 non-null float64
VT_CLOSE 4598 non-null float64
VT_HIGH 4598 non-null float64
VT_LOW 4598 non-null float64
PP_VT 4598 non-null float64
R1_VT 4598 non-null float64
R2_VT 4598 non-null float64
R3_VT 4598 non-null float64
S1_VT 4598 non-null float64
S2_VT 4598 non-null float64
S3_VT 4598 non-null float64
DIFF_VT_CLOSE 4598 non-null float64
DIFF_VT_HIGH 4598 non-null float64
DIFF_VT_OPEN 4598 non-null float64
DIFF_VT_LOW 4598 non-null float64
DIFF_T_OPEN 4655 non-null float64
DIFF_T_LOW 4655 non-null float64
DIFF_T_HIGH 4655 non-null float64
DIFF_T_CLOSE 4655 non-null float64
DIFF_VTCLOSE_TOPEN 4598 non-null float64
VP_HIGH 4654 non-null float64
VP_LOW 4654 non-null float64
VP_OPEN 4654 non-null float64
VP_CLOSE 4654 non-null float64
regel_r1 4655 non-null int64
regel_r2 4655 non-null int64
regel_r3 4655 non-null int64
regeln 4655 non-null int64
vormittag_flag 4655 non-null int64
dtypes: datetime64[ns](2), float64(64), int64(5)
memory usage: 2.6 MB
None
Open Close High Low DateTime Date T_CLOSE T_HIGH T_LOW T_OPEN MVA5 MVA10 MVA14 MVA15 MVA28 MVA50 STD5 STD10 STD15 CV_5 CV_10 CV_15 DIFF_VP_CLOSE DIFF_VP_HIGH DIFF_VP_OPEN DIFF_VP_LOW AVG_STEIG_5 AVG_STEIG_10 AVG_STEIG_15 AVG_STEIG_28 AVG_5_DIFF AVG_10_DIFF AVG_15_DIFF AVG_14_DIFF AVG_50_DIFF AD_5_14 Momentum_4 ROC_4 Momentum_8 ROC_8 Momentum_12 ROC_12 VT_OPEN VT_CLOSE VT_HIGH VT_LOW PP_VT R1_VT R2_VT R3_VT S1_VT S2_VT S3_VT DIFF_VT_CLOSE DIFF_VT_HIGH DIFF_VT_OPEN DIFF_VT_LOW DIFF_T_OPEN DIFF_T_LOW DIFF_T_HIGH DIFF_T_CLOSE DIFF_VTCLOSE_TOPEN VP_HIGH VP_LOW VP_OPEN VP_CLOSE regel_r1 regel_r2 regel_r3 regeln vormittag_flag T_DIRC T_WECHSELC T_NUM_INNEN T_CANDLEC
4653 12488.1 12490.1 12490.6 12484.9 2017-05-03 14:00:00 2017-05-03 12490.1 12508.3 12475.4 12506.5 12490.18 12488.41 12487.521429 12487.053333 12493.078571 12498.118 2.178761 4.334218 4.515033 0.000174 0.000347 0.000362 -1.7 2.5 -1.0 -2.7 -9.6 8.5 4.866667 -8.178571 -0.08 1.69 3.046667 2.578571 -8.018 -2.658571 -3.8 0.00004 8.7 0.000577 -0.3 0.000521 12449.3 12514.3 12527.8 12432.0 12491.366667 12550.733333 12587.166667 12646.533333 12454.933333 12395.566667 12359.133333 24.2 37.7 -40.8 -58.1 16.4 -14.7 18.2 0.0 7.8 12492.6 12487.4 12489.1 12488.4 0 0 0 0 0 neutral INNEN 1 GRUEN
4654 12489.9 12489.9 12489.9 12489.6 2017-05-03 14:15:00 2017-05-03 12489.9 12508.3 12475.4 12506.5 12489.38 12488.91 12487.828571 12487.680000 12492.182143 12498.436 0.712039 4.169586 4.180431 0.000057 0.000334 0.000335 0.2 0.7 -1.8 -5.0 -8.0 5.0 6.266667 -8.964286 0.52 0.99 2.220000 2.071429 -8.536 -1.551429 0.3 0.00008 7.0 0.000064 6.3 0.000665 12449.3 12514.3 12527.8 12432.0 12491.366667 12550.733333 12587.166667 12646.533333 12454.933333 12395.566667 12359.133333 24.4 37.9 -40.6 -57.9 16.6 -14.5 18.4 0.0 7.8 12490.6 12484.9 12488.1 12490.1 0 0 0 0 0 neutral INNEN 1 GRUEN
UPDATED
This is a straightforward example with a Quandl dataset,
import quandl
df = quandl.get("WIKI/GOOGL")
columns = ["High", 'Low', 'Close']
def operations(row, columns):
df1 = row[columns[0]] + row[columns[1]] + row[columns[2]]
df2 = -row[columns[1]] - row[columns[2]] + row[columns[0]]
return df1, df2
df["function1"], df["function2"] = zip(*df.apply(lambda row: operations(row, columns), axis=1))
df[["High","Low","Close","function1", "function2"]].head(5)
I have a pandas panel of:
Items axis: X1 to X3
Major_axis axis: (1973-09-30 00:00:00, 1989-03-31 00:00:00) to (2015-07-31 00:00:00, 2015-08-21 00:00:00)
Minor_axis axis: A to C
and I would like to convert it to a dataframe with a multilevel column of (Item, Minor), the multilevel columns would look as follows:
mi_tuples = [ ('A','X1'), ('A','X2'), ('A','X3'), ('B','X1'), ('B','X2'), ('B','X3'), ('C','X1'), ('C','X2'), ('C','X3') ]
mi_columns = pd.MultiIndex.from_tuples(mi_tuples, names = ['minor', 'items'])
Any thoughts?
Thanks!
I think a combination of to_frame, unstack, and swaplevel can get you there. See below with some example data.
In [134]: pnl = pd.io.data.DataReader(['GOOG', 'AAPL'], 'yahoo')
In [135]: pnl
Out[135]:
<class 'pandas.core.panel.Panel'>
Dimensions: 6 (items) x 1421 (major_axis) x 2 (minor_axis)
Items axis: Open to Adj Close
Major_axis axis: 2010-01-04 00:00:00 to 2015-08-25 00:00:00
Minor_axis axis: AAPL to GOOG
In [136]: df = pnl.to_frame().unstack(level=1)
In [137]: df.columns = df.columns.swaplevel(0,1)
In [138]: df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1421 entries, 2010-01-04 to 2015-08-25
Data columns (total 12 columns):
(AAPL, Open) 1421 non-null float64
(GOOG, Open) 357 non-null float64
(AAPL, High) 1421 non-null float64
(GOOG, High) 357 non-null float64
(AAPL, Low) 1421 non-null float64
(GOOG, Low) 357 non-null float64
(AAPL, Close) 1421 non-null float64
(GOOG, Close) 357 non-null float64
(AAPL, Volume) 1421 non-null float64
(GOOG, Volume) 357 non-null float64
(AAPL, Adj Close) 1421 non-null float64
(GOOG, Adj Close) 357 non-null float64
dtypes: float64(12)
memory usage: 144.3 KB