Pandas df.fillna(method = 'pad') not working on 28000 row df

Pandas df.fillna(method = 'pad') not working on 28000 row df - python

Have been trying to just replace the NaN values in my DataFrame with the last valued item however this does not seem to do the job. Just wondering if anyone else has this same issue or what could be causing this problem.
In [16]: ABCW.info()
Out[16]:<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 692 entries, 2014-10-22 10:30:00 to 2015-05-21 16:00:00
Data columns (total 6 columns):
Price 692 non-null float64
Volume 692 non-null float64
Symbol_Num 692 non-null object
Actual Price 577 non-null float64
Market Cap Rank 577 non-null float64
Market Cap 577 non-null float64
dtypes: float64(5), object(1)
memory usage: 37.8+ KB
In [18]: ABCW.fillna(method = 'pad')
In [19]: ABCW.info()
Out [19]: <class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 692 entries, 2014-10-22 10:30:00 to 2015-05-21 16:00:00
Data columns (total 6 columns):
Price 692 non-null float64
Volume 692 non-null float64
Symbol_Num 692 non-null object
Actual Price 577 non-null float64
Market Cap Rank 577 non-null float64
Market Cap 577 non-null float64
dtypes: float64(5), object(1)
memory usage: 37.8+ KB
There is no change in the number of non-null values and there is still all the preexisting NaN values that were previously in the data frame

You are using the 'pad' method. This is basically a forward fill. See examples at http://pandas.pydata.org/pandas-docs/stable/missing_data.html
I am reproducing the relevant example here,
In [33]: df
Out[33]:
one two three
a NaN -0.282863 -1.509059
c NaN 1.212112 -0.173215
e 0.119209 -1.044236 -0.861849
f -2.104569 -0.494929 1.071804
h NaN -0.706771 -1.039575
In [34]: df.fillna(method='pad')
Out[34]:
one two three
a NaN -0.282863 -1.509059
c NaN 1.212112 -0.173215
e 0.119209 -1.044236 -0.861849
f -2.104569 -0.494929 1.071804
h -2.104569 -0.706771 -1.039575
This method will not do a backfill. You should also consider doing a backfill if you want all your NaNs to go away. Also, 'inplace= False' by default. So you probably want to assign results of the operation back to ABCW.. like so,
ABCW = ABCW.fillna(method = 'pad')
ABCW = ABCW.fillna(method = 'bfill')

Related

Apply operation to column

I import a dataframe which has a column 'Goals' that accumulates the previous results, like so:
print (df[df['name']=='Player Name']['Goals'])
152 1.0
828 2.0
1591 3.0
I know for a fact that the player scored only one goal per game, so the column should be like:
152 1.0
828 1.0
1591 1.0
By the way the same logic applies to all other scout columns:
...
FF 322 non-null float64
FS 568 non-null float64
Goals 80 non-null float64
A 63 non-null float64
PI 834 non-null float64
SG 140 non-null float64
DD 46 non-null float64
DS 611 non-null float64
FC 602 non-null float64
GC 3 non-null float64
GS 45 non-null float64
FD 231 non-null float64
CA 190 non-null float64
FT 34 non-null float64
I 112 non-null float64
PP 4 non-null float64
CV 9 non-null float64
...
QUESTION
What is the best way to correct this logic and apply this subtraction to the subset of columns above?
Edit:
df['G'] = df['G'].diff()
returns
Name: G, dtype: float64
152 NaN
828 NaN
1562 NaN

Writing this in the blind as you have not shown the full DataFrame:
for col in ['Goals', 'FF', 'FS', ...]:
tmp = df.groupby('name')[col].diff().fillna(1)
df[col] = tmp

How to add a column name for Timeseries when it is indexed

Input: df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
Index: 100 entries, 2019-01-16 to 2018-08-23 - I want to add this as my first column to to analysis.
Data columns (total 5 columns):
open 100 non-null float64
high 100 non-null float64
low 100 non-null float64
close 100 non-null float64
volume 100 non-null float64
dtypes: float64(5)
memory usage: 9.7+ KB

df = df.assign(time=date)
df.head()
Out[76]:
1. open 2. high 3. low 4. close 5. volume time
2019-01-16 105.2600 106.2550 104.9600 105.3800 29655851 2019-01-16
2019-01-15 102.5100 105.0500 101.8800 105.0100 31587616 2019-01-15
2019-01-14 101.9000 102.8716 101.2600 102.0500 28437079 2019-01-14
2019-01-11 103.1900 103.4400 101.6400 102.8000 28314202 2019-01-11
2019-01-10 103.2200 103.7500 102.3800 103.6000 30067556 2019-01-10

Pandas Merge returning only null values

I am trying to use Pandas to merge a products packing information with each order record for a given product. The data frame information is below.
BreakerOrders.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3774010 entries, 0 to 3774009
Data columns (total 2 columns):
Material object
Quantity float64
dtypes: float64(1), object(1)
memory usage: 86.4+ MB
manh.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1381 entries, 0 to 1380
Data columns (total 4 columns):
Material 1381 non-null object
SUBPACK_QTY 202 non-null float64
PACK_QTY 591 non-null float64
PALLET_QTY 809 non-null float64
dtypes: float64(3), object(1)
memory usage: 43.2+ KB
When attempting the merge using the code below, I get the following table with all NaN values for packaging quantities.
BreakerOrders.merge(manh,how='left',on='Material')
Material Quantity SUBPACK_QTY PACK_QTY PALLET_QTY
HOM230CP 5.0 NaN NaN NaN
QO115 20.0 NaN NaN NaN
QO2020CP 20.0 NaN NaN NaN
QO220CP 50.0 NaN NaN NaN
HOM115CP 50.0 NaN NaN NaN
HOM120 100.0 NaN NaN NaN

I was having the same and I was able to solve it by just flipping the DFs. so instead of:
df2 = df.merge(df1)
try
df2 = df1.merge(df)
Looks silly, but it solved my issue.

Applying function on multiple columns to create multiple new columns

I need to apply a function on a df to create multiple new columns. As an input to my function I would need (i) a row or (ii) multiple columns
def divideAndMultiply(x,y):
return x/y, x*y
df["e"], df["f"] = zip(*df.a.apply(lambda val: divideAndMultiply(val,2)))
(https://stackoverflow.com/a/36600318/1831695)
This works to create multiple new columns, but not with multiple inputs (= columns from that df). Or did I miss something?
Example what I want to do:
My DF has several columns. 3 (a, b, c) of them are relevant to calculate two new columns (y, z) where, y = a + b + c and z = c - b - a
I know this is an easy calculation where I do not need a function, but let's assume we would need one.
Instead of writing and applying 2 functions, I would like to have only one function, return both values and accepting all three values (or even better: a row) for the calculation.
This example:
df["y"], df["z"] = zip(*df.a.apply(lambda val: divideAndMultiply(val,2)))
works only when using one column data item (val) and on other value (2 in this case).
I would need something like that:
df["y"], df["z"] = zip(*df.a.apply(lambda val: divideAndMultiply(df['a'],df['b'],df['c'])))
(And yes, I know that val is assigned to df.a )
Update 2
this is how my df looks like:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4655 entries, 0 to 4654
Data columns (total 71 columns):
Open 4655 non-null float64
Close 4655 non-null float64
High 4655 non-null float64
Low 4655 non-null float64
DateTime 4655 non-null datetime64[ns]
Date 4655 non-null datetime64[ns]
T_CLOSE 4655 non-null float64
T_HIGH 4655 non-null float64
T_LOW 4655 non-null float64
T_OPEN 4655 non-null float64
MVA5 4651 non-null float64
MVA10 4646 non-null float64
MVA14 4642 non-null float64
MVA15 4641 non-null float64
MVA28 4628 non-null float64
MVA50 4606 non-null float64
STD5 4651 non-null float64
STD10 4646 non-null float64
STD15 4641 non-null float64
CV_5 4651 non-null float64
CV_10 4646 non-null float64
CV_15 4641 non-null float64
DIFF_VP_CLOSE 4654 non-null float64
DIFF_VP_HIGH 4654 non-null float64
DIFF_VP_OPEN 4654 non-null float64
DIFF_VP_LOW 4654 non-null float64
AVG_STEIG_5 4650 non-null float64
AVG_STEIG_10 4645 non-null float64
AVG_STEIG_15 4640 non-null float64
AVG_STEIG_28 4627 non-null float64
AVG_5_DIFF 4651 non-null float64
AVG_10_DIFF 4646 non-null float64
AVG_15_DIFF 4641 non-null float64
AVG_14_DIFF 4642 non-null float64
AVG_50_DIFF 4606 non-null float64
AD_5_14 4642 non-null float64
Momentum_4 4651 non-null float64
ROC_4 4652 non-null float64
Momentum_8 4647 non-null float64
ROC_8 4648 non-null float64
Momentum_12 4643 non-null float64
ROC_12 4644 non-null float64
VT_OPEN 4598 non-null float64
VT_CLOSE 4598 non-null float64
VT_HIGH 4598 non-null float64
VT_LOW 4598 non-null float64
PP_VT 4598 non-null float64
R1_VT 4598 non-null float64
R2_VT 4598 non-null float64
R3_VT 4598 non-null float64
S1_VT 4598 non-null float64
S2_VT 4598 non-null float64
S3_VT 4598 non-null float64
DIFF_VT_CLOSE 4598 non-null float64
DIFF_VT_HIGH 4598 non-null float64
DIFF_VT_OPEN 4598 non-null float64
DIFF_VT_LOW 4598 non-null float64
DIFF_T_OPEN 4655 non-null float64
DIFF_T_LOW 4655 non-null float64
DIFF_T_HIGH 4655 non-null float64
DIFF_T_CLOSE 4655 non-null float64
DIFF_VTCLOSE_TOPEN 4598 non-null float64
VP_HIGH 4654 non-null float64
VP_LOW 4654 non-null float64
VP_OPEN 4654 non-null float64
VP_CLOSE 4654 non-null float64
regel_r1 4655 non-null int64
regel_r2 4655 non-null int64
regel_r3 4655 non-null int64
regeln 4655 non-null int64
vormittag_flag 4655 non-null int64
dtypes: datetime64[ns](2), float64(64), int64(5)
memory usage: 2.6 MB
None
Open Close High Low DateTime Date T_CLOSE T_HIGH T_LOW T_OPEN MVA5 MVA10 MVA14 MVA15 MVA28 MVA50 STD5 STD10 STD15 CV_5 CV_10 CV_15 DIFF_VP_CLOSE DIFF_VP_HIGH DIFF_VP_OPEN DIFF_VP_LOW AVG_STEIG_5 AVG_STEIG_10 AVG_STEIG_15 AVG_STEIG_28 AVG_5_DIFF AVG_10_DIFF AVG_15_DIFF AVG_14_DIFF AVG_50_DIFF AD_5_14 Momentum_4 ROC_4 Momentum_8 ROC_8 Momentum_12 ROC_12 VT_OPEN VT_CLOSE VT_HIGH VT_LOW PP_VT R1_VT R2_VT R3_VT S1_VT S2_VT S3_VT DIFF_VT_CLOSE DIFF_VT_HIGH DIFF_VT_OPEN DIFF_VT_LOW DIFF_T_OPEN DIFF_T_LOW DIFF_T_HIGH DIFF_T_CLOSE DIFF_VTCLOSE_TOPEN VP_HIGH VP_LOW VP_OPEN VP_CLOSE regel_r1 regel_r2 regel_r3 regeln vormittag_flag T_DIRC T_WECHSELC T_NUM_INNEN T_CANDLEC
4653 12488.1 12490.1 12490.6 12484.9 2017-05-03 14:00:00 2017-05-03 12490.1 12508.3 12475.4 12506.5 12490.18 12488.41 12487.521429 12487.053333 12493.078571 12498.118 2.178761 4.334218 4.515033 0.000174 0.000347 0.000362 -1.7 2.5 -1.0 -2.7 -9.6 8.5 4.866667 -8.178571 -0.08 1.69 3.046667 2.578571 -8.018 -2.658571 -3.8 0.00004 8.7 0.000577 -0.3 0.000521 12449.3 12514.3 12527.8 12432.0 12491.366667 12550.733333 12587.166667 12646.533333 12454.933333 12395.566667 12359.133333 24.2 37.7 -40.8 -58.1 16.4 -14.7 18.2 0.0 7.8 12492.6 12487.4 12489.1 12488.4 0 0 0 0 0 neutral INNEN 1 GRUEN
4654 12489.9 12489.9 12489.9 12489.6 2017-05-03 14:15:00 2017-05-03 12489.9 12508.3 12475.4 12506.5 12489.38 12488.91 12487.828571 12487.680000 12492.182143 12498.436 0.712039 4.169586 4.180431 0.000057 0.000334 0.000335 0.2 0.7 -1.8 -5.0 -8.0 5.0 6.266667 -8.964286 0.52 0.99 2.220000 2.071429 -8.536 -1.551429 0.3 0.00008 7.0 0.000064 6.3 0.000665 12449.3 12514.3 12527.8 12432.0 12491.366667 12550.733333 12587.166667 12646.533333 12454.933333 12395.566667 12359.133333 24.4 37.9 -40.6 -57.9 16.6 -14.5 18.4 0.0 7.8 12490.6 12484.9 12488.1 12490.1 0 0 0 0 0 neutral INNEN 1 GRUEN

UPDATED
This is a straightforward example with a Quandl dataset,
import quandl
df = quandl.get("WIKI/GOOGL")
columns = ["High", 'Low', 'Close']
def operations(row, columns):
df1 = row[columns[0]] + row[columns[1]] + row[columns[2]]
df2 = -row[columns[1]] - row[columns[2]] + row[columns[0]]
return df1, df2
df["function1"], df["function2"] = zip(*df.apply(lambda row: operations(row, columns), axis=1))
df[["High","Low","Close","function1", "function2"]].head(5)

Convert a Panel to DataFrame with multilevel columns of (minor, item)?

I have a pandas panel of:
Items axis: X1 to X3
Major_axis axis: (1973-09-30 00:00:00, 1989-03-31 00:00:00) to (2015-07-31 00:00:00, 2015-08-21 00:00:00)
Minor_axis axis: A to C
and I would like to convert it to a dataframe with a multilevel column of (Item, Minor), the multilevel columns would look as follows:
mi_tuples = [ ('A','X1'), ('A','X2'), ('A','X3'), ('B','X1'), ('B','X2'), ('B','X3'), ('C','X1'), ('C','X2'), ('C','X3') ]
mi_columns = pd.MultiIndex.from_tuples(mi_tuples, names = ['minor', 'items'])
Any thoughts?
Thanks!

I think a combination of to_frame, unstack, and swaplevel can get you there. See below with some example data.
In [134]: pnl = pd.io.data.DataReader(['GOOG', 'AAPL'], 'yahoo')
In [135]: pnl
Out[135]:
<class 'pandas.core.panel.Panel'>
Dimensions: 6 (items) x 1421 (major_axis) x 2 (minor_axis)
Items axis: Open to Adj Close
Major_axis axis: 2010-01-04 00:00:00 to 2015-08-25 00:00:00
Minor_axis axis: AAPL to GOOG
In [136]: df = pnl.to_frame().unstack(level=1)
In [137]: df.columns = df.columns.swaplevel(0,1)
In [138]: df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1421 entries, 2010-01-04 to 2015-08-25
Data columns (total 12 columns):
(AAPL, Open) 1421 non-null float64
(GOOG, Open) 357 non-null float64
(AAPL, High) 1421 non-null float64
(GOOG, High) 357 non-null float64
(AAPL, Low) 1421 non-null float64
(GOOG, Low) 357 non-null float64
(AAPL, Close) 1421 non-null float64
(GOOG, Close) 357 non-null float64
(AAPL, Volume) 1421 non-null float64
(GOOG, Volume) 357 non-null float64
(AAPL, Adj Close) 1421 non-null float64
(GOOG, Adj Close) 357 non-null float64
dtypes: float64(12)
memory usage: 144.3 KB

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas df.fillna(method = 'pad') not working on 28000 row df - python

Related

Apply operation to column

How to add a column name for Timeseries when it is indexed

Pandas Merge returning only null values

Applying function on multiple columns to create multiple new columns

Convert a Panel to DataFrame with multilevel columns of (minor, item)?

Categories

Resources