fillna(0) first but NaN value appears in iloc - python

df1.fillna(0)
Montant vente Marge
0 778283.75 13.63598
1 312271.20 9.26949
2 163214.65 14.50288
3 191000.20 9.55818
4 275970.00 12.76534
... ... ...
408 2999.80 14.60610
409 390.00 0.00000
410 699.00 26.67334
411 625.00 30.24571
412 0.00 24.79797
x = df1.iloc[:,1:3] # 1t for rows and second for columns
x
Marge
0 13.63598
1 9.26949
2 14.50288
3 9.55818
4 12.76534
... ...
408 14.60610
409 NaN
410 26.67334
411 30.24571
412 24.79797
413 rows × 1 columns
Why does the line 409 has a 0.000value first and then after iloc, it has NaN?
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

You should learn which functions mutate the data frame and which doesn't. For example fillna does not mutate the dataframe. Instead you can use inplace=True
df1 = df1.fillna(0)
or
df1.fillna(0, inplace=True)

Related

Python Map function deletes all data in column

I have a Pandas DataFrame with several columns.
One of these ('Code') is object-type but has missing data (NaN). Other data can be numbers or letters.
For the missing data, I want to do a map / set_index function in order to fill in the data.
Here is my code:
for row in df['Code']:
if pd.isnull(row) == True:
df['Code']= df['account'].map(df_2.set_index('AccountID')['AccountCode'])
else:
None
However, this code deletes all data from the entire columns.
This is the original (I mean to do the map function on the NaN values only!) :
0 23050178040
1 23050178040
2 23050178040
3 23050178106
4 23050178040
...
288 23050942326
289 23050942326
290 NaN
291 23050942858
292 NaN
Name: Code BU, Length: 293, dtype: object
And the result:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
...
288 NaN
289 NaN
290 NaN
291 NaN
292 NaN
Name: Code BU, Length: 293, dtype: object
What is the issue here?
Instead all your code loop use Series.fillna:
df['Code']= df['Code'].fillna(df['account'].map(df_2.set_index('AccountID')['AccountCode']))

Copying existing columns as moving averages to a dataframe

I think I am overthinking this - I am trying to copy existing pandas data frame columns and values and making rolling averages - I do not want to overwrite original data. I am iterating over the columns, taking the columns and values, making a rolling 7 day ma as a new column with the suffix _ma as a copy to the original copy. I want to compare existing data to the 7day MA and see how many standard dev the data is from the 7 day MA - which I can figure out - I am just trying to save MA data as a new data frame.
I have
for column in original_data[ma_columns]:
ma_df = pd.DataFrame(original_data[ma_columns].rolling(window=7).mean(), columns = str(column)+'_ma')
and getting the error : Index(...) must be called with a collection of some kind, 'Carrier_AcctPswd_ma' was passed
But if I am iterating with
for column in original_data[ma_columns]:
print('Colunm Name : ', str(column)+'_ma')
print('Contents : ', original_data[ma_columns].rolling(window=7).mean())
I get the data I need :
My issue is just saving this as a new data frame, which I can concatenate to the old, and then do my analysis.
EDIT
I have now been able to make a bunch of data frames, but I want to concatenate them together and this is where the issue is:
for column in original_data[ma_columns]:
MA_data = pd.DataFrame(original_data[column].rolling(window=7).mean())
for i in MA_data:
new = pd.concat(i)
print(i)
<ipython-input-75-7c5e5fa775b3> in <module>
17 # print(type(MA_data))
18 for i in MA_data:
---> 19 new = pd.concat(i)
20 print(i)
21
~\Anaconda3\lib\site-packages\pandas\core\reshape\concat.py in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
279 verify_integrity=verify_integrity,
280 copy=copy,
--> 281 sort=sort,
282 )
283
~\Anaconda3\lib\site-packages\pandas\core\reshape\concat.py in __init__(self, objs, axis, join, keys, levels, names, ignore_index, verify_integrity, copy, sort)
307 "first argument must be an iterable of pandas "
308 "objects, you passed an object of type "
--> 309 '"{name}"'.format(name=type(objs).__name__)
310 )
311
TypeError: first argument must be an iterable of pandas objects, you passed an object of type "str"
You should iterate over column names and assign the resulting pandas series as a new named column, for example:
import pandas as pd
original_data = pd.DataFrame({'A': range(100), 'B': range(100, 200)})
ma_columns = ['A', 'B']
for column in ma_columns:
new_column = column + '_ma'
original_data[new_column] = pd.DataFrame(original_data[column].rolling(window=7).mean())
print(original_data)
Output dataframe:
A B A_ma B_ma
0 0 100 NaN NaN
1 1 101 NaN NaN
2 2 102 NaN NaN
3 3 103 NaN NaN
4 4 104 NaN NaN
.. .. ... ... ...
95 95 195 92.0 192.0
96 96 196 93.0 193.0
97 97 197 94.0 194.0
98 98 198 95.0 195.0
99 99 199 96.0 196.0
[100 rows x 4 columns]

Handling Zeros or NaNs in a Pandas DataFrame operations

I have a DataFrame (df) like shown below where each column is sorted from largest to smallest for frequency analysis. That leaves some values either zeros or NaN values as each column has a different length.
08FB006 08FC001 08FC003 08FC005 08GD004
----------------------------------------------
0 253 872 256 11.80 2660
1 250 850 255 10.60 2510
2 246 850 241 10.30 2130
3 241 827 235 9.32 1970
4 241 821 229 9.17 1900
5 232 0 228 8.93 1840
6 231 0 225 8.05 1710
7 0 0 225 0 1610
8 0 0 224 0 1590
9 0 0 0 0 1590
10 0 0 0 0 1550
I need to perform the following calculation as if each column has different lengths or number of records (ignoring zero values). I have tried using NaN but for some reason operations on Nan values are not possible.
Here is what I am trying to do with my df columns :
shape_list1=[]
location_list1=[]
scale_list1=[]
for column in df.columns:
shape1, location1, scale1=stats.genpareto.fit(df[column])
shape_list1.append(shape1)
location_list1.append(location1)
scale_list1.append(scale1)
Assuming all values are positive (as seems from your example and description), try:
stats.genpareto.fit(df[df[column] > 0][column])
This filters every column to operate just on the positive values.
Or, if negative values are allowed,
stats.genpareto.fit(df[df[column] != 0][column])
The syntax is messy, but change
shape1, location1, scale1=stats.genpareto.fit(df[column])
to
shape1, location1, scale1=stats.genpareto.fit(df[column][df[column].nonzero()[0]])
Explanation: df[column].nonzero() returns a tuple of size (1,) whose only element, element [0], is a numpy array that holds the index labels where df is nonzero. To index df[column] by these nonzero labels, you can use df[column][df[column].nonzero()[0]].

how to multiply all values from a column in a particular year in pandas

I'm trying to multiply all values in a particular year and push it to another column. With the code below I'm getting this error
TypeError: ("'NoneType' object is not callable", 'occurred at index
I'm getting NaT and NaN when I use shift(1). How can I get it to work?
def check_date():
next_row = df.Date.shift(1)
first_row = df.Date
date1 = pd.to_datetime(first_row).year
date2 = pd.to_datetime(next_row).year
if date1 == date2:
df['all_data_in_year'] = date1 * date2
df.apply(check_date(), axis=1)
DataSet:
Date Open High Low Last Close Total Trade Quantity Turnover (Lacs)
31/12/10 816 824.5 807.3 815 818.45 1165987 9529.64
31/01/11 675 680 654 670.1 669.35 535039 3553.92
28/02/11 550 561.6 542 548.5 548.4 749166 4136.09
31/03/11 621.5 624.7 607.1 618 616.25 628572 3866
29/04/11 654.7 657.95 626 631 632.05 833213 5338.91
31/05/11 575 590 565.6 589.3 585.15 908185 5239.36
30/06/11 527 530.7 521.3 524 524.6 534496 2804.89
29/07/11 496.95 502.9 486 486.2 489.7 500743 2477.96
30/08/11 365.95 382.7 365 380 376.65 844439 3171.6
30/09/11 362.4 365.9 348.1 352 352.75 617537 2196.56
31/10/11 430 439.5 425 429.1 431.2 1033903 4493.97
30/11/11 349.05 354.95 344.15 348 350 686735 2404.1
30/12/11 353 355.9 340.1 340.1 342.75 740222 2565.39
31/01/12 443 451.45 428 445.5 446 1344942 5952.77
29/02/12 485.55 505.9 484 497 495.1 1011007 5004.46
30/03/12 421 436.45 418.4 432.5 432.95 867832 3740.04
30/04/12 410.35 419.4 406.85 414.3 414.05 418539 1733.81
31/05/12 362 363.05 351.2 359 358.3 840753 3000.41
29/06/12 385.05 395.3 382.9 388 389.75 1171690 4581.58
31/07/12 377.75 386 367.7 380.5 381.35 499246 1886.06
31/08/12 473.7 473.7 394.25 399 400.85 631225 2544.24
I think better is avoid loops (apply under the hood) and use numpy.where:
#sample Dataframe with sample datetimes
rng = pd.date_range('2017-04-03', periods=10, freq='8m')
df = pd.DataFrame({'Date': rng, 'a': range(10)})
date1 = df.Date.shift(1).dt.year
date2 = df.Date.dt.year
df['all_data_in_year'] = np.where(date1 == date2, date1 * date2, np.nan)
print (df)
Date a all_data_in_year
0 2017-04-30 0 NaN
1 2017-12-31 1 4068289.0
2 2018-08-31 2 NaN
3 2019-04-30 3 NaN
4 2019-12-31 4 4076361.0
5 2020-08-31 5 NaN
6 2021-04-30 6 NaN
7 2021-12-31 7 4084441.0
8 2022-08-31 8 NaN
9 2023-04-30 9 NaN
EDIT1:
df['new'] = df.groupby( pd.to_datetime(df['Date']).dt.year)['Close'].transform('prod')

Python Pandas Dataframe select row by max value in group

I have a dataframe which was created via a df.pivot:
type start end
F_Type to_date
A 20150908143000 345 316
B 20150908140300 NaN 480
20150908140600 NaN 120
20150908143000 10743 8803
C 20150908140100 NaN 1715
20150908140200 NaN 1062
20150908141000 NaN 145
20150908141500 418 NaN
20150908141800 NaN 450
20150908142900 1973 1499
20150908143000 19522 16659
D 20150908143000 433 65
E 20150908143000 7290 7375
F 20150908143000 0 0
G 20150908143000 1796 340
I would like to filter and return a single row for each 'F_TYPE' only returning the row with the Maximum 'to_date'. I would like to return the following dataframe:
type start end
F_Type to_date
A 20150908143000 345 316
B 20150908143000 10743 8803
C 20150908143000 19522 16659
D 20150908143000 433 65
E 20150908143000 7290 7375
F 20150908143000 0 0
G 20150908143000 1796 340
Thanks..
A standard approach is to use groupby(keys)[column].idxmax().
However, to select the desired rows using idxmax you need idxmax to return unique index values. One way to obtain a unique index is to call reset_index.
Once you obtain the index values from groupby(keys)[column].idxmax() you can then select the entire row using df.loc:
In [20]: df.loc[df.reset_index().groupby(['F_Type'])['to_date'].idxmax()]
Out[20]:
start end
F_Type to_date
A 20150908143000 345 316
B 20150908143000 10743 8803
C 20150908143000 19522 16659
D 20150908143000 433 65
E 20150908143000 7290 7375
F 20150908143000 0 0
G 20150908143000 1796 340
Note: idxmax returns index labels, not necessarily ordinals. After using reset_index the index labels happen to also be ordinals, but since idxmax is returning labels (not ordinals) it is better to always use idxmax in conjunction with df.loc, not df.iloc (as I originally did in this post.)
The other ways to do that are as follow:
If you want only one max row per group.
(
df
.groupby(level=0)
.apply(lambda group: group.nlargest(1, columns='to_date'))
.reset_index(level=-1, drop=True)
)
If you want to get all rows that are equal to max per group.
(
df
.groupby(level=0)
.apply(lambda group: group.loc[group['to_date'] == group['to_date'].max()])
.reset_index(level=-1, drop=True)
)

Categories

Resources