Replacing NaTs in a DataFrame using shift - python

I am importing a dataframe from an Excel spreadsheet where the data column is incomplete:
Date Value
0 2020-04-29 144
1 NaT 158
2 NaT 134
3 2020-04-30 114
4 NaT 153
and I'd like to fill in the NaTs by replacing them with the date from the line above. The slow method works:
for i in range(0, df.shape[0]):
if pd.isnull(df.iat[i,0]):
df.iat[i, 0] = df.iat[i-1, 0]
but the methods I think ought to work, don't. Both of these replace the first NaT they can encounter but skip NaTs after that (are they working on copies of the data?)
df["Date"] = np.where(df["Date"].isnull(), df["Date"].shift(1), df["Date"])
df['Date'].mask(df['Date'].isnull(), df['Date'].shift(1), inplace=True)
Is there any quick way of doing this?
A

You can try ffill:
df.ffill()
If "Date" values are string, you can convert "NaT" into actual NaN value using replace:
df.replace("NaT", np.NaN).ffill()
Explanation
Use replace to replace "NaT" string to actuel NaN values.
Fill all NaN cells from the previous not NaN cell using ffill.
Code + illustration
import pandas as pd
import numpy as np
print(df.replace("NaT", np.NaN))
# Date Value
# 0 2020-04-29 144
# 1 NaN 158
# 2 NaN 134
# 3 2020-04-30 114
# 4 NaN 153
print(df.replace("NaT", np.NaN).ffill())
# Date Value
# 0 2020-04-29 144
# 1 2020-04-29 158
# 2 2020-04-29 134
# 3 2020-04-30 114
# 4 2020-04-30 153

Related

appending pandas columns data

why can't the pandas data frame append appropriately to form one data frame in this loop?
#Produce the overall data frame
def processed_data(data1_,f_loc,open,close):
"""data1_: is the csv file to be modified
f_loc: is the location of csv files to be processed
open and close: are the columns to undergo computations
returns a new dataframe of modified columns"""
main_file=drop_col(data1_)#Dataframe to append more data columns to
for i in files_path(f_loc):
data=get_data_frame(i[0])#returns the dataframe, takes file path location of the csv file and returns the data frame
perc=perc_df(data,open,close,i[1])#Dataframe to append
copy_data=main_file.append(perc)
return copy_data
heres the output:
Date WTRX-USD
0 2021-05-27 NaN
1 2021-05-28 NaN
2 2021-05-29 NaN
3 2021-05-30 NaN
4 2021-05-31 NaN
.. ... ...
79 NaN -2.311576
80 NaN 5.653349
81 NaN 5.052950
82 NaN -2.674435
83 NaN -3.082957
[450 rows x 2 columns]
My intention is to return something like this(where each append operation adds a column):
Date Open High Low Close Adj Close Volume
0 2021-05-27 0.130793 0.136629 0.124733 0.128665 0.128665 70936563
1 2021-05-28 0.128659 0.129724 0.111244 0.113855 0.113855 71391441
2 2021-05-29 0.113752 0.119396 0.108206 0.111285 0.111285 62049940
3 2021-05-30 0.111330 0.115755 0.107028 0.112185 0.112185 70101821
4 2021-05-31 0.112213 0.126197 0.111899 0.125617 0.125617 83502219
.. ... ... ... ... ... ... ...
361 2022-05-23 0.195637 0.201519 0.185224 0.185231 0.185231 47906144
362 2022-05-24 0.185242 0.190071 0.181249 0.189553 0.189553 33312065
363 2022-05-25 0.189550 0.193420 0.183710 0.183996 0.183996 33395138
364 2022-05-26 0.184006 0.186190 0.165384 0.170173 0.170173 57218888
365 2022-05-27 0.170636 0.170660 0.165052 0.166864 0.166864 63560568
[366 rows x 7 columns]
pandas.concat
pandas.DataFrame.append has been deprecated. Use pandas.concat instead.
Combine DataFrame objects horizontally along the x-axis by passing in
axis=1
copy_data=pd.concat([copy_data,perc], axis=1)

Shifting column values in Pandas Dataframe causes missing values

I want to shift column values one space to the left. I don't want to save the original values of the column 'average_rating'.
I used the shift command:
data3 = data3.shift(-1, axis=1)
But the output I get has missing values for two columns- 'num_pages' and 'text_reviews_count'
It is because the data types of the source and target columns do not match. Try converting the column value after shift() to the target data type for each source and target column - for example .fillna(0).astype(int).
Alternately, you can convert all the data in the data frame to strings and then perform the shift. You might want to convert them back to their original data types again.
df = df.astype(str) # convert all data to str
df_shifted = (df.shift(-1,axis=1)) # perform the shift
df_string = df_shifted.to_csv() # store the shifted to a string variable
new_df = pd.read_csv(StringIO(df_string), index_col=0) # read the data again from the string variable
Output:
average_rating isbn isbn13 language_code num_pages ratings_count text_reviews_count extra
0 3.57 0674842111 978067 en-US 236 55 6.0 NaN
1 3.60 1593600119 978067 eng 400 25 4.0 NaN
2 3.63 156384155X 978067 eng 342 38 4.0 NaN
3 3.98 1857237250 978067 eng 383 2197 17.0 NaN
4 0.00 0851742718 978067 eng 49 0 0.0 NaN

Selecting column values of a dataframe which is in a range and put it in appropriate columns of another dataframe in pandas

I have a csv file which is something like below
date,mean,min,max,std
2018-03-15,3.9999999999999964,inf,0.0,100.0
2018-03-16,0.46403712296984756,90.0,0.0,inf
2018-03-17,2.32452732452731,,0.0,143.2191767899579
2018-03-18,2.8571428571428523,inf,0.0,100.0
2018-03-20,0.6928406466512793,100.0,0.0,inf
2018-03-22,2.8675703858185635,,0.0,119.05383697172658
I want to select those column values which is > 20 and < 500 that is (20 to 500) and put those values along with date in another column of a dataframe.The other dataframe looks something like this
Date percentage_change location
2018-02-14 23.44 BOM
So I want to get the date, value from the csv and add it into the new dataframe at appropriate columns.Something like
Date percentage_change location
2018-02-14 23.44 BOM
2018-03-15 100.0 NaN
2018-03-16 90.0 NaN
2018-03-17 143.2191767899579 NaN
.... .... ....
Now I am aware of functions like df.max(axis=1) and df.min(axis=1) which gives you the min and max but not sure for finding values based on a range.So how can this be achieved?
Given dataframes df1 and df2, you can achieve this via aligning column names, cleaning numeric data, and then using pd.DataFrame.append.
df_app = df1.loc[:, ['date', 'mean', 'min', 'std']]\
.rename(columns={'date': 'Date'})\
.replace(np.inf, 0)\
.fillna(0)
print(df_app)
df_app['percentage_change'] = np.maximum(df_app['min'], df_app['std'])
print(df_app)
df_app = df_app[df_app['percentage_change'].between(20, 500)]
res = df2.append(df_app.loc[:, ['Date', 'percentage_change']])
print(res)
# Date location percentage_change
# 0 2018-02-14 BOM 23.440000
# 0 2018-03-15 NaN 100.000000
# 1 2018-03-16 NaN 90.000000
# 2 2018-03-17 NaN 143.219177
# 3 2018-03-18 NaN 100.000000
# 4 2018-03-20 NaN 100.000000
# 5 2018-03-22 NaN 119.053837

pandas dataframe: perform calculations on columns

New to pandas and new to stackoverflow (really), any suggestions are highly appreciated!
I have this dataframe df:
col1 col2 col3
Date
2017-08-24 100 101 105
2017-08-23 102 102 107
2017-08-22 101 100 106
2017-08-21 103 99 106
2017-08-18 103 98 108
...
Now I'd like to perform some calculations with the values of each column, e.g. calculate the logarithm of each value.
I thought it's a good idea to loop over the columns and create a new temporary data frame with the resulting columns.
This new data frame should look like this e.g.:
col1 RN LOG
Date
2017-08-24 100 1 2
2017-08-23 102 2 2,008600
2017-08-22 101 3 2,004321
2017-08-21 103 4 2,012837
2017-08-18 103 5 2,012837
So I tried this for-loop:
for column in df:
tmp_df = df[column]
tmp_df['RN'] = range(1, len(tmp_df) + 1) # to create a new column with the row number
tmp_df['LOG'] = np.log(df[column]) # to create a new column with the LOG
However this doesn't print the new columns next to col1, but one below the other. The result looks like this:
Name: col1, Length: 86, dtype: object
Date
2017-08-24 00:00:00 100
2017-08-23 00:00:00 102
2017-08-22 00:00:00 101
2017-08-21 00:00:00 103
2017-08-18 00:00:00 103
RN,"range(1, 86)"
LOG,"Date
2017-08-24 2
2017-08-23 2,008600
2017-08-22 2,004321
2017-08-21 2,012837
2017-08-18 2,012837
00:00:00 was added to the date in the first part...
I also tried something with assign:
tmp_df = tmp_df.assign(LN=np.log(df[column]))
But this results in "AttributeError: "'Series' object has no attribute 'assign'""
It'd really be great if someone could point me in the right direction.
Thanks!
Your for loop is a good idea, but you need to create pandas Series in new columns this way:
for column in df:
df['RN ' + column] = pd.Series(range(1, len(df[column]) + 1))
df['Log ' + column] = pd.Series(np.log(df[column]))
Now I figured it out. :)
import pandas as pd
import numpy as np
...
for column in df:
tmp_res=pd.DataFrame(data=df[column])
newcol=range(1, len(df) + 1)
tmp_res=tmp_res.assign(RN=newcol)
newcol2=np.log(df[column])
tmp_res=tmp_res.assign(LN=newcol2)
This prints all columns next to each other:
col1 RN LOG
Date
2017-08-24 100 1 2
2017-08-23 102 2 2.008600
2017-08-22 101 3 2.004321
2017-08-21 103 4 2.012837
2017-08-18 103 5 2.012837
Now I can go on processing them or put it all in a csv / excel file.
Thanks for all your suggestions!

Pandas Reindex - Fill Column with Missing Values

I tried several examples of this topic but with no results. I'm reading a DataFrame like:
Code,Counts
10006,5
10011,2
10012,26
10013,20
10014,17
10015,2
10018,2
10019,3
How can I get another DataFrame like:
Code,Counts
10006,5
10007,NaN
10008,NaN
...
10011,2
10012,26
10013,20
10014,17
10015,2
10016,NaN
10017,NaN
10018,2
10019,3
Basically filling the missing values of the 'Code' Column? I tried the df.reindex() method but I can't figure out how it works. Thanks a lot.
I'd set the index to you 'Code' column, then reindex by passing in a new array based on your current index, arange accepts a start and stop param (you need to add 1 to the end) and then reset_index this assumes that your 'Code' values are already sorted:
In [21]:
df.set_index('Code', inplace=True)
df = df.reindex(index = np.arange(df.index[0], df.index[-1] + 1)).reset_index()
df
Out[21]:
Code Counts
0 10006 5
1 10007 NaN
2 10008 NaN
3 10009 NaN
4 10010 NaN
5 10011 2
6 10012 26
7 10013 20
8 10014 17
9 10015 2
10 10016 NaN
11 10017 NaN
12 10018 2
13 10019 3

Categories

Resources