Creating columns of a dataframe in for loop - python

I need help on a pandas issue:
I have a function which returns a pandas series as a result, which looks like the following:
date number
2018-01-01 1
2018-02-01 2
2018-03-01 3
2018-04-01 3
And I call this function within a for loop. The for loop does the following:
for i in range(len(category_list)):
print(category_list[i])
serie = filter_users_pageNS(data, new_index, category_list[i])
serie = pd.DataFrame(serie).reset_index()
Where category_list is just the number of iterations to be done that contains the parameter to be passed to the filter_users_pageNS function.
My intention is to append the number column obtained on each iteration in a dataframe, so that at the end of the for loop, I get the following dataframe:
date number_iteration_1 number_iteration_2
2018-01-01 1 4
2018-02-01 2 1
2018-03-01 3 5
2018-04-01 3 2
As you can see, the final dataframe has the number column added to the dataframe, with the name number_iteration_x.
Any ideas on how to get this final dataframe?
Thank you very much in advance.

IIUC, something along these lines should work:
df=df.reset_index()
for i in range(len(category_list)):
print(category_list[i])
serie = filter_users_pageNS(data, new_index, category_list[i])
serie = pd.DataFrame(serie).reset_index()
df['number_iteration_'+str(i)]=serie
Edit: Adjusted answer based on your new data

Related

Substract one datetime column after a groupby with a time reference for each group from a second Pandas dataframe

I have one dataframe df1 with one admissiontime for each id.
id admissiontime
1 2117-04-03 19:15:00
2 2117-10-18 22:35:00
3 2163-10-17 19:15:00
4 2149-01-08 15:30:00
5 2144-06-06 16:15:00
And an another dataframe df2 with several datetame for each id
id datetime
1 2135-07-28 07:50:00.000
1 2135-07-28 07:50:00.000
2 2135-07-28 07:57:15.900
3 2135-07-28 07:57:15.900
3 2135-07-28 07:57:15.900
I would like to substract for each id, datetimes with his specific admissiontime, in a column of the second dataframe.
I think I have to use d2.group.by('id')['datetime']- something but I struggle to connect with the df1.
Use Series.sub with mapping by Series.map by another DataFrame:
df1['admissiontime'] = pd.to_datetime(df1['admissiontime'])
df2['datetime'] = pd.to_datetime(df2['datetime'])
df2['diff'] = df2['datetime'].sub(df2['id'].map(df1.set_index('id')['admissiontime']))

Mapping two rows to one row in pandas

I have a dataframe a with 14 rows and another dataframe comp1sum with 7 rows. a has date column for 7 days in 12hr interval. So that makes it 14 rows. Also, comp1sum has a column with 7 days.
This is the comp1sum dataframe
And this is the a dataframe
I want to map 2 rows of a dataframe to single rows of comp1sum dataframe. So, that one day of dataframe a is mapped to one day of comp1sum dataframe.
I have the following code for that
j=0
for i in range(0,7):
a.loc[i,'comp1_sum'] = comp_sum.iloc[j]['comp1sum']
a.loc[i,'comp2_sum'] = comp_sum.iloc[j]['comp2sum']
j=j+1
And its output is
dt_truncated comp1_sum
3 2015-02-01 00:00:00 142.0
10 2015-02-01 12:00:00 144.0
12 2015-02-03 00:00:00 145.0
2 2015-02-05 00:00:00 141.0
14 2015-02-05 12:00:00 NaN
The code is mapping the days from comp1sum based on index of a and not based on dates of a. I want 2015-02-01 00:00:00 to have the values 139.0 and 2015-02-02 00:00:00 to have the value 140.0 and so on such that increasing dates have increasing values.
I am not able to map in such a way. please help.
Edit1- As per #Ssayan answer, I am getting this error-
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-255-77e55efca5f9> in <module>
3 # use the sorted index to iterate through the sorted dataframe
4 for i, idx in enumerate(a.index):
----> 5 a.loc[idx, 'comp1_sum'] = b.iloc[i//2]['comp1sum']
6 a.loc[idx,'comp2_sum'] = b.iloc[i//2]['comp2sum']
IndexError: single positional indexer is out-of-bounds
Your issue is that your DataFrame a is not sorted by date so the index 0 does not match the earliest date. When you use loc it uses the value of the index, not the order in which the table is, so even with sorting the DataFrame the issue remains.
One way out is to sort the DataFrame a by date and then to use the sorted index to apply the value in the order you need.
# sort the dataframe by date
a = a.sort_values("dt_truncated")
# use the sorted index to iterate through the sorted dataframe
for i, idx in enumerate(a.index):
a.loc[idx, 'val_1'] = b.iloc[i//2]['val1']
a.loc[idx,'val_2'] = b.iloc[i//2]['val2']

How to fill missing dates with corresponding NaN in other columns

I have a CSV that initially creates following dataframe:
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-05 52304.0
Using the following script, I would like to fill the missing dates and have a corresponding NaN value in the Portfoliovalue column with NaN. So the result would be this:
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-02 NaN
2 2021-05-03 NaN
3 2021-05-04 NaN
4 2021-05-05 52304.0
I first tried the method here: Fill the missing date values in a Pandas Dataframe column
However the bfill replaces all my NaN's and removing it only returns an error.
So far I have tried this:
df = pd.read_csv("Tickers_test5.csv")
df2 = pd.read_csv("Portfoliovalues.csv")
portfolio_value = df['Currentvalue'].sum()
portfolio_value = portfolio_value + cash
date = datetime.date(datetime.now())
df2.loc[len(df2)] = [date, portfolio_value]
print(df2.asfreq('D'))
However, this only returns this:
Date Portfoliovalue
1970-01-01 NaN NaN
Thanks for your help. I am really impressed at how helpful this community is.
Quick update:
I have added the code, so that it fills my missing dates. However, it is part of a programme, which tries to update the missing dates every time it launches. So when I execute the code and no dates are missing, I get the following error:
ValueError: cannot reindex from a duplicate axis”
The code is as follows:
df2 = pd.read_csv("Portfoliovalues.csv")
portfolio_value = df['Currentvalue'].sum()
date = datetime.date(datetime.now())
df2.loc[date, 'Portfoliovalue'] = portfolio_value
#Solution provided by Uts after asking on Stackoverflow
df2.Date = pd.to_datetime(df2.Date)
df2 = df2.set_index('Date').asfreq('D').reset_index()
So by the looks of it the code adds a duplicate date, which then causes the .reindex() function to raise the ValueError. However, I am not sure how to proceed. Is there an alternative to .reindex() or maybe the assignment of today's date needs changing?
Pandas has asfreq function for datetimeIndex, this is basically just a thin, but convenient wrapper around reindex() which generates a date_range and calls reindex.
Code
df.Date = pd.to_datetime(df.Date)
df = df.set_index('Date').asfreq('D').reset_index()
Output
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-02 NaN
2 2021-05-03 NaN
3 2021-05-04 NaN
4 2021-05-05 52304.0
Pandas has reindex method: given a list of indices, it remains only indices from list.
In your case, you can create all the dates you want, by date_range for example, and then give it to reindex. you might needed a simple set_index and reset_index, but I assume you don't care much about the original index.
Example:
df.set_index('Date').reindex(pd.date_range(start=df['Date'].min(), end=df['Date'].max(), freq='D')).reset_index()
On first we set 'Date' column as index. Then we use reindex, it full list of dates (given by date_range from minimal date to maximal date in 'Date' column, with daily frequency) as new index. It result nans in places without former value.

Assign values to a pandas dataframe column filtered by index and column

I have a pandas DataFrame with a DateTime index and two columns called 'text' and 'labels'. I want to assign the value of labels which have value =2 and lie within a DateTime index range with value =50
I tried using,
df[df['labels']==2]['2017-02-01 05:03:25+00:00':'2017-02-01 05:05:55+00:00']['labels']=50
I am able to view the DataFrame filtered by DataTime index (rows) and columns but not able to assign it
Also tried
df.loc[df['2017-03-13 00:00:00':'2017-03-23 00:00:00'], df['labels']==2]=50
but it threw an error
df looks like
created text labels
2017-02-01 05:03:25+00:00 break john cena eyelash grow 4
2017-02-01 05:05:55+00:00 eyelash tooooo much sweeti definit 2
2017-02-01 05:14:57+00:00 come eyelash 2
created is the DateTime index and 'text' and 'labels' are the columns of the DataFrame
df[df['labels']==2]['2017-02-01 05:03:25+00:00':'2017-02-01 05:05:55+00:00']['labels']
filters the DataFrame but doesn't assign it if we set it equal to a value
On assigning the DataFrame for created between '2017-02-01 05:03:25+00:00':'2017-02-01 05:05:55+00:00' and labels =2 for labels=50 I expect the result to be like this
created text labels
2017-02-01 05:03:25+00:00 break john cena eyelash grow 4
2017-02-01 05:05:55+00:00 eyelash tooooo much sweeti definit 50
2017-02-01 05:14:57+00:00 come eyelash 2
Let us do get_level_values
s=df.index.get_level_values(0)
m=(s>'2017-02-01 05:03:25+00:00') & (s<='2017-02-01 05:05:55+00:00')
df.loc[m&(df.labels==2),'lable']=50

Pandas: fill in a dataframe column with a serie starting at a specifc index

My dataframe looks like this:
time price
0 2019-02-01 00:07:00 0.00234135
1 2019-02-01 00:10:15 0.0023541
2 2019-02-01 00:13:30 0.00235838
3 2019-02-01 01:03:00 0.00236977
4 2019-02-01 01:07:00 0.00237751
What I did after was to compute the MACD using the following code: macd, macd_signal, macd_histogram = ti.macd(data,10,20,9)
I would like now to create a new column with the related macd values: df['macd'] = pd.Series(macd) however the first 20 values are used to compute the macd, so there is no macd values for the first 20 values.
I should then create a column with the macd values starting at index 20. I tried that: df.at[18, 'macd'] = pd.Series(macd) but it did not work I have the following error message :
ValueError: setting an array element with a sequence.
Any help? Thanks!
Converting to Series with not defined index is not good idea, because possible not aligment between new Series and old index:
df.loc[18:, 'macd'] = macd[18:]
Solution with pd.Series:
df.loc[18:, 'macd'] = pd.Series(macd, index=df.index)

Categories

Resources