Mapping two rows to one row in pandas - python

I have a dataframe a with 14 rows and another dataframe comp1sum with 7 rows. a has date column for 7 days in 12hr interval. So that makes it 14 rows. Also, comp1sum has a column with 7 days.
This is the comp1sum dataframe
And this is the a dataframe
I want to map 2 rows of a dataframe to single rows of comp1sum dataframe. So, that one day of dataframe a is mapped to one day of comp1sum dataframe.
I have the following code for that
j=0
for i in range(0,7):
a.loc[i,'comp1_sum'] = comp_sum.iloc[j]['comp1sum']
a.loc[i,'comp2_sum'] = comp_sum.iloc[j]['comp2sum']
j=j+1
And its output is
dt_truncated comp1_sum
3 2015-02-01 00:00:00 142.0
10 2015-02-01 12:00:00 144.0
12 2015-02-03 00:00:00 145.0
2 2015-02-05 00:00:00 141.0
14 2015-02-05 12:00:00 NaN
The code is mapping the days from comp1sum based on index of a and not based on dates of a. I want 2015-02-01 00:00:00 to have the values 139.0 and 2015-02-02 00:00:00 to have the value 140.0 and so on such that increasing dates have increasing values.
I am not able to map in such a way. please help.
Edit1- As per #Ssayan answer, I am getting this error-
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-255-77e55efca5f9> in <module>
3 # use the sorted index to iterate through the sorted dataframe
4 for i, idx in enumerate(a.index):
----> 5 a.loc[idx, 'comp1_sum'] = b.iloc[i//2]['comp1sum']
6 a.loc[idx,'comp2_sum'] = b.iloc[i//2]['comp2sum']
IndexError: single positional indexer is out-of-bounds

Your issue is that your DataFrame a is not sorted by date so the index 0 does not match the earliest date. When you use loc it uses the value of the index, not the order in which the table is, so even with sorting the DataFrame the issue remains.
One way out is to sort the DataFrame a by date and then to use the sorted index to apply the value in the order you need.
# sort the dataframe by date
a = a.sort_values("dt_truncated")
# use the sorted index to iterate through the sorted dataframe
for i, idx in enumerate(a.index):
a.loc[idx, 'val_1'] = b.iloc[i//2]['val1']
a.loc[idx,'val_2'] = b.iloc[i//2]['val2']

Related

How to fill missing dates with corresponding NaN in other columns

I have a CSV that initially creates following dataframe:
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-05 52304.0
Using the following script, I would like to fill the missing dates and have a corresponding NaN value in the Portfoliovalue column with NaN. So the result would be this:
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-02 NaN
2 2021-05-03 NaN
3 2021-05-04 NaN
4 2021-05-05 52304.0
I first tried the method here: Fill the missing date values in a Pandas Dataframe column
However the bfill replaces all my NaN's and removing it only returns an error.
So far I have tried this:
df = pd.read_csv("Tickers_test5.csv")
df2 = pd.read_csv("Portfoliovalues.csv")
portfolio_value = df['Currentvalue'].sum()
portfolio_value = portfolio_value + cash
date = datetime.date(datetime.now())
df2.loc[len(df2)] = [date, portfolio_value]
print(df2.asfreq('D'))
However, this only returns this:
Date Portfoliovalue
1970-01-01 NaN NaN
Thanks for your help. I am really impressed at how helpful this community is.
Quick update:
I have added the code, so that it fills my missing dates. However, it is part of a programme, which tries to update the missing dates every time it launches. So when I execute the code and no dates are missing, I get the following error:
ValueError: cannot reindex from a duplicate axis”
The code is as follows:
df2 = pd.read_csv("Portfoliovalues.csv")
portfolio_value = df['Currentvalue'].sum()
date = datetime.date(datetime.now())
df2.loc[date, 'Portfoliovalue'] = portfolio_value
#Solution provided by Uts after asking on Stackoverflow
df2.Date = pd.to_datetime(df2.Date)
df2 = df2.set_index('Date').asfreq('D').reset_index()
So by the looks of it the code adds a duplicate date, which then causes the .reindex() function to raise the ValueError. However, I am not sure how to proceed. Is there an alternative to .reindex() or maybe the assignment of today's date needs changing?
Pandas has asfreq function for datetimeIndex, this is basically just a thin, but convenient wrapper around reindex() which generates a date_range and calls reindex.
Code
df.Date = pd.to_datetime(df.Date)
df = df.set_index('Date').asfreq('D').reset_index()
Output
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-02 NaN
2 2021-05-03 NaN
3 2021-05-04 NaN
4 2021-05-05 52304.0
Pandas has reindex method: given a list of indices, it remains only indices from list.
In your case, you can create all the dates you want, by date_range for example, and then give it to reindex. you might needed a simple set_index and reset_index, but I assume you don't care much about the original index.
Example:
df.set_index('Date').reindex(pd.date_range(start=df['Date'].min(), end=df['Date'].max(), freq='D')).reset_index()
On first we set 'Date' column as index. Then we use reindex, it full list of dates (given by date_range from minimal date to maximal date in 'Date' column, with daily frequency) as new index. It result nans in places without former value.

How can I filter for rows one hour before and after a set timestamp in Python?

I am trying to filter a DataFrame to only show values 1-hour before and 1-hour after a specified time/date, but am having trouble finding the right function for this. I am working in Python with Pandas.
The posts I see regarding masking by date mostly cover the case of masking rows between a specified start and end date, but I am having trouble finding help on how to mask rows based around a single date.
I have time series data as a DataFrame that spans about a year, so thousands of rows. This data is at 1-minute intervals, and so each row corresponds to a row ID, a timestamp, and a value.
Example of DataFrame:
ID timestamp value
0 2011-01-15 03:25:00 34
1 2011-01-15 03:26:00 36
2 2011-01-15 03:27:00 37
3 2011-01-15 03:28:00 37
4 2011-01-15 03:29:00 39
5 2011-01-15 03:30:00 29
6 2011-01-15 03:31:00 28
...
I am trying to create a function that outputs a DataFrame that is the initial DataFrame, but only rows for 1-hour before and 1-hour after a specified timestamp, and so only rows within this specified 2-hour window.
To be more clear:
I have a DataFrame that has 1-minute interval data throughout a year (as exemplified above).
I now identify a specific timestamp: 2011-07-14 06:15:00
I now want to output a DataFrame that is the initial input DataFrame, but now only contains rows that are within 1-hour before 2011-07-14 06:15:00, and 1-hour after 2011-07-14 06:15:00.
Do you know how I can do this? I understand that I could just create a filter where I get rid of all values before 2011-07-14 05:15:00 and 2011-07-14 07:15:00, but my goal is to have the user simply enter a single date/time (e.g. 2011-07-14 06:15:00) to produce the output DataFrame.
This is what I have tried so far:
hour = pd.DateOffset(hours=1)
date = pd.Timestamp("2011-07-14 06:15:00")
df = df.set_index("timestamp")
df([date - hour: date + hour])
which returns:
File "<ipython-input-49-d42254baba8f>", line 4
df([date - hour: date + hour])
^
SyntaxError: invalid syntax
I am not sure if this is really only a syntax error, or something deeper and more complex. How can I fix this?
Thanks!
You can do with:
import pandas as pd
import datetime as dt
data = {"date": ["2011-01-15 03:10:00","2011-01-15 03:40:00","2011-01-15 04:10:00","2011-01-15 04:40:00","2011-01-15 05:10:00","2011-01-15 07:10:00"],
"value":[1,2,3,4,5,6]}
df=pd.DataFrame(data)
df['date']=pd.to_datetime(df['date'], format='%Y-%m-%d %H:%M:%S', errors='ignore')
date_search= dt.datetime.strptime("2011-01-15 05:20:00",'%Y-%m-%d %H:%M:%S')
mask = (df['date'] > date_search-dt.timedelta(hours = 1)) & (df['date'] <= date_search+dt.timedelta(hours = 1))
print(df.loc[mask])
result:
date value
3 2011-01-15 04:40:00 4
4 2011-01-15 05:10:00 5

Boxplot of Multiindex df

I want to do 2 things:
I want to create one boxplot per date/day with all the values for MeanTravelTimeSeconds in that date. The number of MeanTravelTimeSeconds elements varies from date to date (e.g. one day might have a count of 300 values while another, 400).
Also, I want to transform the rows in my multiindex series into columns because I don't want the rows to repeat every time. If it stays like this I'd have tens of millions of unnecessary rows.
Here is the resulting series after using df.stack() on a df indexed by date (date is a datetime object index):
Date
2016-01-02 NumericIndex 1611664
OriginMovementID 4744
DestinationMovementID 5084
MeanTravelTimeSeconds 1233
RangeLowerBoundTravelTimeSeconds 756
...
2020-03-31 DestinationMovementID 3594
MeanTravelTimeSeconds 1778
RangeLowerBoundTravelTimeSeconds 1601
RangeUpperBoundTravelTimeSeconds 1973
DayOfWeek Tuesday
Length: 11281655, dtype: object
When I use seaborn to plot the boxplot I guet a bucnh of errors after playing with different selections.
If I try to do df.stack().unstack() or df.stack().T I get then following error:
Index contains duplicate entries, cannot reshape
How do I plot the boxplot and how do I turn the rows into columns?
You really do need to make your index unique to make the functions you want to work. I suggest a sequential number that resets at every change in the other two key columns.
import datetime as dt
import random
import numpy as np
cat = ["NumericIndex","OriginMovementID","DestinationMovementID","MeanTravelTimeSeconds",
"RangeLowerBoundTravelTimeSeconds"]
df = pd.DataFrame(
[{"Date":d, "Observation":cat[random.randint(0,len(cat)-1)],
"Value":random.randint(1000,10000)}
for i in range(random.randint(5,20))
for d in pd.date_range(dt.datetime(2016,1,2), dt.datetime(2016,3,31), freq="14D")])
# starting point....
df = df.sort_values(["Date","Observation"]).set_index(["Date","Observation"])
# generate an array that is sequential within change of key
seq = np.full(df.index.shape, 0)
s=0
p=""
for i, v in enumerate(df.index):
if i==0 or p!=v: s=0
else: s+=1
seq[i] = s
p=v
df["SeqNo"] = seq
# add to index - now unstack works as required
dfdd = df.set_index(["SeqNo"], append=True)
dfdd.unstack(0).loc["MeanTravelTimeSeconds"].boxplot()
print(dfdd.unstack(1).head().to_string())
output
Value
Observation DestinationMovementID MeanTravelTimeSeconds NumericIndex OriginMovementID RangeLowerBoundTravelTimeSeconds
Date SeqNo
2016-01-02 0 NaN NaN 2560.0 5324.0 5085.0
1 NaN NaN 1066.0 7372.0 NaN
2016-01-16 0 NaN 6226.0 NaN 7832.0 NaN
1 NaN 1384.0 NaN 8839.0 NaN
2 NaN 7892.0 NaN NaN NaN

Calculating moving median within group

I want to perform rolling median on price column over 4 days back, data will be groupped by date. So basically I want to take prices for a given day and all prices for 4 days back and calculate median out of these values.
Here are the sample data:
id date price
1637027 2020-01-21 7045204.0
280955 2020-01-11 3590000.0
782078 2020-01-28 2600000.0
1921717 2020-02-17 5500000.0
1280579 2020-01-23 869000.0
2113506 2020-01-23 628869.0
580638 2020-01-25 650000.0
1843598 2020-02-29 969000.0
2300960 2020-01-24 5401530.0
1921380 2020-02-19 1220000.0
853202 2020-02-02 2990000.0
1024595 2020-01-27 3300000.0
565202 2020-01-25 3540000.0
703824 2020-01-18 3990000.0
426016 2020-01-26 830000.0
I got close with combining rolling and groupby:
df.groupby('date').rolling(window = 4, on = 'date')['price'].median()
But this seems to add one row per each index value and by median definition, I am not able to somehow merge these rows to produce one result per row.
Result now looks like this:
date date
2020-01-10 2020-01-10 NaN
2020-01-10 NaN
2020-01-10 NaN
2020-01-10 3070000.0
2020-01-10 4890000.0
...
2020-03-11 2020-03-11 4290000.0
2020-03-11 3745000.0
2020-03-11 3149500.0
2020-03-11 3149500.0
2020-03-11 3149500.0
Name: price, Length: 389716, dtype: float64
It seems it just deleted 3 first values and then just printed price value.
Is it possible to get one lagged / moving median value per one date?
You can use rolling with a frequency window of 5 days to get today and last 4 days, then drop_duplicates to keep the last row per day. First create a copy (if you want to keep the original one), sort_values per date and ensure the date column is datetime
#sort and change to datetime
df_f = df[['date','price']].copy().sort_values('date')
df_f['date'] = pd.to_datetime(df_f['date'])
#create the column rolling
df_f['price'] = df_f.rolling('5D', on='date')['price'].median()
#drop_duplicates and keep the last row per day
df_f = df_f.drop_duplicates(['date'], keep='last').reset_index(drop=True)
print (df_f)
date price
0 2020-01-11 3590000.0
1 2020-01-18 3990000.0
2 2020-01-21 5517602.0
3 2020-01-23 869000.0
4 2020-01-24 3135265.0
5 2020-01-25 2204500.0
6 2020-01-26 849500.0
7 2020-01-27 869000.0
8 2020-01-28 2950000.0
9 2020-02-02 2990000.0
10 2020-02-17 5500000.0
11 2020-02-19 3360000.0
12 2020-02-29 969000.0
This is a step by step process. There are probably more efficient methods of getting what you want. Note, if you have time information for your dates, you would need to drop that information before grouping by date.
import pandas as pd
import statistics as stat
import numpy as np
# Replace with you data import
df = pd.read_csv('random_dates_prices.csv')
# Convert your date to a datetime
df['date'] = pd.to_datetime(df['date'])
# Sort your data by date
df = df.sort_values(by = ['date'])
# Create group by object
dates = df.groupby('date')
# Reformat dataframe for one row per day, with prices in a nested list
df = pd.DataFrame(dates['price'].apply(lambda s: s.tolist()))
# Extract price lists to a separate list
prices = df['price'].tolist()
# Initialize list to store past four days of prices for current day
four_days = []
# Loop over the prices list to combine the last four days to a single list
for i in range(3, len(prices), 1):
x = i - 1
y = i - 2
z = i - 3
four_days.append(prices[i] + prices[x] + prices[y] + prices[z])
# Initialize a list to store median values
medians = []
# Loop through four_days list and calculate the median of the last for days for the current date
for i in range(len(four_days)):
medians.append(stat.median(four_days[i]))
# Create dummy zero values to add lists create to dataframe
four_days.insert(0, 0)
four_days.insert(0, 0)
four_days.insert(0, 0)
medians.insert(0, 0)
medians.insert(0, 0)
medians.insert(0, 0)
# Add both new lists to data frames
df['last_four_day_prices'] = four_days
df['last_four_days_median'] = medians
# Replace dummy zeros with np.nan
df[['last_four_day_prices', 'last_four_days_median']] = df[['last_four_day_prices', 'last_four_days_median']].replace(0, np.nan)
# Clean data frame so you only have a single date a median value for past four days
df_clean = df.drop(['price', 'last_four_day_prices'], axis=1)

Creating columns of a dataframe in for loop

I need help on a pandas issue:
I have a function which returns a pandas series as a result, which looks like the following:
date number
2018-01-01 1
2018-02-01 2
2018-03-01 3
2018-04-01 3
And I call this function within a for loop. The for loop does the following:
for i in range(len(category_list)):
print(category_list[i])
serie = filter_users_pageNS(data, new_index, category_list[i])
serie = pd.DataFrame(serie).reset_index()
Where category_list is just the number of iterations to be done that contains the parameter to be passed to the filter_users_pageNS function.
My intention is to append the number column obtained on each iteration in a dataframe, so that at the end of the for loop, I get the following dataframe:
date number_iteration_1 number_iteration_2
2018-01-01 1 4
2018-02-01 2 1
2018-03-01 3 5
2018-04-01 3 2
As you can see, the final dataframe has the number column added to the dataframe, with the name number_iteration_x.
Any ideas on how to get this final dataframe?
Thank you very much in advance.
IIUC, something along these lines should work:
df=df.reset_index()
for i in range(len(category_list)):
print(category_list[i])
serie = filter_users_pageNS(data, new_index, category_list[i])
serie = pd.DataFrame(serie).reset_index()
df['number_iteration_'+str(i)]=serie
Edit: Adjusted answer based on your new data

Categories

Resources