I have a pandas dataframe where observations are broken out per every two days. The values in the 'Date' column each describe a range of two days (eg 2020-02-22 to 2020-02-23).
I want to spit those Date values into individual days, with a row for each day. The closest I got was by doing newdf = df_day.set_index(df_day.columns.drop('Date',1).tolist()).Date.str.split(' to ', expand=True).stack().reset_index().loc[:, df_day.columns]
The problem here is that the new date values are returned as NaNs. Is there a way to achieve this data broken out by individual day?
I might not be understanding, but based on the image it's a single date per row as is, just poorly labeled -- I would manipulate the index strings, and if I can't do that I would create a new date column, or new df w/ clean date and merge it.
You should be able to chop off the first 14 characters with a lambda -- leaving you with second listed date in index.
I can't reproduce this, so bear with me.
df.rename(index=lambda s: s[14:])
#should remove first 14 characters from each row label.
#leaving just '2020-02-23' in row 2.
#If you must skip row 1, idx = df.index[1:]
#or df.iloc[1:].rename(index=lambda s: s[1:])
Otherwise, I would just replace it with a new datetime index.
didx = pd.DatetimeIndex(start ='2000-01-10', freq ='D',end='2020-02-26')
#Make sure same length as df
df.set_index(didx)
#Or
#df['new_date'] = didx.values
#df.set_index('new_date').drop(columns=['Date'])
#Or
#df.append(didx,axis=1) #might need ignore_index=True
Related
I have a Pandas DataFrame extracted from Estespark Weather for the dates between Sep-2009 and Oct-2018, and the mean of the Average windspeed column is 4.65. I am taking a challenge where there is a sanity check where the mean of this column needed to be 4.64. How can I modify the values of this column so that the mean of this column becomes 4.64? Is there any code solution for this, or do we have to do it manually?
I can see two solutions:
Substract 0.01 (4.65 - 4.64) to every value of that column like:
df['AvgWS'] -= 0.01
2 If you dont want to alter all rows: find wich rows you can remove to give you the desired mean (if there are any):
current_mean = 4.65
desired_mean = 4.64
n_rows = len(df['AvgWS'])
df['can_remove'] = df['AvgWS'].map(lambda x: (current_mean*n_rows - x)/(n_rows-1) == 4.64)
This will create a new boolean column in your dataframe with True in the rows that, if removed, make the rest of the column's mean = 4.64. If there are more than one you can analyse them to choose which one seems less important and then remove that one.
Ok, here is my situation (leaving out uninteresting things):
Dataframe from a csv file, weher I get infos about the infentory of stores, like
Date,StoreID,…,InventoryCount
The rows are sorted by Date, but not sorted by StoreID, and the amount of stores can very in this time series.
What I want:
I want add a column to the Dataframe with the change in InventoryCount from one day to the previous one.
For that I was trying:
for name, group in df.groupby(["StoreID"]):
for i in range(1, len(group)):
group.loc[i, 'InventoryChange'] = group.loc[i, 'InventoryCount'] - group.loc[i-1, 'InventoryCount']
Your code explicitly iterates through the rows, which is a terrible idea in pandas, both aesthetically and performance wise. Instead, replace the last two lines by:
group['InventoryChange'] = group[ 'InventoryCount'].diff(n)
Where n is the number of days you are interested in - 1 in your example, 8 in your comment.
Rather than a question, this is more of my workaround for a problem I was having when reading tmy3 files. I hope that it can be of use to some of you. I am still a novice when it comes to coding and python, so there may be simpler ways.
PROBLEM
Upon using the iotools.read_tmy3() function, I followed examples and code outlined by others. This included:
1. Reading the tmy3 datafile;
2. Coercing the year to 2015 (or any other year you like); and,
3. Shifting my index by 30 minutes to match the solar radiation simulation.
The code I used was:
tmy_data, metadata = pvlib.iotools.read_tmy3(datapath, coerce_year=2015)
tmy_data = tmy_data.shift(freq='-30Min') ['2015']
tmy_data.index.name = 'Time
By using this code, you lose the final row of your DataFrame. This is resolved by removing the ['2015'] in line two resolves this, but now the final date is set in 2014. For the purpose of my work, I needed the final index to remain consistent.
WHAT I TRIED
I attempted to shift index of the final row, unsuccessfully, using the shift method.
I attempted to reindex by setting the index of the final row equal to the DatetimeIndex I wanted.
I attempted to remove the timezone data from the index, modify the index, then reapply the timezone data.
All of these were overly complicated, and did not help resolve my issue.
WHAT WORKED FOR ME
The code for what I did is shown below. These were my steps:
1. Identify the index from my final row, and copy its data;
2. Drop the final row of my tmy_data DataFrame;
3. Create a new dataframe with the shifted date and copied data; and,
4. Append the row to my tmy_data DataFrame
It is a bit tedious, but it an easy fix when reading multiple tmy3 files with multiple timezones.
#Remove the specific year from line 2 in the code above
tmy_data = tmy_data.shift(freq='-30Min')
#Identify the last row, and copy the data
last_date = tmy_data.iloc[[-1]].index
copied_data = tmy_data.iloc[[-1]]
copied_data = copied_data.to_dict(orient='list')
#Drop the row with the incorrect index
tmy_data.drop([last_date][0], inplace = True)
#Shift the date of the last date
last_date = last_date.shift(n = 1, freq = 'A')
#Create a new DataFrame with the shifted date and copied data
last_row = pd.DataFrame(data = copied_data,
index=[last_date][0])
tmy_data = tmy_data.append(last_row)
After doing this my final indeces are:
2015-01-01 00:30:00-05:00
....
2015-12-31 22:30:00-05:00
2015-12-31 23:30:00-05:00
This way, the DataFrame contains 8760 rows, as opposed to 8759 through the previous method.
I hope this helps others!
I am trying to replace a value in the data frame dh based on the data frame larceny.
If the date in larceny exists, I want to find the corresponding date in dh and replace the corresponding 5th column entry with 1.
I am currently (somewhat successfully) doing it with the below code but, it is taking forever. Any help on this?
When I try to compare the dates, the code does not work, so I compare the .value of the dates and this seems to work.
import pandas as pd
from datetime import datetime
for i, row in dh.iterrows():
for j in range(45314):
if dh.iat[i,0].value==larceny.iat[j,0].value:
dh.iat[i,5]=1
print("Larceny")
print(i,j)
print(dh.iat[i,0],larceny.iat[j,0])
print(dh.iat[i,0].value,larceny.iat[j,0].value,'\n\n')
Basically, dh has a cell for each hour of each day for 4 years. I want to populate the cell for each hour with a 1 in the "Is_larceny" column, if that corresponding year-month-day-hour appears in the larceny data frame.
Please help. I tried some pandas search methods but I was having a problem with comparing dates and searching and replacing properly.
Thanks.
dh.loc[dh['col1'].isin(larceny['col2']), 'col1'] = 1
This looks for any value in the dh['col1'] that also appears in larceny['col2'], then sets those values in dh['col1'] to 1. You will have to replace col1 and col2 with your respective column names.
I was working on this question filter iteration using FOR. I wonder how to replace the last cell of year column in each csv file generated. Lets say I want to replace the last cell (of column year) of each cvs file by current year (2018). I did the following code:
for i, g in df.sort_values("year").groupby("Univers", sort=False):
for y in g.iloc[-1, g.columns.get_loc('year')]:
y = 2018
g.to_csv('{}.xls'.format(i))
But I get the same column with any changes. Any ideas how to do this?
The problem seems to be of two fold: first find the index of the last row and then replace it at (last_row_idx, "year").
Try this
for i, g in df.sort_values("year").groupby("Univers", sort=False):
last_row_idx = g.tail(1).index.item() # first task: find index
g.at[last_row_idx, "year"] = 2018 # replace
g.to_csv('{}.xls'.format(i))
Alternatively, one can also use g.set_value(last_row_idx, "year", 2018) to set value at a particular cell.
Reference
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set_value.html
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.at.html
Set value for particular cell in pandas DataFrame
Get index of a row of a pandas dataframe as an integer