I am new in python and in the last days I have educated myself in opening and performing operations on data stored in txt, xls, asc files with pandas, but I still have confusion when doing operations with datarames.
I have a .wac file which is in the right formatting (it has to be then used as input file for a software), but contains partially wrong values, and an .xlsx file containing the right values.
I have transferred the data to two dataframes with this code (I used skiprows to skip through the string data in both files):
data_format = pd.read_csv('Example.wac', skiprows=11, delim_whitespace=True, names=["Date", "Hour", "Temp gap North [C]", "RH %"])
data_WUFI =pd.read_excel('Temperature_RH_North.xlsx', skiprows=1, header=None, dtype=float, names=["Hour", "Temp gap [C]", "RH %"])
Now I need to do the following modifications to the dataframes, but I do not know where to start from and I hope I came to the right place to seek help.
For data_format:
- the column 'Date' is in the format *2018-01-01* and runs to *2019-12-31*. Being obviously a date, it stays the same for 24 positions and then it increases by 1 day. I need to add rows to that column up to *2027-12-31* (without leap years)
- the column 'Hour' is in the format *01:00*. Values run from *01:00* to *24:00*. I need to add rows so that every 24 hours the date in the first column increases by one day, then the hour numbering restarts at *01:00*
- The column 'RH %' contains the same value in all rows, i.e. 0.5
I add a snapshot of data_format to make it more clear:
Once the new dataframe is created, e.g. data_format_NEW I can substitute the values in 'Temp gap North [C]' with the correct values from data_WUFI (already of the right size):
data_format_NEW['Temp gap North [C]'] = data_WUFI['Temp gap [C]']
At that point I will write data_format_NEW in a .wac file:
data_format_NEW.to_csv('Example_NEW.wac', index=False, delim_whitespace=True)
but the first 12 rows will have to contain string values as in the picture:
I am not sure whether I got the planning right , but I hope I managed to explain myself enough to be clear
Related
Using the ff_monthly.csv data set https://github.com/alexpetralia/fama_french,
use the first column as an index
(this contains the year and month of the data as a string
Create a new column ‘Mkt’ as ‘Mkt-RF’ + ‘RF’
Create two new columns in the loaded DataFrame, ‘Month’ and ‘Year’ to
contain the year and month of the dataset extracted from the index column.
Create a new DataFrame with columns ‘Mean’ and ‘Standard
Deviation’ and the full set of years from (b) above.
Write a function which accepts (r_m,s_m) the monthy mean and standard
deviation of a return series and returns a tuple (r_a,s_a), the annualised
mean and standard deviation. Use the formulae: r_a = (1+r_m)^12 -1, and
s_a = s_m * 12^0.5.
Loop through each year in the data, and calculate the annualised mean and
standard deviation of the new ‘Mkt’ column, storing each in the newly
created DataFrame. Note that the values in the input file are % returns, and
need to be divided by 100 to return decimals (i.e the value for August 2022
represents a return of -3.78%).
. Print the DataFrame and output it to a csv file.
Workings so far:
import pandas as pd
ff_monthly=pd.read_csv(r"file path")
ff_monthly=pd.read_csv(r"file path",index_col=0)
Mkt=ff_monthly['Mkt-RF']+ff_monthly['RF']
ff_monthly= ff_monthly.assign(Mkt=Mkt)
df=pd.DataFrame(ff_monthly)
enter image description here
There are a few things to pay attention to.
The Date is the index of your DataFrame. This is treated in a special way compared to the normal columns. This is the reason df.Date gives an Attribute error. Date is not an Attribute, but the index. Instead try df.index
df.Date.str.split("_", expand=True) would work if your Date would look like 22_10. However according to your picture it doesn't contain an underscore and also contains the day, so this cannot work
In fact the format you have is not even following any standard. In order to properly deal with that the best way would be parsing this to a proper datetime64[ns] type that pandas will understand with df.index = pd.to_datetime(df.index, format='%y%m%d'). See the python docu for supported format strings.
If all this works, it should be rather straightforward to create the columns
df['year'] = df.index.dt.year
In fact, this part has been asked before
Each hour I get current weather data via an API. Current weather data is appended to the bottom of the dataframe:
df = df.append([current_weather], sort=False, ignore_index=True)
Current weather data includes precipitation totals for the past hour (precipitation_1h).
I also have a column for 24 hour precipitation totals (precipitation_1d). I want to calculate this value for the appended row only, not the entire column (so I'm not looking to use df.rolling).
Here's what I've tried...
This code skips the bottom row (i.e. the most recent precipitation_1h value):
df.at[current_weather['timestamp'], 'precipitation_1d'] = df.iloc[-1:-25 , df.columns.get_loc('precipitation_1h')].sum()
This code includes too many rows (I think all rows are included and bottom 25 counted twice):
df.at[current_weather['timestamp'], 'precipitation_1d'] = df.iloc[0:-25 , df.columns.get_loc('precipitation_1h')].sum()
This code works if I flip the order and add the new data as the first row of the dataframe (but I'd prefer to keep the new data at the bottom of the dataframe):
df.at[current_weather['timestamp'], 'precipitation_1d'] = df.iloc[:24 , df.columns.get_loc('precipitation_1h')].sum()
Any ideas?
UPDATE: G. Anderson's suggestion in the comments worked perfectly. He said it'd be helpful to see dataframe so I've included a screenshot of the tail in case it helps anyone in the future. Tail screenshot shows the dataframe after I've applied G. Anderson's solution. Before, all columns ending in 1d, 5d, 10d, and 20d had NaN values. Now they have sums (precipitation) or means (humidity, temp).
I have a pandas dataframe where observations are broken out per every two days. The values in the 'Date' column each describe a range of two days (eg 2020-02-22 to 2020-02-23).
I want to spit those Date values into individual days, with a row for each day. The closest I got was by doing newdf = df_day.set_index(df_day.columns.drop('Date',1).tolist()).Date.str.split(' to ', expand=True).stack().reset_index().loc[:, df_day.columns]
The problem here is that the new date values are returned as NaNs. Is there a way to achieve this data broken out by individual day?
I might not be understanding, but based on the image it's a single date per row as is, just poorly labeled -- I would manipulate the index strings, and if I can't do that I would create a new date column, or new df w/ clean date and merge it.
You should be able to chop off the first 14 characters with a lambda -- leaving you with second listed date in index.
I can't reproduce this, so bear with me.
df.rename(index=lambda s: s[14:])
#should remove first 14 characters from each row label.
#leaving just '2020-02-23' in row 2.
#If you must skip row 1, idx = df.index[1:]
#or df.iloc[1:].rename(index=lambda s: s[1:])
Otherwise, I would just replace it with a new datetime index.
didx = pd.DatetimeIndex(start ='2000-01-10', freq ='D',end='2020-02-26')
#Make sure same length as df
df.set_index(didx)
#Or
#df['new_date'] = didx.values
#df.set_index('new_date').drop(columns=['Date'])
#Or
#df.append(didx,axis=1) #might need ignore_index=True
Rather than a question, this is more of my workaround for a problem I was having when reading tmy3 files. I hope that it can be of use to some of you. I am still a novice when it comes to coding and python, so there may be simpler ways.
PROBLEM
Upon using the iotools.read_tmy3() function, I followed examples and code outlined by others. This included:
1. Reading the tmy3 datafile;
2. Coercing the year to 2015 (or any other year you like); and,
3. Shifting my index by 30 minutes to match the solar radiation simulation.
The code I used was:
tmy_data, metadata = pvlib.iotools.read_tmy3(datapath, coerce_year=2015)
tmy_data = tmy_data.shift(freq='-30Min') ['2015']
tmy_data.index.name = 'Time
By using this code, you lose the final row of your DataFrame. This is resolved by removing the ['2015'] in line two resolves this, but now the final date is set in 2014. For the purpose of my work, I needed the final index to remain consistent.
WHAT I TRIED
I attempted to shift index of the final row, unsuccessfully, using the shift method.
I attempted to reindex by setting the index of the final row equal to the DatetimeIndex I wanted.
I attempted to remove the timezone data from the index, modify the index, then reapply the timezone data.
All of these were overly complicated, and did not help resolve my issue.
WHAT WORKED FOR ME
The code for what I did is shown below. These were my steps:
1. Identify the index from my final row, and copy its data;
2. Drop the final row of my tmy_data DataFrame;
3. Create a new dataframe with the shifted date and copied data; and,
4. Append the row to my tmy_data DataFrame
It is a bit tedious, but it an easy fix when reading multiple tmy3 files with multiple timezones.
#Remove the specific year from line 2 in the code above
tmy_data = tmy_data.shift(freq='-30Min')
#Identify the last row, and copy the data
last_date = tmy_data.iloc[[-1]].index
copied_data = tmy_data.iloc[[-1]]
copied_data = copied_data.to_dict(orient='list')
#Drop the row with the incorrect index
tmy_data.drop([last_date][0], inplace = True)
#Shift the date of the last date
last_date = last_date.shift(n = 1, freq = 'A')
#Create a new DataFrame with the shifted date and copied data
last_row = pd.DataFrame(data = copied_data,
index=[last_date][0])
tmy_data = tmy_data.append(last_row)
After doing this my final indeces are:
2015-01-01 00:30:00-05:00
....
2015-12-31 22:30:00-05:00
2015-12-31 23:30:00-05:00
This way, the DataFrame contains 8760 rows, as opposed to 8759 through the previous method.
I hope this helps others!
I am taking 30 days of historical data and modifying it.
Hopefully, I can read the historical data and have it refer to a dynamic rolling date of 30 days. The 'DateTime' value is a column in the raw data.
df_new = df = pd.read_csv(loc+filename)
max_date = df_new['DateTime'].max()
date_range = max_date - Timedelta(30, unit='d')
df_old = pd.read_hdf(loc+filename,'TableName', where = [('max_date > date_range')])
Then I would read the new data which is a separate file which is always Month to Date values (all of June for example, this file is replaced daily with the latest data), and concat them to the old dataframe.
frames = [df_old, df_new]
df = pd.concat(frames)
Then I do some things to the file (I am checking if certain values repeat within a 30 day window, if they do then I place a timestamp in a column).
Now I would want to add this modified data back into my original file (it was HDF5 but it could be a .sqlite file too) called df_old. For sure, there are a ton of duplicates since I am reading the previous 30 days data and the MTD data. How do I manage this?
My only solution is to read the entire file (df_old along with the new data I added) and then drop duplicates and then overwrite it again. This isn't very efficient.
Can .sqlite or .hdf formats enforce non-duplicates? If so then I have 3 columns which identify a unique value (Date, EmpID, CustomerID). I do not want exact duplicate rows.
Define them as primary keys in sqlite. It wont allow you to have set of non-unique primary keys.
e.g.
CREATE TABLE table (
a INT,
b INT,
c INT,
PRIMARY KEY(a,b)
);
wont allow you to have duplicates of a,b added to the data. Then use
INSERT OR IGNORE to add data, and any duplicates will be ignored.
http://sqlite.org/lang_insert.html