I have a .cvs file in which data is stored for data ranges - from and to date columns. However, I would like to create a daily data frame with Python out of it.
The time can be ignored, as a gasday always starts at 6am and ends at 6am.
My idea was to have in the end a data frame index with a date (like from March 1st, 2019, ranging to December 31st, 2019 on a daily granularity.
I would create columns with the unique values of the identifier and as values place the respective values or nan in.
The latter one, I can easily do with pd.pivot_table, but still my problem with the time range exists...
Any ideas of how to cope with that?
time-ranged data frame
It should look like this, just with rows in a daily granularity, considering the to column as well. Maybe with range?
output should look similar to this, just with a different period
you can use pandas and groupby the column you want:
df=pd.read_csv("yourfile.csv")
groups=df.groupby("periodFrom")
group.get_group("2019-03-09 06:00")
Related
I'm trying to pull some data from yfinance in Python for different funds from different exchanges. In pulling my data I just set-up the start and end dates through:
start = '2002-01-01'
end = '2022-06-30'
and pulling it through:
assets = ['GOVT', 'IDNA.L', 'IMEU.L', 'EMMUSA.SW', 'EEM', 'IJPD.L', 'VCIT',
'LQD', 'JNK', 'JNKE.L', 'IEF', 'IEI', 'SHY', 'TLH', 'IGIB',
'IHYG.L', 'TIP', 'TLT']
assets.sort()
data = yf.download(assets, start = start, end = end)
I guess you've noticed that the "assets" or the ETFs come from different exchanges such as ".L" or ".SW".
Now the result this:
It seems to me that there is no overlap for a single instrument (i.e. two prices for the same day). So I don't think the data will be disturbed if any scrubbing or clean-up is done.
So my goal is to harmonize or consolidate the prices to its date index rather than date-and-time index so that each price for each instrument is firmly side-by-side each other for a particular date.
Thanks!
If you want the daily last closing price from the yahoo-finance api you could use the interval argument,
yf.download(assets, start=start, end=end, interval="1d")
Solution with Pandas:
Transforming the Index
You have an index where each row is a string representing the datetime. You firstly want to transform those strings to an actual DatetimeIndex where each row will be of type datetime64. This is done in order to easily work with dates in you dataset applying functions from the datetime library. Finally, you pick the date from each datetime64;
data.index = pd.to_datetime(data.index).date
Groupby
Now that you have an index of dates you can groupby on index. Firstly, you want to deal with NaN values. If you want that the closing price is only considered to fill the values within the date itself only you want to apply:
data= data.groupby(data.index).ffill()
Otherwise, if you think that the closing price of (e.g.) the 1st October can be used not only to filter values in the 1st October but also 2nd and 3rd of October which have NaN values, simply apply the ffill() without the groupby;
data= data.ffill()
Lastly, taking last observed record grouping for date (Index); Note that you can apply all the functions you want here, even a custom lambda;
data = data.groupby(data.index).last()
I have a csv-file (called "cameradata") with columns MeetingRoomID, and Time (There are more columns, but they should not be needed).
I would like to get the number of occurrences a certain MeetingRoomID ("14094020", from the column "MeetingRoomID") is used during one day. The csv-file luckily only consist of timestamps from one day in the "Time" column. One problem is that the timestamps are in the datetime-format %H:%M:%S and I want to categorize the occurrences by the hour it occured (between 07:00-18:00).
The goal is to have the occurences linked to the hours of the timestamps - in order for me to plot a barplot with (x = "timestamps (hourly)" and y = "a dataframe/series that maps the certain MeetingRoomID with the hour it was used".
How can I get a function for my y-axis that understands that the value_count for ID 14094020 and the timestamps are connected?
So far I've come up with something like this:
y = cameradata.set_index('Time').resample('H')
cameradata['MeetingRoomID'].value_counts()[14094020]
My code seems to work if I divide it, but I do not know how to connect it in a syntax-friendly way.
Clarification:
The code: cameradata['MeetingRoomID'].value_counts().idxmax() revealed the ID with the most occurrences, so I think I'm onto something there.
Grateful for your help!
This is how the print of the Dataframe looks like, 'Tid' is time and 'MätplatsID' is what I called MeetingRoomID.
For some reason the "Time" column has added a made-up year and month next to it when I converted it to datetime. I converted in to datetime by: kameradata['Tid'] = pd.to_datetime(kameradata['Tid'], format=('%H:%M:%S'))
This is an example of how the output look like in the end
I have a dataframe, called PORResult, of daily temperatures where rows are years and each column is a day (121 rows x 365 columns). I also have an array, called Percentile_90, of a threshold temperature for each day (length=365). For every day for every year in the PORResult dataframe I want to find out if the value for that day is higher than the value for that day in the Percentile_90 array. The results of which I want to store in a new dataframe, called Count (121rows x 365 columns). To start, the Count dataframe is full of zeros, but if the daily value in PORResult is greater than the daily value in Percentile_90. I want to change the daily value in Count to 1.
This is what I'm starting with:
for i in range(len(PORResult)):
if PORResult.loc[i] > Percentile_90[i]:
CountResult[i]+=1
But when I try this I get KeyError:0. What else can I try?
(Edited:)
Depending on your data structure, I think
CountResult = PORResult.gt(Percentile_90,axis=0).astype(int)
should do the trick. Generally, the toolset provided in pandas is sufficient that for-looping over a dataframe is unnecessary (as well as remarkably inefficient).
I have downloaded ten open datasets of air pollution in 2010-2019 (which has been transferred to Pandas DataFrame by 'read_csv') that have some missing values.
The rows are ordered by each day including several items (like PM2.5, SO2,...). Most of the data include 17 or 18 items. There are 27 columns which separately are Year, Station, Item, 00, 01, ..., 23.
In this case, I already used
df.fillna(np.nan).apply(lambda x: pd.to_numeric(x,errors='coerce')
and df.interpolate(axis=1,inplace=True)
But now if the data have missing values from '00' to anytime following, the interpolate function would not works. If I want to fill all these blanks, I need to merge the last day data which is not null and use interpolate again.
However, different days have different items numbers, which means there are still some rows that can't be filled.
In a nutshell, now I'm trying to contact all data by the key of items and use interpolate.
By the way, after data cleaning, I would like to apply to xgboost and linear regression to predict PM2.5. Is there any way recommended to deal with the data?
(Or any demo code online?)
For example, the data would be like:
one of the datasets
I used df.groupby('date').size() and got
size of different days
Or in other words, how to split different days and concat together?
Groupby(['date','items'])? and then how to merge?
Or, is that possible to interpolate from the last value of the last row?
I'm using pytrend to do some basic data analysis. Here's my code
from pytrends import dailydata
df = dailydata.get_daily_data('blockchain', 2020, 4, 2020, 5, geo = '')
The result looks like this:
I'm struggling to understand the meaning of each column. Can anyone explain them?
I had the same doubt and found the correct answer at this website:
https://raw.githubusercontent.com/GeneralMills/pytrends/master/pytrends/dailydata.py
https://medium.com/#yanweiliu/getting-the-google-trends-data-with-python-67b335e7d1cf
Returns:
complete (pd.DataFrame): Contains 4 columns.
The column named after the word argument contains the daily search
volume already scaled and comparable through time.
The column f'{word}_unscaled' is the original daily data fetched
month by month, and it is not comparable across different months
(but is comparable within a month).
The column f'{word}_monthly' contains the original monthly data
fetched at once. The values in this column have been backfilled
so that there are no NaN present.
The column 'scale' contains the scale used to obtain the scaled
daily data.