The following routine retrieves a data file.
wget.download("https://www.aaii.com/files/surveys/sentiment.xls", "C:/temp/sentiment.xls")
df = pd.read_excel("C:/temp/sentiment.xls", sheet_name = "SENTIMENT", skiprows=3, parse_dates=['Date'], date_format='%m-%d-%y', index_col ='Date')
The first three data lines are incomplete so I can slice them off like this df[3:]
At about line 1640 there is a blank line. I wish to skip the rest of the file after that line. I tried to find that line like so and get its index so I could do another slice, but I get nan for the index value.
df[df.isnull().all(1)].index.values[0]
How can I find that line and skip the rest of the file?
I think you have two nan-row problems in this file:
The first row after the header is already an empty row leading to a nan index.
The reason for your post here, the empty row which indicates the end of the data you're interested in.
first import the data as you did it:
df = pd.read_excel("sentiment.xls", sheet_name = "SENTIMENT", skiprows=3, parse_dates=['Date'], date_format='%m-%d-%y', index_col ='Date')
df.head()
Bullish Neutral Bearish ... High Low Close
Date ...
NaN NaN NaN NaN ... NaN NaN NaN
1987-06-26 00:00:00 NaN NaN NaN ... NaN NaN NaN
1987-07-17 00:00:00 NaN NaN NaN ... 314.59 307.63 314.59
1987-07-24 00:00:00 0.36 0.50 0.14 ... 311.39 307.81 309.27
1987-07-31 00:00:00 0.26 0.48 0.26 ... 318.66 310.65 318.66
then remove first empty row (nan-index), problem No1:
df = df[1:]
df.head()
Bullish Neutral Bearish ... High Low Close
Date ...
1987-06-26 00:00:00 NaN NaN NaN ... NaN NaN NaN
1987-07-17 00:00:00 NaN NaN NaN ... 314.59 307.63 314.59
1987-07-24 00:00:00 0.36 0.50 0.14 ... 311.39 307.81 309.27
1987-07-31 00:00:00 0.26 0.48 0.26 ... 318.66 310.65 318.66
1987-08-07 00:00:00 0.56 0.15 0.29 ... 323.00 316.23 323.00
And now you want to index all rows before the first nan-index, problem No2.
Idea: create a boolean array with True entries for all nan- indices, cast to integer and build the cumulative sum. Now you have an array, which is 0 for all the data you want and >0 from any unwanted line on until the end.
This tested against 0 returns a boolean index for your data:
data_idx = df.index.isna().astype(int).cumsum() == 0
Applied to your dataframe:
df[data_idx]
Bullish Neutral ... Low Close
Date ...
1987-06-26 00:00:00 NaN NaN ... NaN NaN
1987-07-17 00:00:00 NaN NaN ... 307.63 314.59
1987-07-24 00:00:00 0.360000 0.500000 ... 307.81 309.27
1987-07-31 00:00:00 0.260000 0.480000 ... 310.65 318.66
1987-08-07 00:00:00 0.560000 0.150000 ... 316.23 323.00
... ... ... ... ...
2018-10-11 00:00:00 0.306061 0.339394 ... 2784.86 2785.68
2018-10-18 00:00:00 0.339350 0.310469 ... 2710.51 2809.21
2018-10-25 00:00:00 0.279693 0.310345 ... 2651.89 2656.10
2018-11-01 00:00:00 0.379310 0.275862 ... 2603.54 2711.74
2018-11-08 00:00:00 0.412844 0.275229 ... 2700.44 2813.89
[1635 rows x 12 columns]
Related
I have a dataset that marks the occurrences of an event every minute for four years. Here's a sample:
In [547]: result
Out[547]:
uuid timestamp col1 col2 col3
0 100 2016-03-30 00:00:00+02:00 NaN NaN NaN
1 100 2016-03-30 00:01:00+02:00 NaN NaN NaN
2 100 2016-03-30 00:02:00+02:00 NaN NaN NaN
3 100 2016-03-30 00:03:00+02:00 1.49 1.79 0.979
4 100 2016-03-30 00:04:00+02:00 NaN NaN NaN
... ... ... .. ...
1435 100 2016-03-30 23:55:00+02:00 NaN NaN NaN
1436 100 2016-03-30 23:56:00+02:00 1.39 2.19 1.09
1437 100 2016-03-30 23:57:00+02:00 NaN NaN NaN
1438 100 2016-03-30 23:58:00+02:00 NaN NaN NaN
1439 100 2016-03-30 23:59:00+02:00 NaN NaN NaN
[1440 rows x 5 columns]
I am trying to get summary statistics every time there is a non-blank row and get these statistics for every six hours. To do this, the resample() function works great. Here's a sample:
In [548]: result = result.set_index('timestamp').tz_convert('Europe/Berlin').resample('6h', label='right', closed='right', origin='start_day').agg(['mean', 'last', 'count']).iloc[:,-9:]
Out[548]:
col1_mean col1_last ... col3_last times_changed
timestamp ...
2016-03-30 00:00:00+02:00 NaN NaN ... NaN 0
2016-03-30 07:00:00+02:00 1.0690 1.069 ... 1.279 1
2016-03-30 13:00:00+02:00 1.0365 1.009 ... 1.239 4
2016-03-30 19:00:00+02:00 1.0150 0.989 ... 1.209 5
2016-03-30 01:00:00+02:00 1.1290 1.129 ... 1.329 1
[5 rows x 7 columns]
This looks great and is the format I'd like to work with. However, when I run my code on all data (spanning many years), here's an excerpt of what the output looks like:
In [549]: result
Out[549]:
col1_mean col1_last ... col3_last times_changed
timestamp ...
2016-03-27 00:00:00+01:00 NaN NaN ... NaN 0
2016-03-27 07:00:00+02:00 1.0690 1.069 ... 1.279 1
2016-03-27 13:00:00+02:00 1.0365 1.009 ... 1.239 4
2016-03-27 19:00:00+02:00 1.0150 0.989 ... 1.209 5
2016-03-28 01:00:00+02:00 1.1290 1.129 ... 1.329 1
[5 rows x 7 columns]
The new index takes DST into consideration and throws everything off by an hour. I would like the new times to still be between 0–6, 6–12 etc.
Is there a way to coerce my dataset to adhere to a 0–6, 6–12 format? If there's an extra hour, maybe the aggregations from that could still be tucked into the 0–6 range?
The timezone I'm working with is Europe/Berlin and I tried converting everything to UTC. However, values are not at their right date or time — for example, an occurrence at 00:15hrs would be 23:15hrs the previous day, which throws off those summary statistics.
Are there any creative solutions to fix this?
Have you tried this? I think it should work
(First converts to local timezone, and then truncates the timezone info by .tz_localize(None))
result = result.set_index('timestamp').tz_convert('Europe/Berlin').tz_localize(None).resample('6h', label='right', closed='right', origin='start_day').agg(['mean', 'last', 'count']).iloc[:,-9:]
I have several csv files in a directory of folders and subfolders. All the csv files have headers and time stamp as 1st column, whether time series data is present or not. I want to read all the csv files and should return status as empty if no data is present.
When I used df.empty function to check, it returns False even there is no data (the file has only header row and 1st column with time stamp).
The code I used is:
import pandas as pd
df1 = pd.read_csv("D://sirifort_with_data.csv", index_col=0)
df2 = pd.read_csv("D://sirifort_without_data.csv", index_col=0)
print(df1.empty)
print(df2.empty)
print(df2)
The result is:
False
False
PM2.5(ug/m3) PM10(ug/m3) ... NOx(ppb) NH3(ug/m3)
Time_Stamp ...
26/02/2022 0:00 NaN NaN ... NaN NaN
26/02/2022 0:15 NaN NaN ... NaN NaN
26/02/2022 0:30 NaN NaN ... NaN NaN
26/02/2022 0:45 NaN NaN ... NaN NaN
26/02/2022 1:00 NaN NaN ... NaN NaN
26/02/2022 1:15 NaN NaN ... NaN NaN
26/02/2022 1:30 NaN NaN ... NaN NaN
26/02/2022 1:45 NaN NaN ... NaN NaN
26/02/2022 2:00 NaN NaN ... NaN NaN
26/02/2022 2:15 NaN NaN ... NaN NaN
26/02/2022 2:30 NaN NaN ... NaN NaN
26/02/2022 2:45 NaN NaN ... NaN NaN
[12 rows x 6 columns]
Use the sum of one of the columns. in the case of empty df it is zero.
def col_check(col):
if df[col].sum()!=0:
return 1
for col in df.columns:
if col_check(col):
print('not empty')
break
The documentation clearly indicates:
If we only have NaNs in our DataFrame, it is not considered empty! We
will need to drop the NaNs to make the DataFrame empty:
df = pd.DataFrame({'A' : [np.nan]})
df.empty
False
and then suggests:
df.dropna().empty
True
I'm trying to loop over multiple JSON data and then for each value in list add it to the DataFrame. For each JSON data, I create a column header. I seem to always only get the data for the last column, so there is clearly something wrong with the way I append the data I believe.
from pycoingecko import CoinGeckoAPI
cg = CoinGeckoAPI()
df = pd.DataFrame()
timePeriod = 120
for x in range(10):
try:
data = cg.get_coin_market_chart_by_id(id=geckoList[x],
vs_currency ='btc', days = 'timePeriod')
for y in range(timePeriod):
df = df.append({geckoList[x]: data['prices'][y][1]},
ignore_index= True)
print(geckoList[x])
except:
pass
Geckolist example:
['bitcoin',
'ethereum',
'xrp',
'bitcoin-cash',
'litecoin',
'binance-coin']
Example JSON of one the coins:
'prices': [[1565176840078, 0.029035263522626625],
[1565177102060, 0.029079747150763842],
[1565177434439, 0.029128983083947863],
[1565177700686, 0.029136960678700433],
[1565178005716, 0.0290826667213779],
[1565178303855, 0.029173025688296675],
[1565178602640, 0.029204331218623796],
[1565178911561, 0.029211943928343167],
The expected result would be a DataFrame with columns and rows of data for each crypto coin. Right now only the last column shows data
Currently, it looks like this:
bitcoin ethereum bitcoin-cash
0 NaN NaN 0.33
1 NaN NaN 0.32
2 NaN NaN 0.21
3 NaN NaN 0.22
4 NaN NaN 0.25
5 NaN NaN 0.26
6 NaN NaN 0.22
7 NaN NaN 0.22
Ok I think I found the issue.
The problem is you append data structures row by row that contained only one column to the frame, so all the other columns were filled with NaN. What i think you want is to join the columns by their timestamp. This is what i did in my example below. Let me know if this is what you need:
from pycoingecko import CoinGeckoAPI
import pandas as pd
cg = CoinGeckoAPI()
timePeriod = 120
gecko_list = ['bitcoin',
'ethereum',
'xrp',
'bitcoin-cash',
'litecoin',
'binance-coin']
data = {}
for coin in gecko_list:
try:
nested_lists = cg.get_coin_market_chart_by_id(
id=coin, vs_currency='btc', days='timePeriod')['prices']
data[coin] = {}
data[coin]['timestamps'], data[coin]['values'] = zip(*nested_lists)
except Exception as e:
print(e)
print('coin: ' + coin)
frame_list = [pd.DataFrame(
data[coin]['values'],
index=data[coin]['timestamps'],
columns=[coin])
for coin in gecko_list
if coin in data]
df = pd.concat(frame_list, axis=1).sort_index()
df.index = pd.to_datetime(df.index, unit='ms')
print(df)
This gets me the output
bitcoin ethereum bitcoin-cash litecoin
2019-08-07 12:20:14.490 NaN NaN 0.029068 NaN
2019-08-07 12:20:17.420 NaN NaN NaN 0.007890
2019-08-07 12:20:21.532 1.0 NaN NaN NaN
2019-08-07 12:20:27.730 NaN 0.019424 NaN NaN
2019-08-07 12:24:45.309 NaN NaN 0.029021 NaN
... ... ... ... ...
2019-08-08 12:15:47.548 NaN NaN NaN 0.007578
2019-08-08 12:18:41.000 NaN 0.018965 NaN NaN
2019-08-08 12:18:44.000 1.0 NaN NaN NaN
2019-08-08 12:18:54.000 NaN NaN NaN 0.007577
2019-08-08 12:18:59.000 NaN NaN 0.028144 NaN
[1153 rows x 4 columns]
This is the data i get if i switch days to 180.
To get daily data, use the groupby function:
df = df.groupby(pd.Grouper(freq='D')).mean()
On a data frame of 5 days, this gives me:
bitcoin ethereum bitcoin-cash litecoin
2019-08-03 1.0 0.020525 0.031274 0.008765
2019-08-04 1.0 0.020395 0.031029 0.008583
2019-08-05 1.0 0.019792 0.029805 0.008360
2019-08-06 1.0 0.019511 0.029196 0.008082
2019-08-07 1.0 0.019319 0.028837 0.007854
2019-08-08 1.0 0.018949 0.028227 0.007593
I currently have some time series data that I applied a rolling mean on with a window of 17520.
Thus before the head of my data looked like this:
SETTLEMENTDATE ==
0 2006/01/01 00:30:00 8013.27833 ... 5657.67500 20.03
1 2006/01/01 01:00:00 7726.89167 ... 5460.39500 18.66
2 2006/01/01 01:30:00 7372.85833 ... 5766.02500 20.38
3 2006/01/01 02:00:00 7071.83333 ... 5503.25167 18.59
4 2006/01/01 02:30:00 6865.44000 ... 5214.01500 17.53
And now it looks like this:
SETTLEMENTDATE =
0 2006/01/01 00:30:00 NaN ... NaN NaN
1 2006/01/01 01:00:00 NaN ... NaN NaN
2 2006/01/01 01:30:00 NaN ... NaN NaN
3 2006/01/01 02:00:00 NaN ... NaN NaN
4 2006/01/01 02:30:00 NaN ... NaN NaN
How can I get it so that my data only begins, when there is not a NaN? (also making sure that the date matches)
=
You can try with rolling with min_periods = 1
data['NSW DEMAND'] = data['NSW DEMAND'].rolling(17520,min_periods=17520).mean()
Also try using for loo, you do not need to write the columns one by one
youcols=['xxx'...'xxx1']
for x in youcols:
data[x]=data[x].rolling(17520,min_periods=1).mean()
Base on your comments
for x in youcols:
data[x]=data[x].rolling(17520,min_periods=1).mean()
then ,
data=data.dropna(subset=youcols,thresh =1)
I have a table that looks like this:
Index Group_Id Period Start Period End Value Value_Count
42 1016833 2012-01-01 2013-01-01 127491.00 17.0
43 1016833 2013-01-01 2014-01-01 48289.00 9.0
44 1016833 2014-01-01 2015-01-01 2048.00 2.0
45 1016926 2012-02-01 2013-02-01 913.00 1.0
46 1016926 2013-02-01 2014-02-01 6084.00 5.0
47 1016926 2014-02-01 2015-02-01 29942.00 3.0
48 1016971 2014-03-01 2015-03-01 0.00 0.0
I am trying to end up with a 'wide' df where each Group_Id has one observation and the value/value counts are converted to columns that correspond to their respective period in order of recency. So the end result would like like:
Index Group_Id Value_P0 Value_P1 Value_P3 Count_P0 Count_P1 ...
42 1016833 2048.00 48289.00 127491.00 2.0 9.0
45 1016926 29942.00 6084.00 913.00 3.0 5.0
48 1016971 0.0 0.00 0.0 0.0 0.0
Where Value_P0 is the most recent value, Value_P1 is the next most recent value after that, and the Count columns work the same way.
I've tried pivoting the table so that the Group_IDs are the indices and Period Start is the columns and Values or Counts is the corresponding value.
Period Start 2006-07-01 2008-07-01 2009-02-01 2009-12-17 2010-02-01 2010-06-01 2010-07-01 2010-08-13 2010-09-01 2010-12-01 ... 2016-10-02 2016-10-20 2016-12-29 2017-01-05 2017-02-01 2017-03-28 2017-04-10 2017-05-14 2017-08-27 2017-09-15
Group_Id
1007310 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1007318 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1007353 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
This way I have the Group_Ids as one record but would then need to loop through each row of the many columns and pull out the non-NaN values. Their order would correspond to oldest to newest. This seems like an incorrect way to go about this though.
I've also considered grouping by Group_Id and somehow creating a timedelta that corresponds to the most recent date. Then from this pivoting/unstacking so that the columns are the timedelta and the values are value or value_count. I'm not sure how to do this though. I appreciate the help.
Still using pivot
df['ID']=df.groupby('Group_Id').cumcount()
d1=df.pivot('Group_Id','ID','Value').add_prefix('Value_P')
d2=df.pivot('Group_Id','ID','Value_Count').add_prefix('Count_P')
pd.concat([d1,d2],axis=1).fillna(0)
Out[347]:
ID Value_P0 Value_P1 Value_P2 Count_P0 Count_P1 Count_P2
Group_Id
1016833 127491.0 48289.0 2048.0 17.0 9.0 2.0
1016926 913.0 6084.0 29942.0 1.0 5.0 3.0
1016971 0.0 0.0 0.0 0.0 0.0 0.0