pandas resample dealing with missing data - python

I am using pandas to deal with monthly data that have some missing value. I would like to be able to use the resample method to compute annual statistics but for years with no missing data.
Here is some code and output to demonstrate :
import pandas as pd
import numpy as np
dates = pd.date_range(start = '1980-01', periods = 24,freq='M')
df = pd.DataFrame( [np.nan] * 10 + range(14), index = dates)
Here is what I obtain if I resample :
In [18]: df.resample('A')
Out[18]:
0
1980-12-31 0.5
1981-12-31 7.5
I would like to have a np.nan for the 1980-12-31 index since that year does not have monthly values for every month. I tried to play with the 'how' argument but to no luck.
How can I accomplish this?

i'm sure there's a better way, but in this case you can use:
df.resample('A', how=[np.mean, pd.Series.count, len])
and then drop all rows where count != len

Related

Finding the Difference between Rows and Formatting Into New DF

This is my first time working with dataframes in Python. Here is the question:
"Calculate the difference in the number of Occupied Housing Units from year to year and print it. The difference must be calculated for the consecutive years such as 2008-2009, 2009-2010 etc. Finally, print the values in the ascending order."
Below is my code. It does work but i know there has to be a more efficient way to accomplish this aside from the brute force, manual data entry method that I used:
import pandas as pd
import numpy as np
url = 'https://raw.githubusercontent.com/unt-iialab/INFO5731_Spring2020/master/Assignments/Assignment1_denton_housing.csv'
data = pd.read_csv(url, index_col=0)
A = data.loc[(data["title_field"]=="Occupied Housing Units") , ["title_field", 'value']]
data_2 = [['2008-2009', 35916-36711],['2009-2010', 41007-35916],['2010-2011', 40704-41007],['2011-2012', 42108-40704],['2012-2013', 43673-42108],['2013-2014', 46295-43673]]
B = pd.DataFrame(data_2, columns = ['Years', 'Difference'])
sort_by_diff = B.sort_values('Difference')
print(sort_by_diff)

Pandas ValueError on df.join(), then TypeError on pd.concat()

I'm sure I have a simple error here but I'm not seeing it. Maybe a fresh pair of eyes can pick it out in a minute or two. I've been working on various solutions for a while now and I need some help.
I have a Pandas dataframe and I'm attempting to add a newly created calculated column to the dataframe based on an existing column in that dataframe.
import pandas as pd
df = pd.read_csv('mydata.csv')
df.date = pd.to_datetime(df.date, format='%d.%m.%Y %H:%M:%S.%f')
# calculate Simple Moving Average with a 20 day window
sma = df_train.close.rolling(window=20).mean()
res = df_train.join(sma, on='date', how='left', lsuffix='_left', rsuffix='_right')
print(res)
ValueError: You are trying to merge on datetime64[ns] and int64
columns. If you wish to proceed you should use pd.concat
Ok, so I tried using pd.concat:
import pandas as pd
df = pd.read_csv('mydata.csv')
df.date = pd.to_datetime(df.date, format='%d.%m.%Y %H:%M:%S.%f')
# calculate Simple Moving Average with 20 days window
sma = df_train.close.rolling(window=20).mean()
frames = [df_train, sma]
res = pd.concat(frames)
print("Printing result of concat")
print(res)
TypeError: concat() missing 1 required positional argument: 'objs'
What positional argument is needed? I can't figure this out based on the research and documentation I've seen online.
Thanks in advance.

Filter each column by having the same value three times or more

I have a Data set that contains Dates as an index, and each column is the name of an item with count as value. I'm trying to figure out how to filter each column where there will be more than 3 consecutive days where the count is zero for each different column. I was thinking of using a for loop, any help is appreciated. I'm using python for this project.
I'm fairly new to python, so far I tried using for loops, but did not get it to work in any way.
for i in a.index:
if a.loc[i,'name']==3==df.loc[i+1,'name']==df.loc[i+2,'name']:
print(a.loc[i,"name"])
Cannot add integral value to Timestamp without freq.
It would be better if you included a sample dataframe and desired output in your question. Please do the next time. This way, I have to guess what your data looks like and may not be answering your question. I assume the values are integers. Does your dataframe have a row for every day? I will assume that might not be the case. I will make it so that every day in the last delta days has a row. I created a sample dataframe like this:
import pandas as pd
import numpy as np
import datetime
# Here I am just creating random data from your description
delta = 365
start_date = datetime.datetime.now() - datetime.timedelta(days=delta)
end_date = datetime.datetime.now()
datetimes = [end_date - diff for diff in [datetime.timedelta(days=i) for i in range(delta,0,-1)]]
# This is the list of dates we will have in our final dataframe (includes all days)
dates = pd.Series([date.strftime('%Y-%m-%d') for date in datetimes], name='Date', dtype='datetime64[ns]')
# random integer dataframe
df = pd.DataFrame(np.random.randint(0, 5, size=(delta,4)), columns=['item' + str(i) for i in range(4)])
df = pd.concat([df, dates], axis=1).set_index('Date')
# Create a missing day
df = df.drop(df.loc['2019-08-01'].name)
# Reindex so that index has all consecutive days
df = df.reindex(index=dates)
Now that we have a sample dataframe, the rest will be straightforward. I am going to check if a value in the dataframe is equal to 0 and then do a rolling sum with the window of 4 (>3). This way I can avoid for loops. The resulting dataframe has all the rows where at least one of the items had a value of 0 for 4 consecutive rows. If there is a 0 for more than window consecutive rows, it will show as two rows where the dates are just one day apart. I hope that makes sense.
# custom function as I want "np.nan" returned if a value does not equal "test_value"
def equals(df_value, test_value=0):
return 1 if df_value == test_value else np.nan
# apply the function to every value in the dataframe
# for each row, calculate the sum of four subsequent rows (>3)
df = df.applymap(equals).rolling(window=4).sum()
# if there was np.nan in the sum, the sum is np.nan, so it can be dropped
# keep the rows where there is at least 1 value
df = df.dropna(thresh=1)
# drop all columns that don't have any values
df = df.dropna(thresh=1, axis=1)

Extract data between two dates each year

I have a time series of daily data from 2000 to 2015. What I want is another single time series which only contains data from each year between April 15 to June 15 (because that is the period relevant for my analysis).
I have already written a code to do the same myself, which is given below:
import pandas as pd
df = pd.read_table(myfilename, delimiter=",", parse_dates=['Date'], na_values=-99)
dff = df[df['Date'].apply(lambda x: x.month>=4 and x.month<=6)]
dff = dff[dff['Date'].apply(lambda x: x.day>=15 if x.month==4 else True)]
dff = dff[dff['Date'].apply(lambda x: x.day<=15 if x.month==6 else True)]
I think this code is too much ineffecient as it has to carry out operation on the dataframe 3 times to get the desired subset.
I would like to know the following two things:
Is there an inbuilt pandas function to achieve this?
If not, is there a more efficient and better way to achieve this?
let the data frame look like this:
df = pd.DataFrame({'Date': pd.date_range('2000-01-01', periods=365*10, freq='D'),
'Value': np.random.random(365*10)})
create a series of dates with the year set to the same value
x = df.Date.apply(lambda x: pd.datetime(2000,x.month, x.day))
filter using this series to select from the dataframe
df.values[(x >= pd.datetime(2000,4,15)) & (x <= pd.datetime(2000,6,15))]
try this:
index = pd.date_range("2000/01/01", "2016/01/01")
s = index.to_series()
s[(s.dt.month * 100 + s.dt.day).between(415, 615)]

Adding columns to a dataframe where all other columns are periods

I have a timeseries dataframe with a PeriodIndex. I would like to use the values as column names in another dataframe and add other columns, which are not Periods. The problem is that when I create the dataframe by using only periods as column-index adding a column whos index is a string raises an error. However if I create the dataframe with a columns index that has periods and strings, then I'm able to add a columns with string indices.
import pandas as pd
data = np.random.normal(size=(5,2))
idx = pd.Index(pd.period_range(2011,2012,freq='A'),name=year)
df = pd.DataFrame(data,columns=idx)
df['age'] = 0
This raises an error.
import pandas as pd
data = np.random.normal(size=(5,2))
idx = pd.Index(pd.period_range(2011,2012,freq='A'),name=year)
df = pd.DataFrame(columns=idx.tolist()+['age'])
df = df.iloc[:,:-1]
df[:] = data
df['age'] = 0
This does not raise an error and gives my desired outcome, but doing it this way I can't assign the data in a convenient way when I create the dataframe. I would like a more elegant way of achieving the result. I wonder if this is a bug in Pandas?
Not really sure what you are trying to achieve, but here is one way to get what I understood you wanted:
import pandas as pd
idx = pd.Index(pd.period_range(2011,2015,freq='A'),name='year')
df = pd.DataFrame(index=idx)
df1 = pd.DataFrame({'age':['age']})
df1 = df1.set_index('age')
df = df.append(df1,ignore_index=False).T
print df
Which gives:
Empty DataFrame
Columns: [2011, 2012, 2013, 2014, 2015, age]
Index: []
And it keeps you years as Periods:
df.columns[0]
Period('2011', 'A-DEC')
The same result most likely can be achieved using .merge.

Categories

Resources