Trying to create a dataframe from another dataframe with certain restrictions - python

I am trying to write a VAR model in Python (where it is not allowed to use pre-made functions such as VAR within statsmodel).
For this I need the matrix of the dependent variables.
I have a data set of 3 government bonds, all with different maturities.
The data is imported and treated as follows
# importing file
df = pd.read_csv("C://Users/raymond/Desktop/Econometrie3/us_tbills_8019.csv")
# dropping years > 1999
df = df.iloc[:240]
# calculating log differences
Dates = pd.to_datetime(df['DATE'], format='%Y-%m-%d')
mData = df[['GS10','GS5','GS1']]
mData.index = pd.DatetimeIndex(Dates)
AllData = mData
logdif = np.log(mData).diff().shift(1).dropna()
To create the matrix Y, which is the matrix of dependent variables, I want to take values of logdif with range i = 1:K and j = P+1:T-1
I tried the following to create my matrix:
# variables
K = 3
T = df.shape[0]
P = 4
# matrix of Dependent Variable
Y = logdif
def functionY():
for i in range(1, 3, 1):
for j in range(P+1, T-1, 1):
Y[i][j-1] = logdif[i][j]
return Y
I have tried other ways to find the matrix as well, with none working.
Any tips for creating my matrix?

I cannot understand all of your code (so there may be other problems), but I notice that you're assigning your dataframe logdif to a second variable Y and then writing values to Y. You should be aware that this modifies logdif, as it is only assigned by reference. (This doesn't look like what you want, since you're reading from logdif at the same time.)
Use Y = logdif.copy() to avoid that.
Edit: Also, range(1,3,1) is identical to range(1,3).
Edit: Also, you mention that you are trying to create a matrix; the variable you're using right now is still a DataFrame (np.log preserves the type of the argument here). You can use Y = np.matrix(logdif) to change this.

Related

Question about best way to structure my data in specific loop and saving

I am new to python and programming. I started this summer and have been trying to use it to do chemistry research.
My goal:
Currently, my goal is to make a loop that runs a simulation that generates two float64s and a list. The float64s contain x and y values for a plot and I am looking to save each output as x and y coordinates and then save them to a single data structure so I can call on them later for machine learning (haven't gotten to that point yet).
What I was going to try:
I was told that a list might be the best way to store the values for this because I will be generating something like 10000 iterations of this with about 1000 x and y points each iteration. I was thinking I could have a loop to store each set of x and y as lists in two other lists and at the end change them into some sort of csv file. I am worried this will be very inefficient and thought I could put the question out there.
My current code:
x_list = []
y_list = []
count = 0
while count < 2:
df = *this is a function I am not sure I can share the name of*
x_arr = df[0]
y_arr = df[1]
k = df[2]
x_axis = x_arr.tolist()
y_axis = y_arr.tolist()
x_list.append(x_axis)
y_list.append(y_axis)
count += 1
The data and data types:
print(x_arr.dtype)
print(y_arr.dtype)
print(type(k))
print(len(x_list), len(y_list))
float64
float64
<class 'list'>
2 2
I would also appreciate recommended resources I can learn more about these questions.

Python: Only exponentiate array if values is not equal to zero

I have the below function in which I run several regressions. Some estimated coefficients are outputted as '0s' and naturally when they're exponentiated they turn into '1s'.
Ideally, I would have sm.OLS() output 'blanks' rather than 'zeros' in those cases where the estimated coefficient is zero. But I've tried and this doesn't seem possible.
So, alternatively, I would prefer to keep zeros rather than 1s. This would require not exponentiating the zeros in this line of the code: exp_coefficients=np.exp(results.params)
How could I do this?
import statsmodels.api as sm
df_index = []
coef_mtr = [] # start with an empty list
for x in df_main.project_x.unique():
df_holder=df_main[df_main.project_x == x]
X = df_holder.drop(['unneeded1', 'unneeded2','unneeded3'], axis=1)
X['constant']=1
Y = df_holder['sales']
eq=sm.OLS(y, X)
results=eq.fit()
exp_coefficients=np.exp(results.params)
# print(exp_coefficients)
coef_mtr.append(exp_coefficients)
df_index.append(x)
coef_mtr = np.array(coef_mtr)
# create a dataframe with this data
df_columns = [f'coef_{n}' for n in range(coef_mtr.shape[1])]
df_matrix=pd.DataFrame(data = coef_mtr, index = df_index, columns = df_columns)
The cleanest would probably be using the where keyword (not the function) as in
out = np.exp(in_,where=in_!=0)
This will skip al zero values. But because when I say skip I mean skip this will leave the corresponding values in out uninitialized. We therefore need to preset them to zero:
out = np.zeros_like(in_)
np.exp(in_,where=in_!=0,out=out)

Efficient way to calculate pandas columns with pymap3d?

I have a pandas dataframe with columns of geo-coordinates and I am using pymap3d to convert the locations to other coordinate systems. A typical function I have implemented to do this is:
def append_enu(df, observerlla):
e = []
n = []
u = []
for index, row in df.iterrows():
tmpe, tmpn, tmpu = pm.geodetic2enu( row["lat_deg"], row["lon_deg"], row["alt_m"], *observerlla )
e.append(tmpe)
n.append(tmpn)
u.append(tmpu)
df["enu_e_m"] = e
df["enu_n_m"] = n
df["enu_u_m"] = u
return df
This works, but I find it extremely slow. (My tables have over 700000 rows, and I am adding conversions for 3 different coordinate systems.) Is there a "more pythonic" way to do this that properly takes advantage of the ways pandas allows dataframe manipulation?
Here's the solution I came up with, which modifies the original DataFrame instead of returning a new one from a function:
(df['enu_e_m'], df['enu_n_m'], df['enu_u_m']) = pm.geodetic2enu(df['lat_deg'], df['lon_deg'], df['alt_m'], *observerlla)
So, yeah, "it just works."

how to create multiple variables with similar name in for loop?

I had a problem with for loops earlier, and it was solved thanks to #mak4515, however, there is something else I want to accomplish
# Use pandas to read in csv file
data_df_0 = pd.read_csv('puget_sound_ctd.csv')
#create data subsets based on specific buoy coordinates
data_df_1 = pd.read_csv('puget_sound_ctd.csv', skiprows=range(9,114))
data_df_2 = pd.read_csv('puget_sound_ctd.csv', skiprows=([i for i in range(1, 8)] + [j for j in range(21, 114)]))
for x in range(0,2):
for df in [data_df_0, data_df_2]:
lon_(x) = df['longitude']
lat_(x) = df['latitude']
This is my current code, I want to have it have it so that it reads the different data sets and creates different values based on the data set it is reading. However, when I run the code this way I get this error
File "<ipython-input-66-446aebc48604>", line 21
lon_(x) = df['longitude']
^
SyntaxError: can't assign to function call
What does "can't assign to function call" mean, and how do I fix this?
I think the comment by #Chris is probably a good way to go. I wanted to point out that since you're already using pandas dataframes, an easier way might be to make a column corresponding to the original dataframe then concatenate them.
import pandas as pd
data_df_0 = pd.DataFrame({'longitude':range(-125,-120,1),'latitude':range(45,50,1)})
data_df_0['dfi'] = 0
data_df_2 = pd.DataFrame({'longitude':range(-120,-125,-1),'latitude':range(50,45,-1),'dfi':[2]*5})
data_df_2['dfi'] = 2
df = pd.concat([data_df_0,data_df_2])
Then, you could acess data from the original frames like this:
df.loc[2]

Finding euclidean distance from multiple mean vectors

This is what I am trying to do - I was able to do steps 1 to 4. Need help with steps 5 onward
Basically for each data point I would like to find euclidean distance from all mean vectors based upon column y
take data
separate out non numerical columns
find mean vectors by y column
save means
subtract each mean vector from each row based upon y value
square each column
add all columns
join back to numerical dataset and then join non numerical columns
import pandas as pd
data = [['Alex',10,5,0],['Bob',12,4,1],['Clarke',13,6,0],['brke',15,1,0]]
df = pd.DataFrame(data,columns=['Name','Age','weight','class'],dtype=float)
print (df)
df_numeric=df.select_dtypes(include='number')#, exclude=None)[source]
df_non_numeric=df.select_dtypes(exclude='number')
means=df_numeric.groupby('class').mean()
For each row of means, subtract that row from each row of df_numeric. then take square of each column in the output and then for each row add all columns. Then join this data back to df_numeric and df_non_numeric
--------------update1
added code as below. My questions have changed and updated questions are at the end.
def calculate_distance(row):
return (np.sum(np.square(row-means.head(1)),1))
def calculate_distance2(row):
return (np.sum(np.square(row-means.tail(1)),1))
df_numeric2=df_numeric.drop("class",1)
#np.sum(np.square(df_numeric2.head(1)-means.head(1)),1)
df_numeric2['distance0']= df_numeric.apply(calculate_distance, axis=1)
df_numeric2['distance1']= df_numeric.apply(calculate_distance2, axis=1)
print(df_numeric2)
final = pd.concat([df_non_numeric, df_numeric2], axis=1)
final["class"]=df["class"]
could anyone confirm that these is a correct way to achieve the results? i am mainly concerned about the last two statements. Would the second last statement do a correct join? would the final statement assign the original class? i would like to confirm that python wont do the concat and class assignment in a random order and that python would maintain the order in which rows appear
final = pd.concat([df_non_numeric, df_numeric2], axis=1)
final["class"]=df["class"]
I think this is what you want
import pandas as pd
import numpy as np
data = [['Alex',10,5,0],['Bob',12,4,1],['Clarke',13,6,0],['brke',15,1,0]]
df = pd.DataFrame(data,columns=['Name','Age','weight','class'],dtype=float)
print (df)
df_numeric=df.select_dtypes(include='number')#, exclude=None)[source]
# Make df_non_numeric a copy and not a view
df_non_numeric=df.select_dtypes(exclude='number').copy()
# Subtract mean (calculated using the transform function which preserves the
# number of rows) for each class to create distance to mean
df_dist_to_mean = df_numeric[['Age', 'weight']] - df_numeric[['Age', 'weight', 'class']].groupby('class').transform('mean')
# Finally calculate the euclidean distance (hypotenuse)
df_non_numeric['euc_dist'] = np.hypot(df_dist_to_mean['Age'], df_dist_to_mean['weight'])
df_non_numeric['class'] = df_numeric['class']
# If you want a separate dataframe named 'final' with the end result
df_final = df_non_numeric.copy()
print(df_final)
It is probably possible to write this even denser but this way you'll see whats going on.
I'm sure there is a better way to do this but I iterated through depending on the class and follow the exact steps.
Assigned the 'class' as the index.
Rotated so that the 'class' was in the columns.
Performed that operation of means that corresponded with df_numeric
Squared the values.
Summed the rows.
Concatenated the dataframes back together.
data = [['Alex',10,5,0],['Bob',12,4,1],['Clarke',13,6,0],['brke',15,1,0]]
df = pd.DataFrame(data,columns=['Name','Age','weight','class'],dtype=float)
#print (df)
df_numeric=df.select_dtypes(include='number')#, exclude=None)[source]
df_non_numeric=df.select_dtypes(exclude='number')
means=df_numeric.groupby('class').mean().T
import numpy as np
# Changed index
df_numeric.index = df_numeric['class']
df_numeric.drop('class' , axis = 1 , inplace = True)
# Rotated the Numeric data sideways so the class was in the columns
df_numeric = df_numeric.T
#Iterated through the values in means and seen which df_Numeric values matched
store = [] # Assigned an empty array
for j in means:
sto = df_numeric[j]
if type(sto) == type(pd.Series()): # If there is a single value it comes out as a pd.Series type
sto = sto.to_frame() # Need to convert ot dataframe type
store.append(sto-j) # append the various values to the array
values = np.array(store)**2 # Squaring the values
# Summing the rows
summed = []
for i in values:
summed.append((i.sum(axis = 1)))
df_new = pd.concat(summed , axis = 1)
df_new.T

Categories

Resources