Converting an array structure to a dataframe to get the column names - python

I am having a dataframe which I have converted to an array to model the data using a regression algorithm. I used the following code to do it
X=df.iloc[:, 0:345].values
Y=df.iloc[:,345].values
Hence X & Y are arrays now.There are many columns because, the categorical variables have been created into dummy variables. Further, I create train and test split
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
X_train,X_test,y_train,y_test=train_test_split(X,Y,test_size=0.25,random_state=0)
Now, after I have completed building the model and making predictions, I want to get back the value of my categorical variables (X & Y have been created after creating dummy variables for all categorical variables).For this, I am trying to convert my X_test back to a dataframe with the column names in the original dataframe df. I tried the following code
dff=df.iloc[:, 0:345]
The above statement is to get the first 345 columns (of the data frame).
Then,
pd.DataFrame(X_test, index=dff.index, columns=dff.columns)
I get the following error
ValueError: Shape of passed values is (345, 25000), indices imply (345, 100000)
I don't understand why it matters how many rows I have. I have lesser rows because my train and test have been split up 75%-25%. And I am performing the split after data is converted to an array. How do i now convert the array data into a dataframe with column names from dff dataframe?

pd.DataFrame(X_test, index=dff.index, columns=dff.columns)
X_test being a numpy.ndarray
Modified the above statement to just this:
df_new=pd.DataFrame(X_test)
df_new.columns=list(dff.columns)
The new dataframe contains the X_test data and the column names are assigned from the dff dataframe to the newly created dataframe as well.

I would recommend using the DataFrame for train_test_split, and then passing in arrays to your algorithm using numpy:
my_algorithm(np.asarray(X_train), np.asarray(y_train))
This way you can look at your data the same way you would for any df, but can run the model with the array. I'm not sure what library you are using - but I'm pretty sure some can take DataFrames now for modeling.

Related

What dimensions should my Numpy Array be ? Obspy Traces

I currently have seismic data with 175x events with 3 traces for each event (traces are numpy arrays of seismic data). I have classification labels for whether the seismic data is an earthquake or not for each of those 175 samples. I'm looking to format my data into numpy arrays for modelling. I've tried placing into a dataframe of numpy arrays with each column being a different trace. So columns would be 'Trace one' 'Trace two' 'Trace three'. This did not work. I have tried lots of different methods of arranging the data to use with keras.
I'm now looking to create a numpy matrix for the data to go into and to then use for modelling.
I had thought that the shape may be (175,3,7501) as (#number of events, #number of traces,#number of samples in trace), however I then iterate through and try to add the three traces to the numpy matrix and have failed. I'm used to using dataframes and not numpy for inputting to Keras.
newrow = np.array([[trace_copy_1],[trace_copy_2],[trace_copy_3]])
data = numpy.vstack([data, newrow])
The data shape is (175,3,7510). The newrow shape is (3,1,7510) and does not allow me to add newrow to data.
The form in which I receive the data is in obspy streams and each stream has the 3 trace objects. With each trace object, it holds the trace data in numpy arrays and so I'm having to access and append those to a dataframe for modelling as obviously I can't feed a stream or trace object to keras model.
If I understand your data correctly you can try one of the following method:
If your data shape is (175, 3, 7510) define newrow as follows newrow = np.array([trace_copy_1,trace_copy_2,trace_copy_3]) with trace_copy_x being a numpy array with shape 7510.
Use the reshape function (either with numpy.reshape(new_row, (3, 7510)) or new_row.reshape((3, 7510))
If you're familiar with dataframes you can still use pandas dataframes by reducing the dimension of your data (you can for example add the different traces at the end of one another on the same row, something you often see when working with images). Here it could be something like pandas.DataFrame(data.reshape((175, 3*7510)))
In addition to that I recommend using numpy.concatenate instead of numpy.vstack (more general).
I hope it will works.
Cheers
Thanks for the answers. The way I solved this was I created the NumPy array of the desired fit shape. (index or number of events, number of traces (or number of arrays), then sample amount (or amount of values in each array)
I then created a new row. I then reshaped and added. Following this, I then split the data to remove the original data before I started appending my new data.
data = np.zeros(shape=(175,3,7501))
newrow = [[trace_copy_1],[trace_copy_2],[trace_copy_3]]
newrow = np.array([[trace_copy_1],[trace_copy_2],[trace_copy_3]])
newrow = newrow.reshape((1,3,7501))

scikit preprocessing across entire dataframe

I have a dataframe:
df = pd.DataFrame({'Company': ['abc', 'xyz', 'def'],
'Q1-2019': [9.05, 8.64, 6.3],
'Q2-2019': [8.94, 8.56, 7.09],
'Q3-2019': [8.86, 8.45, 7.09],
'Q4-2019': [8.34, 8.61, 7.25]})
The data is an average response of the same question asked across 4 quarters.
I am trying to create a benchmark index from this data. To do so I wanted to preprocess it first using either standardize or normalize.
How would I standardize/normalize across the entire dataframe. What is the best way to go about this?
I can do this for a row or column using but struggling across the dataframe.
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
#define scaler
scaler = MinMaxScaler() #or StandardScaler
X = df.loc[1].T
X = X.to_numpy()
#transform data
scaled = scaler.fit_transform(X)
If I understood correctly your need, you can use ColumnTransformer to apply the same transformation (e.g. scaling) separately to different columns.
As you can read from the linked documentation, you need to provide inside a tuple:
a name for the step
the chosen transformer (e.g. StandardScaler) or a Pipeline as well
a list of columns to which apply the selected transformations
Code example
# specify columns
columns = ['Q1-2019', 'Q2-2019', 'Q3-2019', 'Q4-2019']
# create a ColumnTransformer instance
ct = ColumnTransformer([
('scaler', StandardScaler(), columns)
])
# fit and transform the input dataframe
ct.fit_transform(df)
array([[ 0.86955718, 0.93177476, 0.96056682, 0.46493449],
[ 0.53109031, 0.45544147, 0.41859563, 0.92419906],
[-1.40064749, -1.38721623, -1.37916245, -1.38913355]])
ColumnTransformer will output a numpy array with the transformed value, which were fitted on the input dataset df. Even though there are no column names now, the array columns are still ordered in the same way as the input dataframe, so it's easy to convert the array to a pandas dataframe if you need to.
In addition to #RicS's answer, note that what scikit-learn function return is a numpy array, and it is not a dataframe anymore. Also Company column is not included. You may consider this to convert results to dataframe again:
scaler = StandardScaler()
x = scaler.fit_transform(df.drop("Company",axis=1)) # scale all columns except Company
y = pd.concat([df["Company"],pd.DataFrame(x, columns=df.columns[1:])],axis=1) # adds results and company into dataframe again
y.head()

Value Error and problem with shape during creation of Data Frame in Python?

I would like to combine coefficient from Liear Regression model with values from test dataset, nevertheless I have error like below, my code is below, do you know where is the problem and what can I do ?
I need something like below, where indexes are from X.columns and numbers are from LR.coef_.
In the following example, values is a dataframe which has the same shape of your LR.coef_. To use its first row as column values in another dataframe, you can create a dict and pass that dict to pandas.DataFrame().
import pandas as pd
import numpy as np
values = pd.DataFrame(np.zeros((1, 689)))
X = pd.DataFrame(np.zeros((2096, 689)))
frame = { 'coefficient': values.iloc[0] }
coefficient = pd.DataFrame(frame, index=X.columns)

Put a fixed quantity of missing values in a dataset - Azure ML

I'm dealing with Azure ML and my goal is to see what happens if I have a fixed quantity(in percentage) of missing values in my dataset.
My idea could be:
Starting from the dataset(take in example Adult dataset) ,duplicate the original dataset and call it for convention X. Dataset X will contain randomly missing value in the percentage of the 20%. Once we have the original dataset and the duplicated dataset X we can use a Neural Net algo , create training and test set and then train this neural net with the dataset X in input . What it could be interesting to see is the global error produced. After we can imagine to expand the range of missing values in the dataset X. Starting from 20%,after 40% and so on... I think the hardest part is to duplicate the original dataset and so create the dataset X with this missing values.
In which way I can do it? Using modules in Azure ML or maybe R/Python scripts?
Just Sharing my idea, please see the sample code & comments as below.
import numpy as np
import pandas as pd
# Origin DataFrame
df = pd.DataFrame(np.random.randn(6,4))
# Copy data via flatten data matrix as an array
array = df.values.flatten()
# insert missing data by percent
# Define the percent of missing data
percent = 0.2
size = len(array)
# generate a random list for indexing data which will be assigned NaN
chosen = np.random.choice(size, int(size*percent))
array[chosen] = np.nan
# Create a new DataFrame with missing data
df2 = pd.DataFrame(np.reshape(array, (6,4)))
Hope it helps.

Insert result of sklearn CountVectorizer in a pandas dataframe

I have a bunch of 14784 text documents, which I am trying to vectorize, so I can run some analysis. I used the CountVectorizer in sklearn, to convert the documents to feature vectors. I did this by calling:
vectorizer = CountVectorizer
features = vectorizer.fit_transform(examples)
where examples is an array of all the text documents
Now, I am trying to use additional features. For this, I am storing the features in a pandas dataframe. At present, my pandas dataframe(without inserting the text features) has the shape (14784, 5). The shape of my feature vector is (14784, 21343).
What would be a good way to insert the vectorized features into the pandas dataframe?
Return term-document matrix after learning the vocab dictionary from the raw documents.
X = vect.fit_transform(docs)
Convert sparse csr matrix to dense format and allow columns to contain the array mapping from feature integer indices to feature names.
count_vect_df = pd.DataFrame(X.todense(), columns=vect.get_feature_names_out())
Concatenate the original df and the count_vect_df columnwise.
pd.concat([df, count_vect_df], axis=1)
If your base data frame is df, all you need to do is:
import pandas as pd
features_df = pd.DataFrame(features)
combined_df = pd.concat([df, features_df], axis=1)
I'd recommend some options to reduce the number of features, which could be useful depending on what type of analysis you're doing. For example, if you haven't already, I'd suggest looking into removing stop words and stemming. Additionally you can set max_features, like features = vectorizer.fit_transform(examples, max_features = 1000) to limit the number of features.

Categories

Resources