I have about 88 columns in a pandas dataframe. I'm trying to apply a formula that calculates a single value for each column. How do I switch out the name of each column and then build a new single-row dataframe from the equation?
Below is the equation (linear mixed model) which results in a single value for each column.
B1 = (((gdf.groupby(['Benthic_Mo'])['SHAPE_Area'].sum())/Area_sum) *
(gdf.groupby(['Benthic_Mo'])['W8_629044'].mean())).sum()
Below is a sample of the names of the columns
['OBJECTID', 'Benthic_Mo', 'SHAPE_Leng', 'SHAPE_Area', 'geometry', 'tmp', 'Species','W8_629044', 'W8_642938', 'W8_656877', 'W8_670861', 'W8_684891', 'W8_698965', 'W8_713086', 'W8_72726',...]
The columns with W8_## need to be switched out in the formula, but about 80 of them are there. The output I need is a new dataframe with a single row. I also would like to calculate the variance or Standard deviation from the data calculated with the formal.
thank you!
You can loop through the dataframe columns. I think the below code should work.
collist = list(orignal_dataframe.columns)
emptylist = []
emptydict = {}
for i in collist[7:]:
B1 = (((gdf.groupby(['Benthic_Mo'])['SHAPE_Area'].sum())/Area_sum) * (gdf.groupby(['Benthic_Mo'])[i].mean())).sum()
emptydict[i] = B1
emptylist.append(emptydict)
resdf = pd.DataFrame(emptylist)
to create new df with the results in each new col (one row), you can use similar as below:
W8_cols = [col for col in df.columns if 'W8_' in col]
df_out = pd.DataFrame()
for col in W8_cols:
B1 = (((gdf.groupby(['Benthic_Mo'])['SHAPE_Area'].sum()) / Area_sum) *
(gdf.groupby(['Benthic_Mo'])[col].mean())).sum()
t_data = [{col: B1}]
df_temp = pd.DataFrame(t_data)
data = [df_out, df_temp]
df_out = pd.concat(data, axis=1)
Related
I have the following dataframe:
Dataframe
Now i want to find the average of every column and create a new dataframe with the result.
My only solution has been:
#convert all rows to mean of values in column
df_find_mean['Germany'] = (df_find_mean["Germany"].mean())
df_find_mean['Turkey'] = (df_find_mean["Turkey"].mean())
df_find_mean['USA_NJ'] = (df_find_mean["USA_NJ"].mean())
df_find_mean['USA_TX'] = (df_find_mean["USA_TX"].mean())
df_find_mean['France'] = (df_find_mean["France"].mean())
df_find_mean['Sweden'] = (df_find_mean["Sweden"].mean())
df_find_mean['Italy'] = (df_find_mean["Italy"].mean())
df_find_mean['SouthAfrica'] = (df_find_mean["SouthAfrica"].mean())
df_find_mean['Taiwan'] = (df_find_mean["Taiwan"].mean())
df_find_mean['Hungary'] = (df_find_mean["Hungary"].mean())
df_find_mean['Portugal'] = (df_find_mean["Portugal"].mean())
df_find_mean['Croatia'] = (df_find_mean["Croatia"].mean())
df_find_mean['Albania'] = (df_find_mean["Albania"].mean())
df_find_mean['England'] = (df_find_mean["England"].mean())
df_find_mean['Switzerland'] = (df_find_mean["Switzerland"].mean())
df_find_mean['Denmark'] = (df_find_mean["Denmark"].mean())
#Remove all rows except first
df_find_mean = df_find_mean.loc[[0]]
#Verify data
display(df_find_mean)
Which works, but is not very elegant.
Is there some way to iterate over each column and construct a new dataframe as the average (.mean()) of that colume?
Expected output:
Dataframe with average of columns from previous dataframes
Use DataFrame.mean with convert Series to one row DataFrame by Series.to_frame and transpose:
df = df_find_mean.mean().to_frame().T
display(df)
Just use DataFrame.mean() to compute the mean of all your columns:
You can compute the mean of each column by df_find_mean.mean() and then integrate this into pd.DataFrame([df_find_mean.mean()])!
means = df_find_mean.mean()
df_mean = pd.DataFrame([means])
display(df_mean)
How do I load the data and rearrange them so that x of shape (2000, 2) values and y of shape (2000,) that represent the labels?
This is what I am currently doing now.
This is the info I know:
The dataframe has
100 rows × 40 columns
so I
p1 = q2_data.iloc[:,0:2]
p2 = q2_data.iloc[:,2:4]
.......
p20 = q2_data.iloc[:,38:40]
new_columns = ["x1", "x2"]
p1.columns = new_columns
p2.columns = new_columns
.....
p40.columns = new_columns
print( pd.concat([p1, p2,.....,p20], ignore_index=True))
[2000 rows x 2 columns]
How do I also had labels to each of the columns of p1, p2, .. p40? so I can create another column with labels ranging form (0,19)
If you are looking for the loop logic, this probably works, not the best looking script tho.
columns_name = ["x1", "x2"] # initiate the column name
new_df = pd.DataFrame(columns=columns_name) # create an empty dataframe with column name
for col_index in range(0,len(q2_data.columns))[::2]: # create a loop with sliding windows of 2
temp = q2_data.iloc[:,col_index:col_index+2] # create a temporary df to store the value
temp.rename(dict(zip(list(temp.columns), columns_name)), inplace = True) # rename for concatenating purpose
new_df = pd.concat([new_df, temp], ignore_index = True)
I need to fix a large excel database where in some columns some cells are blank and all the data from the row is moved one cell to the right.
For example:
In this example I need a script that would detect that the first cell form the last row is blank and then it would move all the values one cell to the left.
I'm trying to do it with this function. Vencli_col is the dataset, df1 and df2 are copies. In df2 I drop column 12, which is where the error originates. I index the rows where the error happens and then I try to replace them with the values from df2.
df1 = vencli_col.copy()
df2 = vencli_col.copy()
df2 = df1.drop(columns=['Column12'])
df2['droppedcolumn'] = np.nan
i = 0
col =[]
for k, value in vencli_col.iterrows():
i +=1
if str(value['Column12']) == '' or str(value['Column12']) == str(np.nan):
col.append(i+1)
for j in col:
df1.iloc[j] = df2.iloc[j]
df1.head(25)
You could do something like the below. It is not very pretty but it does the trick.
# Select the column names that are correct and the ones that are shifted
# This is assuming the error column is the second one as in the image you have
correct_cols = df.columns[1:-1]
shifted_cols = df.columns[2:]
# Get the indexes of the rows that are NaN or ""
df = df.fillna("")
shifted_indexes = df[df["col1"] == ""].index
# Shift the data 1 column to the left
# It has to be transformed in numpy because if you don't the column names
# prevent from copying in the destination columns
df.loc[shifted_indexes ,correct_cols] = df.loc[shifted_indexes, shifted_cols].to_numpy()
EDIT: just realised there is an easier way using df.shift()
columns_to_shift = df.columns[1:]
shifted_indexes = df[df["col1"] == ""].index
df.loc[shifted_indexes, columns_to_shift] = df.loc[shifted_indexes, columns_to_shift].shift(-1, axis=1)
I have written a function to calculate the gradient between two columns of two dataframes and output this gradient in a new dataframe. These columns have the same headings, and upon merging the columns a suffix of _A or _B is added.
The column headings are chemical formulas, the table output is expected to be in dataframe columns where there is a gradient of the linear regression between two columns for the same chemical formula from two dataframes.
The input dataframes contains columns with chemical formula headings, integer values in each of the columns and are indexed with a datetimeindex.
def find_gradient(dfA, dfB):
dfA.resample('1min')
dfB.resample('1min')
combined_df = dfA.merge(dfB,how='inner',left_index=True,right_index=True, suffixes=('_A', "_B"))
combined_df= combined_df.dropna(how='all', axis=0)
#return combined_df
listofcols = combined_df.columns
listofcols = listofcols.tolist()
listofformulas = dfA.columns
listofformulas = listofformulas.tolist()
for cols in listofcols:
for formula in listofformulas:
A = [col for col in combined_df if col.startswith(formula) and col.endswith('_A')]#.str] # startswith(formula)
B = [col for col in combined_df if col.startswith(formula) and col.endswith('_B')]#.str] #startswith(formula)
q1 = combined_df[A]
q2 = combined_df[B]
x1 = np.squeeze(np.array(q1))
x2 = np.squeeze(np.array(q2))
gradient = np.polyfit(x1,x2,0) ## fits a linear regression and forces the y-intercept to 0
GradientFormulas = pd.DataFrame(columns=['formula','gradient'])
GradientFormulas = GradientFormulas.append([{'formula':formula,'gradient':gradient}])
return GradientFormulas
The output is expected to be a table with as many rows as there are chemical formulas (columns in input) and a single gradient value corresponding to each row. However, the output currently only shows the chemical formula and gradient for the first column.
please help. Seems easy, just can't figure it out.
DataFrame (df) contains numbers. For each column:
* compute the mean and std
* compute a new value for each value in each row in each column
* change that value with the new value
Method 1
import numpy as np
import pandas as pd
n = 1
while n<len(df.column.values.tolist()):
col = df.values[:,n]
mean = sum(col)/len(col)
std = np.std(col, axis = 0)
for x in df[df.columns.values[n]]:
y = (float(x) - float(mean)) / float(std)
df.set_value(x, df.columns.values[n], y)
n = n+1
Method 2
labels = df.columns.values.tolist()
df2 = df.ix[:,0]
n = 1
while n<len(df.column.values.tolist()):
col = df.values[:,n]
mean = sum(col)/len(col)
std = np.std(col, axis = 0)
ls = []
for x in df[df.columns.values[n]]:
y = (float(x) - float(mean)) / float(std)
ls.append(y)
df2 = pd.DataFrame({labels[n]:str(ls)})
df1 = pd.concat([df1, df2], axis=1, ignore_index=True)
n = n+1
Error: ValueError: If using all scalar values, you must pass an index
Also tried the .apply method but the new DataFrame doesn't change the values.
print(df.to_json()):
{"col1":{"subj1":4161.97,"subj2":5794.73,"subj3":4740.44,"subj4":4702.84,"subj5":3818.94},"col2":{"subj1":13974.62,"subj2":19635.32,"subj3":17087.721851,"subj4":13770.461021,"subj5":11546.157578},"col3":{"subj1":270.7,"subj2":322.607708,"subj3":293.422314,"subj4":208.644585,"subj5":210.619961},"col4":{"subj1":5400.16,"subj2":7714.080365,"subj3":6023.026011,"subj4":5880.187272,"subj5":4880.056292}}
You are standard normalizing each column by removing the mean and scaling to unit variance. You can use scikit-learn's standardScaler for this:
from sklearn import preprocessing
scaler= preprocessing.StandardScaler()
new_df = pd.DataFrame(scaler.fit_transform(df.T), columns=df.columns, index=df.index)
Here is the documentation for the same
It looks like you're trying to do operations on DataFrame columns and values as though DataFrames were simple lists or arrays, rather than in the vectorized / column-at-a-time way more usual for NumPy and Pandas work.
A simple, first-pass improvement might be:
# import your data
import json
df = pd.DataFrame(json.loads(json_text))
# loop over only numeric columns
for col in df.select_dtypes([np.number]):
# compute column mean and std
col_mean = df[col].mean()
col_std = df[col].std()
# adjust column to normalized values
df[col] = df[col].apply(lambda x: (x - col_mean) / col_std)
That is vectorized by column. It retains some explicit looping, but is straightforward and relatively beginner-friendly.
If you're comfortable with Pandas, it can done more compactly:
numeric_cols = list(df.select_dtypes([np.number]))
df[numeric_cols] = df[numeric_cols].apply(lambda col: (col - col.mean()) / col.std(), axis=0)
In your revised DataFrame, there are no string columns. But the earlier DataFrame had string columns, causing problems when they were computed upon, so let's be careful. This is a generic way to select numeric columns. If it's too much, you can simplify at the cost of generality by listing them explicitly:
numeric_cols = ['col1', 'col2', 'col3', 'col4']