Python Pandas create new column in loop - python

I'm attempting to create a new column for each column by dividing two columns. df is a pandas dataframe...
columns = list(df.columns.values)
for column_1 in columns:
for column_2 in columns:
new_column = '-'.join([column_1,column_2])
df[new_column] = df[column_1] / df[column_2]
Getting an error: NotImplementedError: operator '/' not implemented for bool dtypes
Any thoughts would be appreciate?

Like Brian said you're definitely trying to divide non-numeric columns. Here's a working example of dividing two columns to create a third:
name = ['bob','sam','joe']
age = [25,32,50]
wage = [50000, 75000, 32000]
people = {}
for i in range(0,3):
people[i] = {'name':name[i], 'age':age[i],'wage':wage[i]}
# you should now have a data frame where each row is a person
# you have one string column (name), and two numerics (age and wage)
df = pd.DataFrame(people).transpose()
df['wage_per_year'] = df['wage']/df['age']
print df

Related

Finding .mean() of all columns in python using loop

I have the following dataframe:
Dataframe
Now i want to find the average of every column and create a new dataframe with the result.
My only solution has been:
#convert all rows to mean of values in column
df_find_mean['Germany'] = (df_find_mean["Germany"].mean())
df_find_mean['Turkey'] = (df_find_mean["Turkey"].mean())
df_find_mean['USA_NJ'] = (df_find_mean["USA_NJ"].mean())
df_find_mean['USA_TX'] = (df_find_mean["USA_TX"].mean())
df_find_mean['France'] = (df_find_mean["France"].mean())
df_find_mean['Sweden'] = (df_find_mean["Sweden"].mean())
df_find_mean['Italy'] = (df_find_mean["Italy"].mean())
df_find_mean['SouthAfrica'] = (df_find_mean["SouthAfrica"].mean())
df_find_mean['Taiwan'] = (df_find_mean["Taiwan"].mean())
df_find_mean['Hungary'] = (df_find_mean["Hungary"].mean())
df_find_mean['Portugal'] = (df_find_mean["Portugal"].mean())
df_find_mean['Croatia'] = (df_find_mean["Croatia"].mean())
df_find_mean['Albania'] = (df_find_mean["Albania"].mean())
df_find_mean['England'] = (df_find_mean["England"].mean())
df_find_mean['Switzerland'] = (df_find_mean["Switzerland"].mean())
df_find_mean['Denmark'] = (df_find_mean["Denmark"].mean())
#Remove all rows except first
df_find_mean = df_find_mean.loc[[0]]
#Verify data
display(df_find_mean)
Which works, but is not very elegant.
Is there some way to iterate over each column and construct a new dataframe as the average (.mean()) of that colume?
Expected output:
Dataframe with average of columns from previous dataframes
Use DataFrame.mean with convert Series to one row DataFrame by Series.to_frame and transpose:
df = df_find_mean.mean().to_frame().T
display(df)
Just use DataFrame.mean() to compute the mean of all your columns:
You can compute the mean of each column by df_find_mean.mean() and then integrate this into pd.DataFrame([df_find_mean.mean()])!
means = df_find_mean.mean()
df_mean = pd.DataFrame([means])
display(df_mean)

Use one df column to filter another df, multiple filter

I want to filter df1 based on values in the Version column of df2, and then change the Cost Total to 0 in df1. I want to change Cost of those Version which are there in df2.
df1 is [24867 rows x 63 columns]
df2 is [35 rows x 7 columns]
The code I'm using for filtering and setting value is:
df1.loc[
(df1['Group'] == "CBSS_cq_....JZJN") &
(df1['Version – USE'] == df2['Version - USE']),
df1['Cost Total']] = 0
The code is assigning Cost Total to 0 for all the 'Group', it is not filtering on my second condition for Version. giving error:
raise ValueError("Can only compare identically-labeled Series objects")
ValueError: Can only compare identically-labeled Series objects
note that when I used .values :
df1.loc[
(df1['Group'] == "CBSS_.......KJZJN") &
(df1['Version – USE'].values == df2['Version'].values),
df1['Cost Total']] = 0
giving me following error:
block_values = np.empty(block_shape, dtype=dtype)
ValueError: array is too big; arr.size * arr.dtype.itemsize is larger than the maximum possible size.
**********the above is sorted with .isin *************
My df2 is the template files which are 24 excel files, each having 3-4 sheets. I have looped through all the files and their sheets.
Index Template files are named like-
AdDape CBS Index Template 6.3.xlsx
AdDape Midlife Index Template 5.3.xlsx
And looks like below:
print("\nIndex Template Files\n")
os.chdir('path to my \IndexTemplatefiles')
FileList = glob.glob('*.xlsx')
print(FileList)
for fname in FileList:
excel = pd.ExcelFile(fname)
sheets = pd.ExcelFile(fname).sheet_names # list of sheets
print(fname)
for sheet in excel.sheet_names:
df2 = pd.read_excel(excel, sheet_name=sheet)
df3 = pd.read_excel(CostGroupFile, sheet_name='Sheet2')
#merging df1 and df2
df1 = pd.merge(df1, df2, left_on='Version', right_on='Version Market - USE', how='left')
df1.loc[(
(df1['Cost Group'] == "CBSS_ron_rt_na_disp_JZJN") &
(df1['Version'].isin(df2['Version Market - USE'])),
'Cost Total')] = (df1['Market Spend'] / df1['Sum of Impressions']) * df1['Impressions']
#deleting extra columns
df1 = df1.drop(columns=['..all columns that came after merging'])
df1.to_excel(writer, index=False)
writer.save()
This code is working and updating the Cost total values but as you can see the Cost group I have entered manually, I want that to be dynamic.
If the excel file (Index Template files) name is similar to the df3[filename] and its sheet's name i.e. sheetname of df2 is similar to the df3[Sheetname] then use that corresponding cost group and use in the filter part to filter df1 and update cost total.
do you have an example of your data?
I'm not sure if this is what you want.. though you could try this
df1.loc[df1['Version – USE'].isin(df2['Version - USE']), 'Cost Total'] = 0

CONCAT and DELETE nan in each row and have the result in a other Column

i just learn the basic in Panda and im looking a way to concat and delete nan value and get the result in a new column of my dataframe.
I know how to concat, how to create list but not realy how to iterate trought the column delete the nan value and finally concat the result in the new column.
i have a table with different number and i would like to create a column with panda (CONTACT[CALLER_PHONE] = ...) with all the number from each row and no null values.
Exemple of result that i want in a table:
Number1 Number2 Number3 CALLER_PHONE
0675416952 0675416941 0675416930 0675416952,067541694,0675416930
Nan 0675417080 0675417082 0675417080,0675417082
Nan Nan 0675837759 0675837759
My Code :
import pandas as pd
CONTACT = pd.read_excel('O:/16_GIS_Team/X_Tools/Model Builder And Parcels Package/Contact_20200807/CONTACT_20200807.xlsx')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
CONTACT['CALLER_NAME'] = CONTACT['First Name'].str.cat(CONTACT['Last Name'], sep =" ")
cols = CONTACT[['Work Phone','Mobile','Home Phone','SMS marketing phone','Other Phone Number','Details (USA): Caller Phone']]
print(cols)
columns = list(cols)
for i in columns:
Clean_Columns = cols.dropna(axis=1, how='any')
print (Clean_Columns[i][2])
My files is an Excel
CONTACT is my dataframe
I try to iterate trought the column, than use dropna and get a cain of result with the list but its not working and i didn't dig deeper.
Error with my list peace of code
Im open to any advise thank you very much by advance!
You can define your own function that will take the numbers you select and return as a string where the numbers are delimited with ','.
# get the data
cols = CONTACT[['Work Phone','Mobile','Home Phone','SMS marketing phone','Other Phone Number','Details (USA): Caller Phone']]
def concatenate_numbers(s):
"""Remove all NA values from a series and return as a string joined by ','"""
s = s.dropna()
return ','.join([str(number) for number in s])
# create a new column by applying the above function to every row of the dataframe.
df['all_phones'] = df.apply(concatenate_numbers, axis=1)
pandas.Series.dropna returns a pandas.Series with dropped NA values, so you need to assign that to a variable. You can then create a new column in the dataframe from the result.
You don't have to create a list for your columns. Just use df["column_name"].columns
df = df.dropna()
or
df = df[df["columns_name"] != np.nan]

Select columns based on != condition

I have a dataframe and I have a list of some column names that correspond to the dataframe. How do I filter the dataframe so that it != the list of column names, i.e. I want the dataframe columns that are outside the specified list.
I tried the following:
quant_vair = X != true_binary_cols
but get the output error of: Unable to coerce to Series, length must be 545: given 155
Been battling for hours, any help will be appreciated.
It will help:
df.drop(columns = ["col1", "col2"])
You can either drop the columns from the dataframe, or create a list that does not contain all these columns:
df_filtered = df.drop(columns=true_binary_cols)
Or:
filtered_col = [col for col in df if col not in true_binary_cols]
df_filtered = df[filtered_col]

How to merge columns interspersing the data?

I'm new to python and pandas and working to create a Pandas MultiIndex with two independent variables: flow and head which create a dataframe and I have 27 different design points. It's currently organized in a single dataframe with columns for each variable and rows for each design point.
Here's how I created the MultiIndex:
flow = df.loc[0, ["Mass_Flow_Rate", "Mass_Flow_Rate.1",
"Mass_Flow_Rate.2"]]
dp = df.loc[:,"Design Point"]
index = pd.MultiIndex.from_product([dp, flow], names=
['DP','Flows'])
I then created three columns of data:
df0 = df.loc[:,"Head2D"]
df1 = df.loc[:,"Head2D.1"]
df2 = df.loc[:,"Head2D.1"]
And want to merge these into a single column of data such that I can use this command:
pc = pd.DataFrame(data, index=index)
Using the three columns with the same indexes for the rows (0-27), I want to merge the columns into a single column such that the data is interspersed. If I call the columns col1, col2 and col3 and I denote the index in parentheses such that col1(0) indicates column1 index 0, I want the data to look like:
col1(0)
col2(0)
col3(0)
col1(1)
col2(1)
col3(1)
col1(2)...
it is a bit confusing. But what I understood is that you are trying to do this:
flow = df.loc[0, ["Mass_Flow_Rate", "Mass_Flow_Rate.1",
"Mass_Flow_Rate.2"]]
dp = df.loc[:,"Design Point"]
index = pd.MultiIndex.from_product([dp, flow], names=
['DP','Flows'])
df0 = df.loc[:,"Head2D"]
df1 = df.loc[:,"Head2D.1"]
df2 = df.loc[:,"Head2D.1"]
data = pd.concat[df0, df1, df2]
pc = pd.DataFrame(data=data, index=index)

Categories

Resources