ValueError: Columns must be same length as key in pandas - python

i have df below
Cost,Reve
0,3
4,0
0,0
10,10
4,8
len(df['Cost']) = 300
len(df['Reve']) = 300
I need to divide df['Cost'] / df['Reve']
Below is my code
df[['Cost','Reve']] = df[['Cost','Reve']].apply(pd.to_numeric)
I got the error ValueError: Columns must be same length as key
df['C/R'] = df[['Cost']].div(df['Reve'].values, axis=0)
I got the error ValueError: Wrong number of items passed 2, placement implies 1

Problem is duplicated columns names, verify:
#generate duplicates
df = pd.concat([df, df], axis=1)
print (df)
Cost Reve Cost Reve
0 0 3 0 3
1 4 0 4 0
2 0 0 0 0
3 10 10 10 10
4 4 8 4 8
df[['Cost','Reve']] = df[['Cost','Reve']].apply(pd.to_numeric)
print (df)
# ValueError: Columns must be same length as key
You can find this columns names:
print (df.columns[df.columns.duplicated(keep=False)])
Index(['Cost', 'Reve', 'Cost', 'Reve'], dtype='object')
If same values in columns is possible remove duplicated by:
df = df.loc[:, ~df.columns.duplicated()]
df[['Cost','Reve']] = df[['Cost','Reve']].apply(pd.to_numeric)
#simplify division
df['C/R'] = df['Cost'].div(df['Reve'])
print (df)
Cost Reve C/R
0 0 3 0.0
1 4 0 inf
2 0 0 NaN
3 10 10 1.0
4 4 8 0.5

The issue lies in the size of data that you are trying to assign to the columns. I had an issue with this:
df[['X1','X2', 'X3']] = pd.DataFrame(df.X.tolist(), index= df.index)
I was trying to assign values of X to 3 columns X1,X2,X3, assuming that X has 3 values, but, X had 4 values.
So the revised code in my case was
df[['X1','X2', 'X3','X4']] = pd.DataFrame(df.X.tolist(), index= df.index)

I had the same error, but it did not come from the above two issues. In my case the columns had the same length. What helped me was transforming my Series to a DataFrame with pd.DataFrame() and then assigning its values to a new column of my existing df.

Related

Pandas: Append copy of rows changing only values in multiple columns larger than max allowed to split bin values

Problem: I have a data frame that I need to modify based on the values of particular column. If value of any column value is greater than that of maximum allowed then a new row will be created based upon distribution into equally sized bins (taking integer division between data value and max allowed value)
Table and Explanation:
Original:
Index
Data 1
Data 2
Max. Allowed
1
1
2
3
2
10
5
8
3
7
12
5
Required:
Index Values in brackets refers to the original index value
Index
Data 1
Data 2
Max. Allowed
1 (1)
1
2
3
2 (2)
8
5
8
3
2
0
8
4 (3)
5
5
5
5
2
5
5
6
0
2
5
Since original index = 2, had Data1 = 10 which is greater than max allowed = 8. This row has been broken into two rows as shown in the above table.
Attempt: I was able to find those columns which had value greater than max allowed and number of rows to be inserted. But I had a confusion wether that approach would work if both columns would have value greater than the max allowed value (as in case of index = 3). The values indicate how many more rows to be inserted for each index value for particular column.
Index
Data 1
Data 2
1
0
0
2
1
0
3
1
2
Let's approach in the following steps:
Step 1: Preparation of split values:
Define custom lambda function to turn Data 1, Data 2 into lists of values split with Max. Allowed if larger than it. Hold the expanded lists in 2 new columns Data 1x, Data 2x:
f = lambda x, y, z: [z] * (x // z) + [x % z] + [0] * (max(x//z, y//z) - x//z)
df['Data 1x'] = df.apply(lambda x: f(x['Data 1'], x['Data 2'], x['Max. Allowed']) , axis=1)
df['Data 2x'] = df.apply(lambda x: f(x['Data 2'], x['Data 1'], x['Max. Allowed']) , axis=1)
The lambda function is designed to add 0 into the lists to make the number of elements in lists in the same row to have the same lengths.
Intermediate result:
print(df)
Index Data 1 Data 2 Max. Allowed Data 1x Data 2x
0 1 1 2 3 [1] [2]
1 2 10 5 8 [8, 2] [5, 0]
2 3 7 12 5 [5, 2, 0] [5, 5, 2]
Step 2: Explode split values into separate rows:
Case 1: If your Pandas version is 1.3 or above
We use DataFrame.explode() to explode the 2 new columns: (this part of feature to explode multiple columns requires Pandas version 1.3 or above)
df = df.explode(['Data 1x', 'Data 2x'])
Case 2: For Pandas version lower than 1.3, try the following way to explode:
df = df.apply(pd.Series.explode)
Case 3: If the above 2 ways to explode don't work in your programming environment, use:
df_exp = df.explode('Data 1x')[['Index', 'Data 1', 'Data 2', 'Max. Allowed']].reset_index(drop=True)
df_1x = df.explode('Data 1x')[['Data 1x']].reset_index(drop=True)
df_2x = df.explode('Data 2x')[['Data 2x']].reset_index(drop=True)
df = df_exp.join([df_1x, df_2x])
Result:
print(df)
Index Data 1 Data 2 Max. Allowed Data 1x Data 2x
0 1 1 2 3 1 2
1 2 10 5 8 8 5
1 2 10 5 8 2 0
2 3 7 12 5 5 5
2 3 7 12 5 2 5
2 3 7 12 5 0 2
Step 3: Formatting to the required output:
# select and rename columns
df = (df[['Index', 'Data 1x', 'Data 2x', 'Max. Allowed']]
.rename({'Data 1x': 'Data 1', 'Data 2x': 'Data 2'}, axis=1)
.reset_index(drop=True)
)
# reset the `Index` values
df['Index'] = df.index + 1
Final result:
print(df)
Index Data 1 Data 2 Max. Allowed
0 1 1 2 3
1 2 8 5 8
2 3 2 0 8
3 4 5 5 5
4 5 2 5 5
5 6 0 2 5
Assuming you are willing to process the data frame row by row, then you could carry out the check for maximum value in a while loop and populate a new data frame with the new rows.
import pandas as pd
df = pd.DataFrame({"Index" : [1, 2, 3], "Data 1" : [1,10,7], "Data 2" : [2,5,12], "Max_Allowed" : [3,8,5]})
print(df)
# create a new data frame that we can populate with rows of data
dfz = pd.DataFrame(columns=("Index", "Data 1","Data 2","Max_Allowed"))
iz = 0
for(_, col1, col2, col3, col4) in df.itertuples(name=None):
if col2<=col4 and col3<=col4:
dfz.loc[iz] = [str(iz+1)+"("+str(col1)+")", col2, col3, col4]
iz += 1
else:
iz_orig = iz # keep the index we are at currently
while col2>0 or col3>0:
if col2>col4: # check if more than maximum value for Data 1
col2p=col4
col2 -= col4 # minus the maximum value from current value
else:
col2p=col2
col2 = 0 # set the value to zero
if col3>col4: # check if more than maximum value for Data 2
col3p=col4
col3 -= col4
else:
col3p=col3
col3 = 0
if iz_orig == iz:
# enter with the original Index in parenthesis
dfz.loc[iz] = [str(iz+1)+"("+str(col1)+")", col2p, col3p, col4]
else:
# enter row with just the new Index
dfz.loc[iz] = [str(iz+1), col2p, col3p, col4]
iz += 1
print(dfz)

How to drop float values from a column - pandas

I have a dataframe given shown below
df = pd.DataFrame({
'subject_id':[1,1,1,1,1,1],
'val' :[5,6.4,5.4,6,6,6]
})
It looks like as shown below
I would like to drop the values from val column which ends with .[1-9]. Basically I would like to retain values like 5.0,6.0 and drop values like 5.4,6.4 etc
Though I tried below, it isn't accurate
df['val'] = df['val'].astype(int)
df.drop_duplicates() # it doesn't give expected output and not accurate.
I expect my output to be like as shown below
First idea is compare original value with casted column to integer, also assign integers back for expected output (integers in column):
s = df['val']
df['val'] = df['val'].astype(int)
df = df[df['val'] == s]
print (df)
subject_id val
0 1 5
3 1 6
4 1 6
5 1 6
Another idea is test is_integer:
mask = df['val'].apply(lambda x: x.is_integer())
df['val'] = df['val'].astype(int)
df = df[mask]
print (df)
subject_id val
0 1 5
3 1 6
4 1 6
5 1 6
If need floats in output you can use:
df1 = df[ df['val'].astype(int) == df['val']]
print (df1)
subject_id val
0 1 5.0
3 1 6.0
4 1 6.0
5 1 6.0
Use mod 1 to determine the residual. If residual is 0 it means the number is a int. Then use the results as a mask to select only those rows.
df.loc[df.val.mod(1).eq(0)].astype(int)
subject_id val
0 1 5
3 1 6
4 1 6
5 1 6

Append data with one column to existing dataframe

I want append a list of data to a dataframe such that the list will appear in a column ie:
#Existing dataframe:
[A, 20150901, 20150902
1 4 5
4 2 7]
#list of data to append to column A:
data = [8,9,4]
#Required dataframe
[A, 20150901, 20150902
1 4 5
4 2 7
8, 0 0
9 0 0
4 0 0]
I am using the following:
df_new = df.copy(deep=True)
#I am copying and deleting data as column names are type Timestamp and easier to reuse them
df_new.drop(df_new.index, inplace=True)
for item in data_list:
df_new = df_new.append([{'A':item}], ignore_index=True)
df_new.fillna(0, inplace=True)
df = pd.concat([df, df_new], axis=0, ignore_index=True)
But doing this in a loop is inefficient plus I get this warning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.
Any ideas on how to overcome this error and append 2 dataframes in one go?
I think need concat new DataFrame with column A, then reindex if want same order of columns and last replace missing values by fillna:
data = [8,9,4]
df_new = pd.DataFrame({'A':data})
df = (pd.concat([df, df_new], ignore_index=True)
.reindex(columns=df.columns)
.fillna(0, downcast='infer'))
print (df)
A 20150901 20150902
0 1 4 5
1 4 2 7
2 8 0 0
3 9 0 0
4 4 0 0
I think, you could do something like this.
df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
df2 = pd.DataFrame({'A':[8,9,4]})
df.append(df2).fillna(0)
A B
0 1 2.0
1 3 4.0
0 8 0.0
1 9 0.0
2 4 0.0
maybe you can do it in this way:
new = pd.DataFrame(np.zeros((3, 3))) #Create a new zero dataframe:
new[0]=[8,9,4] #add values
existed_dataframe.append(new) #and merge both dataframes

reshape pandas column to allow sum instead of all values

I have a data frame with 10 columns which successfully loads into a classifier. Now I am trying to load the sum of the columns instead of all 10 columns.
previous_games_stats = pd.read_csv('stats/2016-2017 CANUCKS STATS.csv', header=1)
numGamesToLookBack = 10;
X = previous_games_stats[['GF', 'GA']]
X = X[0:numGamesToLookBack] #num games to look back
stats_feature_names = list(X.columns.values)
totals = pd.DataFrame(X, columns=stats_feature_names)
y = previous_games_stats['Unnamed: 7'] #outcome variable (win/loss)
y = y[numGamesToLookBack+1]
df = pd.DataFrame(iris.data, columns=iris.feature_names)
stats_df = pd.DataFrame(X, columns=stats_feature_names).sum()
The final line (with .sum() at the end) causes stats_df to go form being formatted like:
GF GA
0 2 1
1 4 3
2 2 1
3 2 1
4 3 4
5 2 4
6 0 3
7 0 2
8 2 5
9 0 3
to:
GF 17
GA 27
But I want to keep the same format, so the end result should be this:
GF GA
0 17 27
Since it is getting re-formatted, I am getting the following error:
IndexError: boolean index did not match indexed array along dimension 0; dimension is 4 but corresponding boolean dimension is 3
What can I do to make the format stay the same?
If call sum to DataFrame, get Series. For one row DataFrame use:
stats_df = pd.DataFrame(X, columns=stats_feature_names).sum().to_frame().T
Another solution:
df1 = pd.DataFrame(X, columns=stats_feature_names)
stats_df = pd.DataFrame([df1.sum().values], columns=df.columns)

Compare rows of 2 pandas data frames by column and keep bigger and sum

I have two data frames of same IDs with identical structure:
X, Y, Value, ID
The only difference between the two should be values in column Value - it may need to be sorted by ID first so both have same order of rows to make sure.
I want to compare these two data frames by row based on column Value and keep the row from first or second depending where the Value is bigger. I would also like to see example how to add additional column SUM for sum of Value columns from both data frames.
I will be glad for any example, including using numpy if you feel it is better to use for this than Pandas.
Edit: I just realized after testing the example from the first answer that the data frames I have are missing completely the rows with ids where Value was null. That makes two data frames of different number of rows. So could be please also included how to make them same size before comparison - adding rows with missing ids from each other with IDs and zeros?
import numpy as np
# create a new dataframe, where Value is the max value per row
val1 = df1['Value']
val2 = df2['Value'][val1.index] # align to val1
df = df1.copy()
df['Value'] = np.maximum(val1, val2)
# add a SUM column:
df1['SUM'] = df1['Value'].sum()
df2['SUM'] = df2['Value'].sum()
df = (pd.concat([df1, df2])
.groupby(['ID','X','Y'])
.agg({'value':'max', 'value_sum':'sum'}))
I use reindex_like for align dataframes and then where and loc for filling column Value of new dataframe df:
print df1
X Y Value ID
0 1 4 10 0
1 2 5 55 1
2 3 6 21 2
print df2
X Y Value ID
0 2 5 7 1
1 3 6 34 2
#align dataframes
df1 = df1.set_index('ID')
df2 = df2.set_index('ID')
df2 = df2.reindex_like(df1)
print df2
X Y Value
ID
0 NaN NaN NaN
1 2 5 7
2 3 6 34
#create new df
df = df1.copy()
df['Value'] = df1['Value'].where(df1['Value'] > df2['Value'], df2['Value'])
#if value is NaN in column df2 give value of column1
df.loc[df2['Value'].isnull(), 'Value'] = df1['Value']
print df
X Y Value
ID
0 1 4 10
1 2 5 55
2 3 6 34
#sum columns Value to columns SUM
df1['SUM'] = df1['Value'].sum()
df2['SUM'] = df2['Value'].sum()
print df1
X Y Value SUM
ID
0 1 4 10 86
1 2 5 55 86
2 3 6 21 86
#remove rows with NaN
print df2.dropna()
X Y Value SUM
ID
1 2 5 7 41
2 3 6 34 41

Categories

Resources