Creating a new column by summing previous ones with pandas error

Creating a new column by summing previous ones with pandas error - python

Here is a picture of my file, the upper one is the original, the bottom one is what I get after I run my code: https://imgur.com/a/zUCGart
Here is my code:
import pandas as pd
path = r'C:\Users\myname\Downloads\RGB.xlsx'
df = pd.read_excel(path)
df['RGB'] = df.iloc[1:7:,:5:8].sum(axis=1)
df.to_excel(path)
Basically what I want to do is to create a new column called RGB, and sum the values of red, blue and green columns, hence I did the 1:7:,:5:8, to apply it to all of the rows, and the 5.,6. and 7. (red,blue,green) columns, but instead it just made RGB equal to the black(first) column...
Not sure what did I do wrong here.
My original dataframe:
Black Orange Yellow Brown Blue Green Red
7 4 3 1 6 7 2
3 3 3 8 4 5 2
6 7 3 2 2 2 5
2 9 2 2 2 2 2
5 5 5 5 5 5 5
2 2 8 2 27 8 2

You have some extra semicolons in your code, at df.iloc[1:7:,:5:8]. Try without them. let me know if it works, otherwise i will come back with a more general solution.
import pandas as pd
path = r'C:\Users\myname\Downloads\RGB.xlsx'
df = pd.read_excel(path)
df['RGB'] = df.iloc[1:7,5:8].sum(axis=1)
df.to_excel(path)

You shoud be able to do it like this.
df['RGB'] = df['Red'] + df['Green'] + df['Blue']

Related

Sampling of a pandas dataframe with groupby on one column such that the sample has no duplicates in another column

Assume the below dummy dataframe:
a b
0 red 0
1 red 1
2 red 3
3 blue 0
4 blue 1
5 blue 3
6 black 4
7 black 2
I want to sample on column a, with a sample size of 1. But the only condition is that in the sample created the values of column b should be unique, such that it gives me a result something like:
a b
2 red 0
3 blue 1
5 black 4
The below result is not acceptable:
a b
2 red 1
3 blue 1
5 black 4
Right now I am using below code with pandas sample function, but it is returning duplicate values in column b
df.groupby('a').sample(n=1)

sample(n=1) selects values randomly from each element in column 'a'.
You can use df.groupby('a').first() which chooses the first value of each element, and would return the output you are looking for.

Pandas add a second level index to the columns using a list

I have a dataframe with column headings (and for my real data multi-level row indexes). I want to add a second level index to the columns based on a list I have.
import pandas as pd
data = {"apple": [7,5,6,4,7,5,8,6],
"strawberry": [3,5,2,1,3,0,4,2],
"banana": [1,2,1,2,2,2,1,3],
"chocolate" : [5,8,4,2,1,6,4,5],
"cake":[4,4,5,1,3,0,0,3]
}
df = pd.DataFrame(data)
food_cat = ["fv","fv","fv","j","j"]
I am wanting something that looks like this:
I tried to use How to add a second level column header/index to dataframe by matching to dictionary values? - however couldn't get it working (and not ideal as I'd need to figure out how to automate the dictionary, which I don't have).
I also tried adding the list as a row in the dataframe and converting that row to a second level index as in this answer using
df.loc[len(df)] = food_cat
df = pd.MultiIndex.from_arrays(df.columns, df.iloc[len(df)-1])
but got the error
Check if lengths of all arrays are equal or not,
TypeError: Input must be a list / sequence of array-likes.
I also tried using df = pd.MultiIndex.from_arrays(df.columns, np.array(food_cat)) with import numpy as np but got the same error.
I feel like this should be a simple task (it is for rows), and there are a lot of questions asked, but I was struggling to find something I could duplicate to adapt to my data.

Pandas multi index creation requires a list(or list like) passed as an argument:
df.columns = pd.MultiIndex.from_arrays([food_cat, df.columns])
df
fv j
apple strawberry banana chocolate cake
0 7 3 1 5 4
1 5 5 2 8 4
2 6 2 1 4 5
3 4 1 2 2 1
4 7 3 2 1 3
5 5 0 2 6 0
6 8 4 1 4 0
7 6 2 3 5 3

How to add or combine two columns into another one in a dataframe if they meet a condition

I'm new to this, so this may sound weird but basically, I have a large dataframe but for simplification purposes let's assume the dataframe is this:
import pandas as pd
import numpy as np
dfn = pd.DataFrame({'a':[1,2,3,4,5],
'b':[6,7,8,9,10],
'c':np.nan})
dfn
Output:
a b c
0 1 6 NaN
1 2 7 NaN
2 3 8 NaN
3 4 9 NaN
4 5 10 NaN
What I want to do is to fill in values in the 'c' column based off of a condition, namely if the corresponding row value in 'a' is odd, then add it to the corresponding row value 'b' and input into 'c', else, just use the 'a' value for 'c'.
What I currently have is this:
for row in range(dfn.shape[0]):
if dfn.loc[row]['a']%2!=0:
dfn.loc[row]['c']=dfn.loc[row]['a']+dfn.loc[row]['b']
else:
dfn.loc[row]['c']=dfn.loc[row]['a']
dfn
Output:
a b c
0 1 6 NaN
1 2 7 NaN
2 3 8 NaN
3 4 9 NaN
4 5 10 NaN
Nothing seems to happen here and I'm not entirely sure why.
I've also tried a different approach of:
is_odd=dfn[dfn['a']%2!=0]
is_odd['c'] = is_odd['a'] + is+odd['b']
is_odd
Here, weirdly I get the right output:
a b c
0 1 1 2
2 3 3 6
4 5 5 10
But when I call dfn again, it comes out with all NaN values.
I've also tried doing it without using a variable name and nothing happens.
Any idea what I'm missing or if there's a way of doing this?
Thanks!

Use numpy where, which works for conditionals. It is akin to an if statement in Python, but significantly faster. I rarely use iterrows, since I don't find it as efficient as numpy where.
dfn['c'] = np.where(dfn['a']%2 !=0,
dfn.a + dfn.b,
dfn.a)
a b c
0 1 6 7
1 2 7 2
2 3 8 11
3 4 9 4
4 5 10 15
Basically, the first line in np.where defines your condition, which in this case is finding out if the 'a' column is an odd number. If it is, the next line is executed. If it is an even number, then the last line is executed. You can think of it as an if-else statement.

Use Series.mod and Series.where to get a copy of column b with 0 where there is an even value in a, then we add this serie to a.
dfn['c'] = dfn['b'].where(dfn['a'].mod(2).eq(1), 0).add(dfn['a'])
print(dfn)
a b c
0 1 6 7
1 2 7 2
2 3 8 11
3 4 9 4
4 5 10 15
Alternative
dfn['c'] = dfn['a'].mask(dfn['a'].mod(2).eq(1), dfn['a'].add(dfn['b']))

dfn.loc[row]['c']=... is always wrong. dfn.loc[row] may be either a copy or a view, so you cannot know what will happen. The correct way is:
dfn.loc[row, 'c']=...
Anyway here you should avoid the iteration and use np.where as suggested by other answers

Here is my solution which is close to original thought of the author of the question, hope it can be helpful
def oddadd(x):
if x['a']%2!=0:
return x['a']+x['b']
else:
return x['a']
dfn["c"] = dfn.apply(oddadd,axis=1)

Automating slicing prodcedures using pandas

I am currently using Pandas and Python to handle much of the repetitive tasks, I need done for my master thesis. At this point, I have written some code (with help from stack overflow) that, based on some event dates in one file, finds a start and end date to use as a date range in another file. These dates are then located and appended to an empty list, which I can then output to excel. However, using the below code I get a dataframe with 5 columns and 400.000 + rows (which is basically what I want), but not how I want the data outputted to excel. Below is my code:
end_date = pd.DataFrame(data=(df_sample['Date']-pd.DateOffset(days=2)))
start_date = pd.DataFrame(data=(df_sample['Date']-pd.offsets.BDay(n=252)))
merged_dates = pd.merge(end_date,start_date,left_index=True,right_index=True)
ff_factors = []
for index, row in merged_dates.iterrows():
time_range= (df['Date'] > row['Date_y']) & (df['Date'] <= row['Date_x'])
df_factor = df.loc[time_range]
ff_factors.append(df_factor)
appended_data = pd.concat(ff_factors, axis=0)
I need the data to be 5 columns and 250 rows (columns are variable identifiers) side by side, so that when outputting it to excel I have, for example column A-D and then 250 rows for each column. This then needs to be repeated for column E-H and so on. Using iloc, I can locate the 250 observations using appended_data.iloc[0:250], with both 5 columns and 250 rows, and then output it to excel.
Are the any way for me to automate the process, so that after selecting the first 250 and outputting it to excel, it selects the next 250 and outputs it next to the first 250 and so on?
I hope the above is precise and clear, else I'm happy to elaborate!
EDIT:
The above picture illustrate what I get when outputting to excel; 5 columns and 407.764 rows. What I needed is to get this split up into the following way:
The second picture illustrates how I needed the total sample to be split up. The first five columns and corresponding 250 rows needs to be as the second picture. When I do the next split using iloc[250:500], I will get the next 250 rows, which needs to be added after the initial five columns and so on.

You can do this with a combination of np.reshape, which can be made to behave as desired on individual columns, and which should be much faster than a loop through the rows, and pd.concat, to join the dataframes it makes back together:
def reshape_appended(df, target_rows, pad=4):
df = df.copy() # don't modify in-place
# below line adds strings, '0000',...,'0004' to the column names
# this ensures sorting the columns preserves the order
df.columns = [str(i).zfill(pad)+df.columns[i] for i in range(len(df.columns))]
#target number of new columns per column in df
target_cols = len(df.index)//target_rows
last_group = pd.DataFrame()
# below conditional fires if there will be leftover rows - % is mod
if len(df.index)%target_rows != 0:
last_group = df.iloc[-(len(df.index)%target_rows):].reset_index(drop=True)
df = df.iloc[:-(len(df.index)%target_rows)] # keep rows that divide nicely
#this is a large list comprehension, that I'll elaborate on below
groups = [pd.DataFrame(df[col].values.reshape((target_rows, target_cols),
order='F'),
columns=[str(i).zfill(pad)+col for i in range(target_cols)])
for col in df.columns]
if not last_group.empty: # if there are leftover rows, add them back
last_group.columns = [pad*'9'+col for col in last_group.columns]
groups.append(last_group)
out = pd.concat(groups, axis=1).sort_index(axis=1)
out.columns = out.columns.str[2*pad:] # remove the extra characters in the column names
return out
last_group takes care of any rows that don't divide evenly into sets of 250. The playing around with column names enforces proper sorting order.
df[col].values.reshape((target_rows, target_cols), order='F')
Reshapes the values in the column col of df into the shape specified by the tuple (target_rows, target_cols), with the ordering Fortran uses, indicated by F.
columns=[str(i).zfill(pad)+col for i in range(target_cols)]
is just giving names to these columns, with any eye to establishing proper ordering afterward.
Ex:
df = pd.DataFrame(np.random.randint(0, 10, (23, 3)), columns=list('abc'))
reshape_appended(df, 5)
Out[160]:
a b c a b c a b c a b c a b c
0 8 3 0 4 1 9 5 4 7 2 3 4 5.0 7.0 2.0
1 1 6 1 3 5 1 1 6 0 5 9 4 6.0 0.0 1.0
2 3 1 3 4 3 8 9 3 9 8 7 8 7.0 3.0 2.0
3 4 0 1 5 5 6 6 4 4 0 0 3 NaN NaN NaN
4 9 7 3 5 7 4 6 5 8 9 5 5 NaN NaN NaN
df
Out[161]:
a b c
0 8 3 0
1 1 6 1
2 3 1 3
3 4 0 1
4 9 7 3
5 4 1 9
6 3 5 1
7 4 3 8
8 5 5 6
9 5 7 4
10 5 4 7
11 1 6 0
12 9 3 9
13 6 4 4
14 6 5 8
15 2 3 4
16 5 9 4
17 8 7 8
18 0 0 3
19 9 5 5
20 5 7 2
21 6 0 1
22 7 3 2

My best guess to solving the problem would be to try and loop, until the counter is greater than length, so
i = 250 # counter
j = 0 # left limit
for x in range(len("your dataframe")):
appended_data.iloc[j:i]
i+=250
if i > len("your df"):
appended_data.iloc[j:(len("your df"))
break
else:
j = i

Multi indexing line plot

After slicing, I have a multi header Dataframe with two levels, indexed by date, obtained like this:
df = df.iloc[:, df.columns.get_level_values(1).isin({'a','b'})]
Date one two
a b a b
2 2 3 3 3
3 2 3 3 3
4 2 3 3 3
5 2 3 3 3
6 2 3 3 3
7 2 3 3 3
What I would like to do is to plot this Dataframe with a line plot with the Date in axis, the same color for the level 0 and solid/dashed lines for the first level.
I have tried unstacking ie.
df.unstack(level=0).plot(kind='line')
but with no success. The plot as it is now, shows Date in x axis but treat each combination of level 0 and 1 headers as a new entry.
Here is a picture of the plot obtained:
What we would like to implement would be a two levels legend (color/shape of line).
Code Example:
import numpy as np
import pandas as pd
A = np.random.rand(4,4)
C = pd.DataFrame(A, index=range(4), columns=[np.array(['A','A','B','B']), np.array(['a','b','a','b'])])
C.plot(kind='line')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Creating a new column by summing previous ones with pandas error - python

You shoud be able to do it like this. df['RGB'] = df['Red'] + df['Green'] + df['Blue']

Related

Sampling of a pandas dataframe with groupby on one column such that the sample has no duplicates in another column

Pandas add a second level index to the columns using a list

How to add or combine two columns into another one in a dataframe if they meet a condition

Automating slicing prodcedures using pandas

Multi indexing line plot

Categories

Resources