Generate dataframe columns where each column is shift(-1) of previous column - python

I want to create 44 dataframe columns based on TAZ_1720 such that each column is shift(-1) of the previous column.
How can I do it instead of writing it 44 times?
df['m1']=df['TAZ_1270'].shift(-1)
df['m2']=df['m1'].shift(-1)
df['m3']=df['m2'].shift(-1)

Use DataFrame.assign with a dict comprehension.
Here is a minimal example with 4 shifts:
df = pd.DataFrame({'TAZ_1270': [100047, 100500, 100488, 100099]})
# TAZ_1270
# 0 100047
# 1 100500
# 2 100488
# 3 100099
df = df.assign(**{f'm{i}': df['TAZ_1270'].shift(-i) for i in range(1, 5)})
# TAZ_1270 m1 m2 m3 m4
# 0 100047 100500.0 100488.0 100099.0 NaN
# 1 100500 100488.0 100099.0 NaN NaN
# 2 100488 100099.0 NaN NaN NaN
# 3 100099 NaN NaN NaN NaN
Re: questions in comments
Why use **?
DataFrame.assign normally accepts the format df.assign(col1=foo, col2=bar, ...). When we use ** on a dict in a function call, it automatically unpacks the dict's 'col1': foo, 'col2': bar, ... pairs into col1=foo, col2=bar, ... arguments.
Why use f?
This is f-string syntax (introduced in python 3.6). f'm{i}' is just a more concise version of 'm' + str(i).

Related

Comparing 2 dataframes and set the value of the dataframe if not exists [duplicate]

I cannot find a pandas function (which I had seen before) to substitute the NaN's in a dataframe with values from another dataframe (assuming a common index which can be specified). Any help?
If you have two DataFrames of the same shape, then:
df[df.isnull()] = d2
Will do the trick.
Only locations where df.isnull() evaluates to True (highlighted in green) will be eligible for assignment.
In practice, the DataFrames aren't always the same size / shape, and transforming methods (especially .shift()) are useful.
Data coming in is invariably dirty, incomplete, or inconsistent. Par for the course. There's a pretty extensive pandas tutorial and associated cookbook for dealing with these situations.
As I just learned, there is a DataFrame.combine_first() method, which does precisely this, with the additional property that if your updating data frame d2 is bigger than your original df, the additional rows and columns are added, as well.
df = df.combine_first(d2)
This should be as simple as
df.fillna(d2)
A dedicated method for this is DataFrame.update:
Quoted from the documentation:
Modify in place using non-NA values from another DataFrame.
Aligns on indices. There is no return value.
Important to note is that this method will modify your data inplace. So it will overwrite your updated dataframe.
Example:
print(df1)
A B C
aaa NaN 1.0 NaN
bbb NaN NaN 10.0
ccc 3.0 NaN 6.0
ddd NaN NaN NaN
eee NaN NaN NaN
print(df2)
A B C
index
aaa 1.0 1.0 NaN
bbb NaN NaN 10.0
eee NaN 1.0 NaN
# update df1 NaN where there are values in df2
df1.update(df2)
print(df1)
A B C
aaa 1.0 1.0 NaN
bbb NaN NaN 10.0
ccc 3.0 NaN 6.0
ddd NaN NaN NaN
eee NaN 1.0 NaN
Notice the updated NaN values at intersect aaa, A and eee, B
DataFrame.combine_first() answers this question exactly.
However, sometimes you want to fill/replace/overwrite some of the non-missing (non-NaN) values of DataFrame A with values from DataFrame B. That question brought me to this page, and the solution is DataFrame.mask()
A = B.mask(condition, A)
When condition is true, the values from A will be used, otherwise B's values will be used.
For example, you could solve the OP's original question with mask such that when an element from A is non-NaN, use it, otherwise use the corresponding element from B.
But using DataFrame.mask() you could replace the values of A that fail to meet arbitrary criteria (less than zero? more than 100?) with values from B. So mask is more flexible, and overkill for this problem, but I thought it was worthy of mention (I needed it to solve my problem).
It's also important to note that B could be a numpy array instead of a DataFrame. DataFrame.combine_first() requires that B be a DataFrame, but DataFrame.mask() just requires that B's is an NDFrame and its dimensions match A's dimensions.
One important info missing from the other answers is that both combine_first and fillna match on index, so you have to make the indices of match across the DataFrames for these methods to work.
Oftentimes, there's a need to match on some other column(s) to fill in missing values. In that case, you need to use set_index first to make the columns to be matched, the index.
df1 = df1.set_index(cols_to_be_matched).fillna(df2.set_index(cols_to_be_matched)).reset_index()
or
df1 = df1.set_index(cols_to_be_matched).combine_first(df2.set_index(cols_to_be_matched)).reset_index()
Another option is to use merge:
df1 = (df1.merge(df2, on=cols_to_be_matched, how='left', suffixes=('','\x00'))
.sort_index(axis=1).bfill(axis=1)[df.columns])
The idea here is to left-merge and by sorting the columns (we use '\x00' as the suffix for columns from df2 since it's the character with the lowest Unicode value), we make sure the same column values end up next to each other. Then use bfill horizontally to update df1 with values from df2.
Example:
Suppose you had df1:
C1 C2 C3 C4
0 1 a 1.0 0
1 1 b NaN 1
2 2 b NaN 2
3 2 b NaN 3
and df2
C1 C2 C3
0 1 b 2
1 2 b 3
and you want to fill in the missing values in df1 with values in df2 for each pair of C1-C2 value pair. Then
cols_to_be_matched = ['C1', 'C2']
and all of the codes above produce the following output (where the values are indeed filled as required):
C1 C2 C3 C4
0 1 a 1.0 0
1 1 b 2.0 1
2 2 b 3.0 2
3 2 b 3.0 3

Adding a value at the end of a column in a multindex column dataframe

I have a simple problem that probably has a simple solution but I couldn't found it anywhere. I have the following multindex column Dataframe:
mux = pd.MultiIndex.from_product(['A','B','C'], ['Datetime', 'Str', 'Ret']])
dfr = pd.DataFrame(columns=mux)
| A | B | C |
|Datetime|Str|Ret|Datetime|Str|Ret|Datetime|Str|Ret|
I need to add values one by one at the end of a specific subcolumn. For example add one value at the end of column A sub-column Datetime and leave the rest of the row as it is, then add another value to column B sub-column Str and again leave the rest of the values in the same row untouched and so on. So my questions are: Is it possible to target individual locations in this type of Dataframes? How? and also Is it possible to append not a full row but an individual value always at the end after the previous value without knowing where the end is?. Thank you so much for your answers.
IIUC, you can use .loc:
idx = len(dfr) # get the index of the next row after the last one
dfr.loc[idx, ('A', 'Datetime')] = pd.to_datetime('2021-09-24')
dfr.loc[idx, ('B', 'Str')] = 'Hello'
dfr.loc[idx, ('C', 'Ret')] = 4.3
Output:
>>> dfr
A B C
Datetime Str Ret Datetime Str Ret Datetime Str Ret
0 2021-09-24 00:00:00 NaN NaN NaN Hello NaN NaN NaN 4.3
Update
I mean for example when I have different number of values in different columns (for example 6 values in column A-Str but only 4 in column B-Datetime) but I donĀ“t really know. In that case what I need is to add the next value in that column after the last one so I need to know the index of the last non Nan value of that particular column so I can use it in your answer because if I use len(dfr) while trying to add value to the column that only has 4 values it will end up in the 7th row instead of the 5th row, this is because one of the columns may have more values than the others.
You can do it easily using last_valid_index. Create a convenient function append_to_col to append values inplace in your dataframe:
def append_to_col(col, val):
idx = dfr[col].last_valid_index()
dfr.loc[idx+1 if idx is not None else 0, col] = val
# Fill your dataframe
append_to_col(('A', 'Datetime'), '2021-09-24')
append_to_col(('A', 'Datetime'), '2021-09-25')
append_to_col(('B', 'Str'), 'Hello')
append_to_col(('C', 'Ret'), 4.3)
append_to_col(('C', 'Ret'), 8.2)
append_to_col(('A', 'Datetime'), '2021-09-26')
Output:
>>> dfr
A B C
Datetime Str Ret Datetime Str Ret Datetime Str Ret
0 2021-09-24 NaN NaN NaN Hello NaN NaN NaN 4.3
1 2021-09-25 NaN NaN NaN NaN NaN NaN NaN 8.2
2 2021-09-26 NaN NaN NaN NaN NaN NaN NaN NaN

Add new column with column names of a table, based on conditions [duplicate]

I have a dataframe as below:
I want to get the name of the column if column of a particular row if it contains 1 in the that column.
Use DataFrame.dot:
df1 = df.dot(df.columns)
If there is multiple 1 per row:
df2 = df.dot(df.columns + ';').str.rstrip(';')
Firstly
Your question is very ambiguous and I recommend reading this link in #sammywemmy's comment. If I understand your problem correctly... we'll talk about this mask first:
df.columns[
(df == 1) # mask
.any(axis=0) # mask
]
What's happening? Lets work our way outward starting from within df.columns[**HERE**] :
(df == 1) makes a boolean mask of the df with True/False(1/0)
.any() as per the docs:
"Returns False unless there is at least one element within a series or along a Dataframe axis that is True or equivalent".
This gives us a handy Series to mask the column names with.
We will use this example to automate for your solution below
Next:
Automate to get an output of (<row index> ,[<col name>, <col name>,..]) where there is 1 in the row values. Although this will be slower on large datasets, it should do the trick:
import pandas as pd
data = {'foo':[0,0,0,0], 'bar':[0, 1, 0, 0], 'baz':[0,0,0,0], 'spam':[0,1,0,1]}
df = pd.DataFrame(data, index=['a','b','c','d'])
print(df)
foo bar baz spam
a 0 0 0 0
b 0 1 0 1
c 0 0 0 0
d 0 0 0 1
# group our df by index and creates a dict with lists of df's as values
df_dict = dict(
list(
df.groupby(df.index)
)
)
Next step is a for loop that iterates the contents of each df in df_dict, checks them with the mask we created earlier, and prints the intended results:
for k, v in df_dict.items(): # k: name of index, v: is a df
check = v.columns[(v == 1).any()]
if len(check) > 0:
print((k, check.to_list()))
('b', ['bar', 'spam'])
('d', ['spam'])
Side note:
You see how I generated sample data that can be easily reproduced? In the future, please try to ask questions with posted sample data that can be reproduced. This way it helps you understand your problem better and it is easier for us to answer it for you.
Getting column name are dividing in 2 sections.
If you want in a new column name then condition should be unique because it will only give 1 col name for each row.
data = {'foo':[0,0,3,0], 'bar':[0, 5, 0, 0], 'baz':[0,0,2,0], 'spam':[0,1,0,1]}
df = pd.DataFrame(data)
df=df.replace(0,np.nan)
df
foo bar baz spam
0 NaN NaN NaN NaN
1 NaN 5.0 NaN 1.0
2 3.0 NaN 2.0 NaN
3 NaN NaN NaN 1.0
If you were looking for min or maximum
max= df.idxmax(1)
min = df.idxmin(1)
out= df.assign(max=max , min=min)
out
foo bar baz spam max min
0 NaN NaN NaN NaN NaN NaN
1 NaN 5.0 NaN 1.0 bar spam
2 3.0 NaN 2.0 NaN foo baz
3 NaN NaN NaN 1.0 spam spam
2nd case, If your condition is satisfied in multiple columns for example you are looking for columns that contain 1 and you are looking for list because its not possible to adjust in same dataframe.
str_con= df.astype(str).apply(lambda x:x.str.contains('1.0',case=False, na=False)).any()
df.column[str_con]
#output
Index(['spam'], dtype='object') #only spam contains 1
Or you are looking for numerical condition columns contains value more than 1
num_con = df.apply(lambda x:x>1.0).any()
df.columns[num_con]
#output
Index(['foo', 'bar', 'baz'], dtype='object') #these col has higher value than 1
Happy learning

how to replace NaN values in one dataframe with values from another [duplicate]

I cannot find a pandas function (which I had seen before) to substitute the NaN's in a dataframe with values from another dataframe (assuming a common index which can be specified). Any help?
If you have two DataFrames of the same shape, then:
df[df.isnull()] = d2
Will do the trick.
Only locations where df.isnull() evaluates to True (highlighted in green) will be eligible for assignment.
In practice, the DataFrames aren't always the same size / shape, and transforming methods (especially .shift()) are useful.
Data coming in is invariably dirty, incomplete, or inconsistent. Par for the course. There's a pretty extensive pandas tutorial and associated cookbook for dealing with these situations.
As I just learned, there is a DataFrame.combine_first() method, which does precisely this, with the additional property that if your updating data frame d2 is bigger than your original df, the additional rows and columns are added, as well.
df = df.combine_first(d2)
This should be as simple as
df.fillna(d2)
A dedicated method for this is DataFrame.update:
Quoted from the documentation:
Modify in place using non-NA values from another DataFrame.
Aligns on indices. There is no return value.
Important to note is that this method will modify your data inplace. So it will overwrite your updated dataframe.
Example:
print(df1)
A B C
aaa NaN 1.0 NaN
bbb NaN NaN 10.0
ccc 3.0 NaN 6.0
ddd NaN NaN NaN
eee NaN NaN NaN
print(df2)
A B C
index
aaa 1.0 1.0 NaN
bbb NaN NaN 10.0
eee NaN 1.0 NaN
# update df1 NaN where there are values in df2
df1.update(df2)
print(df1)
A B C
aaa 1.0 1.0 NaN
bbb NaN NaN 10.0
ccc 3.0 NaN 6.0
ddd NaN NaN NaN
eee NaN 1.0 NaN
Notice the updated NaN values at intersect aaa, A and eee, B
DataFrame.combine_first() answers this question exactly.
However, sometimes you want to fill/replace/overwrite some of the non-missing (non-NaN) values of DataFrame A with values from DataFrame B. That question brought me to this page, and the solution is DataFrame.mask()
A = B.mask(condition, A)
When condition is true, the values from A will be used, otherwise B's values will be used.
For example, you could solve the OP's original question with mask such that when an element from A is non-NaN, use it, otherwise use the corresponding element from B.
But using DataFrame.mask() you could replace the values of A that fail to meet arbitrary criteria (less than zero? more than 100?) with values from B. So mask is more flexible, and overkill for this problem, but I thought it was worthy of mention (I needed it to solve my problem).
It's also important to note that B could be a numpy array instead of a DataFrame. DataFrame.combine_first() requires that B be a DataFrame, but DataFrame.mask() just requires that B's is an NDFrame and its dimensions match A's dimensions.
One important info missing from the other answers is that both combine_first and fillna match on index, so you have to make the indices of match across the DataFrames for these methods to work.
Oftentimes, there's a need to match on some other column(s) to fill in missing values. In that case, you need to use set_index first to make the columns to be matched, the index.
df1 = df1.set_index(cols_to_be_matched).fillna(df2.set_index(cols_to_be_matched)).reset_index()
or
df1 = df1.set_index(cols_to_be_matched).combine_first(df2.set_index(cols_to_be_matched)).reset_index()
Another option is to use merge:
df1 = (df1.merge(df2, on=cols_to_be_matched, how='left', suffixes=('','\x00'))
.sort_index(axis=1).bfill(axis=1)[df.columns])
The idea here is to left-merge and by sorting the columns (we use '\x00' as the suffix for columns from df2 since it's the character with the lowest Unicode value), we make sure the same column values end up next to each other. Then use bfill horizontally to update df1 with values from df2.
Example:
Suppose you had df1:
C1 C2 C3 C4
0 1 a 1.0 0
1 1 b NaN 1
2 2 b NaN 2
3 2 b NaN 3
and df2
C1 C2 C3
0 1 b 2
1 2 b 3
and you want to fill in the missing values in df1 with values in df2 for each pair of C1-C2 value pair. Then
cols_to_be_matched = ['C1', 'C2']
and all of the codes above produce the following output (where the values are indeed filled as required):
C1 C2 C3 C4
0 1 a 1.0 0
1 1 b 2.0 1
2 2 b 3.0 2
3 2 b 3.0 3

Python adding two dataframes based on index (edited)

(no idea how to introduce a matrix here for readability)
I have two dataframes obtained with Panda and Python.
df1 = pd.DataFrame({'Index': ['0','1','2'], 'number':[3,'dd',1], 'people':[3,'s',3]})
df1 = df1.set_index('Index')
df2 = pd.DataFrame({'Index': ['0','1','2'], 'quantity':[3,2,'hi'], 'persons':[1,5,np.nan]})
I would like to sum the quantities of columns based on Index. Columns do not have the same name and may contain strings. (I have in fact 50 columns on each df). I want to consider nan as 0. The result should look:
df3
Index column 1 column 2
0 6 4
1 nan nan
2 nan nan
I was wondering how could this be done.
Note:
For sure a double while or for would do the trick, just not very elegant...
indices=0
columna=0
while indices<len(df.index)-1:
while columna<numbercolumns-1:
df3.iloc[indices,columna]=df1.iloc[indices,columna] +df2.iloc[indices,columna]
indices += 1
columna += 1
Thank you.
You can try of concatenating both dataframes, then add based on the index group
df1.columns = df.columns
df1.people = pd.to_numeric(df1.people,errors='coerce')
pd.concat([df,df1]).groupby('Index').sum()
Out:
number people
Index
A 8 5.0
B 2 2.0
C 2 5.0
F 3 3.0

Categories

Resources