Iloc and rename in pandas - python

I am new to data Science and recently i have been working with pandas and cannot figure out what the following line means in it!
df1=df1.rename(columns=df1.iloc[0,:]).iloc[1:,:]
The problem states that this is used to make the columns with index 11 as the header but i can't understand how?
I know the use of rename but cannot understand what's happening here with multiple iloc ?

Just disect the line by each method applied:
df1 = # reassign df1 to ...
df1.rename( # the renamed frame of df1 ...
columns = # where column names will use mapper of ...
df1.iloc[0,:] # slice of df1 on row 0, include all columns ...
)
.iloc[1:,:] # the slice of the renamed frame from row 1 forward, include all columns...
Effectively, it's removing the first row and set as column names, which can be done similarly:
df1.columns = df1.iloc[0, :]
df1.drop(0, inplace=True)

Related

Name and join 2 multi index dataframe after operation is done

Have the following multi index data frame, df.
I performed a 20 day moving average operation on the df[‘Close’] with the following code, ma20.
How do I append m20 to df as a multi index data frame?
I should have 3 level 0 columns, ['Adj Close','Close', ‘ma20’], each with the 3 tickers, ['MSFT','AAPL','AMZN'], at level 1 columns.
The answer should also not require me to type out all the tickers manually.
import yfinance as yf
df = yf.download(['MSFT','AAPL','AMZN'], start="2022-01-01", end="2022-09-01").loc[:,['Adj Close','Close']]
ma20 = df['Close'].sort_index(ascending=True).rolling(20, min_periods=20).mean()
pd.concat([df,ma20], axis=1)????
It's not very elegant solution, but should work, the idea is to specify explicitly the multiindex names for new columns:
df[[('Mean', col) for col in ma20.columns]] = ma20
UPDATE:
If you want to use concat() method, you need firstly to add another column index level to ma20. The way you can do this looks counter-intuitive up to me:
pd.concat((df, pd.concat({'Mean': ma20}, axis=1)), sort=False, axis=1)
The purpose of pd.concat({'Mean': ma20}, axis=1)) is just to add another index level to the columns of ma20

Replace elements of a dataframe with a values at another dataframe elements

I want to replace df2 elements with df1 elements but according to that: If df2 first row first column has value '1' than df1 first row first column element is getting there, If it is zero than '0' stands. If df2 any row last column element is '1' than df1 that row last column element is coming there. It is going to be like that.
So i want to replace all df2 '1' element with df1 elements according to that rule. df3 is going to be like:
abcde0000;
abcd0e000;
abcd00e00;...
We can use apply function for this. But first you have concat both frames along axis 1. I am using a dummy table with just three entries. It can be applied for any number of rows.
import pandas as pd
import numpy as np
# Dummy data
df1 = pd.DataFrame([['a','b','c','d','e'],['a','b','c','d','e'],['a','b','c','d','e']])
df2 = pd.DataFrame([[1,1,1,1,1,0,0,0,0],[1,1,1,1,0,1,0,0,0],[1,1,1,1,0,0,1,0,0]])
# Display dataframe . May not work in python scripts. I used them in jupyter notebooks
display(df1)
display(df2)
# Concat DFs
df3 = pd.concat([df1,df2],axis=1)
display(df3)
# Define function for replacing
def replace(letters,indexes):
seek =0
for i in range(len(indexes)):
if indexes[i]==1:
indexes[i]=letters[seek]
seek+=1
return ''.join(list(map(str,indexes)))
# Applying replace function to dataframe
df4 = df3.apply(lambda x: replace(x[:5],x[5:]),axis=1)
# Display df4
display(df4)
The result is
0 abcde0000
1 abcd0e000
2 abcd00e00
dtype: object
I think this will solve your problem

Pandas "A value is trying to be set on a copy of a slice from a DataFrame"

Having a bit of trouble understanding the documentation
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
dfbreed['x'] = dfbreed.apply(testbreed, axis=1)
C:/Users/erasmuss/PycharmProjects/Sarah/farmdata.py:38: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Code is basically to re-arrange and clean some data to make analysis easier.
Code in given row-by per each animal, but has repetitions, blanks, and some other sparse values
Idea is to basically stack rows into columns and grab the useful data (Weight by date and final BCS) per animal
Initial DF
few snippets of the dataframe
Output Format
Output DF/csv
import pandas as pd
import numpy as np
#Function for cleaning up multiple entries of breeds
def testbreed(x):
if x.first_valid_index() is None:
return None
else:
return x[x.first_valid_index()]
#Read Data
df1 = pd.read_csv("farmdata.csv")
#Drop empty rows
df1.dropna(how='all', axis=1, inplace=True)
#Copy to extract Weights in DF2
df2 = df1.copy()
df2 = df2.drop(['BCS', 'Breed','Age'], axis=1)
#Pivot for ID names in DF1
df1 = df1.pivot(index='ID', columns='Date', values=['Breed','Weight', 'BCS'])
#Pivot for weights in DF2
df2 = df2.pivot(index='ID', columns='Date', values = 'Weight')
#Split out Breeds and BCS into individual dataframes w/Duplicate/missing data for each ID
df3 = df1.copy()
dfbreed = df3[['Breed']]
dfBCS = df3[['BCS']]
#Drop empty BCS columns
df1.dropna(how='all', axis=1, inplace=True)
#Shorten Breed and BCS to single Column by grabbing first value that is real. see function above
dfbreed['x'] = dfbreed.apply(testbreed, axis=1)
dfBCS['x'] = dfBCS.apply(testbreed, axis=1)
#Populate BCS and Breed into new DF
df5= pd.DataFrame(data=None)
df5['Breed'] = dfbreed['x']
df5['BCS'] = dfBCS['x']
#Join Weights
df5 = df5.join(df2)
#Write output
df5.to_csv(r'.\out1.csv')
I want to take the BCS and Breed dataframes which are multi-indexed on the column by Breed or BCS and then by date to take the first non-NaN value in the rows of dates and set that into a column named breed.
I had a lot of trouble getting the columns to pick the first unique values in-situ on the DF
I found a work-around with a 2015 answer:
2015 Answer
which defined the function at the top.
reading through the setting a value on the copy-of a slice makes sense intuitively,
but I can't seem to think of a way to make it work as a direct-replacement or index-based.
Should I be looping through?
Trying from The second answer here
I get
dfbreed.loc[:,'Breed'] = dfbreed['Breed'].apply(testbreed, axis=1)
dfBCS.loc[:, 'BCS'] = dfBCS.apply['BCS'](testbreed, axis=1)
which returns
ValueError: Must have equal len keys and value when setting with an iterable
I'm thinking this has something to do with the multi-index
keys come up as:
MultiIndex([('Breed', '1/28/2021'),
('Breed', '2/12/2021'),
('Breed', '2/4/2021'),
('Breed', '3/18/2021'),
('Breed', '7/30/2021')],
names=[None, 'Date'])
MultiIndex([('BCS', '1/28/2021'),
('BCS', '2/12/2021'),
('BCS', '2/4/2021'),
('BCS', '3/18/2021'),
('BCS', '7/30/2021')],
names=[None, 'Date'])
Sorry for the long question(s?)
Can anyone help me out?
Thanks.
You created dfbreed as:
dfbreed = df3[['Breed']]
So it is a view of the original DataFrame (limited to just this one column).
Remember that a view has not any own data buffer, it is only a tool to "view"
a fragment of the original DataFrame, with read only access.
When you attempt to perform dfbreed['x'] = dfbreed.apply(...), you
actually attempt to violate the read-only access mode.
To avoid this error, create dfbreed as an "independent" DataFrame:
dfbreed = df3[['Breed']].copy()
Now dfbreed has its own data buffer and you are free to change the data.

Wide to long returns empty output - Python dataframe

I have a dataframe which can be generated from the code as given below
df = pd.DataFrame({'person_id' :[1,2,3],'date1':
['12/31/2007','11/25/2009','10/06/2005'],'val1':
[2,4,6],'date2': ['12/31/2017','11/25/2019','10/06/2015'],'val2':[1,3,5],'date3':
['12/31/2027','11/25/2029','10/06/2025'],'val3':[7,9,11]})
I followed the below solution to convert it from wide to long
pd.wide_to_long(df, stubnames=['date', 'val'], i='person_id',
j='grp').sort_index(level=0)
Though this works with sample data as shown below, it doesn't work with my real data which has more than 200 columns. Instead of person_id, my real data has subject_ID which is values like DC0001,DC0002 etc. Does "I" always have to be numeric? Instead it adds the stub values as new columns in my dataset and has zero rows
This is how my real columns looks like
My real data might contains NA's as well. So do I have to fill them with default values for wide_to_long to work?
Can you please help as to what can be the issue? Or any other approach to achieve the same result is also helpful.
Try adding additional argument in the function which allows the strings suffix.
pd.long_to_wide(.......................,suffix='\w+')
The issue is with your column names, the numbers used to convert from wide to long need to be at the end of your column names or you need to specify a suffix to groupby. I think the easiest solution is to create a function that accepts regex and the dataframe.
import pandas as pd
import re
def change_names(df, regex):
# Select one of three column groups
old_cols = df.filter(regex = regex).columns
# Create list of new column names
new_cols = []
for col in old_cols:
# Get the stubname of the original column
stub = ''.join(re.split(r'\d', col))
# Get the time point
num = re.findall(r'\d+', col) # returns a list like ['1']
# Make new column name
new_col = stub + num[0]
new_cols.append(new_col)
# Create dictionary mapping old column names to new column names
dd = {oc: nc for oc, nc in zip(old_cols, new_cols)}
# Rename columns
df.rename(columns = dd, inplace = True)
return df
tdf = pd.DataFrame({'person_id' :[1,2,3],'h1date': ['12/31/2007','11/25/2009','10/06/2005'],'t1val': [2,4,6],'h2date': ['12/31/2017','11/25/2019','10/06/2015'],'t2val':[1,3,5],'h3date': ['12/31/2027','11/25/2029','10/06/2025'],'t3val':[7,9,11]})
# Change date columns
tdf = change_names(tdf, 'date$')
tdf = change_names(tdf, 'val$')
print(tdf)
person_id hdate1 tval1 hdate2 tval2 hdate3 tval3
0 1 12/31/2007 2 12/31/2017 1 12/31/2027 7
1 2 11/25/2009 4 11/25/2019 3 11/25/2029 9
2 3 10/06/2005 6 10/06/2015 5 10/06/2025 11
This is quite late to answer this question. But putting the solution here in case someone else find it useful
tdf = pd.DataFrame({'person_id' :[1,2,3],'h1date': ['12/31/2007','11/25/2009','10/06/2005'],'t1val': [2,4,6],'h2date': ['12/31/2017','11/25/2019','10/06/2015'],'t2val':[1,3,5],'h3date': ['12/31/2027','11/25/2029','10/06/2025'],'t3val':[7,9,11]})
## You can use m13op22 solution to rename your columns with numeric part at the
## end of the column name. This is important.
tdf = tdf.rename(columns={'h1date': 'hdate1', 't1val': 'tval1',
'h2date': 'hdate2', 't2val': 'tval2',
'h3date': 'hdate3', 't3val': 'tval3'})
## Then use the non-numeric portion, (in this example 'hdate', 'tval') as
## stubnames. The mistake you were doing was using ['date', 'val'] as stubnames.
df = pd.wide_to_long(tdf, stubnames=['hdate', 'tval'], i='person_id', j='grp').sort_index(level=0)
print(df)

How to rename one-hot encoded columns in pandas to their respective index?

I'm one-hot encoding some categorical variables with some code that was provied to me. This line adds a column of 0s and 1s with a name with the format prefix_categoricalValue
dataframe = pandas.concat([dataframe,pandas.get_dummies(dataframe[0], prefix='protocol')],axis=1).drop([0],axis=1)
I want the column to have as a name its index, not prefix_categoricalValue.
I know that I can do something like df.rename(columns={'prefix_categoricalValue': '0'}, inplace=True), but I'm not sure how to do it for all the columns which have that prefix.
This is an example of a part of the dataframe. Whether I decide to leave the local_address prefix or not, each category will have its name. Is it possible to rename the column with its index?
EDIT:
I'm trying to do this:
for column in dataframe:
dataframe.rename(columns={column: 'new_name'}, inplace=True)
print (column)
but I'm not exactly sure why it doesn't work
import pandas as pd
# 'dataframe' is the name of your data frame in the question, so that's what I use
# in my code below, although I suggest using 'data' or something for it instead,
# as 'DataFrame' is a keyword and its easy to make confusion. But anyway...
features = ['list of column names you want one-hot encoded']
# for example, features = ['Cars', 'Model, 'Year', ... ]
for f in features:
df = dataframe[[f]]
df2 = (pd.get_dummies(df, prefix='', prefix_sep='')
.max(level=0, axis=1)
.add_prefix(f+' - '))
# the new feature names will be "<old_feature_name> - <categorical_value>"
# for example, "Cars" will get transformed to "Cars - Minivan", "Cars - Truck", etc
# add the new one-hot encoded column to the dataframe
dataframe = pd.concat([dataframe, df2], axis=1)
# you can remove the original columns, if you don't need them anymore (optional)
dataframe = dataframe.drop([f], axis=1)
Let's say your prefix is local_address_0.0.0.0. The following code renames the columns that start with the prefix you specify to the index that column has according to the order in which they appear in the dataframe:
prefix = 'local_address_0.0.0.0'
cols = list(dataframe)
for idx, val in enumerate(cols):
if val.startswith(prefix):
dataframe.rename(index=str, columns={val: idx}, inplace=True)
This will show a warning in the console:
python3.6/site-packages/pandas/core/frame.py:3027: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-
docs/stable/indexing.html#indexing-view-versus-copy
return super(DataFrame, self).rename(**kwargs)
But it is just a warning, the column names of the dataframe are updated. If you want to learn more about the warning, see How to deal with SettingWithCopyWarning in Pandas?
If someone knows how to do the same thing without a warning, please comment.
IIUC
dummydf=pd.get_dummies(df.A)
dummydf.columns=['A']*dummydf.shape[1]
dummydf
Out[1171]:
A A
0 1 0
1 0 1
2 1 0
df
Out[1172]:
A B C
0 a b 1
1 b a 2
2 a c 3

Categories

Resources