This question would be very basic but I'm struck in dropping a column without a column name. I imported an excel into pandas and the data looked something like below
A B
0 24 -10
1 12 -3
2 17 5
3 63 45
I tried to get rid of the first column (supposed to be index columns) that has no column name, and wish to have the dataframe with just A and B columns, for instance..
When I ran
df.columns
I get the below
Index(['Unnamed: 0', 'A', 'B', dtype='object')
I tried several ways
df = pd.read_excel(r"path", index_col = False)
and
df.reset_index(drop=True, inplace=True)
and
df = df.drop([''], axis=1)
the below line display an error
self.DATA.drop([""], axis=1, inplace=True)
The error for the above line is
name 'self' is not defined
I tried other possible ways. But nothing seems to work. What is the mistake that I'm making? Any help would be highly appreciated.
You can try
pd.read_excel('tmp.xlsx', index_col=0)
# Or
pd.read_excel('tmp.xlsx', usecols=lambda x: 'Unnamed' not in x)
# Or
pd.read_excel('tmp.xlsx', usecols=['A', 'B'])
this should work for the nth column in your dataframe df.drop(columns=df.columns[n], inplace=True), if it's the first columns, so n = 0.
I worked around as I deliberated on #enke comment and realized that the simple drop function giving the column name (as below) can actually solve the issue of removing the undesired index column
df = df.drop(['Unnamed: 0'], axis=1)
Try This
empty_index = [" " for i in range(len(d["A"]))]
df = pd.DataFrame(d, index = empty_index)
here the list is empty spaces..
Related
I am new to data Science and recently i have been working with pandas and cannot figure out what the following line means in it!
df1=df1.rename(columns=df1.iloc[0,:]).iloc[1:,:]
The problem states that this is used to make the columns with index 11 as the header but i can't understand how?
I know the use of rename but cannot understand what's happening here with multiple iloc ?
Just disect the line by each method applied:
df1 = # reassign df1 to ...
df1.rename( # the renamed frame of df1 ...
columns = # where column names will use mapper of ...
df1.iloc[0,:] # slice of df1 on row 0, include all columns ...
)
.iloc[1:,:] # the slice of the renamed frame from row 1 forward, include all columns...
Effectively, it's removing the first row and set as column names, which can be done similarly:
df1.columns = df1.iloc[0, :]
df1.drop(0, inplace=True)
I imported a dataset into my python script and took the correlation. This is the code for correlation:
data = pd.read_excel('RQ_ID_Grouping.xlsx' , 'Sheet1')
corr = data.corr()
After the correlation the data looks like this:
I want to convert the data into below format:
I am using this code to achieve the above data , but it doesn't seem to be working:
corr1 = (corr.melt(var_name = 'X' , value_name = 'Y').groupby('X')['Y'].reset_index(name = 'Corr_Value'))
I know there should be something after the 'groupby' part but I don't know what . If you could help me , I would greatly appreciate it.
Use DataFrame.stack for reshape and drop missing values, convert MultiIndex to columns by DataFrame.reset_index and last set columns names:
df = corr.stack().reset_index()
df.columns = ['X','Y','Corr_Value']
Another solution with DataFrame.rename_axis:
df = corr.stack().rename_axis(('X','Y')).reset_index(name='Corr_Value')
And your solution with melt is also possible:
df = (corr.rename_axis('X')
.reset_index()
.melt('X', var_name='Y', value_name='Corr_Value')
.dropna()
.sort_values(['X','Y'])
.reset_index(drop=True))
trying to do a quick function but struggling since new to Pandas/Python. I'm trying to remove nas from two of my columns, but I keep getting this error, my code is the following:
def remove_na():
df.dropna(subset=['Column 1', 'Column 2'])
df.reset_index(drop=True)
df = remove_rows()
df.head(3)
AttributeError: 'NoneType' object has no attribute 'dropna'
I want to use this function on different tables, hence why I thought it would make sense to create a method. However, I just don't understand why it's not working for this method when compared to others it seems fine. Thank you.
I believe you can specify if you want to remove NA from columns or rows by the paremeter axis where 0 is index and 1 is columns. This would remove all NAs from all columns
df.dropna(axis =1, inplace=True )
I think you can use apply with dropna:
df = df.apply(lambda x: pd.Series(x.dropna().values))
print (df)
OR you can also try this
df=df.dropna(axis=0, how='any')
You're getting an error cos the dropna function here yields a dataframe as its output.
You can either save it to a dataframe:
df = df.dropna(subset=['Column 1', 'Column 2'])
or call the argument 'inplace=True' :
df.dropna(subset=['Column 1', 'Column 2'], inplace=True)
In order to remove all the missing values from the data set at once using pandas you can use the following:(Remember You have to specify the index in the arguments so that you can efficiently remove the missing values)
# making new data frame with dropped NA values
new_data = data.dropna(axis = 0, how ='any')
I'm one-hot encoding some categorical variables with some code that was provied to me. This line adds a column of 0s and 1s with a name with the format prefix_categoricalValue
dataframe = pandas.concat([dataframe,pandas.get_dummies(dataframe[0], prefix='protocol')],axis=1).drop([0],axis=1)
I want the column to have as a name its index, not prefix_categoricalValue.
I know that I can do something like df.rename(columns={'prefix_categoricalValue': '0'}, inplace=True), but I'm not sure how to do it for all the columns which have that prefix.
This is an example of a part of the dataframe. Whether I decide to leave the local_address prefix or not, each category will have its name. Is it possible to rename the column with its index?
EDIT:
I'm trying to do this:
for column in dataframe:
dataframe.rename(columns={column: 'new_name'}, inplace=True)
print (column)
but I'm not exactly sure why it doesn't work
import pandas as pd
# 'dataframe' is the name of your data frame in the question, so that's what I use
# in my code below, although I suggest using 'data' or something for it instead,
# as 'DataFrame' is a keyword and its easy to make confusion. But anyway...
features = ['list of column names you want one-hot encoded']
# for example, features = ['Cars', 'Model, 'Year', ... ]
for f in features:
df = dataframe[[f]]
df2 = (pd.get_dummies(df, prefix='', prefix_sep='')
.max(level=0, axis=1)
.add_prefix(f+' - '))
# the new feature names will be "<old_feature_name> - <categorical_value>"
# for example, "Cars" will get transformed to "Cars - Minivan", "Cars - Truck", etc
# add the new one-hot encoded column to the dataframe
dataframe = pd.concat([dataframe, df2], axis=1)
# you can remove the original columns, if you don't need them anymore (optional)
dataframe = dataframe.drop([f], axis=1)
Let's say your prefix is local_address_0.0.0.0. The following code renames the columns that start with the prefix you specify to the index that column has according to the order in which they appear in the dataframe:
prefix = 'local_address_0.0.0.0'
cols = list(dataframe)
for idx, val in enumerate(cols):
if val.startswith(prefix):
dataframe.rename(index=str, columns={val: idx}, inplace=True)
This will show a warning in the console:
python3.6/site-packages/pandas/core/frame.py:3027: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-
docs/stable/indexing.html#indexing-view-versus-copy
return super(DataFrame, self).rename(**kwargs)
But it is just a warning, the column names of the dataframe are updated. If you want to learn more about the warning, see How to deal with SettingWithCopyWarning in Pandas?
If someone knows how to do the same thing without a warning, please comment.
IIUC
dummydf=pd.get_dummies(df.A)
dummydf.columns=['A']*dummydf.shape[1]
dummydf
Out[1171]:
A A
0 1 0
1 0 1
2 1 0
df
Out[1172]:
A B C
0 a b 1
1 b a 2
2 a c 3
I have 2 dataframes. df1 comprises a Series of values.
df1 = pd.DataFrame({'winnings': cumsums_winnings_s, 'returns':cumsums_returns_s, 'spent': cumsums_spent_s, 'runs': cumsums_runs_s, 'wins': cumsums_wins_s, 'expected': cumsums_expected_s}, columns=["winnings", "returns", "runs", "wins", "expected"])
df2 runs each row through a function which takes 3 columns and produces a result for each row - specialSauce
df2= pd.DataFrame(list(map(lambda w,r,e: doStuff(w,r,e), df1['wins'], df1['runs'], df1['expected'])), columns=["specialSauce"])
print(df2.append(df1))
produces all the df1 columns but NaN for the df1 (and vice versa if df1/df2 switched in append)
So the problem I has is how to append these 2 dataframes correctly.
As I understand things, your issue seems to be related to the fact that you get NaN's in the result DataFrame.
The reason for this is that you are trying to .append() one dataframe to the other while they don't have the same columns.
df2 has one extra column, the one created with apply() and doStuff, while df1 does not have that column. When trying to append one pd.DataFrame to the other the result will have all columns both pd.DataFrame objects. Naturally, you will have some NaN's for ['specialSauce'] since this column does not exist in df1.
This would be the same if you were to use pd.concat(), both methods do the same thing in this case. The one thing that you could do to bring the result closer to your desired result is use the ignore_index flag like this:
>> df2.append(df1, ignore_index=True)
This would at least give you a 'fresh' index for the result pd.DataFrame.
EDIT
If what you're looking for is to "append" the result of doStuff to the end of your existing df, in the form of a new column (['specialSauce']), then what you'll have to do is use pd.concat() like this:
>> pd.concat([df1, df2], axis=1)
This will return the result pd.DataFrame as you want it.
If you had a pd.Series to add to the columns of df1 then you'd need to add it like this:
>> df1['specialSauce'] = <'specialSauce values'>
I hope that helps, if not please rephrase the description of what you're after.
Ok, there are a couple of things going on here. You've left code out and I had to fill in the gaps. For example you did not define doStuff, so I had to.
doStuff = lambda w, r, e: w + r + e
With that defined, your code does not run. I had to guess what you were trying to do. I'm guessing that you want to have an additional column called 'specialSauce' adjacent to your other columns.
So, this is how I set it up and solved the problem.
Setup and Solution
import pandas as pd
import numpy as np
np.random.seed(314)
df = pd.DataFrame(np.random.randn(100, 6),
columns=["winnings", "returns",
"spent", "runs",
"wins", "expected"]).cumsum()
doStuff = lambda w, r, e: w + r + e
df['specialSauce'] = df[['wins', 'runs', 'expected']].apply(lambda x: doStuff(*x), axis=1)
print df.head()
winnings returns spent runs wins expected specialSauce
0 0.166085 0.781964 0.852285 -0.707071 -0.931657 0.886661 -0.752067
1 -0.055704 1.163688 0.079710 0.155916 -1.212917 -0.045265 -1.102266
2 -0.554241 1.928014 0.271214 -0.462848 0.452802 1.692924 1.682878
3 0.627985 3.047389 -1.594841 -1.099262 -0.308115 4.356977 2.949601
4 0.796156 3.228755 -0.273482 -0.661442 -0.111355 2.827409 2.054611
Also
You tried to use pd.DataFrame.append(). Per the linked documentation, it attaches the DataFrame specified as the argument to the end of the DataFrame that is being appended to. You would have wanted to use pd.DataFrame.concat().