I have a dataframe df where some of the columns are strings and some are numeric. I am trying to convert all of them to numeric. So what I would like to do is something like this:
col = df.ix[:,i]
le = preprocessing.LabelEncoder()
le.fit(col)
newCol = le.transform(col)
df.ix[:,i] = newCol
but this does not work. Basically my question is how do I delete a column from a data frame then create a new column with the same name as the column I deleted when I do not know the column name, only the column index?
This should do it for you:
# Find the name of the column by index
n = df.columns[1]
# Drop that column
df.drop(n, axis = 1, inplace = True)
# Put whatever series you want in its place
df[n] = newCol
...where [1] can be whatever the index is, axis = 1 should not change.
This answers your question very literally where you asked to drop a column and then add one back in. But the reality is that there is no need to drop the column if you just replace it with newCol.
newcol = [..,..,.....]
df['colname'] = newcol
This will keep the colname intact while replacing its contents with newcol.
Related
I need to fix a large excel database where in some columns some cells are blank and all the data from the row is moved one cell to the right.
For example:
In this example I need a script that would detect that the first cell form the last row is blank and then it would move all the values one cell to the left.
I'm trying to do it with this function. Vencli_col is the dataset, df1 and df2 are copies. In df2 I drop column 12, which is where the error originates. I index the rows where the error happens and then I try to replace them with the values from df2.
df1 = vencli_col.copy()
df2 = vencli_col.copy()
df2 = df1.drop(columns=['Column12'])
df2['droppedcolumn'] = np.nan
i = 0
col =[]
for k, value in vencli_col.iterrows():
i +=1
if str(value['Column12']) == '' or str(value['Column12']) == str(np.nan):
col.append(i+1)
for j in col:
df1.iloc[j] = df2.iloc[j]
df1.head(25)
You could do something like the below. It is not very pretty but it does the trick.
# Select the column names that are correct and the ones that are shifted
# This is assuming the error column is the second one as in the image you have
correct_cols = df.columns[1:-1]
shifted_cols = df.columns[2:]
# Get the indexes of the rows that are NaN or ""
df = df.fillna("")
shifted_indexes = df[df["col1"] == ""].index
# Shift the data 1 column to the left
# It has to be transformed in numpy because if you don't the column names
# prevent from copying in the destination columns
df.loc[shifted_indexes ,correct_cols] = df.loc[shifted_indexes, shifted_cols].to_numpy()
EDIT: just realised there is an easier way using df.shift()
columns_to_shift = df.columns[1:]
shifted_indexes = df[df["col1"] == ""].index
df.loc[shifted_indexes, columns_to_shift] = df.loc[shifted_indexes, columns_to_shift].shift(-1, axis=1)
I have this csv
name,age,count
will,12,2
joe,22,4
tim,34,10
I want to reset all the values in the table count to 0
name,age,count
will,12,0
joe,22,0
tim,34,0
To change the values
file = pd.read_csv("./users.csv")
df = pd.DataFrame(file)
df["count"].replace({ to_replace: value }, inplace=True)
Is there a way to do it without mentioning to_replace and setting the value directly?
In your case just assign back
df["count"] = 0
I'd like to split a column of a dataframe into two separate columns. Here is how my dataframe looks like (only the first 3 rows):
I'd like to split the column referenced_tweets into two columns: type and id in a way that for example, for the first row, the value of the type column would be replied_to and the value of id would be 1253050942716551168.
Here is what I've tried:
df[['type', 'id']] = df['referenced_tweets'].str.split(',', n=1, expand=True)
but I get the error:
ValueError: Columns must be the same length as key
(I think I get this error because the type in the referenced_tweets column is NOT always replied_to (e.g., it can be retweeted, and therefore, the lengths would be different)
Why not get the values from the dict and add it two new columns?
def unpack_column(df_series, key):
""" Function that unpacks the key value of your column and skips NaN values """
return [None if pd.isna(value) else value[0][key] for value in df_series]
df['type'] = unpack_column(df['referenced_tweets'], 'type')
df['id'] = unpack_column(df['referenced_tweets'], 'id')
or in a one-liner:
df[['type', 'id']] = df['referenced_tweets'].apply(lambda x: (x[0]['type'], x[0]['id']))
I'm using this dataset:
https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/employmentandemployeetypes/datasets/commutingtoworkbygenderukcountryandregion
Loaded thus:
commuting_data_xls = pd.ExcelFile(commuting_data_filename)
commuting_data_sheets = commuting_data_front['Table description '].dropna()
commuting_data_1 = pd.read_excel(commuting_data_xls, '1', header=4, usecols=range(1,13))
commuting_data_1.dropna().dropna(axis=1)
The resulting hierarchical index only gets the rows right where all index columns are specified.
How can I correct this and name the index columns?
Try the following steps:
Open using pd.read_excel(), just the sheet and range you want.
commuting_data_xls = pd.read_excel("commutingdata.xlsx",'1', header=4, usecols=range(1,13))
Reset the multi index names.
commuting_data_xls.index.names = ['Gender', 'Work_Region', 'Region']
Reset the index and then restrict the rows to elimiate the totals, I assume you want them gone? If not just remove the iloc step.
commuting_data_xls = commuting_data_xls.reset_index().iloc[0:28]
Remove the 'Work_Region' column as this seems superfluous.
commuting_data_xls = commuting_data_xls.loc[:,commuting_data_xls.columns != 'Work_Region']
Fill down the Gender column to replace NaN
commuting_data_xls['Gender'].fillna(method='ffill', inpldace=True)
Reset the index if it suits your purposes.
commuting_data_xls.set_index('Gender', 'Region')
I want to iterate through the rows of a DataFrame and assign values to a new DataFrame. I've accomplished that task indirectly like this:
#first I read the data from df1 and assign it to df2 if something happens
counter = 0 #line1
for index,row in df1.iterrows(): #line2
value = row['df1_col'] #line3
value2 = row['df1_col2'] #line4
#try unzipping a file (pseudo code)
df2.loc[counter,'df2_col'] = value #line5
counter += 1 #line6
#except
print("Error, could not unzip {}") #line7
#then I set the desired index for df2
df2 = df2.set_index(['df2_col']) #line7
Is there a way to assign the values to the index of df2 directly in line5? Sorry my original question was unclear. I'm creating an index based on the something happening.
There are a bunch of ways to do this. According to your code, all you've done is created an empty df2 dataframe with an index of values from df1.df1_col. You could do this directly like this:
df2 = pd.DataFrame([], df1.df1_col)
# ^ ^
# | |
# specifies no data, yet |
# defines the index
If you are concerned about having to filter df1 then you can do:
# cond is some boolean mask representing a condition to filter on.
# I'll make one up for you.
cond = df1.df1_col > 10
df2 = pd.DataFrame([], df1.loc[cond, 'df1_col'])
No need to iterate, you can do:
df2.index = df1['df1_col']
If you really want to iterate, save it to a list and set the index.