Is there a way to dynamically update column names that are based on previous column names? Or what are best practices for column names while processing data? Below I explain the problem:
When processing data, I often need to create columns that are calculated from the previous columns, and I set up the names like below:
|STUDENT|GRADE|GRADE_AVG|GRADE_AVG_FORMATTED|GRADE_AVG_FORMATTED_FINAL
The problem is, if I need to make a change in the middle of this data flow [for example, hypothetically, say I needed to scale the grade before taking the average], I would have to rename all the column names that were produced after this point. See below:
|STUDENT|GRADE|**GRADE_SCALED**|GRADE_SCALED_AVG|GRADE_SCALED_AVG_FORMATTED|GRADE_SCALED_AVG_FORMATTED_FINAL
Since the code to calculate each column is based on the previous column names, this process of name changing in the code gets really cumbersome, specially for big datasets for which a lot of code has been produced. Any suggestion on how to dynamically update the column names? or best practices on it?
To clarify, an extension of the example:
my code would look like:
df[GRADE_AVG] = df[GRADE].apply(something)
df[GRADE_AVG_FORMATTED] = df[GRADE_AVG].apply(something)
df[GRADE_AVG_FORMATTED_FINAL] = df[GRADE_AVG_FORMATTED].apply(something)
...
... more column names based on the previous one..
...
df[FINAL_SCORE] = df[GRADE_AVG_FORMATTED_FINAL_REVISED...etc]
And then... I need to change GRADE_AVG to GRADE_SCALED_AVG in the code. So I will have change those columns names. This is a small example, but when there are a lot of column names based on the previous one, changing the code gets messy.
What I do is to change all the column names in the code, like below (but this gets really impractical), hence my question:
df[GRADE_SCALED_AVG] = df[GRADE].apply(something)
df[GRADE_SCALED_AVG_FORMATTED] = df[GRADE_SCALED_AVG].apply(something)
df[GRADE_SCALED_AVG_FORMATTED_FINAL] = df[GRADE_SCALED_AVG_FORMATTED].apply(something)
...
... more column names based on the previous one..
...
df[FINAL_SCORE] = df[GRADE_SCALED_AVG_FORMATTED_FINAL_REVISED...etc]
Lets say if your columns will start with GRADE. You can do this.
df.columns = ['GRADE_SCALED_'+ '_'.join(x.split('_')[1:]) if x.startswith('GRADE') else x for x in df.columns]
# sample test case
>>> l = ['abc','GRADE_AVG','GRADE_AVG_TOTAL']
>>> ['GRADE_SCALED_'+ '_'.join(x.split('_')[1:]) if x.startswith('GRADE') else x for x in l]
['abc', 'GRADE_SCALED_AVG', 'GRADE_SCALED_AVG_TOTAL']
A nice way to rename dynamically is with rename method:
import pandas as pd
import re
header = '|STUDENT|GRADE|GRADE_AVG|GRADE_AVG_FORMATTED|GRADE_AVG_FORMATTED_FINAL'
df = pd.DataFrame(columns=header.split('|')) # your dataframe
print(df)
# now rename: can take a function or a dictionary as a parameter
df1 = df.rename(lambda x: re.sub('^GRADE', 'GRADE_SCALE', x), axis=1)
print(df1)
#Empty DataFrame
#Columns: [, STUDENT, GRADE, GRADE_AVG, GRADE_AVG_FORMATTED, GRADE_AVG_FORMATTED_FINAL]
#Index: []
#Empty DataFrame
#Columns: [, STUDENT, GRADE_SCALE, GRADE_SCALE_AVG, GRADE_SCALE_AVG_FORMATTED, GRADE_SCALE_AVG_FORMATTED_FINAL]
#Index: []
However, in your case, I'm not sure this is what you are looking for. Are the AVG and FORMATTED columns generated from GRADE column? Also, is this RENAMING or REPLACING? doesn't the content of the columns change as well?
It seems a more complete description of the problem might help..
Related
I am using Python to clean address data and standardize abbreviations, etc. so that it can be compared against other address data. I finally have 2 dataframes in Pandas. I would like to compare each row in the first df, named df, against a list created from another list of addresses in a df of similar structure, second_df. If the address from df is on the list, then I would like to create a column to note this, maybe a boolean, but best case the string 'found'. I have used isin and it did not work.
For example, suppose my data looks like the sample data below. I would like to compare each row in df['concat'] to the entire list list to see if the address in df['concat'] column appears in the second_df list.
read = pd.read_excel('fullfilepath.xlsx')
second_df = pd.read_excel('anotherfilepath.xlsx')
df = read[['column1','column2', 'concat']]
list = second_df.concat.tolist()
EDIT based on tdy comment as my original answer didn't have the value for False option in where statement.
Try sth like this:
df["isFound"] = np.where(df['concat'].isin(second_df["concat"]), "found", "notfound")
Should be exactly what you need
I am new to using python with data sets and am trying to exclude a column ("id") from being shown in the output. Wondering how to go about this using the describe() and exclude functions.
describe works on the datatypes. You can include or exclude based on the datatype & not based on columns. If your column id is of unique data type, then
df.describe(exclude=[datatype])
or if you just want to remove the column(s) in describe, then try this
cols = set(df.columns) - {'id'}
df1 = df[list(cols)]
df1.describe()
TaDa its done. For more info on describe click here
You can do that by slicing your original DF and remove the 'id' column. One way is through .iloc . Let's suppose the column 'id' is the first column from you DF, then, you could do this:
df.iloc[:,1:].describe()
The first colon represents the rows, the second the columns.
Although somebody responded with an example given from the official docs which is more then enough, I'd just want to add this, since It might help a few ppl:
IF your DataFrame is large (let's say 100s columns), removing one or two, might not be a good idea (not enough), instead, create a smaller DataFrame holding what you're interested and go from there.
Example of removing 2+ columns:
table_of_columns_you_dont_want = set(your_bigger_data_frame.colums) = {'column_1', 'column_2','column3','etc'}
your_new_smaller_data_frame = your_new_smaller_data_frame[list[table_of_columns_you_dont_want]]
your_new_smaller_data_frame.describe()
IF your DataFrame is medium/small size, you already know every column and you only need a few columns, just create a new DataFrame and then apply describe():
I'll give an example from reading a .csv file and then read a smaller portion of that DataFrame which only holds what you need:
df = pd.read_csv('.\docs\project\file.csv')
df = [['column_1','column_2','column_3','etc']]
df.describe()
Use output.describe(exclude=['id'])
I'm struggling with something I thought would be fairly trivial. I have a spreadsheet that provides data in the format below, unfortunately this can't be changed, this is the only way it can be provided:
I load the file in pandas in a jupyter notebook, I can read it, specifying the header has 3 rows, so far so good. The point is that because some of the headers in the second level repeat themselves (teachers, students, other), I want to combine the 3 levels into one, so I can easily identify which columns does what. The data in the top left corner changes every day, hence I renamed that one column with nothing (''). The output I'm looking for should have the following columns: country, region, teachers_present, ..., perf_teachers_score, ..., count_teachers etc.
For some reason, pandas renders this table like this:
It doesn't add any Unnamed column name placeholders on level 0, but it does that on level 1 and 2. If I concatenate the names, I get some very ugly column names. I need to concatenate them but ignore the Unnamed ones in the process. My code is:
df = pd.read_excel(src, header=[0,1,2])
# to get rid of the date, works as intended
df.columns.set_levels(['', 'perf', 'count'], level=0, inplace=True)
# doesn't work, tells me str has no str method, despite successfully using this function elsewhere
df.columns.set_levels(['' if x.str.contains('unnamed', case=False, na=False) else x for x in df.columns.levels[1].values], level=1, inplace=True)
In conclusion, what am I doing wrong and how do I get my column names concatenated without the Unnamed (and unwanted) labels?
Thank you!
Got it...
df.columns = [f'{x}{z}' if 'unnamed' in y.lower() else f'{x}{y}' if 'unnamed' in z.lower() else f'{x}{y}{z}' for x, y, z in df.columns]
Thank you David, you've been helpful!
I have a dataframe like this in pandas DataFrame where A is the column name:
results.head()
A
when you are away
when I was away
when they are away
I want to add a new column B which would seem like the following:
A B
when you are away you
when I was away I
when they are away they
I tried with this code but it did not work:
results.assign(B = you, I, they)
I am new to pandas dataframe and would very much appreciate the help.
Try this:
B_list = ['you','I','they']
results['B'] = B_list
OR
results = results.assign(B=['you', 'I', 'they'])
If you are interested in placing the column in a specific location (index), then use the insert method:
results.insert(index, "ColName" , new_set)
otherwise, the answer by Mayank is simpler.
I have a JSON object inside a pandas dataframe column, which I want to pull apart and put into other columns. In the dataframe, the JSON object looks like a string containing an array of dictionaries. The array can be of variable length, including zero, or the column can even be null. I've written some code, shown below, which does what I want. The column names are built from two components, the first being the keys in the dictionaries, and the second being a substring from a key value in the dictionary.
This code works okay, but it is very slow when running on a big dataframe. Can anyone offer a faster (and probably simpler) way to do this? Also, feel free to pick holes in what I have done if you see something which is not sensible/efficient/pythonic. I'm still a relative beginner. Thanks heaps.
# Import libraries
import pandas as pd
from IPython.display import display # Used to display df's nicely in jupyter notebook.
import json
# Set some display options
pd.set_option('max_colwidth',150)
# Create the example dataframe
print("Original df:")
df = pd.DataFrame.from_dict({'ColA': {0: 123, 1: 234, 2: 345, 3: 456, 4: 567},\
'ColB': {0: '[{"key":"keyValue=1","valA":"8","valB":"18"},{"key":"keyValue=2","valA":"9","valB":"19"}]',\
1: '[{"key":"keyValue=2","valA":"28","valB":"38"},{"key":"keyValue=3","valA":"29","valB":"39"}]',\
2: '[{"key":"keyValue=4","valA":"48","valC":"58"}]',\
3: '[]',\
4: None}})
display(df)
# Create a temporary dataframe to append results to, record by record
dfTemp = pd.DataFrame()
# Step through all rows in the dataframe
for i in range(df.shape[0]):
# Check whether record is null, or doesn't contain any real data
if pd.notnull(df.iloc[i,df.columns.get_loc("ColB")]) and len(df.iloc[i,df.columns.get_loc("ColB")]) > 2:
# Convert the json structure into a dataframe, one cell at a time in the relevant column
x = pd.read_json(df.iloc[i,df.columns.get_loc("ColB")])
# The last bit of this string (after the last =) will be used as a key for the column labels
x['key'] = x['key'].apply(lambda x: x.split("=")[-1])
# Set this new key to be the index
y = x.set_index('key')
# Stack the rows up via a multi-level column index
y = y.stack().to_frame().T
# Flatten out the multi-level column index
y.columns = ['{1}_{0}'.format(*c) for c in y.columns]
# Give the single record the same index number as the parent dataframe (for the merge to work)
y.index = [df.index[i]]
# Append this dataframe on sequentially for each row as we go through the loop
dfTemp = dfTemp.append(y)
# Merge the new dataframe back onto the original one as extra columns, with index mataching original dataframe
df = pd.merge(df,dfTemp, how = 'left', left_index = True, right_index = True)
print("Processed df:")
display(df)
First, a general piece of advice about pandas. If you find yourself iterating over the rows of a dataframe, you are most likely doing it wrong.
With this in mind, we can re-write your current procedure using pandas 'apply' method (this will likely speed it up to begin with, as it means far fewer index lookups on the df):
# Check whether record is null, or doesn't contain any real data
def do_the_thing(row):
if pd.notnull(row) and len(row) > 2:
# Convert the json structure into a dataframe, one cell at a time in the relevant column
x = pd.read_json(row)
# The last bit of this string (after the last =) will be used as a key for the column labels
x['key'] = x['key'].apply(lambda x: x.split("=")[-1])
# Set this new key to be the index
y = x.set_index('key')
# Stack the rows up via a multi-level column index
y = y.stack().to_frame().T
# Flatten out the multi-level column index
y.columns = ['{1}_{0}'.format(*c) for c in y.columns]
#we don't need to re-index
# Give the single record the same index number as the parent dataframe (for the merge to work)
#y.index = [df.index[i]]
#we don't need to add to a temp df
# Append this dataframe on sequentially for each row as we go through the loop
return y.iloc[0]
else:
return pd.Series()
df2 = df.merge(df.ColB.apply(do_the_thing), how = 'left', left_index = True, right_index = True)
Notice that this returns exactly the same result as before, we haven't changed the logic. the apply method sorts out the indexes, so we can just merge, fine.
I believe that answers your question in terms of speeding it up and being a bit more idiomatic.
I think you should consider however, what you want to do with this data structure, and how you might better structure what you're doing.
Given ColB could be of arbitrary length, you will end up with a dataframe with an arbitrary number of columns. When you come to access these values for whatever purpose, this will cause you pain, whatever the purpose is.
Are all entries in ColB important? Could you get away with just keeping the first one? Do you need to know the index of a certain valA val?
These are questions you should ask yourself, then decide on a structure which will allow you to do whatever analysis you need, without having to check a bunch of arbitrary things.