I'm struggling with something I thought would be fairly trivial. I have a spreadsheet that provides data in the format below, unfortunately this can't be changed, this is the only way it can be provided:
I load the file in pandas in a jupyter notebook, I can read it, specifying the header has 3 rows, so far so good. The point is that because some of the headers in the second level repeat themselves (teachers, students, other), I want to combine the 3 levels into one, so I can easily identify which columns does what. The data in the top left corner changes every day, hence I renamed that one column with nothing (''). The output I'm looking for should have the following columns: country, region, teachers_present, ..., perf_teachers_score, ..., count_teachers etc.
For some reason, pandas renders this table like this:
It doesn't add any Unnamed column name placeholders on level 0, but it does that on level 1 and 2. If I concatenate the names, I get some very ugly column names. I need to concatenate them but ignore the Unnamed ones in the process. My code is:
df = pd.read_excel(src, header=[0,1,2])
# to get rid of the date, works as intended
df.columns.set_levels(['', 'perf', 'count'], level=0, inplace=True)
# doesn't work, tells me str has no str method, despite successfully using this function elsewhere
df.columns.set_levels(['' if x.str.contains('unnamed', case=False, na=False) else x for x in df.columns.levels[1].values], level=1, inplace=True)
In conclusion, what am I doing wrong and how do I get my column names concatenated without the Unnamed (and unwanted) labels?
Thank you!
Got it...
df.columns = [f'{x}{z}' if 'unnamed' in y.lower() else f'{x}{y}' if 'unnamed' in z.lower() else f'{x}{y}{z}' for x, y, z in df.columns]
Thank you David, you've been helpful!
Related
I have a dataset for which I am not able to call the columns. In the screen shoot below, I have marked in yellow what I need to be recognized as column (Vale On, Petroleo etc.) and the Date column, which I need to recognize as date since I am working with time series data.
I have tried to reset index and some solutions related but nothing worked. I am new to Python, so I am sorry if it is too obvious.
# use first row as column names
df.columns = df.iloc[0]
# and then drop it
df = df.iloc[1:]
# convert first col to date
# if it doesnt work, try passing format=... refer https://strftime.org/
# also https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html
df['Date'] = pd.to_datetime(df['Date'])
A debugging hint if parsing the date keeps failing is to check if your date strings are consistent, perhaps like so: df['Date'].str.len().value_counts(). That should hopefully return only one length. If that returns multiple rows, that means you have inconsistent and anomalous data which you'll have to clean.
df = pd.read_csv("p4ds_esi_messy_data.txt", sep = "\t")
The messy dataset:
All the columns become one column while i tried. df.columns = [name mass radius density gravity....]
I want to unpack all the columns and access each column individually.
It would be helpful if you add a sample of the data so we can reproduce the issue. Having said that, by looking at the picture I think this will work:
df = pd.read_csv("p4ds_esi_messy_data.txt", sep = '[*]')
Seems like the delimiter in your file is '*'.
Additional parameters might be needed but, again, we need a sample of the data to be able to help you out.
I have a program that applies pd.groupby().agg('sum') to a bunch of different pandas.DataFrame objects. Those dataframes are all in the same format. The code works on all dataframes except for this dataframe (picture: df1) which produces funny result (picture: result1).
I tried:
df = df.groupby('Mapping')[list(df)].agg('sum')
This code works for df2 but not for df1.
The code works fine for other dataframes (pictures: df2, result2)
Could somebody tell me why it turned out that way for df1?
The problem in the first dataframe is the commas in variables that should be numeric and i think that python is not recognizing the columns as numeric. Did you try to replace the commas?
It seems that in df1, most of the numeric columns are actually str. You can tell by the commas (,) that delimit thousands. Try:
df.iloc[:,1:] = df.iloc[:,1:].apply(lambda x: str(x).replace(",",""))
df.iloc[:,1:] = df.iloc[:,1:].apply(lambda x: pd.to_numeric(x))
The first line removes the commas from all the second, third, etc. columns. The second line turns them to numeric data types. This could actually be a one-liner, but I wrote it in two lines for readability's sake.
Once this is done, you can try your groupby code.
It's good practice to check the data types of your columns as soon as you load them. You can do so with df1.dtypes.
Is there a way to dynamically update column names that are based on previous column names? Or what are best practices for column names while processing data? Below I explain the problem:
When processing data, I often need to create columns that are calculated from the previous columns, and I set up the names like below:
|STUDENT|GRADE|GRADE_AVG|GRADE_AVG_FORMATTED|GRADE_AVG_FORMATTED_FINAL
The problem is, if I need to make a change in the middle of this data flow [for example, hypothetically, say I needed to scale the grade before taking the average], I would have to rename all the column names that were produced after this point. See below:
|STUDENT|GRADE|**GRADE_SCALED**|GRADE_SCALED_AVG|GRADE_SCALED_AVG_FORMATTED|GRADE_SCALED_AVG_FORMATTED_FINAL
Since the code to calculate each column is based on the previous column names, this process of name changing in the code gets really cumbersome, specially for big datasets for which a lot of code has been produced. Any suggestion on how to dynamically update the column names? or best practices on it?
To clarify, an extension of the example:
my code would look like:
df[GRADE_AVG] = df[GRADE].apply(something)
df[GRADE_AVG_FORMATTED] = df[GRADE_AVG].apply(something)
df[GRADE_AVG_FORMATTED_FINAL] = df[GRADE_AVG_FORMATTED].apply(something)
...
... more column names based on the previous one..
...
df[FINAL_SCORE] = df[GRADE_AVG_FORMATTED_FINAL_REVISED...etc]
And then... I need to change GRADE_AVG to GRADE_SCALED_AVG in the code. So I will have change those columns names. This is a small example, but when there are a lot of column names based on the previous one, changing the code gets messy.
What I do is to change all the column names in the code, like below (but this gets really impractical), hence my question:
df[GRADE_SCALED_AVG] = df[GRADE].apply(something)
df[GRADE_SCALED_AVG_FORMATTED] = df[GRADE_SCALED_AVG].apply(something)
df[GRADE_SCALED_AVG_FORMATTED_FINAL] = df[GRADE_SCALED_AVG_FORMATTED].apply(something)
...
... more column names based on the previous one..
...
df[FINAL_SCORE] = df[GRADE_SCALED_AVG_FORMATTED_FINAL_REVISED...etc]
Lets say if your columns will start with GRADE. You can do this.
df.columns = ['GRADE_SCALED_'+ '_'.join(x.split('_')[1:]) if x.startswith('GRADE') else x for x in df.columns]
# sample test case
>>> l = ['abc','GRADE_AVG','GRADE_AVG_TOTAL']
>>> ['GRADE_SCALED_'+ '_'.join(x.split('_')[1:]) if x.startswith('GRADE') else x for x in l]
['abc', 'GRADE_SCALED_AVG', 'GRADE_SCALED_AVG_TOTAL']
A nice way to rename dynamically is with rename method:
import pandas as pd
import re
header = '|STUDENT|GRADE|GRADE_AVG|GRADE_AVG_FORMATTED|GRADE_AVG_FORMATTED_FINAL'
df = pd.DataFrame(columns=header.split('|')) # your dataframe
print(df)
# now rename: can take a function or a dictionary as a parameter
df1 = df.rename(lambda x: re.sub('^GRADE', 'GRADE_SCALE', x), axis=1)
print(df1)
#Empty DataFrame
#Columns: [, STUDENT, GRADE, GRADE_AVG, GRADE_AVG_FORMATTED, GRADE_AVG_FORMATTED_FINAL]
#Index: []
#Empty DataFrame
#Columns: [, STUDENT, GRADE_SCALE, GRADE_SCALE_AVG, GRADE_SCALE_AVG_FORMATTED, GRADE_SCALE_AVG_FORMATTED_FINAL]
#Index: []
However, in your case, I'm not sure this is what you are looking for. Are the AVG and FORMATTED columns generated from GRADE column? Also, is this RENAMING or REPLACING? doesn't the content of the columns change as well?
It seems a more complete description of the problem might help..
Some columns of dataframe, df, have elements equal to "?" character. The df has 2000 rows. I want to drop the columns where more than 1800 elements are equal to "?".
I think I need to use the apply method to figure out which columns need to be dropped and then use drop method to drop them but I can't figure out how.
df.drop(df.apply(lambda x: x.value_counts()["?"]>1800 ,axis=0))
but obviously it doesn't work. The above line is not the first thing I tried. I've tried many other things as well but they all give me different errors. I appreciate any help.
You do not necessarily have to use apply method and value_counts; Checking equality and sum can do the same thing here and would potentially be more efficient:
df.eq("?").sum()
gives the amount of ? in each column:
df.eq("?").sum().gt(1800)
gives a boolean series where if the column has more than 1800 question marks, it's marked as True, and this can be further used to subset the data frame with loc; So put together:
df.loc[:,~df.eq("?").sum().gt(1800)]
To use drop method, you need to make sure what you are passing in are labels or list of column names instead of a boolean series and also to drop columns, you need to specify axis parameter to be 1, so to make your original answer work:
df.drop(df.apply(lambda x: x.value_counts()["?"]>1800)[lambda x: x].index, axis=1)
# ^^^^^^^^^^^^^
# here use a lambda filter to extract column names that need to be dropped