I am doing some stuff using pandas in python and came across one dataset that needs to be clean.
It looks likes this-
I need to merge rows of index 0,1,2,3 into 1 single row by avoiding NaN value. Also, after doing this, I want to remove the default header and make the newly made row the default name for columns.
I tried to use groupby operation in pandas but nothing is happening.
Any idea on this?
Thanks
Akhi
pd.DataFrame([df.aggregate(lambda x:" ".join(str(e).strip() for e in x if not pd.isna(e))).tolist()],columns=[<add_new_list here>])
Related
I have a DataFrame with a 259399 rows and one column. It is called hfreq. In one single row I have a NaN value and I want to find it. I thought this is easy and tried hfreq[hfreq.isnull()]. But as you can see it doesn't help:
What am I doing wrong and how is it done correctly?
Edit: For clarity: That is how my DataFrame looks like:
There is only one NaN value hidden somewhere in the middle and I want to learn where it is so I want to get its index.
use the following code.
hfreq.loc[hfreq['value'].isnull()]
I am looking to delete a row in a dataframe that is imported into python by pandas.
if you see the sheet below, the first column has same name multiple times. So the condition is, if the first column value re-appears in a next row, delete that row. If not keep that frame in the dataframe.
My final output should look like the following:
Presently I am doing it by converting each column into a list and deleting them by index values. I am hoping there would be an easy way. Rather than this workaround/
df.drop_duplicates([df.columns[0])
should do the trick.
Try the following code;
df.drop_duplicates(subset='columnName', keep=’first’, inplace=true)
I have been using pandas but am open to all suggestions, I'm not an expert at scripting but am a complete loss. My goal is the following:
Merge multiple CSV files. Was able to do this in Pandas and have a dataframe with the merged dataset.
Screenshot of how merged dataset looks like
Delete the duplicated "GEO" columns after the first set. This last part doesn't let me usedf = df.loc[:,~df.columns.duplicated()] because they are not technically duplicated.The repeated column names end with a .1,.2,etc. as I am guessing the concate adds this. Other problem is that some columns have a duplicated column name but are different datasets. I have been using the first row as the index since it's always the same coded values but this row is unnecessary and will be deleted afterwards in the script. This is my biggest problem right now.
Delete certain columns such as the ones with the "Margins". I use ~df2.columns.str.startswith for this and have no trouble with this.
Replace spaces, ":" and ";" with underscores in the first row. I have no clue how to do this.
Insert a new column, write '=TEXT(B1,0)' formula, do this for the whole column (formula would change to B2,B3, etc.), copy the column and paste as values. I was able to do this in openpyxl although was having trouble and was not able to try the final output thanks to excel trouble.
source = excel.Workbooks.Open(filename)
excel.Range("C1:C1337").Select()
excel.Selection.Copy()
excel.Selection.PasteSpecial(Paste=constants.xlPasteValues)
Not sure if it works and was wondering if it was possible in pandas, win32com or I should stay with openpyxl. Thanks all!
I don't know whether this is a very simple qustion, but I would like to do a condition statement based on two other columns.
I have two columns like: the age and the SES and the another empty column which should be based on these two columns. For example when one person is 65 years old and its corresponding socio-economic status is high, then in the third column(empty column=vitality class) a value of 1 is for example given. I have got an idea about what I want to achieve, however I have no idea how to implement that in python itself. I know I should use a for loop and I know how to write conditons, however due to the fact that I want to take two columns into consideration for determining what will be written in the empty column, I have no idea how to write that in a function
and furthermore how to write back into the same csv (in the respective empty column)
[]
Use the pandas module to import the csv as a DataFrame object. Then you can do logical statements to fill empty columns:
import pandas as pd
df = pd.read_csv('path_to_file.csv')
df.loc[(df['age']==65) & (df['SES']=='high'), 'vitality_class'] = 1
df.to_csv('path_to_new_file.csv', index=False)
I am trying to combine two tables row wise (stack on top of each other, like using rbind in R). I've followed steps mentioned in:
Pandas version of rbind
how to combine two data frames in python pandas
But none of the "append" or "concat" are working for me.
About my data
I have two panda dataframe objects (type class 'pandas.core.frame.DataFrame'), both have 19 columns. when i print each dataframe they look fine.
The problem
So I created another panda dataframe using:
query_results = pd.DataFrame(columns=header_cols)
and then in a loop (because sometimes i may be combining more than just 2 tables) I am trying to combine all the tables:
for CCC in CCCList:
query_results.append(cost_center_query(cccode=CCC))
where cost_center_query is a customized function and returns pandas dataframe objects with same column names as the query_results.
however, with this, whenever i print "query_results" i get empty dataframe.
any idea why this is happening? no error message as well, so i am just confused.
Thank you so much for any advice!
Consider the concat method on a list of dataframes which avoids object expansion inside a loop with multiple append calls. Even consider a list comprehension:
query_results = pd.concat([cost_center_query(cccode=CCC) for CCC in CCCList], ignore_index=True)