Columns with missing values not dropping [duplicate] - python

I have a DataFrame like this (first column is index (786...) and second day (25...) and Rainfall amount is empty):
Day Rainfall amount (millimetres)
786 25
787 26
788 27
789 28
790 29
791 1
792 2
793 3
794 4
795 5
and I want to delete the row 790. I tried so many things with df.drop but nothin happend.
I hope you can help me.

While dropping new DataFrame returns. If you want to apply changes to the current DataFrame you have to specify inplace parameter.
Option 1
Assigning back to df -
df = df.drop(790)
Option 2
Inplace argument -
df.drop(790, inplace=True)

As others may be in my shoes, I'll add a bit here. I've merged three CSV files of data and they mistakenly have the headers copied into the dataframe. Now, naturally, I assumed pandas would have an easy method to remove these obviously bad rows. However, it's not working and I'm still a bit perplexed with this. After using df.drop() I see that the length of my dataframe correctly decreases by 2 (I have two bad rows of headers). But the values are still there and attempts to make a histogram will throw errors due to empty values. Here's the code:
df1=pd.read_csv('./summedDF_combined.csv',index_col=[0])
print len(df1['x'])
badRows=pd.isnull(pd.to_numeric(df1['y'], errors='coerce')).nonzero()[0]
print "Bad rows:",badRows
df1.drop(badRows, inplace=True)
print len(df1['x'])
I've tried other functions in tandem with no luck. This shows an empty list for badrows but still will not plot due to the bad rows still being in the df, just deindexed:
print len(df1['x'])
df1=df1.dropna().reset_index(drop=True)
df1=df1.dropna(axis=0).reset_index(drop=True)
badRows=pd.isnull(pd.to_numeric(df1['x'], errors='coerce')).nonzero()[0]
print "Bad rows:",badRows
I'm stumped, but have one solution that works for the subset of folks who merged CSV files and got stuck. Go back to your original files and merge again, but take care to exclude the headers like so:
head -n 1 anyOneFile.csv > summedDFs.csv && tail -n+2 -q summedBlipDF2*.csv >> summedDFs.out
Apologies, I know this isn't the pythonic or pandas way to fix it and I hope the mods don't feel the need to remove it as it works for the small subset of people with my problem.

Related

Pandas drop creates a NoneType Object [duplicate]

I have a DataFrame like this (first column is index (786...) and second day (25...) and Rainfall amount is empty):
Day Rainfall amount (millimetres)
786 25
787 26
788 27
789 28
790 29
791 1
792 2
793 3
794 4
795 5
and I want to delete the row 790. I tried so many things with df.drop but nothin happend.
I hope you can help me.
While dropping new DataFrame returns. If you want to apply changes to the current DataFrame you have to specify inplace parameter.
Option 1
Assigning back to df -
df = df.drop(790)
Option 2
Inplace argument -
df.drop(790, inplace=True)
As others may be in my shoes, I'll add a bit here. I've merged three CSV files of data and they mistakenly have the headers copied into the dataframe. Now, naturally, I assumed pandas would have an easy method to remove these obviously bad rows. However, it's not working and I'm still a bit perplexed with this. After using df.drop() I see that the length of my dataframe correctly decreases by 2 (I have two bad rows of headers). But the values are still there and attempts to make a histogram will throw errors due to empty values. Here's the code:
df1=pd.read_csv('./summedDF_combined.csv',index_col=[0])
print len(df1['x'])
badRows=pd.isnull(pd.to_numeric(df1['y'], errors='coerce')).nonzero()[0]
print "Bad rows:",badRows
df1.drop(badRows, inplace=True)
print len(df1['x'])
I've tried other functions in tandem with no luck. This shows an empty list for badrows but still will not plot due to the bad rows still being in the df, just deindexed:
print len(df1['x'])
df1=df1.dropna().reset_index(drop=True)
df1=df1.dropna(axis=0).reset_index(drop=True)
badRows=pd.isnull(pd.to_numeric(df1['x'], errors='coerce')).nonzero()[0]
print "Bad rows:",badRows
I'm stumped, but have one solution that works for the subset of folks who merged CSV files and got stuck. Go back to your original files and merge again, but take care to exclude the headers like so:
head -n 1 anyOneFile.csv > summedDFs.csv && tail -n+2 -q summedBlipDF2*.csv >> summedDFs.out
Apologies, I know this isn't the pythonic or pandas way to fix it and I hope the mods don't feel the need to remove it as it works for the small subset of people with my problem.

Divide dataframe column by series

Python beginner here so I reckon I'm massively overcomplicating this in some way.
I have a dataframe with about 20 columns, but I've only shown a small subset for simplicity. I want to get totals for red blue and none as percentages of the total for that month. So I thought it might be easiest to take a subset of these three columns then add the result back to the rest of the data:
data = [['2022-08', 10,'red',0,0], ['2022-04', 15,'blue',1,0], ['2022-08', 14,'none',1,1],['2022-04', 14,'blue',0,0],['2022-03', 14,'none',1,0]]
df = pd.DataFrame(data, columns=['Month', 'Balance','Type','Flag_1','Flag_2'])
df2=df[['Month','Type','Balance']].groupby(['Month','Type']).sum('Balance').unstack().fillna(0)
df2['balance_all_categories']= df2.sum(axis=1)
Now I want to add this back to my full dataframe and make my balances for red, blue, none into percentages of the total for that month. I have many more than just 2 flags and I will need to make subsets based on all flags being zero, all flags being 1 and so on. So if I groupby month and type here, to access any one column they start to have incredibly long names I think, so I don't want to do that if avoidable.
Is there an easy way to deal with this? Thanks for any suggestions! :)

Pandas deleting partly duplicate rows with wrong values in specific columns

I have a large dataframe from a csv file which has a few dozen columns. I have another csv file which I concatenated to the original. Now, the second file has exactly the same structure but a particular column may have incorrect values. I want to delete rows which are duplicates that have this one wrong column. For example in the below the last row should be removed. (The names of the specimens (Albert, etc.) are unique). I have been struggling to find a way of deleting only the data which has the wrong value, without risking deleting the correct row.
0 Albert alive
1 Newton alive
2 Galileo alive
3 Copernicus dead
4 Galileo dead
...
Any help would be greatly appreciated!
You could use this to determine if a name is mentioned more than 1 time
df['RN'] = df.groupby(['Name']).cumcount() + 1
You can also expand it out to have more columns in the "groupby" to see if there are any more limitations you want to put on the duplicates
df['RN'] = df.groupby(['Name', 'Another Column']).cumcount() + 1
The advantage I like with this is it gives you more control over the RN selection if you needed to df.loc[df['RN'] > 2].

Modifying the date column calculation in pandas dataframe

I have a dataframe that looks like this
I need to adjust the time_in_weeks column for the 34 number entry. When there is a duplicate uniqueid with a different rma_created_date that means there was some failure that occurred. The 34 needs to be changed to calculate the number of weeks between the new most recent rma_created_date (2020-10-15 in this case) and subtract the rma_processed_date of the above row 2020-06-28.
I hope that makes sense in terms of what I am trying to do.
So far I did this
def clean_df(df):
'''
This function will fix the time_in_weeks column to calculate the correct number of weeks
when there is multiple failured for an item.
'''
# Sort by rma_created_date
df = df.sort_values(by=['rma_created_date'])
Now I need to perform what I described above but I am a little confused on how to do this. Especially considering we could have multiple failures and not just 2.
I should get something like this returned as output
As you can see what happened to the 34 was it got changed to take the number of weeks between 2020-10-15 and 2020-06-26
Here is another example with more rows
Using the expression suggested
df['time_in_weeks']=np.where(df.uniqueid.duplicated(keep='first'),df.rma_processed_date.dt.isocalendar().week.sub(df.rma_processed_date.dt.isocalendar().week.shift(1)),df.time_in_weeks)
I get this
Final note: if there is a date of 1/1/1900 then don't perform any calculation.
Question not very clear. Happy to correct if I interpreted it wrongly.
Try use np.where(condition, choiceif condition, choice ifnotcondition)
#Coerce dates into datetime
df['rma_processed_date']=pd.to_datetime(df['rma_processed_date'])
df['rma_created_date']=pd.to_datetime(df['rma_created_date'])
#Solution
df['time_in_weeks']=np.where(df.uniqueid.duplicated(keep='first'),df.rma_created_date.sub(df.rma_processed_date),df.time_in_weeks)

How to index a 2d array properly pandas dataframe?

I am reading a .xslx excel file into a pandas dataframe.
Here is what it looks like:
Image Link
Or in text form:
1 2 3 4
3.5 15.48403728 23.22605592 30.96807456 38.7100932
4 17.41954194 26.12931291 34.83908388 43.54885485
4.5 19.3550466 29.0325699 38.7100932 48.3876165
5 21.29055126 31.93582689 42.58110252 53.22637815
As you can see there is a space in the top left hand cell that is empty.
The rows are amounts and the columns are material, the values are the prices.
I really don't know how to give names properly for indexing.
If I was to try
df.columns = ['Material 1',...'Material 4']
It errors because obviously it is wanting 5 column headers as there are five columns.
Really what I want is to label the top left column as amount/material or something like that, but I don't have a clue on how to do it.
I think the best way would be for me to try and transform this dataframe into something like this:
Amount Material Price
3.5 1 15.48...
3.5 2 23.22...
...
5 4 53.22...
as this will hopefully make it easier to deal with.
Any idea how to do this?
I believe this is called unpivot columns in excel or something like that????
I am not sure how you have read the excel file but if all you wanted is to rename your columns then you can set column names while reading the excel itself.
Supposing my file name is MyExcelFile.xlsx and the columns names that are there 'Amount','Material_1','Material_2','Material_3' and 'Material_4' then I will read it as follows. If these column names do not exist (in the excel) then you have to pass header=None explicitly.
MyDF = pd.read_excel('/FullPathToYourExcelFile/MyExcelFile.xlsx', names=['Amount','Material_1','Material_2','Material_3','Material_4'], header=None)
The output is as below.
See the documentation here (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html). If you have already done it, as I have suggested above, then I am sorry I have underestimated your problem requirements. All the best

Categories

Resources