df.duplicated() false positives? - python

I have a dataframe that holds 2,865,044 entries with a 3-level MultiIndex
MultiIndex.levels.names = ['year', 'country', 'productcode']
I am trying to reshape the dataframe to produce a wide dataframe but I am getting the error:
ReshapeError: Index contains duplicate entries, cannot reshape
I have used:
data[data.duplicated()]
to identify the lines causing the error but the data that it lists doesn't seem to contain any duplicates.
This led me to export my dataframe using the to_csv() and opened the data in Stata and used the duplicates list command to find the dataset doesn't hold duplicates (according to stata).
An Example from the sorted csv file:
year country productcode duplicate
1962 MYS 711 FALSE
1962 MYS 712 TRUE
1962 MYS 721 FALSE
I know it's a long shot but ideas what might be causing this? The data types in each index column is ['year': int; 'country': str, 'productcode' :str]. Could it be how pandas defines the unique groups? Any better ways to list the offending index lines?
Update:
I have tried resetting the index
temp = data.reset_index()
dup = temp[temp.duplicated(cols=['year', 'country', 'productcode'])]
and I get a completely different list!
year country productcode
1994 HKG 9710
1994 USA 9710
1995 HKG 9710
1995 USA 9710
Updated 2 [28JUNE2013]:
It appears to have been a strange memory issue during my IPython Session.
This morning's fresh instance, seems to work fine and reshape the data without any adjustments to yesterday's code! I will debug further if the issue returns and let you know. Anyone know of a good debugger for IPython Sessions?

perhaps try
cleaned = df.reset_index().drop_duplicates(df.index.names)
cleaned.set_index(df.index.names, inplace=True)
I think there ought to be a duplicated method in the index, there is not yet
https://github.com/pydata/pandas/issues/4060

Related

Pandas drop creates a NoneType Object [duplicate]

I have a DataFrame like this (first column is index (786...) and second day (25...) and Rainfall amount is empty):
Day Rainfall amount (millimetres)
786 25
787 26
788 27
789 28
790 29
791 1
792 2
793 3
794 4
795 5
and I want to delete the row 790. I tried so many things with df.drop but nothin happend.
I hope you can help me.
While dropping new DataFrame returns. If you want to apply changes to the current DataFrame you have to specify inplace parameter.
Option 1
Assigning back to df -
df = df.drop(790)
Option 2
Inplace argument -
df.drop(790, inplace=True)
As others may be in my shoes, I'll add a bit here. I've merged three CSV files of data and they mistakenly have the headers copied into the dataframe. Now, naturally, I assumed pandas would have an easy method to remove these obviously bad rows. However, it's not working and I'm still a bit perplexed with this. After using df.drop() I see that the length of my dataframe correctly decreases by 2 (I have two bad rows of headers). But the values are still there and attempts to make a histogram will throw errors due to empty values. Here's the code:
df1=pd.read_csv('./summedDF_combined.csv',index_col=[0])
print len(df1['x'])
badRows=pd.isnull(pd.to_numeric(df1['y'], errors='coerce')).nonzero()[0]
print "Bad rows:",badRows
df1.drop(badRows, inplace=True)
print len(df1['x'])
I've tried other functions in tandem with no luck. This shows an empty list for badrows but still will not plot due to the bad rows still being in the df, just deindexed:
print len(df1['x'])
df1=df1.dropna().reset_index(drop=True)
df1=df1.dropna(axis=0).reset_index(drop=True)
badRows=pd.isnull(pd.to_numeric(df1['x'], errors='coerce')).nonzero()[0]
print "Bad rows:",badRows
I'm stumped, but have one solution that works for the subset of folks who merged CSV files and got stuck. Go back to your original files and merge again, but take care to exclude the headers like so:
head -n 1 anyOneFile.csv > summedDFs.csv && tail -n+2 -q summedBlipDF2*.csv >> summedDFs.out
Apologies, I know this isn't the pythonic or pandas way to fix it and I hope the mods don't feel the need to remove it as it works for the small subset of people with my problem.

How to create a Dataframe from rows with conditions from another existing Dataframe using pandas?

So I have this problem, because of the size of the dataframe that I am working on, clearly, I cannot upload it, but it has the following structure:
country
coastline
EU
highest
1
Norway
yes
yes
1500
2
Turkey
yes
no
20100
...
...
...
...
41
Bolivia
no
no
999
42
Japan
yes
no
89
I have to solve several exercises with Pandas, among them is, for example, showing the country with the "highest" maximum, minimum and the average but only of the countries that do belong to the EU, I already solved the maximum and the minimum, but for the middle I thought about creating a new dataframe, one that is created from only the rows that contain a "yes" in the EU column, I've tried a few things, but they haven't worked.
I thought this is the best way to solve it, but if anyone has another idea, I'm looking forward to reading it.
By the way, these are the examples that I said that I was able to solve:
print('Minimum outside the EU')
paises[(paises.EU == "no")].sort_values(by=['highest'], ascending=[True]).head(1)
Which gives me this:
country
coastline
EU
highest
3
Belarus
no
no
345
As a last condition, this must be solved using pandas, since it is basically the chapter that we are working on in classes.
If you want to create a new dataframe that is based off of a filter on your first, you can do this:
new_df = df[df['EU'] == 'yes'].copy()
This will look at the 'EU' column in the original dataframe df, and only return the rows where it is 'yes'. I think it is good practice to add the .copy() since we can sometimes get strange side-affects if we then make changes to new_df (probably wouldn't here).

Raise exception in pandas when one column has duplicates but only for given rows

I have a dataframe read in from this excel file, if you look at FNCL 2019 and 2018 you'll see that those years (and only the Vintage column, not Bal) are duplicated. How could I raise an exception to prevent that from happening? It's not that 2019 and 2018 cannot show up multiple times in the Vintage column, but rather that, within the FNCL cohort (or any other for that matter), each Vintage cannot show up more than once.
There are two different ways, depending on the expected outcome. If you want to combine the Bal entries for the same cohort and Vintage with a function 'f', then
df.groupby([ 'Cohort', 'Vintage' ]).agg( {'Bal':f} ).reset_index()
Otherwise, if you just want to drop the duplicates you can use (and keep the first row)
df.drop_duplicates(subset=['Cohort', 'Vintage'], keep='first')
Assuming the first values are the correct ones, why not simply remove the rest? You can use the following to drop duplicate rows based on certain columns:
df_unique = df.drop_duplicates(subset=['Cohort', 'Vintage'])

Columns with missing values not dropping [duplicate]

I have a DataFrame like this (first column is index (786...) and second day (25...) and Rainfall amount is empty):
Day Rainfall amount (millimetres)
786 25
787 26
788 27
789 28
790 29
791 1
792 2
793 3
794 4
795 5
and I want to delete the row 790. I tried so many things with df.drop but nothin happend.
I hope you can help me.
While dropping new DataFrame returns. If you want to apply changes to the current DataFrame you have to specify inplace parameter.
Option 1
Assigning back to df -
df = df.drop(790)
Option 2
Inplace argument -
df.drop(790, inplace=True)
As others may be in my shoes, I'll add a bit here. I've merged three CSV files of data and they mistakenly have the headers copied into the dataframe. Now, naturally, I assumed pandas would have an easy method to remove these obviously bad rows. However, it's not working and I'm still a bit perplexed with this. After using df.drop() I see that the length of my dataframe correctly decreases by 2 (I have two bad rows of headers). But the values are still there and attempts to make a histogram will throw errors due to empty values. Here's the code:
df1=pd.read_csv('./summedDF_combined.csv',index_col=[0])
print len(df1['x'])
badRows=pd.isnull(pd.to_numeric(df1['y'], errors='coerce')).nonzero()[0]
print "Bad rows:",badRows
df1.drop(badRows, inplace=True)
print len(df1['x'])
I've tried other functions in tandem with no luck. This shows an empty list for badrows but still will not plot due to the bad rows still being in the df, just deindexed:
print len(df1['x'])
df1=df1.dropna().reset_index(drop=True)
df1=df1.dropna(axis=0).reset_index(drop=True)
badRows=pd.isnull(pd.to_numeric(df1['x'], errors='coerce')).nonzero()[0]
print "Bad rows:",badRows
I'm stumped, but have one solution that works for the subset of folks who merged CSV files and got stuck. Go back to your original files and merge again, but take care to exclude the headers like so:
head -n 1 anyOneFile.csv > summedDFs.csv && tail -n+2 -q summedBlipDF2*.csv >> summedDFs.out
Apologies, I know this isn't the pythonic or pandas way to fix it and I hope the mods don't feel the need to remove it as it works for the small subset of people with my problem.

python pandas multi-indexed dataframe selection

Altough I found multiple questions on the topic, I could not find a solution for this one in particular.
I am playing around with this CSV file, which contais a subselection of TBC dat from the WHO:
http://dign.eu/temp/tbc.csv
import pandas as pd
df = pd.read_csv('tbc.csv', index_col=['country', 'year'])
This gives a nicely formatted DataFrame, sorted on country and year, showing one of the parameters.
Now, for this case I would like the mean value of "param" for each country over all avaiable years. Using df.mean() gives me an overall value, and df.mean(axis=1) removes all indices which makes the results useless.
Obviously I can do this using a loop, but I guess there is a smarter way. But how?
If I understand you correctly you want to pass the level to the mean function:
In [182]:
df.mean(level='country')
Out[182]:
param
country
Afghanistan 8391.312500
Albania 183.888889
Algeria 8024.588235
American Samoa 1.500000
....
West Bank and Gaza Strip 12.538462
Yemen 4029.166667
Zambia 13759.266667
Zimbabwe 12889.666667
[219 rows x 1 columns]

Categories

Resources