I have a problem with my pandas dataframe. Pandas detects duplicated rows, but there are none.
I wanted to use the pivot function but I have the error message “ValueError: Index contains duplicate entries, cannot reshape”.
So I tried to find the duplicated rows in my dataframe and when I used the duplicated() function the result were :
Number_id [...] Name Value
802 001 [...] Name1 41
809 001 [...] Name2 75
813 001 [...] Name3 13
845 001 [...] Name4 2
Obviously, those rows are not the same, for each row : Number_id, value and name are different.
My dataframe dimensions are [860 rows x 10 columns]. There are 215 Number_id, each Number_id has 4 values, one for each name. 215*4=860
I wanted to use the pivot function like this :
df.pivot(index=list_of_index_columns, columns='Name', values='Value')
The list_of_index_columns correspond to all the column of the df except Name and Value, so 8 columns.
I don’t how to handle this. Can I have some help?
I use the Spyder 3.8 version.
There is duplication in your data. For example
df = pd.DataFrame([
['0','0','C','0','E','0'],
['A','0','0','0','0','F'],
['A','0','0','0','0','F'],
['0','0','C','D','0','0'],
['A','B','0','0','0','0']], columns=['A','B','C','D','E','F']
)
df[df.duplicated()]
df[df.duplicated(keep=False)]
Related
I have a pandas dataframe with something like the following:
index
order_id
cost
123a
123
5
123b
123
None
123c
123
3
124a
124
None
124b
124
None
For each unique value of order_id, I'd like to drop any row that isn't the lowest cost. For any order_id that only contains nulls for the cost, any row for an order_id can be retained.
I've been struggling with this for a while now.
ol3 = ol3.loc[ol3.groupby('Order_ID').cost.idxmin()]
This code doesn't play nice with the order_id's that have only nulls. So, I tried to figure out how to drop the null's I don't want with
ol4 = ol3.loc[ol3['cost'].isna()].drop_duplicates(subset=['Order_ID', 'cost'], keep='first')
This gives me the list of null order_id's I want to retain. Not sure where to go from here. I'm pretty sure I'm looking at this the wrong way. Any help would be appreciated!
You can use transform to get the indexes with min cost per order_id. We additionally need isna check for the special order_ids that have only NaNs:
order_mins = df.groupby('order_id').cost.transform('min')
df[(df.cost == order_mins) | (order_mins.isna())]
You can (temporarily) fill the NA/None with np.inf before getting the idxmin:
ol3.loc[ol3['cost'].fillna(np.inf).groupby(ol3['order_id']).idxmin()]
You will have exactly one row per order_id
output:
index order_id cost
2 123c 123 3.0
3 124a 124 NaN
Most of the the other questions regarding updating values in pandas df are focused on appending a new column or just updating the cell with a new value (i.e. replacing it). My question is a bit different. Assuming my df already has values in it, and I find a new value, I need to add it into the cell to update its value. Example if a cell already has 5 and I found the value 10 in my file that corresponds to that column/row, the value should now be 15.
But I am having trouble writing this bit of code and even getting values to show up in my dataframe.
I have a dictionary, for example:
id_dict={'Treponema': ['162'], 'Leptospira': ['174'], 'Azospirillum': ['192'], 'Campylobacter': ['195', '197', '199', '201'], 'Pseudomonas': ['287'], 'NONE': ['2829358', '2806529']}
And I have sample id files that contain ids and the number of times those ids showed up in a previous file where the first value is the count and the second value is the id.
cat Sample1_idsummary.txt
1,162
15,174
4,195
5,197
6,201
10,2829358
Some of the ids have the same key in id_dict and I need to create a dataframe like the following:
Sample Treponema Leptospira Azospirillum Campylobacter Pseudomonas NONE
0 sample1 1 15 0 15 0 10
Here is my script, but my issue is that my output is always zero for all columns.
samplefile=sys.argv[1]
sample_ID=samplefile.split("_")[0] ## get just ID name
def get_ids_counts(id_dict,samplefie):
'''Obtain a table of id counts from the samplefile.'''
column_names=["Sample"]
column_names.extend([x for x in list(id_dict.keys())])
df = pd.DataFrame(columns=column_names)
df["Sample"]=[sample_ID]
with open(samplefile) as sf: # open the sample taxid count file
for line in sf:
id = line.split(",")[1] # the taxid (multiple can hit the same lineage info)
idcount = int(line.split(",")[0]) # the count from uniq
# For all keys in the dict, if that key is in the sample id file use the count from the id file
# Otherwise all keys not found in the file are "0" in the df
if id in id_dict:
df[list(id_dict.keys())[list(id_dict.values().index(id))]] = idcount
return df.fillna(0)
It's the very last if statement that is confusing me. How to make idcount add each time it gives the same key and why do I always get zeros filled in?
The below mentioned method worked! Here is the updated code:
def get_ids_counts(id_dict,samplefie):
'''Obtain a table of id counts from the samplefile.'''
df = pd.DataFrame([id_dict]).stack().explode().to_frame('id').droplevel(0).reset_index().astype({'id':int})
iddf = pd.read_csv(samplefile, sep=",", names=["count","id"])
df=df.merge(iddf, how='outer').fillna(0).groupby('index')['count'].sum().to_frame(sample_ID).T
return df
And the output, which is still not coming up right:
index 0 Azospirillaceae Campylobacteraceae Leptospiraceae NONE Pseudomonadacea Treponemataceae
mini 106.0 0.0 20.0 0.0 0.0 0.0 5.0
UPDATE 2
With the code below and using my proper files I've managed to get the table but cannot for the life of me get the "NONE" column to show up anymore. Any suggestions? My output is essentially every key value with proper counts but "NONE" disappears.
Instead of doing that way iteratively, you can more automate and use pandas to perform those operations.
Start by creating the dataframe from id_dict:
df = pd.DataFrame([id_dict]).stack().explode().to_frame('id').droplevel(0).reset_index()\
.astype({'id': int})
index id
0 Treponema 162
1 Leptospira 174
2 Azospirillum 192
3 Campylobacter 195
4 Campylobacter 197
5 Campylobacter 199
6 Campylobacter 201
7 Pseudomonas 287
8 NONE 2829358
9 NONE 2806529
Read the count/id text file into a data frame:
idDF = pd.read_csv('Sample1_idsummary.txt', sep=',' , names=['count', 'id'])
count id
0 1 162
1 15 174
2 4 195
3 5 197
4 6 201
5 10 2829358
Now outer merge both the dataframes, fill NaN's with 0, then groupby index, and call sum and create the dataframe calling to_frame and passing count as column name, finally transpose the dataframe:
df.merge(idDF, how='outer').fillna(0).groupby('index')['count'].sum().to_frame('Sample1').T
OUTPUT:
index Azospirillum Campylobacter Leptospira NONE Pseudomonas Treponema
Sample1 0.0 15.0 15.0 10.0 0.0 1.0
I have the following csv file that I converted to a DataFrame:
apartment,floor,gasbill,internetbill,powerbill
401,4,120,nan,340
409,4,190,50,140
410,4,155,45,180
I want to be able to iterate each column, and if the value of a cell in internetbill column is not a number, delete that whole row. So in this example, the ''401,4,120,nan,340'' row would be eliminated from the DataFrame.
I thought that something like this would work, but I have no avail and I'm stuck
df.drop[df['internetbill'] == "nan"]
If you are using pd.read_csv then that nan will get imported as a np.nan. If so, then you need dropna
df.dropna(subset=['internetbill'])
apartment floor gasbill internetbill powerbill
1 409 4 190 50.0 140
2 410 4 155 45.0 180
If those are strings for whatever reason, you could do one of two things:
replace
df.replace({'internetbill': {'nan': np.nan}}).dropna(subset=['internetbill'])
to_numeric
df.assign(
internetbill=pd.to_numeric(df['internetbill'], errors='coerce')
).dropna(subset=['internetbill'])
I have three dataframes with row counts more than 71K. Below are the samples.
df_1 = pd.DataFrame({'Device_ID':[1001,1034,1223,1001],'Col_A':[45,56,78,33]})
df_2 = pd.DataFrame({'Device_ID':[1001,1034,1223,1001,1887],'Col_B':[35,46,78,33,66]})
df_3 = pd.DataFrame({'Device_ID':[1001,1034,1223,1001,1887,1223],'Col_C':[5,14,8,13,16,8]})
Edit
As suggested, below is my desired out put
df_final
Device_ID Col_A Col_B Col_C
1001 45 35 5
1034 56 46 14
1223 78 78 8
1001 33 33 13
1887 Nan 66 16
1223 NaN NaN 8
While using pd.merge() or df_1.set_index('Device_ID').join([df_2.set_index('Device_ID'),df_3.set_index('Device_ID')],on='Device_ID') it is taking very long time. One reason is repeating values of Device_ID.
I am aware of reduce method, but my suspect is it may lead to same situation.
Is there any better and efficient way?
To get your desired outcome, you can use this:
result = pd.concat([df_1.drop('Device_ID', axis=1),df_2.drop('Device_ID',axis=1),df_3],axis=1).set_index('Device_ID')
If you don't want to use Device_ID as index, you can remove the set_index part of the code. Also, note that because of the presence of NaN's in some columns (Col_A and Col_B) in the final dataframe, Pandas will cast non-missing values to floats, as NaN can't be stored in an integer array (unless you have Pandas version 0.24, in which case you can read more about it here).
I'm not sure I'm going to describe this right, but I'll try.
I have several excel files with about 20 columns and 10k or so rows. Let's say the column names are in the form col1, col2...col20.
Col2 is a timestamp column, so, for instance, a value could read: "2012-07-25 14:21:00".
I want to read the excel files into a DataFrame and perform some time series and grouping operations.
Here's some simplified code to load an excel file:
xl = pd.ExcelFile(os.path.join(dirname, filename))
df = xl.parse(xl.sheet_names[0], index_col=1) # Col2 above
When I run
df.index
it gives me:
<class 'pandas.tseries.index.DatetimeIndex'>
[2012-01-19 15:37:55, ..., 2012-02-02 16:13:42]
Length: 9977, Freq: None, Timezone: None
as expected. However, inspecting the columns, I get:
Index([u'Col1', u'Col2',...u'Col20'], dtype='object')
Which may be why I have problems with some of the manipulation I want to do. So for instance, when I run:
df.groupby[category_col].count()
I expect to get a dataframe with 1 row for each category and 1 column containing the count for that category. Instead, I get a dataframe with 1 row for each category and 19 columns describing the number of values for that column/category pair.
The same thing happens when I try to resample:
df.resample('D', how='count')
Instead of a single column Dataframe with the number of records per day, I get:
2012-01-01 Col1 8
Col2 8
Coln 8
2012-01-02 Col1 10
Col2 10
Coln 10
Is this normal behavior? How would I instead get just one value per day, category, whichever?
Based on this blog post from Wes McKinney, I think the problem is that I have to run my operations on a specific column, namely a column that I know won't have missing data.
So instead of doing:
df.groupby[category_col].count()
I should do:
df['col3'].groupby(df[category_col]).count()
and this:
df2.resample('D', how='count')
should be this:
df2['col3'].resample('D', how='count')
The results are more inline with what I'm looking for:
Category
Cat1 1232
Cat2 7677
Cat3 1053
Date
2012-01-01 8
2012-01-02 66
2012-01-03 89