This question is related to the question I asked previously about using pandas get_dummies() function (link below).
Pandas Get_dummies for nested tables
However in the course of utilizing the solution provide in the answer I noticed odd behavior when looking at the groupby function. The issue is that repeated (non-unique) index values for a dataframe appear to cause an error when the matrix is represented in sparse format, while working as expected for dense matrix.
I have extremely high dimensional data thus sparse matrix will be required for memory reasons. An example of the error is below. If anyone has a work around it would be greatly appreciated
Working:
import pandas as pd
df = pd.DataFrame({'Instance':[1,1,2,3],'Cat_col':
['John','Smith','Jane','Doe']})
result= pd.get_dummies(df.Cat_col, prefix='Name')
result['Instance'] = df.Instance
result = result.set_index('Instance')
result = result.groupby(level=0).apply(max)
Failing
import pandas as pd
df = pd.DataFrame({'Instance':[1,1,2,3],'Cat_col':
['John','Smith','Jane','Doe']})
result= pd.get_dummies(df.Cat_col, prefix='Name',sparse=True)
result['Instance'] = df.Instance
result = result.set_index('Instance')
result = result.groupby(level=0).apply(max)
Note you will need version 16.1 or greater of pandas.
Thank you in advance
You can perform your groupby in a different way as a workaround. Don't set Instance as the index and use the column for your groupby and drop the Instance column (last column in this case since it was just added). Groupby will will make an Instance index.
import pandas as pd
df = pd.DataFrame({'Instance':[1,1,2,3],'Cat_col':
['John','Smith','Jane','Doe']})
result= pd.get_dummies(df.Cat_col, prefix='Name',sparse=True)
result['Instance'] = df.Instance
#WORKAROUND:
result=result.groupby('Instance').apply(max)[result.columns[:-1]]
result
Out[58]:
Name_Doe Name_Jane Name_John Name_Smith
Instance
1 0 0 1 1
2 0 1 0 0
3 1 0 0 0
Note: The sparse dataframe stores your Instance int's as floats within a BlockIndex in the dataframe column. In order to have the index the exact same as the first example you'd need to change to int from float.
result.index=result.index.map(int)
result.index.name='Instance'
Related
I have a Pandas dataframe (tempDF) of 5 columns by N rows. Each element of the dataframe is an object (string in this case). For example, the dataframe looks like (this is fake data - not real world):
I have two tuples, each contains a collection of numbers as a string type. For example:
codeset = ('6108','532','98120')
additionalClinicalCodes = ('131','1','120','130')
I want to retrieve a subset of the rows from the tempDF in which the columns "medcode" OR "enttype" have at least one entry in the tuples above. Thus, from the example above, I would retrieve a subset containing rows with the index 8 and 9 and 11.
Until updating some packages earlier today (too many now to work out which has started throwing the warning), this did work:
tempDF = tempDF[tempDF["medcode"].isin(codeSet) | tempDF["enttype"].isin(additionalClinicalCodes)]
But now it is throwing the warning:
FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
mask |= (ar1 == a)
Looking at the API, isin states the the condition "if ALL" is in the iterable collection. I want an "if ANY" condition.
UPDATE #1
The problem lies with using the | operator, also the np.logical_or method. If I remove the second isin condition i.e., just keep tempDF[tempDF["medcode"].isin(codeSet) then no warning is thrown but I'm only subsetting on the one possible condition.
import numpy as np
tempDF = tempDF[np.logical_or(tempDF["medcode"].isin(codeSet), tempDF["enttype"].isin(additionalClinicalCodes))
I'm unable to reproduce your warning (I assume you are using an outdated numpy version), however I believe it is related to the fact that your enttype column is a numerical type, but you're using strings in additionalClinicalCodes.
Try this:
tempDF = temp[temp["medcode"].isin(list(codeset)) | temp["enttype"].isin(list(additionalClinicalCodes))]
Boiling your question down to an executable example:
import pandas as pd
tempDF = pd.DataFrame({'medcode': ['6108', '6154', '95744', '98120'], 'enttype': ['99', '131', '372', '372']})
codeset = ('6108','532','98120')
additionalClinicalCodes = ('131','1','120','130')
newDF = tempDF[tempDF["medcode"].isin(codeset) | tempDF["enttype"].isin(additionalClinicalCodes)]
print(newDF)
print("Pandas Version")
print(pd.__version__)
This returns for me
medcode enttype
0 6108 99
1 6154 131
3 98120 372
Pandas Version
1.4.2
Thus I am not able to reproduce your warning.
This is a numpy strange behaviour. I think the right way to do this is yours way, but if the warning bothers you, try this:
tempDF = tempDF[
(
tempDF.medcode.isin(codeset).astype(int) +
tempDF.isin(additionalClinicalCode).astype(int)
) >= 1
]
I have a dataset which I transformed to CSV as potential input for a keras auto encoder.
The loading of the CSV works flawless with pandas.read_csv() but the data types are not correct.
The csv solely contains two colums: label and features whereas the label column contains strings and the features column arrays with signed integers ([-1, 1]). So in general pretty simple structure.
To get two different dataframes for further processing I created them via:
labels = pd.DataFrame(columns=['label'], data=csv_data, dtype='U')
and
features = pd.DataFrame(columns=['features'], data=csv_data)
in both cases I got wrong datatypes as both are marked as object typed dataframes. What am I doing wrong?
For the features it is even harder because the parsing returns me a pandas.sequence that contains the array as string: ['[1, ..., 1]'].
So I tried a tedious workaround by parsing the string back to an numpy array via .to_numpy() a python cast for every element and than an np.assarray() - but the type of the dataframe is still incorrect. I think this could not be the general approach how to solve this task. As I am fairly new to pandas I checked some tutorials and the API but in most cases a cell in a dataframe rather contains a single value instead of a complete array. Maybe my overall design of the dataframe ist just not suitable for this task.
Any help appreacheated!
You are reading the file as string but you have a python list as a column you need to evaluate it to get the list.
I am not sure of the use case but you can split the labels for a more readable dataframe
import pandas as pd
features = ["featurea","featureb","featurec","featured","featuree"]
labels = ["[1,0,1,1,1,1]","[1,0,1,1,1,1]","[1,0,1,1,1,1]","[1,0,1,1,1,1]","[1,0,1,1,1,1]"]
df = pd.DataFrame(list(zip(features, labels)),
columns =['Features', 'Labels'])
import ast
#convert Strings to lists
df['Labels'] = df['Labels'].map(ast.literal_eval)
df.index = df['Features']
#Since list itself might not be useful you can split and expand it to multiple columns
new_df = pd.DataFrame(df['Labels'].values.tolist(),index= df.index)
Output
0 1 2 3 4 5
Features
featurea 1 0 1 1 1 1
featureb 1 0 1 1 1 1
featurec 1 0 1 1 1 1
featured 1 0 1 1 1 1
featuree 1 0 1 1 1 1
The input csv was formatted incorrectly, therefore the parsing was accurate but not what i intended. I expanded the real columns and skipped the header to have a column for every array entry - now panda recognize the types and the correct dimensions.
My question is little bit different than the question posted here
So I thought to open a new thread.I have a pandas data frame with 5 attributes.One of these attribute is created using pandas series.Here is the sample code for creating the data frame
import numpy as np
mydf1=pd.DataFrame(columns=['group','id','name','mail','gender'])
data = np.array([2540948, 2540955, 2540956,2540956,7138932])
x=pd.Series(data)
mydf1.loc[0]=[1,x,'abc','abc#xyz.com','male']
I have another data frame,the code for creating the data frame is given below
mydf2=pd.DataFrame(columns=['group','id'])
data1 = np.array([2540948, 2540955, 2540956])
y=pd.Series(data1)
mydf2.loc[0]=[1,y]
These are sample data. Actual data will have large number of rows & also the series length is large too .I want to match mydf1 with mydf2 & if it matches,sometime I wont have matching element in mydf2,then I will delete values of id from mydf1 which are there in mydf2 for example after the run,my id will be for group 1 2540956,7138932. I also tried the code mentioned in above link. But for the first line
counts = mydf1.groupby('id').cumcount()
I got error message as
TypeError: 'Series' objects are mutable, thus they cannot be hashed
in my Python 3.X. Can you please suggest me how to solve this?
This should work. We use Counter to find the difference between 2 lists of ids. (p.s. This problem does not requires the difference is in order.)
Setup
import numpy as np
from collections import Counter
mydf1=pd.DataFrame(columns=['group','id','name','mail','gender'])
x = [2540948, 2540955, 2540956,2540956,7138932]
y = [2540948, 2540955, 2540956,2540956,7138932]
mydf1.loc[0]=[1,x,'abc','abc#xyz.com','male']
mydf1.loc[1]=[2,y,'def','def#xyz.com','female']
mydf2=pd.DataFrame(columns=['group','id'])
x2 = np.array([2540948, 2540955, 2540956])
y2 = np.array([2540955, 2540956])
mydf2.loc[0]=[1,x2]
mydf2.loc[1]=[2,y2]
Code
mydf3 = mydf1[["group", "id"]]
mydf3 = mydf3.merge(mydf2, how="inner", on="group")
new_id_finder = lambda x: list((Counter(x.id_x) - Counter(x.id_y)).elements())
mydf3["new_id"] = mydf3.apply(new_id_finder, 1)
mydf3["new_id"]
group new_id
0 1 [2540956, 7138932]
1 2 [2540948, 2540956, 7138932]
One Counter object can substract another to get the difference in occurances of elements. Then, you can use elements function to retrieve all values left.
I have a DataFrame similar to this:
import numpy as np
raw_data = {'Identifier':['10','10','10','11',11,'12','13']}
import pandas as pd
df = pd.DataFrame(raw_data,columns=['Identifier'])
print df
As you can see the 'Identifier' column is not unique and the dataframe itself has many rows.
Everytime I try to do a calculation on the Identifier column using:
df['CalculatedColumn'] = df['Identifer'] + apply calculation here
As Identifer is not unique, is there a better way of doing this? Maybe store the calculations for each unique identifier and then pass in the results? The calculation is quite complex and added with the number of rows, this takes a long time. But i would want to reduce it as the identifiers are not unique.
Any thoughts?
I'm pretty sure there is a more pythonic way, but this works for me:
import numpy as np
import pandas as pd
raw_data = {'Identifier':['10','10','10','11','11','12','13']}
df = pd.DataFrame(raw_data,columns=['Identifier'])
df['CalculatedColumn']=0
dfuni=df.drop_duplicates(['Identifier'])
dfuni['CalculatedColumn']=dfuni['Identifier']*2 # perform calculation
for j in range(len(dfuni)):
df['CalculatedColumn'][df['Identifier']==dfuni['Identifier'].iloc[j]]=dfuni['CalculatedColumn'].iloc[j]
print df
print dfuni
As an explanation: I'm creating a new dataframe dfuni which contains all the unique fields of your original dataframe. Then, you perform your calculation on this (I just multiplied the value of the Identifier by two, and because it is a string, the result is 1010, 1111 etc.). Up to here, I like the code, but then, I'm using a loop through all the values of dfuni to copy them back into your original df. For this point, there might be a more elegant solution.
As a result, I get:
Identifier CalculatedColumn
0 10 1010
1 10 1010
2 10 1010
3 11 1111
4 11 1111
5 12 1212
6 13 1313
PS: This code was tested with Python 3. The only thing I adapted was the print-statements. I may have missed something.
I have a data frame(df) consisting of more than 1000 columns. Each cell consists of a list.
e.g.
0 1 2 .... n
0 [1,2,3] [3,7,9] [1,2,1] ....[x,y,z]
1 [2,5,6] [2,3,1] [3,3,3] ....[x1,y1,z1]
2 None [2,0,1] [2,2,2] ....[x2,y2,z2]
3 None [9,5,9] None ....None
This list is actually the dimensions. I need to find the euclidean distance of each cell in column 0 with every other cell in column 1 and store the minimum.
Similarly from column 0 to column 2 and then to column 3 so on..
Example
distance of df[0][0] from df[1][0], df[1][1], df[1][2]
then of df[0][1] from df[1][0], df[1][1], df[1][2] and so on...
Currently i am doing it with help of for loops but it is taking a lot of time for large data.
Following is the implementation::
for n in range(len(df.columns)):
for m in range(n+1,len(df.columns)):
for q in range(df.shape[0]):
min1=9999
r=0
while(r<df.shape[0]):
if(df[n][q] is not None or df[m][r] is not None):
dist=distance.euclidean(df[n][q],df[m][r])
if(d<min1):
min1=d
if(min1==0): *#because distance can never be less than zero*
break
r=r+1
Is there any other way to do this?
You can use pandas apply. The example below will do the distance between 0 and 1 and create a new column.
import pandas as pd
import numpy as np
df = pd.DataFrame({'0':[[1,2,3],[2,5,6]],'1':[[3,7,9],[2,3,1]]})
def eucledian(row):
x = np.sqrt((row[0][0]-row[1][0])**2)
y = np.sqrt((row[0][1]-row[1][1])**2)
z = np.sqrt((row[0][2]-row[1][2])**2)
return [x,y,z]
df['dist0-1'] = df.apply(eucledian,axis=1)
With this said I highly recommend to unroll your variables to separate columns, e.g. 0.x, 0.y, 0.z, etc and then use numpy to operate directly on columns. This will be much faster if you have a large amount of data.
since you don't tell me how to handle Nond in the distance calculation, I just give a example. You need to handle the exception of None type yourself.
import pandas as pd
import numpy as np
from scipy.spatial.distance import pdist
df = pd.DataFrame([{'a':[1,2,3], 'b':[3,7,9], 'c':[1,2,1]},
{'a':[2,5,6], 'b':[2,3,1], 'c':[3,3,3]}])
df.apply(distance, axis=1).apply(min, axis=1)