I have loaded an s3 bucket with json files and parsed/flattened it in to a pandas dataframe. Now i have a dataframe with 175 columns with 4 columns containing personally identifiable information.
I am looking for a quick solution anonymising those columns (name & adress). I need to keep information for multiples so that if names or adresses of the same person occuring multiple times have the same hash.
Is there existing functionality in pandas or some other package i can utilize for this?
Using a Categorical would be an efficient way to do this - the main caveat is that the numbering will be based solely on the ordering in the data, so some care will be needed if this numbering scheme needs to be used across multiple columns / datasets.
df = pd.DataFrame({'ssn': [1, 2, 3, 999, 10, 1]})
df['ssn_anon'] = df['ssn'].astype('category').cat.codes
df
Out[38]:
ssn ssn_anon
0 1 0
1 2 1
2 3 2
3 999 4
4 10 3
5 1 0
You can using ngroup or factorize from pandas
df.groupby('ssn').ngroup()
Out[25]:
0 0
1 1
2 2
3 4
4 3
5 0
dtype: int64
pd.factorize(df.ssn)[0]
Out[26]: array([0, 1, 2, 3, 4, 0], dtype=int64)
In sklearn, if you are doing ML , I will recommend this approach
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(df.ssn).transform(df.ssn)
Out[31]: array([0, 1, 2, 4, 3, 0], dtype=int64)
You seem to be looking for a way to encrypt the strings in your dataframe. There are a bunch of python encryption libraries such as cryptography
How to use it is pretty simple, just apply it to each element.
import pandas as pd
from cryptography.fernet import Fernet
df =pd.DataFrame([{'a':'a','b':'b'}, {'a':'a','b':'c'}])
f = Fernet('password')
res = df.applymap(lambda x: f.encrypt(byte(x, 'utf-8'))
# Decrypt
res.applymap(lambda x: f.decrypt(x))
That is probably the best way in terms of security but it would generate a long byte/string and be hard to look at.
# 'a' -> b'gAAAAABaRQZYMjB7wh-_kD-VmFKn2zXajMRUWSAeridW3GJrwyebcDSpqyFGJsCEcRcf68ylQMC83G7dyqoHKUHtjskEtne8Fw=='
Another simple way so solve your problem is to create a function that maps a key to a value and creates a new value if a new key is present.
mapper = {}
def encode(string):
if x not in mapper:
# This part can be changed with anything really
# Such as mapper[x]=randint(-10**10,10**10)
# Just ensure it would not repeat
mapper[x] = len(mapper)+1
return mapper[x]
res = df.applymap(encode)
Sounds a bit like you want to be able to reverse the process by maintaining a key somewhere. If your use case allows I would suggest replacing all the values with valid, human readable and irreversible placeholders.
John > Mark
21 Hammersmith Grove rd > 48 Brewer Street
This is good for generating usable test data for remote devs etc. You can use Faker to generate replacement values yourself. If you want to maintain some utility in your data i.e. "replace all addresses with alternate addresses within 2 miles" you could use an api i'm working on called Anon AI. We parse JSON from s3 buckets, find all the PII automatically (including in free text fields) and replace it with placeholders given your spec. We can keep consistency and reversibility if required and it will be most useful if you want to keep a "live" anonymous version of a growing data set. We're in beta at the moment so let me know if you would be interested in testing it out.
Related
I have a large pandas dataframe with many different types of observations that need different models applied to them. One column is which model to apply, and that can be mapped to a python function which accepts a dataframe and returns a dataframe. One approach would be just doing 3 steps:
split dataframe into n dataframes for n different models
run each dataframe through each function
concatenate output dataframes at the end
This just ends up not being super flexible particularly as models are added and removed. Looking at groupby it seems like I should be able to leverage that to make this look much cleaner code-wise, but I haven't been able to find a pattern that does what I'd like.
Also because of the size of this data, using apply isn't particularly useful as it would drastically slow down the runtime.
Quick example:
df = pd.DataFrame({"model":["a","b","a"],"a":[1,5,8],"b":[1,4,6]})
def model_a(df):
return df["a"] + df["b"]
def model_b(df):
return df["a"] - df["b"]
model_map = {"a":model_a,"b":model_b}
results = df.groupby("model")...
The expected result would look like [2,1,14]. Is there an easy way code-wise to do this? Note that the actual models are much more complicated and involve potentially hundreds of variables with lots of transformations, this is just a toy example.
Thanks!
You can use groupby/apply:
x.name contains the name of the group, here a and b
x contains the sub dataframe
df['r'] = df.groupby('model') \
.apply(lambda x: model_map[x.name](x)) \
.droplevel(level='model')
>>> df
model a b r
0 a 1 1 2
1 b 5 4 1
2 a 8 6 14
Or you can use np.select:
>>> np.select([df['model'] == 'a', df['model'] == 'b'],
[model_a(df), model_b(df)])
array([ 2, 1, 14])
I have a dataset which I transformed to CSV as potential input for a keras auto encoder.
The loading of the CSV works flawless with pandas.read_csv() but the data types are not correct.
The csv solely contains two colums: label and features whereas the label column contains strings and the features column arrays with signed integers ([-1, 1]). So in general pretty simple structure.
To get two different dataframes for further processing I created them via:
labels = pd.DataFrame(columns=['label'], data=csv_data, dtype='U')
and
features = pd.DataFrame(columns=['features'], data=csv_data)
in both cases I got wrong datatypes as both are marked as object typed dataframes. What am I doing wrong?
For the features it is even harder because the parsing returns me a pandas.sequence that contains the array as string: ['[1, ..., 1]'].
So I tried a tedious workaround by parsing the string back to an numpy array via .to_numpy() a python cast for every element and than an np.assarray() - but the type of the dataframe is still incorrect. I think this could not be the general approach how to solve this task. As I am fairly new to pandas I checked some tutorials and the API but in most cases a cell in a dataframe rather contains a single value instead of a complete array. Maybe my overall design of the dataframe ist just not suitable for this task.
Any help appreacheated!
You are reading the file as string but you have a python list as a column you need to evaluate it to get the list.
I am not sure of the use case but you can split the labels for a more readable dataframe
import pandas as pd
features = ["featurea","featureb","featurec","featured","featuree"]
labels = ["[1,0,1,1,1,1]","[1,0,1,1,1,1]","[1,0,1,1,1,1]","[1,0,1,1,1,1]","[1,0,1,1,1,1]"]
df = pd.DataFrame(list(zip(features, labels)),
columns =['Features', 'Labels'])
import ast
#convert Strings to lists
df['Labels'] = df['Labels'].map(ast.literal_eval)
df.index = df['Features']
#Since list itself might not be useful you can split and expand it to multiple columns
new_df = pd.DataFrame(df['Labels'].values.tolist(),index= df.index)
Output
0 1 2 3 4 5
Features
featurea 1 0 1 1 1 1
featureb 1 0 1 1 1 1
featurec 1 0 1 1 1 1
featured 1 0 1 1 1 1
featuree 1 0 1 1 1 1
The input csv was formatted incorrectly, therefore the parsing was accurate but not what i intended. I expanded the real columns and skipped the header to have a column for every array entry - now panda recognize the types and the correct dimensions.
I'm a bit lost with the use of Feature Hashing in Python Pandas .
I have the a DataFrame with multiple columns, with many information in different types. There is one column that represent a class for the data.
Example:
col1 col2 colType
1 1 2 'A'
2 1 1 'B'
3 2 4 'C'
My goal is to apply FeatureHashing for the ColType, in order to be able to apply a Machine Learning Algorithm.
I have created a separate DataFrame for the colType, having something like this:
colType value
1 'A' 1
2 'B' 2
3 'C' 3
4 'D' 4
Then, applied Feature Hashing for this class Data Frame. But I don't understand how to add the result of Feature Hashing to my DataFrame with the info, in order to use it as an input in a Machine Learning Algorithm.
This is how I use FeatureHashing:
from sklearn.feature_extraction import FeatureHasher
fh = FeatureHasher(n_features=10, input_type='string')
result = fh.fit_transform(categoriesDF)
How do I insert this FeatureHasher result, to my DataFrame? How bad is my approach? Is there any better way to achieve what I am doing?
Thanks!
I know this answer comes in late, but I stumbled upon the same problem and found this works:
fh = FeatureHasher(n_features=8, input_type='string')
sp = fh.fit_transform(df['colType'])
df = pd.DataFrame(sp.toarray(), columns=['fh1', 'fh2', 'fh3', 'fh4', 'fh5', 'fh6', 'fh7', 'fh8'])
pd.concat([df1, df], axis=1)
This creates a dataframe out of the sparse matrix retrieved by the FeatureHasher and concatenates the matrix to the existing dataframe.
I have switched to One Hot Coding, using something like this:
categoriesDF = pd.get_dummies(categoriesDF)
This function will create a column for every non-category value, with 1 or 0.
I have a bunch of survey data broken down by number of responses for each choice for each question (multiple-choice questions). I have one of these summaries for each of several different courses, semesters, sections, etc. Unfortunately, all of my data was given to me in PDF printouts and I cannot get the digital data. On the bright side, that means I have free reign to format my data file however I need to so that I can import it into Pandas.
How do I import my data into Pandas, preferably without needing to reproduce it line-by-line (one line for each entry represented by my summary).
The data
My survey comprises several multiple-choice questions. I have the number of respondents who chose each option for each question. Something like:
Course Number: 100
Semester: Spring
Section: 01
Question 1
----------
Option A: 27
Option B: 30
Option C: 0
Option D: 2
Question 2
----------
Option X: 20
Option Y: 10
So essentially I have the .value_counts() results if my data was already in Pandas. Note that the questions do not always have the same number of options (categories), and they do not always have the same number of respondents. I will have similar results for multiple course numbers, semesters, and sections.
The categories A, B, C, etc. are just placeholders here to represent the labels for each response category in my actual data.
Also, I have to manually input all of this into something, so I am not worried about reading the specific file format above, it just represents what I have on the actual printouts in front of me.
The goal
I would like to recreate the response data in Pandas by telling Pandas how many of each response category I have for each question. Basically I want an Excel file or CSV that looks like the response data above, and a Pandas DataFrame that looks like:
Course Number Semester Section Q1 Q2
100 Spring 01 A X
100 Spring 01 A X
... (20 identical entries)
100 Spring 01 A Y
100 Spring 01 A Y
... (7 of these)
100 Spring 01 B Y
100 Spring 01 B Y
100 Spring 01 B Y
100 Spring 01 B N/A (out of Q2 responses)
...
100 Spring 01 D N/A
100 Spring 01 D N/A
I should note that I am not reproducing the actual response data here, because I have no way of knowing that someone who chose option D for question 1 didn't also choose option X for question 2. I just want the number of each result to show up the same, and for my df.count_values() output to basically give me what my summary already says.
Attempts so far
So far the best I can come up with is actually reproducing each response as its own row in an excel file, and then importing this file and converting to categories:
import pandas as pd
df = pd.read_excel("filename")
df["Q1"] = df["Q1"].astype("category")
df["Q2"] = df["Q2"].astype("category")
There are a couple of problems with this. First, I have thousands of responses, so creating all of those rows is going to take way too long. I would much prefer the compact approach of just recording directly how many of each response I have and then importing that into Pandas.
Second, this becomes a bit awkward when I do not have the same number of responses for every question. At first, to save time on entering every response, I was only putting a value in a column when that value was different than the previous row, and then using .ffill() to forward-fill the values in the Pandas DataFrame. The issue with that is that all NaN values get filled, so I cannot have different numbers of responses for different questions.
I am not married to the idea of recording the data in Excel first, so if there is an easier way using something else I am all ears.
If there is some other way of looking at this problem that makes more sense than what I am attempting here, I am open to hearing about that as well.
Edit: kind of working
I switched gears a bit and made an Excel file where each sheet is a single survey summary, the first few columns identify the Course, Semester, Section, Year, etc., and then I have a column of possible Response categories. The rest of the file comprises a column for each question, and then the number of responses in each row corresponding to the responses that match that question. I then import each sheet and concatenate:
df = [pd.read_excel("filename", sheetname=i, index_col=range(0,7)) for i in range(1,3)]
df = pd.concat(df)
This seems to work, but I end up with a really ugly table (lots of NaN's for all of the responses that don't actually correspond to each question). I can kind of get around this for plotting the results for any one question with something like:
df_grouped = df.groupby("Response", sort=False).aggregate(sum) # group according to response
df_grouped["Q1"][np.isfinite(df_grouped["Q1"])].plot(kind="bar") # only plot responses that have values
I feel like there must be a better way to do this, maybe with multiple indices or some kind of 3D data structure...
One hacky way to get the information out is to first split by ----- and then use regex.
For each course so something like the following:
In [11]: s
Out[11]: 'Semester: Spring\nSection: 01\nQuestion 1\n----------\nOption A: 27\nOption B: 30\nOption C: 0\nOption D: 2\n\nQuestion 2\n----------\nOption A: 20\nOption B: 10'
In [12]: blocks = s.split("----------")
Parse out the information from the first block, use regex or just split:
In [13]: semester = re.match("Semester: (.*)", blocks[0]).groups()[0]
In [14]: semester
Out[14]: 'Spring'
To parse the option info from each block:
def parse_block(lines):
d = {}
for line in lines:
m = re.match("Option ([^:]+): (\d+)", line)
if m:
d[m.groups()[0]] = int(m.groups()[1])
return d
In [21]: [parse_block(x.splitlines()) for x in blocks[1:]]
Out[21]: [{'A': 27, 'B': 30, 'C': 0, 'D': 2}, {'A': 20, 'B': 10}]
You can similarly pull out the question number (if you don't know they're sequential):
In [22]: questions = [int(re.match(".*Question (\d+)", x, re.DOTALL).groups()[0]) for x in blocks[:-1]]
In [23]: questions
Out[23]: [1, 2]
and zip these together:
In [31]: dict(zip(questions, ds))
Out[31]: {1: {'A': 27, 'B': 30, 'C': 0, 'D': 2}, 2: {'A': 20, 'B': 10}}
In [32]: pd.DataFrame(dict(zip(questions, ds)))
Out[32]:
1 2
A 27 20
B 30 10
C 0 NaN
D 2 NaN
I'd put these in another dict of (course, semester, section) -> DataFrame and then concat and work out where to go from the big MultiIndex dataFrame...
I'm trying to do some conditional parsing of excel files into Pandas dataframes. I have a group of excel files and each has some number of lines at the top of the file that are not part of the data -- some identification data based on what report parameters were used to create the report.
I want to use the ExcelFile.parse() method with skiprows=some_number but I don't know what some_number will be for each file.
I do know that the HeaderRow will start with one member of a list of possibilities. How can I tell Pandas to create the dataframe starting on the row that includes any some_string in my list of possibilities?
Or, is there a way to import the entire sheet and then remove the rows preceding the row that includes any some_string in my list of possibilities?
Most of the time I would just post-process this in pandas, i.e. diagnose, remove the rows, and correct the dtypes, in pandas. This has the benefit of being easier but is arguably less elegant (I suspect it'll also be faster doing it this way!):
In [11]: df = pd.DataFrame([['blah', 1, 2], ['some_string', 3, 4], ['foo', 5, 6]])
In [12]: df
Out[12]:
0 1 2
0 blah 1 2
1 some_string 3 4
2 foo 5 6
In [13]: df[0].isin(['some_string']).argmax() # assuming it's found
Out[13]: 1
I may actually write this in python, as it's probably little/no benefit in vectorizing (and I find this more readable):
def to_skip(df, preceding):
for s in enumerate(df[0]):
if s in preceding:
return i
raise ValueError("No preceding string found in first column")
In [21]: preceding = ['some_string']
In [22]: to_skip(df, preceding)
Out[22]: 1
In [23]: df.iloc[1:] # or whatever you need to do
Out[23]:
0 1 2
1 some_string 3 4
2 foo 5 6
The other possibility, messing about with ExcelFile and finding the row number could be doing (again with a for-loop as above but in openpyxl or similar). However, I don't think there would be a way to read the excel file (xml) just once if you do this.
This is somewhat unfortunate when compared to how you could do this on a csv, where you can read the first few lines (until you see the row/entry you want), and then pass this opened file to read_csv. (If you can export your Excel spreadsheet to csv then parse in pandas, that would be faster/cleaner...)
Note: read_excel isn't really that fast anyways (esp. compared to read_csv)... so IMO you want to get to pandas asap.