Creating a pandas DataFrame with counts of categorical data - python

I have a bunch of survey data broken down by number of responses for each choice for each question (multiple-choice questions). I have one of these summaries for each of several different courses, semesters, sections, etc. Unfortunately, all of my data was given to me in PDF printouts and I cannot get the digital data. On the bright side, that means I have free reign to format my data file however I need to so that I can import it into Pandas.
How do I import my data into Pandas, preferably without needing to reproduce it line-by-line (one line for each entry represented by my summary).
The data
My survey comprises several multiple-choice questions. I have the number of respondents who chose each option for each question. Something like:
Course Number: 100
Semester: Spring
Section: 01
Question 1
----------
Option A: 27
Option B: 30
Option C: 0
Option D: 2
Question 2
----------
Option X: 20
Option Y: 10
So essentially I have the .value_counts() results if my data was already in Pandas. Note that the questions do not always have the same number of options (categories), and they do not always have the same number of respondents. I will have similar results for multiple course numbers, semesters, and sections.
The categories A, B, C, etc. are just placeholders here to represent the labels for each response category in my actual data.
Also, I have to manually input all of this into something, so I am not worried about reading the specific file format above, it just represents what I have on the actual printouts in front of me.
The goal
I would like to recreate the response data in Pandas by telling Pandas how many of each response category I have for each question. Basically I want an Excel file or CSV that looks like the response data above, and a Pandas DataFrame that looks like:
Course Number Semester Section Q1 Q2
100 Spring 01 A X
100 Spring 01 A X
... (20 identical entries)
100 Spring 01 A Y
100 Spring 01 A Y
... (7 of these)
100 Spring 01 B Y
100 Spring 01 B Y
100 Spring 01 B Y
100 Spring 01 B N/A (out of Q2 responses)
...
100 Spring 01 D N/A
100 Spring 01 D N/A
I should note that I am not reproducing the actual response data here, because I have no way of knowing that someone who chose option D for question 1 didn't also choose option X for question 2. I just want the number of each result to show up the same, and for my df.count_values() output to basically give me what my summary already says.
Attempts so far
So far the best I can come up with is actually reproducing each response as its own row in an excel file, and then importing this file and converting to categories:
import pandas as pd
df = pd.read_excel("filename")
df["Q1"] = df["Q1"].astype("category")
df["Q2"] = df["Q2"].astype("category")
There are a couple of problems with this. First, I have thousands of responses, so creating all of those rows is going to take way too long. I would much prefer the compact approach of just recording directly how many of each response I have and then importing that into Pandas.
Second, this becomes a bit awkward when I do not have the same number of responses for every question. At first, to save time on entering every response, I was only putting a value in a column when that value was different than the previous row, and then using .ffill() to forward-fill the values in the Pandas DataFrame. The issue with that is that all NaN values get filled, so I cannot have different numbers of responses for different questions.
I am not married to the idea of recording the data in Excel first, so if there is an easier way using something else I am all ears.
If there is some other way of looking at this problem that makes more sense than what I am attempting here, I am open to hearing about that as well.
Edit: kind of working
I switched gears a bit and made an Excel file where each sheet is a single survey summary, the first few columns identify the Course, Semester, Section, Year, etc., and then I have a column of possible Response categories. The rest of the file comprises a column for each question, and then the number of responses in each row corresponding to the responses that match that question. I then import each sheet and concatenate:
df = [pd.read_excel("filename", sheetname=i, index_col=range(0,7)) for i in range(1,3)]
df = pd.concat(df)
This seems to work, but I end up with a really ugly table (lots of NaN's for all of the responses that don't actually correspond to each question). I can kind of get around this for plotting the results for any one question with something like:
df_grouped = df.groupby("Response", sort=False).aggregate(sum) # group according to response
df_grouped["Q1"][np.isfinite(df_grouped["Q1"])].plot(kind="bar") # only plot responses that have values
I feel like there must be a better way to do this, maybe with multiple indices or some kind of 3D data structure...

One hacky way to get the information out is to first split by ----- and then use regex.
For each course so something like the following:
In [11]: s
Out[11]: 'Semester: Spring\nSection: 01\nQuestion 1\n----------\nOption A: 27\nOption B: 30\nOption C: 0\nOption D: 2\n\nQuestion 2\n----------\nOption A: 20\nOption B: 10'
In [12]: blocks = s.split("----------")
Parse out the information from the first block, use regex or just split:
In [13]: semester = re.match("Semester: (.*)", blocks[0]).groups()[0]
In [14]: semester
Out[14]: 'Spring'
To parse the option info from each block:
def parse_block(lines):
d = {}
for line in lines:
m = re.match("Option ([^:]+): (\d+)", line)
if m:
d[m.groups()[0]] = int(m.groups()[1])
return d
In [21]: [parse_block(x.splitlines()) for x in blocks[1:]]
Out[21]: [{'A': 27, 'B': 30, 'C': 0, 'D': 2}, {'A': 20, 'B': 10}]
You can similarly pull out the question number (if you don't know they're sequential):
In [22]: questions = [int(re.match(".*Question (\d+)", x, re.DOTALL).groups()[0]) for x in blocks[:-1]]
In [23]: questions
Out[23]: [1, 2]
and zip these together:
In [31]: dict(zip(questions, ds))
Out[31]: {1: {'A': 27, 'B': 30, 'C': 0, 'D': 2}, 2: {'A': 20, 'B': 10}}
In [32]: pd.DataFrame(dict(zip(questions, ds)))
Out[32]:
1 2
A 27 20
B 30 10
C 0 NaN
D 2 NaN
I'd put these in another dict of (course, semester, section) -> DataFrame and then concat and work out where to go from the big MultiIndex dataFrame...

Related

How to get average of a column per block of data in python

Here is an example of the data I am dealing with:
This example of data is a shortened version of each Run. Here the runs are about 4 rows long. In a typical data set they are anywhere between 50-100 rows long. There are also 44 different runs.
So my goal is to get the average of the last 4 rows a given column in stage 2, right now I am achieving that, but it grabs the average based on these conditions for the whole spreadsheet. I want to be able to get these average values for each and every 'Run'.
df["Run"] = pd.DataFrame({
"Run": ["Run1.1", "Run1.2", "Run1.3", "Run2.1", "Run2.2", "Run2.3", "Run3.1", "Run3.2", "Run3.3", "Run4.1",
"Run4.2", "Run4.3", "Run5.1", "Run5.2", "Run5.3", "Run6.1", "Run6.2", "Run6.3", "Run7.1", "Run7.2",
"Run7.3", "Run8.1", "Run8.2", "Run8.3", "Run9.1", "Run9.2", "Run9.3", "Run10.1", "Run10.2", "Run10.3",
"Run11.1", "Run11.2", "Run11.3"],
})
av = df.loc[df['Stage'].eq(2),'Vout'].groupby("Run").tail(4).mean()
print(av)
I want to be able to get these averages for a given column that is in Stage 2, based on each and every 'Run'. As you can see before each data set there is a corresponding 'Run' e.g the second data set has 'Run1.2' before it.
Also, each file I am dealing with, the amount of rows per Run is different/not always the same.
So, it is important to note that this is not achievable with np.repeat, as with each new sheet of data, the rows can be any length, not just the same as the example above.
Expected output:
Run1.1 1841 (example value)
Run1.2 1703 (example value)
Run1.3 1390 (example value)
... so on
etc
Any help would be greatly appreciated.
What does your panda df look like after you import the csv?
I would say you can just groupby on the run column like such:
import pandas as pd
df = pd.DataFrame({
"run": ["run1.1", "run1.2", "run1.1", "run1.2"],
"data": [1, 2, 3, 4],
})
df.groupby("run").agg({"data": ["sum"]}).head()
Out[4]:
data
sum
run
run1.1 4
run1.2 6
This will do the trick:
av = df.loc[df["Stage"].eq(2)]
av = av.groupby("Run").tail(4).groupby("Run")["Vout"].mean()
Now df.groupby("a").tail(n) will return dataframe with only last n elements for each value of a. Then the second groupby will just aggregate these and return average per group.

How to filter and drop rows based on groups when condition is specific

So I struggled to even come up with a title for this question. Not sure I can edit the question title, but I would be happy to do so once there is clarity.
I have a data set from an experiment where each row is a point in time for a specific group. [Edited based on better approach to generate data by Daniela Vera below]
df = pd.DataFrame({'x1': np.random.randn(30),'time': [1,2,3,4,5,6,7,8,9,10] * 3,'grp': ['c', 'c', 'c','a','a','b','b','c','c','c'] * 3})
df.head(10)
x1 time grp
0 0.533131 1 c
1 1.486672 2 c
2 1.560158 3 c
3 -1.076457 4 a
4 -1.835047 5 a
5 -0.374595 6 b
6 -1.301875 7 b
7 -0.533907 8 c
8 0.052951 9 c
9 -0.257982 10 c
10 -0.442044 1 c
In the dataset some people/group only start to have values after time 5. In this case group b. However, in the dataset I am working with there are up to 5,000 groups rather than just the 3 groups in this example.
I would like to be able to identify everyone that only have values that appear after time 5, and drop them from the overall dataframe.
I have come up with a solution that works, but I feel like it is very clunky, and wondered if there was something cleaner.
# First I split the data into before and after the time of interest
after = df[df['time'] > 5].copy()
before = df[df['time'] < 5].copy()
#Then I merge the two dataframes and use indicator to find out which ones only appear after time 5.
missing = pd.merge(after,before, on='grp', how='outer', indicator = True)
#Then I use groupby and nunique to identify the groups that only appear after time 5 and save it as
an array
something = missing[missing['_merge'] == 'left_only'].groupby('ent_id').nunique()
#I extract the list of group ids from the array
something = something.index
# I go back to my main dataframe and make group id the index
df = df.set_index('grp')
#I then apply .drop on the array of group ids
df = df.drop(something)
df = df.reset_index()
Like I said, super clunky. But I just couldn't figure out an alternative. Please let me know if anything isn't clear and I'll happily edit with more details.
I am not sure If I get it, but let's say you have this data:
df = pd.DataFrame({'x1': np.random.randn(30),'time': [1,2,3,4,5,6,7,8,9,10] * 3,'grp': ['c', 'c', 'c','a','a','b','b','c','c','c'] * 3})
In this case, group "b" just has data for times 6, 7, which is above time 5. You can use this process to get a dictionary with the times in which each group has at least one data point and also a list called "keep" with the groups that have data point over the time 5.
list_groups = ["a","b","c"]
times_per_group = {}
keep = []
for group in list_groups:
times_per_group[group] = list(df[df.grp ==group].time.unique())
condition = any([i<5 for i in list(df[df.grp==group].time.unique())])
if condition:
keep.append(group)
Finally, you just keep the groups present in the list "keep":
df = df[df.grp.isin(keep)]
Let me know if I understood your question!
Of course you can just simplify the process, the dictionary is just to check, but you actually don´t need the whole code.
If this results is what you´re looking for, you can just do:
keep = [group for group in list_groups if any([i<5 for i in list(df[df.grp == group].time.unique())])]

Anonymize specific columns with pii in pandas dataframe python

I have loaded an s3 bucket with json files and parsed/flattened it in to a pandas dataframe. Now i have a dataframe with 175 columns with 4 columns containing personally identifiable information.
I am looking for a quick solution anonymising those columns (name & adress). I need to keep information for multiples so that if names or adresses of the same person occuring multiple times have the same hash.
Is there existing functionality in pandas or some other package i can utilize for this?
Using a Categorical would be an efficient way to do this - the main caveat is that the numbering will be based solely on the ordering in the data, so some care will be needed if this numbering scheme needs to be used across multiple columns / datasets.
df = pd.DataFrame({'ssn': [1, 2, 3, 999, 10, 1]})
df['ssn_anon'] = df['ssn'].astype('category').cat.codes
df
Out[38]:
ssn ssn_anon
0 1 0
1 2 1
2 3 2
3 999 4
4 10 3
5 1 0
You can using ngroup or factorize from pandas
df.groupby('ssn').ngroup()
Out[25]:
0 0
1 1
2 2
3 4
4 3
5 0
dtype: int64
pd.factorize(df.ssn)[0]
Out[26]: array([0, 1, 2, 3, 4, 0], dtype=int64)
In sklearn, if you are doing ML , I will recommend this approach
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(df.ssn).transform(df.ssn)
Out[31]: array([0, 1, 2, 4, 3, 0], dtype=int64)
You seem to be looking for a way to encrypt the strings in your dataframe. There are a bunch of python encryption libraries such as cryptography
How to use it is pretty simple, just apply it to each element.
import pandas as pd
from cryptography.fernet import Fernet
df =pd.DataFrame([{'a':'a','b':'b'}, {'a':'a','b':'c'}])
f = Fernet('password')
res = df.applymap(lambda x: f.encrypt(byte(x, 'utf-8'))
# Decrypt
res.applymap(lambda x: f.decrypt(x))
That is probably the best way in terms of security but it would generate a long byte/string and be hard to look at.
# 'a' -> b'gAAAAABaRQZYMjB7wh-_kD-VmFKn2zXajMRUWSAeridW3GJrwyebcDSpqyFGJsCEcRcf68ylQMC83G7dyqoHKUHtjskEtne8Fw=='
Another simple way so solve your problem is to create a function that maps a key to a value and creates a new value if a new key is present.
mapper = {}
def encode(string):
if x not in mapper:
# This part can be changed with anything really
# Such as mapper[x]=randint(-10**10,10**10)
# Just ensure it would not repeat
mapper[x] = len(mapper)+1
return mapper[x]
res = df.applymap(encode)
Sounds a bit like you want to be able to reverse the process by maintaining a key somewhere. If your use case allows I would suggest replacing all the values with valid, human readable and irreversible placeholders.
John > Mark
21 Hammersmith Grove rd > 48 Brewer Street
This is good for generating usable test data for remote devs etc. You can use Faker to generate replacement values yourself. If you want to maintain some utility in your data i.e. "replace all addresses with alternate addresses within 2 miles" you could use an api i'm working on called Anon AI. We parse JSON from s3 buckets, find all the PII automatically (including in free text fields) and replace it with placeholders given your spec. We can keep consistency and reversibility if required and it will be most useful if you want to keep a "live" anonymous version of a growing data set. We're in beta at the moment so let me know if you would be interested in testing it out.

Pandas merge giving wrong output

Ok
I have gone through some blogs related to this topic - but I am still getting the same problem. I have two dataframes. Both have a column X which have SHA2 values in them. It contains hex strings.
Example (Dataframe lookup)
X,Y
000000000E000394574D69637264736F66742057696E646F7773204861726477,7
0000000080000000000000090099000000040005000000000000008F2A000010,7
000000020000000000000000777700010000000000020000000040C002004600,24
0000005BC614437F6BE049237FA1DDD2083B5BA43A10175E4377A59839DC2B64,7
Example (Dataframe source)
X,Z
000000000E000394574D69637264736F66742057696E646F7773204861726477,'blah'
0000000080000000000000090099000000040005000000000000008F2A000010,'blah blah'
000000020000000000000000777700010000000000020000000040C002004600,'dummy'
etc.
So now I am doing
lookup['X'] = lookup['X'].astype(str)
source['X'] = source['X'].astype(str)
source['newcolumn'] = source.merge(lookup, on='X', how='inner')['Y']
The source has 160,000 rows and the lookup has around 500,000 rows.
Now, when the operation finishes, I get newcolumn but the values are wrong.
I have made sure that they are not being picked up from duplicate values of X, because there are no duplicate X in either table.
So, this is really making me feel dumb and gave me quite a pain in my live systems. Can anyone suggest what is the problem ?
I have now replaced the call with
def getReputation(lookupDF,value,lookupcolumn,default):
lookupRows = lookupDF.loc[lookupDF['X']==value]
if lookupRows.shape[0]>0:
return lookupRows[lookupcolumn].values[0]
else:
return default
source['newcolumn'] = source.apply(lambda x: getReputation(lookup,x['X'],'Y',-1),axis=1)
This code works - but obviously it is BAD code and takes a horrible long time. I can multiprocess it - but the question remains. WHY is the merge failing ?
Thanks for your help
Rgds
I'd use map() method in this case:
first set 'X' as index in the lookup DF:
In [58]: lookup.set_index('X', inplace=True)
In [59]: lookup
Out[59]:
Y
X
000000000E000394574D69637264736F66742057696E646F7773204861726477 7
0000000080000000000000090099000000040005000000000000008F2A000010 7
000000020000000000000000777700010000000000020000000040C002004600 24
0000005BC614437F6BE049237FA1DDD2083B5BA43A10175E4377A59839DC2B64 7
In [60]: df['Y'] = df.X.map(lookup.Y)
In [61]: df
Out[61]:
X Z Y
0 000000000E000394574D69637264736F66742057696E646F7773204861726477 blah 7
1 0000000080000000000000090099000000040005000000000000008F2A000010 blah blah 7
2 000000020000000000000000777700010000000000020000000040C002004600 dummy 24
Actually your code is working properly for your sample DFs:
In [68]: df.merge(lookup, on='X', how='inner')
Out[68]:
X Z Y
0 000000000E000394574D69637264736F66742057696E646F7773204861726477 blah 7
1 0000000080000000000000090099000000040005000000000000008F2A000010 blah blah 7
2 000000020000000000000000777700010000000000020000000040C002004600 dummy 24
So check whether you have the same data and dtypes in the X column in both DFs
Probably you might have duplicate values on column X on the lookup data frame. It is due to the indexing of fields and the below snippet will produce the right results.
output = source.merge(lookup, on='X', how='inner')
In case if you want to create a new column, either it should not have any duplicates on the right df or the indexes needs to be adjusted accordingly. If you're sure there are no duplicate values, compare the indexes from the above snippet and your snippet for better understanding and try resetting the indexes before merging.

Conditional Sum/Average/etc... CSV file in Python

First off, I've found similar articles, but I haven't been able to figure out how to translate the answers from those questions to my own problem. Secondly, I'm new to python, so I apologize for being a noob.
Here's my question: I want to perform conditional calculations (average/proportion/etc..) on values within a text file
More concretely, I have a file that looks a little something like below
0 Diamond Correct
0 Cross Incorrect
1 Diamond Correct
1 Cross Correct
Thus far, I am able to read in the file and collect all of the rows.
import pandas as pd
fileLocation = r'C:/Users/Me/Desktop/LogFiles/SubjectData.txt'
df = pd.read_csv(fileLocation, header = None, sep='\t', index_col = False,
name = ["Session Number", "Image", "Outcome"])
I'm looking to query the file such that I can ask questions like:
--What is the proportion of "Correct" values in the 'Outcome' column when the first column ('Session Number') is 0? So this would be 0.5, because there is one "Correct" and one "Incorrect".
I have other calculations I'd like to perform, but I should be able to figure out where to go once I know how to do this, hopefully simple, command.
Thanks!
you can also do it this way:
In [467]: df.groupby('Session#')['Outcome'].apply(lambda x: (x == 'Correct').sum()/len(x))
Out[467]:
Session#
0 0.5
1 1.0
Name: Outcome, dtype: float64
it'll group your DF by Session# and calculate Ratio of correct Outcomes for each group (Session#)
# getting the total number of rows
total = len(df)
# getting the number of rows that have 'Correct' for 'Outcome' and 0 for 'Session Number'
correct_and_session_zero = len(df[(df['Outcome'] == 'Correct') &
(df['Session Number'] == 0)])
# if you're using python 2 you might need to convert correct_and_session_zero or total
# to float so you won't lose precision
print(correct_and_session_zero / total)

Categories

Resources