First off, I've found similar articles, but I haven't been able to figure out how to translate the answers from those questions to my own problem. Secondly, I'm new to python, so I apologize for being a noob.
Here's my question: I want to perform conditional calculations (average/proportion/etc..) on values within a text file
More concretely, I have a file that looks a little something like below
0 Diamond Correct
0 Cross Incorrect
1 Diamond Correct
1 Cross Correct
Thus far, I am able to read in the file and collect all of the rows.
import pandas as pd
fileLocation = r'C:/Users/Me/Desktop/LogFiles/SubjectData.txt'
df = pd.read_csv(fileLocation, header = None, sep='\t', index_col = False,
name = ["Session Number", "Image", "Outcome"])
I'm looking to query the file such that I can ask questions like:
--What is the proportion of "Correct" values in the 'Outcome' column when the first column ('Session Number') is 0? So this would be 0.5, because there is one "Correct" and one "Incorrect".
I have other calculations I'd like to perform, but I should be able to figure out where to go once I know how to do this, hopefully simple, command.
Thanks!
you can also do it this way:
In [467]: df.groupby('Session#')['Outcome'].apply(lambda x: (x == 'Correct').sum()/len(x))
Out[467]:
Session#
0 0.5
1 1.0
Name: Outcome, dtype: float64
it'll group your DF by Session# and calculate Ratio of correct Outcomes for each group (Session#)
# getting the total number of rows
total = len(df)
# getting the number of rows that have 'Correct' for 'Outcome' and 0 for 'Session Number'
correct_and_session_zero = len(df[(df['Outcome'] == 'Correct') &
(df['Session Number'] == 0)])
# if you're using python 2 you might need to convert correct_and_session_zero or total
# to float so you won't lose precision
print(correct_and_session_zero / total)
Related
I have a data frame like this:
MONTH TIME PATH RATE
0 Feb 15:24:11 enp1s0 14.71Kb
I want to create a function which can identify if 'Kb' or 'Mb' is in the column RATE. If an entry in column RATE has 'Kb' or 'Mb' at the end, to strip it of 'Kb'/'Mb' and perform an operation to convert it into just b.
Here's my code so far where RATE is treated by the Dataframe as an object:
df=pd.DataFrame(listOfLists)
def strip(bytesData):
if "Kb" in bytesData:
bytesData/1000
elif "Mb" in bytesData:
bytesData/1000000
df['RATE']=df.apply(lambda x: strip(x['byteData']), axis=1)
How can I get it to change the value within the column while stripping it of unwanted characters and converting it into the format I need? I know once this operation is complete I'll have to change it to an int, however, I can't seem to alter the data in the way I need.
Thanks in advance!
I modified your function a bit and use map(lambda x:) instead of apply, since we are working with a series, and not the full dataframe. Also I added some additional lines as to provide examples for both Kb and Mb and if neither are present:
example_df = pd.DataFrame({'Month':[0,1,2,3],
'Time':['15:32','16:42','17:11','15:21'],
'Path':['xxxxx','yyyyy','zzzzz','aaaaa'],
'Rate':['14.71Kb','18.21Mb','19.01Kb','Error_1']})
def case_1(value):
if value[-2:] == 'Kb':
return float(value[:-2])*1000
elif value[-2:] == 'Mb':
return float(value[:-2])*100000
else:
return np.nan
example_df['Rate'] = example_df['Rate'].map(lambda x: case_1(x))
The logic for the function is, if it ends with Kb then multiply the value by 1000, else-if it ends with Mb multiply the value by 100000, otherwise simply return NaN (because neither of the two conditions are satisfied)
Output:
Month Time Path Rate
0 0 15:32 xxxxx 14710.0
1 1 16:42 yyyyy 1821000.0
2 2 17:11 zzzzz 19010.0
3 3 15:21 aaaaa NaN
Here's an alternative of how I may approach this. This solution handles other Abbreviations. It does rely on regex re standard lib package though.
This approach makes a new column called Bytes. I often find it helpful to keep the RATE column in this case to verify there aren't any edge cases I haven't thought of. I also use a mapping to obtain the necessary power to raise the value to to get the correct bytes. I did add the code required to drop the original RATE column and rename the new column.
import re
def convert_to_bytes(contents):
value, label, _ = re.split('([A-Za-z]+)', contents)
factors = {'Kb': 1, 'Mb': 2, 'Gb': 3, 'Tb': 4}
return float(value) * 1000**(factors[label])
df['Bytes'] = df['RATE'].map(convert_to_bytes)
# Drop original RATE column
df = df.drop('RATE', axis=1)
# Rename Bytes column to RATE
df = df.rename({'Bytes': 'RATE'}, axis='columns')
I am able to use group by to get the overall medians for a document e.g. "print(df.groupby(['Key']).median())". But I want to learn the appropriate way to do it line-by-line and seeing if the aforementioned group has changed. Below is one approach that is very clunky and non-pythonic.
csv:
A,1
A,2
A,3
A,4
A,5
A,6
A,7
B,8
B,9
B,10
B,11
B,12
B,13
B,14
B,15
B,16
B,17
import pandas as pd
import numpy as np
import statistics
df = pd.read_csv(r"C:\Users\mmcgown\Downloads\PythonMedianTest.csv",names=['Key','Values'])
rows = len(df.iloc[:,0])
i=0
med=[]
while i < rows:
if i == 0 or df.iloc[(i-1,0)]==df.iloc[(i,0)]:
med.append(df.iloc[i,1])
if i==(rows-1):
print(f"The median of {df.iloc[(i,0)]} is {statistics.median(med)}")
elif df.iloc[(i-1,0)]!=df.iloc[(i,0)]:
print(f"The median of {df.iloc[(i-1,0)]} is {statistics.median(med)}")
med = []
i += 1
Output:
The median of A is 4
The median of B is 13
I get the same thing as group by, save some rounding error. But I want to do it the most concise, pythonic way, probably using list comprehension.
A proposal for a more pythonic version could look like this:
med=[]
rows, cols= df.shape
last_group=None
group_field='Key'
med_field='Values'
for i, row in df.iterrows():
if last_group is None or last_group == row[group_field]:
med.append(row[med_field])
else:
print(f"The median of {last_group} is {statistics.median(med)}")
med = [row[med_field]]
last_group= row[group_field]
if med:
print(f"The median of {last_group} is {statistics.median(med)}")
I tried to avoid the iloc calls with indexes which are not so easy to read. At first, I didn't get, what you were comparing, to be honest. You also don't need the elif in your case. You can just use else, because your condition is just the negation of a part of the if clause. Then I recognized a difference in the median your version computes and mine computes. If I am not mistaken here, you throw away the verry first value for B, right?
And if you want to get the length of a dataframe, you could use:
rows, cols= df.shape
instead of calling len. I think that is more obvious to the reader of the code, what it does.
Coming from R, the code would be
x <- data.frame(vals = c(100,100,100,100,100,100,200,200,200,200,200,200,200,300,300,300,300,300))
x$state <- cumsum(c(1, diff(x$vals) != 0))
Which marks every time the difference between rows is non-zero, so that I can use it to spot transitions in data, like so:
vals state
1 100 1
...
7 200 2
...
14 300 3
What would be a clean equivalent in Python?
Additional question
The answer to the original question is posted below, but won't work properly for a grouped dataframe with pandas.
Data here: https://pastebin.com/gEmPHAb7. Notice that there are 2 different filenames.
When imported as df_all I group it with the following, and then apply solution posted below.
df_grouped = df_all.groupby("filename")
df_all["state"] = (df_grouped['Fit'].diff() != 0).cumsum()
Using diff and cumsum, as in your R example:
df['state'] = (df['vals'].diff()!= 0).cumsum()
This uses the fact that True has integer value 1
Bonus question
df_grouped = df_all.groupby("filename")
df_all["state"] = (df_grouped['Fit'].diff() != 0).cumsum()
I think you misunderstand what groupby does. All groupby does is create groups based on the criterium (filename in this instance). You then need to tell add another operation to tell what needs to happen with this group.
Common operations are mean, sum, or more advanced as apply and transform.
You can find more information here or here
If you can explain more in detail what you want to achieve with the groupby I can help you find the correct method. If you want to perform the above operation per filename, you probably need something like this:
def get_state(group):
return (group.diff()!= 0).cumsum()
df_all['state'] = df_all.groupby('filename')['Fit'].transform(get_state)
I have a Pandas dataframe with 3000+ rows that looks like this:
t090: c0S/m: pr: timeJ: potemp090C: sal00: depSM: \
407 19.3574 4.16649 1.836 189.617454 19.3571 30.3949 1.824
408 19.3519 4.47521 1.381 189.617512 19.3517 32.9250 1.372
409 19.3712 4.44736 0.710 189.617569 19.3711 32.6810 0.705
410 19.3602 4.26486 0.264 189.617627 19.3602 31.1949 0.262
411 19.3616 3.55025 0.084 189.617685 19.3616 25.4410 0.083
412 19.2559 0.13710 0.071 189.617743 19.2559 0.7783 0.071
413 19.2092 0.03000 0.068 189.617801 19.2092 0.1630 0.068
414 19.4396 0.00522 0.068 189.617859 19.4396 0.0321 0.068
What I want to do is: create individual dataframes from each portion of the dataframe in which the values in column 'c0S/m' exceed 0.1 (eg rows 407-412 in the example above).
So let's say that I have 7 sections in my 3000+ row dataframe in which a series of rows exceed 0.1 in the second column. My if/for/while statement will slice these sections and create 7 separate dataframes.
I tried researching the best I could but could not find a question that would address this problem. Any help is appreciated.
Thank you.
You can try this:
First add a column of 0 or 1 based on whether the value is greater than 1 or less.
df['splitter'] = np.where(df['c0S/m:'] > 1, 1, 0)
Now groupby this column diff.cumsum()
df.groupby((df['splitter'].diff(1) != 0).astype('int').cumsum()).apply(lambda x: [x.index.min(),x.index.max()])
You get the required blocks of indices
splitter
1 [407, 411]
2 [412, 414]
3 [415, 415]
Now you can create dataframes using loc
df.loc[407:411]
Note: I added a line to your sample df using:
df.loc[415] = [19.01, 5.005, 0.09, 189.62, 19.01, 0.026, 0.09]
to be able to test better and hence its splitting in 3 groups
Here's another way.
sub_set = df[df['c0S/m'] > 0.1]
last = None
for i in sub_set.index:
if last is None:
start = i
else:
if i - last > 1:
print start, last
start = i
last = i
I think it works. (Instead of print start, last you could insert code to create the slices you wanted of the original data frame).
Some neat tricks here that do an even better job.
Background
I deal with a csv datasheet that prints out columns of numbers. I am working on a program that will take the first column, ask a user for a time in float (ie. 45 and a half hours = 45.5) and then subtract that number from the first column. I have been successful in that regard. Now, I need to find the row index of the "zero" time point. I use min to find that index and then call that off of the following column A1. I need to find the reading at Time 0 to then normalize A1 to so that on a graph, at the 0 time point the reading is 1 in column A1 (and eventually all subsequent columns but baby steps for me)
time_zero = float(input("Which time would you like to be set to 0?"))
df['A1']= df['A1']-time_zero
This works fine so far to set the zero time.
zero_location_series = df[df['A1'] == df['A1'].min()]
r1 = zero_location_series[' A1.1']
df[' A1.1'] = df[' A1.1']/r1
Here's where I run into trouble. The first line will correctly identify a series that I can pull off of for all my other columns. Next r1 correctly identifies the proper A1.1 value and this value is a float when I use type(r1).
However when I divide df[' A1.1']/r1 it yields only one correct value and that value is where r1/r1 = 1. All other values come out NaN.
My Questions:
How to divide a column by a float I guess? Why am I getting NaN?
Is there a faster way to do this as I need to do this for 16 columns.(ie 'A2/r2' 'a3/r3' etc.)
Do I need to do inplace = True anywhere to make the operations stick prior to resaving the data? or is that only for adding/deleting rows?
Example
Dataframe that looks like this
!http://i.imgur.com/ObUzY7p.png
zero time sets properly (image not shown)
after dividing the column
!http://i.imgur.com/TpLUiyE.png
This should work:
df['A1.1']=df['A1.1']/df['A1.1'].min()
I think the reason df[' A1.1'] = df[' A1.1']/r1 did not work was because r1 is a series. Try r1? instead of type(r1) and pandas will tell you that r1 is a series, not an individual float number.
To do it in one attempt, you have to iterate over each column, like this:
for c in df:
df[c] = df[c]/df[c].min()
If you want to divide every value in the column by r1 it's best to apply, for example:
import pandas as pd
df = pd.DataFrame([1,2,3,4,5])
# apply an anonymous function to the first column ([0]), divide every value
# in the column by 3
df = df[0].apply(lambda x: x/3.0, 0)
print(df)
So you'd probably want something like this:
df = df["A1.1"].apply(lambda x: x/r1, 0)
This really only answers part 2 of you question. Apply is probably your best bet for running a function on multiple rows and columns quickly. As for why you're getting nans when dividing by a float, is it possible the values in your columns are anything other than floats or integers?