Pandas Dataframe grouping / combining columns? - python

I'm new to Pandas, and I'm having a horrible time figuring out datasets.
I have a csv file I've read in using pandas.read_csv, dogData, that looks as follows:
The column names are dog breeds, the first line [0] refers to the size of the dogs, and beyond that there's a bunch of numerical values. The very first column has string description that I need to keep, but isn't relevant to the question. The last column for each size category contains separate "Average" values. (Note that it changed the "Average" columns to "Average.1", "Average.2" and so on, to take care of them not being unique)
Basically, I want to "group" by the first row - so all "small" dog values will be averaged except the "small" average column, and so on. The result would look like something like this:
The existing "Average" columns should not be included in the new average being calculated. The existing "Average" columns for each size don't need to be altered at all. All "small" breed values should be averaged, all "medium" breed values should be averaged, and so on (actual file is much larger then the sample I showed here).
There's no guarantee the breeds won't be altered, and no guarantee the "sizes" will remain the same / always be included ("Small" could be left out, for example).
EDIT:: After Joe Ferndz's comment, I've updated my code and have something slightly closer to working, but the actual adding-the-columns is giving me trouble still.
dogData = pd.read_csv("dogdata.csv", header=[0,1])
dogData.columns = dogData.columns.map("_".join)
totalVal = ""
count = 0
for col in dogData:
if "Unnamed" in col:
continue # to skip starting columns
if "Average" not in col:
totalVal += dogData[col]
count += 1
else:
# this is where I'd calculate average, then reset count and totalVal
# right now, because the addition isn't working, I'm haven't figured that out
break
print(totalVal)
Now, this code is getting the correct values technically... but it won't let me numerically add them (hence why totalVal is a string right now). It gives me a string of concatenated numbers, the correct concatenated numbers, but won't let me convert them to floats to actually add.
I've tried doing float(dogData[col]) for the totalVal addition line - it gives me a TypeError: cannot convert the series to <class float>
I've tried keeping it as a string, putting in "," between the numbers, then doing totalVal.split(",") to separate them, then convert and add... but obviously that doesn't work either, because AttributeError: 'Series' has no attribute 'split'
These errors make sense to me and I understand why it's happening, but I don't know what the correct method for doing this is. dogData[col] gives me all the values for every row at once, which is what I want, but I don't know how to then store that and add it in the next iteration of the loop.
Here's a copy/pastable sample of data:
,Corgi,Yorkie,Pug,Average,Average,Dalmation,German Shepherd,Average,Great Dane,Average
,Small,Small,Small,Small,Medium,Large,Large,Large,Very Large,Very Large
Words,1,3,3,3,2.4,3,5,7,7,7
Words1,2,2,4,4,2.2,4,4,6,8,8
Words2,2,1,5,3,2.5,5,3,8,9,6
Words3,1,4,4,2,2.7,6,6,5,6,9

You have to do a few tricks to get this to work.
Step 1: You need to read the csv file and use first two rows as header. It will create a MultiIndex column list.
Step 2: You need to join them together with say an _.
Step 3: Then rename the specific columns as per your requirement like S-Average, M-Average, ....
Step 4: find out how many columns have dog name + small
Step 5: Compute value for Small. Per your req, sum (columns with Small) / count (columns with Small)
Step 6,7: do same for Large
Step 8,9: do same for Very Large
This will give you the final list. If you want the columns to be in specific order, then you can change the order.
Step 10: Change the order for the dataframe
import pandas as pd
df = pd.read_csv('abc.txt',header=[0,1], index_col=0)
df.columns = df.columns.map('_'.join)
df.rename(columns={'Average_Small': 'S-Average',
'Average_Medium': 'M-Average',
'Average_Large': 'L-Average',
'Average_Very Large': 'Very L-Average'}, inplace = True)
idx = [i for i,x in enumerate(df.columns) if x.endswith('_Small')]
if idx:
df['Small']= ((df.iloc[:, idx].sum(axis=1))/len(idx)).round(2)
df.drop(df.columns[idx], axis = 1, inplace = True)
idx = [i for i,x in enumerate(df.columns) if x.endswith('_Large')]
if idx:
df['Large']= ((df.iloc[:, idx].sum(axis=1))/len(idx)).round(2)
df.drop(df.columns[idx], axis = 1, inplace = True)
idx = [i for i,x in enumerate(df.columns) if x.endswith('_Very Large')]
if idx:
df['Very_Large']= ((df.iloc[:, idx].sum(axis=1))/len(idx)).round(2)
df.drop(df.columns[idx], axis = 1, inplace = True)
df = df[['Small', 'S-Average', 'M-Average', 'L-Average', 'Very L-Average', 'Large', 'Very_Large', ]]
print (df)
The output of this will be:
Small S-Average M-Average ... Very L-Average Large Very_Large
Words 2.33 3 2.4 ... 7 4.0 7.0
Words1 2.67 4 2.2 ... 8 4.0 8.0
Words2 2.67 3 2.5 ... 6 4.0 9.0
Words3 3.00 2 2.7 ... 9 6.0 6.0

Related

Replace the last value of each row by the penultimate

I have a pivot table that looks like this:
And I need to replace the last (non-empty) value of each row by the value before it.
for row in cohort_pivot2.iterrows():
a=[i for i in row[1].values if i][-1]
b=[i for i in row[1].values if i][-2]
But I can't think of any way to do the substitution.
I have not added the steps to replicate the pivot table because it's generated from confidential information and it's a process that would take up a lot of space, but if necessary I'd generate dummy data and replicate the process.
I would greatly appreciate any help
Here is a solution using apply which converts each row to a list and then convert it back to a dataframe:
df = pd.DataFrame([[1,2,3,None],
[10,20,None,None],
[100,None,None,None]]).fillna('')
using list addition
ll = df.apply(lambda x: list(x)[:list(x).index('')-1] + [list(x)[list(x).index('')-2]] + ['']*(length - list(x).index('')), 1)
using list comprehension
ll = df.apply(lambda x: [el if idx != list(x).index('') -1 else list(x)[list(x).index('') -2] for idx, el in enumerate(list(x))], 1)
results in
pd.DataFrame.from_records(ll)
# 0 1 2 3
# 0 1 2 2
# 1 10 10
# 2
Please note that the last line becomes entirely empty because there was no previous element with which we could set it. Also note that I used empty strings as null elements, I did that because of the automatic type inference in pandas, which converts None to np.nan if column has float type.

Is there a way to allocates sorted values in a dataframe to groups based on alternating elements

I have a Pandas DataFrame like:
COURSE BIB# COURSE 1 COURSE 2 STRAIGHT-GLIDING MEAN PRESTASJON
1 2 20.220 22.535 19.91 21.3775 1.073707
0 1 21.235 23.345 20.69 22.2900 1.077332
This is from a pilot and the DataFrame may be much longer when we perform the real experiment. Now that I have calculated the performance for each BIB#, I want to allocate them into two different groups based on their performance. I have therefore written the following code:
df1 = df1.sort_values(by='PRESTASJON', ascending=True)
This sorts values in the DataFrame. Now I want to assign even rows to one group and odd rows to another. How can I do this?
I have no idea what I am looking for. I have looked up in the documentation for the random module in Python but that is not exactly what I am looking for. I have seen some questions/posts pointing to a scikit-learn stratification function but I don't know if that is a good choice. Alternatively, is there a way to create a loop that accomplishes this? I appreciate your help.
Here a figure to illustrate what I want to accomplish
How about this:
threshold = 0.5
df1['group'] = df1['PRESTASJON'] > threshold
Or if you want values for your groups:
df['group'] = np.where(df['PRESTASJON'] > threshold, 'A', 'B')
Here, 'A' will be assigned to column 'group' if precision meets our threshold, otherwise 'B'.
UPDATE: Per OP's update on the post, if you want to group them alternatively into two groups:
#sort your dataframe based on precision column
df1 = df1.sort_values(by='PRESTASJON')
#create new column with default value 'A' and assign even rows (alternative rows) to 'B'
df1['group'] = 'A'
df1.iloc[1::2,-1] = 'B'
Are you splitting the dataframe alternatingly? If so, you can do:
df1 = df1.sort_values(by='PRESTASJON', ascending=True)
for i,d in df1.groupby(np.arange(len(df1)) %2):
print(f'group {i}')
print(d)
Another way without groupby:
df1 = df1.sort_values(by='PRESTASJON', ascending=True)
mask = np.arange(len(df1)) %2
group1 = df1.loc[mask==0]
group2 = df1.loc[mask==1]

How can I alter pandas column data according to a specific condition based on a column entry?

I have a data frame like this:
MONTH TIME PATH RATE
0 Feb 15:24:11 enp1s0 14.71Kb
I want to create a function which can identify if 'Kb' or 'Mb' is in the column RATE. If an entry in column RATE has 'Kb' or 'Mb' at the end, to strip it of 'Kb'/'Mb' and perform an operation to convert it into just b.
Here's my code so far where RATE is treated by the Dataframe as an object:
df=pd.DataFrame(listOfLists)
def strip(bytesData):
if "Kb" in bytesData:
bytesData/1000
elif "Mb" in bytesData:
bytesData/1000000
df['RATE']=df.apply(lambda x: strip(x['byteData']), axis=1)
How can I get it to change the value within the column while stripping it of unwanted characters and converting it into the format I need? I know once this operation is complete I'll have to change it to an int, however, I can't seem to alter the data in the way I need.
Thanks in advance!
I modified your function a bit and use map(lambda x:) instead of apply, since we are working with a series, and not the full dataframe. Also I added some additional lines as to provide examples for both Kb and Mb and if neither are present:
example_df = pd.DataFrame({'Month':[0,1,2,3],
'Time':['15:32','16:42','17:11','15:21'],
'Path':['xxxxx','yyyyy','zzzzz','aaaaa'],
'Rate':['14.71Kb','18.21Mb','19.01Kb','Error_1']})
def case_1(value):
if value[-2:] == 'Kb':
return float(value[:-2])*1000
elif value[-2:] == 'Mb':
return float(value[:-2])*100000
else:
return np.nan
example_df['Rate'] = example_df['Rate'].map(lambda x: case_1(x))
The logic for the function is, if it ends with Kb then multiply the value by 1000, else-if it ends with Mb multiply the value by 100000, otherwise simply return NaN (because neither of the two conditions are satisfied)
Output:
Month Time Path Rate
0 0 15:32 xxxxx 14710.0
1 1 16:42 yyyyy 1821000.0
2 2 17:11 zzzzz 19010.0
3 3 15:21 aaaaa NaN
Here's an alternative of how I may approach this. This solution handles other Abbreviations. It does rely on regex re standard lib package though.
This approach makes a new column called Bytes. I often find it helpful to keep the RATE column in this case to verify there aren't any edge cases I haven't thought of. I also use a mapping to obtain the necessary power to raise the value to to get the correct bytes. I did add the code required to drop the original RATE column and rename the new column.
import re
def convert_to_bytes(contents):
value, label, _ = re.split('([A-Za-z]+)', contents)
factors = {'Kb': 1, 'Mb': 2, 'Gb': 3, 'Tb': 4}
return float(value) * 1000**(factors[label])
df['Bytes'] = df['RATE'].map(convert_to_bytes)
# Drop original RATE column
df = df.drop('RATE', axis=1)
# Rename Bytes column to RATE
df = df.rename({'Bytes': 'RATE'}, axis='columns')

Split a list by half the length and add new column with dependent values

I have a csv datafile that I've split by a column value into 5 datasets for each person using:
for i in range(1,6):
PersonData = df[df['Person'] == i].values
P[i] = PersonData
I want to sort the data into ascending order according to one column, then split the data half way at that column to find the median.
So I sorted the data with the following:
dataP = {}
for i in range(1,6):
sortData = P[i][P[i][:,9].argsort()]
P[i] = sortData
P[i] = pd.DataFrame(P[i])
dataP[1]
Using that I get a dataframe for each of my datasets 1-6 sorted by the relevant column (9), depending on which number I put into dataP[i].
Then I calculate half the length:
for i in range(1,6):
middle = len(dataP[i])/2
print(middle)
Here is where I'm stuck!
I need to create a new column in each dataP[i] dataframe that splits the length in 2 and gives the value 0 if it's in the first half and 1 if it's in the second.
This is what I've tried but I don't understand why it doesn't produce a new list of values 0 and 1 that I can later append to dataP[i]:
for n in range(1, (len(dataP[i]))):
for n, line in enumerate(dataP[i]):
if middle > n:
confval = 0
elif middle < n:
confval = 1
for i in range(1,6):
Confval[i] = confval
Confval[1]
Sorry if this is basic, I'm quite new to this so a lot of what I've written might not be the best way to do it/necessary, and sorry also for the long post.
Any help would be massively appreciated. Thanks in advance!
If I'm reading your question right I believe you are attempting to do two things.
Find the median value of a column
Create a new column which is 0 if the value is less than the median or 1 if greater.
Let's tackle #1 first:
median = df['originalcolumn'].median()
That easy! There's many great pandas functions for things like this.
Ok so number two:
df['newcolumn'] = df[df['originalcolumn'] > median].astype(int)
What we're doing here is creating a new bool series, false if the value at that location is less than the median, true otherwise. Then we can cast that to an int which gives us 0s and 1s.

Pandas - slice sections of dataframe into multiple dataframes

I have a Pandas dataframe with 3000+ rows that looks like this:
t090:   c0S/m:    pr:      timeJ:  potemp090C:   sal00:  depSM:  \
407  19.3574  4.16649  1.836  189.617454      19.3571  30.3949   1.824
408  19.3519  4.47521  1.381  189.617512      19.3517  32.9250   1.372
409  19.3712  4.44736  0.710  189.617569      19.3711  32.6810   0.705
410  19.3602  4.26486  0.264  189.617627      19.3602  31.1949   0.262
411  19.3616  3.55025  0.084  189.617685      19.3616  25.4410   0.083
412  19.2559  0.13710  0.071  189.617743      19.2559   0.7783   0.071
413  19.2092  0.03000  0.068  189.617801      19.2092   0.1630   0.068
414  19.4396  0.00522  0.068  189.617859      19.4396   0.0321   0.068
What I want to do is: create individual dataframes from each portion of the dataframe in which the values in column 'c0S/m' exceed 0.1 (eg rows 407-412 in the example above).
So let's say that I have 7 sections in my 3000+ row dataframe in which a series of rows exceed 0.1 in the second column. My if/for/while statement will slice these sections and create 7 separate dataframes.
I tried researching the best I could but could not find a question that would address this problem. Any help is appreciated.
Thank you.
You can try this:
First add a column of 0 or 1 based on whether the value is greater than 1 or less.
df['splitter'] = np.where(df['c0S/m:'] > 1, 1, 0)
Now groupby this column diff.cumsum()
df.groupby((df['splitter'].diff(1) != 0).astype('int').cumsum()).apply(lambda x: [x.index.min(),x.index.max()])
You get the required blocks of indices
splitter
1 [407, 411]
2 [412, 414]
3 [415, 415]
Now you can create dataframes using loc
df.loc[407:411]
Note: I added a line to your sample df using:
df.loc[415] = [19.01, 5.005, 0.09, 189.62, 19.01, 0.026, 0.09]
to be able to test better and hence its splitting in 3 groups
Here's another way.
sub_set = df[df['c0S/m'] > 0.1]
last = None
for i in sub_set.index:
if last is None:
start = i
else:
if i - last > 1:
print start, last
start = i
last = i
I think it works. (Instead of print start, last you could insert code to create the slices you wanted of the original data frame).
Some neat tricks here that do an even better job.

Categories

Resources