Pandas split DataFrame according to indices - python

I've been working on a pandas DataFrame,
df = pd.DataFrame({'col':[-0.217514, -0.217834, 0.844116, 0.800125, 0.824554]}, index=[49082, 49083, 49853, 49854, 49855])
and I get data that looks like this:
As you can see, the index suddenly jumps 770 values (due to a sorting I did earlier).
Now I would like to split this DataFrame into many different ones, where each one would be made of the rows whose index follow each other only (here the first 2 rows would be in the same DataFrame while the last three would be in a different one).
Does anyone have an idea as to how to do this?
Thanks!

Use groupby on the index from which we subtract an increasing by 1 sequence, then stick each group as a separate df in the list
all_dfs = [g for _,g in df.groupby(df.index - np.arange(len(df.index)))]
all_dfs
output:
[ col
49082 -0.217514
49083 -0.217834,
col
49853 0.844116
49854 0.800125
49855 0.824554]

Related

calculate sum of rows in pandas dataframe grouped by date

I have a csv that I loaded into a Pandas Dataframe.
I then select only the rows with duplicate dates in the DF:
df_dups = df[df.duplicated(['Date'])].copy()
I'm trying to get the sum of all the rows with the exact same date for 4 columns (all float values), like this:
df_sum = df_dups.groupby('Date')["Received Quantity","Sent Quantity","Fee Amount","Market Value"].sum()
However, this does not give the desired result. When I examine df_sum.groups, I've noticed that it did not include the first date in the indices. So for two items with the same date, there would only be one index in the groups object.
pprint(df_dups.groupby('Date')["Received Quantity","Sent Quantity","Fee Amount","Market Value"].groups)
I have no idea how to get the sum of all duplicates.
I've also tried:
df_sum = df_dups.groupby('Date')["Received Quantity","Sent Quantity","Fee Amount","Market Value"].apply(lambda x : x.sum())
This gives the same result, which makes sense I guess, as the indices in the groupby object are not complete. What am I missing here?
Check the documentation for the method duplicated. By default duplicates are marked with True except for the first occurence, which is why the first date is not included in your sums.
You only need to pass in keep=False in duplicated for your desired behaviour.
df_dups = df[df.duplicated(['Date'], keep=False)].copy()
After that the sum can be calculated properly with the expression you wrote
df_sum = df_dups.groupby('Date')["Received Quantity","Sent Quantity","Fee Amount","Market Value"].apply(lambda x : x.sum())

How to sort a dataFrame in python pandas by two or more columns based on a list of values? [duplicate]

So I have a pandas DataFrame, df, with columns that represent taxonomical classification (i.e. Kingdom, Phylum, Class etc...) I also have a list of taxonomic labels that correspond to the order I would like the DataFrame to be ordered by.
The list looks something like this:
class_list=['Gammaproteobacteria', 'Bacteroidetes', 'Negativicutes', 'Clostridia', 'Bacilli', 'Actinobacteria', 'Betaproteobacteria', 'delta/epsilon subdivisions', 'Synergistia', 'Mollicutes', 'Nitrospira', 'Spirochaetia', 'Thermotogae', 'Aquificae', 'Fimbriimonas', 'Gemmatimonadetes', 'Dehalococcoidia', 'Oscillatoriophycideae', 'Chlamydiae', 'Nostocales', 'Thermodesulfobacteria', 'Erysipelotrichia', 'Chlorobi', 'Deinococci']
This list would correspond to the Dataframe column df['Class']. I would like to sort all the rows for the whole dataframe based on the order of the list as df['Class'] is in a different order currently. What would be the best way to do this?
You could make the Class column your index column
df = df.set_index('Class')
and then use df.loc to reindex the DataFrame with class_list:
df.loc[class_list]
Minimal example:
>>> df = pd.DataFrame({'Class': ['Gammaproteobacteria', 'Bacteroidetes', 'Negativicutes'], 'Number': [3, 5, 6]})
>>> df
Class Number
0 Gammaproteobacteria 3
1 Bacteroidetes 5
2 Negativicutes 6
>>> df = df.set_index('Class')
>>> df.loc[['Bacteroidetes', 'Negativicutes', 'Gammaproteobacteria']]
Number
Bacteroidetes 5
Negativicutes 6
Gammaproteobacteria 3
Alex's solution doesn't work if your original dataframe does not contain all of the elements in the ordered list i.e.: if your input data at some point in time does not contain "Negativicutes", this script will fail. One way to get past this is to append your df's in a list and concatenate them at the end. For example:
ordered_classes = ['Bacteroidetes', 'Negativicutes', 'Gammaproteobacteria']
df_list = []
for i in ordered_classes:
df_list.append(df[df['Class']==i])
ordered_df = pd.concat(df_list)

Run functions for each row in dataframe

I have a dataframe df1, like this:
date sentence
29/03/1029 I like you
30/03/2019 You eat cake
and run functions getVerb and getObj to dataframe df1. So, the output is like this:
date sentence verb object
29/03/1029 I like you like you
30/03/2019 You eat cake eat cake
I want those functions (getVerb and getObj) run for each line in df1. Could someone help me to solve this problem in an efficient way?
Thank you so much.
Each column of a pandas DataFrame is a Series. You can use the Series.apply or Series.map functions to get the result you want.
df1['verb'] = df1['sentence'].apply(getVerb)
df1['object'] = df1['sentence'].apply(getObj)
# OR
df1['verb'] = df1['sentence'].map(getVerb)
df1['object'] = df1['sentence'].map(getObj)
See the pandas documentation for more details on Series.apply or Series.map.
Assume you have a pandas dataframe such as:
import pandas as pd, numpy as np
df = pd.DataFrame([[4, 9]] *3, columns=['A', 'B'])
>>>df
A B
4 9
4 9
4 9
Let's say, we want sum of columns A and B row wise and column wise. To accomplish it, we write
df.apply(np.sum, axis = 1) # for row-wise sum
Output: 13
13
13
df.apply(np.sum, axis = 0) # for column-wise sum
Output: A 12
B 27
Now, if you want to apply any function for specific set of columns, you may choose a subset from the data-frame.
For example: I want to compute sum over column A only.
df['A'].apply(np.sum, axis =1)
Dataframe.apply
You may refer the above link as well. Other than that, Series.map, Series.apply could be handy as well, as mentioned in the above answer.
Cheers!
Using a simple loop: (assuming that columns already exist in the data frame having names 'verb' and 'object')
for index, row in df1.iterrows():
df1['verb'].iloc[index]= getVerb(row['sentence'])
df1['object'].iloc[index]= getObj(row['sentence'])

how to get range of index of pandas dataframe

What is the most efficient way to get the range of indices for which the corresponding column content satisfy a condition .. like rows starting with tag and ending with "body" tag.
for e.g the data frame looks like this
I want to get the row index 1-3
Can anyone suggest the most pythonic way to achieve this?
import pandas as pd
df=pd.DataFrame([['This is also a interesting topic',2],['<body> the valley of flowers ...',1],['found in the hilly terrain',5],
['we must preserve it </body>',6]],columns=['description','count'])
print(df.head())
What condition are you looking to satisfy?
import pandas as pd
df=pd.DataFrame([['This is also a interesting topic',2],['<body> the valley of flowers ...',1],['found in the hilly terrain',5],
['we must preserve it </body>',6]],columns=['description','count'])
print(df)
print(len(df[df['count'] != 2].index))
Here, df['count'] != 2 subsets the df, and len(df.index) returns the length of the index.
Updated; note that I used str.contains(), rather than explicitly looking for starting or ending strings.
df2 = df[(df.description.str.contains('<body>') | (df.description.str.contains('</body>')))]
print(df2)
print(len(df2.index))
help from: Check if string is in a pandas dataframe
You can also find the index of start and end row then add the rows in between them to get all contents in between
start_index = df[df['description'].str.contains("<body>")==True].index[0]
end_index = df[df['description'].str.contains("</body>")==True].index[0]
print(df["description"][start_index:end_index+1].sum())

Python dataframe groupby by dictionary list then sum

I have two dataframes. The first named mergedcsv is of the format:
mergedcsv dataframe
The second dataframe named idgrp_df is of a dictionary format which for each region Id a list of corresponding string ids.
idgrp_df dataframe - keys with lists
For each row in mergedcsv (and the corresponding row in idgrp_df) I wish to select the columns within mergedcsv where the column labels are equal to the list with idgrp_df for that row. Then sum the values of those particular values and add the output to a column within mergedcsv. The function will iterate through all rows in mergedcsv (582 rows x 600 columns).
My line of code to try to attempt this is:
mergedcsv['TotRegFlows'] = mergedcsv.groupby([idgrp_df],as_index=False).numbers.apply(lambda x: x.iat[0].sum())
It returns a ValueError: Grouper for class pandas.core.frame.DataFrame not 1-dimensional.
This relates to the input dataframe for the groupby. How can I access the list for each row as the input for the groupby?
So for example, for the first row in mergedcsv I wish to select the columns with labels F95RR04, F95RR06 and F95RR15 (reading from the list in the first row of idgrp_df). Sum the values in these columns for that row and insert the sum value into TotRegFlows column.
Any ideas as to how I can utilize the list would be very much appreciated.
Edits:
Many thanks IanS. Your solution is useful. Following modification of the code line based on this advice I realised that (as suggested) my index in both dataframes are out of sync. I tested the indices (mergedcsv had 'None' and idgrp_df has 'REG_ID' column as index. I set the mergedcsv to 'REG_ID' also. Then realised that the mergedcsv has 582 rows (the REG_ID is not unique) and the idgrp_df has 220 rows (REG_ID is unique). I therefor think I am missing a groupby based on REG_ID index in mergedcsv.
I have modified the code as follows:
mergedcsv.set_index('REG_ID', inplace=True)
print mergedcsv.index.name
print idgrp_df.index.name
mergedcsvgroup = mergedcsv.groupby('REG_ID')[mergedcsv.columns].apply(lambda y: y.tolist())
mergedcsvgroup['TotRegFlows'] = mergedcsvgroup.apply(lambda row: row[idgrp_df.loc[row.name]].sum(), axis=1)
I have a keyError:'REG_ID'.
Any further recommendations are most welcome. Would it be more efficient to combine the groupby and apply into one line?
I am new to working with pandas and trying to build experience in python
Further amendments:
Without an index for mergedcsv:
mergedcsv['TotRegFlows'] = mergedcsv.apply(lambda row: row[idgrp_df.loc[row.name]].groupby('REG_ID').sum(), axis=1)
this throws a KeyError: (the label[0] is not in the [index], u 'occurred at index 0')
With an index for mergedcsv:
mergedcsv.set_index('REG_ID', inplace=True)
columnlist = list(mergedcsv.columns.values)
mergedcsv['TotRegFlows'] = mergedcsv.apply(lambda row: row[idgrp_df.loc[row.name]].groupby('REG_ID')[columnlist].transform().sum(), axis=1)
this throws a TypeError: ("unhashable type:'list'", u'occurred at index 7')
Or finally separating the groupby function:
columnlist = list(mergedcsv.columns.values)
mergedcsvgroup = mergedcsv.groupby('REG_ID')
mergedcsv['TotRegFlows'] = mergedcsvgroup.apply(lambda row: row[idgrp_df.loc[row.name]].sum())
this throws a TypeError: unhashable type list. The axis=1 argument is not available also with groupby apply.
Any ideas how I can use the lists with the apply function? I've explored tuples in the apply code but have not had any success.
Any suggestions much appreciated.
If I understand correctly, I have a simple solution with apply:
Setup
import pandas as pd
df = pd.DataFrame({'A': [1,2,3], 'B': [4,5,6], 'C': [7,8,9]})
lists = pd.Series([['A', 'B'], ['A', 'C'], ['C']])
Solution
I apply a lambda function that gets the list of columns to be summed from the lists series:
df.apply(lambda row: row[lists[row.name]].sum(), axis=1)
The trick is that, when iterating over rows (axis=1), row.name is the original index of the dataframe df. I use that to access the list from the lists series.
Notes
This solution assumes that both dataframes share the same index, which appears not to be the case in the screenshots you included. You have to address that.
Also, if idgrp_df is a dataframe and not a series, then you need to access its values with .loc.

Categories

Resources