I have a pandas groupby object that I made from a larger dataframe, in which amounts are grouped under a person ID variable as well as whether it was an ingoing or outgoing transaction. Heres an example:
ID In_Out Amount
1 In 5
1 Out 8
2 In 4
2 Out 2
3 In 3
3 Out 9
4 Out 8
(sorry I don't know how to put actual sample data in). Note that some folks can have one or the other (e.g., maybe they have some going out but nothing coming in).
All I want to go is get the difference in the amounts, collapsed under the person. So the ideal output would be, perhaps a dictionary or other dataframe, containing the difference in amounts under each person, like this:
ID Difference
1 -3
2 2
3 -6
4 -8
I have tried a handful of different ways to do this but am not sure how to work with these nested lists in python.
Thanks!
We couold select the rows that are Out and convert them to negative integers and then use sum().
import pandas as pd
s = '''\
ID In_Out Amount
1 In 5
1 Out 8
2 In 4
2 Out 2
3 In 3
3 Out 9
4 Out 8'''
# Recreate dataframe
df = pd.read_csv(pd.compat.StringIO(s), sep='\s+')
# Select rows where In_Out == 'Out' and multiple by -1
df.loc[df['In_Out'] == 'Out', 'Amount'] *= -1
# Convert to dict
d = df.groupby('ID')['Amount'].sum().to_dict()
print(d)
Returns:
{1: -3, 2: 2, 3: -6, 4: -8}
Related
I have a column with entries stored as [array, type] and want to convert that into integers. I already manage to convert the itemId strings into integers by:
for i in range(0,len(df)):
if len(df["itemIds"][i])<2:
df['n'][i]=df.itemIds[i][0]
else:
df['n'][i] = df.itemIds[i]
But now I have the problem that some arrays have multiple string entries and I do not know how to create an extra row for those to store all values separate. I am trying to run this for loop:
for i in range(0,len(df)):
if len(df.n[i])>1:
df.loc[-1]=df.iloc[i]
but since the data is quite large it loads forever. Any advice is highly appreciated! Thank you
You could just use explode. For example:
df = pd.DataFrame({
'itemIds' : [10103923431, 1003052070, 935653934, [10040664250, 10076964903, 10106433820, 5551386]],
'other' : range(5, 9)
})
# itemIds other
# 0 10103923431 5
# 1 1003052070 6
# 2 935653934 7
# 3 [10040664250, 10076964903, 10106433820, 5551386] 8
df = df.explode('itemIds', ignore_index=True)
Output:
itemIds other
0 10103923431 5
1 1003052070 6
2 935653934 7
3 10040664250 8
4 10076964903 8
5 10106433820 8
6 5551386 8
Note that with explode you can just leave lists of 1 entry as a list rather than converting them into an int value. This will save some computation.
Given a table,
Id
Value
1
1
2
2
2
3
3
4
4
5
4
6
2
8
2
3
1
1
Instead of a simple groupby('Id').agg({'Value':'sum'}) which would perform aggregation over all the instances and yield a table with only four rows, I wish the result to aggregate only over the nearby instances and hence maintaining the order the table was created.
The expected output is following,
Id
Value
1
1
2
5
3
4
4
11
2
11
1
1
If not possible with pandas groupby, any other kind of trick would also be greatly appreciated.
Note: If the above example is not helpful, basically what I want is to somehow compress the table with aggregating over 'Value'. The aggregation should be done only over the duplicate 'Id's which occur one exactly after the other.
Unfortunately, the answers from eshirvana and wwnde doesn't seem to work for a long dataset. Inspired from answer of wwnde, I found a workaround,
# create a series referring to group of identicals
new=[]
i=-1
for item in df.Id:
if item !=seen:
i+=1
seen=items
new.append(i)
df['temp']=new
Now, we groupby over 'temp' column.
df.groupby('temp').agg({'Id':max, 'Value':sum}).reset_index(drop=True)
I have a dataframe (df) with the following columns:
print(df.columns)
['A','B','C','D','E']
And let's assume all the columns have numbers as data.
Then I select some of the columns to become indexes
Index = ['A','B','C']
df.set_index(Index).sort_index()
and I use it this way for some analysis.At some point I need to change the rows of column 'E' when index 'C' has certain values, for instance something like :
df.loc[df[(slice(None,None),slice(None,None),slice(5,10))], 'E' ] = 6
Which, obviously, doesn't work. I have tried a bunch of different approaches: using tuples and slices for the index as shown in my line above, re-arranging the indexes so i can use a single slice (Moving 'C' to the first level), tried with .xs (cross section) etc and I cannot do it. (I have been looking into de documentation of .loc, .xs, etc) I don't find an example that does exactly this, nor I find conclusive answer that this is not possible. Right now I was able to do the following:
df.reset_index(inplace=True) # returning it back into a normal DataFrame
df.loc[(DataFrame['C'] >= 5) & (df['C'] <= 10),'E'] = 6 # Modifying normally based on column data
df.set_index(Index).sort_index() # bring it back to a multiindex
But this doesn't seem right. It would seem to me that indexes should be able to be sliced somehow, I just can't find how. Perhaps I'm not searching the correct terms on Google. if anyone could give me a hand or point me in the right direction I'd greatly appreciate it.
You can use df.index.get_level_values('C')--which returns an index array of the values--like below.
import pandas as pd
df = pd.DataFrame(np.random.randint(0,10,size=(100, 5)), columns=list('ABCDE'))
df = df.set_index(['A','B','C']).sort_index()
df.loc[(df.index.get_level_values('C') <= 10) & (df.index.get_level_values('C') >= 5), 'E'] = 6
print(df)
Results:
D E
A B C
0 0 6 3 6
2 0 6 1
7 2 6
3 6 5 6
9 1 6
... .. ..
9 3 3 5 0
6 6 6
4 3 5 7
7 6 6
6 8 6 6
Note: The the parenthesis around both .get_level_values()s are required because otherwise the answer is ambiguous and will throw an error.
I'm working on some code that generates features from a dataframe and adds these features as columns to the dataframe.
The trouble is I'm working with a time series so that for any given tuple, I need (let's say) 5 of the previous tuples to generate the corresponding feature for that tuple.
lookback_period = 5
df['feature1'] = np.zeros(len(df)) # preallocate
for index, row in df.iterrows():
if index < lookback_period:
continue
slice = df[index - lookback_period:index]
some_int = SomeFxn(slice)
row['feature1'] = some_int
Is there a way to execute this code without explicitly looping through every row and then slicing?
One way is to create several lagged columns using df['column_name'].shift() such that all the necessary information is contained in each row, but this quickly gets intractable for my computer's memory since the dataset is large (millions of rows).
I don't have enough reputation to comment so will just post it here.
Can't you use apply for your dataframe e.g.
df['feature1'] = df.apply(someRowFunction, axis=1)
where someRowFunction will accept the full row and you can perform whatever row based slice and logic you want to do.
--- updated ---
As we do not have much information about the dataframe and the required/expected output, I just based the answer on the information from the comments
Let's define a function that will take a DataFrame slice (based on current row index and lookback) and the row and will return sum of the first column of the slice and value of the current row.
def someRowFunction (slice, row):
if slice.shape[0] == 0:
return 0
return slice[slice.columns[0]].sum() + row.b
d={'a':[1,2,3,4,5,6,7,8,9,0],'b':[0,9,8,7,6,5,4,3,2,1]}
df=pd.DataFrame(data=d)
lookback = 5
df['c'] = df.apply(lambda current_row: someRowFunction(df[current_row.name -lookback:current_row.name],current_row),axis=1)
we can get row index from apply using its name attribute and as such we can retrieve the required slice. Above will result to the following
print(df)
a b c
0 1 0 0
1 2 9 0
2 3 8 0
3 4 7 0
4 5 6 0
5 6 5 20
6 7 4 24
7 8 3 28
8 9 2 32
9 0 1 36
Here is an example of data I'm working on. (as a pandas df)
index inv Rev_stream Bill_type Net_rev
1 1 A Original -24.77
2 1 B Original -24.77
3 2 A Original -409.33
4 2 B Original -409.33
5 2 C Original -409.33
6 2 D Original -409.33
7 3 A Original -843.11
8 3 A Rebill 279.5
9 3 B Original -843.11
10 4 A Rebill 279.5
11 4 B Original -843.11
12 5 B Rebill 279.5
How could I filter this df, in a way to only get the lines where invoice/Rev_stream combo has both original and rebill kind of Net_rev. In the example above it would be only lines with index 7 and 8.
Is there an easy way to do it, without iterating over the whole dataframe and building dictionaries of invoice+RevStream : Bill_type?
What I'm looking for is some kind of
df = df[df[['inv','Rev_stream']]['Bill_type'].unique().len() == 2]
Unfortunately the code above doesn't work.
Thanks in advance.
You can group your data by inv and Rev_stream columns and then check for each group if both Original and Rebill are in the Bill_type values and filter based on the condition:
(df.groupby(['inv', 'Rev_stream'])
.filter(lambda g: 'Original' in g.Bill_type.values and 'Rebill' in g.Bill_type.values))