Consequently subtract columns and get the resulted cumsum - python

Having the dataframe like this:
|one|two|three|
| 1 | 2 | 4 |
| 4 | 6 | 3 |
| 2 | 4 | 9 |
How can I subtract values from column one from values of column two and so on and then get the sum of obtained values? Like
|one|two|three| one-two | one-three | two-three | SUM |
| 1 | 2 | 4 | -1 | -3 | -2 | -6 |
| 4 | 6 | 3 |
| 2 | 4 | 9 |
As a result I need a df with only one-three columns and SUM onley

You can try this:
from itertools import combinations
import pandas as pd
df = pd.DataFrame({'one': {0: 1, 1: 4, 2: 2},
'two': {0: 2, 1: 6, 2: 4},
'three': {0: 4, 1: 3, 2: 9}})
create column combination using itertools.combinations
## create column combinations
column_combinations = list(combinations(list(df.columns), 2))
Subtract each combination column and create new column
column_names = []
for column_comb in column_combinations:
name = f"{column_comb[0]}_{column_comb[1]}"
df[name] = df[column_comb[0]] - df[column_comb[1]]
column_names.append(name)
df["SUM"] = df[column_names].sum(axis=1)
print(df)
output:

df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
column_differences = df.apply(np.diff, axis=1)
total_column_differences = np.sum(column_differences.to_list(), axis=1)
df['SUM'] = total_column_differences
print(df)
Gives the following.
a b c SUM
0 1 4 7 6
1 2 5 8 6
2 3 6 9 6

DataFrames make operations like this very easy
df['new_column'] = df['colA'] - df['colB']
PyData has a great resource to learn more.
In your example:
import pandas as pd
df = pd.DataFrame(data=[[1,2,4],[4,6,3], [2,4,9]], columns=['one', 'two', 'three'])
df['one-two'] = df['one'] - df['two']
df['one-three'] = df['one'] - df['three']
df['two-three'] = df['two'] - df['three']
df['sum'] = df['one-two'] + df['one-three'] + df['two-three']
df.drop(columns=['one', 'two', 'three'], inplace=True)
# print(df)
one-two one-three two-three sum
0 -1 -3 -2 -6
1 -2 1 3 2
2 -2 -7 -5 -14

Assuming you have only 3 columns and only want pairs of 2 features:
from itertools import combinations
from pandas import DataFrame
# Creating the DataFrame
df = DataFrame({'one': [1,4,2], 'two': [2,6,4], 'three': [4,3,9]})
# Getting the possible feature combinations
combs = combinations(df.columns, 2)
# Calculating the totals for the column pairs
for comb in combs:
df['-'.join(comb)] = df[comb[0]] - df[comb[1]]
# Adding the totals to the DataFrame
df['SUM'] = df[df.columns[3:]].sum(axis=1)
one two three one-two one-three two-three SUM
0 1 2 4 -1 -3 -2 -6
1 4 6 3 -2 1 3 2
2 2 4 9 -2 -7 -5 -14

Related

python filter groups with the last row in groups are negative

I'd like to slice my data frame, based on some group conditions, and get all the groups whose last record has a negative value.
A B C D
1 a a 1
1 a a 2
1 a a 3
2 a a 1
2 a a -1
3 a a -1
3 a a -2
3 a a -3
Suppose this is my data frame, group by column A.
I want to get all the groups with a negative last value in column D.
output:
A B C D
2 a a 1
2 a a -1
3 a a -1
3 a a -2
3 a a -3
Column B and C are not related to the filter. But I need all the rows in each group, not only the last rows.
How to do this?
Use groupby to group by column A and filter over its result to filter which groups you want to keep.
Full functional example:
import pandas as pd
df = pd.DataFrame({"A":[1, 1, 1, 2, 2, 3, 3, 3],
"B": ["a"] *8,
"C": ["a"] *8,
"D": [1, 2, 3, 1, -1, -1, -2 , -3]
})
df.groupby('A').filter(lambda df_group: df_group['D'].iloc[-1] < 0)
Output:
A B C D
3 2 a a 1
4 2 a a -1
5 3 a a -1
6 3 a a -2
7 3 a a -3
You can pass last method to groupby.transform to get the last items and use lt to get a boolean mask that returns True if the last group value is less than 0 and filter df:
out = df[df.groupby('A')['D'].transform('last').lt(0)]
Output:
A B C D
3 2 a a 1
4 2 a a -1
5 3 a a -1
6 3 a a -2
7 3 a a -3
Use:
string = """A B C D
1 a a 1
1 a a 2
1 a a 3
2 a a 1
2 a a -1
3 a a -1
3 a a -2
3 a a -3"""
import numpy as np
data = [x.split() for x in string.split('\n')]
import pandas as pd
df = pd.DataFrame(np.array(data[1:]), columns = data[0])
df['D'] = df['D'].astype(int)
#solution
temp = df.groupby('A')['D'].last()
df[df['A'].isin(temp[temp<0].index)]
Output:

assign new columns to datatable

How do I assign new columns to a datatable?
Tried this:
DT1 = dt.Frame(A = range(5))
DT1['B'] = [1, 2, 3, 4, 5]
But getting
ValueError: The LHS of the replacement has 1 columns, while the RHS has 5 replacement expressions
With DT[col] = ... syntax you can assign single-column datatable Frames, or numpy arrays, or pandas DataFrames or Series. In your example, you just need to wrap the rhs into a Frame:
>>> DT1['B'] = dt.Frame([1, 2, 3, 4, 5])
>>> DT1
| A B
| int32 int32
-- + ----- -----
0 | 0 1
1 | 1 2
2 | 2 3
3 | 3 4
4 | 4 5
[5 rows x 2 columns]

Plot in python after crosstab merge

I'd like to plot my DataFrame. I had this DF first:
id|project|categories|rating
1 | a | A | 1
1 | a | B | 1
1 | a | C | 2
1 | b | A | 1
1 | b | B | 1
2 | c | A | 1
2 | c | B | 2
used this code:
import pandas as pd
df = pd.DataFrame(...)
(df.groupby('id').project.nunique().reset_index()
.merge(pd.crosstab(df.id, df.categories).reset_index()))
and now got this DataFrame:
id | project | A | B | C |
1 | 2 | 2 | 2 | 1 |
2 | 1 | 1 | 1 | 0 |
Now I'd like to plot the DF. I want to show, if the number of projects depends on how many categories are affected, or which categories are affected. I know how to visualize dataframes, but after crosstab and merging, it is not working as usual
I reproduced your data using below code:
import pandas as pd
df = pd.DataFrame({'id': [1, 1, 1, 1, 1, 2, 2,],\
'project': ['a', 'a', 'a', 'b', 'b', 'c', 'c'],\
'categories': ['A', 'B', 'C', 'A', 'B', 'A', 'B'],\
'rating': [1, 1, 2, 1, 1, 1, 2]})
Now data looks like this
categories id project rating
0 A 1 a 1
1 B 1 a 1
2 C 1 a 2
3 A 1 b 1
4 B 1 b 1
5 A 2 c 1
6 B 2 c 2
If you want to plot 'category count' as a function of 'project count' it looks like this.
import matplotlib.pyplot as plt
# this line is your code
df2 = df.groupby('id').project.nunique().reset_index().merge(pd.crosstab(df.id, df.categories).reset_index())
plt.scatter(df2.project, df2.A, label='A', alpha=0.5)
plt.scatter(df2.project, df2.B, label='B', alpha=0.5)
plt.scatter(df2.project, df2.C, label='C', alpha=0.5)
plt.xlabel('project count')
plt.ylabel('category count')
plt.legend()
plt.show()
And you will get this

Pandas - select column using other column value as column name

I have a dataframe that contains a column, let's call it "names". "names" has the name of other columns. I would like to add a new column that would have for each row the value based on the column name contained on that "names" column.
Example:
Input dataframe:
pd.DataFrame.from_dict({"a": [1, 2, 3,4], "b": [-1,-2,-3,-4], "names":['a','b','a','b']})
a | b | names |
--- | --- | ---- |
1 | -1 | 'a' |
2 | -2 | 'b' |
3 | -3 | 'a' |
4 | -4 | 'b' |
Output dataframe:
pd.DataFrame.from_dict({"a": [1, 2, 3,4], "b": [-1,-2,-3,-4], "names":['a','b','a','b'], "new_col":[1,-2,3,-4]})
a | b | names | new_col |
--- | --- | ---- | ------ |
1 | -1 | 'a' | 1 |
2 | -2 | 'b' | -2 |
3 | -3 | 'a' | 3 |
4 | -4 | 'b' | -4 |
You can use lookup:
df['new_col'] = df.lookup(df.index, df.names)
df
# a b names new_col
#0 1 -1 a 1
#1 2 -2 b -2
#2 3 -3 a 3
#3 4 -4 b -4
EDIT
lookup has been deprecated, here's the currently recommended solution:
idx, cols = pd.factorize(df['names'])
df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
Because DataFrame.lookup is deprecated as of Pandas 1.2.0, the following is what I came up with using DataFrame.melt:
df['new_col'] = df.melt(id_vars='names', value_vars=['a', 'b'], ignore_index=False).query('names == variable').loc[df.index, 'value']
Output:
>>> df
a b names new_col
0 1 -1 a 1
1 2 -2 b -2
2 3 -3 a 3
3 4 -4 b -4
Can this be simplified? For correctness, the index must not be ignored.
Additional reference:
Looking up values by index/column labels (archive)
Solution using pd.factorize (from https://github.com/pandas-dev/pandas/issues/39171#issuecomment-773477244):
idx, cols = pd.factorize(df['names'])
df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
With the straightforward and easy solution (lookup) deprecated, another alternative to the pandas-based ones proposed here is to convert df into a numpy array and use numpy indexing:
df['new_col'] = df.values[df.index.get_indexer(df['names'].index), df.columns.get_indexer(df['names'])]
Let me explain what this does. df.values is a numpy array based on the DataFrame. As numpy arrays have to be indexed numerically, we need to use the get_indexer function to convert the pandas row and column index names to index numbers that can be used with numpy:
>>> df.index.get_indexer(df['names'].index)
array([0, 1, 2, 3], dtype=int64)
>>> df.columns.get_indexer(df['names'])
array([0, 1, 0, 1], dtype=int64)
(In this case, where the row index is already numerical, you could get away with simply using df.index as the first argument inside the bracket, but this does not work generally.)
Here's a short solution using df.melt and df.merge:
df.merge(df.melt(var_name='names', ignore_index=False), on=[None, 'names'])
Outputs:
key_0 a b names value
0 0 1 -1 a 1
1 1 2 -2 b -2
2 2 3 -3 a 3
3 3 4 -4 b -4
There's a redundant key_0 column which you need to drop with df.drop.

Rolling up data frame along with count of rows in python

I am still in a learning phase in python and wanted to know how do we roll up the data and count the duplicate data rows in a column called count
The data frame structure is as follows
Col1| Value
A | 1
B | 1
A | 1
B | 1
C | 3
C | 3
C | 3
C | 3
My result should be as follows
Col1|Value|Count
A | 1 | 2
B | 1 | 2
C | 3 | 4
>>> df2 = df.groupby(['Col1', 'Value']).size().reset_index()
>>> df2.columns = ['Col1', 'Value', 'Count']
>>> df2
Col1 Value Count
0 A 1 2
1 B 1 2
2 C 3 4
Roman Pekar's fine answer is correct for this case. However, I saw it after trying to write a solution for the general case stated in the text of your question, not just the example with specific column names. So, for the general case, consider:
df.groupby([df[c] for c in df.columns]).size().reset_index().rename(columns={0: 'Count'})
For example:
import pandas as pd
df = pd.DataFrame({'Col1': ['a', 'a', 'a', 'b', 'c'], 'Value': [1, 2, 1, 3, 2]})
>>> df.groupby([df[c] for c in df.columns]).size().reset_index().rename(columns={0: 'Count'})
Col1 Value Count
0 a 1 2
1 a 2 1
2 b 3 1
3 c 2 1
You can also try:
df.groupby('Col1')['Value'].value_counts().reset_index(name='Count')

Categories

Resources