I have a dataframe that contains a column, let's call it "names". "names" has the name of other columns. I would like to add a new column that would have for each row the value based on the column name contained on that "names" column.
Example:
Input dataframe:
pd.DataFrame.from_dict({"a": [1, 2, 3,4], "b": [-1,-2,-3,-4], "names":['a','b','a','b']})
a | b | names |
--- | --- | ---- |
1 | -1 | 'a' |
2 | -2 | 'b' |
3 | -3 | 'a' |
4 | -4 | 'b' |
Output dataframe:
pd.DataFrame.from_dict({"a": [1, 2, 3,4], "b": [-1,-2,-3,-4], "names":['a','b','a','b'], "new_col":[1,-2,3,-4]})
a | b | names | new_col |
--- | --- | ---- | ------ |
1 | -1 | 'a' | 1 |
2 | -2 | 'b' | -2 |
3 | -3 | 'a' | 3 |
4 | -4 | 'b' | -4 |
You can use lookup:
df['new_col'] = df.lookup(df.index, df.names)
df
# a b names new_col
#0 1 -1 a 1
#1 2 -2 b -2
#2 3 -3 a 3
#3 4 -4 b -4
EDIT
lookup has been deprecated, here's the currently recommended solution:
idx, cols = pd.factorize(df['names'])
df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
Because DataFrame.lookup is deprecated as of Pandas 1.2.0, the following is what I came up with using DataFrame.melt:
df['new_col'] = df.melt(id_vars='names', value_vars=['a', 'b'], ignore_index=False).query('names == variable').loc[df.index, 'value']
Output:
>>> df
a b names new_col
0 1 -1 a 1
1 2 -2 b -2
2 3 -3 a 3
3 4 -4 b -4
Can this be simplified? For correctness, the index must not be ignored.
Additional reference:
Looking up values by index/column labels (archive)
Solution using pd.factorize (from https://github.com/pandas-dev/pandas/issues/39171#issuecomment-773477244):
idx, cols = pd.factorize(df['names'])
df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
With the straightforward and easy solution (lookup) deprecated, another alternative to the pandas-based ones proposed here is to convert df into a numpy array and use numpy indexing:
df['new_col'] = df.values[df.index.get_indexer(df['names'].index), df.columns.get_indexer(df['names'])]
Let me explain what this does. df.values is a numpy array based on the DataFrame. As numpy arrays have to be indexed numerically, we need to use the get_indexer function to convert the pandas row and column index names to index numbers that can be used with numpy:
>>> df.index.get_indexer(df['names'].index)
array([0, 1, 2, 3], dtype=int64)
>>> df.columns.get_indexer(df['names'])
array([0, 1, 0, 1], dtype=int64)
(In this case, where the row index is already numerical, you could get away with simply using df.index as the first argument inside the bracket, but this does not work generally.)
Here's a short solution using df.melt and df.merge:
df.merge(df.melt(var_name='names', ignore_index=False), on=[None, 'names'])
Outputs:
key_0 a b names value
0 0 1 -1 a 1
1 1 2 -2 b -2
2 2 3 -3 a 3
3 3 4 -4 b -4
There's a redundant key_0 column which you need to drop with df.drop.
Related
Example data:
| alcoholism | diabites | | handicapped | hypertensive | new col |
| -------- | -------- | | -------- | -------- | ---------------- |
| 1 | 0 | | 1 | 0 | alcoholism, handicapped |
| 0 | 1 | | 0 | 1 | diabites, hypertensive |
| 0 | 1 | | 0 | 0 | diabites |
If any of the above columns has value = 1, then I need the new column to have the names of these columns only,
and if all are zero return no condition.
I had tried to do it with the below code:
problems = ['alcoholism', 'diabetes','hypertension','handicap']
m1 = df[problems].isin([1])
mask = m1 | (m1.loc[~m1.any(axis=1)])
df['sp_name'] = mask.mul(problems).apply(lambda x: [i for i in x if i], axis=1)
But it returns the data with brackets like [handicapped, alcoholism].
The issue is that I can't do value counts as the zero values show as empty [] and will not be plotted.
I still don't understand your ultimate goal, or how this will be useful in plotting, but all you're really missing is using str.join to combine each list into the string you want. That said, the way you've gotten there involves unnecessary steps. First, multiply the DataFrame by its own column names:
df * df.columns
alcoholism diabetes handicapped hypertension
0 alcoholism handicapped
1 diabetes hypertension
2 diabetes
Then you can apply the same as you did:
(df * df.columns).apply(lambda row: [i for i in row if i], axis=1)
0 [alcoholism, handicapped]
1 [diabetes, hypertension]
2 [diabetes]
dtype: object
Then you just need to include a string join in the function you supply to apply. Here's a complete example:
import pandas as pd
df = pd.DataFrame({
'alcoholism': [1, 0, 0],
'diabetes': [0, 1, 1],
'handicapped': [1, 0, 0],
'hypertension': [0, 1, 0],
})
df['new_col'] = (
(df * df.columns)
.apply(lambda row: ', '.join([i for i in row if i]), axis=1)
)
print(df)
alcoholism diabetes handicapped hypertension new_col
0 1 0 1 0 alcoholism, handicapped
1 0 1 0 1 diabetes, hypertension
2 0 1 0 0 diabetes
df['new_col'] = df.iloc[:, :-1].dot(df.add_suffix(",").columns[:-1]).str[:-1]
i already found this solution helpful for me
I have a DataFrame with col names 'a', 'b', 'c'
#Input
import pandas as pd
list_of_dicts = [
{'a' : 0, 'b' : 4, 'c' : 3},
{'a' : 1, 'b' : 1, 'c' : 2 },
{'a' : 0, 'b' : 0, 'c' : 0 },
{'a' : 1, 'b' : 0, 'c' : 3 },
{'a' : 2, 'b' : 1, 'c' : 0 }
]
df = pd.DataFrame(list_of_dicts)
#Input DataFrame
-----|------|------|-----|
| a | b | c |
-----|------|------|-----|
0 | 0 | 4 | 3 |
1 | 1 | 1 | 2 |
2 | 0 | 0 | 0 |
3 | 1 | 0 | 3 |
4 | 2 | 1 | 0 |
I want to reduce the wide DataFrame to One column, with the column names
as DataFrame values multiplied by the corresponding row values. The operation must be done Row wise.
#Output
| Values |
-----------------
0 | b |
1 | b |
2 | b |
3 | b |
4 | c |
5 | c |
6 | c |
7 | a |
8 | b |
9 | c |
10 | c |
11 | a |
12 | c |
13 | c |
14 | c |
15 | a |
17 | a |
18 | b |
Explanation:
Row 0 in the Input DataFrame has 4 'b' and 3 'c', so the first seven elements of the output DataFrame are bbbbccc
Row 1 similarly has 1 'a' 1 'b' and 2 'c', so the output will have abcc as the next 4 elements
Row 2 has 0's across, so would be skipped entirely.
The Order of the output is very important
For example, the first row has '4' b and 3 'c', so the output DataFrame must be bbbbccc because Column 'b' comes before column 'c'. The operation must be row-wise from left to right.
I'm trying to find an efficient way in order to accomplish this. The real dataset is too big for me to compute. Please provide the python3 solution.
Stack the data (you could melt as well), and drop rows where the count is zero. Finally use numpy.repeat to build a new array, and build your new dataframe from that.
reshape = df.stack().droplevel(0).loc[lambda x: x != 0]
pd.DataFrame(np.repeat(reshape.index, reshape), columns=['values'])
values
0 b
1 b
2 b
3 b
4 c
5 c
6 c
7 a
8 b
9 c
10 c
11 a
12 c
13 c
14 c
15 a
16 a
17 b
I don't think pandas buys you anything in this process, and especially if you have a large amount of data you don't want to read that all into memory and reprocess it into another large data structure.
import csv
with open('input.csv', 'r') as fh:
reader = csv.DictReader(fh)
for row in reader:
for key in reader.headers:
value = int(row[key])
for i in range(value):
print(key)
Having the dataframe like this:
|one|two|three|
| 1 | 2 | 4 |
| 4 | 6 | 3 |
| 2 | 4 | 9 |
How can I subtract values from column one from values of column two and so on and then get the sum of obtained values? Like
|one|two|three| one-two | one-three | two-three | SUM |
| 1 | 2 | 4 | -1 | -3 | -2 | -6 |
| 4 | 6 | 3 |
| 2 | 4 | 9 |
As a result I need a df with only one-three columns and SUM onley
You can try this:
from itertools import combinations
import pandas as pd
df = pd.DataFrame({'one': {0: 1, 1: 4, 2: 2},
'two': {0: 2, 1: 6, 2: 4},
'three': {0: 4, 1: 3, 2: 9}})
create column combination using itertools.combinations
## create column combinations
column_combinations = list(combinations(list(df.columns), 2))
Subtract each combination column and create new column
column_names = []
for column_comb in column_combinations:
name = f"{column_comb[0]}_{column_comb[1]}"
df[name] = df[column_comb[0]] - df[column_comb[1]]
column_names.append(name)
df["SUM"] = df[column_names].sum(axis=1)
print(df)
output:
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
column_differences = df.apply(np.diff, axis=1)
total_column_differences = np.sum(column_differences.to_list(), axis=1)
df['SUM'] = total_column_differences
print(df)
Gives the following.
a b c SUM
0 1 4 7 6
1 2 5 8 6
2 3 6 9 6
DataFrames make operations like this very easy
df['new_column'] = df['colA'] - df['colB']
PyData has a great resource to learn more.
In your example:
import pandas as pd
df = pd.DataFrame(data=[[1,2,4],[4,6,3], [2,4,9]], columns=['one', 'two', 'three'])
df['one-two'] = df['one'] - df['two']
df['one-three'] = df['one'] - df['three']
df['two-three'] = df['two'] - df['three']
df['sum'] = df['one-two'] + df['one-three'] + df['two-three']
df.drop(columns=['one', 'two', 'three'], inplace=True)
# print(df)
one-two one-three two-three sum
0 -1 -3 -2 -6
1 -2 1 3 2
2 -2 -7 -5 -14
Assuming you have only 3 columns and only want pairs of 2 features:
from itertools import combinations
from pandas import DataFrame
# Creating the DataFrame
df = DataFrame({'one': [1,4,2], 'two': [2,6,4], 'three': [4,3,9]})
# Getting the possible feature combinations
combs = combinations(df.columns, 2)
# Calculating the totals for the column pairs
for comb in combs:
df['-'.join(comb)] = df[comb[0]] - df[comb[1]]
# Adding the totals to the DataFrame
df['SUM'] = df[df.columns[3:]].sum(axis=1)
one two three one-two one-three two-three SUM
0 1 2 4 -1 -3 -2 -6
1 4 6 3 -2 1 3 2
2 2 4 9 -2 -7 -5 -14
I would like to groupby and sum dataframe, without modifying the number of indexes but applying the operations to the first occurrence only.
Initial DF:
C1 | Val
a | 1
a | 1
b | 1
c | 1
c | 1
Wanted DF:
C1 | Val
a | 2
a | 0
b | 1
c | 2
c | 0
I tried to apply the following code:
df.groupby(['C1'])['Val'].transform('sum')
which it helps to propagate the aggregated results to the total number or rows. However, it does not seem that transform have arguments which allow to apply the results to first or last occurrence only.
Indeed, what I currently get is:
C1 | Val
a | 2
a | 2
b | 1
c | 2
c | 2
Using pandas.DataFrame.groupby:
s = df.groupby('C1')['Val']
v = s.sum().values
df.loc[:, 'Val'] = 0
df.loc[s.head(1).index, 'Val'] = v
print(df)
Output:
C1 Val
0 a 2
1 a 0
2 b 1
3 c 2
4 c 0
I am still in a learning phase in python and wanted to know how do we roll up the data and count the duplicate data rows in a column called count
The data frame structure is as follows
Col1| Value
A | 1
B | 1
A | 1
B | 1
C | 3
C | 3
C | 3
C | 3
My result should be as follows
Col1|Value|Count
A | 1 | 2
B | 1 | 2
C | 3 | 4
>>> df2 = df.groupby(['Col1', 'Value']).size().reset_index()
>>> df2.columns = ['Col1', 'Value', 'Count']
>>> df2
Col1 Value Count
0 A 1 2
1 B 1 2
2 C 3 4
Roman Pekar's fine answer is correct for this case. However, I saw it after trying to write a solution for the general case stated in the text of your question, not just the example with specific column names. So, for the general case, consider:
df.groupby([df[c] for c in df.columns]).size().reset_index().rename(columns={0: 'Count'})
For example:
import pandas as pd
df = pd.DataFrame({'Col1': ['a', 'a', 'a', 'b', 'c'], 'Value': [1, 2, 1, 3, 2]})
>>> df.groupby([df[c] for c in df.columns]).size().reset_index().rename(columns={0: 'Count'})
Col1 Value Count
0 a 1 2
1 a 2 1
2 b 3 1
3 c 2 1
You can also try:
df.groupby('Col1')['Value'].value_counts().reset_index(name='Count')