assign new columns to datatable

assign new columns to datatable - python

How do I assign new columns to a datatable?
Tried this:
DT1 = dt.Frame(A = range(5))
DT1['B'] = [1, 2, 3, 4, 5]
But getting
ValueError: The LHS of the replacement has 1 columns, while the RHS has 5 replacement expressions

With DT[col] = ... syntax you can assign single-column datatable Frames, or numpy arrays, or pandas DataFrames or Series. In your example, you just need to wrap the rhs into a Frame:
>>> DT1['B'] = dt.Frame([1, 2, 3, 4, 5])
>>> DT1
| A B
| int32 int32
-- + ----- -----
0 | 0 1
1 | 1 2
2 | 2 3
3 | 3 4
4 | 4 5
[5 rows x 2 columns]

Related

Creating dictionary from excel file (pandas dataframe)

I've got excel/pandas dataframe/file looking like this:
+------+--------+
| ID | 2nd ID |
+------+--------+
| ID_1 | R_1 |
| ID_1 | R_2 |
| ID_2 | R_3 |
| ID_3 | |
| ID_4 | R_4 |
| ID_5 | |
+------+--------+
How can I transform it to python dictionary? I want my result to be like:
{'ID_1':['R_1','R_2'],'ID_2':['R_3'],'ID_3':[],'ID_4':['R_4'],'ID_5':[]}
What should I do, to obtain it?

If need remove missing values for not exist values use Series.dropna in lambda function in GroupBy.apply:
d = df.groupby('ID')['2nd ID'].apply(lambda x: x.dropna().tolist()).to_dict()
print (d)
{'ID_1': ['R_1', 'R_2'], 'ID_2': ['R_3'], 'ID_3': [], 'ID_4': ['R_4'], 'ID_5': []}
Or use fact np.nan == np.nan return False in list compehension for filter non missing values, check also warning in docs for more explain.
d = df.groupby('ID')['2nd ID'].apply(lambda x: [y for y in x if y == y]).to_dict()
If need remove empty strings:
d = df.groupby('ID')['2nd ID'].apply(lambda x: [y for y in x if y != '']).to_dict()

Apply a function over the dataframe over the rows which appends the value to your dict. Apply is not inplace and thus your dictionary would be created.
d = dict.fromkeys(df.ID.unique(), [])
def func(x):
d[x.ID].append(x["2nd ID"])
# will return a series of Nones
df.apply(func, axis = 1)
Edit:
I asked it on Gitter and #gurukiran07 gave me an answer. What you are trying to do is reverse of explode function
s = pd.Series([[1, 2, 3], [4, 5]])
0 [1, 2, 3]
1 [4, 5]
dtype: object
exploded = s.explode()
0 1
0 2
0 3
1 4
1 5
dtype: object
exploded.groupby(level=0).agg(list)
0 [1, 2, 3]
1 [4, 5]
dtype: object

Consequently subtract columns and get the resulted cumsum

Having the dataframe like this:
|one|two|three|
| 1 | 2 | 4 |
| 4 | 6 | 3 |
| 2 | 4 | 9 |
How can I subtract values from column one from values of column two and so on and then get the sum of obtained values? Like
|one|two|three| one-two | one-three | two-three | SUM |
| 1 | 2 | 4 | -1 | -3 | -2 | -6 |
| 4 | 6 | 3 |
| 2 | 4 | 9 |
As a result I need a df with only one-three columns and SUM onley

You can try this:
from itertools import combinations
import pandas as pd
df = pd.DataFrame({'one': {0: 1, 1: 4, 2: 2},
'two': {0: 2, 1: 6, 2: 4},
'three': {0: 4, 1: 3, 2: 9}})
create column combination using itertools.combinations
## create column combinations
column_combinations = list(combinations(list(df.columns), 2))
Subtract each combination column and create new column
column_names = []
for column_comb in column_combinations:
name = f"{column_comb[0]}_{column_comb[1]}"
df[name] = df[column_comb[0]] - df[column_comb[1]]
column_names.append(name)
df["SUM"] = df[column_names].sum(axis=1)
print(df)
output:

df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
column_differences = df.apply(np.diff, axis=1)
total_column_differences = np.sum(column_differences.to_list(), axis=1)
df['SUM'] = total_column_differences
print(df)
Gives the following.
a b c SUM
0 1 4 7 6
1 2 5 8 6
2 3 6 9 6

DataFrames make operations like this very easy
df['new_column'] = df['colA'] - df['colB']
PyData has a great resource to learn more.
In your example:
import pandas as pd
df = pd.DataFrame(data=[[1,2,4],[4,6,3], [2,4,9]], columns=['one', 'two', 'three'])
df['one-two'] = df['one'] - df['two']
df['one-three'] = df['one'] - df['three']
df['two-three'] = df['two'] - df['three']
df['sum'] = df['one-two'] + df['one-three'] + df['two-three']
df.drop(columns=['one', 'two', 'three'], inplace=True)
# print(df)
one-two one-three two-three sum
0 -1 -3 -2 -6
1 -2 1 3 2
2 -2 -7 -5 -14

Assuming you have only 3 columns and only want pairs of 2 features:
from itertools import combinations
from pandas import DataFrame
# Creating the DataFrame
df = DataFrame({'one': [1,4,2], 'two': [2,6,4], 'three': [4,3,9]})
# Getting the possible feature combinations
combs = combinations(df.columns, 2)
# Calculating the totals for the column pairs
for comb in combs:
df['-'.join(comb)] = df[comb[0]] - df[comb[1]]
# Adding the totals to the DataFrame
df['SUM'] = df[df.columns[3:]].sum(axis=1)
one two three one-two one-three two-three SUM
0 1 2 4 -1 -3 -2 -6
1 4 6 3 -2 1 3 2
2 2 4 9 -2 -7 -5 -14

How to replace different categorical variables with another list of categorical variables

I have a column with categorical values as [0 1 2 3 4 5]. I want to replace these values with only [1 2 3 4] in the following manner:
1 -> 1
2 -> 2
3 -> 3
0,4,5 -> 3
excel categorical map
I tried this code:
bins = [0, 1, 2, 3, 4, np.inf]
names = ['4','1','2','3','4']
data['NEW_EDU'] = pd.cut(data['EDU'], bins, labels=names)
But I get-
ValueError: Categorical categories must be unique

You simply need to use isin()
df.loc[df['EDU'].isin([0,4,5])] = 3
Example:
df = pd.DataFrame({
'EDU': [1,2,3,4,5,0,4,2]
})
Output:
EDU
0 1
1 2
2 3
3 4
4 5
5 0
6 4
7 2
Use
df.loc[df['EDU'].isin([0,4,5])] = 3
Output:
EDU
0 1
1 2
2 3
3 3
4 3
5 3
6 3
7 2

An alternative way using a lambda expression:
df['NEW_EDU'] = df['EDU'].apply(lambda x: 3 if x in [0, 4, 5] else x)
Or using numpy.where:
import numpy as np
df['NEW_EDU'] = np.where(df["EDU"].isin([0, 4, 5]), 3, df["EDU"])
If, as #rafaelc suggests, it's important to keep the column as a categorical type:
df['NEW_EDU'] = pd.Categorical(np.where(df["EDU"].isin([0, 4, 5]), 3, df["EDU"]))

Use a dictionary to map the new values to the keys:
value_dict = {1:1, 2:2, 3:3, 0:4, 4:4, 5:4}
Then iterate through the column and replace the keys with the mapped values.
df['NEW_EDU'] = [value_dict[x] for x in df['EDU']]
This lets you create arbitrary mappings between a list of values

Add two columns of different dataframes taking into account missing values

How can I add columns of two dataframes (A + B), so that the result (C) takes into account missing values ('---')?
DataFrame A
a = pd.DataFrame({'A': [1, 2, 3, '---', 5]})
A
0 1
1 2
2 3
3 ---
4 5
DataFrame B
b = pd.DataFrame({'B': [3, 4, 5, 6, '---']})
B
0 3
1 4
2 5
3 6
4 ---
Desired Result of A+B
C
0 4
1 6
2 8
3 ---
4 ---

Replace the '---' with np.nan, add the columns and fillna with '---'
(a['A'].replace('---', np.nan)+b['B'].replace('---', np.nan)).fillna('---')
You can assign the result to a new dataframe or an existing one:
df = pd.DataFrame()
df.assign(C = (a['A'].replace('---', np.nan)+b['B'].replace('---', np.nan)).fillna('---'))
OR
a.assign(C = (a['A'].replace('---', np.nan)+b['B'].replace('---', np.nan)).fillna('---'))

Pandas - select column using other column value as column name

I have a dataframe that contains a column, let's call it "names". "names" has the name of other columns. I would like to add a new column that would have for each row the value based on the column name contained on that "names" column.
Example:
Input dataframe:
pd.DataFrame.from_dict({"a": [1, 2, 3,4], "b": [-1,-2,-3,-4], "names":['a','b','a','b']})
a | b | names |
--- | --- | ---- |
1 | -1 | 'a' |
2 | -2 | 'b' |
3 | -3 | 'a' |
4 | -4 | 'b' |
Output dataframe:
pd.DataFrame.from_dict({"a": [1, 2, 3,4], "b": [-1,-2,-3,-4], "names":['a','b','a','b'], "new_col":[1,-2,3,-4]})
a | b | names | new_col |
--- | --- | ---- | ------ |
1 | -1 | 'a' | 1 |
2 | -2 | 'b' | -2 |
3 | -3 | 'a' | 3 |
4 | -4 | 'b' | -4 |

You can use lookup:
df['new_col'] = df.lookup(df.index, df.names)
df
# a b names new_col
#0 1 -1 a 1
#1 2 -2 b -2
#2 3 -3 a 3
#3 4 -4 b -4
EDIT
lookup has been deprecated, here's the currently recommended solution:
idx, cols = pd.factorize(df['names'])
df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]

Because DataFrame.lookup is deprecated as of Pandas 1.2.0, the following is what I came up with using DataFrame.melt:
df['new_col'] = df.melt(id_vars='names', value_vars=['a', 'b'], ignore_index=False).query('names == variable').loc[df.index, 'value']
Output:
>>> df
a b names new_col
0 1 -1 a 1
1 2 -2 b -2
2 3 -3 a 3
3 4 -4 b -4
Can this be simplified? For correctness, the index must not be ignored.
Additional reference:
Looking up values by index/column labels (archive)

Solution using pd.factorize (from https://github.com/pandas-dev/pandas/issues/39171#issuecomment-773477244):
idx, cols = pd.factorize(df['names'])
df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]

With the straightforward and easy solution (lookup) deprecated, another alternative to the pandas-based ones proposed here is to convert df into a numpy array and use numpy indexing:
df['new_col'] = df.values[df.index.get_indexer(df['names'].index), df.columns.get_indexer(df['names'])]
Let me explain what this does. df.values is a numpy array based on the DataFrame. As numpy arrays have to be indexed numerically, we need to use the get_indexer function to convert the pandas row and column index names to index numbers that can be used with numpy:
>>> df.index.get_indexer(df['names'].index)
array([0, 1, 2, 3], dtype=int64)
>>> df.columns.get_indexer(df['names'])
array([0, 1, 0, 1], dtype=int64)
(In this case, where the row index is already numerical, you could get away with simply using df.index as the first argument inside the bracket, but this does not work generally.)

Here's a short solution using df.melt and df.merge:
df.merge(df.melt(var_name='names', ignore_index=False), on=[None, 'names'])
Outputs:
key_0 a b names value
0 0 1 -1 a 1
1 1 2 -2 b -2
2 2 3 -3 a 3
3 3 4 -4 b -4
There's a redundant key_0 column which you need to drop with df.drop.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

assign new columns to datatable - python

How do I assign new columns to a datatable? Tried this: DT1 = dt.Frame(A = range(5)) DT1['B'] = [1, 2, 3, 4, 5] But getting ValueError: The LHS of the replacement has 1 columns, while the RHS has 5 replacement expressions

Related

Creating dictionary from excel file (pandas dataframe)

Consequently subtract columns and get the resulted cumsum

How to replace different categorical variables with another list of categorical variables

Add two columns of different dataframes taking into account missing values

Pandas - select column using other column value as column name

Categories

Resources