Add series to dataframe which is an intersection of two others - python

Suppose I have a dataframe like this
column1 column2
1 8
2 9
20 1
4 2
56
6
2
I want a result like this :
column1 column2 column3
1 8 1
2 9 2
20 1
4 2
56
6
2
So I want a result in the column 3

Using set.intersection with pd.DataFrame.loc:
L = list(set(df['column1']) & set(df['column2']))
df.loc[np.arange(len(L)), 'column3'] = L
print(df)
column1 column2 column3
0 1 8.0 1.0
1 2 9.0 2.0
2 20 1.0 NaN
3 4 2.0 NaN
4 56 NaN NaN
5 6 NaN NaN
6 2 NaN NaN
You should be aware this isn't vectorised and somewhat against the grain with Pandas / NumPy, hence a solution which uses regular Python objects.

column = [1, 2, 20, 4, 56, 6, 2]
column = [8, 9, 1, 2]
list_1 = []
for item1 in column1:
for item2 in column2:
if item1 == item2:
list_1.append(item1)
else:
print("NO MATCH")
z = list(set(list_1))
print(z)

Related

Pandas pandas backward fill from another column

I have a df like this:
val1 val2
9 3
2 .
9 4
1 .
5 1
How can I use bfill con val2 but referencing val1, such that the dataframe results in:
val1 val2
9 3
2 9
9 4
1 3
5 1
So the missing values con val2 are the previous value BUT from val1
You can fill NA values in the second column with the first column shifted down one row:
>>> import pandas as pd
>>> df = pd.DataFrame({"val1": [9, 2, 9, 1, 5], "val2": [3, None, 4, None, 1]})
>>> df
val1 val2
0 9 3.0
1 2 NaN
2 9 4.0
3 1 NaN
4 5 1.0
>>> df["val2"].fillna(df["val1"].shift(), inplace=True)
>>> df
val1 val2
0 9 3.0
1 2 9.0
2 9 4.0
3 1 9.0
4 5 1.0
I guess you want ffill not bfill:
STEPS:
Use mask to make values in val1 column NaN.
ffill the val1 column and save the result in the variable m.
fill the NaN values in val2 with m.
m = df.val1.mask(df.val2.isna()).fillna(method ='ffill')
df.val2 = df.val2.fillna(m)
OUTPUT:
val1 val2
0 9 3
1 2 9
2 9 4
3 1 9
4 5 1

Create dataframe with hierarchical indices and extra columns from non-hierarchically indexed dataframe

Consider a simple dataframe:
import numpy as np
import pandas as pd
x = pd.DataFrame(np.arange(10).reshape(5,2))
print(x)
0 1
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
I would like to create a hierarchically indexed dataframe of the form:
0 1
a b a b
0 0 NaN 1 NaN
1 2 NaN 3 NaN
2 4 NaN 5 NaN
3 6 NaN 7 NaN
4 8 NaN 9 NaN
where the 'a' columns correspond to the original dataframe columns and the 'b' columns are blank (or nan).
I can certainly create a hierarchically indexed dataframe with all NaNs and loop over the columns of the original dataframe, writing them into
the new dataframe. Is there something more compact than that?
you can do with MultiIndex.from_product
extra_level = ['a', 'b']
new_cols = pd.MultiIndex.from_product([x.columns, extra_level])
x.columns = new_cols[::len(x.columns)] # select all the first element of extra_level
x = x.reindex(columns=new_cols)
print(x)
0 1
a b a b
0 0 NaN 1 NaN
1 2 NaN 3 NaN
2 4 NaN 5 NaN
3 6 NaN 7 NaN
4 8 NaN 9 NaN
Very much like #Ben.T I am using MultiIndex.from_product:
x.assign(l='a')
.set_index('l', append=True)
.unstack()
.reindex(pd.MultiIndex.from_product([x.columns.tolist(), ['a','b']]), axis=1)
Output:
0 1
a b a b
0 0 NaN 1 NaN
1 2 NaN 3 NaN
2 4 NaN 5 NaN
3 6 NaN 7 NaN
4 8 NaN 9 NaN

Creating child pandas dataframe from a mother dataframe based on condition

I have a dataframe as follows. Actually this dataframe is made by outer join of two table.
IndentID IndentNo role_name role_id user_id ishod isdesignatedhod Flag
100 1234 xyz 3 17 1 nan right_only
nan nan nan -1 -1 None None right_only
nan nan nan 1 15 None None right_only
nan nan nan 100 100 None 1 right_only
Objective: I want a resultant dataframe based on column conditions. The conditions are given below
if ishod == 1 the resultant df will be:
IndentID IndentNo role_name role_id user_id
100 1234 xyz 3 17
if ishod!=1 and isdesignatedhod==1 the resultant df will be:
IndentID IndentNo role_name role_id user_id
100 1234 xyz 100 100
I am really clueless on how to proceed on this. Any clue will be appreciated!!
To select rows based on a value in a certain column you can do use the following notation:
df[ df["column_name"] == value_to_keep ]
Here is an example of this in action:
import pandas as pd
d = {'col1': [1,2,1,2,1,2,1,2,1,2,1,2,1,2,1],
'col2': [3,4,5,3,4,5,3,4,5,3,4,5,3,4,5],
'col3': [6,7,8,9,6,7,8,9,6,7,8,9,6,7,8]}
# create a dataframe
df = pd.DataFrame(d)
This is what df looks like:
In [17]: df
Out[17]:
col1 col2 col3
0 1 3 6
1 2 4 7
2 1 5 8
3 2 3 9
4 1 4 6
5 2 5 7
6 1 3 8
7 2 4 9
8 1 5 6
9 2 3 7
10 1 4 8
11 2 5 9
12 1 3 6
13 2 4 7
14 1 5 8
Now to select all rows for which the value is '2' in the first column:
df_1 = df[df["col1"] == 2]
In [19]: df_1
Out [19]:
col1 col2 col3
1 2 4 7
3 2 3 9
5 2 5 7
7 2 4 9
9 2 3 7
11 2 5 9
13 2 4 7
You can also multiple conditions this way:
df_2 = df[(df["col2"] >= 4) & (df["col3"] != 7)]
In [22]: df_2
Out [22]:
col1 col2 col3
2 1 5 8
4 1 4 6
7 2 4 9
8 1 5 6
10 1 4 8
11 2 5 9
14 1 5 8
Hope this example helps!
Andre gives the right answer. Also you have to keep in mind dtype of columns ishod and isdesignatedhod. They are "object" type, in this specifically case "strings".
So you have to use "quotes" when compare these object columns with numbers.
df[df["ishod"] == "1"]
This should do approximately what you want
nan = float("nan")
def func(row):
if row["ishod"] == "1":
return pd.Series([100, 1234, "xyz", 3, 17, nan, nan, nan], index=row.index)
elif row["isdesignatedhod"] == "1":
return pd.Series([100, 1234, "xyz", 100, 100, nan, nan, nan], index=row.index)
else:
return row
pd.read_csv(io.StringIO(
"""IndentID IndentNo role_name role_id user_id ishod isdesignatedhod Flag
100 1234 xyz 3 17 1 nan right_only
nan nan nan -1 -1 None None right_only
nan nan nan 1 15 None None right_only
nan nan nan 100 100 None 1 right_only
"""), sep=" +", engine='python')\
.apply(func,axis=1)
Output:
IndentID IndentNo role_name role_id user_id ishod isdesignatedhod Flag
0 100.0 1234.0 xyz 3 17 NaN NaN NaN
1 NaN NaN NaN -1 -1 None None right_only
2 NaN NaN NaN 1 15 None None right_only
3 100.0 1234.0 xyz 100 100 NaN NaN NaN

Slicing each dataframe row into 3 windows with different slicing ranges

I want to slice each row of my dataframe into 3 windows with slice indices that are stored in another dataframe and change for each row of the dataframe. Afterwards i want to return a single dataframe containing the windows in form of a MultiIndex. The rows in each windows that are shorter than the longest row in the window should be filled with NaN values.
Since my actual dataframe has around 100.000 rows and 600 columns, i am concerned about an efficient solution.
Consider the following example:
This is my dataframe which i want to slice into 3 windows
>>> df
0 1 2 3 4 5 6 7
0 0 1 2 3 4 5 6 7
1 8 9 10 11 12 13 14 15
2 16 17 18 19 20 21 22 23
And the second dataframe containing my slicing indices having the same count of rows as df:
>>> df_slice
0 1
0 3 5
1 2 6
2 4 7
I've tried slicing the windows, like so:
first_window = df.iloc[:, :df_slice.iloc[:, 0]]
first_window.columns = pd.MultiIndex.from_tuples([("A", c) for c in first_window.columns])
second_window = df.iloc[:, df_slice.iloc[:, 0] : df_slice.iloc[:, 1]]
second_window.columns = pd.MultiIndex.from_tuples([("B", c) for c in second_window.columns])
third_window = df.iloc[:, df_slice.iloc[:, 1]:]
third_window.columns = pd.MultiIndex.from_tuples([("C", c) for c in third_window.columns])
result = pd.concat([first_window,
second_window,
third_window], axis=1)
Which gives me the following error:
TypeError: cannot do slice indexing on <class 'pandas.core.indexes.range.RangeIndex'> with these indexers [0 3
1 2
2 4
Name: 0, dtype: int64] of <class 'pandas.core.series.Series'>
My expected output is something like this:
>>> result
A B C
0 1 2 3 4 5 6 7 8 9 10
0 0 1 2 NaN 3 4 NaN NaN 5 6 7
1 8 9 NaN NaN 10 11 12 13 14 15 NaN
2 16 17 18 19 20 21 22 NaN 23 NaN NaN
Is there an efficient solution for my problem without iterating over each row of my dataframe?
Here's a solution which, using melt and then pivot_table, plus some logic to:
Identify the three groups 'A', 'B', and 'C'.
Shift the columns to the left, so that NaN would only appear at the right side of each window.
Rename columns to get the expected output.
t = df.reset_index().melt(id_vars="index")
t = pd.merge(t, df_slice, left_on="index", right_index=True)
t.variable = pd.to_numeric(t.variable)
t.loc[t.variable < t.c_0,"group"] = "A"
t.loc[(t.variable >= t.c_0) & (t.variable < t.c_1), "group"] = "B"
t.loc[t.variable >= t.c_1, "group"] = "C"
# shift relevant values to the left
shift_val = t.groupby(["group", "index"]).variable.transform("min") - t.groupby(["group"]).variable.transform("min")
t.variable = t.variable - shift_val
# extract a, b, and c groups, and create a multi-level index for their
# columns
df_a = pd.pivot_table(t[t.group == "A"], index= "index", columns="variable", values="value")
df_a.columns = pd.MultiIndex.from_product([["a"], df_a.columns])
df_b = pd.pivot_table(t[t.group == "B"], index= "index", columns="variable", values="value")
df_b.columns = pd.MultiIndex.from_product([["b"], df_b.columns])
df_c = pd.pivot_table(t[t.group == "C"], index= "index", columns="variable", values="value")
df_c.columns = pd.MultiIndex.from_product([["c"], df_c.columns])
res = pd.concat([df_a, df_b, df_c], axis=1)
res.columns = pd.MultiIndex.from_tuples([(c[0], i) for i, c in enumerate(res.columns)])
print(res)
The output is:
a b c
0 1 2 3 4 5 6 7 8 9 10
index
0 0.0 1.0 2.0 NaN 3.0 4.0 NaN NaN 5.0 6.0 7.0
1 8.0 9.0 NaN NaN 10.0 11.0 12.0 13.0 14.0 15.0 NaN
2 16.0 17.0 18.0 19.0 20.0 21.0 22.0 NaN 23.0 NaN NaN

Pandas sum two columns, skipping NaN

If I add two columns to create a third, any columns containing NaN (representing missing data in my world) cause the resulting output column to be NaN as well. Is there a way to skip NaNs without explicitly setting the values to 0 (which would lose the notion that those values are "missing")?
In [42]: frame = pd.DataFrame({'a': [1, 2, np.nan], 'b': [3, np.nan, 4]})
In [44]: frame['c'] = frame['a'] + frame['b']
In [45]: frame
Out[45]:
a b c
0 1 3 4
1 2 NaN NaN
2 NaN 4 NaN
In the above, I would like column c to be [4, 2, 4].
Thanks...
with fillna()
frame['c'] = frame.fillna(0)['a'] + frame.fillna(0)['b']
or as suggested :
frame['c'] = frame.a.fillna(0) + frame.b.fillna(0)
giving :
a b c
0 1 3 4
1 2 NaN 2
2 NaN 4 4
Another approach:
>>> frame["c"] = frame[["a", "b"]].sum(axis=1)
>>> frame
a b c
0 1 3 4
1 2 NaN 2
2 NaN 4 4
As an expansion to the answer above, doing frame[["a", "b"]].sum(axis=1) will fill sum of all NaNs as 0
>>> frame["c"] = frame[["a", "b"]].sum(axis=1)
>>> frame
a b c
0 1 3 4
1 2 NaN 2
2 NaN 4 4
3 NaN NaN 0
If you want the sum of all NaNs to be NaN, you can add the min_count flag as referenced in the docs
>>> frame["c"] = frame[["a", "b"]].sum(axis=1, min_count=1)
>>> frame
a b c
0 1 3 4
1 2 NaN 2
2 NaN 4 4
3 NaN NaN NaN

Categories

Resources