Creating child pandas dataframe from a mother dataframe based on condition - python

I have a dataframe as follows. Actually this dataframe is made by outer join of two table.
IndentID IndentNo role_name role_id user_id ishod isdesignatedhod Flag
100 1234 xyz 3 17 1 nan right_only
nan nan nan -1 -1 None None right_only
nan nan nan 1 15 None None right_only
nan nan nan 100 100 None 1 right_only
Objective: I want a resultant dataframe based on column conditions. The conditions are given below
if ishod == 1 the resultant df will be:
IndentID IndentNo role_name role_id user_id
100 1234 xyz 3 17
if ishod!=1 and isdesignatedhod==1 the resultant df will be:
IndentID IndentNo role_name role_id user_id
100 1234 xyz 100 100
I am really clueless on how to proceed on this. Any clue will be appreciated!!

To select rows based on a value in a certain column you can do use the following notation:
df[ df["column_name"] == value_to_keep ]
Here is an example of this in action:
import pandas as pd
d = {'col1': [1,2,1,2,1,2,1,2,1,2,1,2,1,2,1],
'col2': [3,4,5,3,4,5,3,4,5,3,4,5,3,4,5],
'col3': [6,7,8,9,6,7,8,9,6,7,8,9,6,7,8]}
# create a dataframe
df = pd.DataFrame(d)
This is what df looks like:
In [17]: df
Out[17]:
col1 col2 col3
0 1 3 6
1 2 4 7
2 1 5 8
3 2 3 9
4 1 4 6
5 2 5 7
6 1 3 8
7 2 4 9
8 1 5 6
9 2 3 7
10 1 4 8
11 2 5 9
12 1 3 6
13 2 4 7
14 1 5 8
Now to select all rows for which the value is '2' in the first column:
df_1 = df[df["col1"] == 2]
In [19]: df_1
Out [19]:
col1 col2 col3
1 2 4 7
3 2 3 9
5 2 5 7
7 2 4 9
9 2 3 7
11 2 5 9
13 2 4 7
You can also multiple conditions this way:
df_2 = df[(df["col2"] >= 4) & (df["col3"] != 7)]
In [22]: df_2
Out [22]:
col1 col2 col3
2 1 5 8
4 1 4 6
7 2 4 9
8 1 5 6
10 1 4 8
11 2 5 9
14 1 5 8
Hope this example helps!

Andre gives the right answer. Also you have to keep in mind dtype of columns ishod and isdesignatedhod. They are "object" type, in this specifically case "strings".
So you have to use "quotes" when compare these object columns with numbers.
df[df["ishod"] == "1"]

This should do approximately what you want
nan = float("nan")
def func(row):
if row["ishod"] == "1":
return pd.Series([100, 1234, "xyz", 3, 17, nan, nan, nan], index=row.index)
elif row["isdesignatedhod"] == "1":
return pd.Series([100, 1234, "xyz", 100, 100, nan, nan, nan], index=row.index)
else:
return row
pd.read_csv(io.StringIO(
"""IndentID IndentNo role_name role_id user_id ishod isdesignatedhod Flag
100 1234 xyz 3 17 1 nan right_only
nan nan nan -1 -1 None None right_only
nan nan nan 1 15 None None right_only
nan nan nan 100 100 None 1 right_only
"""), sep=" +", engine='python')\
.apply(func,axis=1)
Output:
IndentID IndentNo role_name role_id user_id ishod isdesignatedhod Flag
0 100.0 1234.0 xyz 3 17 NaN NaN NaN
1 NaN NaN NaN -1 -1 None None right_only
2 NaN NaN NaN 1 15 None None right_only
3 100.0 1234.0 xyz 100 100 NaN NaN NaN

Related

merge the specific row in from two dataframe

I have df like this
df1:
Name A B C
a b r t y U
0 xyz 1 2 3 4 3 4
1 abc 3 5 4 7 7 8
2 pqr 2 4 4 5 4 6
df2:
Name A B C
a b r t y U
0 xyz Nan Nan Nan Nan Nan Nan
1 abc 2 4 5 7 7 9
2 pqr Nan Nan Nan Nan Nan Nan
i want df like this
Name A B C
a b r t y U
0 xyz Nan Nan Nan Nan Nan Nan
1 abc 5 9 9 14 14 17
2 pqr Nan Nan Nan Nan Nan Nan
basically i want the sum of abc row only
First check what is columns names, obviously it is tuple ('Name', '') here, so set to index and then sum it:
print (df1.columns.tolist())
print (df2.columns.tolist())
df1 = df1.set_index([('Name', '')])
df2 = df2.set_index([('Name', '')])
#set by position
#df1 = df1.set_index([df1.columns[0]])
#df2 = df2.set_index([df2.columns[0]])
df = df1.add(df2)
Or:
df = df1 + df2

Create dataframe with hierarchical indices and extra columns from non-hierarchically indexed dataframe

Consider a simple dataframe:
import numpy as np
import pandas as pd
x = pd.DataFrame(np.arange(10).reshape(5,2))
print(x)
0 1
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
I would like to create a hierarchically indexed dataframe of the form:
0 1
a b a b
0 0 NaN 1 NaN
1 2 NaN 3 NaN
2 4 NaN 5 NaN
3 6 NaN 7 NaN
4 8 NaN 9 NaN
where the 'a' columns correspond to the original dataframe columns and the 'b' columns are blank (or nan).
I can certainly create a hierarchically indexed dataframe with all NaNs and loop over the columns of the original dataframe, writing them into
the new dataframe. Is there something more compact than that?
you can do with MultiIndex.from_product
extra_level = ['a', 'b']
new_cols = pd.MultiIndex.from_product([x.columns, extra_level])
x.columns = new_cols[::len(x.columns)] # select all the first element of extra_level
x = x.reindex(columns=new_cols)
print(x)
0 1
a b a b
0 0 NaN 1 NaN
1 2 NaN 3 NaN
2 4 NaN 5 NaN
3 6 NaN 7 NaN
4 8 NaN 9 NaN
Very much like #Ben.T I am using MultiIndex.from_product:
x.assign(l='a')
.set_index('l', append=True)
.unstack()
.reindex(pd.MultiIndex.from_product([x.columns.tolist(), ['a','b']]), axis=1)
Output:
0 1
a b a b
0 0 NaN 1 NaN
1 2 NaN 3 NaN
2 4 NaN 5 NaN
3 6 NaN 7 NaN
4 8 NaN 9 NaN

How to insert an empty column after each column in an existing dataframe

I have a dataframe that looks as follows
df = pd.DataFrame({"A":[1,2,3,4],
"B":[3,4,5,6],
"C":[2,3,4,5]})
I would like to insert an empty column (with type string) after each existing column in the dataframe, such that the output looks like
A col1 B col2 C col3
0 1 NaN 3 NaN 2 NaN
1 2 NaN 4 NaN 3 NaN
2 3 NaN 5 NaN 4 NaN
3 4 NaN 6 NaN 5 NaN
Actually there's a much more simple way thanks to reindex:
df.reindex([x for i, c in enumerate(df.columns, 1) for x in (c, f'col{i}')], axis=1)
Result:
A col1 B col2 C col3
0 1 NaN 3 NaN 2 NaN
1 2 NaN 4 NaN 3 NaN
2 3 NaN 5 NaN 4 NaN
3 4 NaN 6 NaN 5 NaN
Here's the other more complicated way:
import numpy as np
df.join(pd.DataFrame(np.empty(df.shape, dtype=object), columns=df.columns + '_sep')).sort_index(axis=1)
A A_sep B B_sep C C_sep
0 1 None 3 None 2 None
1 2 None 4 None 3 None
2 3 None 5 None 4 None
3 4 None 6 None 5 None
This solution worked for me:
merged = pd.concat([myDataFrame, pd.DataFrame(columns= [' '])], axis=1)
This is what you can do:
for count in range(len(df.columns)):
df.insert(count*2+1, str('col'+str(count+1)), 'NaN')
print(df)
Output:
A col1 B col2 C col3
0 1 NaN 3 NaN 2 NaN
1 2 NaN 4 NaN 3 NaN
2 3 NaN 5 NaN 4 NaN
3 4 NaN 6 NaN 5 NaN

pd.wide_to_long() lost data

I'm very new to Python. I've tried to reshape a data set using pd.wide_to_long. The original dataframe looks like this:
chk1 chk2 chk3 ... chf1 chf2 chf3 id var1 var2
0 3 4 2 ... nan nan nan 1 1 0
1 4 4 4 ... nan nan nan 2 1 0
2 2 nan nan ... 3 4 3 3 0 1
3 3 3 3 ... 3 2 2 4 1 0
I used the following code:
df2 = pd.wide_to_long(df,
stubnames=['chk', 'chf'],
i=['id', 'var1', 'var2'],
j='type')
When checking the data after these codes, it looks like this
chk chf
id var1 var2 egenskap
1 1 0 1 3 nan
2 4 nan
3 2 nan
4 nan nan
5 4 nan
6 nan nan
7 4 nan
8 4 nan
2 1 0 1 4 nan
2 4 nan
3 4 nan
4 5 nan
But when I check the columns in the new data set, it seems that all columns except 'chk' and 'chf' are gone!
df2.columns
Out[47]: Index(['chk', 'chf'], dtype='object')
df2.columns
for col in df2.columns:
print(col)
chk
chf
From the dataview it looks like 'id', 'var1', 'var2' have been merged into one common index:
Screenprint dataview here
Can someone please help me? :)

Join on a fragment of a dataframe

I am trying to join a fragment of a dataframe with another one. The structure of the dataframe to join is simplified below:
left:
ID f1 TIME
1 10 1
3 10 1
7 10 1
9 10 2
2 10 2
1 10 2
3 10 2
right:
ID f2 f3
1 0 11
7 9 11
I need to select the left dataset by time, and I need to attached the right one, the result I would like to have is the following:
left:
ID f1 TIME f2 f3
1 10 1 0 11
3 10 1 nan nan
7 10 1 9 11
9 10 2 nan nan
2 10 2 nan nan
1 10 2 nan nan
3 10 2 nan nan
Currently I am usually joining dataframes in this way:
left = left.join(right.set_index('ID'), on='ID')
In this case I am using:
left[left.TIME == 1] = left[left.TIME == 1].join(right.set_index('ID'), on='ID')
I have also tried with merge, but the result is the left dataframe without any of the other columns.
Finally the structure of my script need to do this for every unique TIME in the dataframe, thus:
for t in numpy.unique(left.TIME):
#do join on the fragment left.TIME == t
If I save the returned value from the join function in a new dataframe everything works fine, but trying to add the value at the left dataframe does not work.
EDIT: The IDs of the left dataset can be present multiple times, but not inside the same TIME value.
You can filter first by boolean indexing, merge and concat last:
df1 = left[left['TIME']==1]
#alternative
#df1 = left.query('TIME == 1')
df2 = left[left['TIME']!=1]
#alternative
#df2 = left.query('TIME != 1')
df = pd.concat([df1.merge(right, how='left'), df2])
print (df)
ID TIME f1 f2 f3
0 1 1 10 0.0 11.0
1 3 1 10 NaN NaN
2 7 1 10 9.0 11.0
3 9 2 10 NaN NaN
4 2 2 10 NaN NaN
5 1 2 10 NaN NaN
6 3 2 10 NaN NaN
EDIT: merge create default indices, so possible solution is create column first and then set to index:
print (left)
ID f1 TIME
10 1 10 1
11 3 10 1
12 7 10 1
13 9 10 2
14 2 10 2
15 1 10 2
16 3 10 2
#df = left.merge(right, how='left')
df1 = left[left['TIME']==1]
df2 = left[left['TIME']!=1]
df = pd.concat([df1.reset_index().merge(right, how='left').set_index('index'), df2])
print (df)
ID TIME f1 f2 f3
10 1 1 10 0.0 11.0
11 3 1 10 NaN NaN
12 7 1 10 9.0 11.0
13 9 2 10 NaN NaN
14 2 2 10 NaN NaN
15 1 2 10 NaN NaN
16 3 2 10 NaN NaN
EDIT:
After discussion after modify input data is possible use:
df = left.merge(right, how='left', on=['ID','TIME'])
This is one way:
res = left.drop_duplicates('ID')\
.merge(right, how='left')\
.append(left[left.duplicated(subset=['ID'])])
# ID TIME f1 f2 f3
# 0 1 1 10 0.0 11.0
# 1 3 1 10 NaN NaN
# 2 7 1 10 9.0 11.0
# 3 9 2 10 NaN NaN
# 4 2 2 10 NaN NaN
# 5 1 2 10 NaN NaN
# 6 3 2 10 NaN NaN
Note that columns f2 and f3 become float since NaN is considered a float.

Categories

Resources