Sort column in dependence of another column - python

I try to sort one column after another. In doing so, I deliberately create duplicates.
This is what my df looks like at the moment:
ticket magic Id
0 193454845 1311 1313
1 193454846 1927 1311
2 193454847 1810 1927
3 193454852 1313 NaN
What I want:
ticket magic Id
0 193454845 1311 1311
1 193454846 1927 1927
2 193454847 1810 NaN
3 193454852 1313 1313
The column "magic" and "Id" should be identical if no NaN.
Does anyone have an idea?
Thank you very much!

I guess the value in the id column for magic = 1313 should be NaN instead.
data['id'] = np.where(data['Id'].isna(), np.nan, data['magic'])
Update:
data.merge(data[['Id']], left_on='magic', right_on='Id', how='left',suffixes=['_x','']).drop(columns='Id_x')
ticket magic Id
0 193454845 1311 1311.0
1 193454846 1927 1927.0
2 193454847 1810 NaN
3 193454852 1313 1313.0

Related

Data preprocessing using Python dataframes

I have a data frame wherein there are 2 streams of time-series values, and along with that, there is an attribute assigned to a few of the values therein.
VAL1
VAL2
ATT1
ATT2
1221
1221
O
1121
1228
O
1323
1425
O
1522
1222
X
1824
1128
1286
1221
O
1829
1245
1111
1421
X
1123
1622
X
1276
1282
O
1262
1542
X
1423
1228
O
I want an output where the attributes are alternating and are not repeated sequentially.
To do the same I wanted my logic to select the highest value amongst the first 3 rows from column val1 (i.e. 1323) [Similarly lowest value for ATT2 from val2]
I tried to split the data frame into various chunks where attribute 1 or 2 is repeated sequentially and then find the largest value amongst the same, but it's not giving the desired result.
Eventually, I want the data frame to look something like this.
VAL1
VAL2
ATT1
ATT2
1323
1425
O
1522
1222
X
1286
1221
O
1111
1421
X
1276
1282
O
1262
1542
X
1423
1228
O
Also, wish to create a list out of it as follows.
list = [1323, 1222, 1286, 1421, 1276, 1542, 1423]
To alternate the rows, you do the following:
# create a df with only ATT1 in the right order.
df1 = df[~df['ATT1'].isna()].sort_values(by='VAL1', ascending=False).reset_index(drop=True)
# do the same but then with ATT2 and reversed order.
df2 = df[~df['ATT2'].isna()].sort_values(by='VAL2', ascending=True).reset_index(drop=True)
# On the second one, create an index that will fit neatly with the df1 index.
df2.index = df2.index + 0.5
# concat the df's and sort by index.
final_df = pd.concat([df1, df2]).sort_index().reset_index(drop=True)
print(final_df)
The result is this:
VAL1 VAL2 ATT1 ATT2
0 1423 1228 O NaN
1 1522 1222 NaN X
2 1323 1425 O NaN
3 1111 1421 NaN X
4 1286 1221 O NaN
5 1262 1542 NaN X
6 1276 1282 O NaN
7 1123 1622 NaN X
8 1221 1221 O NaN
9 1121 1228 O NaN
As to your final question; how to get a list of values alternating from two columns, we can bank on the previous split:
for_list = pd.concat([df1[['VAL1']], df2[['VAL2']].rename(columns={'VAL2': 'VAL1'})]).sort_index()
l = for_list['VAL1'].to_list()
print(l)
This will result in [1423, 1222, 1323, 1421, 1286, 1542, 1276, 1622, 1221, 1121]
Assumptions:
The empty values in the columns ATT1 and ATT2 of df are NaN, i.e. NaN, None, etc. If the those values are actually the empty string "" then replace .isna() in the following with .eq("").
There's no row where both columns ATT1 and ATT2 are filled.
You could try the following:
m = df["ATT1"].isna()
idx_O = df[~m].groupby(m.cumsum())["VAL1"].idxmax().to_list()
m = df["ATT2"].isna()
idx_X = df[~m].groupby(m.cumsum())["VAL2"].idxmin().to_list()
res = df.loc[sorted(idx_O + idx_X)].reset_index(drop=True)
Build a mask m where ATT1/ATT2 is NaN. Then pick the indices where VAL1/VAL2 is maximal/minimal for the groups of "connected" values by using .idxmax/.idxmin
Then select the corresponding parts of df after sorting the selected indices.
Result for the sample:
VAL1 VAL2 ATT1 ATT2
0 1323 1425 O NaN
1 1522 1222 NaN X
2 1286 1221 O NaN
3 1111 1421 NaN X
4 1276 1282 O NaN
5 1262 1542 NaN X
6 1423 1228 O NaN
For the second part you could try:
values = res["VAL1"].where(res["ATT1"].notna(), res["VAL2"]).to_list()
Result for the sample:
[1323, 1222, 1286, 1421, 1276, 1542, 1423]

Selecting rows that match a column in another dataframe

I have 2 dataframes df_Participants and df_Movements and I want to keep the rows in df_Movements only if a participant is on in df_Participants.
df_Participants:
id
0 1053.0
1 1052.0
2 1049.0
df_Movements
id participant
0 3902 1053
1 3901 1053
611 2763 979
612 2762 979
Expected results:
id participant
0 3902 1053
1 3901 1053
what I have tried so far:
remove_incomplete_submissions = True
if remove_incomplete_submissions:
df_Movements = df_Movements.loc[df_Movements['participant'].isin(df_Participants['id'])]
When I check the number of unique participants, it does not match. I know this is simple, but I can't seem to notice the issue here.
You can use merge:
new_df = df_Participants.merge(df_Movements, how='left', left_on='id', right_on='participant')
Use isin to create a boolean mask and get rows that match the condition:
>>> df_Movements[df_Movements['participant'].isin(df_Participants['id'])]
id participant
0 3902 1053
1 3901 1053
Edit: as suggested too by #Ben.T in comments

fillna(0) first but NaN value appears in iloc

df1.fillna(0)
Montant vente Marge
0 778283.75 13.63598
1 312271.20 9.26949
2 163214.65 14.50288
3 191000.20 9.55818
4 275970.00 12.76534
... ... ...
408 2999.80 14.60610
409 390.00 0.00000
410 699.00 26.67334
411 625.00 30.24571
412 0.00 24.79797
x = df1.iloc[:,1:3] # 1t for rows and second for columns
x
Marge
0 13.63598
1 9.26949
2 14.50288
3 9.55818
4 12.76534
... ...
408 14.60610
409 NaN
410 26.67334
411 30.24571
412 24.79797
413 rows × 1 columns
Why does the line 409 has a 0.000value first and then after iloc, it has NaN?
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
You should learn which functions mutate the data frame and which doesn't. For example fillna does not mutate the dataframe. Instead you can use inplace=True
df1 = df1.fillna(0)
or
df1.fillna(0, inplace=True)

Pandas: map dictionary values on an existing column based on key from another column to replace NaN

I've had a good look and I can't seem to find the answer to this question. I am wanting to replace all NaN values in my Department Code Column of my DataFrame with values from a dictionary, using the Job Number column as the Key matching that of the dictionary. The data can be seen Below: Please note there are many extra columns, these are just the two.)
df =
Job Number Department Code
0 3525 403
1 4555 NaN
2 5575 407
3 6515 407
4 7525 NaN
5 8535 102
6 3545 403
7 7455 102
8 3365 NaN
9 8275 403
10 3185 408
dict = {'4555': '012', '7525': '077', '3365': '034'}
What I am hoping the output to look like is:
Job Number Department Code
0 3525 403
1 4555 012
2 5575 407
3 6515 407
4 7525 077
5 8535 102
6 3545 403
7 7455 102
8 3365 034
9 8275 403
10 3185 408
The two columns are object datatypes and I have tried the replace function which I have used before but that only replaces the value if the key is in the same column.
df['Department Code'].replace(dict, inplace=True)
This does not replace the NaN values.
I'm sure the answer is very simple and I apologies in advance but i'm just stuck.
(Excuse my poor code display, it's handwritten as not sure how to export code from python to here.)
Better is avoid variable dict, because builtin (python code word), then use Series.fillna for replace matched values with Series.map, if no match values return NaN, so no replacement:
d = {'4555': '012', '7525': '077', '3365': '034'}
df['Department Code'] = df['Department Code'].fillna(df['Job Number'].astype(str).map(d))
print (df)
Job Number Department Code
0 3525 403
1 4555 012
2 5575 407
3 6515 407
4 7525 077
5 8535 102
6 3545 403
7 7455 102
8 3365 034
9 8275 403
10 3185 408
Or another way is using set_index and fillna:
df['Department Code'] = (df.set_index('Job Number')['Department Code']
.fillna(d).values)
print(df)
Job Number Department Code
0 3525 403
1 4555 012
2 5575 407
3 6515 407
4 7525 077
5 8535 102
6 3545 403
7 7455 102
8 3365 034
9 8275 403
10 3185 408

How to unify (collapse) multiple columns into one assigning unique values

Edited my previous question:
Want to distinguish each Devices (FOUR types) that are attached to a particular Building's particular Elevator (represented by height).
As there is no unique IDs for the devices, want to identify them and assign unique IDs to each of them by Grouping ('BldID', 'BldHt', 'Deivce') to identify any particular 'Device'.
Count their testing results, i.e. how many times it failed (NG) out of total number of testing (NG + OK) for any particular date for the entire duration consisting of few months.
Original dataframe looks like this
BldgID BldgHt Device Date Time Result
1074 34.0 790 2018/11/20 10:30 OK
1072 31.0 780 2018/11/19 11:10 NG
1072 36.0 780 2018/11/17 05:30 OK
1074 10.0 790 2018/11/19 06:10 OK
1074 10.0 790 2018/12/20 11:50 NG
1076 17.0 760 2018/08/15 09:20 NG
1076 17.0 760 2018/09/20 13:40 OK
As 'Time' is irrelevant, dropped it. Want to find the number of [NG] per day for each set (consists of 'BldgID', 'BlgHt', 'Device'].
#aggregate both functions only once by groupby
df1 = mel_df.groupby(['BldgID','BldgHt','Device','Date'])\
['Result'].agg([('NG', lambda x :(x=='NG').sum()), \
('ALL','count')]).round(2).reset_index()
#create New_ID by insert with Series with zero fill 3 values
s = pd.Series(np.arange(1, len(mel_df2) + 1),
index=mel_df2.index).astype(str).str.zfill(3)
mel_df2.insert(0, 'New_ID', s)
Now the filtered DataFrame looks like:
print (mel_df2)
New_ID BldgID BldgHt Device Date NG ALL
1 001 1072 31.0 780 2018/11/19 1 2
8 002 1076 17.0 760 2018/11/20 1 1
If I groupby ['BldgID', 'BldgHt', 'Device', 'Date'] then I get per day 'NG'.
But it would consider every day differently and if I assign 'unique' IDs I can plot how the unique Devices behave in every other single day.
If I groupby ['BldgId', 'BldgHt', 'Device'] then I get the overall 'NG' for that set (or unique Device), which is not my goal.
What I want to achieve is:
print (mel_df2)
New_ID BldgID BldgHt Device Date NG ALL
001 1072 31.0 780 2018/11/19 1 2
1072 31.0 780 2018/12/30 3 4
002 1076 17.0 760 2018/11/20 1 1
1076 17.0 760 2018/09/20 2 4
003 1072 36.0 780 2018/08/15 1 3
Any tips would be very much appreciated.
Use:
#aggregate both aggregate function only in once groupby
df1 = mel_df.groupby(['BldgID','BldgHt','Device','Date'])\
['Result'].agg([('NG', lambda x :(x=='NG').sum()), ('ALL','count')]).round(2).reset_index()
#filter non 0 rows
mel_df2 = df1[df1.NG != 0]
#filter first rows by Date
mel_df2 = mel_df2.drop_duplicates('Date')
#create New_ID by insert with Series with zero fill 3 values
s = pd.Series(np.arange(1, len(mel_df2) + 1), index=mel_df2.index).astype(str).str.zfill(3)
mel_df2.insert(0, 'New_ID', s)
Output from data from question:
print (mel_df2)
New_ID BldgID BldgHt Device Date NG ALL
1 001 1072 31.0 780 2018/11/19 1 1
8 002 1076 17.0 780 2018/11/20 1 1

Categories

Resources