Split dataframe columns value prefix and make headers in python dataframe - python

I have a traffic data that looks like this. Here, each column have data in format meters:seconds. Like in row 1 column 2, 57:9 represents 57 meters and 9 seconds.
0
1
2
3
4
5
6
7
8
9
0:0
57:9
166:34
178:37
203:44
328:63
344:65
436:77
737:108
None
0:0
166:34
178:37
203:43
328:61
436:74
596:51
737:106
None
None
0:0
57:6
166:30
178:33
203:40
328:62
344:64
436:74
596:91
None
0:0
203:43
328:61
None
None
None
None
None
None
None
0:0
57:7
166:20
178:43
203:10
328:61
None
None
None
None
I want to extract meters values from the dataframe and store them in a list in ascending order. Then create a new dataframe in which the the column header will be the meters value (present in the list). Then it will match the meter value in the parent dataframe and add the corresponding second value beneath. The missing meters:second pair should be replaced by NaN and the current pair at the position would move to next column within same row.
The desired outcome is:
list = [0,57,166,178,203,328,344,436,596,737]
dataframe:
0
57
166
178
203
328
344
436
596
737
0
9
34
37
44
63
65
77
NaN
108
0
NaN
34
37
43
61
NaN
74
51
106
0
6
30
33
40
62
64
74
91
None
0
NaN
NaN
NaN
43
61
None
None
None
None
0
7
20
43
10
61
None
None
None
None
I know I must use a loop to iterate over whole dataframe. I am new to python so I am unable to solve this. I tried using str.split() but it work only on 1 column. I have 98 columns and 290 rows. This is just one month data. I will be having 12 month data. So, need suggestions and help.

Try:
tmp = df1.apply(
lambda x: dict(
map(int, val.split(":"))
for val in x
if isinstance(val, str) and ":" in val
),
axis=1,
).to_list()
out = pd.DataFrame(tmp)
print(out[sorted(out.columns)])
Prints:
0 57 166 178 203 328 344 436 596 737
0 0 9.0 34.0 37.0 44 63 65.0 77.0 NaN 108.0
1 0 NaN 34.0 37.0 43 61 NaN 74.0 51.0 106.0
2 0 6.0 30.0 33.0 40 62 64.0 74.0 91.0 NaN
3 0 NaN NaN NaN 43 61 NaN NaN NaN NaN
4 0 7.0 20.0 43.0 10 61 NaN NaN NaN NaN

Related

How to delete rows which has nan or empty value in SPECIFIC column?

I have a dataframe which has nan or empty cell in specific column for example column index 2. unfortunately I don't have subset. I just have index. I want to delete the rows which has this features. in stackoverflow there are too many soluntions which are using subset
This is the dataframe for example:
12 125 36 45 665
15 212 12 65 62
65 9 nan 98 84
21 54 78 5 654
211 65 58 26 65
...
output:
12 125 36 45 665
15 212 12 65 62
21 54 78 5 654
211 65 58 26 65
If need test third column (with index=2) use boolean indexing if nan is missing value np.nan or string nan:
idx = 2
df1 = df[df.iloc[:, idx].notna() & df.iloc[:, idx].ne('nan')]
#if no value is empty string or nan string or missing value NaN/None
#df1 = df[df.iloc[:, idx].notna() & ~df.iloc[:, idx].isin(['nan',''])]
print (df1)
0 1 2 3 4
0 12 125 36.0 45 665
1 15 212 12.0 65 62
3 21 54 78.0 5 654
4 211 65 58.0 26 65
If nans are missing values:
df1 = df.dropna(subset=df.columns[[idx]])
print (df1)
0 1 2 3 4
0 12 125 36.0 45 665
1 15 212 12.0 65 62
3 21 54 78.0 5 654
4 211 65 58.0 26 65
Not sure what you mean by
there are too many soluntions which are using subset
but the way to do this would be
df[~df.isna().any(axis=1)]
You can use notnull()
df = df.loc[df[df.columns[idx]].notnull()]

loop for creating new column and fill with neighborhood row

I have following df. I am going to dynamically create new columns based on number of date (day_number=2), and conditionally fill them based on "code" and "count"
Current format:
code count
id date
ABC1 2019-04-04 1 76
2019-04-05 2 82
Desired matrix-like format:
code count code1_day1 code1_day1 code1_day2 code2_day2
id date
ABC1 2019-04-04 1 76 76 0 0 82
2019-04-05 2 82
I have done this but it fills the same for every column:
code=[1,2]
for date, new in df.groupby(level=[0]):
for col in range(day_number): # day_number=2
for lvl in code:
new[f"day{col+1}_code1"]=new['count'].where(new['code']==1)
new[f"day{col+1}_code2"]=new['count'].where(new['code']==2)
So many thanks for your help!
A biger example of the databse:
code count new-col1 new_col2 ......
id date
ABC1
2019-04-04 1 76 76 0 79 0 82 0 83 0 88 0 55 3 65 6
2019-04-05 1 79 79 0 82 0 83 0 88 0 55 3 65 6 101 10
2019-04-06 1 82 82 0 83 0 88 0 55 3 65 6 101 10 120 14
2019-04-07 2 83 83 0 88 0 55 3 65 6 101 10 120 14 0 0
2019-04-08 1 88 88 0 55 3 65 6 101 10 120 14 0 0 0 0
2019-04-09 1 55 55 3 65 6 101 10 120 14 0 0 0 0 10 0
2019-04-09 2 3 65 6 101 10 120 14 0 0 0 0 10 0
2019-04-10 1 65 101 10 120 14 0 0 0 0 10 0
2019-04-10 2 6 120 14 0 0 0 0 10 0
2019-04-11 1 101 0 0 0 0 10 0
your sample data is not so usable so I've simulated
considering differently, the data is grouped, hence groupby() ID in index and code
apply() after a groupby() gets passed a dataframe, build required columns on this dataframe
d = pd.date_range("01-jan-2021","03-jan-2021")
df = pd.concat([
pd.DataFrame({"ID":"ABC1","date":d,"code":1,"count":np.random.randint(20,50, len(d))}),
pd.DataFrame({"ID":"ABC1","date":d,"code":2,"count":np.random.randint(20,50, len(d))})
]).sort_values(["ID","date","code"], ascending=[True,False,True]).set_index(["ID","date"])
# pad an array with NaN to same length as second iterable
def nppad(a, s):
return np.pad(a.astype(float), (0,len(s)-len(a)), "constant", constant_values=np.nan)
df2 = df.groupby(["ID","code"]).apply(lambda dfa: dfa.assign(**{f"code{dfa.iloc[0,0]}_day{i+1}":
nppad(dfa["count"].values[i:],dfa)
for i in range(len(dfa))}))
output
code count code1_day1 code1_day2 code1_day3 code2_day1 code2_day2 code2_day3
ID date
ABC1 2021-01-03 1 40 40.0 38.0 46.0 NaN NaN NaN
2021-01-03 2 37 NaN NaN NaN 37.0 33.0 33.0
2021-01-02 1 38 38.0 46.0 NaN NaN NaN NaN
2021-01-02 2 33 NaN NaN NaN 33.0 33.0 NaN
2021-01-01 1 46 46.0 NaN NaN NaN NaN NaN
2021-01-01 2 33 NaN NaN NaN 33.0 NaN NaN

Create new Pandas columns using the value from previous row

I need to create two new Pandas columns using the logic and value from the previous row.
I have the following data:
Day Vol Price Income Outgoing
1 499 75
2 3233 90
3 1812 70
4 2407 97
5 3474 82
6 1057 53
7 2031 68
8 304 78
9 1339 62
10 2847 57
11 3767 93
12 1096 83
13 3899 88
14 4090 63
15 3249 52
16 1478 52
17 4926 75
18 1209 52
19 1982 90
20 4499 93
My challenge is to come up with a logic where both the Income and Outgoing columns (which are currently empty), should have the values of (Vol * Price).
But, the Income column should carry this value when, the previous day's "Price" value is lower than present. The Outgoing column should carry this value when, the previous day's "Price" value is higher than present. The rest of the Income and Outgoing columns, should just have NaN's. If the Price is unchanged, then that day's value is to be dropped.
But the entire logic should start with (n + 1) day. The first row should be skipped and the logic should apply from row 2 onwards.
I have tried using shift in my code example such as:
if sample_data['Price'].shift(1) < sample_data['Price'].shift(2)):
sample_data['Income'] = sample_data['Vol'] * sample_data['Price']
else:
sample_data['Outgoing'] = sample_data['Vol'] * sample_data['Price']
But it isn't working.
I feel there would be a simpler and comprehensive tactic to go about this, could someone please help ?
Update (The final output should look like this):
For day 16, the data is deleted because we have two similar prices for day 15 and 16.
I'd calculate the product and the mask separately, and then update the cols:
In [11]: vol_price = df["Vol"] * df["Price"]
In [12]: incoming = df["Price"].diff() < 0
In [13]: df.loc[incoming, "Income"] = vol_price
In [14]: df.loc[~incoming, "Outgoing"] = vol_price
In [15]: df
Out[15]:
Day Vol Price Income Outgoing
0 1 499 75 NaN 37425.0
1 2 3233 90 NaN 290970.0
2 3 1812 70 126840.0 NaN
3 4 2407 97 NaN 233479.0
4 5 3474 82 284868.0 NaN
5 6 1057 53 56021.0 NaN
6 7 2031 68 NaN 138108.0
7 8 304 78 NaN 23712.0
8 9 1339 62 83018.0 NaN
9 10 2847 57 162279.0 NaN
10 11 3767 93 NaN 350331.0
11 12 1096 83 90968.0 NaN
12 13 3899 88 NaN 343112.0
13 14 4090 63 257670.0 NaN
14 15 3249 52 168948.0 NaN
15 16 1478 52 NaN 76856.0
16 17 4926 75 NaN 369450.0
17 18 1209 52 62868.0 NaN
18 19 1982 90 NaN 178380.0
19 20 4499 93 NaN 418407.0
or is it this way around:
In [21]: incoming = df["Price"].diff() > 0
In [22]: df.loc[incoming, "Income"] = vol_price
In [23]: df.loc[~incoming, "Outgoing"] = vol_price
In [24]: df
Out[24]:
Day Vol Price Income Outgoing
0 1 499 75 NaN 37425.0
1 2 3233 90 290970.0 NaN
2 3 1812 70 NaN 126840.0
3 4 2407 97 233479.0 NaN
4 5 3474 82 NaN 284868.0
5 6 1057 53 NaN 56021.0
6 7 2031 68 138108.0 NaN
7 8 304 78 23712.0 NaN
8 9 1339 62 NaN 83018.0
9 10 2847 57 NaN 162279.0
10 11 3767 93 350331.0 NaN
11 12 1096 83 NaN 90968.0
12 13 3899 88 343112.0 NaN
13 14 4090 63 NaN 257670.0
14 15 3249 52 NaN 168948.0
15 16 1478 52 NaN 76856.0
16 17 4926 75 369450.0 NaN
17 18 1209 52 NaN 62868.0
18 19 1982 90 178380.0 NaN
19 20 4499 93 418407.0 NaN

Adding a row from a dataframe into another by matching columns with NaN values in row pandas python

The Scenario:
I have 2 dataframes fc0 and yc0. Where fc0 is a Cluster and yc0 is another dataframe which needs to be merged in fc0.
The Nature of data is as follows:
fc0
uid 1 2 3 4 5 6
234 235 4.000000 4.074464 4.128026 3.973045 3.921663 4.024864
235 236 3.524208 3.125669 3.652112 3.626923 3.524318 3.650589
236 237 4.174080 4.226267 4.200133 4.150983 4.124157 4.200052
yc0
iid uid 1 2 5 6 9 15
0 944 5.0 3.0 4.0 3.0 3.0 5.0
The Twist
I have 1682 columns in fc0 and I have few hundered values in yc0. Now I need the yc0 to go into fc0
In haste of resolving it, I even tried yc0.reset_index(inplace=True) but wasn't really helpful.
Expected Output
uid 1 2 3 4 5 6
234 235 4.000000 4.074464 4.128026 3.973045 3.921663 4.024864
235 236 3.524208 3.125669 3.652112 3.626923 3.524318 3.650589
236 237 4.174080 4.226267 4.200133 4.150983 4.124157 4.200052
944 5.0 3.0 NaN NaN 4.0 3.0 3.0
References
Link1 Tried this, but landed up inserting NaN values for 1st 16 Columns and rest of the data shifted by that many columns
Link2 Couldn't match column keys, besides I tried it for row.
Link3 Merging doesn't match the columns in it.
Link4 Concatenation doesn't work that way.
Link5 Same issues with Join.
EDIT 1
fc0.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 235 entries, 234 to 468
Columns: 1683 entries, uid to 1682
dtypes: float64(1682), int64(1)
memory usage: 3.0 MB
and
yc0.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Columns: 336 entries, uid to 1007
dtypes: float64(335), int64(1)
memory usage: 2.7 KB
Here's a MVCE example. Does this small sample data show the functionality that you are expecting?
df1 = pd.DataFrame(np.random.randint(0,100,(5,4)), columns=list('ABCE'))
A B C E
0 81 57 54 88
1 63 63 74 10
2 13 89 88 66
3 90 81 3 31
4 66 93 55 4
df2 = pd.DataFrame(np.random.randint(0,100,(5,4)), columns=list('BCDE'))
B C D E
0 93 48 62 25
1 24 97 52 88
2 53 50 21 13
3 81 27 7 81
4 10 21 77 19
df_out = pd.concat([df1,df2])
print(df_out)
Output:
A B C D E
0 81.0 57 54 NaN 88
1 63.0 63 74 NaN 10
2 13.0 89 88 NaN 66
3 90.0 81 3 NaN 31
4 66.0 93 55 NaN 4
0 NaN 93 48 62.0 25
1 NaN 24 97 52.0 88
2 NaN 53 50 21.0 13
3 NaN 81 27 7.0 81
4 NaN 10 21 77.0 19

Calculating the duration an event in a time series data frame (python 2.7)

I have a rather large pandas data frame which is a time serie with a lot of different information for each time stamp (eye tracking data).
Part of the data looks a bit like:
In [58]: df
Out[58]:
time event
49 44295 NaN
50 44311 NaN
51 44328 NaN
52 44345 2
53 44361 2
54 44378 2
55 44395 2
56 44411 2
57 44428 3
58 44445 3
59 44461 3
60 44478 3
61 44495 NaN
62 44511 NaN
63 44528 NaN
64 44544 NaN
65 44561 NaN
66 44578 NaN
67 44594 NaN
68 44611 4
69 44628 4
70 44644 4
71 44661 NaN
72 44678 NaN
I would like to calculate the (time) duration of each event as the max(time)-min(time) for a given event e.g. for event 2: 44411-44345 = 66
This duration I would like in a new column so that the data ends up like this:
In [60]: df
Out[60]:
time event duration
49 44295 NaN NaN
50 44311 NaN NaN
51 44328 NaN NaN
52 44345 2 66
53 44361 2 66
54 44378 2 66
55 44395 2 66
56 44411 2 66
57 44428 3 50
58 44445 3 50
59 44461 3 50
60 44478 3 50
61 44495 NaN NaN
62 44511 NaN NaN
63 44528 NaN NaN
64 44544 NaN NaN
65 44561 NaN NaN
66 44578 NaN NaN
67 44594 NaN NaN
68 44611 4 33
69 44628 4 33
70 44644 4 33
71 44661 NaN NaN
72 44678 NaN NaN
How can I do that?
One way would be to use groupby and transform. max - min is also called peak-to-peak, or ptp for short, and so ptp here basically means for lambda x: x.max() - x.min().
>>> df = pd.read_csv("eye.csv",sep="\s+")
>>> df["duration"] = df.dropna().groupby("event")["time"].transform("ptp")
>>> df
time event duration
49 44295 NaN NaN
50 44311 NaN NaN
51 44328 NaN NaN
52 44345 2 66
53 44361 2 66
54 44378 2 66
55 44395 2 66
56 44411 2 66
57 44428 3 50
58 44445 3 50
59 44461 3 50
60 44478 3 50
61 44495 NaN NaN
62 44511 NaN NaN
63 44528 NaN NaN
64 44544 NaN NaN
65 44561 NaN NaN
66 44578 NaN NaN
67 44594 NaN NaN
68 44611 4 33
69 44628 4 33
70 44644 4 33
71 44661 NaN NaN
72 44678 NaN NaN
The dropna was to prevent each NaN value in the event column from being considered its own event. (There's also something weird going on in how ptp works when the key is NaN too, but that's a separate issue.)
Iterate over records using groupby from itertools. Group criteria shall be the event number. As you have the data properly ordered (all event codes related to the same event are not interrupted by others), there is no need to do sorting on even code.
groupby will iteratively return tuples (key, group), where key is the even code and group is list of all the records.
From the records, pick up minimal and maximal time and calculate duration.
Then, do your work to get durations as new field to your records.
There might be more efficient methods using pandas, which I am not aware of. Described solution does not require pandas.
I ended up doing the following work around to the posted answer by #DSM:
df["dur"] = datalist[i][j].groupby("event")["time"].transform("ptp")
dur = []
for i in datalist.index:
if np.isnan(df["event"][i]):
dur.append(df["event"][i])
else:
dur.append(df["dur"][i])
df["Duration"] = dur
This at least works for me.

Categories

Resources