I am working with python to create a new frame starting from two frame by using Pandas.
The first frame (called frame1) is composed by the following line:
A B C D E
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5
6 6 6 6 6
7 7 7 7 7
8 8 8 8 8
9 9 9 9 9
10 10 10 10 10
11 11 11 11 11
12 12 12 12 12
13 13 13 13 13
14 14 14 14 14
15 15 15 15 15
The second frame (called frame2) is:
A B C D E
19 19 19 19 19
24 24 24 24 24
29 29 29 29 29
34 34 34 34 34
39 39 39 39 39
44 44 44 44 44
49 49 49 49 49
54 54 54 54 54
59 59 59 59 59
64 64 64 64 64
69 69 69 69 69
74 74 74 74 74
79 79 79 79 79
84 84 84 84 84
89 89 89 89 89
94 94 94 94 94
99 99 99 99 99
Now i want to create a new dataset with this logic: starting from frame1 substitute every 5 row until the end of the frame1, the row of the frame1 with a random row of the frame2 (and remove the added row from frame2). A possible output should be:
A B C D E
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
59 59 59 59 59
6 6 6 6 6
7 7 7 7 7
8 8 8 8 8
9 9 9 9 9
29 29 29 29 29
11 11 11 11 11
12 12 12 12 12
13 13 13 13 13
14 14 14 14 14
84 84 84 84 84
How can i do this operation?
It's quite simple:
frame1.loc[4::5] = frame2.sample(frac=1).reset_index(drop=True)
where
df.loc[4::5] selects every fifth element, starting with the fifth one in df, and
df.sample(frac=1).reset_index(drop=True) shuffles a df around randomly
One way is to first obtain the indices where to update (we could also slice assign, but we'd have the problem of the end not being included), and then assign back taking a sample from df2 of the corresponding size:
ix = np.flatnonzero(np.diff(np.arange(df.shape[0]+1)//5))
df1.iloc[ix] = df2.sample(df1.shape[0]//5).to_numpy()
print(df1)
A B C D E
0 1 1 1 1 1
1 2 2 2 2 2
2 3 3 3 3 3
3 4 4 4 4 4
4 84 84 84 84 84
5 6 6 6 6 6
6 7 7 7 7 7
7 8 8 8 8 8
8 9 9 9 9 9
9 89 89 89 89 89
10 11 11 11 11 11
11 12 12 12 12 12
12 13 13 13 13 13
13 14 14 14 14 14
14 99 99 99 99 99
Related
Let's say I have the following dataframe:
ID stop x y z
0 202 9 20 27 4
1 202 2 23 24 13
2 1756 5 5 41 73
3 1756 3 7 42 72
4 1756 4 3 50 73
5 2153 14 121 12 6
6 2153 3 122.5 2 6
7 3276 1 54 33 -12
8 5609 9 -2 44 -32
9 5609 2 8 44 -32
10 5609 5 102 -23 16
I would like to change the ID values in order to have the smallest being 1, the second smallest being 2 etc.. So for my example, I would get this:
ID stop x y z
0 1 9 20 27 4
1 1 2 23 24 13
2 2 5 5 41 73
3 2 3 7 42 72
4 2 4 3 50 73
5 3 14 121 12 6
6 3 3 122.5 2 6
7 4 1 54 33 -12
8 5 9 -2 44 -32
9 5 2 8 44 -32
10 5 5 102 -23 16
Any idea please?
Thanks in advance!
You can use pd.Series.rank with method='dense'
df['ID'] = df['ID'].rank(method='dense').astype(int)
I have a text file which has a number of integer values like this.
20180701 20180707 52 11 1 2 4 1 0 0 10 7 1 3 1 0 4 5 2
20180708 20180714 266 8 19 3 2 9 7 25 20 17 12 9 9 27 34 54 11
20180715 20180721 654 52 34 31 20 16 12 25 84 31 38 37 38 69 66 87 14
20180722 201807281017 110 72 46 52 29 29 22 204 41 46 51 57 67 82 92 17
20180729 201808041106 276 37 11 87 20 10 8 284 54 54 72 38 49 41 53 12
20180805 20180811 624 78 19 15 55 16 8 9 172 15 31 35 38 47 29 36 21
20180812 20180818 488 63 17 7 26 10 9 7 116 17 14 39 31 34 27 64 7
20180819 20180825 91 4 7 0 4 5 1 3 16 3 4 5 10 10 7 11 1
20180826 20180901 49 2 2 1 0 4 0 1 2 0 1 4 8 2 6 6 10
I have to make a file by merging several files like this but you guys can see a problem with this data.
In 4 and 5 lines, the first values, 1017 and 1106, right next to period index make a problem.
When I try to read these two lines, I always have had this result.
It came out that first values in first column next to index columns couldn't recognized as first values themselves.
In [14]: fw.iloc[80,:]
Out[14]:
3 72.0
4 46.0
5 52.0
6 29.0
7 29.0
8 22.0
9 204.0
10 41.0
11 46.0
12 51.0
13 57.0
14 67.0
15 82.0
16 92.0
17 17.0
18 NaN
Name: (20180722, 201807281017), dtype: float64
I tried to make it correct with indexing but failed.
The desirable result is,
In [14]: fw.iloc[80,:]
Out[14]:
2 1017.0
3 110.0
4 72.0
5 46.0
6 52.0
7 29.0
8 29.0
9 22.0
10 204.0
11 41.0
12 46.0
13 51.0
14 57.0
15 67.0
16 82.0
17 92.0
18 17.0
Name: (20180722, 201807281017), dtype: float64
How can I solve this problem?
+
I used this code to read this file.
fw = pd.read_csv('warm_patient.txt', index_col=[0,1], header=None, delim_whitespace=True)
A better fit for this would be pandas.read_fwf. For your example:
df = pd.read_fwf(filename, index_col=[0,1], header=None, widths=2*[10]+17*[4])
I don't know if the column widths can be inferred for all your data or need to be hardcoded.
One possibility would be to manually construct the dataframe, this way we can parse the text by splitting the values every 4 characters.
from textwrap import wrap
import pandas as pd
def read_file(f_name):
data = []
with open(f_name) as f:
for line in f.readlines():
idx1 = line[0:8]
idx2 = line[10:18]
points = map(lambda x: int(x.replace(" ", "")), wrap(line.rstrip()[18:], 4))
data.append([idx1, idx2, *points])
return pd.DataFrame(data).set_index([0, 1])
It could be made somewhat more efficient (in particular if this is a particularly long text file), but here's one solution.
fw = pd.read_csv('test.txt', header=None, delim_whitespace=True)
for i in fw[pd.isna(fw.iloc[:,-1])].index:
num_str = str(fw.iat[i,1])
a,b = map(int,[num_str[:-4],num_str[-4:]])
fw.iloc[i,3:] = fw.iloc[i,2:-1]
fw.iloc[i,:3] = [fw.iat[i,0],a,b]
fw = fw.set_index([0,1])
The result of print(fw) from there is
2 3 4 5 6 7 8 9 10 11 12 13 14 15 \
0 1
20180701 20180707 52 11 1 2 4 1 0 0 10 7 1 3 1 0
20180708 20180714 266 8 19 3 2 9 7 25 20 17 12 9 9 27
20180715 20180721 654 52 34 31 20 16 12 25 84 31 38 37 38 69
20180722 20180728 1017 110 72 46 52 29 29 22 204 41 46 51 57 67
20180729 20180804 1106 276 37 11 87 20 10 8 284 54 54 72 38 49
20180805 20180811 624 78 19 15 55 16 8 9 172 15 31 35 38 47
20180812 20180818 488 63 17 7 26 10 9 7 116 17 14 39 31 34
20180819 20180825 91 4 7 0 4 5 1 3 16 3 4 5 10 10
20180826 20180901 49 2 2 1 0 4 0 1 2 0 1 4 8 2
16 17 18
0 1
20180701 20180707 4 5 2.0
20180708 20180714 34 54 11.0
20180715 20180721 66 87 14.0
20180722 20180728 82 92 17.0
20180729 20180804 41 53 12.0
20180805 20180811 29 36 21.0
20180812 20180818 27 64 7.0
20180819 20180825 7 11 1.0
20180826 20180901 6 6 10.0
Here's the result of the print after applying your initial solution of fw = pd.read_csv('test.txt', index_col=[0,1], header=None, delim_whitespace=True) for comparison.
2 3 4 5 6 7 8 9 10 11 12 13 14 \
0 1
20180701 20180707 52 11 1 2 4 1 0 0 10 7 1 3 1
20180708 20180714 266 8 19 3 2 9 7 25 20 17 12 9 9
20180715 20180721 654 52 34 31 20 16 12 25 84 31 38 37 38
20180722 201807281017 110 72 46 52 29 29 22 204 41 46 51 57 67
20180729 201808041106 276 37 11 87 20 10 8 284 54 54 72 38 49
20180805 20180811 624 78 19 15 55 16 8 9 172 15 31 35 38
20180812 20180818 488 63 17 7 26 10 9 7 116 17 14 39 31
20180819 20180825 91 4 7 0 4 5 1 3 16 3 4 5 10
20180826 20180901 49 2 2 1 0 4 0 1 2 0 1 4 8
15 16 17 18
0 1
20180701 20180707 0 4 5 2.0
20180708 20180714 27 34 54 11.0
20180715 20180721 69 66 87 14.0
20180722 201807281017 82 92 17 NaN
20180729 201808041106 41 53 12 NaN
20180805 20180811 47 29 36 21.0
20180812 20180818 34 27 64 7.0
20180819 20180825 10 7 11 1.0
20180826 20180901 2 6 6 10.0
I have a dataframe with profit values, IDs, and week values. It looks a little like this
ID
Week
Profit
A
1
2
A
2
2
A
3
0
A
4
0
I want to create two new columns called "Bi-Weekly" and "Monthly", so week 1 would be label 2, week 2 would also be label 2, but week 3 would be labeled 4, and week 4 would be labeled 4, and they would all be labeled month 1, so I could groupby weekly, bi-weekly, or monthly profit as needed. Right now I've created two functions which work, but the weeks are going to go up to a year (52 weeks) so I was wondering if there's a more efficient way. My bi-weekly function below.
def biweek(prof_calc):
if (prof_calc['week']==2):
return 2
elif (prof_calc['week']==3):
return 2
elif (prof_calc['week']==4):
return 4
elif (prof_calc['week']==5):
return 4
elif (prof_calc['week']==6):
return 6
elif (prof_calc['week']==7):
return 6
elif (prof_calc['week']==8):
return 8
elif (prof_calc['week']==9):
return 8
elif (prof_calc['week']==10):
return 10
elif (prof_calc['week']==11):
return 10
prof_calc['BiWeek'] = prof_calc.apply(biweek, axis=1)
IIUC, you could try:
df["Biweekly"] = (df["Week"]-1)//2+1
df["Monthly"] = (df["Week"]-1)//4+1
>>> df
ID Week Profit Biweekly Monthly
0 A 1 42 1 1
1 A 2 69 1 1
2 A 3 53 2 1
3 A 4 63 2 1
4 A 5 56 3 2
5 A 6 57 3 2
6 A 7 86 4 2
7 A 8 23 4 2
8 A 9 35 5 3
9 A 10 10 5 3
10 A 11 25 6 3
11 A 12 21 6 3
12 A 13 39 7 4
13 A 14 82 7 4
14 A 15 76 8 4
15 A 16 20 8 4
16 A 17 97 9 5
17 A 18 67 9 5
18 A 19 21 10 5
19 A 20 22 10 5
20 A 21 88 11 6
21 A 22 67 11 6
22 A 23 33 12 6
23 A 24 38 12 6
24 A 25 8 13 7
25 A 26 67 13 7
26 A 27 16 14 7
27 A 28 49 14 7
28 A 29 3 15 8
29 A 30 17 15 8
30 A 31 79 16 8
31 A 32 19 16 8
32 A 33 21 17 9
33 A 34 9 17 9
34 A 35 56 18 9
35 A 36 83 18 9
36 A 37 1 19 10
37 A 38 53 19 10
38 A 39 66 20 10
39 A 40 55 20 10
40 A 41 85 21 11
41 A 42 90 21 11
42 A 43 34 22 11
43 A 44 3 22 11
44 A 45 9 23 12
45 A 46 28 23 12
46 A 47 58 24 12
47 A 48 14 24 12
48 A 49 42 25 13
49 A 50 69 25 13
50 A 51 76 26 13
51 A 52 49 26 13
I am working on a dataframe and I want to group the data for an hour into 4 different slots of 15 mins,
0-15 - 1st slot
15-30 - 2nd slot
30-45 - 3rd slot
45-00(or 60) - 4th slot
I am not even able to think, how to go forward with this
I tried extracting hours, minutes and seconds from the time, but what to do now?
Use integer division by 15 and then add 1:
df = pd.DataFrame({'M': range(60)})
df['slot'] = df['M'] // 15 + 1
print (df)
M slot
0 0 1
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 8 1
9 9 1
10 10 1
11 11 1
12 12 1
13 13 1
14 14 1
15 15 2
16 16 2
17 17 2
18 18 2
19 19 2
20 20 2
21 21 2
22 22 2
23 23 2
24 24 2
25 25 2
26 26 2
27 27 2
28 28 2
29 29 2
30 30 3
31 31 3
32 32 3
33 33 3
34 34 3
35 35 3
36 36 3
37 37 3
38 38 3
39 39 3
40 40 3
41 41 3
42 42 3
43 43 3
44 44 3
45 45 4
46 46 4
47 47 4
48 48 4
49 49 4
50 50 4
51 51 4
52 52 4
53 53 4
54 54 4
55 55 4
56 56 4
57 57 4
58 58 4
59 59 4
I want to convert N columns into one series. How to do it effectively?
Input:
0 1 2 3
0 64 98 47 58
1 80 94 81 46
2 18 43 79 84
3 57 35 81 31
Expected Output:
0 64
1 80
2 18
3 57
4 98
5 94
6 43
7 35
8 47
9 81
10 79
11 81
12 58
13 46
14 84
15 31
dtype: int64
So Far I tried:
print df[0].append(df[1]).append(df[2]).append(df[3]).reset_index(drop=True)
I'm not satisfied with my solution, moreover it won't work for dynamic columns. Please help me to find a better approach.
You can use unstack
pd.Series(df.unstack().values)
you need np.flatten
pd.Series(df.values.flatten(order='F'))
out[]
0 64
1 80
2 18
3 57
4 98
5 94
6 43
7 35
8 47
9 81
10 79
11 81
12 58
13 46
14 84
15 31
dtype: int64
Here's yet another short one.
>>> pd.Series(df.values.ravel(order='F'))
>>>
0 64
1 80
2 18
3 57
4 98
5 94
6 43
7 35
8 47
9 81
10 79
11 81
12 58
13 46
14 84
15 31
dtype: int64
You can also use Series class and .values attribute:
pd.Series(df.values.T.flatten())
Output:
0 64
1 80
2 18
3 57
4 98
5 94
6 43
7 35
8 47
9 81
10 79
11 81
12 58
13 46
14 84
15 31
dtype: int64
Use pd.melt() -
df.melt()['value']
Output
0 64
1 80
2 18
3 57
4 98
5 94
6 43
7 35
8 47
9 81
10 79
11 81
12 58
13 46
14 84
15 31
Name: value, dtype: int64
df.T.stack().reset_index(drop=True)
Out:
0 64
1 80
2 18
3 57
4 98
5 94
6 43
7 35
8 47
9 81
10 79
11 81
12 58
13 46
14 84
15 31
dtype: int64