How do i drop the 0, 1 and 2? - python

I have the code below trying to check dob_years for suspicious values and count the percentage
df['dob_years'].value_counts()
The result is below
35 614
41 603
40 603
34 597
38 595
42 592
33 577
39 572
31 556
36 553
44 543
29 543
30 536
48 536
37 531
43 510
50 509
32 506
49 505
28 501
45 494
27 490
52 483
56 482
47 480
54 476
46 469
58 461
57 457
53 457
51 446
55 441
59 441
26 406
60 376
25 356
61 353
62 351
63 268
24 263
64 263
23 252
65 194
66 183
22 183
67 167
21 110
0 100
68 99
69 83
2 76
70 65
71 58
20 51
1 47
72 33
19 14
73 8
74 6
75 1
How do I drop the ages showing as 0, 1, and 2?
I tried the code below but it didn't work
df.drop(df[(df['dob_years'] = 0) & (df['dob_years'] = 1)].index, inplace=True)

This statement df['dob_years'].value_counts() takes a series and returns another series. The result in your question is a series with index as dob_years and the counts as value array.
To follow the suggestions from Jon Clements and others, you will have to convert it in to a DataFrame using the to_frame function. Consider this code:
import pandas as pd
# create the data frame
df = pd.read_csv('dob.csv')
# create count series and convert it in to a DataFrame
df1 = df['dob_years'].value_counts().to_frame("counts")
# convert DataFrame index in to a column
df1.reset_index(inplace=True)
# rename the column index to dob_years
df1 = df1.rename(columns = {'index':'dob_years'})
# dropping the required rows from DataFrame
df1 = df1[~df1['dob_years'].isin([0, 1, 2])]
print(df1)

Related

Select pandas dataframe where row and column== 0 to F

I have a dataframe A of index and column labelled 0 to F (0-15) in hex.
0 1 2 3 4 5 6 7 8 9 A B C D E F
0 99 124 119 123 242 107 111 197 48 1 103 43 254 215 171 118
1 202 130 201 125 250 89 71 240 173 212 162 175 156 164 114 192
2 183 253 147 38 54 63 247 204 52 165 229 241 113 216 49 21
3 4 199 35 195 24 150 5 154 7 18 128 226 235 39 178 117
4 9 131 44 26 27 110 90 160 82 59 214 179 41 227 47 132
5 83 209 0 237 32 252 177 91 106 203 190 57 74 76 88 207
6 208 239 170 251 67 77 51 133 69 249 2 127 80 60 159 168
7 81 163 64 143 146 157 56 245 188 182 218 33 16 255 243 210
8 205 12 19 236 95 151 68 23 196 167 126 61 100 93 25 115
9 96 129 79 220 34 42 144 136 70 238 184 20 222 94 11 219
A 224 50 58 10 73 6 36 92 194 211 172 98 145 149 228 121
B 231 200 55 109 141 213 78 169 108 86 244 234 101 122 174 8
C 186 120 37 46 28 166 180 198 232 221 116 31 75 189 139 138
D 112 62 181 102 72 3 246 14 97 53 87 185 134 193 29 158
E 225 248 152 17 105 217 142 148 155 30 135 233 206 85 40 223
F 140 161 137 13 191 230 66 104 65 153 45 15 176 84 187 22
I did dataframe A by this
df_sbox=pd.DataFrame(from_a_2d_nparray)
df_sbox.index = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 'A', 'B', 'C', 'D', 'E', 'F']
df_sbox.columns = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 'A', 'B', 'C', 'D', 'E', 'F']
I want to select A where index == 0 - F and column == 0 -F and assign it to a 2D matrix.
What can i use for selecting A where "index == 0 - F and column == 0 -F" in 1 statement?
You can use hex with pandas.DataFrame.loc:
num1 = 10 #row 'A' in hex
num2 = 3 #column 3
df.loc[hex(num1)[2:].upper(), hex(num2)[2:].upper()]
#10
Explanation
You can use python built-in function hex to get the hex representation of an integer:
hex(12)
#0xc
Since we are not interested in the first two characters, we can omit them slicing the str:
hex(12)[2:] #from index 2 onwards
#c
Since the dataframe uses uppercase for its indices and columns, we can use str.upper to match them:
hex(12)[2:].upper()
#'C'
Additional
You can also get the upper-case hex representation using the Standard Format Specifiers:
"{:X}".format(43)
#2B

How i can transform Dataframe in many temporal feature in Python?

i have this dataframe:
Timestamp DATA0 DATA1 DATA2 DATA3 DATA4 DATA5 DATA6 DATA7
0 1.478196e+09 219 128 220 27 141 193 95 50
1 1.478196e+09 95 237 27 121 90 194 232 137
2 1.478196e+09 193 22 103 217 138 195 153 172
3 1.478196e+09 181 120 186 73 120 239 121 218
4 1.478196e+09 70 194 36 16 81 129 95 217
... ... ... ... ... ... ... ... ... ...
242 1.478198e+09 15 133 112 2 236 81 94 252
243 1.478198e+09 0 123 163 160 13 156 145 32
244 1.478198e+09 83 147 61 61 33 199 147 110
245 1.478198e+09 172 95 87 220 226 99 108 176
246 1.478198e+09 123 240 180 145 132 213 47 60
I need to create a temporal features like this:
Timestamp DATA0 DATA1 DATA2 DATA3 DATA4 DATA5 DATA6 DATA7
0 1.478196e+09 219 128 220 27 141 193 95 50
1 1.478196e+09 95 237 27 121 90 194 232 137
2 1.478196e+09 193 22 103 217 138 195 153 172
3 1.478196e+09 181 120 186 73 120 239 121 218
4 1.478196e+09 70 194 36 16 81 129 95 217
Timestamp DATA0 DATA1 DATA2 DATA3 DATA4 DATA5 DATA6 DATA7
1 1.478196e+09 95 237 27 121 90 194 232 137
2 1.478196e+09 193 22 103 217 138 195 153 172
3 1.478196e+09 181 120 186 73 120 239 121 218
4 1.478196e+09 70 194 36 16 81 129 95 217
5 1.478196e+09 121 69 111 204 134 92 51 190
Timestamp DATA0 DATA1 DATA2 DATA3 DATA4 DATA5 DATA6 DATA7
2 1.478196e+09 193 22 103 217 138 195 153 172
3 1.478196e+09 181 120 186 73 120 239 121 218
4 1.478196e+09 70 194 36 16 81 129 95 217
5 1.478196e+09 121 69 111 204 134 92 51 190
6 1.478196e+09 199 132 39 197 159 242 153 104
How can I do this automatically?
what structure should I use, what functions?
I was told that the dataframe should become an array of arrays
it's not very clear to me
If I understand it correctly, you want e.g. a list of dataframes, where each dataframe is a progressing slice of the original frame. This example would give you a list of dataframes:
import pandas as pd
# dummy dataframe
df = pd.DataFrame({'col_1': range(10), 'col_2': range(10)})
# returns slices of size slice_length with step size 1
slice_length = 5
lst = [df.iloc[i:i+slice_length,: ] for i in range(df.shape[0] - slice_length)]
Please note that you are duplicating a lot of data and thus increasing memory usage. If you merely have to perform an operation on subsequent slices, you should better loop over the dataframe and apply your function. Even better, if possible, you should try to verctorize your operation, as this will likely make a huge difference in performance.
EDIT: saving the slices to file:
If you're only interested in saving the slices to file (e.g. in a csv), you don't need to first create a list of all slices (with the associated memory usage). Instead, loop over the slices (by looping over the starting indices that define each slice), and save each slice to file.
slice_length = 5
# loop over indices (i.e. slices)
for idx_from in range(df.shape[0] - slice_length):
# create the slice and write to file
df.iloc[idx_from: idx_from + slice_length, :].to_csv(f'slice_starting_idx_{idx_from}.csv', sep=';', index=False)
hi I have tried this which might results to your expectations, based on indexes:
import numpy as np
import pandas as pd
x=np.array([[8,9],[2,3],[9,10],[25,78],[56,67],[56,67],[72,12],[98,24],
[8,9],[2,3],[9,10],[25,78],[56,67],[56,67],[72,12],[98,24]])
df=pd.DataFrame(np.reshape(x,(16,2)),columns=['Col1','Col2'])
print(df)
print("**********************************")
count=df['Col1'].count() # number of rows in dataframe
i=0 # to set index from starting point for every iteration
n=4 # to set index to end point for every iteration
count2=3 # This is important , if you want 4 row then yo must set this count2 4-1 i.e 3,let say if you want 5 rows then count2 must be 5-1 i.e 4
while count !=0: # condition till the count gets set to 0
df1=df[i:n] # first iteration i=0, n=4(if you want four rows), second iteration i=n i.e i=4, and n will be n=n+4 i.e 8
if i>0:
print(df1.set_index(np.arange(i-count2,n-count2)))
count2=count2+3 # Incrementing count2, so the index will be like in first iteration 0 to 3 then 1 to 4 and so on.
else:
print(df1.set_index(np.arange(i,n)))
i=n
count=count-4
n=n+4
First output of Dataframe
Col1 Col2
0 8 9
1 2 3
2 9 10
3 25 78
4 56 67
5 56 67
6 72 12
7 98 24
8 8 9
9 2 3
10 9 10
11 25 78
12 56 67
13 56 67
14 72 12
15 98 24
Final Ouput
Col1 Col2
0 8 9
1 2 3
2 9 10
3 25 78
Col1 Col2
1 56 67
2 56 67
3 72 12
4 98 24
Col1 Col2
2 8 9
3 2 3
4 9 10
5 25 78
Col1 Col2
3 56 67
4 56 67
5 72 12
6 98 24
Note: I am also new in python there can be some possible shortest ways to achieve the expected output.

Values in pandas dataframe not getting sorted

I have a dataframe as shown below:
Category 1 2 3 4 5 6 7 8 9 10 11 12 13
A 424 377 161 133 2 81 141 169 297 153 53 50 197
B 231 121 111 106 4 79 68 70 92 93 71 65 66
C 480 379 159 139 2 116 148 175 308 150 98 82 195
D 88 56 38 40 0 25 24 55 84 36 24 26 36
E 1084 1002 478 299 7 256 342 342 695 378 175 132 465
F 497 246 283 206 4 142 151 168 297 224 194 198 148
H 8 5 4 3 0 2 3 2 7 5 3 2 0
G 3191 2119 1656 856 50 826 955 739 1447 1342 975 628 1277
K 58 26 27 51 1 18 22 42 47 35 19 20 14
S 363 254 131 105 6 82 86 121 196 98 81 57 125
T 54 59 20 4 0 9 12 7 36 23 5 4 20
O 554 304 207 155 3 130 260 183 287 204 98 106 195
P 756 497 325 230 5 212 300 280 448 270 201 140 313
PP 64 43 26 17 1 15 35 17 32 28 18 9 27
R 265 157 109 89 1 68 68 104 154 96 63 55 90
S 377 204 201 114 5 112 267 136 209 172 147 90 157
St 770 443 405 234 5 172 464 232 367 270 290 136 294
Qs 47 33 11 14 0 18 14 19 26 17 5 6 13
Y 1806 626 1102 1177 14 625 619 1079 1273 981 845 891 455
W 123 177 27 28 0 18 62 34 64 27 14 4 51
Z 2770 1375 1579 1082 17 900 1630 1137 1465 1383 861 755 1201
I want to sort the dataframe by values in each row. Once done, I want to sort the index also.
For example the values in first row corresponding to category A, should appear as:
2 50 53 81 133 141 153 161 169 197 297 377 424
I have tried df.sort_values(by=df.index.tolist(), ascending=False, axis=1) but this doesn't work. The values don't appear in sorted order at all
np.sort + sort_index
You can use np.sort along axis=1, then sort_index:
cols, idx = df.columns[1:], df.iloc[:, 0]
res = pd.DataFrame(np.sort(df.iloc[:, 1:].values, axis=1), columns=cols, index=idx)\
.sort_index()
print(res)
1 2 3 4 5 6 7 8 9 10 11 12 \
Category
A 2 50 53 81 133 141 153 161 169 197 297 377
B 4 65 66 68 70 71 79 92 93 106 111 121
C 2 82 98 116 139 148 150 159 175 195 308 379
D 0 24 24 25 26 36 36 38 40 55 56 84
E 7 132 175 256 299 342 342 378 465 478 695 1002
F 4 142 148 151 168 194 198 206 224 246 283 297
G 50 628 739 826 856 955 975 1277 1342 1447 1656 2119
H 0 0 2 2 2 3 3 3 4 5 5 7
K 1 14 18 19 20 22 26 27 35 42 47 51
O 3 98 106 130 155 183 195 204 207 260 287 304
P 5 140 201 212 230 270 280 300 313 325 448 497
PP 1 9 15 17 17 18 26 27 28 32 35 43
Qs 0 5 6 11 13 14 14 17 18 19 26 33
R 1 55 63 68 68 89 90 96 104 109 154 157
S 6 57 81 82 86 98 105 121 125 131 196 254
S 5 90 112 114 136 147 157 172 201 204 209 267
St 5 136 172 232 234 270 290 294 367 405 443 464
T 0 4 4 5 7 9 12 20 20 23 36 54
W 0 4 14 18 27 27 28 34 51 62 64 123
Y 14 455 619 625 626 845 891 981 1079 1102 1177 1273
Z 1 17 755 861 900 1082 1137 1375 1383 1465 1579 1630
One way is to apply sorted setting 1 as axis, applying pd.Series to return a dataframe instead of a list, and finally sorting by Category:
df.loc[:,'1':].apply(sorted, axis = 1).apply(pd.Series)
.set_index(df.Category).sort_index()
Category 0 1 2 3 4 5 6 7 8 9 10 ...
0 A 2 50 53 81 133 141 153 161 169 197 297 ...
1 B 4 65 66 68 70 71 79 92 93 106 111 ...

Populating a Pandas dataframe with data from a list in a unique order of indexes?

so I have webscraped data from an mlb betting site aggregator and have my data points in two lists. The first list is all of the teams. The way it is formatted is that teamlist[1] and teamlist[2] are playing each other, then teamlist[3] and teamlist[4] play each other and so on. Each row index is a team, and each column index is a betting site.
site1|site2|site3|site4|...
team1
team2
team3
team4
...
This outlines the general form.
I have figured out the pattern I need to put each betting odd I need to put it into, but I cannot figure out the way to input them properly.
I apologize, I do not have to reputation to post the actual image so I must do a link instead.This outlines the structure I need to index. The data points are the index I need to go there. As you can see df[0,0] = moneylines[0], and df[0,1]= moneylines[1]. My Primary issue is once I make it through the first two rows (which are done in the same loop) and it tries to go to the third row, it reindexes over the first two rows.link
Here is the code I am currently using to populate the DataFrame. moneylines is the list of betting odds I am trying to populate the dataframe with, and teams is the row index:
ctr = 0
for t in range(0,int(len(teams)/2)):
for m in range(14):
df.ix[m,t] = moneylines[ctr]
df.ix[m,t+1] = moneylines[ctr+1]
ctr = ctr + 2
Please let me know if there is anything else I can include to help solve this question.
Your issue is due to your first for loop. You increment it one by one so:
first loop :
t = 0
you fill line 0 and line 1
then
t = 1
you fill line 1 and line 2
and so on...
You should use instead of :
for t in range(0,int(len(teams)/2)):
this:
for t in range(0, len(teams), 2)
NB :You can also multiply t by 2 in the index but it's not as logic as using the above solution
I hope it helps,
I'm posting an alternative to looping over the values of a dataframe, which you can avoid pretty easily here, because doing so loses the efficiency boost of using a dataframe in the first place.
It's not entirely clear to me what the formatting of your starting data is, but if, say, you have a series s with values 0 through 195:
s = pd.Series(range(196))
Then, using numpy.reshape you could get the pairings:
>>>s.values.reshape((len(s)//2, 2))
array([[ 0, 1],
[ 2, 3],
[ 4, 5],
...,
[190, 191],
[192, 193],
[194, 195]])
And using it again you could get the desired output:
>>>pd.DataFrame(s.values.reshape((len(s)//2, 2)).T.reshape((len(s)//14, 14))).sort_values(0)
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 0 2 4 6 8 10 12 14 16 18 20 22 24 26
7 1 3 5 7 9 11 13 15 17 19 21 23 25 27
1 28 30 32 34 36 38 40 42 44 46 48 50 52 54
8 29 31 33 35 37 39 41 43 45 47 49 51 53 55
2 56 58 60 62 64 66 68 70 72 74 76 78 80 82
9 57 59 61 63 65 67 69 71 73 75 77 79 81 83
3 84 86 88 90 92 94 96 98 100 102 104 106 108 110
10 85 87 89 91 93 95 97 99 101 103 105 107 109 111
4 112 114 116 118 120 122 124 126 128 130 132 134 136 138
11 113 115 117 119 121 123 125 127 129 131 133 135 137 139
5 140 142 144 146 148 150 152 154 156 158 160 162 164 166
12 141 143 145 147 149 151 153 155 157 159 161 163 165 167
6 168 170 172 174 176 178 180 182 184 186 188 190 192 194
13 169 171 173 175 177 179 181 183 185 187 189 191 193 195

Issue calling time series data by date

So I have been working with the following data:
Date AB1 BC1 MB1 NWT1 SK1 Total1 AB2 BC2 MB2 SK2 Total2
0 2007-01-05 305 76 1 0 36 418 324 64 0 23 417
1 2007-01-12 427 95 5 0 58 585 435 82 2 62 586
2 2007-01-19 481 102 4 0 65 652 460 77 3 63 606
3 2007-01-26 491 98 6 0 59 654 506 79 4 70 664
4 2007-02-02 459 95 6 2 55 617 503 79 5 71 660
5 2007-02-09 459 88 5 4 61 617 493 73 4 68 641
6 2007-02-16 450 83 5 5 60 603 486 74 5 68 636
....
And using the following code to read the data, parse it and now trying to call it by 'sdate'.
def readcsv3():
csv_data = read_csv(file3,dtype=object,parse_dates=[0])
csv_data3 = csv_data.values
return csv_data3
def otherrigs():
sdate='2007-01-26'
df = readcsv3()
df = DataFrame(df,columns=['Date','AB1','BC1','MB1','NWT1','SK1','Total1','AB2','BC2','MB2','SK2','Total2'])
print(df[sdate])
Now I get the following error:
KeyError: '2007-01-26'
Process finished with exit code 1
Any suggestions?
It looks you are trying to access the row containing '2007-01-26', but your syntax is trying to pull a column name. Try
print(df[df['Date'] == sdate])
As an aside, you might also look at the pandas pd.read_csv() function, it will get those column names for you.

Categories

Resources