I have this file
0 0 716
0 1 851
0 2 900
1 0 724
1 1 857
1 2 903
2 0 812
2 1 858
2 2 902
3 0 799
3 1 852
3 2 905
4 0 833
4 1 871
4 2 907
5 0 940
5 1 955
5 2 995
6 0 941
6 1 956
6 2 996
7 0 942
7 1 957
7 2 999
8 0 944
8 1 958
8 2 992
9 0 946
9 1 952
9 2 998
I want to write third column values like this
0 0 716
1 0 724
2 0 812
3 0 799
4 0 833
0 1 851
1 1 857
2 1 858
3 1 852
4 1 871
0 2 900
1 2 903
2 2 902
3 2 905
4 2 907
5 0 940
6 0 941
7 0 942
8 0 944
9 0 946
5 1 955
6 1 956
7 1 957
8 1 958
9 1 952
5 2 995
6 2 996
7 2 999
8 2 992
9 2 998
I have read file
l= [line.rstrip('\n') for line in open('test.txt')]
Now I am stuck,how to read this as 3d array? With enumerate function,does not work because it includes first value on its own,I do not need that.
This works:
with open('input.txt') as infile:
rows = [map(int, line.split()) for line in infile]
def part(minval, maxval):
return [r for r in rows if minval <= r[0] <= maxval]
with open('output.txt', 'w') as outfile:
for half in [part(0, 4), part(5, 9)]:
half.sort(key=lambda (a, b, c): (b, a, c))
for row in half:
outfile.write('%s %s %s\n' % tuple(row))
Let me know if you have questions.
it would be very simple if you could use pandas module:
import pandas as pd
fn = r'D:\temp\.data\37146154.txt'
df = pd.read_csv(fn, delim_whitespace=True, header=None, names=['col1','col2','col3'])
df.sort_values(['col2','col1','col3'])
if you want to write it back to a new file:
df.sort_values(['col2','col1','col3']).to_csv('new_file', sep='\t', index=False, header=None)
Test:
In [15]: df.sort_values(['col2','col1','col3'])
Out[15]:
col1 col2 col3
0 0 0 716
3 1 0 724
6 2 0 812
9 3 0 799
12 4 0 833
15 5 0 940
18 6 0 941
21 7 0 942
24 8 0 944
27 9 0 946
1 0 1 851
4 1 1 857
7 2 1 858
10 3 1 852
13 4 1 871
16 5 1 955
19 6 1 956
22 7 1 957
25 8 1 958
28 9 1 952
2 0 2 900
5 1 2 903
8 2 2 902
11 3 2 905
14 4 2 907
17 5 2 995
20 6 2 996
23 7 2 999
26 8 2 992
29 9 2 998
Related
Running the following code:
from thinc.api import chain, PyTorchLSTM, Sigmoid, Embed, with_padded, with_array2d
vocab_size = len(vocab_to_int)+1 # +1 for the 0 padding + our word tokens
output_size = 1
embedding_dim = 400
hidden_dim = 256
n_layers = 2
model = chain(
Embed(nV=vocab_size, nO=embedding_dim),
with_padded(PyTorchLSTM(nI=embedding_dim,nO=hidden_dim, depth=n_layers)),
with_array2d(Sigmoid(nI=hidden_dim, nO=output_size))
)
model.initialize(X=train_x[:5], Y=train_y[:5])
I get this error: ValueError: Provided 'x' array should be 2-dimensional, but found 3 dimension(s).
Here is x[0], y[0]
[ 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
21025 308 6 3 1050 207 8 2138 32 1 171 57
15 49 81 5785 44 382 110 140 15 5194 60 154
9 1 4975 5852 475 71 5 260 12 21025 308 13
1978 6 74 2395 5 613 73 6 5194 1 24103 5
1983 10166 1 5786 1499 36 51 66 204 145 67 1199
5194 19869 1 37442 4 1 221 883 31 2988 71 4
1 5787 10 686 2 67 1499 54 10 216 1 383
9 62 3 1406 3686 783 5 3483 180 1 382 10
1212 13583 32 308 3 349 341 2913 10 143 127 5
7690 30 4 129 5194 1406 2326 5 21025 308 10 528
12 109 1448 4 60 543 102 12 21025 308 6 227
4146 48 3 2211 12 8 215 23] 1
I am relatively new to building these models, but I think it has to do with the fact that the output of the Pytorch LSTM layer has two dimensions. In a typical torch LSTM you'd stack the output from the LSTM layer (I think), but I'm not sure how to do that here. I assumed with_array2d would help but it doesn't seem to.
The dataframe below has a number of columns but columns names are random numbers.
daily1=
0 1 2 3 4 5 6 7 8 9 ... 11 12 13 14 15 16 17 18 19 20
0 0 0 0 0 0 0 4 0 0 0 ... 640 777 674 842 786 865 809 674 679 852
1 0 0 0 0 0 0 0 0 0 0 ... 108 29 74 102 82 62 83 68 30 61
2 rows × 244 columns
I would like to organise columns names in numerical order(from 0 to 243)
I tried
for i, n in zip(daily1.columns, range(244)):
asd=daily1.rename(columns={i:n})
asd
but output has not shown...
Ideal output is
0 1 2 3 4 5 6 7 8 9 ... 234 235 236 237 238 239 240 241 242 243
0 0 0 0 0 0 0 4 0 0 0 ... 640 777 674 842 786 865 809 674 679 852
1 0 0 0 0 0 0 0 0 0 0 ... 108 29 74 102 82 62 83 68 30 61
Could I get some advice guys? Thank you
If you want to reorder the columns you can try that
columns = sorted(list(df.columns), reverse=False)
df = df[columns]
If you just want to rename the columns then you can try
df.columns = [i for i in range(df.shape[1])]
I have performed a groupby on my dataframe.
grouped = data_df.groupby(['Cluster','Visit Number Final'])['Visitor_ID'].count()
I am getting the below output :
data_df.groupby(['Cluster','Visit Number Final'])['Visitor_ID'].count()
Out[81]:
Cluster Visit Number Final
0 1 21846
2 1485
3 299
4 95
5 24
6 8
7 3
1 1 33600
2 2283
3 404
4 117
5 34
6 7
2 1 5858
2 311
3 55
4 14
5 6
6 3
7 1
3 1 19699
2 1101
3 214
4 78
5 14
6 8
7 3
4 1 10086
2 344
3 59
4 14
5 3
6 1
Name: Visitor_ID, dtype: int64
Now i want to compress the rows whose Visit Number Final >3(Add a new row which has the summation for visit number final 4,5,6). I am trying groupby.filter but not getting the expected output.
My final output should look like
Cluster Visit Number Final
0 1 21846
2 1485
3 299
>=4 130
1 1 33600
2 2283
3 404
>=4 158
2 1 5858
2 311
3 55
>=4 24
3 1 19699
2 1101
3 214
>=4 103
4 1 10086
2 344
3 59
>=4 18
The easiest way is to replace the 'Visit Number Final' values bigger than 3, before you group the dataframe:
df.loc[df['Visit Number Final'] > 3, 'Visit Number Final'] = '>=4'
df.groupby(['Cluster','Visit Number Final'])['Visitor_ID'].count()
Try this:
visit_val = df.index.get_level_values(1)
grp = np.where((visit_val <= 3) == 0, '>=4', visit_val)
(df.groupby(['Cluster',grp])['Number Final'].sum()
.reset_index().rename(columns={'level_1':'Visit'}))
Output:
Cluster Visit Number Final
0 0 1 21846
1 0 2 1485
2 0 3 299
3 0 >=4 130
4 1 1 33600
5 1 2 2283
6 1 3 404
7 1 >=4 158
8 2 1 5858
9 2 2 311
10 2 3 55
11 2 >=4 24
12 3 1 19699
13 3 2 1101
14 3 3 214
15 3 >=4 103
16 4 1 10086
17 4 2 344
18 4 3 59
19 4 >=4 18
Or to get dataframe with indexes:
(df.groupby(['Cluster',grp])['Number Final'].sum()
.rename_axis(['Cluster','Visit']).to_frame())
Output:
Number Final
Cluster Visit
0 1 21846
2 1485
3 299
>=4 130
1 1 33600
2 2283
3 404
>=4 158
2 1 5858
2 311
3 55
>=4 24
3 1 19699
2 1101
3 214
>=4 103
4 1 10086
2 344
3 59
>=4 18
I have a dataframe like this :
df = pd.DataFrame({'dir': [1,1,1,1,0,0,1,1,1,0], 'price':np.random.randint(100,200,10)})
dir price
0 1 100
1 1 150
2 1 190
3 1 194
4 0 152
5 0 151
6 1 131
7 1 168
8 1 112
9 0 193
and I want a new column that shows the maximum price as long as the dir is 1 and reset if dir is 0.
My desired outcome looks like this:
dir price max
0 1 100 194
1 1 150 194
2 1 190 194
3 1 194 194
4 0 152 NaN
5 0 151 NaN
6 1 131 168
7 1 168 168
8 1 112 168
9 0 193 NaN
Use transform with max for filtered rows:
#get unique groups for consecutive values
g = df['dir'].ne(df['dir'].shift()).cumsum()
#filter only 1
m = df['dir'] == 1
df['max'] = df[m].groupby(g)['price'].transform('max')
print (df)
dir price max
0 1 100 194.0
1 1 150 194.0
2 1 190 194.0
3 1 194 194.0
4 0 152 NaN
5 0 151 NaN
6 1 131 168.0
7 1 168 168.0
8 1 112 168.0
9 0 193 NaN
I have two dicts, one with three columns (A) and another with six columns (B), I would like to be able to use the value in the first column (index which is constant for both 1-4) and also the value in the second column (1-2000) to specify the correct element in the third column for subtraction. The second dict is similar in that the first and second columns are used to find the correct row however it is the value in the sixth column of that row that is needed for the subtraction.
A B
1 1 260 541 1 1 260 280 0.001 521.4
1 1 390 1195 1 1 390 900 0.02 963.3
1 1 102 6 1 1 102 2 0.01 4.8
2 1 65 12 2 1 65 9 0.13 13.1
2 1 515 659 2 1 515 356 0.002 532.2
2 1 354 1200 2 1 354 1087 0.119 1502.3
3 1 1190 53 3 1 1190 46 0.058 12.0
3 1 1985 3 3 1 1985 1 0.006 1.02
3 1 457 192 3 1 25 3 0.001 178.2
4 1 261 2084 4 1 261 1792 0.196 100.7
4 1 12 0 4 1 12 0 0.000 12.6
4 1 1756 30 4 1 1756 28 0.006 23.7
4 1 592 354 4 1 592 291 0.357 251.9
So basically I would like to subtract the last column of B from the last column of A whilst retaining the information held in the first and second columns.
C (desired output)
1 1 260 19.6
1 1 390 231.7
1 1 102 1.2
2 1 65 -1.1
2 1 515 126.8
2 1 354 -302.3
3 1 1190 41.0
3 1 1985 1.98
3 1 457 13.8
4 1 261 1983.3
4 1 12 -12.6
4 1 1756 6.3
4 1 592 102.1
I have been through SO for hours looking for a solution but havent found a solution as of yet but I'm sure it must be possible.
I need to be able to create a scatter graph afterwards as well in case anyone has any suggestions as to how to plot positive values and ignore the negatives.
EDIT:
I have added my code below to make it clearer, I take in a three column csv file and then need to get a count of the frequency of each value of the third column when they have the same value in the first column. B then has further alterations to get out the desired data streams and then the subtraction needs to be made. In a few of the comments it mentioned that column one and two are unnecessary but the value in column three is linked to the value in column one and thus must always remain in the same row together.
import pandas as pd
import numpy as np
def ba(fn, float1, float2):
ba=pd.read_csv(fn,header=None, skipfooter=6, engine='python')
ba['col4']=ba.groupby(['col1','col3']).transform(np.size)
ba['col5']=ba['col4'].apply(lambda x: x/float(float2))
ba['col6']=ba['col5'].apply(lambda x: x*float1)
ba=ba.set_index('col1')
ba = dict(tuple(ba.groupby('col1')))
return ba
IIUIC, A and B are dataframes then
In [1062]: A.iloc[:, :3].assign(output=A.iloc[:, -1] - B.iloc[:, -1])
Out[1062]:
0 1 2 output
0 1 1 260 19.60
1 1 1 390 231.70
2 1 1 102 1.20
3 2 1 65 -1.10
4 2 1 515 126.80
5 2 1 354 -302.30
6 3 1 1190 41.00
7 3 1 1985 1.98
8 3 1 457 13.80
9 4 1 261 1983.30
10 4 1 12 -12.60
11 4 1 1756 6.30
12 4 1 592 102.10
Details
In [1063]: A
Out[1063]:
0 1 2 3
0 1 1 260 541
1 1 1 390 1195
2 1 1 102 6
3 2 1 65 12
4 2 1 515 659
5 2 1 354 1200
6 3 1 1190 53
7 3 1 1985 3
8 3 1 457 192
9 4 1 261 2084
10 4 1 12 0
11 4 1 1756 30
12 4 1 592 354
In [1064]: B
Out[1064]:
0 1 2 3 4 5
0 1 1 260 280 0.001 521.40
1 1 1 390 900 0.020 963.30
2 1 1 102 2 0.010 4.80
3 2 1 65 9 0.130 13.10
4 2 1 515 356 0.002 532.20
5 2 1 354 1087 0.119 1502.30
6 3 1 1190 46 0.058 12.00
7 3 1 1985 1 0.006 1.02
8 3 1 25 3 0.001 178.20
9 4 1 261 1792 0.196 100.70
10 4 1 12 0 0.000 12.60
11 4 1 1756 28 0.006 23.70
12 4 1 592 291 0.357 251.90