I have 2 dataframes:
q = pd.DataFrame({'ID':[700,701,701,702,703,703,702],'TX':[0,0,1,0,0,1,1],'REF':[100,120,144,100,103,105,106]})
ID TX REF
0 700 0 100
1 701 0 120
2 701 1 144
3 702 0 100
4 703 0 103
5 703 1 105
6 702 1 106
and
p = pd.DataFrame({'ID':[700,701,701,702,703,703,702,708],'REF':[100,121,149,100,108,105,106,109],'NOTE':['A','B','V','V','T','A','L','M']})
ID REF NOTE
0 700 100 A
1 701 121 B
2 701 149 V
3 702 100 V
4 703 108 T
5 703 105 A
6 702 106 L
7 708 109 M
I wish to merge p with q in such way that ID are equals AND the REF is exact OR higher.
Example 1:
for p: ID=700 and REF=100 and
for q: ID=700 and RED=100 So that's a clear match!
Example 2
for p:
1 701 0 120
2 701 1 144
they would match to:
1 701 121 B
2 701 149 V
this way:
1 701 0 120 121 B 121 is just after 120
2 701 1 144 149 V 149 comes after 144
When I use the below code NOTE: I only indicate the REF which is wrong. Should be ID AND REF:
p = p.sort_values(by=['REF'])
q = q.sort_values(by=['REF'])
pd.merge_asof(p, q, on='REF', direction='forward').sort_values(by=['ID_x','TX'])
I get this problem:
My expected result should be something like this:
ID TX REF REF_2 NOTE
0 700 0 100 100 A
1 701 0 120 121 B
2 701 1 144 149 V
3 702 0 100 100 V
4 703 0 103 108 T
5 703 1 105 105 A
6 702 1 106 109 L
Does this work?
pd.merge_asof(q.sort_values(['REF', 'ID']),
p.sort_values(['REF', 'ID']),
on='REF',
direction='forward',
by='ID').sort_values('ID')
Output:
ID TX REF NOTE
0 700 0 100 A
5 701 0 120 B
6 701 1 144 V
1 702 0 100 V
4 702 1 106 L
2 703 0 103 A
3 703 1 105 A
Related
The dataframe below has a number of columns but columns names are random numbers.
daily1=
0 1 2 3 4 5 6 7 8 9 ... 11 12 13 14 15 16 17 18 19 20
0 0 0 0 0 0 0 4 0 0 0 ... 640 777 674 842 786 865 809 674 679 852
1 0 0 0 0 0 0 0 0 0 0 ... 108 29 74 102 82 62 83 68 30 61
2 rows × 244 columns
I would like to organise columns names in numerical order(from 0 to 243)
I tried
for i, n in zip(daily1.columns, range(244)):
asd=daily1.rename(columns={i:n})
asd
but output has not shown...
Ideal output is
0 1 2 3 4 5 6 7 8 9 ... 234 235 236 237 238 239 240 241 242 243
0 0 0 0 0 0 0 4 0 0 0 ... 640 777 674 842 786 865 809 674 679 852
1 0 0 0 0 0 0 0 0 0 0 ... 108 29 74 102 82 62 83 68 30 61
Could I get some advice guys? Thank you
If you want to reorder the columns you can try that
columns = sorted(list(df.columns), reverse=False)
df = df[columns]
If you just want to rename the columns then you can try
df.columns = [i for i in range(df.shape[1])]
While designing recommendation system, I have stumbled upon a case where voting or something similar is required for collaborative filtering implementation.
But in our system we don't have any field for rating/voting. I am willing to deduce similar kind of rating/voting depending upon the timestamp on which user watched the show.
This is what view-history looks like
subscriber_id content_id timestamp
1 123 1576833135000
1 124 1576833140000
1 125 1576833145000
1 126 1576833150000
1 127 1576833155000
1 128 1576833160000
1 129 1576833165000
1 130 1576833170000
1 131 1576833175000
1 132 1576833180000
2 123 1576833135000
2 124 1576833140000
2 125 1576833145000
2 126 1576833150000
2 127 1576833155000
2 128 1576833160000
2 129 1576833165000
2 130 1576833170000
2 131 1576833175000
2 132 1576833180000
2 133 1576833185000
2 134 1576833190000
2 135 1576833195000
2 136 1576833200000
2 137 1576833205000
2 138 1576833210000
2 139 1576833215000
2 140 1576833220000
2 141 1576833225000
2 142 1576833230000
2 143 1576833235000
2 144 1576833240000
I want to assign a number to each of these entries, ranging from 5-1(5 being most recent), I have implemented the rank system, but it is not working for the range.
df1['rank'] = df1.sort_values(['subscriber_id','timestamp']) \
.groupby(['subscriber_id'])['timestamp'] \
.rank(method='max').astype(int)
Expected Output:
subscriber_id content_id timestamp rating
1 123 1576833135000 1
1 124 1576833140000 1
1 125 1576833145000 2
1 126 1576833150000 2
1 127 1576833155000 3
1 128 1576833160000 3
1 129 1576833165000 4
1 130 1576833170000 4
1 131 1576833175000 5
1 132 1576833180000 5
2 123 1576833135000 1
2 124 1576833140000 1
2 125 1576833145000 1
2 126 1576833150000 1
2 127 1576833155000 2
2 128 1576833160000 2
2 129 1576833165000 2
2 130 1576833170000 2
2 131 1576833175000 3
2 132 1576833180000 3
2 133 1576833185000 3
2 134 1576833190000 3
2 135 1576833195000 4
2 136 1576833200000 4
2 137 1576833205000 4
2 138 1576833210000 4
2 139 1576833215000 4
2 140 1576833220000 5
2 141 1576833225000 5
2 142 1576833230000 5
2 143 1576833235000 5
2 144 1576833240000 5
Any help would be much appreciated!
Now it make sense. Solution is to create list with ranks based on modulo value from dividing number of data for selected user by 5. There You go :)
import pandas as pd
from io import StringIO
data = StringIO("""
content_id subscriber_id timestamp
123 1 1576833135000
124 1 1576833140000
125 1 1576833145000
126 1 1576833150000
127 1 1576833155000
128 1 1576833160000
129 1 1576833165000
130 1 1576833170000
131 1 1576833175000
132 1 1576833180000
123 2 1576833135000
124 2 1576833140000
125 2 1576833145000
126 2 1576833150000
127 2 1576833155000
128 2 1576833160000
129 2 1576833165000
130 2 1576833170000
131 2 1576833175000
132 2 1576833180000
133 2 1576833185000
134 2 1576833190000
135 2 1576833195000
136 2 1576833200000
137 2 1576833205000
138 2 1576833210000
139 2 1576833215000
140 2 1576833220000
141 2 1576833225000
142 2 1576833230000
143 2 1576833235000
144 2 1576833240000
""")
# load data into data frame
df = pd.read_csv(data, sep=' ')
# get unique users
user_list = df['subscriber_id'].unique()
# collect results
results = pd.DataFrame(columns=['content_id','subscriber_id','timestamp','rating'])
for user in user_list:
# select data range for one user
df2 = df[df['subscriber_id'] == user]
items_numer = df2.shape[0]
modulo_remider = items_numer % 5
ranks_repeat = int(items_numer / 5)
# create rating list based on modulo
if modulo_remider > 0:
rating = []
for i in range(1, 6, 1):
l = [i for j in range(ranks_repeat)]
for number in l:
rating.append(number)
if modulo_remider == 1:
rating.insert(rating.index(5), 5)
if modulo_remider == 2:
rating.insert(rating.index(4), 4)
rating.insert(rating.index(5), 5)
if modulo_remider == 3:
rating.insert(rating.index(3), 3)
rating.insert(rating.index(4), 4)
rating.insert(rating.index(5), 5)
if modulo_remider == 4:
rating.insert(rating.index(2), 2)
rating.insert(rating.index(3), 3)
rating.insert(rating.index(4), 4)
rating.insert(rating.index(5), 5)
df2.insert(3, 'rating', rating, True)
else:
rating = []
for i in range(1, 6, 1):
l = [i for j in range(ranks_repeat)]
for number in l:
rating.append(number)
df2.insert(3, 'rating', rating, True)
# collect results
results = results.append(df2)
Result:
content_id subscriber_id timestamp rating
0 123 1 1576833135000 1
1 124 1 1576833140000 1
2 125 1 1576833145000 2
3 126 1 1576833150000 2
4 127 1 1576833155000 3
5 128 1 1576833160000 3
6 129 1 1576833165000 4
7 130 1 1576833170000 4
8 131 1 1576833175000 5
9 132 1 1576833180000 5
10 123 2 1576833135000 1
11 124 2 1576833140000 1
12 125 2 1576833145000 1
13 126 2 1576833150000 1
14 127 2 1576833155000 2
15 128 2 1576833160000 2
16 129 2 1576833165000 2
17 130 2 1576833170000 2
18 131 2 1576833175000 3
19 132 2 1576833180000 3
20 133 2 1576833185000 3
21 134 2 1576833190000 3
22 135 2 1576833195000 4
23 136 2 1576833200000 4
24 137 2 1576833205000 4
25 138 2 1576833210000 4
26 139 2 1576833215000 4
27 140 2 1576833220000 5
28 141 2 1576833225000 5
29 142 2 1576833230000 5
30 143 2 1576833235000 5
31 144 2 1576833240000 5
I have following dataframe in pandas
code amnt pre_amnt
123 200 200
124 234 0
125 231 231
126 236 0
128 122 130
I want to do a subtraction only when pre_amnt is non zero. My desired dataframe would be
code amnt pre_amnt diff
123 200 200 0
124 234 0 0
125 231 231 0
126 236 0 0
128 122 130 8
So, if pre_amnt is zero then diff should be also 0. How can I do it in pandas?
Use numpy.where:
m = df['pre_amnt'] > 0
df['diff'] = np.where(m, df['pre_amnt'] - df['amnt'], 0)
Another solution with where:
df['diff'] = (df['pre_amnt'] - df['amnt']).where(m, 0)
print (df)
code amnt pre_amnt diff
0 123 200 200 0
1 124 234 0 0
2 125 231 231 0
3 126 236 0 0
4 128 122 130 8
another approach ?
data['diff'] = 0
data.loc[data['pre_amnt'] != 0, 'diff'] = abs(data['pre_amnt'] - data['amnt'])
code amnt pre_amnt diff
0 123 200 200 0
1 124 234 0 0
2 125 231 231 0
3 126 236 0 0
4 128 122 130 8
I have a sample with potential malware behaviour, i want to reveal all the network indicators like website names and ip addresses which it is connecting to.
By using strings output i got
$ strings 6787c54e6a2c5cffd1576dcdc8c4f42c954802b7
%PDF-1.5
1 0 obj
<</Type/Page/Parent 80 0 R/Contents 36 0 R/MediaBox[0 0 612 792]/Annots[2 0 R 4 0 R 6 0 R 8 0 R 10 0 R 12 0 R 14 0 R 16 0 R 18 0 R]/Group 20 0 R/StructParents 1/Tabs/S/Resources<</Font<</F1 21 0 R/F2 23 0 R/F3 26 0 R/F4 29 0 R/F5 31 0 R>>/XObject<</Image6 33 0 R/Image9 34 0 R>>>>>>
endobj
2 0 obj
<</Type/Annot/Subtype/Link/Rect[139.10001 398.20001 449.84 726.20001]/Border[0 0 0]/F 4/NM(PDFE-48D407B4789BA8880)/P 1 0 R/StructParent 0/A 3 0 R>>
endobj
3 0 obj
<</S/URI/URI(http://www.pdfupdatersacrobat.top/website/hts-cache/index.php?userid=info#narainsfashionfabrics.com)>>
endobj
4 0 obj
<</Type/Annot/Subtype/Link/Rect[232.39999 618.03003 370.14999 629.53003]/Border[0 0 0]/F 4/NM(PDFE-48D407B4789BA8881)/P 1 0 R/StructParent 2/A 5 0 R>>
endobj
5 0 obj
<</S/URI/URI(>>
endobj
6 0 obj
<</Type/Annot/Subtype/Link/Rect[278.87 583.20001 324.88 594.13]/Border[0 0 0]/F 4/NM(PDFE-48D407B4789BA8882)/P 1 0 R/StructParent 3/A 7 0 R>>
endobj
7 0 obj
<</S/URI/URI()>>
endobj
8 0 obj
<</Type/Annot/Subtype/Link/Rect[185.75999 377.28 398.16 733.67999]/Border[0 0 0]/C[0 0 0]/F 4/NM(PDFE-48D4183FB09C5EC13)/P 1 0 R/A 9 0 R/H/N>>
endobj
9 0 obj
<</S/URI/URI(http://sajiye.net/file/website/file/main/index.php?userid=alwaha_alghannaa#hotmail.com)>>
endobj
10 0 obj
<</Type/Annot/Subtype/Link/Rect[185.75999 373.67999 398.88 734.40002]/Border[0 0 0]/C[0 0 0]/F 4/NM(PDFE-48D4183FB09C5EC14)/P 1 0 R/A 11 0 R/H/N>>
endobj
11 0 obj
<</S/URI/URI(http://sajiye.net/file/website/file/main/index.php?userid=kitja#siamdee2558.com)>>
endobj
12 0 obj
<</Type/Annot/Subtype/Link/Rect[132.48 0 474.48001 772.56]/Border[0 0 0]/C[0 0 0]/F 4/NM(PDFE-48D460B5879C4D8C5)/P 1 0 R/A 13 0 R/H/N>>
endobj
13 0 obj
<</S/URI/URI(http://nurking.pl/wp-admin/user/email.163.htm?login=)>>
endobj
14 0 obj
<</Type/Annot/Subtype/Link/Rect[0 0 612 792]/Border[0 0 0]/C[0 0 0]/F 4/NM(PDFE-48D465334C760A446)/P 1 0 R/A 15 0 R/H/N>>
endobj
15 0 obj
<</S/URI/URI(https://www.dropbox.com/s/76jr9jzg020gory/Swift%20Copy.uue?dl=1)>>
endobj
16 0 obj
<</Type/Annot/Subtype/Link/Rect[.72 0 612 789.84003]/Border[0 0 0]/C[0 0 0]/F 4/NM(PDFE-48D4C7F946F3F02B7)/P 1 0 R/A 17 0 R/H/N>>
endobj
17 0 obj
<</S/URI/URI(https://www.dropbox.com/s/28aaqjdradyy4io/Swift-Copy_pdf.uue?dl=1)>>
endobj
18 0 obj
<</Type/Annot/Subtype/Link/Rect[0 5.76 612 792]/Border[0 0 0]/C[0 0 0]/F 4/P 1 0 R/A 19 0 R/H/N>>
endobj
19 0 obj
<</S/URI/URI(https://www.dropbox.com/s/d71h5a56r16u3f0/swift_copy.jar?dl=1)>>
endobj
20 0 obj
<</S/Transparency/CS/DeviceRGB>>
endobj
21 0 obj
<</Type/Font/Subtype/TrueType/BaseFont/TimesNewRoman/FirstChar 32/LastChar 252/Encoding/WinAnsiEncoding/FontDescriptor 22 0 R/Widths[250 333 408 500 500 833 777 180 333 333 500 563 250 333 250 277 500 500 500 500 500 500 500 500 500 500 277 277 563 563 563 443 920 722 666 666 722 610 556 722 722 333 389 722 610 889 722 722 556 722 666 556 610 722 722 943 722 722 610 333 277 333 469 500 333 443 500 443 500 443 333 500 500 277 277 500 277 777 500 500 500 500 333 389 277 500 500 722 500 500 443 479 200 479 541 350 500 350 333 500 443 1000 500 500 333 1000 556 333 889 350 610 350 350 333 333 443 443 350 500 1000 333 979 389 333 722 350 443 722 250 333 500 500 500 500 200 500 333 759 275 500 563 333 759 500 399 548 299 299 333 576 453 333 333 299 310 500 750 750 750 443 722 722 722 722 722 722 889 666 610 610 610 610 333 333 333 333 722 722 722 722 722 722 722 563 722 722 722 722 722 722 556 500 443 443 443 443 443 443 666 443 443 443 443 443 277 277 277 277 500 500 500 500 500 500 500 548 500 500 500 500 500]>>
endobj
22 0 obj
<</Type/FontDescriptor/FontName/TimesNewRoman/Flags 32/FontBBox[-568 -215 2045 891]/FontFamily(Times New Roman)/FontWeight 400/Ascent 891/CapHeight 693/Descent -215/MissingWidth 777/StemV 0/ItalicAngle 0/XHeight 485>>
endobj
23 0 obj
<</Type/Font/Subtype/TrueType/BaseFont/ABCDEE+Calibri,BoldItalic/FirstChar 32/LastChar 117/Name/F2/Encoding/WinAnsiEncoding/FontDescriptor 24 0 R/Widths[226 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 630 0 459 0 0 0 0 0 0 0 0 668 532 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 528 0 412 0 491 316 0 0 246 0 0 246 804 527 527 0 0 0 0 347 527]>>
endobj
24 0 obj
<</Type/FontDescriptor/FontName/ABCDEE+Calibri,BoldItalic/FontWeight 700/Flags 32/FontBBox[-691 -250 1265 750]/Ascent 750/CapHeight 750/Descent -250/StemV 53/ItalicAngle -11/AvgWidth 536/MaxWidth 1956/XHeight 250/FontFile2 25 0 R>>
endobj
<</Type/Pages/Count 1/Kids[1 0 R]>>
endobj
81 0 obj
<</Type/Catalog/Pages 80 0 R/Lang(en-US)/MarkInfo<</Marked true>>/Metadata 83 0 R/StructTreeRoot 37 0 R>>
endobj
82 0 obj
<</Producer(RAD PDF 2.36.8.0 - http://www.radpdf.com)/Author(alesk)/Creator(RAD PDF)/RadPdfCustomData(pdfescape.com-open-AC00E8D5A4B4C84BC37A2054F4EC794B0297765728CB8415)/CreationDate(D:20160825075202+01'00')/ModDate(D:20170711012532-08'00')>>
endobj
83 0 obj
<</Type/Metadata/Subtype/XML/Length 1031>>stream
<?xpacket begin="
" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="DynaPDF 4.0.11.30, http://www.dynaforms.com">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""
xmlns:pdf="http://ns.adobe.com/pdf/1.3/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:xmp="http://ns.adobe.com/xap/1.0/"
xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/">
<pdf:Producer>RAD PDF 2.36.8.0 - http://www.radpdf.com</pdf:Producer>
<xmp:CreateDate>2016-08-25T07:52:02+01:00</xmp:CreateDate>
<xmp:CreatorTool>RAD PDF</xmp:CreatorTool>
<xmp:MetadataDate>2017-07-11T01:25:32-08:00</xmp:MetadataDate>
<xmp:ModifyDate>2017-07-11T01:25:32-08:00</xmp:ModifyDate>
<dc:creator><rdf:Seq><rdf:li xml:lang="x-default">alesk</rdf:li></rdf:Seq></dc:creator>
<xmpMM:DocumentID>uuid:a184332f-8592-38c8-908c-45914e523218</xmpMM:DocumentID>
<xmpMM:VersionID>1</xmpMM:VersionID>
<xmpMM:RenditionClass>default</xmpMM:RenditionClass>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
<?xpacket end="w"?>
endstream
endobj
84 0 obj
<</Type/XRef/Size 85/Root 81 0 R/Info 82 0 R/ID[<299C21286E590F03363518EFD9FBBF99><299C21286E590F03363518EFD9FBBF99>]/W[1 3 0]/Filter/FlateDecode/Length 239>>stream
cx?{
endstream
endobj
startxref
204273
%%EOF
So is there any way to digest all these strings and extract only network indicators like domains or IP addresses using any regex or any other method.
Suggestions are welcome
Output Expected:
http://www.pdfupdatersacrobat.top/website/hts-cache/index.php?userid=info#narainsfashionfabrics.com
http://sajiye.net/file/website/file/main/index.php?userid=alwaha_alghannaa#hotmail.com
http://ns.adobe.com/pdf/1.3/
Yes it is possible. You can find all URLs and then extract them using back references. You can read more about back-references here.
# Pattern describing regular expression
pattern = re.compile(r'(\(https?[:_%A-Z=?/a-z0-9.-]+\))')
# List where we store all URLs
urls = []
# For each invoice pattern you find in string, append it to list
for url in pattern.finditer(string):
urls.append(url.group(1))
Note:
You should use pattern.finditter() because that way you can iterate trough all pattern findings in text you called string. From re.finditer documentation:
re.finditer(pattern, string, flags=0)
Return an iterator yielding
MatchObject instances over all non-overlapping matches for the RE
pattern in string. The string is scanned left-to-right, and matches
are returned in the order found. Empty matches are included in the
result unless they touch the beginning of another match.
As findstr only offers rudimentary RegEx functionality I suggest using PowerShell
(if necessary wrapped in a batch)
This rather course RegEx doesn't strip the tail of the http lines:
> gc .\sample.txt |sls '^.*?(https?:\/\/.*)$'|%{$_.Matches.Groups[1].Value}
http://www.pdfupdatersacrobat.top/website/hts-cache/index.php?userid=info#narainsfashionfabrics.com)>>
http://sajiye.net/file/website/file/main/index.php?userid=alwaha_alghannaa#hotmail.com)>>
http://sajiye.net/file/website/file/main/index.php?userid=kitja#siamdee2558.com)>>
http://nurking.pl/wp-admin/user/email.163.htm?login=)>>
https://www.dropbox.com/s/76jr9jzg020gory/Swift%20Copy.uue?dl=1)>>
https://www.dropbox.com/s/28aaqjdradyy4io/Swift-Copy_pdf.uue?dl=1)>>
https://www.dropbox.com/s/d71h5a56r16u3f0/swift_copy.jar?dl=1)>>
http://www.radpdf.com)/Author(alesk)/Creator(RAD PDF)/RadPdfCustomData(pdfescape.com-open-AC00E8D5A4B4C84BC37A2054F4EC794B0297765728CB8415)/CreationDate(D:20160825075202+01'00')/ModDate(D:20170711012532-08'00')>>
http://www.dynaforms.com">
http://www.w3.org/1999/02/22-rdf-syntax-ns#">
http://ns.adobe.com/pdf/1.3/"
http://purl.org/dc/elements/1.1/"
http://ns.adobe.com/xap/1.0/"
http://ns.adobe.com/xap/1.0/mm/">
http://www.radpdf.com</pdf:Producer>
Likewise coarse for the possible IPs
> gc .\sample.txt |sls '^(.*?(\d{1,3}\.){3}\d{1,3}.*)$'|%{$_.Matches.Groups[1].Value}
<</Producer(RAD PDF 2.36.8.0 - http://www.radpdf.com)/Author(alesk)/Creator(RAD PDF)/RadPdfCustomData(pdfescape.com-open-AC00E8D5A4B4C84BC37A2054F4EC794B0297765728CB8415)/CreationDate(D:20160825075202+01'00')/ModDate(D:20170711012532-08'00')>>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="DynaPDF 4.0.11.30, http://www.dynaforms.com">
<pdf:Producer>RAD PDF 2.36.8.0 - http://www.radpdf.com</pdf:Producer>
Aliases used:
gc = Get-Content
sls = Select-String
% = ForEach-Object
I have the following code:
data = open(filename).readlines()
result = {}
for d in data:
res[d[0:11]] = d
Each line in data looks like so and there are 251 with 2 different "keys" in the first 11 characters:
205583620002008TAVG 420 M 400 M 1140 M 1590 M 2160 M 2400 M 3030 M 2840 M 2570 M 2070 M 1320 M 750 M
205583620002009TAVG 380 M 890 M 1060 M 1630 M 2190 M 2620 M 2880 M 2790 M 2500 M 2130 M 1210 M 640 M
205583620002010TAVG 530 M 750 M 930 M 1280 M 2080 M 2380 M 2890 M 3070 M 2620 M 1920 M 1400 M 790 M
205583620002011TAVG 150 M 600 M 930 M 1600 M 2160 M 2430 M 3000 M 2790 M 2430 M 1910 M 1670 M 650 M
205583620002012TAVG 470 M 440 M 950 M 1750 M 2130 M 2430 M 2970 M 2900 M 2370 M 1980 M 1220 M 630 M
205583620002013TAVG 460 M 680 M 1100 M 1530 M 2130 M 2410 M 3200 M 3100 M-9999 -9999 -9999 -9999 XM
205583620002014TAVG-9999 XC-9999 XC-9999 XC-9999 XC-9999 XP-9999 XC-9999 XC-9999 XC-9999 XC-9999 XC-9999 XC-9999 XC
205583620002015TAVG-9999 XC-9999 XC-9999 XC-9999 XC-9999 XC-9999 XC-9999 XC-9999 XK-9999 XP-9999 -9999 -9999
210476000001930TAVG 153 0 343 0 593 0 1033 0 1463 0 1893 0 2493 0 2583 0 2023 0 1483 0 873 0 473 0
210476000001931TAVG 203 0 73 0 473 0 833 0 1383 0 1823 0 2043 0 2513 0 2003 0 1413 0 1033 0 543 0
210476000001932TAVG 433 0 243 0 403 0 933 0 1503 0 1833 0 2353 0 2493 0 2043 0 1393 0 963 0 583 0
210476000001933TAVG 133 0 53 0 213 0 953 0 1553 0 1983 0 2543 0 2543 0 2043 0 1403 0 973 0 503 0
210476000001934TAVG 103 0 153 0 333 0 843 0 1493 0 1933 0 2243 0 2353 0 1983 0 1353 0 863 0 523 0
210476000001935TAVG 243 0 273 0 503 0 983 0 1453 0 1893 0 2303 0 2343 0 2053 0 1473 0 993 0 453 0
210476000001936TAVG -7 0 33 0 223 0 903 0 1433 0 1983 0 2293 0 2383 0 2153 0 1443 0 913 0 573 0
The keys output is this:
print res.keys()
>['20558362000', '21047600000']
And to check the result I have 3 prints:
print len(res.values())
print len(res.values()[0])
print len(res.values()[1])
My expected output is:
2
165
86
But I end up with:
2
116
116
It's pretty clear to me that it adds the same values to both keys, but I don't understand why.
If anyone could clarify with or without a working code strip it would help a lot.
When you use
res[d[0:11]] = d
the entry in the dict gets overwritten, and each line has a length of 116 characters. So a
print len(res.values()[0])
returns the length of a single line, not a number of elements.
You need to so something like this:
for d in data:
key = d[0:11]
if key in res:
res[key].append(d)
else:
res[key] = [d]
Checkout defaultkey.
The answer is simple: the first 10 characters of the string that you use as the key(s) are all the same in your example, so the value for the key keeps on being reset !
Use this code as a proof (substitute by the name of your filename):
filename = 'test.dat'
data = open(filename).readlines()
result = {}
for d in data:
key = d[0:11]
if key in result:
print 'key {key} already in {dict}'.format(key=key, dict=result)
result[key] = d
print result
If you want to collect multiple lines you should store them in a sequence (preferably list). Right now you are overwriting previously stored line with new one (that's what res[d[0:11]] = d does).
Possible solution:
data = open(filename).readlines()
result = {}
for d in data:
try:
res[d[0:11]].append(d)
except KeyError:
res[d[0:11]] = [d]