Building dict in loop adds values to wrong keys - python

I have the following code:
data = open(filename).readlines()
result = {}
for d in data:
res[d[0:11]] = d
Each line in data looks like so and there are 251 with 2 different "keys" in the first 11 characters:
205583620002008TAVG 420 M 400 M 1140 M 1590 M 2160 M 2400 M 3030 M 2840 M 2570 M 2070 M 1320 M 750 M
205583620002009TAVG 380 M 890 M 1060 M 1630 M 2190 M 2620 M 2880 M 2790 M 2500 M 2130 M 1210 M 640 M
205583620002010TAVG 530 M 750 M 930 M 1280 M 2080 M 2380 M 2890 M 3070 M 2620 M 1920 M 1400 M 790 M
205583620002011TAVG 150 M 600 M 930 M 1600 M 2160 M 2430 M 3000 M 2790 M 2430 M 1910 M 1670 M 650 M
205583620002012TAVG 470 M 440 M 950 M 1750 M 2130 M 2430 M 2970 M 2900 M 2370 M 1980 M 1220 M 630 M
205583620002013TAVG 460 M 680 M 1100 M 1530 M 2130 M 2410 M 3200 M 3100 M-9999 -9999 -9999 -9999 XM
205583620002014TAVG-9999 XC-9999 XC-9999 XC-9999 XC-9999 XP-9999 XC-9999 XC-9999 XC-9999 XC-9999 XC-9999 XC-9999 XC
205583620002015TAVG-9999 XC-9999 XC-9999 XC-9999 XC-9999 XC-9999 XC-9999 XC-9999 XK-9999 XP-9999 -9999 -9999
210476000001930TAVG 153 0 343 0 593 0 1033 0 1463 0 1893 0 2493 0 2583 0 2023 0 1483 0 873 0 473 0
210476000001931TAVG 203 0 73 0 473 0 833 0 1383 0 1823 0 2043 0 2513 0 2003 0 1413 0 1033 0 543 0
210476000001932TAVG 433 0 243 0 403 0 933 0 1503 0 1833 0 2353 0 2493 0 2043 0 1393 0 963 0 583 0
210476000001933TAVG 133 0 53 0 213 0 953 0 1553 0 1983 0 2543 0 2543 0 2043 0 1403 0 973 0 503 0
210476000001934TAVG 103 0 153 0 333 0 843 0 1493 0 1933 0 2243 0 2353 0 1983 0 1353 0 863 0 523 0
210476000001935TAVG 243 0 273 0 503 0 983 0 1453 0 1893 0 2303 0 2343 0 2053 0 1473 0 993 0 453 0
210476000001936TAVG -7 0 33 0 223 0 903 0 1433 0 1983 0 2293 0 2383 0 2153 0 1443 0 913 0 573 0
The keys output is this:
print res.keys()
>['20558362000', '21047600000']
And to check the result I have 3 prints:
print len(res.values())
print len(res.values()[0])
print len(res.values()[1])
My expected output is:
2
165
86
But I end up with:
2
116
116
It's pretty clear to me that it adds the same values to both keys, but I don't understand why.
If anyone could clarify with or without a working code strip it would help a lot.

When you use
res[d[0:11]] = d
the entry in the dict gets overwritten, and each line has a length of 116 characters. So a
print len(res.values()[0])
returns the length of a single line, not a number of elements.
You need to so something like this:
for d in data:
key = d[0:11]
if key in res:
res[key].append(d)
else:
res[key] = [d]
Checkout defaultkey.

The answer is simple: the first 10 characters of the string that you use as the key(s) are all the same in your example, so the value for the key keeps on being reset !
Use this code as a proof (substitute by the name of your filename):
filename = 'test.dat'
data = open(filename).readlines()
result = {}
for d in data:
key = d[0:11]
if key in result:
print 'key {key} already in {dict}'.format(key=key, dict=result)
result[key] = d
print result

If you want to collect multiple lines you should store them in a sequence (preferably list). Right now you are overwriting previously stored line with new one (that's what res[d[0:11]] = d does).
Possible solution:
data = open(filename).readlines()
result = {}
for d in data:
try:
res[d[0:11]].append(d)
except KeyError:
res[d[0:11]] = [d]

Related

merge_asof with multiple columns and forward direction

I have 2 dataframes:
q = pd.DataFrame({'ID':[700,701,701,702,703,703,702],'TX':[0,0,1,0,0,1,1],'REF':[100,120,144,100,103,105,106]})
ID TX REF
0 700 0 100
1 701 0 120
2 701 1 144
3 702 0 100
4 703 0 103
5 703 1 105
6 702 1 106
and
p = pd.DataFrame({'ID':[700,701,701,702,703,703,702,708],'REF':[100,121,149,100,108,105,106,109],'NOTE':['A','B','V','V','T','A','L','M']})
ID REF NOTE
0 700 100 A
1 701 121 B
2 701 149 V
3 702 100 V
4 703 108 T
5 703 105 A
6 702 106 L
7 708 109 M
I wish to merge p with q in such way that ID are equals AND the REF is exact OR higher.
Example 1:
for p: ID=700 and REF=100 and
for q: ID=700 and RED=100 So that's a clear match!
Example 2
for p:
1 701 0 120
2 701 1 144
they would match to:
1 701 121 B
2 701 149 V
this way:
1 701 0 120 121 B 121 is just after 120
2 701 1 144 149 V 149 comes after 144
When I use the below code NOTE: I only indicate the REF which is wrong. Should be ID AND REF:
p = p.sort_values(by=['REF'])
q = q.sort_values(by=['REF'])
pd.merge_asof(p, q, on='REF', direction='forward').sort_values(by=['ID_x','TX'])
I get this problem:
My expected result should be something like this:
ID TX REF REF_2 NOTE
0 700 0 100 100 A
1 701 0 120 121 B
2 701 1 144 149 V
3 702 0 100 100 V
4 703 0 103 108 T
5 703 1 105 105 A
6 702 1 106 109 L
Does this work?
pd.merge_asof(q.sort_values(['REF', 'ID']),
p.sort_values(['REF', 'ID']),
on='REF',
direction='forward',
by='ID').sort_values('ID')
Output:
ID TX REF NOTE
0 700 0 100 A
5 701 0 120 B
6 701 1 144 V
1 702 0 100 V
4 702 1 106 L
2 703 0 103 A
3 703 1 105 A

Renaming a number of columns using for loop (python)

The dataframe below has a number of columns but columns names are random numbers.
daily1=
0 1 2 3 4 5 6 7 8 9 ... 11 12 13 14 15 16 17 18 19 20
0 0 0 0 0 0 0 4 0 0 0 ... 640 777 674 842 786 865 809 674 679 852
1 0 0 0 0 0 0 0 0 0 0 ... 108 29 74 102 82 62 83 68 30 61
2 rows × 244 columns
I would like to organise columns names in numerical order(from 0 to 243)
I tried
for i, n in zip(daily1.columns, range(244)):
asd=daily1.rename(columns={i:n})
asd
but output has not shown...
Ideal output is
0 1 2 3 4 5 6 7 8 9 ... 234 235 236 237 238 239 240 241 242 243
0 0 0 0 0 0 0 4 0 0 0 ... 640 777 674 842 786 865 809 674 679 852
1 0 0 0 0 0 0 0 0 0 0 ... 108 29 74 102 82 62 83 68 30 61
Could I get some advice guys? Thank you
If you want to reorder the columns you can try that
columns = sorted(list(df.columns), reverse=False)
df = df[columns]
If you just want to rename the columns then you can try
df.columns = [i for i in range(df.shape[1])]

how to do a subtraction on condition in pandas

I have following dataframe in pandas
code amnt pre_amnt
123 200 200
124 234 0
125 231 231
126 236 0
128 122 130
I want to do a subtraction only when pre_amnt is non zero. My desired dataframe would be
code amnt pre_amnt diff
123 200 200 0
124 234 0 0
125 231 231 0
126 236 0 0
128 122 130 8
So, if pre_amnt is zero then diff should be also 0. How can I do it in pandas?
Use numpy.where:
m = df['pre_amnt'] > 0
df['diff'] = np.where(m, df['pre_amnt'] - df['amnt'], 0)
Another solution with where:
df['diff'] = (df['pre_amnt'] - df['amnt']).where(m, 0)
print (df)
code amnt pre_amnt diff
0 123 200 200 0
1 124 234 0 0
2 125 231 231 0
3 126 236 0 0
4 128 122 130 8
another approach ?
data['diff'] = 0
data.loc[data['pre_amnt'] != 0, 'diff'] = abs(data['pre_amnt'] - data['amnt'])
code amnt pre_amnt diff
0 123 200 200 0
1 124 234 0 0
2 125 231 231 0
3 126 236 0 0
4 128 122 130 8

How to extract only network indicators from the stings in sample?

I have a sample with potential malware behaviour, i want to reveal all the network indicators like website names and ip addresses which it is connecting to.
By using strings output i got
$ strings 6787c54e6a2c5cffd1576dcdc8c4f42c954802b7
%PDF-1.5
1 0 obj
<</Type/Page/Parent 80 0 R/Contents 36 0 R/MediaBox[0 0 612 792]/Annots[2 0 R 4 0 R 6 0 R 8 0 R 10 0 R 12 0 R 14 0 R 16 0 R 18 0 R]/Group 20 0 R/StructParents 1/Tabs/S/Resources<</Font<</F1 21 0 R/F2 23 0 R/F3 26 0 R/F4 29 0 R/F5 31 0 R>>/XObject<</Image6 33 0 R/Image9 34 0 R>>>>>>
endobj
2 0 obj
<</Type/Annot/Subtype/Link/Rect[139.10001 398.20001 449.84 726.20001]/Border[0 0 0]/F 4/NM(PDFE-48D407B4789BA8880)/P 1 0 R/StructParent 0/A 3 0 R>>
endobj
3 0 obj
<</S/URI/URI(http://www.pdfupdatersacrobat.top/website/hts-cache/index.php?userid=info#narainsfashionfabrics.com)>>
endobj
4 0 obj
<</Type/Annot/Subtype/Link/Rect[232.39999 618.03003 370.14999 629.53003]/Border[0 0 0]/F 4/NM(PDFE-48D407B4789BA8881)/P 1 0 R/StructParent 2/A 5 0 R>>
endobj
5 0 obj
<</S/URI/URI(>>
endobj
6 0 obj
<</Type/Annot/Subtype/Link/Rect[278.87 583.20001 324.88 594.13]/Border[0 0 0]/F 4/NM(PDFE-48D407B4789BA8882)/P 1 0 R/StructParent 3/A 7 0 R>>
endobj
7 0 obj
<</S/URI/URI()>>
endobj
8 0 obj
<</Type/Annot/Subtype/Link/Rect[185.75999 377.28 398.16 733.67999]/Border[0 0 0]/C[0 0 0]/F 4/NM(PDFE-48D4183FB09C5EC13)/P 1 0 R/A 9 0 R/H/N>>
endobj
9 0 obj
<</S/URI/URI(http://sajiye.net/file/website/file/main/index.php?userid=alwaha_alghannaa#hotmail.com)>>
endobj
10 0 obj
<</Type/Annot/Subtype/Link/Rect[185.75999 373.67999 398.88 734.40002]/Border[0 0 0]/C[0 0 0]/F 4/NM(PDFE-48D4183FB09C5EC14)/P 1 0 R/A 11 0 R/H/N>>
endobj
11 0 obj
<</S/URI/URI(http://sajiye.net/file/website/file/main/index.php?userid=kitja#siamdee2558.com)>>
endobj
12 0 obj
<</Type/Annot/Subtype/Link/Rect[132.48 0 474.48001 772.56]/Border[0 0 0]/C[0 0 0]/F 4/NM(PDFE-48D460B5879C4D8C5)/P 1 0 R/A 13 0 R/H/N>>
endobj
13 0 obj
<</S/URI/URI(http://nurking.pl/wp-admin/user/email.163.htm?login=)>>
endobj
14 0 obj
<</Type/Annot/Subtype/Link/Rect[0 0 612 792]/Border[0 0 0]/C[0 0 0]/F 4/NM(PDFE-48D465334C760A446)/P 1 0 R/A 15 0 R/H/N>>
endobj
15 0 obj
<</S/URI/URI(https://www.dropbox.com/s/76jr9jzg020gory/Swift%20Copy.uue?dl=1)>>
endobj
16 0 obj
<</Type/Annot/Subtype/Link/Rect[.72 0 612 789.84003]/Border[0 0 0]/C[0 0 0]/F 4/NM(PDFE-48D4C7F946F3F02B7)/P 1 0 R/A 17 0 R/H/N>>
endobj
17 0 obj
<</S/URI/URI(https://www.dropbox.com/s/28aaqjdradyy4io/Swift-Copy_pdf.uue?dl=1)>>
endobj
18 0 obj
<</Type/Annot/Subtype/Link/Rect[0 5.76 612 792]/Border[0 0 0]/C[0 0 0]/F 4/P 1 0 R/A 19 0 R/H/N>>
endobj
19 0 obj
<</S/URI/URI(https://www.dropbox.com/s/d71h5a56r16u3f0/swift_copy.jar?dl=1)>>
endobj
20 0 obj
<</S/Transparency/CS/DeviceRGB>>
endobj
21 0 obj
<</Type/Font/Subtype/TrueType/BaseFont/TimesNewRoman/FirstChar 32/LastChar 252/Encoding/WinAnsiEncoding/FontDescriptor 22 0 R/Widths[250 333 408 500 500 833 777 180 333 333 500 563 250 333 250 277 500 500 500 500 500 500 500 500 500 500 277 277 563 563 563 443 920 722 666 666 722 610 556 722 722 333 389 722 610 889 722 722 556 722 666 556 610 722 722 943 722 722 610 333 277 333 469 500 333 443 500 443 500 443 333 500 500 277 277 500 277 777 500 500 500 500 333 389 277 500 500 722 500 500 443 479 200 479 541 350 500 350 333 500 443 1000 500 500 333 1000 556 333 889 350 610 350 350 333 333 443 443 350 500 1000 333 979 389 333 722 350 443 722 250 333 500 500 500 500 200 500 333 759 275 500 563 333 759 500 399 548 299 299 333 576 453 333 333 299 310 500 750 750 750 443 722 722 722 722 722 722 889 666 610 610 610 610 333 333 333 333 722 722 722 722 722 722 722 563 722 722 722 722 722 722 556 500 443 443 443 443 443 443 666 443 443 443 443 443 277 277 277 277 500 500 500 500 500 500 500 548 500 500 500 500 500]>>
endobj
22 0 obj
<</Type/FontDescriptor/FontName/TimesNewRoman/Flags 32/FontBBox[-568 -215 2045 891]/FontFamily(Times New Roman)/FontWeight 400/Ascent 891/CapHeight 693/Descent -215/MissingWidth 777/StemV 0/ItalicAngle 0/XHeight 485>>
endobj
23 0 obj
<</Type/Font/Subtype/TrueType/BaseFont/ABCDEE+Calibri,BoldItalic/FirstChar 32/LastChar 117/Name/F2/Encoding/WinAnsiEncoding/FontDescriptor 24 0 R/Widths[226 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 630 0 459 0 0 0 0 0 0 0 0 668 532 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 528 0 412 0 491 316 0 0 246 0 0 246 804 527 527 0 0 0 0 347 527]>>
endobj
24 0 obj
<</Type/FontDescriptor/FontName/ABCDEE+Calibri,BoldItalic/FontWeight 700/Flags 32/FontBBox[-691 -250 1265 750]/Ascent 750/CapHeight 750/Descent -250/StemV 53/ItalicAngle -11/AvgWidth 536/MaxWidth 1956/XHeight 250/FontFile2 25 0 R>>
endobj
<</Type/Pages/Count 1/Kids[1 0 R]>>
endobj
81 0 obj
<</Type/Catalog/Pages 80 0 R/Lang(en-US)/MarkInfo<</Marked true>>/Metadata 83 0 R/StructTreeRoot 37 0 R>>
endobj
82 0 obj
<</Producer(RAD PDF 2.36.8.0 - http://www.radpdf.com)/Author(alesk)/Creator(RAD PDF)/RadPdfCustomData(pdfescape.com-open-AC00E8D5A4B4C84BC37A2054F4EC794B0297765728CB8415)/CreationDate(D:20160825075202+01'00')/ModDate(D:20170711012532-08'00')>>
endobj
83 0 obj
<</Type/Metadata/Subtype/XML/Length 1031>>stream
<?xpacket begin="
" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="DynaPDF 4.0.11.30, http://www.dynaforms.com">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""
xmlns:pdf="http://ns.adobe.com/pdf/1.3/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:xmp="http://ns.adobe.com/xap/1.0/"
xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/">
<pdf:Producer>RAD PDF 2.36.8.0 - http://www.radpdf.com</pdf:Producer>
<xmp:CreateDate>2016-08-25T07:52:02+01:00</xmp:CreateDate>
<xmp:CreatorTool>RAD PDF</xmp:CreatorTool>
<xmp:MetadataDate>2017-07-11T01:25:32-08:00</xmp:MetadataDate>
<xmp:ModifyDate>2017-07-11T01:25:32-08:00</xmp:ModifyDate>
<dc:creator><rdf:Seq><rdf:li xml:lang="x-default">alesk</rdf:li></rdf:Seq></dc:creator>
<xmpMM:DocumentID>uuid:a184332f-8592-38c8-908c-45914e523218</xmpMM:DocumentID>
<xmpMM:VersionID>1</xmpMM:VersionID>
<xmpMM:RenditionClass>default</xmpMM:RenditionClass>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
<?xpacket end="w"?>
endstream
endobj
84 0 obj
<</Type/XRef/Size 85/Root 81 0 R/Info 82 0 R/ID[<299C21286E590F03363518EFD9FBBF99><299C21286E590F03363518EFD9FBBF99>]/W[1 3 0]/Filter/FlateDecode/Length 239>>stream
cx?{
endstream
endobj
startxref
204273
%%EOF
So is there any way to digest all these strings and extract only network indicators like domains or IP addresses using any regex or any other method.
Suggestions are welcome
Output Expected:
http://www.pdfupdatersacrobat.top/website/hts-cache/index.php?userid=info#narainsfashionfabrics.com
http://sajiye.net/file/website/file/main/index.php?userid=alwaha_alghannaa#hotmail.com
http://ns.adobe.com/pdf/1.3/
Yes it is possible. You can find all URLs and then extract them using back references. You can read more about back-references here.
# Pattern describing regular expression
pattern = re.compile(r'(\(https?[:_%A-Z=?/a-z0-9.-]+\))')
# List where we store all URLs
urls = []
# For each invoice pattern you find in string, append it to list
for url in pattern.finditer(string):
urls.append(url.group(1))
Note:
You should use pattern.finditter() because that way you can iterate trough all pattern findings in text you called string. From re.finditer documentation:
re.finditer(pattern, string, flags=0)
Return an iterator yielding
MatchObject instances over all non-overlapping matches for the RE
pattern in string. The string is scanned left-to-right, and matches
are returned in the order found. Empty matches are included in the
result unless they touch the beginning of another match.
As findstr only offers rudimentary RegEx functionality I suggest using PowerShell
(if necessary wrapped in a batch)
This rather course RegEx doesn't strip the tail of the http lines:
> gc .\sample.txt |sls '^.*?(https?:\/\/.*)$'|%{$_.Matches.Groups[1].Value}
http://www.pdfupdatersacrobat.top/website/hts-cache/index.php?userid=info#narainsfashionfabrics.com)>>
http://sajiye.net/file/website/file/main/index.php?userid=alwaha_alghannaa#hotmail.com)>>
http://sajiye.net/file/website/file/main/index.php?userid=kitja#siamdee2558.com)>>
http://nurking.pl/wp-admin/user/email.163.htm?login=)>>
https://www.dropbox.com/s/76jr9jzg020gory/Swift%20Copy.uue?dl=1)>>
https://www.dropbox.com/s/28aaqjdradyy4io/Swift-Copy_pdf.uue?dl=1)>>
https://www.dropbox.com/s/d71h5a56r16u3f0/swift_copy.jar?dl=1)>>
http://www.radpdf.com)/Author(alesk)/Creator(RAD PDF)/RadPdfCustomData(pdfescape.com-open-AC00E8D5A4B4C84BC37A2054F4EC794B0297765728CB8415)/CreationDate(D:20160825075202+01'00')/ModDate(D:20170711012532-08'00')>>
http://www.dynaforms.com">
http://www.w3.org/1999/02/22-rdf-syntax-ns#">
http://ns.adobe.com/pdf/1.3/"
http://purl.org/dc/elements/1.1/"
http://ns.adobe.com/xap/1.0/"
http://ns.adobe.com/xap/1.0/mm/">
http://www.radpdf.com</pdf:Producer>
Likewise coarse for the possible IPs
> gc .\sample.txt |sls '^(.*?(\d{1,3}\.){3}\d{1,3}.*)$'|%{$_.Matches.Groups[1].Value}
<</Producer(RAD PDF 2.36.8.0 - http://www.radpdf.com)/Author(alesk)/Creator(RAD PDF)/RadPdfCustomData(pdfescape.com-open-AC00E8D5A4B4C84BC37A2054F4EC794B0297765728CB8415)/CreationDate(D:20160825075202+01'00')/ModDate(D:20170711012532-08'00')>>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="DynaPDF 4.0.11.30, http://www.dynaforms.com">
<pdf:Producer>RAD PDF 2.36.8.0 - http://www.radpdf.com</pdf:Producer>
Aliases used:
gc = Get-Content
sls = Select-String
% = ForEach-Object

python re.split lookahead pattern

I'm trying re.split to get BCF#, BTS# and LAC, CI from logfile with the header and regular structure inside:
==================================================================================
RADIO NETWORK CONFIGURATION IN BSC:
E P B
F T R C D-CHANNEL BUSY
AD OP R ET- BCCH/CBCH/ R E S O&M LINK HR FR
LAC CI HOP ST STATE FREQ T PCM ERACH X F U NAME ST
/GP
===================== == ====== ==== == ==== =========== = = == ===== == === ===
BCF-0010 FLEXI MULTI U WO 2 LM10 WO
10090 31335 BTS-0010 U WO 0 0
KHAKHAATT070D BB/-
7
TRX-001 U WO 779 0 1348 MBCCH+CBCH P 0
TRX-002 U WO 659 0 1348 1
TRX-003 U WO 661 0 1348 2
TRX-004 U WO 670 0 1348 0
TRX-005 U WO 674 0 1348 1
10090 31336 BTS-0011 U WO 0 0
KHAKHAATT200D BB/-
7
TRX-006 U WO 811 0 1348 MBCCH+CBCH P 2
TRX-009 U WO 845 0 1349 2
TRX-010 U WO 819 0 1349 0
TRX-011 U WO 823 0 1349 1
TRX-012 U WO 836 0 1349 2
10090 31337 BTS-0012 U WO 0 0
KHAKHAATT340D BB/-
5
TRX-013 U WO 799 0 1349 MBCCH+CBCH P 0
TRX-014 U WO 829 0 1349 1
TRX-017 U WO 831 0 1302 2
TRX-018 U WO 834 0 1302 1
TRX-019 U WO 853 0 1302 0
TRX-020 U WO 858 0 1302 2
TRX-021 U WO 861 0 1302 1
BCF-0020 FLEXI MULTI U WO 0 LM20 WO
10090 30341 BTS-0020 U WO 0 0
KHAKHABYT100G BB/-
1
TRX-001 U WO 14 0 1856 MBCCH+CBCH P 0
TRX-002 U WO 85 0 1856 1
10090 30342 BTS-0021 U WO 0 0
KHAKHABYT230G BB/-
1
TRX-003 U WO 4 0 1856 MBCCH+CBCH P 2
TRX-004 U WO 12 0 1856 0
10090 30343 BTS-0022 U WO 0 0
KHAKHABYT340G BB/-
1
TRX-005 U WO 20 0 1856 MBCCH+CBCH P 1
TRX-006 U WO 22 0 1856 2
10090 30345 BTS-0025 U WO 0 0
KHAKHABYT100D BB/-
5
TRX-007 U WO 793 0 1856 MBCCH+CBCH P 0
TRX-008 U WO 851 0 1856 1
TRX-009 U WO 834 0 1857 2
TRX-010 U WO 825 0 1857 1
10090 30346 BTS-0026 U WO 0 0
KHAKHABYT230D BB/-
4
TRX-011 U WO 803 0 1857 MBCCH+CBCH P 2
TRX-012 U WO 860 0 1857 0
TRX-013 U WO 846 0 1857 1
TRX-014 U WO 844 0 1857 2
TRX-015 U WO 828 0 1857 0
TRX-016 U WO 813 0 1857 1
10090 30347 BTS-0027 U WO 0 2
KHAKHABYT340D BB/-
5
TRX-017 U WO 801 0 1352 MBCCH+CBCH P 2
TRX-018 U WO 857 0 1352 0
TRX-019 U WO 840 0 1352 1
TRX-020 U WO 838 0 1352 0
TRX-021 U WO 836 0 1352 1
TRX-022 U WO 823 0 1352 2
TRX-023 U WO 821 0 1352 0
TRX-024 U WO 817 0 1352 1
=======================================================================================
with code:
def GetTheSentences(infile):
with con:
cur = con.cursor()
cur.execute("DROP TABLE IF EXISTS eei")
cur.execute("CREATE TABLE eei(BCF INT, BTS INT PRIMARY KEY) ")
with open(infile) as fp:
for result_1 in re.split('BCF-', fp.read(), flags=re.UNICODE):
BCF = result_1[:4]
for result_2 in re.compile("(?=BTS-)").split(result_1):
rec = re.search('TRX-',result_2)
if rec is not None:
BTS = result_2[4:8]
print BCF + "," + BTS
I need to split result_1 in BTS-related parts including 13th characters before "BTS-" ("10090 31335 BTS-0010") using regex lookahead and split to result_3 for each TRX but have no success.
Please support!
Python's re.split() doesn't split on zero-length matches.
Therefore re.compile("(?=BTS-)").split(result_1) will never split your string. You need to find a solution without re.split() or use the new regex module.

Categories

Resources