python re.split lookahead pattern

python re.split lookahead pattern - python

I'm trying re.split to get BCF#, BTS# and LAC, CI from logfile with the header and regular structure inside:
==================================================================================
RADIO NETWORK CONFIGURATION IN BSC:
E P B
F T R C D-CHANNEL BUSY
AD OP R ET- BCCH/CBCH/ R E S O&M LINK HR FR
LAC CI HOP ST STATE FREQ T PCM ERACH X F U NAME ST
/GP
===================== == ====== ==== == ==== =========== = = == ===== == === ===
BCF-0010 FLEXI MULTI U WO 2 LM10 WO
10090 31335 BTS-0010 U WO 0 0
KHAKHAATT070D BB/-
7
TRX-001 U WO 779 0 1348 MBCCH+CBCH P 0
TRX-002 U WO 659 0 1348 1
TRX-003 U WO 661 0 1348 2
TRX-004 U WO 670 0 1348 0
TRX-005 U WO 674 0 1348 1
10090 31336 BTS-0011 U WO 0 0
KHAKHAATT200D BB/-
7
TRX-006 U WO 811 0 1348 MBCCH+CBCH P 2
TRX-009 U WO 845 0 1349 2
TRX-010 U WO 819 0 1349 0
TRX-011 U WO 823 0 1349 1
TRX-012 U WO 836 0 1349 2
10090 31337 BTS-0012 U WO 0 0
KHAKHAATT340D BB/-
5
TRX-013 U WO 799 0 1349 MBCCH+CBCH P 0
TRX-014 U WO 829 0 1349 1
TRX-017 U WO 831 0 1302 2
TRX-018 U WO 834 0 1302 1
TRX-019 U WO 853 0 1302 0
TRX-020 U WO 858 0 1302 2
TRX-021 U WO 861 0 1302 1
BCF-0020 FLEXI MULTI U WO 0 LM20 WO
10090 30341 BTS-0020 U WO 0 0
KHAKHABYT100G BB/-
1
TRX-001 U WO 14 0 1856 MBCCH+CBCH P 0
TRX-002 U WO 85 0 1856 1
10090 30342 BTS-0021 U WO 0 0
KHAKHABYT230G BB/-
1
TRX-003 U WO 4 0 1856 MBCCH+CBCH P 2
TRX-004 U WO 12 0 1856 0
10090 30343 BTS-0022 U WO 0 0
KHAKHABYT340G BB/-
1
TRX-005 U WO 20 0 1856 MBCCH+CBCH P 1
TRX-006 U WO 22 0 1856 2
10090 30345 BTS-0025 U WO 0 0
KHAKHABYT100D BB/-
5
TRX-007 U WO 793 0 1856 MBCCH+CBCH P 0
TRX-008 U WO 851 0 1856 1
TRX-009 U WO 834 0 1857 2
TRX-010 U WO 825 0 1857 1
10090 30346 BTS-0026 U WO 0 0
KHAKHABYT230D BB/-
4
TRX-011 U WO 803 0 1857 MBCCH+CBCH P 2
TRX-012 U WO 860 0 1857 0
TRX-013 U WO 846 0 1857 1
TRX-014 U WO 844 0 1857 2
TRX-015 U WO 828 0 1857 0
TRX-016 U WO 813 0 1857 1
10090 30347 BTS-0027 U WO 0 2
KHAKHABYT340D BB/-
5
TRX-017 U WO 801 0 1352 MBCCH+CBCH P 2
TRX-018 U WO 857 0 1352 0
TRX-019 U WO 840 0 1352 1
TRX-020 U WO 838 0 1352 0
TRX-021 U WO 836 0 1352 1
TRX-022 U WO 823 0 1352 2
TRX-023 U WO 821 0 1352 0
TRX-024 U WO 817 0 1352 1
=======================================================================================
with code:
def GetTheSentences(infile):
with con:
cur = con.cursor()
cur.execute("DROP TABLE IF EXISTS eei")
cur.execute("CREATE TABLE eei(BCF INT, BTS INT PRIMARY KEY) ")
with open(infile) as fp:
for result_1 in re.split('BCF-', fp.read(), flags=re.UNICODE):
BCF = result_1[:4]
for result_2 in re.compile("(?=BTS-)").split(result_1):
rec = re.search('TRX-',result_2)
if rec is not None:
BTS = result_2[4:8]
print BCF + "," + BTS
I need to split result_1 in BTS-related parts including 13th characters before "BTS-" ("10090 31335 BTS-0010") using regex lookahead and split to result_3 for each TRX but have no success.
Please support!

Python's re.split() doesn't split on zero-length matches.
Therefore re.compile("(?=BTS-)").split(result_1) will never split your string. You need to find a solution without re.split() or use the new regex module.

Related

merge_asof with multiple columns and forward direction

I have 2 dataframes:
q = pd.DataFrame({'ID':[700,701,701,702,703,703,702],'TX':[0,0,1,0,0,1,1],'REF':[100,120,144,100,103,105,106]})
ID TX REF
0 700 0 100
1 701 0 120
2 701 1 144
3 702 0 100
4 703 0 103
5 703 1 105
6 702 1 106
and
p = pd.DataFrame({'ID':[700,701,701,702,703,703,702,708],'REF':[100,121,149,100,108,105,106,109],'NOTE':['A','B','V','V','T','A','L','M']})
ID REF NOTE
0 700 100 A
1 701 121 B
2 701 149 V
3 702 100 V
4 703 108 T
5 703 105 A
6 702 106 L
7 708 109 M
I wish to merge p with q in such way that ID are equals AND the REF is exact OR higher.
Example 1:
for p: ID=700 and REF=100 and
for q: ID=700 and RED=100 So that's a clear match!
Example 2
for p:
1 701 0 120
2 701 1 144
they would match to:
1 701 121 B
2 701 149 V
this way:
1 701 0 120 121 B 121 is just after 120
2 701 1 144 149 V 149 comes after 144
When I use the below code NOTE: I only indicate the REF which is wrong. Should be ID AND REF:
p = p.sort_values(by=['REF'])
q = q.sort_values(by=['REF'])
pd.merge_asof(p, q, on='REF', direction='forward').sort_values(by=['ID_x','TX'])
I get this problem:
My expected result should be something like this:
ID TX REF REF_2 NOTE
0 700 0 100 100 A
1 701 0 120 121 B
2 701 1 144 149 V
3 702 0 100 100 V
4 703 0 103 108 T
5 703 1 105 105 A
6 702 1 106 109 L

Does this work?
pd.merge_asof(q.sort_values(['REF', 'ID']),
p.sort_values(['REF', 'ID']),
on='REF',
direction='forward',
by='ID').sort_values('ID')
Output:
ID TX REF NOTE
0 700 0 100 A
5 701 0 120 B
6 701 1 144 V
1 702 0 100 V
4 702 1 106 L
2 703 0 103 A
3 703 1 105 A

How to remove a parent while parsing JSON into a pandas dataframe?

I want to parse the data from the following API response into a pandas dataframe. There is an extra parent in this JSON file that I guess is causing the problem. How can I remove this and parse the data correctly?
URL: "https://api.covid19india.org/state_district_wise.json"
import pandas as pd
URL = "https://api.covid19india.org/state_district_wise.json"
df = pd.read_json(URL)
df.head()
The above code does not work and gives bad output. Please help.

Parsing nested structures in python is pain, here is solution working for your data:
import requests
URL = "https://api.covid19india.org/state_district_wise.json"
d = requests.get(URL).json()
L = []
for k, v in d.items():
for k1, v1 in v.items():
if isinstance(v1, dict):
for k2, v2 in v1.items():
if isinstance(v2, dict):
for k3, v3 in v2.items():
if isinstance(v3, dict):
d1 = {f'{k3}.{k4}': v4 for k4, v4 in v3.items()}
d2 = {'districtData':k,'State':k2,'statecode': v['statecode']}
d3 = {**d2, **v2, **d1}
del d3[k3]
L.append(d3)
df = pd.DataFrame(L)
print (df)
districtData State statecode \
0 State Unassigned Unassigned UN
1 Andaman and Nicobar Islands Nicobars AN
2 Andaman and Nicobar Islands North and Middle Andaman AN
3 Andaman and Nicobar Islands South Andaman AN
4 Andaman and Nicobar Islands Unknown AN
.. ... ... ...
767 West Bengal Purba Bardhaman WB
768 West Bengal Purba Medinipur WB
769 West Bengal Purulia WB
770 West Bengal South 24 Parganas WB
771 West Bengal Uttar Dinajpur WB
notes active confirmed \
0 0 0
1 District-wise numbers are out-dated as cumulat... 0 0
2 District-wise numbers are out-dated as cumulat... 0 1
3 District-wise numbers are out-dated as cumulat... 19 51
4 148 4442
.. ... ... ...
767 618 8773
768 1424 16548
769 350 5609
770 1899 27445
771 358 5197
deceased recovered delta.confirmed delta.deceased delta.recovered
0 0 0 0 0 0
1 0 0 0 0 0
2 0 1 0 0 0
3 0 32 0 0 0
4 60 4234 0 0 0
.. ... ... ... ... ...
767 74 8081 0 0 0
768 212 14912 0 0 0
769 33 5226 0 0 0
770 501 25045 0 0 0
771 55 4784 0 0 0
[772 rows x 11 columns]

How to set array index in this case?

I have this file
0 0 716
0 1 851
0 2 900
1 0 724
1 1 857
1 2 903
2 0 812
2 1 858
2 2 902
3 0 799
3 1 852
3 2 905
4 0 833
4 1 871
4 2 907
5 0 940
5 1 955
5 2 995
6 0 941
6 1 956
6 2 996
7 0 942
7 1 957
7 2 999
8 0 944
8 1 958
8 2 992
9 0 946
9 1 952
9 2 998
I want to write third column values like this
0 0 716
1 0 724
2 0 812
3 0 799
4 0 833
0 1 851
1 1 857
2 1 858
3 1 852
4 1 871
0 2 900
1 2 903
2 2 902
3 2 905
4 2 907
5 0 940
6 0 941
7 0 942
8 0 944
9 0 946
5 1 955
6 1 956
7 1 957
8 1 958
9 1 952
5 2 995
6 2 996
7 2 999
8 2 992
9 2 998
I have read file
l= [line.rstrip('\n') for line in open('test.txt')]
Now I am stuck,how to read this as 3d array? With enumerate function,does not work because it includes first value on its own,I do not need that.

This works:
with open('input.txt') as infile:
rows = [map(int, line.split()) for line in infile]
def part(minval, maxval):
return [r for r in rows if minval <= r[0] <= maxval]
with open('output.txt', 'w') as outfile:
for half in [part(0, 4), part(5, 9)]:
half.sort(key=lambda (a, b, c): (b, a, c))
for row in half:
outfile.write('%s %s %s\n' % tuple(row))
Let me know if you have questions.

it would be very simple if you could use pandas module:
import pandas as pd
fn = r'D:\temp\.data\37146154.txt'
df = pd.read_csv(fn, delim_whitespace=True, header=None, names=['col1','col2','col3'])
df.sort_values(['col2','col1','col3'])
if you want to write it back to a new file:
df.sort_values(['col2','col1','col3']).to_csv('new_file', sep='\t', index=False, header=None)
Test:
In [15]: df.sort_values(['col2','col1','col3'])
Out[15]:
col1 col2 col3
0 0 0 716
3 1 0 724
6 2 0 812
9 3 0 799
12 4 0 833
15 5 0 940
18 6 0 941
21 7 0 942
24 8 0 944
27 9 0 946
1 0 1 851
4 1 1 857
7 2 1 858
10 3 1 852
13 4 1 871
16 5 1 955
19 6 1 956
22 7 1 957
25 8 1 958
28 9 1 952
2 0 2 900
5 1 2 903
8 2 2 902
11 3 2 905
14 4 2 907
17 5 2 995
20 6 2 996
23 7 2 999
26 8 2 992
29 9 2 998

Building dict in loop adds values to wrong keys

I have the following code:
data = open(filename).readlines()
result = {}
for d in data:
res[d[0:11]] = d
Each line in data looks like so and there are 251 with 2 different "keys" in the first 11 characters:
205583620002008TAVG 420 M 400 M 1140 M 1590 M 2160 M 2400 M 3030 M 2840 M 2570 M 2070 M 1320 M 750 M
205583620002009TAVG 380 M 890 M 1060 M 1630 M 2190 M 2620 M 2880 M 2790 M 2500 M 2130 M 1210 M 640 M
205583620002010TAVG 530 M 750 M 930 M 1280 M 2080 M 2380 M 2890 M 3070 M 2620 M 1920 M 1400 M 790 M
205583620002011TAVG 150 M 600 M 930 M 1600 M 2160 M 2430 M 3000 M 2790 M 2430 M 1910 M 1670 M 650 M
205583620002012TAVG 470 M 440 M 950 M 1750 M 2130 M 2430 M 2970 M 2900 M 2370 M 1980 M 1220 M 630 M
205583620002013TAVG 460 M 680 M 1100 M 1530 M 2130 M 2410 M 3200 M 3100 M-9999 -9999 -9999 -9999 XM
205583620002014TAVG-9999 XC-9999 XC-9999 XC-9999 XC-9999 XP-9999 XC-9999 XC-9999 XC-9999 XC-9999 XC-9999 XC-9999 XC
205583620002015TAVG-9999 XC-9999 XC-9999 XC-9999 XC-9999 XC-9999 XC-9999 XC-9999 XK-9999 XP-9999 -9999 -9999
210476000001930TAVG 153 0 343 0 593 0 1033 0 1463 0 1893 0 2493 0 2583 0 2023 0 1483 0 873 0 473 0
210476000001931TAVG 203 0 73 0 473 0 833 0 1383 0 1823 0 2043 0 2513 0 2003 0 1413 0 1033 0 543 0
210476000001932TAVG 433 0 243 0 403 0 933 0 1503 0 1833 0 2353 0 2493 0 2043 0 1393 0 963 0 583 0
210476000001933TAVG 133 0 53 0 213 0 953 0 1553 0 1983 0 2543 0 2543 0 2043 0 1403 0 973 0 503 0
210476000001934TAVG 103 0 153 0 333 0 843 0 1493 0 1933 0 2243 0 2353 0 1983 0 1353 0 863 0 523 0
210476000001935TAVG 243 0 273 0 503 0 983 0 1453 0 1893 0 2303 0 2343 0 2053 0 1473 0 993 0 453 0
210476000001936TAVG -7 0 33 0 223 0 903 0 1433 0 1983 0 2293 0 2383 0 2153 0 1443 0 913 0 573 0
The keys output is this:
print res.keys()
>['20558362000', '21047600000']
And to check the result I have 3 prints:
print len(res.values())
print len(res.values()[0])
print len(res.values()[1])
My expected output is:
2
165
86
But I end up with:
2
116
116
It's pretty clear to me that it adds the same values to both keys, but I don't understand why.
If anyone could clarify with or without a working code strip it would help a lot.

When you use
res[d[0:11]] = d
the entry in the dict gets overwritten, and each line has a length of 116 characters. So a
print len(res.values()[0])
returns the length of a single line, not a number of elements.
You need to so something like this:
for d in data:
key = d[0:11]
if key in res:
res[key].append(d)
else:
res[key] = [d]
Checkout defaultkey.

The answer is simple: the first 10 characters of the string that you use as the key(s) are all the same in your example, so the value for the key keeps on being reset !
Use this code as a proof (substitute by the name of your filename):
filename = 'test.dat'
data = open(filename).readlines()
result = {}
for d in data:
key = d[0:11]
if key in result:
print 'key {key} already in {dict}'.format(key=key, dict=result)
result[key] = d
print result

If you want to collect multiple lines you should store them in a sequence (preferably list). Right now you are overwriting previously stored line with new one (that's what res[d[0:11]] = d does).
Possible solution:
data = open(filename).readlines()
result = {}
for d in data:
try:
res[d[0:11]].append(d)
except KeyError:
res[d[0:11]] = [d]

pandas count number of same element in column and erase if the number is small

I have two questions
first one is about counting and second one is erasing
My data set looks like this
X = pd.read_table(dstore, sep=',', warn_bad_lines=True, error_bad_lines=True)
X.sort(['Chain_key'], ascending=[True], inplace=True)
IRI_KEY OU EST_ACV Market_Name Open Clsd Chain_key
230 229030 GR 6.619999 SEATTLE/TACOMA 1123 9998 1
264 231588 GR 5.286999 SEATTLE/TACOMA 960 9998 1
291 233708 GR 5.556000 SEATTLE/TACOMA 607 9998 1
1083 288392 GR 5.556000 SEATTLE/TACOMA 902 1400 1
167 223660 GR 5.825996 SEATTLE/TACOMA 1123 9998 1
1128 292476 GR 12.683000 LOS ANGELES 1048 9998 2
451 243939 GR 15.306000 WEST TEX/NEW MEX 1196 9998 2
980 281109 GR 15.800990 PORTLAND,OR 435 9998 2
945 278738 GR 9.685997 LOS ANGELES 435 9998 2
1473 656089 GR 14.738000 PHOENIX, AZ 1192 9998 2
1329 648019 GR 13.397990 PHOENIX, AZ 902 9998 3
999 283190 GR 19.213990 SACRAMENTO 1059 1450 3
207 227169 GR 18.780990 WEST TEX/NEW MEX 1075 9998 3
1026 285252 GR 31.476000 WEST TEX/NEW MEX 659 9998 4
1231 535552 GR 22.150990 SPOKANE 1145 9998 4
455 244163 GR 19.213990 PORTLAND,OR 435 1424 4
328 236100 GR 19.120000 WEST TEX/NEW MEX 493 9998 5
1228 535326 GR 15.429990 PHOENIX, AZ 1190 9998 6
436 242841 GR 20.472990 PORTLAND,OR 1285 9998 6
then I want to count number of same element for Chain_key column like this
1: 5
2: 5
3: 3
4: 3
5: 1
...
and
how can I erase the row if their chain_key has small number of same element
for example erase less than 2 then in above case chain_key 5 should be erased
because they have only 1 (5: 1)
I try group by or some other things but didn't catch yet..

I think you're looking for size:
In [11]: df.groupby('Chain_key').size()
Out[11]:
Chain_key
1 5
2 5
3 3
4 3
5 1
6 2
dtype: int64
To remove rows with fewer than 2 in the group, use filter:
In [12]: df.groupby('Chain_key').filter(lambda x: len(x) >= 2)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python re.split lookahead pattern - python

Python's re.split() doesn't split on zero-length matches. Therefore re.compile("(?=BTS-)").split(result_1) will never split your string. You need to find a solution without re.split() or use the new regex module.

Related

merge_asof with multiple columns and forward direction

How to remove a parent while parsing JSON into a pandas dataframe?

How to set array index in this case?

Building dict in loop adds values to wrong keys

pandas count number of same element in column and erase if the number is small

Categories

Resources