pandas count occurrences during time window on other dataframe

pandas count occurrences during time window on other dataframe - python

I have a dataframe with this pattern of events
df = {
'2017-11-28 11:00': 'event1',
'2017-11-28 11:01': 'event1',
'2017-11-28 11:02': 'event1', <-----
'2017-11-28 11:03': 'event2',
'2017-11-28 11:04': 'event2',
'2017-11-28 11:05': 'event1',
'2017-11-28 11:06': 'event1',
'2017-11-28 11:07': 'event1', <-----
'2017-11-28 11:08': 'event2',
'2017-11-28 11:09': 'event2',
'2017-11-28 11:10': 'event2',
}
What I want to do is, for every event1 followed by one or many event2s, count the number of these event2s occurring during a specified time window, say 3 mins after that event1.
The arrows indicate the beginning of the time window.
Any help please?

It looks like you have a series there. In which case you can do:
threshold = (s.index.to_series()
.groupby((s.eq('event1') & s.shift(-1).eq('event2')).cumsum())
.transform('min') + pd.to_timedelta('3Min') # adjust threshold here
)
(s.eq('event2') & (s.index < threshold)).sum()
# out 4

Related

How to throw away lines of text with specific characters?

I have multiple .log files that look like:
#Software: Microsoft Internet Information Services 10.0
#Version: 1.0
#Date: 2020-04-02 00:09:16
#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) cs(Referer) sc-status sc-substatus sc-win32-status time-taken
2020-04-02 00:14:16 172.31.11.70 GET /ben_laptop_Apple.html - 443 - 156.154.81.54 curl/7.54.0 - 404 0 2 28
...
2020-04-02 00:19:16 172.31.11.70 GET /ben_laptop_Apple.html - 443 - 123.123.23.23 curl/7.54.0 - 404 0 2 47
I want to parse and concatenate the fields to get a nicely formatted Pandas table. To that end, I have the following working well:
# Match the extension pattern and save the list of file names in the ‘all_filenames’ variable.
extension = 'log'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
# Use pandas to concatenate all files in the list and export as CSV. The output file is named “combined_csv.csv” located in your working directory.
fields = 'date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) cs(Referer) sc-status sc-substatus sc-win32-status time-taken'.split(' ')
#combine all files in the list
combined_csv = pd.concat([pd.read_csv(f, sep=' ', header=None, skiprows=4, names=fields) for f in all_filenames ])
As you can see, I skip the first 4 rows to remove the header text. However, the problem is that a single log file will have the header text repeated throughout the .log file. So my files actually look like:
#Software: Microsoft Internet Information Services 10.0
#Version: 1.0
#Date: 2020-04-02 00:09:16
#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) cs(Referer) sc-status sc-substatus sc-win32-status time-taken
2020-04-02 00:14:16 172.31.11.70 GET /ben_laptop_Apple.html - 443 - 156.154.81.54 curl/7.54.0 - 404 0 2 28
...
#Software: Microsoft Internet Information Services 10.0
#Version: 1.0
#Date: 2020-04-02 00:09:16
#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) cs(Referer) sc-status sc-substatus sc-win32-status time-taken
...
2020-04-02 00:19:16 172.31.11.70 GET /ben_laptop_Apple.html - 443 - 123.123.23.23 curl/7.54.0 - 404 0 2 47
How do I filter out the repeating header text? I'm guessing I need a RegEx solution.

Instead of using the skiprows=4 you should use the comment='#'. This way your pd.read_csv will skip the rows that begins with #:
combined_csv = pd.concat([pd.read_csv(f, sep=' ', header=None, comment='#', names=fields) for f in all_filenames ])

I think this can help you:
df = pd.read_csv('sample_file.csv', comment='#')
From the documentation:
comment : str, default None
Indicates remainder of line should not be parsed. If found at the
beginning of a line, the line will be ignored altogether. This
parameter must be a single character. Like empty lines (as long as
skip_blank_lines=True), fully commented lines are ignored by the
parameter header but not by skiprows. For example, if comment=’#’,
parsing ‘#emptyna,b,cn1,2,3’ with header=0 will result in ‘a,b,c’
being treated as the header.

Copying a portion of .docx file (keep formatting and images)

Good day SO,
I am trying to copy a part of a .docx file into another .docx file, while keeping the formatting of the copied part, as well as any images, using python.
I have tried python-docx but i am unable to find anything regarding images. Link to my previous qn here: Extracting .docx data, images and structure
Is there a way for me to copy a part of a document, lets say DocA, and insert it into the ending of DocB (Including images and formatting, basically a clean copy and paste situation)?
Thanks alot!
EDIT:
I have managed to find paragraphs containing images in DocA using the following code. I understand that it is a very hack-ish way as I am a complete beginner in python-docx, but here it is:
for x in document.paragraphs:
if "<w:pict" in x._p.xml:
print(x._p.xml)
Using this code, I successfully managed to find paragraphs containing the said images in the document. However, I am still unable to copy the image over to DocB (It appears as blanks in DocB), which is because (based on my understanding) I didn't extract the image data from the .docx file DocA.
EDIT 2:
Here is the XML of the Paragraph object containing the images:
<w:p xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:cx="http://schemas.microsoft.com/office/drawing/2014/chartex" xmlns:cx1="http://schemas.microsoft.com/office/drawing/2015/9/8/chartex" xmlns:cx2="http://schemas.microsoft.com/office/drawing/2015/10/21/chartex" xmlns:cx3="http://schemas.microsoft.com/office/drawing/2016/5/9/chartex" xmlns:cx4="http://schemas.microsoft.com/office/drawing/2016/5/10/chartex" xmlns:cx5="http://schemas.microsoft.com/office/drawing/2016/5/11/chartex" xmlns:cx6="http://schemas.microsoft.com/office/drawing/2016/5/12/chartex" xmlns:cx7="http://schemas.microsoft.com/office/drawing/2016/5/13/chartex" xmlns:cx8="http://schemas.microsoft.com/office/drawing/2016/5/14/chartex" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:aink="http://schemas.microsoft.com/office/drawing/2016/ink" xmlns:am3d="http://schemas.microsoft.com/office/drawing/2017/model3d" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:w16cid="http://schemas.microsoft.com/office/word/2016/wordml/cid" xmlns:w16se="http://schemas.microsoft.com/office/word/2015/wordml/symex" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" w14:paraId="18A83B04" w14:textId="77777777" w:rsidR="00200C54" w:rsidRDefault="00051C61" w:rsidP="00200C54">
<w:pPr>
<w:jc w:val="center"/>
</w:pPr>
<w:r>
<w:rPr>
<w:noProof/>
</w:rPr>
<w:pict w14:anchorId="30C19523">
<v:shapetype id="_x0000_t202" coordsize="21600,21600" o:spt="202" path="m,l,21600r21600,l21600,xe">
<v:stroke joinstyle="miter"/>
<v:path gradientshapeok="t" o:connecttype="rect"/>
</v:shapetype>
<v:shape id="Text Box 2" o:spid="_x0000_s1029" type="#_x0000_t202" style="position:absolute;left:0;text-align:left;margin-left:305.1pt;margin-top:112.75pt;width:86.25pt;height:19.5pt;z-index:1;visibility:visible;mso-wrap-distance-top:3.6pt;mso-wrap-distance-bottom:3.6pt;mso-width-relative:margin;mso-height-relative:margin" o:gfxdata="UEsDBBQABgAIAAAAIQC2gziS/gAAAOEBAAATAAAAW0NvbnRlbnRfVHlwZXNdLnhtbJSRQU7DMBBF
90jcwfIWJU67QAgl6YK0S0CoHGBkTxKLZGx5TGhvj5O2G0SRWNoz/78nu9wcxkFMGNg6quQqL6RA
0s5Y6ir5vt9lD1JwBDIwOMJKHpHlpr69KfdHjyxSmriSfYz+USnWPY7AufNIadK6MEJMx9ApD/oD
OlTrorhX2lFEilmcO2RdNtjC5xDF9pCuTyYBB5bi6bQ4syoJ3g9WQ0ymaiLzg5KdCXlKLjvcW893
SUOqXwnz5DrgnHtJTxOsQfEKIT7DmDSUCaxw7Rqn8787ZsmRM9e2VmPeBN4uqYvTtW7jvijg9N/y
JsXecLq0q+WD6m8AAAD//wMAUEsDBBQABgAIAAAAIQA4/SH/1gAAAJQBAAALAAAAX3JlbHMvLnJl
bHOkkMFqwzAMhu+DvYPRfXGawxijTi+j0GvpHsDYimMaW0Yy2fr2M4PBMnrbUb/Q94l/f/hMi1qR
JVI2sOt6UJgd+ZiDgffL8ekFlFSbvV0oo4EbChzGx4f9GRdb25HMsYhqlCwG5lrLq9biZkxWOiqY
22YiTra2kYMu1l1tQD30/bPm3wwYN0x18gb45AdQl1tp5j/sFB2T0FQ7R0nTNEV3j6o9feQzro1i
OWA14Fm+Q8a1a8+Bvu/d/dMb2JY5uiPbhG/ktn4cqGU/er3pcvwCAAD//wMAUEsDBBQABgAIAAAA
IQBxa2hSIQIAAB0EAAAOAAAAZHJzL2Uyb0RvYy54bWysU11v2yAUfZ+0/4B4X+x4cdNYcaouXaZJ
3YfU7gdgjGM04DIgsbtfvwtO06h7m+YHxPW9HM4997C+GbUiR+G8BFPT+SynRBgOrTT7mv543L27
psQHZlqmwIiaPglPbzZv36wHW4kCelCtcARBjK8GW9M+BFtlmee90MzPwAqDyQ6cZgFDt89axwZE
1yor8vwqG8C11gEX3uPfuylJNwm/6wQP37rOi0BUTZFbSKtLaxPXbLNm1d4x20t+osH+gYVm0uCl
Z6g7Fhg5OPkXlJbcgYcuzDjoDLpOcpF6wG7m+atuHnpmReoFxfH2LJP/f7D86/G7I7KtaTFfUmKY
xiE9ijGQDzCSIuozWF9h2YPFwjDib5xz6tXbe+A/PTGw7ZnZi1vnYOgFa5HfPJ7MLo5OOD6CNMMX
aPEadgiQgMbO6SgeykEQHef0dJ5NpMLjlfmqfL8sKeGYKxbLqzINL2PV82nrfPgkQJO4qanD2Sd0
drz3IbJh1XNJvMyDku1OKpUCt2+2ypEjQ5/s0pcaeFWmDBlquiqLMiEbiOeThbQM6GMldU2v8/hN
zopqfDRtKglMqmmPTJQ5yRMVmbQJYzNiYdSsgfYJhXIw+RXfF256cL8pGdCrNfW/DswJStRng2Kv
5otFNHcKFuWywMBdZprLDDMcoWoaKJm225AeRNTBwC0OpZNJrxcmJ67owSTj6b1Ek1/GqerlVW/+
AAAA//8DAFBLAwQUAAYACAAAACEAiK7BRuMAAAAQAQAADwAAAGRycy9kb3ducmV2LnhtbExPy26D
MBC8V+o/WFupl6oxQQESgon6UKtek+YDDN4ACl4j7ATy992e2stKuzM7j2I3215ccfSdIwXLRQQC
qXamo0bB8fvjeQ3CB01G945QwQ097Mr7u0Lnxk20x+shNIJFyOdaQRvCkEvp6xat9gs3IDF2cqPV
gdexkWbUE4vbXsZRlEqrO2KHVg/41mJ9PlysgtPX9JRspuozHLP9Kn3VXVa5m1KPD/P7lsfLFkTA
Ofx9wG8Hzg8lB6vchYwXvYJ0GcVMVRDHSQKCGdk6zkBUfElXCciykP+LlD8AAAD//wMAUEsBAi0A
FAAGAAgAAAAhALaDOJL+AAAA4QEAABMAAAAAAAAAAAAAAAAAAAAAAFtDb250ZW50X1R5cGVzXS54
bWxQSwECLQAUAAYACAAAACEAOP0h/9YAAACUAQAACwAAAAAAAAAAAAAAAAAvAQAAX3JlbHMvLnJl
bHNQSwECLQAUAAYACAAAACEAcWtoUiECAAAdBAAADgAAAAAAAAAAAAAAAAAuAgAAZHJzL2Uyb0Rv
Yy54bWxQSwECLQAUAAYACAAAACEAiK7BRuMAAAAQAQAADwAAAAAAAAAAAAAAAAB7BAAAZHJzL2Rv
d25yZXYueG1sUEsFBgAAAAAEAAQA8wAAAIsFAAAAAA==
" stroked="f">
<v:textbox>
<w:txbxContent>
<w:p w14:paraId="467DC1DB" w14:textId="77777777" w:rsidR="00200C54" w:rsidRDefault="00200C54" w:rsidP="00200C54">
<w:pPr>
<w:jc w:val="center"/>
</w:pPr>
<w:r>
<w:t>tLSTM</w:t>
</w:r>
</w:p>
</w:txbxContent>
</v:textbox>
</v:shape>
</w:pict>
</w:r>
<w:r>
<w:rPr>
<w:noProof/>
</w:rPr>
<w:pict w14:anchorId="0D832600">
<v:line id="Straight Connector 8" o:spid="_x0000_s1028" style="position:absolute;left:0;text-align:left;flip:y;z-index:2;visibility:visible;mso-width-relative:margin;mso-height-relative:margin" from="205.4pt,44.35pt" to="249.05pt,45.55pt" o:gfxdata="UEsDBBQABgAIAAAAIQC2gziS/gAAAOEBAAATAAAAW0NvbnRlbnRfVHlwZXNdLnhtbJSRQU7DMBBF
90jcwfIWJU67QAgl6YK0S0CoHGBkTxKLZGx5TGhvj5O2G0SRWNoz/78nu9wcxkFMGNg6quQqL6RA
0s5Y6ir5vt9lD1JwBDIwOMJKHpHlpr69KfdHjyxSmriSfYz+USnWPY7AufNIadK6MEJMx9ApD/oD
OlTrorhX2lFEilmcO2RdNtjC5xDF9pCuTyYBB5bi6bQ4syoJ3g9WQ0ymaiLzg5KdCXlKLjvcW893
SUOqXwnz5DrgnHtJTxOsQfEKIT7DmDSUCaxw7Rqn8787ZsmRM9e2VmPeBN4uqYvTtW7jvijg9N/y
JsXecLq0q+WD6m8AAAD//wMAUEsDBBQABgAIAAAAIQA4/SH/1gAAAJQBAAALAAAAX3JlbHMvLnJl
bHOkkMFqwzAMhu+DvYPRfXGawxijTi+j0GvpHsDYimMaW0Yy2fr2M4PBMnrbUb/Q94l/f/hMi1qR
JVI2sOt6UJgd+ZiDgffL8ekFlFSbvV0oo4EbChzGx4f9GRdb25HMsYhqlCwG5lrLq9biZkxWOiqY
22YiTra2kYMu1l1tQD30/bPm3wwYN0x18gb45AdQl1tp5j/sFB2T0FQ7R0nTNEV3j6o9feQzro1i
OWA14Fm+Q8a1a8+Bvu/d/dMb2JY5uiPbhG/ktn4cqGU/er3pcvwCAAD//wMAUEsDBBQABgAIAAAA
IQBOa9I08gEAAMcDAAAOAAAAZHJzL2Uyb0RvYy54bWysU02P0zAQvSPxHyzfadqyQUvUdA+tlssK
KrVwn3XsxMJf8pim+feM3dIWuCFysGzPvJeZN8+rp5M17Cgjau9avpjNOZNO+E67vuVfD8/vHjnD
BK4D451s+SSRP63fvlmNoZFLP3jTyciIxGEzhpYPKYWmqlAM0gLOfJCOgspHC4mOsa+6CCOxW1Mt
5/MP1ehjF6IXEpFut+cgXxd+paRIX5RCmZhpOdWWyhrL+prXar2Cpo8QBi0uZcA/VGFBO/rplWoL
CdiPqP+islpEj16lmfC28kppIUsP1M1i/kc3+wGCLL2QOBiuMuH/oxWfj7vIdNdyGpQDSyPapwi6
HxLbeOdIQB/ZY9ZpDNhQ+sbtYu5UnNw+vHjxHZnzmwFcL0u9hykQySIjqt8g+YDhDD6paJkyOnzL
qZmOpGCnMpfpOhd5SkzQZV0/vK9rzgSFFvXyoYytgiazZGyImD5Jb1netNxol1WDBo4vmHIdt5R8
7fyzNqZM3jg2EufHeU3mEEAGVAYSbW0gSdD1nIHpydkixUKJ3uguwzMRTrgxkR2BzEWe7Px4oJI5
M4CJAtRH+YoUlH0PzZVuAYczuKPd2YpWJ3oPRlsayD3YuPxDWRx9aeqmZ969+m7axV+ik1tK2xdn
Zzven8tobu9v/RMAAP//AwBQSwMEFAAGAAgAAAAhAMrmE8fjAAAADgEAAA8AAABkcnMvZG93bnJl
di54bWxMj8FOwzAQRO9I/IO1SNyoY1TaJI1TIapyRLRw4ebGJomw15HtNIGvZzmVy0qj3Z15U21n
Z9nZhNh7lCAWGTCDjdc9thLe3/Z3ObCYFGplPRoJ3ybCtr6+qlSp/YQHcz6mlpEJxlJJ6FIaSs5j
0xmn4sIPBmn36YNTiWRouQ5qInNn+X2WrbhTPVJCpwbz1Jnm6zg6CZN9Xj3oYjh87HkQ69efUePu
Rcrbm3m3ofG4AZbMnC4f8NeB+KEmsJMfUUdmJSxFRvxJQp6vgdHBssgFsJOEQgjgdcX/16h/AQAA
//8DAFBLAQItABQABgAIAAAAIQC2gziS/gAAAOEBAAATAAAAAAAAAAAAAAAAAAAAAABbQ29udGVu
dF9UeXBlc10ueG1sUEsBAi0AFAAGAAgAAAAhADj9If/WAAAAlAEAAAsAAAAAAAAAAAAAAAAALwEA
AF9yZWxzLy5yZWxzUEsBAi0AFAAGAAgAAAAhAE5r0jTyAQAAxwMAAA4AAAAAAAAAAAAAAAAALgIA
AGRycy9lMm9Eb2MueG1sUEsBAi0AFAAGAAgAAAAhAMrmE8fjAAAADgEAAA8AAAAAAAAAAAAAAAAA
TAQAAGRycy9kb3ducmV2LnhtbFBLBQYAAAAABAAEAPMAAABcBQAAAAA=
" strokecolor="windowText" strokeweight="1.5pt">
<v:stroke dashstyle="dash" joinstyle="miter"/>
</v:line>
</w:pict>
</w:r>
<w:r>
<w:rPr>
<w:noProof/>
</w:rPr>
<w:pict w14:anchorId="7B559002">
<v:line id="Straight Connector 9" o:spid="_x0000_s1027" style="position:absolute;left:0;text-align:left;z-index:3;visibility:visible;mso-width-relative:margin;mso-height-relative:margin" from="203.6pt,47.3pt" to="249.65pt,114pt" o:gfxdata="UEsDBBQABgAIAAAAIQC2gziS/gAAAOEBAAATAAAAW0NvbnRlbnRfVHlwZXNdLnhtbJSRQU7DMBBF
90jcwfIWJU67QAgl6YK0S0CoHGBkTxKLZGx5TGhvj5O2G0SRWNoz/78nu9wcxkFMGNg6quQqL6RA
0s5Y6ir5vt9lD1JwBDIwOMJKHpHlpr69KfdHjyxSmriSfYz+USnWPY7AufNIadK6MEJMx9ApD/oD
OlTrorhX2lFEilmcO2RdNtjC5xDF9pCuTyYBB5bi6bQ4syoJ3g9WQ0ymaiLzg5KdCXlKLjvcW893
SUOqXwnz5DrgnHtJTxOsQfEKIT7DmDSUCaxw7Rqn8787ZsmRM9e2VmPeBN4uqYvTtW7jvijg9N/y
JsXecLq0q+WD6m8AAAD//wMAUEsDBBQABgAIAAAAIQA4/SH/1gAAAJQBAAALAAAAX3JlbHMvLnJl
bHOkkMFqwzAMhu+DvYPRfXGawxijTi+j0GvpHsDYimMaW0Yy2fr2M4PBMnrbUb/Q94l/f/hMi1qR
JVI2sOt6UJgd+ZiDgffL8ekFlFSbvV0oo4EbChzGx4f9GRdb25HMsYhqlCwG5lrLq9biZkxWOiqY
22YiTra2kYMu1l1tQD30/bPm3wwYN0x18gb45AdQl1tp5j/sFB2T0FQ7R0nTNEV3j6o9feQzro1i
OWA14Fm+Q8a1a8+Bvu/d/dMb2JY5uiPbhG/ktn4cqGU/er3pcvwCAAD//wMAUEsDBBQABgAIAAAA
IQDiYn9F7wEAAL4DAAAOAAAAZHJzL2Uyb0RvYy54bWysU02P0zAQvSPxHyzfadJlC23UdA+tlssK
KrX8gFnHSSz8JY9pkn/P2P3YAjdEDtbY43me9+Zl/TQazU4yoHK25vNZyZm0wjXKdjX/fnz+sOQM
I9gGtLOy5pNE/rR5/249+Eo+uN7pRgZGIBarwde8j9FXRYGilwZw5ry0lGxdMBBpG7qiCTAQutHF
Q1l+KgYXGh+ckIh0ujsn+Sbjt60U8VvbooxM15x6i3kNeX1Na7FZQ9UF8L0SlzbgH7owoCw9eoPa
QQT2M6i/oIwSwaFr40w4U7i2VUJmDsRmXv7B5tCDl5kLiYP+JhP+P1jx9bQPTDU1X3FmwdCIDjGA
6vrIts5aEtAFtko6DR4rur61+5CYitEe/IsTP5BZt+3BdjL3e5w8gcxTRfFbSdqgPxePbTAJhARg
Y57GdJuGHCMTdLhYPi4/LjgTlFo+fi5XeVoFVNdiHzB+kc6wFNRcK5vEggpOLxjT81Bdr6Rj656V
1nng2rKBelyVC/KEAPJdqyFSaDwpgbbjDHRHhhYxZEh0WjWpPAHhhFsd2AnIU2TFxg1H6pkzDRgp
QUTylxWg2/elqZ8dYH8ubig6O9CoSL+BVoao3hdrmx6U2cgXUm8ypujVNdM+XLUmk2TaF0MnF97v
80TefrvNLwAAAP//AwBQSwMEFAAGAAgAAAAhAMoUjjTgAAAADwEAAA8AAABkcnMvZG93bnJldi54
bWxMT0tOwzAQ3SNxB2sqsaN2TVSaNE6FIFRiSekBpvGQRI3tKHY+vT1mBZuRnuZ988NiOjbR4Ftn
FWzWAhjZyunW1grOX++PO2A+oNXYOUsKbuThUNzf5ZhpN9tPmk6hZtHE+gwVNCH0Gee+asigX7ue
bPx9u8FgiHCouR5wjuam41KILTfY2pjQYE+vDVXX02gUmCo9jjSV5VGeb3zm/fWjwVKph9Xyto/n
ZQ8s0BL+FPC7IfaHIha7uNFqzzoFiXiWkaogTbbAIiFJ0ydgFwVS7gTwIuf/dxQ/AAAA//8DAFBL
AQItABQABgAIAAAAIQC2gziS/gAAAOEBAAATAAAAAAAAAAAAAAAAAAAAAABbQ29udGVudF9UeXBl
c10ueG1sUEsBAi0AFAAGAAgAAAAhADj9If/WAAAAlAEAAAsAAAAAAAAAAAAAAAAALwEAAF9yZWxz
Ly5yZWxzUEsBAi0AFAAGAAgAAAAhAOJif0XvAQAAvgMAAA4AAAAAAAAAAAAAAAAALgIAAGRycy9l
Mm9Eb2MueG1sUEsBAi0AFAAGAAgAAAAhAMoUjjTgAAAADwEAAA8AAAAAAAAAAAAAAAAASQQAAGRy
cy9kb3ducmV2LnhtbFBLBQYAAAAABAAEAPMAAABWBQAAAAA=
" strokecolor="windowText" strokeweight="1.5pt">
<v:stroke dashstyle="dash" joinstyle="miter"/>
</v:line>
</w:pict>
</w:r>
<w:r>
<w:rPr>
<w:noProof/>
</w:rPr>
<w:pict w14:anchorId="1C829DE8">
<v:shapetype id="_x0000_t75" coordsize="21600,21600" o:spt="75" o:preferrelative="t" path="m#4#5l#4#11#9#11#9#5xe" filled="f" stroked="f">
<v:stroke joinstyle="miter"/>
<v:formulas>
<v:f eqn="if lineDrawn pixelLineWidth 0"/>
<v:f eqn="sum #0 1 0"/>
<v:f eqn="sum 0 0 #1"/>
<v:f eqn="prod #2 1 2"/>
<v:f eqn="prod #3 21600 pixelWidth"/>
<v:f eqn="prod #3 21600 pixelHeight"/>
<v:f eqn="sum #0 0 1"/>
<v:f eqn="prod #6 1 2"/>
<v:f eqn="prod #7 21600 pixelWidth"/>
<v:f eqn="sum #8 21600 0"/>
<v:f eqn="prod #7 21600 pixelHeight"/>
<v:f eqn="sum #10 21600 0"/>
</v:formulas>
<v:path o:extrusionok="f" gradientshapeok="t" o:connecttype="rect"/>
<o:lock v:ext="edit" aspectratio="t"/>
</v:shapetype>
<v:shape id="Picture 6" o:spid="_x0000_s1026" type="#_x0000_t75" style="position:absolute;left:0;text-align:left;margin-left:247.8pt;margin-top:10pt;width:186pt;height:112.15pt;z-index:-1;visibility:visible" wrapcoords="17332 576 17332 2880 3571 3024 1742 3312 1742 5184 348 5328 348 6192 1742 7488 1742 9792 871 12096 871 12816 1481 14400 1742 16704 261 16848 261 18000 2439 19008 2613 21456 3135 21456 5661 21312 18726 19440 19945 19008 21426 17712 21339 16704 19510 14400 19510 7488 20816 7200 20816 5472 19510 4752 18639 3456 17855 2880 18639 2016 18639 864 17768 576 17332 576">
<v:imagedata r:id="rId8" o:title=""/>
<w10:wrap type="tight"/>
</v:shape>
</w:pict>
</w:r>
<w:r w:rsidR="00524183">
<w:rPr>
<w:noProof/>
<w:lang w:val="en-US"/>
</w:rPr>
<w:pict w14:anchorId="63A496C5">
<v:shape id="Picture 5" o:spid="_x0000_i1025" type="#_x0000_t75" style="width:191.7pt;height:128.1pt;visibility:visible">
<v:imagedata r:id="rId9" o:title=""/>
</v:shape>
</w:pict>
</w:r>
</w:p>
The images are in the docx file, but do not show up in document.inline_shapes (python-docx), hence I have no idea how to continue.. any help appreciated :)

Check this code you can identify the location of an image after a specific text:
tags = []
for t in document.element.getiterator():
if t.tag in ['{http://schemas.openxmlformats.org/wordprocessingml/2006/main}r','{http://schemas.openxmlformats.org/wordprocessingml/2006/main}t','{http://schemas.openxmlformats.org/wordprocessingml/2006/main}drawing']:
if t.tag == '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}drawing':
print('Picture Found')
else:
print(t.text)

Check this code. You can extract image position between two texts and image name by:
tags = []
text = []
for t in doc.element.getiterator():
if t.tag in ['{http://schemas.openxmlformats.org/wordprocessingml/2006/main}r', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}t','{http://schemas.openxmlformats.org/drawingml/2006/picture}cNvPr','{http://schemas.openxmlformats.org/wordprocessingml/2006/main}drawing']:
if t.tag == '{http://schemas.openxmlformats.org/drawingml/2006/picture}cNvPr':
print('Picture Found: ',t.attrib['name'])
tags.append('Picture')
text.append(t.attrib['name'])
elif t.text:
tags.append('text')
text.append(t.text)
You can check previous and next text from text list and their tag from the tag list.
If you have extracted the image location and image name then you can add the image in your docx file by this code
from docx import Document
document = Document()
p = document.add_paragraph()
r = p.add_run()
r.add_text('Good Morning every body,This is my ')
r.add_picture('/tmp/foo.jpg')
r.add_text(' do you like it?')
document.save('demo.docx')
You can access the image by unzipping your docx file. when you will unzip you will get different folders. You can access all the images in the file from word/media folder
Check this link for unzipping a docx file
https://towardsdatascience.com/how-to-extract-data-from-ms-word-documents-using-python-ed3fbb48c122

This may not be a direct answer to your question, but it is worth considering.
If you have control over docA, have you considered the use of a docx template? In my problem I needed to generate reports from a template, so I had to copy information from python variables into a document, to generate a report. I found this project library which does replacement: https://github.com/elapouya/python-docx-template
Finally, you can replace the content from your variables like this:
from docxtpl import DocxTemplate
doc = DocxTemplate("my_word_template.docx")
context = { 'company_name' : "World company" }
doc.render(context)
doc.save("generated_doc.docx")
I have not checked but I believe this does preserve formatting. Here is an example of what my template looked like before replacing variables:

Python removing duplicate names

I have plain text file with words in each line:
3210 <DOCID>GH950102-000003<DOCID>/O
3243 Australia/LOCATION
3360 England/LOCATION
3414 India/LOCATION
3474 Melbourne/LOCATION
3497 England/LOCATION
3521 >India<TOPONYM>/O
3526 >Zimbabwe<TOPONYM>/O
3531 >England<TOPONYM>/O
3536 >Melbourne<TOPONYM>/O
3541 >England<TOPONYM>/O
3546 >England<TOPONYM>/O
3551 >Glasgow<TOPONYM>/O
3556 >England<TOPONYM>/O
3561 >England<TOPONYM>/O
3566 >Australia<TOPONYM>/O
3568 <DOCID>GH950102-000004<DOCID>/O
3739 Hampden/LOCATION
3821 Hampden/LOCATION
3838 Ibrox/LOCATION
3861 Neerday/LOCATION
4161 Fir Park/LOCATION
4229 Park<TOPONYM>/O
4234 >Hampden<TOPONYM>/O
4239 >Hampden<TOPONYM>/O
4244 >Midfield<TOPONYM>/O
4249 >Glasgow<TOPONYM>/O
4251 <DOCID>GH950102-000005<DOCID>/O
4535 Edinburgh/LOCATION
4840 Road<TOPONYM>/O
4845 >Edinburgh<TOPONYM>/O
4850 >Glasgow<TOPONYM>/O``
I want to remove same location names in this list and it should look like this:
3210 <DOCID>GH950102-000003<DOCID>/O
3243 Australia/LOCATION
3360 England/LOCATION
3414 India/LOCATION
3474 Melbourne/LOCATION
3497 England/LOCATION
3526 >Zimbabwe<TOPONYM>/O
3551 >Glasgow<TOPONYM>/O
3568 <DOCID>GH950102-000004<DOCID>/O
3739 Hampden/LOCATION
3838 Ibrox/LOCATION
3861 Neerday/LOCATION
4161 Fir Park/LOCATION
4229 Park<TOPONYM>/O
4244 >Midfield<TOPONYM>/O
4249 >Glasgow<TOPONYM>/O
4251 <DOCID>GH950102-000005<DOCID>/O
4535 Edinburgh/LOCATION
4840 Road<TOPONYM>/O
4850 >Glasgow<TOPONYM>/O
I want to remove the duplicate locations name and docid should remain in the file. I know there is a way through linux using uniq but if I'll run that it will remove locations within different docid.
Is there anyway to split it through every docid and within docid if location names are same then it should remove duplicate names.

I am writing from mobile, so this will not be a complete solution, but the key points:
import re
Docid=re.compile("^ *\d+ +<DOCID>")
Location=re.compile("^ *\d +>?(. +)/")
Lines={}
for line in file:
if re.match(Docid,line):
Lines={}
print line
else:
loc=re.findall(Location, line)[0]
if loc not in Lines.keys():
print line
Lines[loc] = True
Basically it checks each line of it is not a new docid. If it isn't, it then tries to read location and see if it already was read. If not, it prints the location and adds it to the list of locations tead.
If there is a new docid, it resets the last of read locations.

Here is a way to do it.
import string
filename = 'testfile'
lines = tuple(open(filename, 'r'))
final_list = []
unique_list = [] # this resets itself every docid
for line in lines:
currentline = str(line)
if 'DOCID' in currentline:
unique_list = [] # this resets itself every docid
final_list.append(line)
else:
exclude = set(string.punctuation)
currentline = ''.join(ch if ch not in exclude else " " for ch in currentline)
city = currentline.split()[1]
if city not in unique_list:
unique_list.append(city)
final_list.append(line)
for line in final_list:
print(line)
output:
3210 <DOCID>GH950102-000003<DOCID>/O
3243 Australia/LOCATION
3360 England/LOCATION
3414 India/LOCATION
3474 Melbourne/LOCATION
3526 >Zimbabwe<TOPONYM>/O
3551 >Glasgow<TOPONYM>/O
3568 <DOCID>GH950102-000004<DOCID>/O
3739 Hampden/LOCATION
3838 Ibrox/LOCATION
3861 Neerday/LOCATION
4161 Fir Park/LOCATION
4229 Park<TOPONYM>/O
4244 >Midfield<TOPONYM>/O
4249 >Glasgow<TOPONYM>/O
4251 <DOCID>GH950102-000005<DOCID>/O
4535 Edinburgh/LOCATION
4840 Road<TOPONYM>/O
4850 >Glasgow<TOPONYM>/O``
Note: The testfileis a text file with your input text. You can optimize the code if necessary.

How to extract dates and all of the data following them using re.findall in python

I have a string, and I'm hoping to use regular expressions to separate some of the information from the string.
The following code that I have has two problems that I'm aware of: 1) it is failing to capture the date, "20120623" for a reason that I'm not quite sure I understand (perhaps the compiled regex query is not doing what I expect it to do?), and 2) the current regex expression I'm using to designate how the expression ends won't work for the last match in the string.
I may revise the title of this question, depending on what the problem/solution ends up being.
The current code:
import re
s='20120622 10.3 -84.8$Sabulodes colombiata;Clemensia cincinnata;Gn sp_san_luis_1037;Glena mopsaria;Euphyia sp_group_san_luis;Thysanopyga carfinia;Glena mopsaria;Eusarca minucia;Euphyia sp_group_san_luis;Scopula compensata;Gn sp_san_luis_1028;Gn sp_san_luis_7003;Trygodes amphion;Phyllodonta latrata;Gn sp_san_luis_1003;Idaea sp_group_san_luis;Gn sp_san_luis_1002;Melanolophia sp_san_luis;Hemiceras pernubila;Leucula meganira;Oospila athena;Gn sp_san_luis_6001;Epimecis semicompleta;Gn sp_san_luis_1004;Glena mopsaria;Euphyia sp_group_san_luis;Iridopsis validaria;Eusarca asteria--cayennaria;Disphragis proba;Trygodes amphion;Gn sp_san_luis_8022;Melanolophia sp_san_luis;Melanolophia sp_san_luis;Melanolophia sp_san_luis;Idaea sp_group_san_luis;Gn sp_san_luis_1013;Gn sp_san_luis_7001;Disphragis proba;Gn sp_san_luis_1003;Eois sp_san_luis_b;Prochoerodes striata;Nola sp_san_luis_a;Oospila venezuelata;Eois sp_san_luis_b;Oospila venezuelata;Glena mopsaria;Idaea sp_group_san_luis;Oospila venezuelata;Gn sp_san_luis_1002;Phyllodonta latrata;Virbia sp_san_luis_a;Oospila venezuelata;Hemiceras pernubila;Pantherodes conglomerata;Nephodia betala;Melanolophia sp_san_luis;Herbita medama--medona;Euphyia sp_group_san_luis;Idaea sp_group_san_luis;Glena mopsaria;Pantherodes conglomerata;Glena mopsaria;Eois sp_san_luis_k;Melanolophia sp_san_luis;Idaea sp_group_san_luis;Eois sp_san_luis_b;Idaea sp_group_san_luis;Oospila venezuelata;Gn sp_san_luis_1010;Gn sp_san_luis_7001;Eois sp_san_luis_k;Hemiceras pernubila;Dyspteris tenuivitta;Macaria pernicata;Hemiceras pernubila;Hymenia perspectalis;Oxydia sp_san_luis_c;Cliniodes opalalis;Gn sp_san_luis_1042;Oospila athena;Eois sp_san_luis_b;Hemiceras rufescens;Glena mopsaria;Gn sp_san_luis_1011;Phyllodonta latrata;Simopteryx torquataria;Perasia helvina;Nola sp_san_luis_a;Gn sp_san_luis_2023;Nephodia auxesia;Nephodia betala;Glena mopsaria;Ametris nitocris;Gn sp_san_luis_1003;Glena mopsaria;Hymenia perspectalis;Phyllodonta latrata;Eusarca asteria--cayennaria;Disphragis proba;Gn sp_san_luis_2028;Ptychamalia sp_san_luis_a;Nola sp_san_luis_a;Anticla antica;Nola sp_san_luis_a;Semaeopus sabuloides;Simopteryx torquataria;Gn sp_san_luis_1002;Phyllodonta latrata;Glena mopsaria;Gizma undilinealis;Eois sp_san_luis_d;Oxydia bilinea;Hemiceras pernubila;Hemiceras pernubila;Thysanopyga amarantha;Plusiodonta sp_san_luis_a;Gn sp_san_luis_1011;Cyclophora sp_san_luis_a;Bleptina caradrinalis;Opharus rudis;Melanolophia sp_san_luis;Schrankia macula;Leucula meganira;Iridopsis lurida--oberthuri--herse;Schrankia macula;Letis buteo;Scopula sp_group_san_luis;Herbita amicaria;Gn sp_san_luis_1013;Adhemarius ypsilon;Adhemarius ypsilon;Adhemarius ypsilon;Euphyia sp_group_san_luis;Idaea tacturata;Adhemarius ypsilon;Hemerophila gradella;Acrosemia vulpecularia;Pyrgion repanda;Epimecis matronaria;Scopula sp_group_san_luis;Eois sp_san_luis_b;Virbia sp_san_luis_a;Conchylodes erinalis;Eois sp_san_luis_b;Epimecis matronaria;Melanolophia sp_san_luis;Melese monima;Rhabdatomis laudamia;Lirimiris inopinata;Josia sp_group_san_luis;Pseudodirphia menander;Josia sp_group_san_luis;Disphragis proba;Cliniodes opalalis;Melanolophia sp_san_luis;Prorifrons rufescens;Antiblemma concinnula;Melanolophia sp_san_luis;Nola sp_san_luis_a;Melanolophia sp_san_luis;Cliniodes opalalis;Eois sp_san_luis_b;Renia vinasalis;Eucereon relegata;Colla rhodope;Conchylodes erinalis;Oospila venezuelata;Semaeopus illimitata;Eucereon aurantiaca;Macaria approximaria--gambarina--ostia;Hemiceras modesta;Microphysetica hermeasalis;Nola sp_san_luis_a;Melanolophia sp_san_luis;Oospila venezuelata;Cautethia spuria;Euphyia sp_group_san_luis;Gn sp_san_luis_8046;Polypoetes villia;Synnomos sp_san_luis_a;Hemiceras pernubila;Hemerophila gradella;Nola sp_san_luis_a;Iridopsis validaria;Sarsina purpurascens;Argyrotome prospecta;Gn sp_san_luis_2008;Eulepidotis alabastraria;Gn sp_san_luis_7001;Desmia bajulalis;Eusarca asteria--cayennaria;Nola sp_san_luis_a;Oospila venezuelata;Macaria approximaria--gambarina--ostia;Epimecis matronaria;Glena mopsaria;Euphyia sp_group_san_luis;Iridopsis lurida--oberthuri--herse;Gizma undilinealis;Dichomeris arotrosema;Iridopsis lurida--oberthuri--herse;Disphragis proba;Gn sp_san_luis_7003;Eusarca asteria--cayennaria;Acharia hyperoche;Thysanopyga amarantha;Oospila athena;Glena mopsaria;Crambidia myrlosea;Marimatha nigrofimbria;Eucereon aurantiaca;Euphyia sp_group_san_luis;Gn sp_san_luis_1023;Pyrgion repanda;Microphysetica hermeasalis;Lobocleta tenellata;Scopula sp_group_san_luis;Clepsis sp_san_luis_a;Iridopsis lurida--oberthuri--herse;Argyrotome prospecta;Phrygionis polita;Marimatha nigrofimbria;Iridopsis lurida--oberthuri--herse;Semaeopus viridiplaga;Euphyia sp_group_san_luis;Iridopsis lurida--oberthuri--herse;Nephodia auxesia;Josia sp_group_san_luis;Herbita amicaria;Melanolophia sp_san_luis;Semaeopus illimitata;Thysanopyga amarantha;Opisthoxia miletia;Hymenia perspectalis;Gn sp_san_luis_4001;Gn sp_san_luis_8057;Oospila athena;Hymenia perspectalis 20120623 10.3 -84.8$Amorbia emigratella;Idaea sp_group_san_luis;Oospila athena;Lomographa argentata;Idaea sp_group_san_luis;Dysodia oculatana;Ptychamalia sp_san_luis_a;Gizma undilinealis;Udea rubigalis;Antaeotricha sp_san_luis_b;Bryoptera friaria;Euphyia sp_group_san_luis;Gn sp_san_luis_1028;Glena mopsaria;Epicrisias eschara;Gn sp_san_luis_1037;Hemiceras pernubila;Melanolophia sp_san_luis;Eois sp_san_luis_b;Lonomia electra;Gn sp_san_luis_1027;Macaria approximaria--gambarina--ostia;Gn sp_san_luis_1002;Gn sp_san_luis_1001;Eusarca asteria--cayennaria;Eucereon tigrata;Simopteryx torquataria;Phyllodonta latrata;Eois sp_san_luis_b;Thysanopyga amarantha;Oospila athena;Scopula sp_group_san_luis;Oospila venezuelata;Lineodes sp_san_luis_a;Gn sp_san_luis_1001;Glena mopsaria;Glena mopsaria;Amorbia emigratella;Oospila venezuelata;Oospila venezuelata;Oospila venezuelata;Nola sp_san_luis_a;Oospila venezuelata;Gn sp_san_luis_1002;Glena mopsaria;Gn sp_san_luis_1002;Anomis texana;Melanolophia sp_san_luis;Paragonia cruraria;Euphyia sp_group_san_luis;Gn sp_san_luis_1013;Spodoptera eridania;Eois sp_san_luis_f;Euphyia sp_group_san_luis;Herbita amicaria;Disphragis proba;Anomis texana;Hemiceras pernubila;Thysanopyga amarantha;Physocleora pauper;Disphragis proba;Acrosemia vulpecularia;Melanolophia sp_san_luis;Glena mopsaria;Gn sp_san_luis_1023;Gn sp_san_luis_1023;Gn sp_san_luis_1001;Gn sp_san_luis_1003;Bagisara laverna;Clemensia leopardina;Conchylodes erinalis;Gn sp_san_luis_1001;Bleptina caradrinalis;Gn sp_san_luis_1001;Nephodia auxesia;Leucula meganira;Nola sp_san_luis_a;Thysanopyga amarantha;Eois sp_san_luis_e;Orthofidonia sp_san_luis_a;Isogona continua--natatrix;Oospila venezuelata;Manduca lucetius;Manduca lucetius;Rhabdatomis laudamia;Gn sp_san_luis_7003;Oxydia trychiata;Idaea sp_group_san_luis;Stenoma byssina;Melanolophia sp_san_luis;Amorbia emigratella;Thysanopyga amarantha;Marimatha nigrofimbria;Iridopsis lurida--oberthuri--herse;Clepsis sp_san_luis_a;Sabulodes colombiata;Scopula sp_group_san_luis;Idalus crinis;Macrocneme iole;Urodus sp_san_luis_c;Hymenia perspectalis;Gonodonta paraequalis;Disphragis proba;Rhabdatomis laudamia;Ethmia exornata;Hymenia perspectalis;Gn sp_san_luis_2021;Quentalia subumbrata;Disphragis proba;Oospila venezuelata;Amorbia emigratella;Glena mopsaria;Herbita amicaria;Hylesia continua;Hylesia continua;Hylesia continua;Gn sp_san_luis_1023;Ethmia exornata;Gn sp_san_luis_1003;Psilosetia pura;Psilosetia pura;Psilosetia pura;Thysanopyga casperia;Glena mopsaria;Nola sp_san_luis_a;Herbita amicaria;Idaea sp_group_san_luis;Gn sp_san_luis_4001;Clemensia leopardina;Gn sp_san_luis_1004;Thysanopyga amarantha;Conchylodes erinalis;Disphragis proba;Semaeopus illimitata;Gn sp_san_luis_1010;Idaea sp_group_san_luis;Renia vinasalis;Euphyia sp_group_san_luis;Glena mopsaria;Josia sp_group_san_luis;Eois sp_san_luis_e;Stenoma byssina;Pyrgion repanda;Glena mopsaria;Cimicodes albicosta;Phyllodonta latrata;Lineodes sp_san_luis_a;Hymenia perspectalis;Acharia hyperoche;Conchylodes erinalis;Oxydia bilinea;Iridopsis lurida--oberthuri--herse;Hemiceras pernubila;Oxydia masthala;Isogona continua--natatrix;Amorbia emigratella;Cliniodes opalalis;Hypena livia;Cecharismena sp_san_luis_a;Hemiceras nigricosta;Glenopteris oculifera;Melanolophia sp_san_luis;Antaeotricha sp_san_luis_a;Leucanopsis longa;Anomis texana;Crambidia cephalica;Ascalapha odorata;Ascalapha odorata 20120623 33.9 -83.3$Renia flavipunctalis;Peridea basitriens;Peridea basitriens;Idaea obfusaria;Eulithis diversilineata;Clemensia albata;Acrolophus texanella;Chytonix palliatricula;Idia rotundalis;Spilosoma congrua;Spilosoma congrua;Spilosoma congrua;Parapediasia decorellus;Aethiophysa lentiflualis;Datana major;Datana major;Spodoptera ornithogalli;Cisthene packardii;Idaea obfusaria;Nigetia formosalis;Clemensia albata;Eutrapela clemataria;Ectropis crepuscularia;Heterocampa obliqua;Heterocampa obliqua;Baileya dormitans;Datana drexelii;Datana drexelii;Eulithis diversilineata;Crambidia pallida--uniformis;Arta statalis;Megalopyge opercularis;Megalopyge opercularis;Heterocampa obliqua;Isochaetes beutenmuelleri;Eudryas grata;Eudryas grata;Parasa chloris;Parasa chloris;Palthis asopialis;Apoda biguttata;Apoda biguttata;Paectes abrostoloides;Clemensia albata;Loxostegopsis merrickalis;Loxostegopsis merrickalis;Hypagyrtis esther;Idia julia;Microcrambus elegans;Megalopyge opercularis;Melanolophia signataria;Ectropis crepuscularia;Nigetia formosalis 20120624 10.3 -84.8$Nola sp_san_luis_a;Nola sp_san_luis_a;Gn sp_san_luis_2009;Amorbia emigratella;Eois sp_san_luis_b;Eupithecia sp_group_san_luis;Eusarca asteria--cayennaria;Gn sp_san_luis_1027;Glena mopsaria;Iridopsis validaria;Macaria carpo;Glena mopsaria;Campatonema lineata;Eusarca asteria--cayennaria;Ametris nitocris;Melanolophia sp_san_luis;Iridopsis lurida--oberthuri--herse;Thysanopyga carfinia;Conchylodes erinalis;Gn sp_san_luis_1004;Phyllodonta latrata;Gn sp_san_luis_1001;Nola sp_san_luis_a;Gn sp_san_luis_1001;Glena mopsaria;Gn sp_san_luis_1003;Euphyia sp_group_san_luis;Eois sp_san_luis_b;Euphyia sp_group_san_luis;Pareuchaetes insulata;Phyllodonta latrata;Glena mopsaria;Xylophanes porcus;Trygodes amphion;Synnomos sp_san_luis_a;Nola sp_san_luis_a;Eucereon aroa;Nephodia betala;Eusarca asteria--cayennaria;Thysanopyga amarantha;Gn sp_san_luis_1024;Amorbia emigratella;Conchylodes erinalis;Sphacelodes vulneraria;Sabulodes arge;Nola sp_san_luis_a;Gn sp_san_luis_1042;Euphyia sp_group_san_luis;Simopteryx torquataria;Clepsis sp_san_luis_a;Oospila venezuelata;Oxydia bilinea;Oospila venezuelata;Eusarca asteria--cayennaria;Anomis texana;Eusarca crameraria--brown;Nola sp_san_luis_a;Melanolophia sp_san_luis;Nola sp_san_luis_a;Melanolophia sp_san_luis;Gn sp_san_luis_8097;Gn sp_san_luis_1001;Nola sp_san_luis_a;Semaeopus viridiplaga;Gn sp_san_luis_1023;Iridopsis lurida--oberthuri--herse;Isochromodes caleta;Isochromodes caleta;Antaeotricha sp_san_luis_d;Crambidia myrlosea;Oospila venezuelata;Gn sp_san_luis_8032;Phyllodonta latrata;Amorbia emigratella;Oospila venezuelata;Gn sp_san_luis_2022;Glena mopsaria;Oospila venezuelata;Glena mopsaria;Euphyia sp_group_san_luis;Gn sp_san_luis_1001;Macaria carpo;Hemiceras pernubila;Gn sp_san_luis_1003;Pyrgion repanda;Gn sp_san_luis_1002;Gn sp_san_luis_1023;Opharus rudis;Tricentrogyna vinacea;Trygodes amphion;Anticarsia gemmatalis;Eois sp_san_luis_b;Glena mopsaria;Euphyia sp_group_san_luis;Synnomos sp_san_luis_a;Nola sp_san_luis_a;Glena mopsaria;Trygodes amphion;Lobocleta tenellata;Gn sp_san_luis_7001;Thysanopyga amarantha;Synnomos sp_san_luis_a;Gn sp_san_luis_1002;Gn sp_san_luis_1002;Gn sp_san_luis_1002;Eusarca asteria--cayennaria;Cratoptera zarumata;Sabulodes colombiata;Synnomos sp_san_luis_a;Stenoma byssina;Sanys irrosea;Gn sp_san_luis_1001;Gn sp_san_luis_1011;Euphyia sp_group_san_luis;Sanys irrosea;Herbita aglausaria;Eois sp_san_luis_b;Bagisara laverna;Anticarsia gemmatalis;Melanolophia sp_san_luis;Eois sp_san_luis_e;Synnomos urota;Gn sp_san_luis_1003;Iridopsis lurida--oberthuri--herse;Gn sp_san_luis_1002;Gn sp_san_luis_1003;Eusarca asteria--cayennaria;Gn sp_san_luis_8097;Opharus rudis;Thysanopyga amarantha;Scopula sp_group_san_luis;Gn sp_san_luis_1042;Idaea sp_group_san_luis;Eusarca asteria--cayennaria;Pareuchaetes insulata;Iridopsis validaria;Glena mopsaria;Gn sp_san_luis_1003;Herminocala sabata;Gizma undilinealis;Nola sp_san_luis_a;Herminocala sabata;Josia sp_group_san_luis;Melanolophia sp_san_luis;Gn sp_san_luis_7001;Polypoetes villia;Gn sp_san_luis_2006;Gn sp_san_luis_8073;Nola sp_san_luis_a;Eois sp_san_luis_b;Glena mopsaria;Gn sp_san_luis_7001;Melanolophia sp_san_luis;Gn sp_san_luis_1004;Idaea sp_group_san_luis;Nola sp_san_luis_a;Eusarca sp_san_luis_a;Gn sp_san_luis_1002;Gn sp_san_luis_8005;Idaea sp_group_san_luis;Dichomeris arotrosema;Simopteryx torquataria;Idaea sp_group_san_luis;Nola sp_san_luis_a;Thysanopyga amarantha;Simopteryx torquataria;Thysanopyga amarantha;Gn sp_san_luis_1023;Trygodes amphion;Eois sp_san_luis_b;Semaeopus viridiplaga;Euphyia sp_group_san_luis;Hemiceras pernubila;Elaphria sp_san_luis_a;Rivula leucosticta;Scopula sp_group_san_luis;Thysanopyga amarantha;Gn sp_san_luis_7001;Semaeopus viridiplaga;Dichomeris arotrosema;Melanolophia sp_san_luis;Epimecis semicompleta;Phrygionis platinata;Gn sp_san_luis_1005;Opharus rudis;Melanolophia sp_san_luis;Pachylia syces;Pachylia syces;Pachylia syces;Pachylia syces;Pachylia syces;Thysanopyga amarantha;Eois sp_san_luis_b;Semaeopus sp_san_luis_a;Bertholdia specularis;Melese monima;Rhabdatomis laudamia;Tosale aucta;Phostria tedea;Ischnurges eudamidasalis;Semaeopus viridiplaga;Trygodes amphion;Opisthoxia miletia;Eusarca melenda;Conchylodes erinalis;Conchylodes erinalis;Glena mopsaria;Melanolophia sp_san_luis;Hymenia perspectalis;Tosale aucta;Gn sp_san_luis_7001;Euphyia sp_group_san_luis;Gn sp_san_luis_8001;Scopula sp_group_san_luis;Glena mopsaria;Amorbia emigratella;Eois sp_san_luis_f;Amorbia emigratella;Amorbia emigratella;Hemerophila gradella;Amorbia emigratella;Givira lineaeplena;Rhabdatomis laudamia;Hypena andraca;Macaria carpo;Clepsis sp_san_luis_a;Euphyia sp_group_san_luis;Gn sp_san_luis_2020;Clepsis sp_san_luis_a;Clepsis sp_san_luis_a;Disphragis proba;Cliniodes opalalis;Gn sp_san_luis_7003;Hymenia perspectalis;Micrathetis dasarada;Iridopsis lurida--oberthuri--herse;Diphthera festiva;Gn sp_san_luis_1001;Thysanopyga amarantha;Desmia bajulalis;Idaea sp_group_san_luis;Euphyia sp_group_san_luis;Conchylodes concinnalis;Hymenia perspectalis;Euphyia sp_group_san_luis;Tetanolita mynesalis;Pachylioides resumens;Cratoptera zarumata;Iridopsis lurida--oberthuri--herse;Glena mopsaria;Euphyia sp_group_san_luis;Gn sp_san_luis_2003;Udea rubigalis;Cimicodes albicosta;Euphyia sp_group_san_luis;Renodes curviluna;Trygodes amphion;Iridopsis lurida--oberthuri--herse;Lobocleta tenellata;Iridopsis lurida--oberthuri--herse;Gn sp_san_luis_1001;Oospila venezuelata;Xenosoma nigromarginatum;Eucereon relegata;Iridopsis lurida--oberthuri--herse;Renodes curviluna;Macaria approximaria--gambarina--ostia;Glena mopsaria;Lineodes sp_san_luis_a;Cosmosoma impar;Eucereon tigrata;Eusarca asteria--cayennaria;Sabulodes colombiata;Pseudodirphia menander;Hymenia perspectalis;Eucereon tigrata;Sabulodes colombiata;Oospila venezuelata;Pyrgion repanda;Ischnurges eudamidasalis 20120624 33.9 -83.3$Acrolophus popeanella;Feltia subterranea;Acrolophus texanella;Acrolophus texanella;Neodactria luteolella;Neodactria luteolella;Renia flavipunctalis;Lithacodes fasciola;Bleptina inferior;Amydria effrentella;Cisthene packardii;Baileya ophthalmica;Protoboarmia porcelaria;Glyphidocera lactiflosella;Clemensia albata;Spilosoma congrua;Spilosoma congrua;Spilosoma congrua;Rhyacionia rigidana;Diatraea lisetta'
CR_query=re.compile(r'(\d\d\d\d\d\d\d\d\s)10.3\s-84.8\$(.*?)\d\d\d\d\d\d\d\d')
x=re.findall(CR_query,s)
d=[i[0] for i in x]
print "d", d
c=[i[1] for i in x]
print "c", c
print "len(c)", len(c)

Try using the regex:
(\d{8}\s)10\.3\s-84\.8\$(.*?)(?=\d{8}|$)
Your current regex was preventing successive matches because of matching overlaps; I removed this by using a positive lookahead (?= ... ) which is a zero-width assertion.
Also, I escaped your periods. Those should be escaped if intended to be literal dots.
regex101 demo

It doesn't look like your regex will match your string.
Your regex looks for 8 digits, followed by "10.3 -84.8$" and then captures everything and then 8 more digits afterwards. However, it doesn't look like your string has those 8 digits anywhere else.

Wrong Range Rate with Pyephem

I am trying to calculate a satellites Range Rate using Python and pyephem. Unfortunately pyephems result seems to be wrong.
After comparing the value with calculations made by other satellite tracking programs like GPredict or Ham Radio Deluxe the the difference goes up to 2km/sec.The calculated values for the Azemuth and Elevation ankle are almost the same thought. TLE's are new and the system clock is the same.
Do you see any mistake I made in my code or do you have an idea what else could cause the error?
Thank you very much!
Here is my Code:
import ephem
import time
#TLE Kepler elements
line1 = "ESTCUBE 1"
line2 = "1 39161U 13021C 13255.21187718 .00000558 00000-0 10331-3 0 3586"
line3 = "2 39161 98.1264 332.9982 0009258 190.0328 170.0700 14.69100578 18774"
satellite = ephem.readtle(line1, line2, line3) # create ephem object from tle information
while True:
city = ephem.Observer() # recreate Oberserver with current time
city.lon, city.lat, city.elevation = '52.5186' , '13.4080' , 100
satellite.compute(city)
RangeRate = satellite.range_velocity/1000 # get RangeRate in km/sec
print ("RangeRate: " + str(RangeRate))
time.sleep(1)
I recorded some Range Rate values from the script and from GPRedict to make the error reproducibly:
ESTCUBE 1
1 39161U 13021C 13255.96108453 .00000546 00000-0 10138-3 0 3602
2 39161 98.1264 333.7428 0009246 187.4393 172.6674 14.69101320 18883
date: 2013-09-13
time pyephem-Script Gpredict
14:07:02 -1.636 -3.204
14:12:59 -2.154 -4.355
14:15:15 -2.277 -4.747
14:18:48 -2.368 -5.291
And I added some lines to calculate the satellites elevation and coordinates:
elevation = satellite.elevation
sat_latitude = satellite.sublat
sat_longitude = satellite.sublong
The results with time stamp are:
2013-09-13 14:58:13
RangeRate: 2.15717797852 km/s
Range: 9199834.0
Sat Elevation: 660743.6875
Sat_Latitude: -2:22:27.3
Sat_Longitude: -33:15:15.4
2013-09-13 14:58:14
RangeRate: 2.15695092773 km/s
Range: 9202106.0
Sat Elevation: 660750.9375
Sat_Latitude: -2:26:05.8
Sat_Longitude: -33:16:01.7
Another important information might be that I am trying to calculate the Doppler Frequency for a satellite pass. And therefore I need the Range Rate:
f_Doppler_corrected = (c0/(c0 + RangeRate))*f0
Range Rate describes the velocity of a moving object on the visual axis to the observer. Maybe the range_velocity is something different?

It seems pyephem (libastro as a backend) and gpredict (predict) as a backend use different ways to calculate the satellite velocity. I am attaching detailed output of comparison for an actual reference observation. It can be seen that both output the correct position, while only gpredict outputs reasonable range_rate values. The error seems to occur in the satellite velocity vector. I would say that the reasons from gpredict are more reasonable (and the similar code is with question marks in libastro ..) therefore I will propose a fix in libastro to handle it as in gpredict, however maybe someone who understands the math behind it can add to this.
I added another tool, PyPredict (also predict based), to get some calculations here. However these values are off, so must be something else.
Pyephem: 3.7.5.3
Gpredict: 1.3
PyPredict 1.1 (Git: 10/02/2015)
OS: Ubuntu x64
Python 2.7.6
Time:
Epoch timestamp: 1420086600
Timestamp in milliseconds: 1420086600000
Human time (GMT): Thu, 01 Jan 2015 04:30:00 GMT
ISS (ZARYA)
1 25544U 98067A 15096.52834639 .00016216 00000-0 24016-3 0 9993
2 25544 51.6469 82.0200 0006014 185.1879 274.8446 15.55408008936880
observation point: N0 E0 alt=0
Test 1:
Gpredict: (Time, Az, El, Slant Range, Range Velocity)
2015 01 01 04:30:00 202.31 -21.46 5638 -5.646
2015 01 01 04:40:00 157.31 -2.35 2618 -3.107
2015 01 01 04:50:00 72.68 -10.26 3731 5.262
Pyephem 3.7.5.3 (default atmospheric refraction)
(2015/1/1 04:30:00, 202:18:45.3, -21:27:43.0, 5638.0685, -5.3014228515625)
(2015/1/1 04:40:00, 157:19:08.3, -1:21:28.6, 2617.9915, -2.934402099609375)
(2015/1/1 04:50:00, 72:40:59.9, -10:15:15.1, 3730.78375, 4.92381201171875)
No atmospheric refraction
(2015/1/1 04:30:00, 202:18:45.3, -21:27:43.0, 5638.0685, -5.3014228515625)
(2015/1/1 04:40:00, 157:19:08.3, -1:21:28.6, 2617.9915, -2.934402099609375)
(2015/1/1 04:50:00, 72:40:59.9, -10:15:15.1, 3730.78375, 4.92381201171875)
Pypredict
1420086600.0
{'decayed': 0, 'elevation': -19.608647085869123, 'name': 'ISS (ZARYA)', 'norad_id': 25544, 'altitude': 426.45804846615556, 'orbit': 92208, 'longitude': 335.2203454719759, 'sunlit': 1, 'geostationary': 0, 'footprint': 4540.173580837984, 'epoch': 1420086600.0, 'doppler': 1635.3621339278857, 'visibility': 'D', 'azimuth': 194.02436209048014, 'latitude': -45.784314563471646, 'orbital_model': 'SGP4', 'orbital_phase': 73.46488929141783, 'eclipse_depth': -8.890253049060693, 'slant_range': 5311.3721164183535, 'has_aos': 1, 'orbital_velocity': 27556.552465256085}
1420087200.0
{'decayed': 0, 'elevation': -6.757496200551716, 'name': 'ISS (ZARYA)', 'norad_id': 25544, 'altitude': 419.11153234752874, 'orbit': 92208, 'longitude': 9.137628905963876, 'sunlit': 1, 'geostationary': 0, 'footprint': 4502.939901708917, 'epoch': 1420087200.0, 'doppler': 270.6901377419433, 'visibility': 'D', 'azimuth': 139.21315598291235, 'latitude': -20.925997669236732, 'orbital_model': 'SGP4', 'orbital_phase': 101.06301876416072, 'eclipse_depth': -18.410968838249545, 'slant_range': 3209.8444916123644, 'has_aos': 1, 'orbital_velocity': 27568.150821416708}
1420087800.0
{'decayed': 0, 'elevation': -16.546383900323555, 'name': 'ISS (ZARYA)', 'norad_id': 25544, 'altitude': 414.1342802649042, 'orbit': 92208, 'longitude': 31.52356804788407, 'sunlit': 1, 'geostationary': 0, 'footprint': 4477.499436144489, 'epoch': 1420087800.0000002, 'doppler': -1597.032808834609, 'visibility': 'D', 'azimuth': 76.1840387294104, 'latitude': 9.316828913183791, 'orbital_model': 'SGP4', 'orbital_phase': 128.66115193399546, 'eclipse_depth': -28.67721196244149, 'slant_range': 4773.838774518728, 'has_aos': 1, 'orbital_velocity': 27583.591664378775}
Test 2 (short time):
Gpredict: (Slant Range, Range Velocity)
2015 01 01 04:30:00 5638 -5.646
2015 01 01 04:30:10 5581 -5.648
->5.7 km/s avg
(2015/1/1 04:30:00, 5638.0685, -5.3014228515625)
(2015/1/1 04:30:10, 5581.596, -5.30395361328125)
->5.7 km/s avg
Pyephem
import ephem
import time
#TLE Kepler elements
line1 = "ISS (ZARYA)"
line2 = "1 25544U 98067A 15096.52834639 .00016216 00000-0 24016-3 0 9993"
line3 = "2 25544 51.6469 82.0200 0006014 185.1879 274.8446 15.55408008936880"
satellite = ephem.readtle(line1, line2, line3) # create ephem object from tle information
obs = ephem.Observer() # recreate Oberserver with current time
obs.lon, obs.lat, obs.elevation = '0' , '0' , 0
print('Pyephem Default (atmospheric refraction)')
obs.date = '2015/1/1 04:30:00'
satellite.compute(obs)
print(obs.date, satellite.az, satellite.alt,satellite.range/1000, satellite.range_velocity/1000)
obs.date = '2015/1/1 04:40:00'
satellite.compute(obs)
print(obs.date, satellite.az, satellite.alt,satellite.range/1000, satellite.range_velocity/1000)
obs.date = '2015/1/1 04:50:00'
satellite.compute(obs)
print(obs.date, satellite.az, satellite.alt,satellite.range/1000, satellite.range_velocity/1000)
obs.pressure = 0 # disable atmospheric refraction
print('Pyephem No atmospheric refraction')
obs.date = '2015/1/1 04:30:00'
satellite.compute(obs)
print(obs.date, satellite.az, satellite.alt,satellite.range/1000, satellite.range_velocity/1000)
obs.date = '2015/1/1 04:40:00'
satellite.compute(obs)
print(obs.date, satellite.az, satellite.alt,satellite.range/1000, satellite.range_velocity/1000)
obs.date = '2015/1/1 04:50:00'
satellite.compute(obs)
print(obs.date, satellite.az, satellite.alt,satellite.range/1000, satellite.range_velocity/1000)
print('10 s timing')
obs.date = '2015/1/1 04:30:00'
satellite.compute(obs)
print(obs.date, satellite.range/1000, satellite.range_velocity/1000)
obs.date = '2015/1/1 04:30:10'
satellite.compute(obs)
print(obs.date, satellite.range/1000, satellite.range_velocity/1000)
Pypredict
import predict
import datetime
import time
format = '%Y/%m/%d %H:%M:%S'
tle = """ISS (ZARYA)
1 25544U 98067A 15096.52834639 .00016216 00000-0 24016-3 0 9993
2 25544 51.6469 82.0200 0006014 185.1879 274.8446 15.55408008936880"""
qth = (0, 10, 0) # lat (N), long (W), alt (meters)
#expect time as epoch time float
time= (datetime.datetime.strptime('2015/1/1 04:30:00', format) -datetime.datetime(1970,1,1)).total_seconds()
result = predict.observe(tle, qth, time)
print time
print result
time= (datetime.datetime.strptime('2015/1/1 04:40:00', format) -datetime.datetime(1970,1,1)).total_seconds()
result = predict.observe(tle, qth, time)
print time
print result
time= (datetime.datetime.strptime('2015/1/1 04:50:00', format) -datetime.datetime(1970,1,1)).total_seconds()
result = predict.observe(tle, qth, time)
print time
print result
Debug output of Gpredict and PyEphem
PyPredict
Name = ISS (ZARYA)
current jd = 2457023.68750
current mjd = 42003.7
satellite jd = 2457119.02835
satellite mjd = 42099
SiteLat = 0
SiteLong = 6.28319
SiteAltitude = 0
se_EPOCH : 115096.52834638999775052071
se_XNO : 0.06786747737871574870
se_XINCL : 0.90140843391418457031
se_XNODEO : 1.43151903152465820312
se_EO : 0.00060139998095110059
se_OMEGAO : 3.23213863372802734375
se_XMO : 4.79694318771362304688
se_BSTAR : 0.00024016000679694116
se_XNDT20 : 0.00000000049135865048
se_orbit : 93688
dt : -137290.81880159676074981689
CrntTime = 42004.2
SatX = -3807.5
SatY = 2844.85
SatZ = -4854.26
Radius = 6793.68
SatVX = -5.72752
SatVY = -3.69533
SatVZ = 2.32194
SiteX = -6239.11
SiteY = 1324.55
SiteZ = 0
SiteVX = -0.0965879
SiteVY = -0.454963
Height = 426.426
SSPLat = -0.795946
SSPLong = 0.432494
Azimuth = 3.53102
Elevation = -0.374582
Range = 5638.07
RangeRate = -5.30142
(2015/1/1 04:30:00, 5638.0685, -5.3014228515625)
Gpredict
time: 2457023,687500
pos obs: -6239,093574, 1324,506494, 0,000000
pos sat: -3807,793748, 2844,641722, -4854,112635
vel obs: -0,096585, -0,454962, 0,000000
vel sat: -6,088242, -3,928388, 2,468585
Gpredict (sgp_math.h)
/------------------------------------------------------------------/
/* Converts the satellite's position and velocity */
/* vectors from normalised values to km and km/sec */
void
Convert_Sat_State( vector_t *pos, vector_t *vel )
{
Scale_Vector( xkmper, pos );
Scale_Vector( xkmper*xmnpda/secday, vel );
} /* Procedure Convert_Sat_State */
Ephem (Libastro)
*SatX = ERAD*posvec.x/1000; /* earth radii to km */
*SatY = ERAD*posvec.y/1000;
*SatZ = ERAD*posvec.z/1000;
*SatVX = 100*velvec.x; /* ?? */
*SatVY = 100*velvec.y;
*SatVZ = 100*velvec.z;

Updating to the most recent release of pyephem (I tried V3.7.6.0) seems to solve the problem. The range rate now agrees closely with the values given by other commonly used tracking software.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas count occurrences during time window on other dataframe - python

It looks like you have a series there. In which case you can do: threshold = (s.index.to_series() .groupby((s.eq('event1') & s.shift(-1).eq('event2')).cumsum()) .transform('min') + pd.to_timedelta('3Min') # adjust threshold here ) (s.eq('event2') & (s.index < threshold)).sum() # out 4

Related

How to throw away lines of text with specific characters?

Copying a portion of .docx file (keep formatting and images)

Python removing duplicate names

How to extract dates and all of the data following them using re.findall in python

Wrong Range Rate with Pyephem

Categories

Resources