deleting number and colon in text file\ - python

I'am a newbie in Python and i want to make a simple code that will erase a numbers and colon from my txt file.
Example :
00:00:01:05 00:00:03:12 so I thought it was very interesting
00:00:03:12 00:00:06:15 when we did the videos that most of the
00:00:06:15 00:00:09:09 line of business users minds are around
00:00:09:09 00:00:12:04 data rate about data analytics and you
I want to remove the bolded part.
Anyone has any idea for a code ?
Thank you guys in advance !!

Use regex and I am assuming you know how to read a txt file already.
import re
new_string = re.sub('\\d+:\\d+:\\d+:\\d+', '', your_string_here)

Related

Read file with strange separations in python

i'm trying to automate a process. But one of the files have a very strange separation.
The columns are separated by spaces, but some rows have more spaces then others
Any one have idea how solve this.
Thank a lot! :D
First of all, I'd make sure that this is not an artefact of the interface you use for viewing the file, because some of them might actually just display tabs like this.
You can split the file using regular expressions on multiple spaces.
import re
for line in file:
fields = re.split("\s+", line)

How to find filenames with a specific extension using regex?

How can I grab 'dlc3.csv' & 'spongebob.csv' from the below string via the absolute quickest method - which i assume is regex?
4918, fx,fx,weapon/muzzleflashes/fx_m1carbine,3.3,3.3,|sp/zombie_m1carbine|weapon|../zone_source/dlc3.csv|csv|../zone_source/spongebob.csv|csv
I've already managed to achieve this by using split() and for loops but its slowing my program down way too much.
I would post an example of my current code but its got a load of other stuff in it so it would only cause you to ask more questions.
In a nutshell im opening a large 6,000 line .csv file and im then using nested for loops to iterate through each line and using .split() to find specific parts in each line. I have many files where i need to scan specific things on each line and atm ive only implemented a couple features into my Qt program and its already taking upto 5 seconds to load some things and up to 10 seconds for others. All of which is due to the nested loops. Ive looked at where to use range, where not to, and where to use enumerate. I also use time.time() and loggin.info() to show each code changes speed. And after asking around ive been told that using a regex is the best option for me as it would remove the need for many of my for loops. Problem is i have no clue how to use regex. I of course plan on learning it but if someone could help me out with this it'll be much appreciated.
Thanks.
Edit: just to point out that when scanning each line the filename is unknown. ".csv" is the only thing that isnt unknown. So i basically need the regex to grab every filename before .csv but of course without grabbing the crap before the filename.
Im currently looking for .csv using .split('/') & .split('|'), then checking if .csv is in list index to grab the 'unknown' filename. And some lines will only have 1 filename whereas others will have 2+ so i need the regex to account for this too.
You can use this pattern: [^/]*\.csv
Breakdown:
[^/] - Any character that's not a forward slash (or newline)
* - Zero or more of them
\. - A literal dot. (This is necessary because the dot is a special character in regex.)
For example:
import re
s = '''4918, fx,fx,weapon/muzzleflashes/fx_m1carbine,3.3,3.3,|sp/zombie_m1carbine|weapon|../zone_source/dlc3.csv|csv|../zone_source/spongebob.csv|csv'''
pattern = re.compile(r'[^/]*\.csv')
result = pattern.findall(s)
Result:
['dlc3.csv', 'spongebob.csv']
Note: It could just as easily be result = re.findall(r'[^/]*\.csv', s), but for code cleanliness, I prefer naming my regexes. You might consider giving it an even clearer name in your code, like pattern_csv_basename or something like that.
Docs: re, including re.findall
See also: The official Python Regular Expression HOWTO

Search for a line that does NOT contain a certain expression (Python 2.7)

The following line of code is used to search for lines of source code that contains the text "#":
XPATH_RANK = '//span[contains(text(),"#")]//text()'
How can I modify this particular line of code to IGNORE certain text?
Please keep in mind that I know next to nothing about Python, and am only learning as I go along with this project for work.
Thanks in advance.
'//span[not(contains(text(),"#"))]//text()'
This question is a duplicate of How to use not contains() in xpath? but here's your use.

python code for parsing mysql-dump file and extract useful data from it

I'm sorry to tell a bad question, actually I have a set of MySQL dump file and I want to parse these files with Python and extracting valuable information from them.
in parsing operation i have 3 state as follow:
enter image description here
In your opinoin how i can handle this 3 state?
You can simply use:
('BoredMS site, ddos regularly :3')
If you really want that exact part! :) But I suspect you want the 5th quoted item in the comma separated list. Give this a shot:
(?:[^,]+,){4}\s*('[^']+')
To explain that it's 4 sets of items separated by a comma, then maybe spaces, then matching everything between the next set of single quotes. Hope that helps!
\(\d+\,\s*\d+,\s*\'\w\',\s*\'\d+\.\d+\.\d+\.\d+\',\s*\'(.*)\'\)
When I need to create regular expression, I use regex101.com
It helps to create the regexp string and see your example being parsed live.

python and pyPdf - how to extract text from the pages so that there are spaces between lines

currently, if I make a page object of a pdf page with pyPdf, and extractText(), what happens is that lines are concatenated together. For example, if line 1 of the page says "hello" and line 2 says "world" the resulting text returned from extractText() is "helloworld" instead of "hello world." Does anyone know how to fix this, or have suggestions for a work around? I really need the text to have spaces in between the lines because i'm doing text mining on this pdf text and not having spaces in between lines kills it....
This is a common problem with pdf parsing. You can also expect trailing dashes that you will have to fix in some cases. I came up with a workaround for one of my projects which I will describe here shortly:
I used pdfminer to extract XML from PDF and also found concatenated words in the XML. I extracted the same PDF as HTML and the HTML can be described by lines of the following regex:
<span style="position:absolute; writing-mode:lr-tb; left:[0-9]+px; top:([0-9]+)px; font-size:[0-9]+px;">([^<]*)</span>
The spans are positioned absolutely and have a top-style that you can use to determine if a line break happened. If a line break happened and the last word on the last line does not have a trailing dash you can separate the last word on the last line and the first word on the current line. It can be tricky in the details, but you might be able to fix almost all text parsing errors.
Additionally you might want to run a dictionary library like enchant over your text, find errors and if the fix suggested by the dictionary is like the error word but with a space somewhere, the error word is likely to be a parsing error and can be fixed with the dictionaries suggestion.
Parsing PDF sucks and if you find a better source, use it.

Categories

Resources