Read file with strange separations in python - python

i'm trying to automate a process. But one of the files have a very strange separation.
The columns are separated by spaces, but some rows have more spaces then others
Any one have idea how solve this.
Thank a lot! :D

First of all, I'd make sure that this is not an artefact of the interface you use for viewing the file, because some of them might actually just display tabs like this.
You can split the file using regular expressions on multiple spaces.
import re
for line in file:
fields = re.split("\s+", line)

Related

How to find filenames with a specific extension using regex?

How can I grab 'dlc3.csv' & 'spongebob.csv' from the below string via the absolute quickest method - which i assume is regex?
4918, fx,fx,weapon/muzzleflashes/fx_m1carbine,3.3,3.3,|sp/zombie_m1carbine|weapon|../zone_source/dlc3.csv|csv|../zone_source/spongebob.csv|csv
I've already managed to achieve this by using split() and for loops but its slowing my program down way too much.
I would post an example of my current code but its got a load of other stuff in it so it would only cause you to ask more questions.
In a nutshell im opening a large 6,000 line .csv file and im then using nested for loops to iterate through each line and using .split() to find specific parts in each line. I have many files where i need to scan specific things on each line and atm ive only implemented a couple features into my Qt program and its already taking upto 5 seconds to load some things and up to 10 seconds for others. All of which is due to the nested loops. Ive looked at where to use range, where not to, and where to use enumerate. I also use time.time() and loggin.info() to show each code changes speed. And after asking around ive been told that using a regex is the best option for me as it would remove the need for many of my for loops. Problem is i have no clue how to use regex. I of course plan on learning it but if someone could help me out with this it'll be much appreciated.
Thanks.
Edit: just to point out that when scanning each line the filename is unknown. ".csv" is the only thing that isnt unknown. So i basically need the regex to grab every filename before .csv but of course without grabbing the crap before the filename.
Im currently looking for .csv using .split('/') & .split('|'), then checking if .csv is in list index to grab the 'unknown' filename. And some lines will only have 1 filename whereas others will have 2+ so i need the regex to account for this too.
You can use this pattern: [^/]*\.csv
Breakdown:
[^/] - Any character that's not a forward slash (or newline)
* - Zero or more of them
\. - A literal dot. (This is necessary because the dot is a special character in regex.)
For example:
import re
s = '''4918, fx,fx,weapon/muzzleflashes/fx_m1carbine,3.3,3.3,|sp/zombie_m1carbine|weapon|../zone_source/dlc3.csv|csv|../zone_source/spongebob.csv|csv'''
pattern = re.compile(r'[^/]*\.csv')
result = pattern.findall(s)
Result:
['dlc3.csv', 'spongebob.csv']
Note: It could just as easily be result = re.findall(r'[^/]*\.csv', s), but for code cleanliness, I prefer naming my regexes. You might consider giving it an even clearer name in your code, like pattern_csv_basename or something like that.
Docs: re, including re.findall
See also: The official Python Regular Expression HOWTO

deleting number and colon in text file\

I'am a newbie in Python and i want to make a simple code that will erase a numbers and colon from my txt file.
Example :
00:00:01:05 00:00:03:12 so I thought it was very interesting
00:00:03:12 00:00:06:15 when we did the videos that most of the
00:00:06:15 00:00:09:09 line of business users minds are around
00:00:09:09 00:00:12:04 data rate about data analytics and you
I want to remove the bolded part.
Anyone has any idea for a code ?
Thank you guys in advance !!
Use regex and I am assuming you know how to read a txt file already.
import re
new_string = re.sub('\\d+:\\d+:\\d+:\\d+', '', your_string_here)

python code for parsing mysql-dump file and extract useful data from it

I'm sorry to tell a bad question, actually I have a set of MySQL dump file and I want to parse these files with Python and extracting valuable information from them.
in parsing operation i have 3 state as follow:
enter image description here
In your opinoin how i can handle this 3 state?
You can simply use:
('BoredMS site, ddos regularly :3')
If you really want that exact part! :) But I suspect you want the 5th quoted item in the comma separated list. Give this a shot:
(?:[^,]+,){4}\s*('[^']+')
To explain that it's 4 sets of items separated by a comma, then maybe spaces, then matching everything between the next set of single quotes. Hope that helps!
\(\d+\,\s*\d+,\s*\'\w\',\s*\'\d+\.\d+\.\d+\.\d+\',\s*\'(.*)\'\)
When I need to create regular expression, I use regex101.com
It helps to create the regexp string and see your example being parsed live.

Python: select a line if specific characters spaced by tab at end of line

I am trying to find out how to best select specific lines from multiple txt files in Python. One way could be to use regex, but I have read that this would probably be a 'heavy' solution for a simpler selection process of lines. Another possibility may be string.split() but it seems that I would have to split all lines first before making my selection. The selection I intend to make is upon the following condition:
if a line end with 'a tab a tab' then I select that line
in regex this would be the following:
((a\t){2}|(b\t){2})\n # character 'a' or 'b' at end of line
The function line.endswith('a a ') is also available, yet this does not recognize tabs.
if line.endswith('a a '): # tabs are not recognized at end of line
Can you please advice if regex is a good or too heavy use or if string.split or another function like line.endswith is more appropriate?
Thank you.
endswith is enough to solve your selection problem:
\t is a nice way to represent a tab in a python string:
>>> print('a\ta\t')
a a
And endswith match it nicely:
>>> print('foobar a\ta\t'.endswith('a\ta\t'))
True

Loading regular expression patterns from external source?

I have a series of regular expression patterns defined for automated processing of text. Due to the design of the program, it's better to have these patterns separate in a text file, namely a JSON file. The pattern in Python is of r'' type, but all I can provide is a string. I'd like to retain functionalities such as grouping. I'd like to have features such as entities ([A-z]), so I'm not talking about escaping everything.
I'm using Python 3.4. How do I properly load these patterns into the re module? And what kind of escaping problem should I watch out for?
I am not sure what you want but have a look at this.:
If you have a file called input.txt containing \d+
Then you can use it this way:
import re
f=open("input.txt","r")
x="asasd3243sdfdsf23234sdsdf"
print re.findall(r""+f.readline(),x)
Output:['3243', '23234']
When you use r mode you need not escape anything.
The r'' thing in Python is not a different type than simple ''. The r'' syntax simply creates a string that looks exactly like the one you typed, so the \n sequence stays as \n, and isn't turned into a new line (same thing happens to other special characters). This little r simply escapes everything you type.
Check it yourself with this two simple lines in the console:
print('test \n test')
print(r'test \n test')
print(type(r''))
print(type(''))
Now, while you read lines from JSON file, the escaping is done for you. I don't know how will you create the JSON file, but you should take a look at the json module, and the load method, that will allow you to read a JSON file.
You can use re.escape to escape the strings. However this is escaping everything and you might want some special chars. I'd just use the strings and be careful about placing \ in the right places.
BTW: If you have many regular expressions, matching might get slow. You might want to consider some alternatives such esmre.

Categories

Resources