Can you help me figure out how to split based on multiple/group of number as delimiter?
I have content in a file in below format:
data_file_10572_2018-02-15-12-57-29.file
header_file_13238_2018-02-15-12-57-48.file
sig_file1_17678_2018-02-15-12-57-14.file
Expected output:
data_file
header_file
sig_file1
I'm new to python and I'm not sure how to cut based on group of number. Thanks for the reply!!
I hope this will help you. Method finds the element that can be casted to integer and return a string up to this value.
data = ['data_file_10572_2018-02-15-12-57-29.file', 'header_file_13238_2018-02-15-12-57-48.file', 'sig_file1_17678_2018-02-15-12-57-14.file']
def split_before_int(elem):
filename = elem.split('_')
for part in filename:
if not isinstance(part, (int)):
return '_'.join(filename[:filename.index(part)-2])
for elem in data:
print(split_before_int(elem))
Output:
data_file
header_file
sig_file1
First index to get the second location of the _ symbol, then python list partial indexing (i.e. list[0:5]) to get a substring up to the location of the second _.
files = ['data_file_10572_2018-02-15-12-57-29.file', 'header_file_13238_2018-02-15-12-57-48.file','sig_file1_17678_2018-02-15-12-57-14.file']
cleaned_files = list(map(lambda file: '_'.join(file.split('_')[0:2]), files))
This results in:
['data_file', 'header_file', 'sig_file1']
You can use the split by "_" with regex and then join the elements excluding the last
Ex:
import re
a = "data_file_10572_2018-02-15-12-57-29.file"
print "_".join(re.match("(.*?)_\d",a).group().split("_")[:-1])
output:
data_file
This code will work if all you filenames follow the pattern you described.
filename = 'data_file_10572_2018-02-15-12-57-29.file'
parts = filename.split('_')
new_filename = '_'.join(parts[:2])
If alphabetical part fo file name has variable number of underscores it's better to use Regex.
import re
pattern = re.compile('_[0-9_-]{3,}.file$')
re.sub(pattern, '', filename)
Output:
data_file
Essentially, first, it creates a pattern that starts with _, followed by 3 or more numbers, _ or - and ends with .file.
Then you replace the largest substring of you string that follows this pattern with an empty string.
Related
I am trying to replace a number in a string with another number. For instance, I have the string "APU12_24F" and I want to add 7 to the second number to make it "APU12_31F".
Right now I am simply able to locate the number in which I'm interested by using string.split.
I can't figure out how to edit the new strings which this produces.
def main():
f=open("edita15888_debug.txt", "r")
fl = f.readlines()
for x in fl:
if ("APU12" in x):
list_string=split_string(x)
print(list_string);
return
def split_string_APU12(string):
# Split the string based on APU12_
list_string = string.split("APU12_")
return list_string
main()
The output for this makes sense as I'll get something like ['', 24F\n]. I just now need to change the 24 to 31 then put it back into the original string.
Feel free to let me know if there is a better approach to this. I'm very new to python and everything I can find online with the available search/replace functions doesn't seem to do what I'd need them to do. Thank you!
Assuming that pattern is _ + multiple digits you can replace it with regex
import re
re.sub(r"_(\d+)", lambda r: '_'+str(int(r.group(1)) + 7),'APU12_24F')
This isn't generalized because I'm not sure what the rest of the data looks like but maybe something like this should work:
def main():
f=open("edita15888_debug.txt", "r")
fl = f.readlines()
for x in fl:
if ("APU12" in x):
list_string=split_string_APU12(x)
list_string = int(list_string[1].split('F')[0]) + 7
list_string = "APU12_" + str(list_string)
print(list_string)
return
def split_string_APU12(string):
# Split the string based on APU12_
list_string = string.split("APU12_")
return list_string
main()
I'm assuming your strings will be of the format
APU12_##...F
(where ###... means a variable digits number, and F could be any letter, but just one). If so, you could do something like this:
# Notice the use of context managers
# I would recommend learning about this for working with files
with open('edita15888_debug.txt', 'r') as f:
fl = f.readlines()
new_strings = []
for line in fl:
beg, end = line.split('_')
# This splits the end part into number + character
number, char = int(end[:-1]), end[-1]
# Here goes your operation on the number
number += your_quantity # This may be your +7, for example
# Now joining back everything together
new_strings.append(beg + '_' + str(number) + char)
And this would yield you the same list of strings but with the numbers before the last letter modified as you need.
I hope this helps you!
I assumed you need to add seven to a number which goes after an underscore. I hope, this function will be helpful
import re
def add_seven_to_number_after_underscore_in_a_string(aString):
regex = re.compile(r'_(\d+)')
match = regex.search(aString)
return regex.sub('_' + str(int(match.group(1)) + 7), aString)
I have a number of html files in a directory. I am trying to store the filenames in a list so that I can use it later to compare with another list.
Eg: Prod224_0055_00007464_20170930.html is one of the filenames. From the filename, I want to extract '00007464' and store this value in a list and repeat the same for all the other files in the directory. How do I go about doing this? I am new to Python and any help would be greatly appreciated!
Please let me know if you need more information to answer the question.
Split the filename on underscores and select the third element (index 2).
>>> 'Prod224_0055_00007464_20170930.html'.split('_')[2]
'00007464'
In context that might look like this:
nums = [f.split('_')[2] for f in os.listdir(dir) if f.endswith('.html')]
you may try this (assuming you are in the folder with the files:
import os
num_list = []
r, d, files = os.walk( '.' ).next()
for f in files :
parts = f.split('_') # now `parts` contains ['Prod224', '0055', '00007464', '20170930.html']
print parts[2] # this outputs '00007464'
num_list.append( parts[2] )
Assuming you have a certain pattern for your files, you can use a regex:
>>> import re
>>> s = 'Prod224_0055_00007464_20170930.html'
>>> desired_number = re.findall("\d+", s)[2]
>>> desired_number
'00007464'
Using a regex will help you getting not only that specific number you want, but also other numbers in the file name.
This will work if the name of your files follow the pattern "[some text][number]_[number]_[desired_number]_[a date].html". After getting the number, I think it will be very simple to use the append method to add that number to any list you want.
I have the following line of code reading in a specific part of a text file. The problem is these are numbers not strings so I want to convert them to ints and read them into a list of some sort.
A sample of the data from the text file is as follows:
However this is not wholly representative I have uploaded the full set of data here: http://s000.tinyupload.com/?file_id=08754130146692169643 as a text file.
*NSET, NSET=Nodes_Pushed_Back_IB
99915527, 99915529, 99915530, 99915532, 99915533, 99915548, 99915549, 99915550,
99915551, 99915552, 99915553, 99915554, 99915555, 99915556, 99915557, 99915558,
99915562, 99915563, 99915564, 99915656, 99915657, 99915658, 99915659, 99915660,
99915661, 99915662, 99915663, 99915664, 99915665, 99915666, 99915667, 99915668,
99915669, 99915670, 99915885, 99915886, 99915887, 99915888, 99915889, 99915890,
99915891, 99915892, 99915893, 99915894, 99915895, 99915896, 99915897, 99915898,
99915899, 99915900, 99916042, 99916043, 99916044, 99916045, 99916046, 99916047,
99916048, 99916049, 99916050
*NSET, NSET=Nodes_Pushed_Back_OB
Any help would be much appreciated.
Hi I am still stuck with this issue any more suggestions? Latest code and error message is as below Thanks!
import tkinter as tk
from tkinter import filedialog
file_path = filedialog.askopenfilename()
print(file_path)
data = []
data2 = []
data3 = []
flag= False
with open(file_path,'r') as f:
for line in f:
if line.strip().startswith('*NSET, NSET=Nodes_Pushed_Back_IB'):
flag= True
elif line.strip().endswith('*NSET, NSET=Nodes_Pushed_Back_OB'):
flag= False #loop stops when condition is false i.e if false do nothing
elif flag: # as long as flag is true append
data.append([int(x) for x in line.strip().split(',')])
result is the following error:
ValueError: invalid literal for int() with base 10: ''
Instead of reading these as strings I would like each to be a number in a list, i.e [98932850 98932852 98932853 98932855 98932856 98932871 98932872 98932873]
In such cases I use regular expressions together with string methods. I would solve this problem like so:
import re
with open(filepath) as f:
txt = f.read()
g = re.search(r'NSET=Nodes_Pushed_Back_IB(.*)', txt, re.S)
snums = g.group(1).replace(',', ' ').split()
numbers = [int(num) for num in snums]
I read the entire text into txt.
Next I use a regular expression and use the last portion of your header in the text as an anchor, and capture with capturing parenthesis all the rest (the re.S flag means that a dot should capture also newlines). I access all the nubers as one unit of text via g.group(1).
Next. I remove all the commas (actually replace them with spaces) because on the resulting text I use split() which is an excellent function to use on text items that are separated with spaces - it doesn't matter the amount of spaces, it just splits it as you would intent.
The rest is just converting the text to numbers using a list comprehension.
Your line contains more than one number, and some separating characters. You could parse that format by judicious application of split and perhaps strip, or you could minimize string handling by having re extract specifically the fields you care about:
ints = list(map(int, re.findall(r'-?\d+', line)))
This regular expression will find each group of digits, optionally prefixed by a minus sign, and then map will apply int to each such group found.
Using a sample of your string:
strings = ' 98932850, 98932852, 98932853, 98932855, 98932856, 98932871, 98932872, 98932873,\n'
I'd just split the string, strip the commas, and return a list of numbers:
numbers = [ int(s.strip(',')) for s in strings.split() ]
Based on your comment and regarding the larger context of your code. I'd suggest a few things:
from itertools import groupby
number_groups = []
with open('data.txt', 'r') as f:
for k, g in groupby(f, key=lambda x: x.startswith('*NSET')):
if k:
pass
else:
number_groups += list(filter('\n'.__ne__, list(g))) #remove newlines in list
data = []
for group in number_groups:
for str_num in group.strip('\n').split(','):
data.append(int(str_num))
I need to find the identification number of a big number of files while iterating throught them.
The file names are loaded onto a list and look like:
ID322198.nii
ID9828731.nii
ID23890.nii
FILEID988312.nii
So the best way to approach this would be to find the number that sits between ID and .nii
Because number of digits varies I can't simply select [-10:-4] of thee file name. Any ideas?
You can use a regex (see it in action here):
import re
files = ['ID322198.nii','ID9828731.nii','ID23890.nii','FILEID988312.nii']
[re.findall(r'ID(\d+)\.nii', file)[0] for file in files]
Returns:
['322198', '9828731', '23890', '988312']
to find the position of ID and .nii, you can use python's index() function
for line in file:
idpos =
nilpos =
data =
or as a list of ints:
[ int(line[line.index("ID")+1:line.index(".nii")]) for line in file ]
Using rindex:
s = 'ID322198.nii'
s = s[s.rindex('D')+1 : s.rindex('.')]
print(s)
Returns:
322198
Then apply this sintax to a list of strings.
It seems like you could filter the digits out, like this:
digits = ''.join(d for d in filename if d.isdigit())
That will work nicely as long as there are no other digits in the filename (e.g backups with a .1 suffix or something).
for name in files:
name = name.replace('.nii', '')
id_num = name.replace(name.rstrip('0123456789'), '')
How this works:
# example
name = 'ID322198.nii'
# remove '.nii'. -> name1 = 'ID322198'
name1 = name.replace('.nii', '')
# strip all digits from the end. -> name2 = 'ID'
name2 = name1.rstrip('0123456789')
# remove 'ID' from 'ID322198'. -> id_num = '322198'
id_num = name1.replace(name2, '')
I have a very basic question. I have files named like Dipole_E0=1.2625E-01.dat and I want to extract the 1.2625E-01 part and finally sort them by ascending order. How can this be done ? I tried first to plit the filename with .split() but it does not what I expect. Thanks for your help.
Best
Roland
Best way is to use regexp. To obtain value from file name:
m = re.search(filename, '^Dipole_E0=(.*)/s?')
val = m.group(0)
Walk through all dilenames and append all values to array. After that sort and that's all.
You want to look into regular expressions. In python they live in the re module. Depending on exact format, something like:
import re
ematch = re.compile("=([0-9]*\.[0-9]*[eE][+-][0-9]+)")
val = ematch.search(filename).group(0)
Sorting a list can be done with the .sort() method on lists, or the sorted(list) builtin, which give you a new list.
This is a good situation to use a generator expression and the sorted builtin:
sorted(float(filename.split("=", 1)[1].rsplit(".", 1)[0]) for filename in filenames)
Where filenames is your list of filenames.
>>> filenames = ["Dipole_E0=1.2625E-01.dat", "Dipole_E0=1.3625E-01.dat", "Dipole_E0=0.2625E-01.dat"]
>>> sorted(float(filename.split("=", 1)[1].rsplit(".", 1)[0]) for filename in filenames)
[0.02625, 0.12625, 0.13625]
You can get the filenames with the glob module.
from glob import glob
file_names = glob("yourpath/*.dat")
vals = []
for name in file_names:
vals.append(float(name[:-4].rpartition("=")[2]))
vals.sort()
name[:-4] throws away the ".dat". rpartition is a string method. It returns a tuple where entry 0 is the string left of the string used to split, entry 1 is the string used to split (here: "=") and entry 2 is the string right of this string (here: your float). Then it is converted to a float and appended to the list of values.