Extract list of words from filenames

Extract list of words from filenames - python

I need to get a list of words, that files contains. Here is the files:
sub-Dzh_task-FmriPictures_space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii
sub-Dzh_task-FmriVernike_space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii
sub-Dzh_task-FmriWgWords_space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii
sub-Dzh_task-RestingState_space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii
I need to get that goes after task-<>_, so my list should looks:
['FmriPictures','FmriVernike','FmriWgWords','RestingState']
how can I implement it in python3?

Here's a Python Solution for this which uses Regex.
>>> import re
>>> test_str = 'sub-Dzh_task-FmriPictures_space-
MNI152NLin2009cAsym_desc-preproc_bold_mask-
Language_sub01_component_ica_s1_.nii'
>>> re.search('task-(.*?)_', test_str).group(1)
'FmriPictures'
I think you can do the same for every string.

l=["sub-Dzh_task-FmriPictures_space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii",
"sub-Dzh_task-FmriVernike_space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii",
"sub-Dzh_task-FmriWgWords_space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii",
"sub-Dzh_task-RestingState_space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii"]
k=[]
for i in l:
k.append(i.split('-')[2].replace("_space",""))
print(k)
thats just approach.

You can loop over your list and use regex to get the names from the strings like this example:
import re
a = ['sub-Dzh_task-FmriPictures_space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii',
'sub-Dzh_task-FmriVernike_space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii',
'sub-Dzh_task-FmriWgWords_space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii',
'sub-Dzh_task-RestingState_space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii']
out = []
for elm in a:
condition = re.search(r'_task-(.*?)_', elm)
if bool(condition):
out.append(condition.group(1))
print(out)
Output:
['FmriPictures', 'FmriVernike', 'FmriWgWords', 'RestingState']

I would just simply replace
sub-Dzh_task-
and
_space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii
with null. Just empty those lines out and you'll get the file names.

Related

How to convert string to list?

I have a list:
sample = ["['P001']"]
How to convert it into:
sample = [["P001" ]]
I tried [[x] for x in sample] but the output is [["[P001]"]]
Edited : I am very sorry for the inconvenience because I have made mistake in the question I ask. Actually is the "P001" a string. Sorry again.

If you want the output as this [['P001']], you can do this:
>>> sample = ["[P001]"]
>>> [[sample[0].strip('[\]')]]
[['P001']]
But if you want the output as [[P001]] I don't know if that's possible. P001 is not a python type, and if P001 is a defined variable before then the modified list will hold the value of P001 not the name itself. For example:
>>> P001 = 'something'
>>> sample = ["[P001]"]
>>> [eval(sample[0])]
[['something']]

This is an array of strings and each string consists of pattern like "[P001]". You need to loop over the outer array and run a match with each element to get the inner value, in this case 'P001'. Then you can append the value as you like.
s = ["[P001]", "[S002]"]
result = []
import re
for i in s:
r = re.match(r"\[(.*?)\]", i)
if r:
result.append([r.group(1)])
print(result)
[['P001'], ['S002']]

It can be done just using a regular expression for alphanumeric characters:
sample = ["[P001,P002]","[P003,P004]"]
import re
sample = [re.findall("\w+", sublist) for sublist in sample]

Get list from string with exec in python

I have:
"[15765,22832,15289,15016,15017]"
I want:
[15765,22832,15289,15016,15017]
What should I do to convert this string to list?
P.S. Post was edited without my permission and it lost important part. The type of line that looks like list is 'bytes'. This is not string.
P.S. №2. My initial code was:
import urllib.request, re
f = urllib.request.urlopen("http://www.finam.ru/cache/icharts/icharts.js")
lines = f.readlines()
for line in lines:
m = re.match('var\s+(\w+)\s*=\s*\[\\s*(.+)\s*\]\;', line.decode('windows-1251'))
if m is not None:
varname = m.group(1)
if varname == "aEmitentIds":
aEmitentIds = line #its type is 'bytes', not 'string'
I need to get list from line
line from web page looks like
[15765, 22832, 15289, 15016, 15017]

Assuming s is your string, you can just use split and then cast each number to integer:
s = [int(number) for number in s[1:-1].split(',')]
For detailed information about split function:
Python3 split documentation

What you have is a stringified list. You could use a json parser to parse that information into the corresponding list
import json
test_str = "[15765,22832,15289,15016,15017]"
l = json.loads(test_str) # List that you need.
Or another way to do this would be to use ast
import ast
test_str = "[15765,22832,15289,15016,15017]"
data = ast.literal_eval(test_str)
The result is
[15765, 22832, 15289, 15016, 15017]
To understand why using eval() is bad practice you could refer to this answer

You can also use regex to pull out numeric values from the string as follows:
import re
lst = "[15765,22832,15289,15016,15017]"
lst = [int(number) for number in re.findall('\d+',lst)]
Output of the above code is,
[15765, 22832, 15289, 15016, 15017]

How do you pick out certain values from a list based on their string values in Python?

I have a list of hyperlinks, with three types of links; htm, csv and pdf. And I would like to just pick out those that are csv.
The list contains strings of the form: csv/damlbmp/20160701damlbmp_zone_csv.zip
I was thinking of running a for loop across the string and just returning values that have first 3 string values are equal to csv, but I am not really sure how to do this.

I would use link.endswith('csv') (or link.endswith('csv.zip')), where link is a string containing that link)
For example:
lst = ['csv/damlbmp/20160701damlbmp_zone_csv.zip',
'pdf/damlbmp/20160701damlbmp_zone_pdf.zip',
'html/damlbmp/20160701damlbmp_zone_html.zip',
'csv/damlbmp/20160801damlbmp_zone_csv.zip']
csv_files = [link for link in lst if link.endswith('csv.zip')]

If your list is called links:
[x for x in links if 'csv/' in x]

You can try this
import re
l=["www.h.com","abc.csv","test.pdf","another.csv"] #list of links
def MatchCSV(list):
matches=[]
for string in list:
m=re.findall('[^\.]*\.csv',string)
if(len(m)>0):
matches.append(m)
return matches
print(MatchCSV(l))
[['abc.csv'], ['another.csv']]
(endswith is a good option too)

This is one way:
lst = ['csv/damlbmp/20160701damlbmp_zone_csv.zip',
'pdf/damlbmp/20160701damlbmp_zone_pdf.zip',
'html/damlbmp/20160701damlbmp_zone_html.zip',
'csv/damlbmp/20160801damlbmp_zone_csv.zip']
[i for i in lst if i[:3]=='csv']
# ['csv/damlbmp/20160701damlbmp_zone_csv.zip',
# 'csv/damlbmp/20160801damlbmp_zone_csv.zip']

Extract and sort numbers from filnames in python

I have a very basic question. I have files named like Dipole_E0=1.2625E-01.dat and I want to extract the 1.2625E-01 part and finally sort them by ascending order. How can this be done ? I tried first to plit the filename with .split() but it does not what I expect. Thanks for your help.
Best
Roland

Best way is to use regexp. To obtain value from file name:
m = re.search(filename, '^Dipole_E0=(.*)/s?')
val = m.group(0)
Walk through all dilenames and append all values to array. After that sort and that's all.

You want to look into regular expressions. In python they live in the re module. Depending on exact format, something like:
import re
ematch = re.compile("=([0-9]*\.[0-9]*[eE][+-][0-9]+)")
val = ematch.search(filename).group(0)
Sorting a list can be done with the .sort() method on lists, or the sorted(list) builtin, which give you a new list.

This is a good situation to use a generator expression and the sorted builtin:
sorted(float(filename.split("=", 1)[1].rsplit(".", 1)[0]) for filename in filenames)
Where filenames is your list of filenames.
>>> filenames = ["Dipole_E0=1.2625E-01.dat", "Dipole_E0=1.3625E-01.dat", "Dipole_E0=0.2625E-01.dat"]
>>> sorted(float(filename.split("=", 1)[1].rsplit(".", 1)[0]) for filename in filenames)
[0.02625, 0.12625, 0.13625]

You can get the filenames with the glob module.
from glob import glob
file_names = glob("yourpath/*.dat")
vals = []
for name in file_names:
vals.append(float(name[:-4].rpartition("=")[2]))
vals.sort()
name[:-4] throws away the ".dat". rpartition is a string method. It returns a tuple where entry 0 is the string left of the string used to split, entry 1 is the string used to split (here: "=") and entry 2 is the string right of this string (here: your float). Then it is converted to a float and appended to the list of values.

appending regex matches to a dictionary

I have a file in which there is the following info:
dogs_3351.txt:34.13559322033898
cats_1875.txt:23.25581395348837
cats_2231.txt:22.087912087912088
elephants_3535.txt:37.092592592592595
fish_1407.txt:24.132530120481928
fish_2078.txt:23.470588235294116
fish_2041.txt:23.564705882352943
fish_666.txt:23.17241379310345
fish_840.txt:21.77173913043478
I'm looking for a way to match the colon and append whatever appears afterwards (the numbers) to a dictionary the keys of which are the name of the animals in the beginning of each line.

Actually, regular expressions are unnecessary, provided that your data is well formatted and contains no surprises.
Assuming that data is a variable containing the string that you listed above:
dict(item.split(":") for item in data.split())

t = """
dogs_3351.txt:34.13559322033898
cats_1875.txt:23.25581395348837
cats_2231.txt:22.087912087912088
elephants_3535.txt:37.092592592592595
fish_1407.txt:24.132530120481928
fish_2078.txt:23.470588235294116
fish_2041.txt:23.564705882352943
fish_666.txt:23.17241379310345
fish_840.txt:21.77173913043478
"""
import re
d = {}
for p, q in re.findall(r'^(.+?)_.+?:(.+)', t, re.M):
d.setdefault(p, []).append(q)
print d

why dont you use the python find method to locate the index of the colons which you can use to slice the string.
>>> x='dogs_3351.txt:34.13559322033898'
>>> key_index = x.find(':')
>>> key = x[:key_index]
>>> key
'dogs_3351.txt'
>>> value = x[key_index+1:]
>>> value
'34.13559322033898'
>>>
Read in each line of the file as a text and process the lines individually as above.

Without regex and using defaultdict:
from collections import defaultdict
data = """dogs_3351.txt:34.13559322033898
cats_1875.txt:23.25581395348837
cats_2231.txt:22.087912087912088
elephants_3535.txt:37.092592592592595
fish_1407.txt:24.132530120481928
fish_2078.txt:23.470588235294116
fish_2041.txt:23.564705882352943
fish_666.txt:23.17241379310345
fish_840.txt:21.77173913043478"""
dictionary = defaultdict(list)
for l in data.splitlines():
animal = l.split('_')[0]
number = l.split(':')[-1]
dictionary[animal] = dictionary[animal] + [number]
Just make sure your data is well formatted

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract list of words from filenames - python

I would just simply replace sub-Dzh_task- and _space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii with null. Just empty those lines out and you'll get the file names.

Related

How to convert string to list?

Get list from string with exec in python

How do you pick out certain values from a list based on their string values in Python?

Extract and sort numbers from filnames in python

appending regex matches to a dictionary

Categories

Resources