I need to get a list of words, that files contains. Here is the files:
sub-Dzh_task-FmriPictures_space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii
sub-Dzh_task-FmriVernike_space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii
sub-Dzh_task-FmriWgWords_space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii
sub-Dzh_task-RestingState_space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii
I need to get that goes after task-<>_, so my list should looks:
['FmriPictures','FmriVernike','FmriWgWords','RestingState']
how can I implement it in python3?
Here's a Python Solution for this which uses Regex.
>>> import re
>>> test_str = 'sub-Dzh_task-FmriPictures_space-
MNI152NLin2009cAsym_desc-preproc_bold_mask-
Language_sub01_component_ica_s1_.nii'
>>> re.search('task-(.*?)_', test_str).group(1)
'FmriPictures'
I think you can do the same for every string.
l=["sub-Dzh_task-FmriPictures_space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii",
"sub-Dzh_task-FmriVernike_space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii",
"sub-Dzh_task-FmriWgWords_space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii",
"sub-Dzh_task-RestingState_space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii"]
k=[]
for i in l:
k.append(i.split('-')[2].replace("_space",""))
print(k)
thats just approach.
You can loop over your list and use regex to get the names from the strings like this example:
import re
a = ['sub-Dzh_task-FmriPictures_space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii',
'sub-Dzh_task-FmriVernike_space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii',
'sub-Dzh_task-FmriWgWords_space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii',
'sub-Dzh_task-RestingState_space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii']
out = []
for elm in a:
condition = re.search(r'_task-(.*?)_', elm)
if bool(condition):
out.append(condition.group(1))
print(out)
Output:
['FmriPictures', 'FmriVernike', 'FmriWgWords', 'RestingState']
I would just simply replace
sub-Dzh_task-
and
_space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii
with null. Just empty those lines out and you'll get the file names.
I am trying to process a csv file and want to extract the entire row if it contains a string and add it to an another brand new list. But my approach is giving me all the rows which contain that string whereas I want the unique string row. Let me explain it with an example:
I have the following list of lists:
myList = [['abc', 1, 3, 5, 6], ['abcx', 5, 6, 8, 9], ['abcn', 7, 12, 89, 23]]
I want to get the whole list which has the string 'abc'. I tried the following:
newList = []
for temp in myList:
if 'abc' in temp:
newList.append(temp)
But this gives me all the values, as 'abc' is a substring of all the other strings too which are in the strings. What is a cleaner approach to solve this problem?
Update:
I have a huge CSV file, which I am reading line by line using readlines() and I want to find the line which has "abc" gene and shove the whole line into a list. But when I do if 'abc' in , I get all the other strings which also have "abc" as a substring. How can I ignore the substrings.
From your comment on the question, I think it is straight forward to use numpy and pandas if you want to process a csv file. Pandas has in-built csv reader and you can extract the row and convert into a list or a numpy array in a couple of lines with ease. Here's how I would do it:
import pandas
df = pandas.read_csv("your_csv")
#assuming you have column names.
x = df.loc[df['col_name'] == 'abc'].values.tolist() #this will give you the whole row and convert into a list.
Or
import numpy as np
x = np.array(df.loc[df['col_name'] == 'abc']) #gives you a numpy array
This gives you much more flexibility to do processing. I hope this helps.
It seems you want to append only if the string matches 'abc' and nothing else(e.g. true for 'abc, but false for 'abcx'). Is this correct?
If so, you need to make 2 corrections;
First, you need to index the list, currently temp is the entire list, but if you know the string will always be in position 0, index that in the if statement.(if you don't, either a nested for loop will work)
Second, you need to use '==' instead of 'in'. 'in' means that it can be a part of a larger string, whereas '==' must be an exact match.
newList = []
for temp in myList:
if temp[0] == 'abc':
newList.append(temp)
or
newList = [temp for temp in myList if temp[0] == 'abc']
Your code works, as others said it before me.
Part of your question was to get a cleaner code. Since you only want the sub-lists that contain your string, I would recommend to use filter:
check_against_string = 'abc'
newList = list(filter(lambda sub_list: check_against_string in sub_list, myList))
filter creates a list of elements for which a function returns true. It is exactly the code you wrote, but more pythonic!
I am new to python and want to split what I have read in from a text file into two specific parts. Below is an example of what could be read in:
f = ['Cats','like','dogs','as','much','cats.'][1,2,3,4,5,4,3,2,6]
So what I want to achieve is to be able to execute the second part of the program is:
words = ['Cats','like','dogs','as','much','cats.']
numbers = [1,2,3,4,5,4,3,2,6]
I have tried using:
words,numbers = f.split("][")
However, this removes the double bracets from the two new variable which means the second part of my program which recreates the original text does not work.
Thanks.
I assume f is a string like
f = "['Cats','like','dogs','as','much','cats.'][1,2,3,4,5,4,3,2,6]"
then we can find the index of ][ and add one to find the point between the brackets
i = f.index('][')
a, b = f[:i+1], f[i+1:]
print(a)
print(b)
output:
['Cats','like','dogs','as','much','cats.']
[1,2,3,4,5,4,3,2,6]
Another Alternative if you want to still use split()
f = "['Cats','like','dogs','as','much','cats.'][1,2,3,4,5,4,3,2,6]"
d="]["
print f.split(d)[0]+d[0]
print d[1]+f.split(d)[1]
If you can make your file look something like this:
[["Cats","like","dogs","as","much","cats."],[1,2,3,4,5,4,3,2,6]]
then you could simply use Python's json module to do this for you. Note that the JSON format requires double quotes rather than single.
import json
f = '[["Cats","like","dogs","as","much","cats."],[1,2,3,4,5,4,3,2,6]]'
a, b = json.loads(f)
print(a)
print(b)
Documentation for the json library can be found here: https://docs.python.org/3/library/json.html
An alternative to Patrick's answer using regular expressions:
import re
data = "f = ['Cats','like','dogs','as','much','cats.'][1,2,3,4,5,4,3,2,6]"
pattern = 'f = (?P<words>\[.*?\])(?P<numbers>\[.*?\])'
match = re.match(pattern, data)
words = match.group('words')
numbers = match.group('numbers')
print(words)
print(numbers)
Output
['Cats','like','dogs','as','much','cats.']
[1,2,3,4,5,4,3,2,6]
If I understand correctly, you have a text file that contains ['Cats','like','dogs','as','much','cats.'][1,2,3,4,5,4,3,2,6] and you just need to split that string at the transition between brackets. You can do this with the string.index() method and string slicing. See my console output below:
>>> f = open('./catsdogs12.txt', 'r')
>>> input = f.read()[:-1] # Read file without trailing newline (\n)
>>> input
"['Cats','like','dogs','as','much','cats.'][1,2,3,4,5,4,3,2,6]"
>>> bracket_index = input.index('][') # Get index of transition between brackets
>>> bracket_index
41
>>> words = input[:bracket_index + 1] # Slice from beginning of string
>>> words
"['Cats','like','dogs','as','much','cats.']"
>>> numbers = input[bracket_index + 1:] # Slice from middle of string
>>> numbers
'[1,2,3,4,5,4,3,2,6]'
Note that this will leave you with a python string that looks visually identical to a list (array). If you needed the data represented as python native objects (i.e. so that you can actually use it like a list), you'll need to use some combination of string[1:-1].split(',') on both strings and list.map() on the numbers list to convert the numbers from strings to numbers.
Hope this helps!
Another thing you can do is first replace ][ with ]-[ and then do a split or partition using - but i will suggest you to do split as we don't want that delimiter.
SPLIT
f = "['Cats','like','dogs','as','much','cats.'][1,2,3,4,5,4,3,2,6]"
f = f.replace('][',']-[')
a,b = f.split('-')
Output
>>> print(a)
['Cats','like','dogs','as','much','cats.']
>>> print(b)
[1,2,3,4,5,4,3,2,6]
PARTITION
f = "['Cats','like','dogs','as','much','cats.'][1,2,3,4,5,4,3,2,6]"
f = f.replace('][',']-[')
a,b,c = f.partition('-')
Output
>>> print(a)
['Cats','like','dogs','as','much','cats.']
>>> print(c)
[1,2,3,4,5,4,3,2,6]
I have a list that contains file paths like this:
my_paths = ['/home/mark/results/chilo/15381_chilo_140618_099_X/15381_chilo.csv','/home/mark/results/chilo/15382_chilo_140610_099_X/15382_chilo.csv','/home/mark/results/chilo/15383_chilo_140616_099_X/15383_chilo.csv','/home/mark/results/chilo/15384_chilo_140620_099_X/15384_chilo.csv']
I like to sort the list based on the date in the second level, e.g. 140616 in 15383_chilo_140616_099_X. So output should be:
['/home/mark/results/chilo/15382_chilo_140610_099_X/15382_chilo.csv', '/home/mark/results/chilo/15383_chilo_140616_099_X/15383_chilo.csv', '/home/mark/results/chilo/15381_chilo_140618_099_X/15381_chilo.csv', '/home/mark/results/chilo/15384_chilo_140620_099_X/15384_chilo.csv']
What is the best way to do this. I cant make my mind up whether I should first loop through the paths, take the second level like this:
for my_path in my_paths:
(SeqDir,seqFileName) = os.path.split(my_path)
(SeqDir_remaining,second_level) = os.path.split(SeqDir)
....and then split on underscore, take the date and then sort it and take the path of that date, or use a dictionary and have the dates as keys and the path as values (but then got a problem with sorting).
Appreciate you help.
Thanks!
split three times on an underscore and get the third element casting to int, the path separators are irrelevant, you just want the number between the second and third underscore:
my_paths = ['/home/mark/results/chilo/15381_chilo_140618_099_X/15381_chilo.csv','/home/mark/results/chilo/15382_chilo_140610_099_X/15382_chilo.csv','/home/mark/results/chilo/15383_chilo_140616_099_X/15383_chilo.csv','/home/mark/results/chilo/15384_chilo_140620_099_X/15384_chilo.csv']
my_list.sort(key=lambda x: int(x.split("_", 3)[2])))
Output:
['/home/mark/results/chilo/15382_chilo_140610_099_X/15382_chilo.csv',
'/home/mark/results/chilo/15383_chilo_140616_099_X/15383_chilo.csv',
'/home/mark/results/chilo/15381_chilo_140618_099_X/15381_chilo.csv',
'/home/mark/results/chilo/15384_chilo_140620_099_X/15384_chilo.csv']
If they are actually year/month/day dates, you don't need to use int.
Write a function to extract the thing you want to sort on:
def getdate(item):
...
then
my_paths.sort(key=getdate)
Your getdate function might need to be better than this, but you get the idea:
>>> import pprint
>>> pprint.pprint(my_paths)
['/home/mark/results/chilo/15381_chilo_140618_099_X/15381_chilo.csv',
'/home/mark/results/chilo/15382_chilo_140610_099_X/15382_chilo.csv',
'/home/mark/results/chilo/15383_chilo_140616_099_X/15383_chilo.csv',
'/home/mark/results/chilo/15384_chilo_140620_099_X/15384_chilo.csv']
>>> def getdate(item):
... start = len('/home/mark/results/chilo/15381_chilo_')
... end = start + 6
... return item[start:end]
...
>>> getdate(my_paths[0])
'140618'
>>> my_paths.sort(key=getdate)
>>> pprint.pprint(my_paths)
['/home/mark/results/chilo/15382_chilo_140610_099_X/15382_chilo.csv',
'/home/mark/results/chilo/15383_chilo_140616_099_X/15383_chilo.csv',
'/home/mark/results/chilo/15381_chilo_140618_099_X/15381_chilo.csv',
'/home/mark/results/chilo/15384_chilo_140620_099_X/15384_chilo.csv']
>>>
def sort_links(my_paths, pattern):
# to sort by chilo_xxxxxx
# pattern = r'(chilo_\d+)'
import re
my_paths = sorted(my_paths,key=lambda x : re.search(pattern,x).groups(1)[0])
return my_paths
my_paths = sorted(my_paths,key=f)
return my_paths
print(sort_links(my_paths,r'(chilo_\d+)'))
['/home/mark/results/chilo/15382_chilo_140610_099_X/15382_chilo.csv', '/home/mark/results/chilo/15383_chilo_140616_099_X/15383_chilo.csv', '/home/mark/results/chilo/15381_chilo_140618_099_X/15381_chilo.csv', '/home/mark/results/chilo/15384_chilo_140620_099_X/15384_chilo.csv']
I have a file in which there is the following info:
dogs_3351.txt:34.13559322033898
cats_1875.txt:23.25581395348837
cats_2231.txt:22.087912087912088
elephants_3535.txt:37.092592592592595
fish_1407.txt:24.132530120481928
fish_2078.txt:23.470588235294116
fish_2041.txt:23.564705882352943
fish_666.txt:23.17241379310345
fish_840.txt:21.77173913043478
I'm looking for a way to match the colon and append whatever appears afterwards (the numbers) to a dictionary the keys of which are the name of the animals in the beginning of each line.
Actually, regular expressions are unnecessary, provided that your data is well formatted and contains no surprises.
Assuming that data is a variable containing the string that you listed above:
dict(item.split(":") for item in data.split())
t = """
dogs_3351.txt:34.13559322033898
cats_1875.txt:23.25581395348837
cats_2231.txt:22.087912087912088
elephants_3535.txt:37.092592592592595
fish_1407.txt:24.132530120481928
fish_2078.txt:23.470588235294116
fish_2041.txt:23.564705882352943
fish_666.txt:23.17241379310345
fish_840.txt:21.77173913043478
"""
import re
d = {}
for p, q in re.findall(r'^(.+?)_.+?:(.+)', t, re.M):
d.setdefault(p, []).append(q)
print d
why dont you use the python find method to locate the index of the colons which you can use to slice the string.
>>> x='dogs_3351.txt:34.13559322033898'
>>> key_index = x.find(':')
>>> key = x[:key_index]
>>> key
'dogs_3351.txt'
>>> value = x[key_index+1:]
>>> value
'34.13559322033898'
>>>
Read in each line of the file as a text and process the lines individually as above.
Without regex and using defaultdict:
from collections import defaultdict
data = """dogs_3351.txt:34.13559322033898
cats_1875.txt:23.25581395348837
cats_2231.txt:22.087912087912088
elephants_3535.txt:37.092592592592595
fish_1407.txt:24.132530120481928
fish_2078.txt:23.470588235294116
fish_2041.txt:23.564705882352943
fish_666.txt:23.17241379310345
fish_840.txt:21.77173913043478"""
dictionary = defaultdict(list)
for l in data.splitlines():
animal = l.split('_')[0]
number = l.split(':')[-1]
dictionary[animal] = dictionary[animal] + [number]
Just make sure your data is well formatted