Finding particular column value via regex - python

I have a txt file containing multiple rows as below.
56.0000 3 1
62.0000 3 1
74.0000 3 1
78.0000 3 1
82.0000 3 1
86.0000 3 1
90.0000 3 1
94.0000 3 1
98.0000 3 1
102.0000 3 1
106.0000 3 1
110.0000 3 0
116.0000 3 1
120.0000 3 1
Now I am looking for the row which has '0' in the third column .
I am using python regex package. What I have tried is re.match("(.*)\s+(0-9)\s+(1)",line) but of no help..
What should be the regular expression pattern I should be looking for?

You probably don't need a regex for this. You can strip trailing whitespaces from the right side of the line and then check the last character:
if line.rstrip()[-1] == "0": # since your last column only contains 0 or 1
...

Just split line and read value from list.
>>> line = "56.0000 3 1"
>>> a=line.split()
>>> a
['56.0000', '3', '1']
>>> print a[2]
1
>>>
Summary:
f = open("sample.txt",'r')
for line in f:
tmp_list = line.split()
if int(tmp_list[2]) == 0:
print "Line has 0"
print line
f.close()
Output:
C:\Users\dinesh_pundkar\Desktop>python c.py
Line has 0
110.0000 3 0

Related

Removing the first and only the first '-' in the values of a string column

1 2
0 ADRC-111-01 ADRC111
1 ADRC-11955-01 ADRC11955
2 ADRC-18133-01 ADRC18133
3 SWAN0023-03 SWAN0023
In Column 1, I wish to get rid of the first - sign, regardless of how many are in the the cell. There are one or two - in each entry.
Desired output:
1 2
0 ADRC111-01 ADRC111
1 ADRC11955-01 ADRC11955
2 ADRC18133-01 ADRC18133
3 SWAN002303 SWAN0023
Use .str.replace with n=1:
df['1'] = df['1'].str.replace('-', '', n=1)
Output:
>>> df
1 2
0 ADRC111-01 ADRC111
1 ADRC11955-01 ADRC11955
2 ADRC18133-01 ADRC18133
3 SWAN002303 SWAN0023

Remove chinese parentheses and inside content in string column with Python

I would like to remove chinese type parentheses and their contents inside from the following dataframe:
id title
0 1 【第一次拍卖】深圳市光明新区公明街道中心区(拍卖) ---> (拍卖)need to remove
1 2 【第一次拍卖】深圳市龙岗区龙岗街道新生社区
2 3 【第一次拍卖】(破)广东省深圳市龙岗区布吉新区 ---> (破) need to remove
3 4 【第一次拍卖】深圳市宝安区新安街道新城大道
4 5 (拍卖)【第二次拍卖】深圳市盐田区沙头角东和路 ---> (拍卖) need to remove
I tried with df['title'].str.replace(r'\([^()]*\)', '') and df['title'].str.replace(r'\([^)]*\)', ''), but they both can remove them if they are in the end of string.
0 【第一次拍卖】深圳市光明新区公明街道中心区 ---> this row works
1 【第一次拍卖】深圳市龙岗区龙岗街道新生社区
2 【第一次拍卖】(拍卖)广东省深圳市龙岗区布吉新区
3 【第一次拍卖】深圳市宝安区新安街道新城大道
4 (拍卖)【第二次拍卖】深圳市盐田区沙头角东和路
How could I modify my code to get the following output? Thank you.
0 【第一次拍卖】深圳市光明新区公明街道中心区
1 【第一次拍卖】深圳市龙岗区龙岗街道新生社区
2 【第一次拍卖】广东省深圳市龙岗区布吉新区
3 【第一次拍卖】深圳市宝安区新安街道新城大道
4 【第二次拍卖】深圳市盐田区沙头角东和路
The following three solutions work out:
df['title'].str.replace(r'\([^()]*\)', '')
df['title'].str.replace(r'\([^)]*\)', '')
df['title'].str.replace(r'\(\S+\)', '')
Out:
0 【第一次拍卖】深圳市光明新区公明街道中心区
1 【第一次拍卖】深圳市龙岗区龙岗街道新生社区
2 【第一次拍卖】广东省深圳市龙岗区布吉新区
3 【第一次拍卖】深圳市宝安区新安街道新城大道
4 【第二次拍卖】深圳市盐田区沙头角东和路

select some lines as a group then give each line same label in same group in txt file

I want to extract some lines of txt file between two patterns as different groups and give each line a same number as a label in each group.(last column input the group label ) for example, have data:
==g1
a 1 2 3 4
b 2 1 2 3
~~
==g2
c 2...
d 1...
...
I want to get output like
a 1 2 3 4 1
b 2 1 2 3 1
c 2 ... 2
d 1 ... 2
...
I use python 3.7
with open('input.txt') as infile, open('output.txt', 'w') as outfile:
copy = False
for line in infile:
#ng as the new label in each group
ng=0
if line.strip() == "start":
copy = True
continue
elif line.startswith ('with'):
copy = False
continue
elif copy:
ng=ng+1
line=line.rstrip('\n') +'\t'+ str(ng)+'\n'
outfile.write(line)

Splitting a string into a list of integers with Python

This method inputs a file and the directory of the file. It contains a matrix of data, and needs to copy the first 20 columns of each row after the given row name and the corresponding letter for the row. The first 3 lines of each file is skipped because it has unimportant information that is not needed, and it also doesn't need the data at the bottom of the file.
For example a file would look like:
unimportant information--------
unimportant information--------
-blank line
1 F -1 2 -3 4 5 6 7 (more columns of ints)
2 L 3 -1 3 4 0 -2 1 (more columns of ints)
3 A 3 -1 3 6 0 -2 5 (more columns of ints)
-blank line
unimportant information--------
unimportant information--------
The output of the method needs to print out a "matrix" in some given form.
So far the output gives a list of each row as a string, however I'm trying to figure out the best way to approach the problem. I don't know how to ignore the unimportant information at the end of the files. I don't know how to only retrieve the first 20 columns after the letter in each row, and I don't know how to ignore the row number and the row letter.
def pssmMatrix(self,ipFileName,directory):
dir = directory
filename = ipFileName
my_lst = []
#takes every file in fasta folder and put in files list
for f in os.listdir(dir):
#splits the file name into file name and its extension
file, file_ext = os.path.splitext(f)
if file == ipFileName:
with open(os.path.join(dir,f)) as file_object:
for _ in range(3):
next(file_object)
for line in file_object:
my_lst.append(' '.join(line.strip().split()))
return my_lst
Expected results:
['-1 2 -3 4 5 6 7'], ['3 -1 3 4 0 -2 1'], ['3 -1 3 6 0 -2 5']
Actual results:
['1 F -1 2 -3 4 5 6 7'], ['2 L 3 -1 3 4 0 -2 1'], ['3 A 3 -1 3 6 0 -2 5'], [' '], [' unimportant info'], ['unimportant info']
Try this solution.
import re
reg = re.compile(r'(?<=[0-9]\s[A-Z]\s)[0-9\-\s]+')
text = """
unimportant information--------
unimportant information--------
-blank line
1 F -1 2 -3 4 5 6 7 (more columns of ints)
2 L 3 -1 3 4 0 -2 1 (more columns of ints)
3 A 3 -1 3 6 0 -2 5 (more columns of ints)"""
ignore_start = 5 # 0,1,2,3 = 4
expected_array = []
for index, line in enumerate(text.splitlines()):
if(index >= ignore_start):
if reg.search(line):
result = reg.search(line).group(0).strip()
# Use Result
expected_array.append(' '.join(result))
print(expected_array)
# Result: [
#'- 1 2 - 3 4 5 6 7',
#'3 - 1 3 4 0 - 2 1',
#'3 - 1 3 6 0 - 2 5'
#]
Ok so it looks to me like you have a file with certain lines that you want and the lines that you want always start with a number followed by a letter. So what we can do is apply a regular expression to this that only gets lines that match that pattern and only get the numbers after the pattern
The expression for this would look like (?<=[0-9]\s[A-Z]\s)[0-9\-\s]+
import re
reg = re.compile(r'(?<=[0-9]\s[A-Z]\s)[0-9\-\s]+')
for line in file:
if reg.search(line):
result = reg.search(test).group(0)
# Use Result
my_lst.append(' '.join(result))
Hope that helps

Listing distributions in Python

I have a sheet of numbers, separated by spaces into columns. Each column represents a different category, and within each column, each number represents a different value. For example, column number four represents age, and within the column, the number 5 represents an age of 44-55. Obviously, each row is a different person's record. I'd like to use a Python script to search through the the sheet, and find all columns where the sixth column is number "1." After that, I want to know how many times each number in column one appears where the number in column six is equal to "1." The script should output to the user that "While column six equals '1', the value '1' appears 12 times in column one. The value '2' appears 18 times..." etc. I hope I'm being clear here. I just want it to list the numbers, basically. Anyway, I'm new to Python. I've attached my code below. I think I should be using dictionaries, but I'm just not totally sure how. So far, I haven't really come close to figuring this out. I would really appreciate if someone could walk me through the logic that would be behind such code. Thank you so much!
ldata = open("list.data", "r")
income_dist = {}
for line in ldata:
linelist = line.strip().split(" ")
key_income_dist = linelist[6]
if key_income_dist in income_dist:
income_dist[key_income_dist] = 1 + income_dist[key_income_dist]
else:
income_dist[key_income_dist] = 1
ldata.close()
print value_no_occupations
First, indentation is majorly important in Python and the above is bad: the 5 lines following linelist = line.strip().split(" ") need to be indented to be in the loop like they should be.
Next they should be indented further and this line added before them:
if len(linelist)>6 and linelist[6]=="1":
This line skips over short lines (there are some), and tests for what you said you wanted: "where column six equals "1."" This is column [6] where the first number on the line is referenced as [0] (these are "offsets", not "cardinal", or counting, numbers).
You'll probably want to change key_income_dist = linelist[6] to key_income_dist = linelist[0] or [1] to get what you want. Play around if necessary.
Finally, you should say print income_dist at the end to get a look at your results. If you want fancier output, study up on formatting.
This is actually easier than it seems! The key is collections.Counter
from collections import Counter
ldata = open("list.data")
rows = [tuple(row.split()) for row in ldata if row.split()[5]==1]
# warning this will break if some rows are shorter than 6 columns
first_col = Counter(item[0] for item in rows)
If you want the distribution of every column (not just the first) do:
distribution = {column: Counter(item[column] for item in rows) for column in range(len(rows[0]))}
# warning this will break if all rows are not the same size!
Considering that the data file has ~9000 rows of data, if you don't want to keep the original data, you can combine step 1 & 2 to make the program use less memory and a little faster.
ldata = open("list.data", "r")
# read in all the rows, note that the list values are strings instead of integers
# keep only the rows with 6th column = '1'
only1 = []
for line in ldata:
if line.strip() == '': # ignor blank lines
continue
row = tuple(line.strip().split(" "))
if row[5] == '1':
only1.append(row)
ldata.close()
# tally the statistics
income_dist = {}
for row in only1:
if row[0] in income_dist:
income_dist[row[0]] += 1
else:
income_dist[row[0]] = 1
# print result
print "While column six equals '1',"
for num in sorted(income_dist):
print "the value %s appears %d times in column one." % (num, income_dist[num])
Sample Test Data in list.data:
9 2 1 5 4 5 5 3 3 0 1 1 7 NA
9 1 1 5 5 5 5 3 5 2 1 1 7 1
9 2 1 3 5 1 5 2 3 1 2 3 7 1
1 2 5 1 2 6 5 1 4 2 3 1 7 1
1 2 5 1 2 6 3 1 4 2 3 1 7 1
8 1 1 6 4 8 5 3 2 0 1 1 7 1
1 1 5 2 3 9 4 1 3 1 2 3 7 1
6 1 3 3 4 1 5 1 1 0 2 3 7 1
2 1 1 6 3 8 5 3 3 0 2 3 7 1
4 1 1 7 4 8 4 3 2 0 2 3 7 1
1 1 5 2 4 1 5 1 1 0 2 3 7 1
4 2 2 2 3 2 5 1 2 0 1 1 5 1
8 2 1 3 6 6 2 2 4 2 1 1 7 1
7 2 1 5 3 5 5 3 4 0 2 1 7 1
1 1 5 2 3 9 4 1 3 1 2 3 7 1
6 1 3 3 4 1 5 1 1 0 2 3 7 1
2 1 1 6 3 8 5 3 3 0 2 3 7 1
4 1 1 7 4 8 4 3 2 0 2 3 7 1
1 1 5 2 4 9 5 1 1 0 2 3 7 1
4 2 2 2 3 2 5 1 2 0 1 1 5 1
Following your original program logic, I come up with this version:
ldata = open("list.data", "r")
# read in all the rows, note that the list values are strings instead of integers
linelist = []
for line in ldata:
linelist.append(tuple(line.strip().split(" ")))
ldata.close()
# keep only the rows with 6th column = '1'
only1 = []
for row in linelist:
if row[5] == '1':
only1.append(row)
# tally the statistics
income_dist = {}
for row in only1:
if row[0] in income_dist:
income_dist[row[0]] += 1
else:
income_dist[row[0]] = 1
# print result
print "While column six equals '1',"
for num in sorted(income_dist):
print "the value %s appears %d times in column one." % (num, income_dist[num])

Categories

Resources