Extract rows based on values from text file using Python - python

I have a list of information in file A that I want to extract according to the numbering in file B. If given the value 4 and 5, all the 4th column in file A with the value 4 and 5 will be extracted. May I know how can I do this using python? can anyone help me? The code below only extract based on index that have value 4.
with open("B.txt", "rt") as f:
classes = [int(line) for line in f.readlines()]
with open("A.txt", "rt") as f:
lines = [line for index, line in enumerate(f.readlines()) if classes[index]== 4]
lines_all= "".join(lines)
with open("C.txt", "w") as f:
f.write(lines_all)
A.txt
hg17_ct_ER_ER_1003 36 42 1
hg17_ct_ER_ER_1003 109 129 2
hg17_ct_ER_ER_1003 110 130 2
hg17_ct_ER_ER_1003 129 149 2
hg17_ct_ER_ER_1003 130 150 2
hg17_ct_ER_ER_1003 157 163 3
hg17_ct_ER_ER_1003 157 165 3
hg17_ct_ER_ER_1003 179 185 4
hg17_ct_ER_ER_1003 197 217 5
hg17_ct_ER_ER_1003 220 226 6
B.txt
4
5
Desired output
hg17_ct_ER_ER_1003 179 185 4
hg17_ct_ER_ER_1003 197 217 5

create a set of the lines/numbers from the b file the compare the last element from each row in f1 to the elements in the set:
import csv
with open("a.txt") as f, open("b.txt") as f2:
st = set(line.rstrip() for line in f2)
r = csv.reader(f,delimiter=" ")
data = [row for row in r if row[-1] in st]
print(data)
[['hg17_ct_ER_ER_1003', '179', '185', '4'], ['hg17_ct_ER_ER_1003', '197', '217', '5']]
set delimiter= to whatever it is or don't set it at all if your file is comma separated.
Or:
with open("a.txt") as f, open("b.txt") as f2:
st = set(line.rstrip() for line in f2)
data = [line.rstrip() for line in f if line.rsplit(None, 1)[1] in st ]
print(data)
['hg17_ct_ER_ER_1003 179 185 4', 'hg17_ct_ER_ER_1003 197 217 5']

with open("B.txt", "r") as target_file:
target = [i.strip() for i in target_file]
with open("A.txt", "r") as data_file:
r = filter(lambda x: x.strip().rsplit(None, 1)[1] in target, data_file)
print "".join(r)
the output:
hg17_ct_ER_ER_1003 179 185 4
hg17_ct_ER_ER_1003 197 217 5
As mentioned by #Padraic, I change the split()[-1] to rsplit(None, 1)[1].

Related

Compare a file line by line and for those lines that meet the given requirement, print them

I have a txt file with content:
577 181 619 216
603 175 630 202
651 180 681 202
661 152 676 179
604 176 630 204
605 177 632 202
I want to read each line of this file and compare each line with one another and if for e.g. line i - line j <= 3 then remove that line and output only one between those lines.
For above content I want the output as:
577 181 619 216
603 175 630 202
651 180 681 202
661 152 676 179
In this case second line 603 175 630 202 falls under above condition so other 2 lines:5 and 6 are removed and only line 2 is written to output as given above.
f1 = open("result.txt", "r")
f2 = open("final.txt", "w" )
for line1 in f1:
for line2 in f1:
if each number line2 - line1 <= 3:
#remove one of those line and write the remaining line to new file
#f2.write(lines)
f1.close()
f2.close()
For example if you look at line 2, 5 and 6, each adjacent number in each line, the difference between is less then 3 i.e For line 2 and 5, the first element are 603 and 604 ( 603 -604 =1 i.e less then 3) and the second element 175 - 176 =1, 3rd element 630 -630 =0 and 4th element 202 - 204 = 2 i.e less then 3, all this falls under the given condition and hence for 1st and 5th line only one line is enough
For starters, you need to convert the lines into the numbers.
With that, find absolute difference across the lines.
for i, line1 in enumerate(f1);
nums1 = list(map(int, line1.strip().split()))
for j, line2 in enumerate(f1);
if j <= i: # skip repeating + equal lines
continue
nums2 = list(map(int, line2.strip().split()))
diffs = [abs(nums2[x] - nums1[x]) for x in range(len(nums1))]
print(f'Diff between {nums1} and {nums2}: {diffs}') # for debugging
# check all the differences something like this...
if not all(d <= 3 for d in diffs):
f2.write(line1)

Replace a number in a file in python - script worked and now does not

I need to replace a number in a file for a programme to work. These are ascii files, but don't need to be adapted in format.
The head of the file looks like:
106 106 106 106 106 106 106 106 106 106 106 106 106 106 106 106 106 106 1 1 1 1 1 1 1 1 1 1
I want to replace all numbers equal to 106 with -9999 and all numbers equal to 1 to 0.1.
This is my code, which worked previously and now does not:
lowres_file = str(lowres_file)
with open(lowres_file, 'r') as f_in:
with open(lowres_file_converted, 'w') as f_out:
for line in f_in:
print(line)
if line[0] != ' ':
f_out.write(line)
else:
split_line = np.array(line[1:-1].split(' '), dtype = int)
split_line[split_line == 106] = -9999
split_line[split_line == 1] = 0.1
split_line = np.array(split_line, dtype = str)
new_line = ' ' + ' '.join(split_line) + '\n'
f_out.write(new_line)
However, this doesn't replace the 1s with 0.1, it just replaces them with 0s. Converting the 106 to -9999 works just fine, however.
Is there something wrong with my code?
filepointer=open('word.txt','r+')
text=filepointer.read().split(' ')
str_data=''
for i in text:
if i == '106':
str_data+='-9999 '
elif i == '1':
str_data+='0.1 '
filepointer.close()
filepointer=open('word.txt','w')
filepointer.write(str_data)

how to bring out the similar values in between two csv files and creating a new csv file with desired output in python?

hi I have two csv files which are boom.csv and kaboom.csv which has data like this
boom.csv
id;rollnumber;total;subjects;obtained;rank;standing
260406;260737;137;10;127;10;111
552592;260806;134;10;124;10;108
402788;260837;134;10;124;10;108
262744;260851;131;10;121;10;105
502870;260874;131;10;121;10;105
342541;260879;131;10;121;10;105
502806;260902;135;10;125;10;109
261664;261182;217;21;196;15;161
and kaboom.csv consists of data like this
kaboom.csv
id
342541
552592
402788
502806
502870
here im trying to compare these both files and trying to bring out the data which is common in between them and storing them into a new csv file..
to be more exact.. im considering the "id" values in kaboom.csv and comparing it with the id values of boom.csv and trying to create a new csv file which consists of only identical ids with its entire row of values associated with it which are rollnumber, total subjects, obtained, rank, standing
desired output:-
bigbang.csv
id rollnumber total subjects obtained rank standing
402788 260837 134 10 124 10 108
552592 260806 134 10 124 10 108
502870 260874 131 10 121 10 105
342541 260879 131 10 121 10 105
502806 260902 135 10 125 10 109
can anyone help with this? how to do it python?
Using Pandas, you can easily load CSV files as dataframes and merge them by column:
import pandas as pd
boom = pd.read_csv('boom.csv',sep = ';')
kaboom = pd.read_csv('kaboom.csv',header=0,names=['id'])
bigbang = pd.merge(boom, kaboom, on="id")
print(bigbang)
Output:
id rollnumber total subjects obtained rank standing
0 552592 261347 243 16 227 19 174
1 402788 261381 231 16 215 19 164
2 502870 262871 248 22 226 21 151
3 342541 267359 117 8 108 8 106
4 502806 261664 235 14 221 15 173
You can then easily write the resulting dataframe to a CSV file with:
bigbang.to_csv('bigbang.csv',sep = ' ',index = False)
Assuming your DataFrames are loaded as df_initial and df_compare (more readable than boom and kaboom), a simple merge would suffice:
df_merge = pd.merge(df_init,df_compare, on = 'id', how='left')
To expand, this command will search through both df's and return rows with matching id's. Additional data from the left df will be included.
I've included a longer version of this solution on this collab notebook.
We start by getting all items from boom.csv and kaboom.csv by reading the csv file via the csv module
boom_items = []
#Iterate over csv and read all rows
with open('boom.csv') as fp:
reader = csv.reader(fp, delimiter=' ', skipinitialspace=True)
next(reader)
boom_items = [row for row in reader]
kaboom_items = []
#Iterate over csv and read all rows
with open('kaboom.csv') as fp:
reader = csv.reader(fp, delimiter=' ', skipinitialspace=True)
next(reader)
kaboom_items = [row for row in reader]
Then we iterate over both loops, and find common id's
bigbang_items = [ item_2 for item_1 in kaboom_items for item_2 in boom_items if item_1[0] == item_2[0]]
Then we save this list to bigbang.csv
headers = ['id','rollnumber','total','subjects','obtained','rank','standing']
with open('bigbang.csv','w') as fp:
writer = csv.writer(fp, delimiter='\t')
writer.writerow(headers)
writer.writerows(bigbang_items)
Hence the bigbang.csv would look like
id rollnumber total subjects obtained rank standing
342541 267359 117 8 108 8 106
552592 261347 243 16 227 19 174
402788 261381 231 16 215 19 164
502806 261664 235 14 221 15 173
502870 262871 248 22 226 21 151

Replacing value in text file column with string

I'm having a pretty simple issue. I have a dataset (small sample shown below)
22 85 203 174 9 0 362 40 0
21 87 186 165 5 0 379 32 0
30 107 405 306 25 0 756 99 0
6 5 19 6 2 0 160 9 0
21 47 168 148 7 0 352 29 0
28 38 161 114 10 3 375 40 0
27 218 1522 1328 114 0 1026 310 0
21 78 156 135 5 0 300 27 0
The first issue I needed to cover was replacing each space with a comma I did that with the following code
import fileinput
with open('Data_Sorted.txt', 'w') as f:
for line in fileinput.input('DATA.dat'):
line = line.split(None,8)
f.write(','.join(line))
The result was the following
22,85,203,174,9,0,362,40,0
21,87,186,165,5,0,379,32,0
30,107,405,306,25,0,756,99,0
6,5,19,6,2,0,160,9,0
21,47,168,148,7,0,352,29,0
28,38,161,114,10,3,375,40,0
27,218,1522,1328,114,0,1026,310,0
21,78,156,135,5,0,300,27,0
My next step is to grab the values from the last column, check if they are less than 2 and replace it with the string 'nfp'.
I'm able to seperate the last column with the following
for line in open("Data_Sorted.txt"):
columns = line.split(',')
print columns[8]
My issue is implementing the conditional to replace the value with the string and then I'm not sure how to put the modified column back into the original dataset.
There's no need to do this in two loops through the file. Also, you can use -1 to index the last element in the line.
import fileinput
with open('Data_Sorted.txt', 'w') as f:
for line in fileinput.input('DATA.dat'):
# strip newline character and split on whitespace
line = line.strip().split()
# check condition for last element (assuming you're using ints)
if int(line[-1]) < 2:
line[-1] = 'nfp'
# write out the line, but you have to add the newline back in
f.write(','.join(line) + "\n")
Further Reading:
Negative list index?
Understanding Python's slice notation
You need to convert columns[8] to an int and compare if it is less than 2.
for line in open("Data_Sorted.txt"):
columns = line.split(',')
if (int(columns[8]) < 2):
columns[8] = "nfp"
print columns

Print Dictionary with commas separating values

import csv
import output
fill = input("Enter File name:")
f = open(fill)
csv_f = csv.reader(f)
m = open('data.csv', "w")
dict_out = {}
for row in csv_f:
if row[1] in dict_out:
dict_out[row[1]] += row[3]
else:
dict_out[row[1]] = row[3]
for title, value in dict_out.items():
m.write('{},'.format(title))
m.write ('{} \n'.format(value))
m.close()
Prints my csv as
Title,Detail
Siding, 50 63 22 68 138 47 123 107 107 93 117
Asphalt, 49 8 72 19 125 95 33 83 123 144
Rail, 82 98 89 62 58 66 24 77 120 93
Grinding, 127 47 20 66 29 137 33 145 3 98
Concrete, 130 75 12 88 22 137 114 88 143 16
I would like to put a comma in between the numbers. I have tried m.write(',') after m.write('{} \n'.format(value)) but it only adds it after the last one. How can i format it so it will output as
Title,Detail
Siding, 50,63,22,68,138,47,123,107,107,93,117
Asphalt, 49,8,72,191,25,95,33,83,123,144
Rail, 82,98,89,62,58,66,24,77,120,93
Grinding, 127,47,20,66,29,137,33,145,3,98
Concrete, 130,75,12,88,22,137,114,88,143,16
not the best way but you can:
for title, value in dict_out.items():
m.write('{},'.format(title))
m.write ('{} \n'.format(value.replace(' ', ',')))
but you should definetly use csv writter,
import csv
import output
fill = input("Enter File name:")
f = open(fill)
csv_f = csv.reader(f)
c = open('data.csv', "w")
m = csv.writer(c)
dict_out = {}
for row in csv_f:
if row[1] in dict_out:
dict_out[row[1]].append(row[3])
else:
dict_out[row[1]] = [row[3]]
for title, value in dict_out.items():
m.writerow([title] + value)
c.close()
If value is a string then you need to use value.split(). If it is already a list then you don't need to use the split method.
with open('data.csv', "w") as m:
for title, value in dict_out.items():
m.write(title + "," + ",".join(value.split()) + "\n")

Categories

Resources