Replace the data/string inside a text-like file - python

I am trying to automate some parts of my work. I have a INP file which is text-like (but not .txt file) and contains both strings and ints/floats. I'd like to replace certain columns from the 6 to the end rows with the values in the output(result) of a loop.
Here's what I want to accomplish for the test.INP:
Keep the first 5 lines, replace the data from columns 3-5 with those data in result. Hopefully, the final test.INP file is not newly created but the data has been replaced.
Because the dimension of the data to be replaced with and the target data in result is the same, to avoid the first 5 lines, I am trying to define a function to reversely read line by line and replace test.INP file.
Python script:
...
with open('test.INP') as j:
raw = j.readlines()
def replace(raw_line, sep='\t', idx=[2, 3, 4], values=result[-1:]):
temp = raw[-1].split('\t')
for i, v in zip(idx, values):
temp[i] = str(v)
return sep.join(temp)
raw[::-1] = replace(raw[::-1])
print('\n'.join(raw))
...
test.INP contents (before):
aa bb cc dd
abcd
e
fg
cols1 cols2 cols3 cols4 cols5 cols6
65 69 433 66 72 70b
65 75 323 61 71 68g
61 72 12 57 73 26c
Result contents:
[[329 50 58]
[258 47 66]
[451 38 73]]
My final goal is to get the test.INP below:
test.INP contents(after):
aa bb cc dd
abcd
e
fg
cols1 cols2 cols3 cols4 cols5 cols6
65 69 329 50 58 70b
65 75 258 47 66 68g
61 72 451 38 73 26c
But the code doesn't work as expected, seems nothing changed in the test.INP file. Any suggestions?
Getting error message at the bottom it says:
ValueError Traceback (most recent call last)
<ipython-input-1-92f8c1020af3> in <module>
36 temp[i] = str(v)
37 return sep.join(temp)
---> 38 raw[::-1] = replace(raw[::-1])
39 print('\n'.join(raw))
ValueError: attempt to assign sequence of size 100 to extended slice of size 8

I coudn't understand your code so I build own version.
Later you understand what you try to do - you reverse lines to works from last until you use all results. Problem is that you forgot loop which will do it. You run replace only once and send all rows at once but replace works only with one row and it returns only one row - so finally you get one row (with 8 columns) and you want to assign in places of all rows (probably 100 rows)
Here version which works for me. I put text directly in code but I expect it will works also with text from file
text = '''aa bb cc dd
abcd
e
fg
cols1\tcols2\tcols3\tcols4\tcols5\tcols6
65\t69\t433\t66\t72\t70b
65\t75\t323\t61\t71\t68g
61\t72\t12\t57\t73\t26c'''
results = [[329, 50, 58], [258, 47, 66], [451, 38, 73]]
idx = [2,3,4]
sep = '\t'
print(text)
#with open('test.INP') as j:
# lines = j.readlines()
# split text to lines
lines = text.splitlines()
def replace(line_list, result_list, idx):
for i, v in zip(idx, result_list):
line_list[i] = str(v)
return line_list
# start at 5 line and group line (text) with values to replace
for line_number, result_as_list in zip(range(5, len(lines)), results):
# convert line from string to list
line_list = lines[line_number].split(sep)
# replace values
line_list = replace(line_list, result_as_list, idx)
# convert line from list to string
lines[line_number] = sep.join(line_list)
# join lines to text
text = '\n'.join(lines)
print(text)
with open('test.INP', 'w') as j:
j.write(text)

Related

Removing characters from lists in pandas column

I have a pandas DataFrame df with two columns (NACE and cleaned) which looks like this:
NACE cleaned
0 071 [260111, 260112]
1 072 [2603, 2604, 2606, 261610, 261690, 2607, 2608]
2 081 [251511, 251512, 251520, 251611, 251612, 25162]
3 089 [251010, 251020, 2502, 25030010, 251110, 25112]
4 101 [020110, 02012020, 02012030a), 02012050, 020130]
... ... ...
92 324 [95030021, 95030041, 95030049, 95030029, 95030]
93 325 [901841, 90184910, 90184990b), 841920, 90183110]
94 329 [960310, 96039010, 96039091, 96039099, 960321]
95 331 [-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-, 983843]
96 332 [-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-]
The cleaned column consists of lists of strings, some of which still contain characters that need to be removed. Specifically I need to remove all +, -, and ).
To focus on one of these +, I have tried many methods including:
df['cleaned'] = df['cleaned'].str.replace('+', '')
but also:
df.replace('+', '', regex = True, inplace = True)
and a desperate:
for i in df['cleaned']:
for x in i:
i.replace('+', '')
Different versions of these solutions work on most dataframes, but not when the column consists of lists.
Just change
for i in df['cleaned']:
for x in i:
i.replace('+', '')
to:
for i in df['cleaned']:
for x in range(len(i)):
i[x].replace('+', '')
it should work.

Adding values from a CSV file

I am beginning to learn python and am struggling with Syntax.
I have a simple CSV file that looks like this
0.01,10,20,0.35,40,50,60,70,80,90,100
2,22,32,42,52,62,72,82,92,102,112
3,33,43,53,63,5647,83,93,103,113,123
I want to look for the highest and lowest value in all the data in the csv file except in the first value of each row.
So effectively the answer here would be
highestValue=5647
lowestValue=0.35
because the data that is looked at is as follows (it ignored the first value of each row)
10,20,0.35,40,50,60,70,80,90,100
22,32,42,52,62,72,82,92,102,112
33,43,53,63,73,5647,93,103,113,123
I would like my code to work for ANY row length.
I really have to admit I'm struggling but here's what I've tried. I usually program PHP so this is all new to me. I have been working on this simple task for a day and can't fathom it out. I think I'm getting confused with terminology 'lists' for example.
import numpy
test_data_file = open ("Anaconda3JamesData/james_test_3.csv","r")
test_data_list = test_data_file.readlines()
test_data_file.close()
for record in test_data_list:
all_values = record.split(',')
maxvalue = np.max(numpy.asfarray(all_values[1:])
print (maxvalue)
With the test data (the CSV file shown at the very top of this question) I would expect the answer to be
highestValue=5647
lowestValue=0.35
If you're using numpy, you can read your csv file as a numpy.ndarray using numpy.genfromtxt() and then use the array's .max() and .min() methods
import numpy
array = numpy.genfromtxt('Anaconda3JamesData/james_test_3.csv', delimiter=',')
array[:, 1:].max()
array[:, 1:].min()
The [:, 1:] part is using numpy's array indexing. It's saying take all the rows (the first [:, part), and for each row take all but the first column (the 1:] part) . This doesn't work with Python's built in lists.
You're overwriting maxvalue each time through the loop, so you're just getting the max value from the last line, not the whole file. You need to compare with the previous maximum.
maxvalue = None
for record in test_data_list:
all_values = record.split(',')
if maxvalue is None:
maxvalue = np.max(numpy.asfarray(all_values[1:])
else:
maxvalue = max(maxvalue, np.max(numpy.asfarray(all_values[1:]))
You do not need the power of numpy for this problem. A simple CSV reader is good enough:
with open("Anaconda3JamesData/james_test_3.csv") as infile:
r = csv.reader(infile)
rows = [list(map(float, line))[1:] for line in r]
max(map(max, rows))
# 5647.0
min(map(min, rows))
# 0.35
I think using numpy is unneeded for this task. First of all, this:
test_data_file = open ("Anaconda3JamesData/james_test_3.csv","r")
test_data_list = test_data_file.readlines()
test_data_file.close()
for record in test_data_list:
can be simplified into this:
with open("Anaconda3JamesData/james_test_3.csv","r") as test_data_file:
for record in test_data_file:
We can use a list comprehension to read in all of the values:
with open("Anaconda3JamesData/james_test_3.csv","r") as test_data_file:
values = [float(val) for line in test_data_file for val in line.split(",")[1:]]
values now contains all relevant numbers, so we can just do:
highest_value = max(values)
lowest_value = min(values)
Here's a pandas solution that can give the desired results:
import pandas as pd
df = pd.read_csv('test1.csv', header=None)
# df:
# 0 1 2 3 4 5 6 7 8 9 10
# 0 0.01 10 20 0.35 40 50 60 70 80 90 100
# 1 2.00 22 32 42.00 52 62 72 82 92 102 112
# 2 3.00 33 43 53.00 63 5647 83 93 103 113 123
df = df.iloc[:, 1:]
print("Highest value: {}".format(df.values.max()))
print("Lowest value: {}".format(df.values.min()))
#Output:
Highest value: 5647.0
Lowest value: 0.35

how can i extract elements from lists in python

I am trying to extract elements from list.
I've looked up a lot of data, but I do not know..
this is my test.txt (text file)
[ left in the table = time, right in the table = value ]
0 81
1 78
2 76
3 74
4 81
5 79
6 80
7 81
8 83
9 83
10 83
11 82
.
.
22 81
23 80
If the current time is equal to the time in the table, i want to extract the value of that time.
this is my demo.py (python file)
import datetime
now = datetime.datetime.now())
current_hour = now.hour
with open('test.txt') as f:
lines = f.readlines()
time = [int(line.split()[0]) for line in lines]
value = [int(line.split()[1]) for line in lines]
>>>time = [0,1,2,3,4,5,....,23]
>>>value = [81,78,76,......,80]
You could make a loop where you iterate over the list, looking for the current hour at every position on the list.
Starting at position 0, it will compare it with the current hour. If it's the same value, it will assign the value at the position it was found in "time" to the variable extractedValue, then it will break the loop.
If it isn't the same value, it will increase by 1 the pos variable, which we use to look into the list. So it will keep searching until the first if is True or the list ends.
pos=0
for i in time:
if(current_hour==time[pos]):
extractedValue=value[pos]
break
else:
pos+=1
pass
Feel free to ask if you don't understand something :)
Assuming unique values for the time column:
import datetime
with open('text.txt') as f:
lines = f.readlines()
#this will create a dictionary with time value from test.txt as the key
time_data_dict = { l.split(' ')[0] : l.split(' ')[1] for l in lines }
current_hour = datetime.now().hour
print(time_data_dict[current_hour])
import datetime
import csv
data = {}
with open('hour.csv', newline='') as f:
reader = csv.reader(f)
for row in reader:
k, v = row
data[k] = v
hour = str(datetime.datetime.now().hour)
print(data[str(hour)])

Adding a number to a column with python

I am very new to python and I would be grateful for some guidance with the following.
I have a text file with over 5 million rows and 8 columns, I am trying to add "15" to each value in column 4 only.
For example:
10 21 34 12 50 111 234 21 7
21 10 23 56 80 90 221 78 90
Would be changed to:
10 21 34 12 **65** 111 234 21 7
21 10 23 56 **95** 90 221 78 90
My script below allows me to isolate the column, but when I try to add any amount to it i return "TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'"
file = open("file.txt")
column = []
for line in file:
column.append(int(line.split("\t")[3]))
print column
Any advice would be great.
try this to get you started -- there are many better ways using libraries but this will show you some better file handling basic methods anyway. works for the data you posted -- as long as the delimiter in your files is double space (" ") and that everything can be cast to an int. If not.....
Also -- note the correct way to start a script is with:
if __name__ == "__main__":
this is because you wont generally want any code to execute if you are making a library...
__author__ = 'charlie'
in_filename = "in_file.txt"
out_filename = "out_file.txt"
delimiter = " "
def main():
with open(in_filename, "r") as infile:
with open(out_filename, "w") as outfile:
for line in infile:
ldata = line.split(delimiter)
ldata[4] = str(int(ldata[4]) + 15)
outfile.write(delimiter.join(ldata))
if __name__ == "__main__":
main()
With Pandas :
import pandas as pd
df = pd.read_clipboard(header=None)
df[4] += 15

Python: Removing duplicates from col 1,2 and printing col 3 values on 1 line

I have a file with AA sequences in column 1, and in column two, the number of times they appear, which I created using Counter(). In column three I have numerical values, which are all different. The items in col 1 and col 2 can be identical.
Ex. Input file:
ADVAEDY 28 0.17805
ADVAEDY 28 0.17365
ADVAEDY 28 0.16951
...
ARYLGYNSNWYPFDY 23 4.16148
ARYLGYNSNWYPFDY 23 3.17716
ARYLGYNSNWYPFDY 23 1.74919
...
ARHLGYNSAWYPFDY 21 10.6038
ARHLGYNSAWYPFDY 21 2.3498
ARHLGYNSAWYPFDY 21 1.68818
...
AGIAFDY 20 0.457553
AGIAFDY 20 0.416321
AGIAFDY 20 0.286349
...
ATIEDH 4 2.45283
ATIEDH 4 0.553351
ATIEDH 4 0.441266
So there is 197 lines in this file. There are only 48 unique AA sequences from col 1. The code that generated this file:
input_fh = sys.argv[1] # File containing all CDR(x)
cdr_spec = sys.argv[2] # File containing CDR(x) in one column and specificities in the second
with open(input_fh, "r") as f1:
cdr = [line.strip() for line in f1]
with open(cdr_spec, "r") as f2:
cdr_spec_list = [line.strip().split() for line in f2]
cdr_spec_out = open("CDR" + c + "_counts_spec.txt", "w")
counter_cdr = Counter(cdr)
countermc_cdr = counter_cdr.most_common()
print len(countermc_cdr)
#This one might work:
for k,v in countermc_cdr:
for x,y in cdr_spec_list:
if k == x:
print >> cdr_spec_out, k, '\t', v, '\t', y
cdr_spec_out.close()
The output I want to generate is,using the example above by removing duplicates in col 1 and 2 but keeping all mtaching values in col 3 on one line:
ADVAEDY 28 0.17805, 0.17365, 0.16951
...
ARYLGYNSNWYPFDY 23 4.16148, 3.17716, 1.74919
...
ARHLGYNSAWYPFDY 21 10.6038, 2.3498, 1.68818
...
AGIAFDY 20 0.457553, 0.416321, 0.286349
...
ATIEDH 4 2.45283, 0.553351, 0.441266
Also, for each comma separated value for the "new" col 3 I would need them to be in order of largest to smallest. I would prefer to stay away from modules, as I'm still learning python and the "pythonic" way of doing things.
Any help is appreciated.
What causes the same AA to be printed additional times is the second for loop:
for x,y in cdr_spec_list:
try to load the cdr_spec_list from the start as a dictionary:
with open(cdr_spec, "r") as f2:
cdr_spec_dic = defaultdict(list) #a dictionary with the default value of list
for ln in f2:
k,v = ln.strip().split()
cdr_spec_dic[k].append(v)
Now you have a dictionary from each AA sequence to the numerical values you're presenting.
So now, we don't need the second for loop, and we can also sort while we're there.
for k,v in countermc_cdr:
print >> cdr_spec_out, k, '\t', v, '\t', ' '.join(sorted(cdr_spec_dic[k]))

Categories

Resources