I'm trying to edit some data in a txt file, but the file is written such that there are some rows with more columns than others. Example:
1 0.0 0.
2 0.25 0.
3 0.50 0. 13 1 0.2 0.
14 2.625 0.
15 2.800 0. 20 1 0.2
21 4.05 0.
22 4.2 0. 24 1 0.2
25 4.75 0.
26 4.90
27 5.05
28 5.15
29 5.25
As can be seen, there are sections with multiple spaces, and some rows have 7 columns instead of 3.
I want to take each value from the second column (0.0, 0.25, etc) and from the sixth column (0.2, 0.2, etc) and perform basic multiplication and division on each. So for example, in the second row, I want to take 0.25 and multiply it by 25.4.
I tried to read and break the file into a list
g = open("myfile.txt","r+")
lines = g.read().split(' ')
while('' in lines):
lines.remove('')
This gives the output
['1', '0.0', '0.\n', '2', '0.25', '0.\n', '3', '0.50', '0.', '13', '1', '0.2',
'0.\n', '14', '2.625', '0.\n', '15', '2.800', '0.', '20', '1', '0.2\n', '21',
'4.05', '0.\n', '22', '4.2', '0.', '24', '1', '0.2\n', '25', '4.75', '0.\n', '26',
'4.90\n', '27', '5.05\n', '28', '5.15\n', '29', '5.25\n\n']
(The second \n at the end is because there are empty rows to space each section of this data). I then tried to use a loop and counter to define where each item in the list is in the table:
counter = 0
for i in lines:
if '\n' in lines[i]:
counter = 0
elif counter == 1 or counter == 5:
lines[i] = float(lines[i])*25.4
counter += 1
From this, I end up with the error:
TypeError: list indices must be integers or slices, not str
Any ideas on what I could do that would work, and potentially be more elegant?
A possible solution for your problem if I understood it correctly:
Use with open() as f sintax to make sure your file will be closed after the scope ended.
lines = list()
with open('my_file.txt', 'r') as f:
for line in f.readlines():
line = line.strip() # clean possible additional spaces just to be sure
lines.append(line.split())
to multiply the values you want in place:
for line in lines:
line[1] = line[1]*25.4
line[5] = line[5]*25.4
hope it helped
Related
I wish to merge multiple files with a single (f1.txt) file based on 2 column matches after comparison with that file. I can do it in pandas but it reads everything to memory which can get big really fast. I am thinking a line by line reading will not load everything into memory. Pandas is also not an option now. How do I perform the operation while filling in null for cells where a match with f1.txt does not occur?
Here, I used a dictionary, which I am not sure if it will hold in memory and I also can't find a way to add null where there is no match in the other files with f1.txt. The other files could be as many as 1000 different files. The time does not matter as long as I do not read everything to memory
FILES (tab-delimited)
f1.txt
A B num val scol
1 a1 1000 2 3
2 a2 456 7 2
3 a3 23 2 7
4 a4 800 7 3
5 a5 10 8 7
a1.txt
A B num val scol fcol dcol
1 a1 1000 2 3 0.2 0.77
2 a2 456 7 2 0.3 0.4
3 a3 23 2 7 0.5 0.6
4 a4 800 7 3 0.003 0.088
a2.txt
A B num val scol fcol2 dcol1
2 a2 456 7 2 0.7 0.8
4 a4 800 7 3 0.9 0.01
5 a5 10 8 7 0.03 0.07
Current Code
import os
import csv
m1 = os.getcwd() + '/f1.txt'
files_to_compare = [i for i in os.listdir('dir')]
dictionary = dict()
dictionary1 = dict()
with open(m1, 'rt') as a:
reader1 = csv.reader(a, delimiter='\t')
for x in files_to_compare:
with open(os.getcwd() + '/dir/' + x, 'rt') as b:
reader2 = csv.reader(b, delimiter='\t')
for row1 in list(reader1):
dictionary[row1[0]] = list()
dictionary1[row1[0]] = list(row1)
for row2 in list(reader2):
try:
dictionary[row2[0]].append(row2[5:])
except KeyError:
pass
print(dictionary)
print(dictionary1)
What I am trying to achieve is similar to using: df.merge(df1, on=['A','B'], how='left').fillna('null')
current result
{'A': [['fcol1', 'dcol1'], ['fcol', 'dcol']], '1': [['0.2', '0.77']], '2': [['0.7', '0.8'], ['0.3', '0.4']], '3': [['0.5', '0.6']], '4': [['0.9', '0.01'], ['0.003', '0.088']], '5': [['0.03', '0.07']]}
{'A': ['A', 'B', 'num', 'val', 'scol'], '1': ['1', 'a1', '1000', '2', '3'], '2': ['2', 'a2', '456', '7', '2'], '3': ['3', 'a3', '23', '2', '7'], '4': ['4', 'a4', '800', '7', '3'], '5': ['5', 'a5', '10', '8', '7']}
Desired result
{'A': [['fcol1', 'dcol1'], ['fcol', 'dcol']], '1': [['0.2', '0.77'],['null', 'null']], '2': [['0.7', '0.8'], ['0.3', '0.4']], '3': [['0.5', '0.6'],['null', 'null']], '4': [['0.9', '0.01'], ['0.003', '0.088']], '5': [['null', 'null'],['0.03', '0.07']]}
{'A': ['A', 'B', 'num', 'val', 'scol'], '1': ['1', 'a1', '1000', '2', '3'], '2': ['2', 'a2', '456', '7', '2'], '3': ['3', 'a3', '23', '2', '7'], '4': ['4', 'a4', '800', '7', '3'], '5': ['5', 'a5', '10', '8', '7']}
My final intent is to write the dictionary to a text file. I do not know how much memory will be used or if it will even fit in memory. if there is a better way without using pandas, that will be nice else how do I make dictionary work?
DASK ATTEMPT:
import dask.dataframe as dd
directory = 'input_dir/'
first_file = dd.read_csv('f1.txt', sep='\t')
df = dd.read_csv(directory + '*.txt', sep='\t')
df2 = dd.merge(first_file, df, on=[A, B])
I kept getting ValueError: Metadata mismatch found in 'from_delayed'
+-----------+--------------------+
| column | Found | Expected |
+--------------------------------+
| fcol | int64 | float64 |
+-----------+--------------------+
I googled, found similar complaints but could not fix it. That was why I decided to try this. Checked my files and all dtypes seem to be consistent. My version of dask was 2.9.1
If you want hand made solution, you can look at heapq.merge and itertools.groupby. This assumes your files are sorted by the first two columns (the key).
I made simple example that merges and groups the files and produces two files, instead of dictionaries (so (almost) nothing is stored in memory, everything is reading/writing from/to disk):
from heapq import merge
from itertools import groupby
first_file_name = 'f1.txt'
other_files = ['a1.txt', 'a2.txt']
def get_lines(filename):
with open(filename, 'r') as f_in:
for line in f_in:
yield [filename, *line.strip().split()]
def get_values(lines):
for line in lines:
yield line
while True:
yield ['null']
opened_files = [get_lines(f) for f in [first_file_name] + other_files]
# save headers
headers = [next(f) for f in opened_files]
with open('out1.txt', 'w') as out1, open('out2.txt', 'w') as out2:
# print headers to files
print(*headers[0][1:6], sep='\t', file=out1)
new_header = []
for h in headers[1:]:
new_header.extend(h[6:])
print(*(['ID'] + new_header), sep='\t', file=out2)
for v, g in groupby(merge(*opened_files, key=lambda k: (k[1], k[2])), lambda k: (k[1], k[2])):
lines = [*g]
print(*lines[0][1:6], sep='\t', file=out1)
out_line = [lines[0][1]]
iter_lines = get_values(lines[1:])
current_line = next(iter_lines)
for current_file in other_files:
if current_line[0] == current_file:
out_line.extend(current_line[6:])
current_line = next(iter_lines)
else:
out_line.extend(['null', 'null'])
print(*out_line, sep='\t', file=out2)
Produces two files:
out1.txt:
A B num val scol
1 a1 1000 2 3
2 a2 456 7 2
3 a3 23 2 7
4 a4 800 7 3
5 a5 10 8 7
out2.txt:
ID fcol dcol fcol2 dcol1
1 0.2 0.77 null null
2 0.3 0.4 0.7 0.8
3 0.5 0.6 null null
4 0.003 0.088 0.9 0.01
5 null null 0.03 0.07
One crucial step in my project is to track the absolute difference of values in a column of pandas dataframe for subsamples.
I managed to write a for-loop to create my subsamples. I select every person and go through every year this person is observed. I further accessed the index of each groups first element, and even compared it the each ones second element.
Here is my MWE data:
df = pd.DataFrame({'year': ['2001', '2004', '2005', '2006', '2007', '2008', '2009',
'2003', '2004', '2005', '2006', '2007', '2008', '2009',
'2003', '2004', '2005', '2006', '2007', '2008', '2009'],
'id': ['1', '1', '1', '1', '1', '1', '1',
'2', '2', '2', '2', '2', '2', '2',
'5', '5', '5','5', '5', '5', '5'],
'money': ['15', '15', '15', '21', '21', '21', '21',
'17', '17', '17', '20', '17', '17', '17',
'25', '30', '22', '25', '8', '7', '12']}).astype(int)
Here is my code:
# do it for all IDs in my dataframe
for i in df.id.unique():
# now check every given year for that particular ID
for j in df[df['id']==i].year:
# access the index of the first element of that ID, as integer
index = df[df['id']==i].index.values.astype(int)[0]
# use that index to calculate absolute difference of the first and second element
abs_diff = abs( df['money'].iloc[index] - df['money'].iloc[index+1] )
# print all the changes, before further calculations
index =+1
print(abs_diff)
My index is not updating. It yields 0000000 0000000 5555555 (3 x 7 changes) but it should show 0,0,0,6,0,0,0 0,0,0,3,-3,0,0 0,5,-8,3,-17,-1,5 (3 x 7 changes). Since the either first or last element have no change, I added 0 in front of each group.
Solution I changed the second loop from for to while:
for i in df.id.unique():
first = df[df['id']==i].index.values.astype(int)[0] # ID1 = 0
last = df[df['id']==i].index.values.astype(int)[-1] # ID1 = 6
while first < last:
abs_diff = abs( df['money'][first] - df['money'][first+1] )
print(abs_diff)
first +=1
`for i in df.id.unique():
for j in df[df['id']==i].year:
index = df[(df['id']==i)&(df['year']==j)].index.values[0].astype(int)
try:
abs_diff = abs(df['money'].iloc[index] - df['money'].iloc[index+1] )
except:
pass
print(abs_diff)`
output:
0
0
6
0
0
0
4
0
0
3
3
0
0
8
5
8
3
17
1
5
You're currently always checking the first value of each batch, so you'd need to do:
# do it for all IDs in my dataframe
for i in df.id.unique():
# now check every given year for that particular ID
for idx,j in enumerate(df[df['id']==i].year):
# access the index of the first element of that ID, as integer
index = df[df['id']==i].index.values.astype(int)[idx]
# use that index to calculate absolute difference of the first and second element
try:
abs_diff = abs( df['money'][index] - df['money'][index+1] )
except:
continue
# print all the changes, before further calculations
index =+1
print(abs_diff)
Which outputs:
0
0
6
0
0
0
4
0
0
3
3
0
0
8
5
8
3
17
1
5
I have a text file with multiple matrices like this:
4 5 1
4 1 5
1 2 3
[space]
4 8 9
7 5 6
7 4 5
[space]
2 1 3
5 8 9
4 5 6
I want to read this input file in python and store it in multiple matrices like:
matrixA = [...] # first matrix
matrixB = [...] # second matrix
so on. I know how to read external files in python but don't know how to divide this input file in multiple matrices, how can I do this?
Thank you
You can write a code like this:
all_matrices = [] # hold matrixA, matrixB, ...
matrix = [] # hold current matrix
with open('file.txt', 'r') as f:
values = line.split()
if values: # if line contains numbers
matrix.append(values)
else: # if line contains nothing then add matrix to all_matrices
all_matrices.append(matrix)
matrix = []
# do what every you want with all_matrices ...
I am sure the algorithm could be optimized somewhere, but the answer I found is quite simple:
file = open('matrix_list.txt').read() #Open the File
matrix_list = file.split("\n\n") #Split the file in a list of Matrices
for i, m in enumerate(matrix_list):
matrix_list[i]=m.split("\n") #Split the row of each matrix
for j, r in enumerate(matrix_list[i]):
matrix_list[i][j] = r.split() #Split the value of each row
This will result in the following format:
[[['4', '5', '1'], ['4', '1', '5'], ['1', '2', '3']], [['4', '8', '9'], ['7', '5', '6'], ['7', '4', '5']], [['2', '1', '3'], ['5', '8', '9'], ['4', '5', '6']]]
Example on how to use the list:
print(matrix_list) #prints all matrices
print(matrix_list[0]) #prints the first matrix
print(matrix_list[0][1]) #prints the second row of the first matrix
print(matrix_list[0][1][2]) #prints the value from the second row and third column of the first matrix
This question already has answers here:
How to sort python list of strings of numbers
(4 answers)
Closed 6 years ago.
I have a file with 4 column data, and I want to prepare a final output file which is sorted by the first column. The data file (rough.dat) looks like:
1 2 4 9
11 2 3 5
6 5 7 4
100 6 1 2
The code I am using to sort by the first column is:
with open('rough.dat','r') as f:
lines=[line.split() for line in f]
a=sorted(lines, key=lambda x:x[0])
print a
The result I am getting is strange, and I think I'm doing something silly!
[['1', '2', '4', '9'], ['100', '6', '1', '2'], ['11', '2', '3', '5'], ['6', '5', '7', '4']]
You may see that the first column sorting is not done as per ascending order, instead, the numbers starting with 'one' takes the priority!! A zero after 'one' i.e 100 takes priority over 11!
Strings are compared lexicographically (dictionary order):
>>> '100' < '6'
True
>>> int('100') < int('6')
False
Converting the first item to int in key function will give you what you want.
a = sorted(lines, key=lambda x: int(x[0]))
You are sorting your numbers literally because they are strings not integers. As a more numpythonic way you can use np.loadtext in order to load your data then sort your rows based on second axis:
import numpy as np
array = np.loadtxt('rough.dat')
array.sort(axis=1)
print array
[[ 1. 2. 4. 9.]
[ 2. 3. 5. 11.]
[ 4. 5. 6. 7.]
[ 1. 2. 6. 100.]]
I am a very beginner in Python and have the next 'problem'. I would be glad, if you could help me)
I have a *.dat file (let's name it file-1, first row is just a headline which I use only here to mark the columns) which looks like:
1 2 3 4 5 6
6 5 -1000 "" "" ""
6 5 -1000 "" "" ""
6 5 -1000 "" "" ""
6 5 -1000 "" "" ""
6 5 -1000 "" "" ""
6 5 -1000 "" "" ""
6 5 -1000 "" "" ""
I need it to be like (file-1 (converted)):
6 5 1 -1000
6 5 1 -1000
6 5 2 -1000
6 5 3 -1000
6 5 3 -1000
6 5 3 -1000
6 5 3 -1000
So, file-1 has 9 rows (7 with information and 2 empty) and 6 columns and I have to do the next:
Delete the last 3 columns in the file-1.
Add 1 new column that will take place between the columns 2 and 3.
The value of this new column should be increased by 1 unit (like '+= 1') after passing the empty line.
Delete all the empty lines. The result is represented as the 'file-1 (converted)'.
I've tried to do this but stucked. For now I am on the level of:
import sys
import csv
with open("file-1.dat", "r", newline="") as f:
sys.stdout = open('%s2 (converted).txt' % f.name, 'a')
incsv = csv.reader(f, delimiter="\t")
for row in incsv:
if len(row) == 6:
i = 0
row = row[0:3]
row.insert(2, i)
print(row)
and it looks like:
['6', '5', 0, '-1000']
['6', '5', 0, '-1000']
['6', '5', 0, '-1000']
['6', '5', 0, '-1000']
['6', '5', 0, '-1000']
['6', '5', 0, '-1000']
['6', '5', 0, '-1000']
I don't know for now how to change 0 to 1 and 2 and so on, so it could be like:
['6', '5', 0, '-1000']
['6', '5', 0, '-1000']
['6', '5', 1, '-1000']
['6', '5', 2, '-1000']
['6', '5', 2, '-1000']
['6', '5', 2, '-1000']
['6', '5', 2, '-1000']
And the result should be like the 'file-1 (converted)' file.
P.S. All the examples are simplified, real file has a lot of rows and I don't know where the empty lines appear.
P.P.S. Sorry for such a long post, hope, it makes sense. Ask, suggest - I would be really glad to see other opinions) Thank you.
seems like you're almost there, you're just inserting i=0 all the time instead of the count of empty rows, try something like:
with open("file-1.dat", "r", newline="") as f:
sys.stdout = open('%s2 (converted).txt' % f.name, 'a')
incsv = csv.reader(f, delimiter="\t")
empties = 0 # init empty row counter
for row in incsv:
if len(row) == 6:
row = row[0:3]
row.insert(2, empties) # insert number of empty rows
print(row)
else:
empties += 1 # if row is empty, increase counter
This is bit different without using csv module. Hope this helps. :)
import sys
count = 0
with open("file-1.dat", "r") as f:
sys.stdout = open('%s2 (converted).txt' % f.name, 'a')
for line in f:
converted_line = line.split()[:-3] #split each line and remove last 3 column
if not converted_line: # if list/line is empty
count += 1 #increase count but DO NOT PRINT/ WRITE TO FILE
else:
converted_line.insert(2,str(count)) # insert between 2nd and 3rd column
print ('\t'.join(converted_line)) # join them and print them with tab delimiter
You need to increment i on every empty line
import sys
import csv
with open("file-1.dat", "r") as f:
sys.stdout = open('%s2 (converted).txt' % f.name, 'a')
incsv = csv.reader(f, delimiter="\t")
incsv.next() # ignore first line
i = 0
for row in incsv:
if len(row) == 6:
row = row[0:3]
row.insert(2, i)
print(row)
elif len(row) == 0:
i += 1
Also, I couldn't execute your code on my machine (with Python 2.7.6). I changed the code according to run with Python 2.x.
Edit: I see it runs with Python 3.x