Splitting Text File - Column to Rows in Python - python

I have a txt file which looks like:
X Y Z I
1 1 1 10
2 2 2 20
3 3 3 30
4 4 4 40
5 5 5 50
6 6 6 60
7 7 7 70
8 8 8 80
9 9 9 90
I want to split 4th column to 3 rows and export it to txt file.
10 20 30
40 50 60
70 80 90
This is just example. In my goal I have to split column with 675311 values into 16471 rows with 41 values. So first 41 values in column "I" will be first row.

If you use numpy, this is trivial and potentially more flexible:
Edit: added parameters for selecting which column to pick and how many columns the output table will have. You can change it to fit whatever shape you want the output to be.
import numpy as np
datacolumn = 3
outputcolumns = 3
data = np.genfromtxt('path/to/csvfile',skip_header=True)
column = data[:,datacolumn]
reshaped = column.reshape((len(column)/outputcolumns,outputcolumns))
np.savetxt('path/to/newfile',reshaped)
Edit: separated out comments from code for readability. Here's what each line does:
# Parse CSV file with header
# Extract 4th column
# Reshape column into new matrix
# Save matrix to text file

with open ( 'in.txt' , 'r') as f:
f.next() '# skip header
l = [x.split()[-1] for x in f]
print [l[x:x+3] for x in xrange(0, len(l),3)]
[['10', '20', '30'], ['40', '50', '60'], ['70', '80', '90']]

What I did is I made a list of all the numbers you wanted to write to the text file, then in another for loop with an output text file open, I looped through that list (using indicies because on every third one you wanted a new line). Then I have a local variable that is one more than i called j. I use that to check if i + 1 is a multiple of 3 (since I start at 0 every third iteration + 1 will be a multiple of 3). I write a new line character and continue on my way. If it is not a multiple of 3, I write a space and continue on my way.
nums = []
with open ('input.txt' , 'r') as f:
for line in f:
s = line.split(' ')
num = s[3]
nums.append(num)
with open('output.txt', 'w') as f:
for i in range(0, len(nums)):
num = nums[i].strip('\n')
f.write(num)
j = i + 1
if j%3 == 0:
f.write('\n')
else:
f.write(' ')

Related

how to write a list in a file with a specific format?

I have a Python list and wanna reprint that in a special way.
input:
trend_end= ['skill1',10,0,13,'skill2',6,1,0,'skill3',5,8,9,'skill4',9,0,1]
I want to write a file like this:
output:
1 2 3
1 10 0 13
2 6 1 0
3 5 8 9
4 9 0 1
Basically, I need to do the following steps:
Separate elements of the list for each skill.
Write them in a table shape, add indices of columns and rows.
I wanna use it as an input of another software. That's why I wanna write a file.
I did this but I know it is wrong, can you see how I can fix it?
f1 = open("data.txt", "a")
for j in trend_end:
f1.write(str(j))
for i in range(1,int(len(trend_end)/df1ana.shape[0])):
G=[trend_end[i*(df1ana.shape[0]-10)- (df1ana.shape[0]-10):i*(df1ana.shape[0]-10)]]
for h in G:
f1.write(i)
f1.write(h)
f1.write('\n')
f.close()
df1ana.shape[0] is 3 in the above example. It is basically the length of data for each skill
Another option that you can try via pandas:
import pandas as pd
pd.DataFrame([trend_end[i+1:i+4] for i in range(0,len(trend_end),4)]).to_csv('data.txt', sep='\t')
OUTPUT:
0 1 2
0 10 0 13
1 6 1 0
2 5 8 9
3 9 0 1
You should iterate over the list in steps of 4, i.e. df1ana.shape[0]+1
steps = df1ana.shape[0]+1
with open("data.txt", "a") as f:
f.write(' ' + ' '.join(range(1, steps)) + '\n') # write header line
for i in range(1, len(trend_end), steps):
f.write(f"{i:<3}")
for j in range(i, i+steps-1):
f.write("f{trend_end[j]:<3}")
f.write("\n")
The :<3 formatting puts each value in a 3-character, left-aligned field.
This should work regardless of the number of groups or the number of records per group. It uses the difference in the size of the full list compared to the integer only list to calculate the number of rows you should have, and uses the ratio of the number of integers over the number of rows to get the number of columns.
import numpy as np
import pandas as pd
digits = [x for x in trend if isinstance(x,int)]
pd.DataFrame(np.reshape(digits,
(int(len(trend)-len(digits)),
int(len(digits)/(len(trend)-len(digits)))))).to_csv('output.csv')

How to read after a space until the next space in Python

I have this program:
import sys
import itertools
from itertools import islice
fileLocation = input("Input the file location of ScoreBoard: ")
input1 = open(fileLocation, "rt")
amountOfLines = 0
for line in open('input1.txt').readlines( ):
amountOfLines += 1
timestamps = [line.split(' ', 1)[0][0:] for line in islice(input1, 2, amountOfLines)]
teamids = [line.split(' ', 1)[0][0:] for line in islice(input1, 2, amountOfLines)]
print(teamids)
and this text file:
1
5 6
1 5 1 5 0
1 4 1 4 1
2 1 2 1 1
2 2 3 1 1
3 5 2 1 1
4 4 5 4 1
For teamids, I want it to start reading after the first space and to the next space, starting from the second line which, I have already achieved but don't get how to start reading after the first space to the next. For timestamps i have managed this but only starting from the first character to the first space and don't know how to do this for teamids. Much help would be appreciated
Here's one suggestion showcasing a nice use case of zip to transpose your array:
lines = open(fileLocation, 'r').readlines()[2:]
array = [[int(x) for x in line.split()] for line in lines]
transpose = list(zip(*filter(None, array)))
# now we can do this:
timestamps = transpose[0] # (1, 1, 2, 2, 3, 4)
teamids = transpose[1] # (5, 4, 1, 2, 5, 4)
This exploits the fact that zip(*some_list) returns the transpose of some_list.
Beware of the fact that the number of columns you get will be equal to the length of the shortest row. Which is one reason why I included the call to filter to remove empty rows caused by empty lines.

How to filter column values from a file and write in a new file in python

I have a .txt file with columns
#x y z
1 4 6
2 5 6
3 6 8
4 8 8
5 7 8
6 7 8
The first column is sorted in an ascending order. I want to filter the first column x for values between 2 and 6 and then create a new file with corresponding y and z columns
So the output file looks like:
# x y z
3 6 8
4 8 8
5 7 8
This simple lines filters the x columns, but how do I get the corresponding other columns to write to a new file?
x=x[np.where(x>2)]
print x
x=x[np.where(x<6)]
print x
Your help is very apppreciated
You can use np.where to get the indices for entries that satisfy the condition and then save only those rows to file,
import numpy as np
data_in = np.loadtxt('xyz.txt', dtype = int)
idx = np.where(np.logical_and(data_in[:,0]>2, data_in[:,0]<6))[0]
np.savetxt('xyz_filtered.txt', data_in[idx,:], fmt = '%d')
This assumed that you don't have any header in your input file and that you want all your data as integers, but any necessary changes would not influence the program much.
I am not sure if this is exactly what you want but filtering the array that you get from the input file is a viable option. Here's the code:
filename = 'table.txt'
with open(filename, mode='rt') as file:
table = [[int(n) for n in line.split()] for line in file]
predicate = lambda l: 2 < l[0] < 6
table = filter(predicate, table)
with open('output.txt', mode='wt') as file:
for row in table:
line = ' '.join(map(str, row))
file.write(line + '\n')

Python csv; get max length of all columns then lengthen all other columns to that length

I have a directory full of data files in the following format:
4 2 5 7
1 4 9 8
8 7 7 1
4 1 4
1 5
2 0
1 0
0 0
0 0
They are separated by tabs. The third and fourth columns contain useful information until they reach 'zeroes'.. At which point, they are arbitrarily filled with zeroes until the end of file.
I want to get the length of the longest column where we do not count the 'zero' values on the bottom. In this case, the longest column is column 3 with a length of 7 because we disregard the zeros at the bottom. Then I want to transform all the other columns by packing zeroes on them until their length is equal to the length of my third column (besides column 4 b/c it is already filled with zeroes). Then I want to get rid of all the zeros beyond my max length in all my columns.. So my desired file output will be as follows:
4 2 5 7
1 4 9 8
8 7 7 1
0 4 1 4
0 0 1 5
0 0 2 0
0 0 1 0
These files consist of ~ 100,000 rows each on average... So processing them takes a while. Can't really find an efficient way of doing this. Because of the way file-reading goes (line-by-line), am I right in assuming that in order to find the length of a column, we need to process in the worst case, N rows? Where N is the length of the entire file. When I just ran a script to print out all the rows, it took about 10 seconds per file... Also, I'd like to modify the file in-place (over-write).
Hi I would use Pandas and Numpy for this:
import pandas as pd
import numpy as np
df = pd.read_csv('csv.csv', delimiter='\t')
df = df.replace(0,np.nan)
while df.tail(1).isnull().all().all() == True:
df=df[0:len(df)-1]
df=df.replace(np.nan,0)
df.to_csv('csv2.csv',sep='\t', index=False) #i used a different name just for testing
You create a DataFrame with your csv data.
There are a lot of built in functions that deal with NaN values, so change all 0s to nan.
Then start at the end tail(1) and check if the row is all() NaN. If so copy the DF less the last row and repeat.
I did this with 100k rows and it takes only a few seconds.
Here are two ways to do it:
# Read in the lines and fill in the zeroes
with open('input.txt') as input_file:
data = [[item.strip() or '0'
for item in line.split('\t')]
for line in input_file]
# Delete lines near the end that are only zeroes
while set(data[-1]) == {'0'}:
del data[-1]
# Write out the lines
with open('output.txt', 'wt') as output_file:
output_file.writelines('\t'.join(line) + '\n' for line in data)
Or
with open('input.txt') as input_file:
with open('output.txt', 'wt') as output_file:
for line in input_file:
line = line.split('\t')
line = [item.strip() or '0' for item in line]
if all(item == '0' for item in line):
break
output_file.write('\t'.join(line))
output_file.write('\n')

Assigning list to an array in python

I have a data file "list_2_array.dat" as shown below. First, I want to read it and then I want to take control over fourth column elements for further mathematical operations.
1 2 3 10
4 5 6 20
1 3 5 30
2 1 4 40
3 2 3 50
I tried following piece of code
b_list = []
file=open('/path_to_file/list_2_array.dat', 'r')
m1=[(i.strip()) for i in file]
for j in m1:
b_list.append(j.replace('\n','').split(' '))
for i in range(5):
print b_list[i][3]
which gives output
10
20
30
40
50
I don't want to print the elements, I am interested in first assigning the fourth column elements to a 1-D array so that I can easily process them later. I tried several ways to do this,as one shown below, but did not work
import numpy as np
for i in range(5):
arr = array (b_list[i][3])
f=open('/path_to_file/list_2_array.dat', 'r')
l = []
for line in f.readlines():
l.append(int(line.strip().split()[-1]))
array=np.array(l)
or more pythonic I guess..:
f=open('/path_to_file/list_2_array.dat', 'r')
l = [int(line.strip().split()[-1]) for line in f.readlines()]
array=np.array(l)
data = """1 2 3 10
4 5 6 20
1 3 5 30
2 1 4 40
3 2 3 50"""
fourth = [int(line.split()[3]) for line in data.split("\n")]
print(fourth)
Output:
[10, 20, 30, 40, 50]
def get_last_col(file):
last_col = [int(line.split()[-1]) for line in open(file)]
return last_col
first of all, never assign variable names like str, file, int.
next you were nearly there.
b_list = []
c_list = []
file=open('/path_to_file/list_2_array.dat', 'r')
m1=[(i.strip()) for i in file]
for j in m1:
b_list.append(j.replace('\n','').split(' '))
for i in range(5):
c_list.append(b_list[i][3])
print c_list
I don't really like this solution so I changed #user2994666 his/her solution:
file_location = "/path_to_file/list_2_array.dat"
def get_last_col(file_location):
last_col = [int(line.split()[-1]) for line in open(file_location)]
return last_col
print get_last_col(file_location)
Note that the [-1] solution yields the last column, in your case this gives no problem. In case you have a file with 5 columns and you are still interested in the 4th, you use [3] instead of [-1]

Categories

Resources