Reading columns of a txt file on python - python

I am working with a .txt file. This has 100 rows and 5 columns. I need to divide it in five vectors of lenght 100, one for each column. I am trying to follow this: Reading specific columns from a text file in python.
However, when I implement it as:
token = open('token_data.txt','r')
linestoken=token.readlines()
resulttoken=[]
for x in linestoken:
resulttoken.append(x.split(' ')[1])
token.close()
I don't know how this is stored. If I write print('resulttoken'), nothing appears on my screen.
Can someone please tell me what I am doing wrong?
Thanks.
part of my text file

x.split(' ') is not useful, because columns of your text file separated by more than one space. Use x.split() to ignore spaces:
token = open('token_data.txt','r')
linestoken=token.readlines()
tokens_column_number = 1
resulttoken=[]
for x in linestoken:
resulttoken.append(x.split()[tokens_column_number])
token.close()
print(resulttoken)

Well, the file looks like to be split by table rather than space, so try this:
token = open('token_data.txt','r')
linestoken=token.readlines()
tokens_column_number = 1 resulttoken=[] for x in linestoken:
resulttoken.append(x.split('\t'))
token.close()
print(resulttoken)

You want a list of five distinct lists, and append to each in turn.
columns = [[]] * 5
with open('token_data.txt','r') as token:
for line in token:
for field, value in enumerate(line.split()):
columns[field].append(value)
Now, you will find the first value from the first line in columns[0][0], the second value from the first line in columns[1][0], the first value from the second line in columns[0][1], etc.
To print the value of a variable, don't put quotes around it. Quotes create a literal string.
print(columns[0][0])
prints the value of columns[0][0] whereas
print('columns[0][0]')
simply prints the literal text "columns[0][0]".

You can use data_py package to read column wise data in FORTRAN style.
Install this package using
pip install data-py
Usage Example
from data_py import datafile
NoOfLines=0
lineNumber=2 # Line number to read (Excluding lines starting with '#')
df1=datafile("C:/Folder/SubFolder/data-file-name.txt")
df1.separator="," # No need to specify if separator is space(" ") and for 'tab' separated values use '\t'
NoOfLines=df1.lines # Total number of lines in the data file (Excluding lines starting with '#')
[Col1,Col2,Col3,Col4,Col5]=["","","","",""] # Initial values
[Col1,Col2,Col3,Col4,Col5]=df1.read([Col1,Col2,Col3,Col4,Col5)],lineNumber)
print(Col1,Col2,Col3,Col4,Col5) # In str format
For details please follow the link https://www.respt.in/p/python-package-datapy.html

Related

Why does this error occur when my text files have clearly more than 1 lines?

I'm a beginner in Python. I've checked my text files, and definitely have more than 1 lines, so I don't understand why it gave me the error on
---> 11 Coachid.append(split[1].rstrip())
IndexError: list index out of range
The problem are the lines:
split=line.split(",")
Coachname.append(split[0].rstrip())
Coachid.append(split[1].rstrip())
The first line assumes that line contains at lest one comma so that after method split is called variable split will be a list of at least length two. But if line contains no commas, then split will have length 1 and Coachid.append(split[1].rstrip()) will generate the error you are getting. You need to add some conditional tests of the length of split.
Update
Your code should look like (assuming that the correct action is to append an empty string to the Coachid list if it is missing from the input):
split=line.split(",")
split_length = len(split)
Coachname.append(split[0].rstrip())
# append '' if split_length is less than 2:
Coachid.append('' if split_length < 2 else split[1].rstrip())
etc. for the other fields
If you want to loop over lines of a file, you have to use
for line in f.readlines()
...

Rewriting Single Words in a .txt with Python

I need to create a Database, using Python and a .txt file.
Creating new items is no Problem,the inside of the Databse.txt looks like this:
Index Objektname Objektplace Username
i.e:
1 Pen Office Daniel
2 Saw Shed Nic
6 Shovel Shed Evelyn
4 Knife Room6 Evelyn
I get the index from a QR-Scanner (OpenCV) and the other informations are gained via Tkinter Entrys and if an objekt is already saved in the Database, you should be able to rewrite Objektplace and Username.
My Problems now are the following:
If I scan the Code with the index 6, how do i navigate to that entry, even if it's not in line 6, without causing a Problem with the Room6?
How do I, for example, only replace the "Shed" from Index 4 when that Objekt is moved to f.e. Room6?
Same goes for the Usernames.
Up until now i've tried different methods, but nothing worked so far.
The last try looked something like this
def DBChange():
#Removes unwanted bits from the scanned code
data2 = data.replace("'", "")
Index = data2.replace("b","")
#Gets the Data from the Entry-Widgets
User = Nutzer.get()
Einlagerungsort = Ort.get()
#Adds a whitespace at the end of the Entrys to seperate them
Userlen = len(User)
User2 = User.ljust(Userlen)
Einlagerungsortlen = len(Einlagerungsort)+1
Einlagerungsort2 = Einlagerungsort.ljust(Einlagerungsortlen)
#Navigate to the exact line of the scanned Index and replace the words
#for the place and the user ONLY in this line
file = open("Datenbank.txt","r+")
lines=file.readlines()
for word in lines[Index].split():
List.append(word)
checkWords = (List[2],List[3])
repWords = (Einlagerungsort2, User2)
for line in file:
for check, rep in zip(checkWords, repWords):
line = line.replace(check, rep)
file.write(line)
file.close()
Return()
Thanks in advance
I'd suggest using Pandas to read and write your textfile. That way you can just use the index to select the approriate line. And if there is no specific reason to use your text format, I would switch to csv for ease of use.
import pandas as pd
def DBChange():
#Removes unwanted bits from the scanned code
# I haven't changed this part, since I guess you need this for some input data
data2 = data.replace("'", "")
Indexnr = data2.replace("b","")
#Gets the Data from the Entry-Widgets
User = Nutzer.get()
Einlagerungsort = Ort.get()
# I removed the lines here. This isn't necessary when using csv and Pandas
# read in the csv file
df = pd.read_csv("Datenbank.csv")
# Select line with index and replace value
df.loc[Indexnr, 'Username'] = User
df.loc[Indexnr, 'Objektplace'] = Einlagerungsort
# Write back to csv
df.to_csv("Datenbank.csv")
Return()
Since I can't reproduce your specific problem, I haven't tested it. But something like this should work.
Edit
To read and write text-file, use ' ' as the seperator. (I assume all values do not contain spaces, and your text file now uses 1 space between values).
reading:
df = pd.read_csv('Datenbank.txt', sep=' ')
Writing:
df.to_csv('Datenbank.txt', sep=' ')
First of all, this is a terrible way to store data. My suggestion is not particularily well code, don't do this in production! (edit
newlines = []
for line in lines:
entry = line.split()
if entry[0] == Index:
#line now is the correct line
#Index 2 is the place, index 0 the ID, etc
entry[2] = Einlagerungsort2
newlines.append(" ".join(entry))
# Now write newlines back to the file

How to organise dataFrame like this, in Python:

I have a file which has some information:
1.Movie ID (the first character before a ":")
2.User ID
4.User Rating
3.Date
All elements are splited by a "," but Movie ID, which is separated by a colon
if I create a dataframe like this:
df=pd.read_csv('combined_data_1.txt',header = None,names['Movie_ID','User_ID','Rating','Date'])
and print the dataframe, I will get this:
Which is not correct, obviosly.
So, if you look at the "Movie_ID" column, in the first row, there is a 1:1488844. Only the number "1" (just before the colon) should be in the "Movie_ID" column, not "1:1488844". The rest (1488844) should be in the User_ID column.
Another problem is that not every "Movie_ID" column have its correctly ID, and in this case, it should be "1" until I find another movie id, that again, will be the first number before a colon.
I know that the ids of all the movies follow a sequence, that is: 1,2,3,4,...
Another problem that I saw, was that when I read the file, for some reason a split occours when there is a colon, so after the first row (which doesn't get splited), when a colon appears, a row in "Movie_ID" is created containing only, for example: "2:", not something like the first row.
In the end, I would like to get something like this:
But I don't know how to organise like this.
Thanks for the help!
Use shift with axis=1 and simply modify the columns:
df=df.shift(axis=1)
df['Movie_ID']=df['User_ID'].str[0]
df['User_ID']=df['User_ID'].str[2:]
And now:
print(df)
Would be desired result.
I believe the issue might be coming from how your data is being stored and thus parsed due to the way your Movie ID is stored separated by a : (colon) rather than a , (comma) as would be needed in a CSV.
If you are able to parse to have it delineate by commas exclusively. the text before it is opened as a CSV, you may be able to eliminate this issue. I only note this because Pandas does not permit multiple delimiters.
Here is what I was able to come up with regarding making something which delineates by colon and comma for how you desire. While I know this isn't your ultimate goal, hopefully this is able to get you on the right path.
import pandas as pd
with open("combined_data_1.txt") as file:
lines = file.readlines()
#Splitting the data into a list delineated by colons
data = []
for line in lines:
if(":" in line):
data.append([])
else: #Using else here prevents the line containing the colon from being saved.
data[len(data)-1].append(line)
for x in range(len(data)):
print("Section " + str(x+1) + ":\n")
print(str(data[x]) + "\n\n")

Copy number file format issue (Need to modify the structure)

I have a file in a special format .cns,which is a segmented file used to analyze copy number. It is a text file, that looks like this (first line plus header):
head -1 copynumber.cns
chromosome,start,end,gene,log2 chr1,13402,861395,"LOC102725121,DDX11L1,OR4F5,LOC100133331,LOC100132062,LOC100132287,LOC100133331,LINC00115,SAMD11",-0.28067
We transformed it to a .csv so we could separate it by tab (but it didn't work well). The .cns is separated by commas but genes are a single string delimited by quotes. I hope this is useful. The output I need is something like this:
gene log2
LOC102725121 -0.28067
DDX11L1 -0.28067
OR4F5 -0.28067
PIK3CA 0.35475
NRAS 3.35475
The fist step, would be, to separate everything by commas and then, transpose columns? and finally print de log2 value for each gene that was contained in that string delimited by quotes. If you could help me with an R, or python script it would help a lot. Perhaps awk would work too.
I am using LInux UBuntu V16.04
I'm not sure if I am being clear, let me know if this is useful.
Thank you!
Hope following code in Python helps
import csv
list1 = []
with open('copynumber.cns','r') as file:
exampleReader = csv.reader(file)
for row in exampleReader:
list1.append(row)
for row in list1:
strings = row[3].split(',') # Get fourth column in CSV, i.e. gene column, and split on occurrance of comma
for string in strings: # Loop through each string
print(string + ' ' + str(row[4]))

Looping a write command to output many different indices from a list separately in Python

Im trying to get an output like:
KPLR003222854-2009131105131
in a text file. The way I am attempting to derive that output is as such:
with open('Processed_Data.txt', 'r') as file_P, open('KIC_list.txt', 'w') as namelist:
nameData = []
for line in file_P:
splt_file_P = line.split()
nameData.append(splt_file_P[0])
for key in nameData:
namelist.write('\n' 'KPLR00' + "".join(str(w) for w in nameData) + '-2009131105131')
However I am having an issue in that the numbers in the nameData array are all appearing at once in the specified output, instead of using on ID cleanly as shown above the output is something like this:
KPLR00322285472138721382172198371823798123781923781237819237894676472634973256279234987-2009131105131
So my question is how do I loop the write command in a way that will allow me to get each separate ID (each has a specific index value, but there are over 150) to be properly outputted.
EDIT:
Also, some of the ID's in the list are not the same length, so I wanted to add 0's to the front of the 'key' to make them all equal 9 digits. I cheated this by adding the 0's into the KPLR in quotes but not all of the ID's need just two 0's. The question is, could I add 0's between KPLR and the key in any way to match the 9-digit format?
Your code looks like it's working as one would expect: "".join(str(w) for w in nameData) makes a string composed of the concatenation of every item in nameData.
Chances are you want;
for key in nameData:
namelist.write('\n' 'KPLR00' + key + '-2009131105131')
Or even better:
for key in nameData:
namelist.write('\nKPLR%09i-2009131105131'%int(key)) #no string concatenation
String concatenation tends to be slower, and if you're not only operating on strings, will involve explicit calls to str. Here's a pair of ideone snippets showing the difference: http://ideone.com/RR5RnL and http://ideone.com/VH2gzx
Also, the above form with the format string '%09i' will pad with 0s to make the number up to 9 digits. Because the format is '%i', I've added an explicit conversion to int. See here for full details: http://docs.python.org/2/library/stdtypes.html#string-formatting-operations
Finally, here's a single line version (excepting the with statement, which you should of course keep):
namelist.write("\n".join("KPLR%09i-2009131105131"%int(line.split()[0]) for line in file_P))
You can change this:
"".join(str(w) for w in nameData)
to this:
",".join(str(w) for w in nameData)
Basically, the "," will comma delimit the elements in your nameData list. If you use "", then there will be nothing to separate the elements, so they appear all at once. You can change the delimiter to suit your needs.
Just for kicks:
with open('Processed_Data.txt', 'r') as file_P, open('KIC_list.txt', 'w') as namelist:
nameData = [line.split()[0] for line in file_P]
namelist.write("\n".join("KPLR00" + str(key) + '-2009131105131' for key in nameData))
I think that will work, but I haven't tested it. You can make it even smaller/uglier by not using nameData at all, and just use that list comprehension right in its place.

Categories

Resources