I have a .txt file which looks like:
# Explanatory text
# Explanatory text
# ID_1 ID_2
10310 34426
104510 4582343
1032410 5424233
12410 957422
In the file, the two IDs on the same row are separated with tabs and the tab character is encoded as '\t'
I'm trying to do some analysis using the numbers in the dataset so want to delete the first three rows. How can this be done in Python? I.e. I'd like to produce a new dataset that looks like:
10310 34426
104510 4582343
1032410 5424233
12410 957422
I've tried the following code but it didn't work:
f = open(filename,'r')
lines = f.readlines()[3:]
f.close()
It doesn't work because I get this format (a list, with \t and \n present), not the one I indicated I want above:
[10310\t34426\n', '104510\t4582343\n', '1032410\t5424233\n' ... ]
You Can Try Something Like this
with open(filename,'r') as fh
for curline in fh:
# check if the current line
# starts with "#"
if curline.startswith("#"):
...
...
else:
...
...
You can use Python's Pandas to do these kind of tasks easily:
import pandas as pd
pd.read_csv(filename, header=None, skiprows=[0, 1, 2], sep='\t')
Ok, here is the solution:
with open('file.txt') as f:
lines = f.readlines()
lines = lines[3:]
Remove Comments
This function remove all comment lines
def remove_comments(lines):
return [line for line in lines if line.startswith("#") == False]
Remove n number of top lines
def remove_n_lines_from_top(lines, n):
if n <= len(lines):
return lines[n:]
else:
return lines
Here is the complete source:
with open('file.txt') as f:
lines = f.readlines()
def remove_comments(lines):
return [line for line in lines if line.startswith("#") == False]
def remove_n_line(lines, n):
return lines[n if n<= len(lines) else 0:]
lines = remove_n_lines_from_top(lines, 3)
f = open("new_file.txt", "w+") # save on new_file
f.writelines(lines)
f.close()
Related
I'm working on a simple data filters and I need help with this task:
Let's say this is my .csv file:
08534710,2888,15
08583315,2999,5
My goal here is to write a function that will search for given value (e.g. 2888) and return value next to it (15)
Here's my code so far:
def wordfinder(searchstring):
csv_file = pd.read_csv('test.csv', "r")
for searchstring in csv_file:
if searchstring in csv_file:
print(searchstring[2])
return searchstring[2]
But I don't think it works as intended.
Search the searchstring into the second column and return the values of the third column:
def wordfinder(searchstring):
df = pd.read_csv('test.csv', dtype=str, header=None)
return df.loc[df[1] == searchstring, 2]
Output:
>>> wordfinder('2888')
0 15
Name: 2, dtype: object
# OR
>>> wordfinder('2888').tolist()
['15']
I suggest the solution as following when:
this is just a simple search task, which means you don't need a dataframe or importing the whole pandas package; or
you don't know in advance in which column your searchstring appears.
import csv
def word_finder(search_string):
# Read the CSV file
csv_reader = csv.reader('test.csv')
# Loop through every line
for line in csv_reader:
# If the search_string exists in this line
if search_string in line:
# Get its position
position = line.index(search_string)
# Only take the next value if it exists
if position + 1 < len(line):
return line[position + 1]
# Not found
return None
It's really this simple:
# Open and read the file
with open('t.txt') as f:
lines = [l.strip() for l in f.readlines()]
def get_field(num):
for line in lines:
parts = line.split(',')
if parts[1] == str(num):
return int(parts[2])
Usage:
>>> get_field(2888)
15
>>> get_field(2999)
5
You don't really need pandas for such a simple case. Try this:
search = '2888'
with open('my.csv') as csv:
for line in csv:
tokens = line.strip().split(',')
try:
i = tokens.index(search)
print(tokens[i+1])
except (ValueError, IndexError):
pass
I try analyze text file with data - columns, and records.
My file:
Name Surname Age Sex Grade
Chris M. 14 M 4
Adam A. 17 M
Jack O. M 8
The text file has some empty data. As above.
User want to show Name and Grade:
import csv
with open('launchlog.txt', 'r') as in_file:
stripped = (line.strip() for line in in_file)
lines = (line.split() for line in stripped if line)
with open('log.txt', 'w') as out_file:
writer = csv.writer(out_file)
writer.writerow(('Name', 'Surname', 'Age', 'Sex', 'Grade'))
writer.writerows(lines)
log.txt :
Chris,M.,14,M,4
Adam,A.,17,M
Jack,O.,M,8
How to empty data insert a "None" string?
For example:
Chris,M.,14,M,4
Adam,A.,17,M,None
Jack,O.,None,M,8
What would be the best way to do this in Python?
Use pandas:
import pandas
data=pandas.read_fwf("file.txt")
To get your dictionnary:
data.set_index("Name")["Grade"].to_dict()
Here's something in Pure Pythonâ„¢ that seems to do what you want, at least on the sample data file in your question.
In a nutshell what it does is first determine where each of the field names in column header line start and end, and then for each of the remaining lines of the file, does the same thing getting a second list which is used to determine what column each data item in the line is underneath (which it then puts in its proper position in the row that will be written to the output file).
import csv
def find_words(line):
""" Return a list of (start, stop) tuples with the indices of the
first and last characters of each "word" in the given string.
Any sequence of consecutive non-space characters is considered
as comprising a word.
"""
line_len = len(line)
indices = []
i = 0
while i < line_len:
start, count = i, 0
while line[i] != ' ':
count += 1
i += 1
if i >= line_len:
break
indices.append((start, start+count-1))
while i < line_len and line[i] == ' ': # advance to start of next word
i += 1
return indices
# convert text file with missing fields to csv
with open('name_grades.txt', 'rt') as in_file, open('log.csv', 'wt', newline='') as out_file:
writer = csv.writer(out_file)
header = next(in_file) # read first line
fields = header.split()
writer.writerow(fields)
# determine the indices of where each field starts and stops based on header line
field_positions = find_words(header)
for line in in_file:
line = line.rstrip('\r\n') # remove trailing newline
row = ['None' for _ in range(len(fields))]
value_positions = find_words(line)
for (vstart, vstop) in value_positions:
# determine what field the value is underneath
for i, (hstart, hstop) in enumerate(field_positions):
if vstart <= hstop and hstart <= vstop: # overlap?
row[i] = line[vstart:vstop+1]
break # stop looking
writer.writerow(row)
Here's the contents of the log.csv file it created:
Name,Surname,Age,Sex,Grade
Chris,C.,14,M,4
Adam,A.,17,M,None
Jack,O.,None,M,8
I would use baloo's answer over mine -- but if you just want to get a feel for where your code went wrong, the solution below mostly works (there is a formatting issue with the Grade field, but I'm sure you can get through that.) Add some print statements to your code and to mine and you should be able to pick up the differences.
import csv
<Old Code removed in favor of new code below>
EDIT: I see your difficulty now. Please try the below code; I'm out of time today so you will have to fill in the writer parts where the print statement is, but this will fulfill your request to replace empty fields with None.
import csv
with open('Test.txt', 'r') as in_file:
with open('log.csv', 'w') as out_file:
writer = csv.writer(out_file)
lines = [line for line in in_file]
name_and_grade = dict()
for line in lines[1:]:
parts = line[0:10], line[11:19], line[20:24], line[25:31], line[32:]
new_line = list()
for part in parts:
val = part.replace('/n','')
val = val.strip()
val = val if val != '' else 'None'
new_line.append(val)
print(new_line)
Without using pandas:
Edited based on your comment, I hard coded this solution based on your data. This will not work for the rows doesn't have Surname column.
I'm writing out Name and Grade since you only need those two columns.
o = open("out.txt", 'w')
with open("inFIle.txt") as f:
for lines in f:
lines = lines.strip("\n").split(",")
try:
grade = int(lines[-1])
if (lines[-2][-1]) != '.':
o.write(lines[0]+","+ str(grade)+"\n")
except ValueError:
print(lines)
o.close()
I have a text file that contains the following contents. I want to split this file into multiple files (1.txt, 2.txt, 3.txt...). Each a new output file will be as the following. The code I tried doesn't split the input file properly. How can I split the input file into multiple files?
My code:
#!/usr/bin/python
with open("input.txt", "r") as f:
a1=[]
a2=[]
a3=[]
for line in f:
if not line.strip() or line.startswith('A') or line.startswith('$$'): continue
row = line.split()
a1.append(str(row[0]))
a2.append(float(row[1]))
a3.append(float(row[2]))
f = open('1.txt','a')
f = open('2.txt','a')
f = open('3.txt','a')
f.write(str(a1))
f.close()
Input file:
A
x
k
..
$$
A
z
m
..
$$
A
B
l
..
$$
Desired output 1.txt
A
x
k
..
$$
Desired output 2.txt
A
z
m
..
$$
Desired output 3.txt
A
B
l
..
$$
Read your input file and write to an output each time you find a "$$" and increase the counter of output files, code :
with open("input.txt", "r") as f:
buff = []
i = 1
for line in f:
if line.strip(): #skips the empty lines
buff.append(line)
if line.strip() == "$$":
output = open('%d.txt' % i,'w')
output.write(''.join(buff))
output.close()
i+=1
buff = [] #buffer reset
EDIT: should be efficient too https://wiki.python.org/moin/PythonSpeed/PerformanceTips#String_Concatenation
try re.findall() function:
import re
with open('input.txt', 'r') as f:
data = f.read()
found = re.findall(r'\n*(A.*?\n\$\$)\n*', data, re.M | re.S)
[open(str(i)+'.txt', 'w').write(found[i-1]) for i in range(1, len(found)+1)]
Minimalistic approach for the first 3 occurrences:
import re
found = re.findall(r'\n*(A.*?\n\$\$)\n*', open('input.txt', 'r').read(), re.M | re.S)
[open(str(found.index(f)+1)+'.txt', 'w').write(f) for f in found[:3]]
Some explanations:
found = re.findall(r'\n*(A.*?\n\$\$)\n*', data, re.M | re.S)
will find all occurrences matching the specified RegEx and will put them into the list, called found
[open(str(found.index(f)+1)+'.txt', 'w').write(f) for f in found]
iterate (using list comprehensions) through all elements belonging to found list and for each element create text file (which is called like "index of the element + 1.txt") and write that element (occurrence) to that file.
Another version, without RegEx's:
blocks_to_read = 3
blk_begin = 'A'
blk_end = '$$'
with open('35916503.txt', 'r') as f:
fn = 1
data = []
write_block = False
for line in f:
if fn > blocks_to_read:
break
line = line.strip()
if line == blk_begin:
write_block = True
if write_block:
data.append(line)
if line == blk_end:
write_block = False
with open(str(fn) + '.txt', 'w') as fout:
fout.write('\n'.join(data))
data = []
fn += 1
PS i, personally, don't like this version and i would use the one using RegEx
open 1.txt in the beginning for writing. Write each line to the current output file. Additionally, if line.strip() == '$$', close the old file and open a new one for writing.
The blocks are divided by empty lines. Try this:
import sys
lines = [line for line in sys.stdin.readlines()]
i = 1
o = open("1{}.txt".format(i), "w")
for line in lines:
if len(line.strip()) == 0:
o.close()
i = i + 1
o = open("{}.txt".format(i), "w")
else:
o.write(line)
Looks to me that the condition that you should be checking for is a line that contains just the carriage return (\n) character. When you encounter such a line, write the contents of the parsed file so far, close the file, and open another one for writing.
A very easy way would if you want to split it in 2 files for example:
with open("myInputFile.txt",'r') as file:
lines = file.readlines()
with open("OutputFile1.txt",'w') as file:
for line in lines[:int(len(lines)/2)]:
file.write(line)
with open("OutputFile2.txt",'w') as file:
for line in lines[int(len(lines)/2):]:
file.write(line)
making that dynamic would be:
with open("inputFile.txt",'r') as file:
lines = file.readlines()
Batch = 10
end = 0
for i in range(1,Batch + 1):
if i == 1:
start = 0
increase = int(len(lines)/Batch)
end = end + increase
with open("splitText_" + str(i) + ".txt",'w') as file:
for line in lines[start:end]:
file.write(line)
start = end
The problem is i need to read a text.txt file and just get very specific data from it.
the entries of that text.txt looks like this
b(1,4,8,1,4,TEST,0,3,AAAA,Test,2-150,000)
a(1,1,3,1,3,BBBB,0,3,BBBB,Test,2-150,000)
a(1,0,2,1,4,TEST,0,3,CCCC,Test,2-150,000)
b(1,1,0,1,4,TEST,0,3,DDDD,Test,2-150,000)
So now i just whant those lines with "a(" and in those i just need the sting after the 5 and 8 comma, so in line 2 it would be BBBB ,BBBB
my code so far is:
infile = open("text.txt","r")
numlines = 0
found = []
for line in infile:
numlines += 1
if "a" in line:
line=line[line.find("(")+1:line.find(")")]
found.append(line.split(','))
wordLed=len(found)
for i in range(0,wordLed):
print found[i]
infile.close()
This just gives me the full lines seperated at the "," but how can i index though them?
The quick short and dirty:
with open('text.txt') as f:
result = [line.split(',')[5:9:3] for line in f if line.startswith("a(")]
# ^^^^^^^
# "5 to 9 (excl.) by step of 3"
# that is items 5 and 5+3
#
# replace by [5] if you only want the fifths item
# replace by [5:9] if you want items from 5 to 9 (excl.)
from pprint import pprint
pprint(result)
dirty because of the lack of error handling...
... anyway, given your sample data, this produces:
[['BBBB', 'BBBB'], ['TEST', 'CCCC']]
I would use readlines function:
with open("data.txt","r") as f:
lines = f.readlines()
for line in lines:
if line[0:2] == 'a(':
data1 = line.split(',')[5]
data2 = line.split(',')[8]
print(data1, data2)
f.close()
You should check the full condition on start, i.e. a( instead of a. Also you could use split to create an array out of your string, based on ,:
infile = open("text.txt","r")
for line in infile:
if line.startswith("a("): # Starts with a(
data = line.split(',')
print data[5] # Print data at place 5
print data[8] # Print data at place 8
infile.close()
for line in [l for l infile if l.startswith('a(')]
line = line[line.find('('):].strip('()\n').split(',')
a_field, other_field = line[5], line[8]
You split the string already, just index into the list to get the fields you want.
How can I skip the header row and start reading a file from line2?
with open(fname) as f:
next(f)
for line in f:
#do something
f = open(fname,'r')
lines = f.readlines()[1:]
f.close()
If you want the first line and then you want to perform some operation on file this code will helpful.
with open(filename , 'r') as f:
first_line = f.readline()
for line in f:
# Perform some operations
If slicing could work on iterators...
from itertools import islice
with open(fname) as f:
for line in islice(f, 1, None):
pass
f = open(fname).readlines()
firstLine = f.pop(0) #removes the first line
for line in f:
...
To generalize the task of reading multiple header lines and to improve readability I'd use method extraction. Suppose you wanted to tokenize the first three lines of coordinates.txt to use as header information.
Example
coordinates.txt
---------------
Name,Longitude,Latitude,Elevation, Comments
String, Decimal Deg., Decimal Deg., Meters, String
Euler's Town,7.58857,47.559537,0, "Blah"
Faneuil Hall,-71.054773,42.360217,0
Yellowstone National Park,-110.588455,44.427963,0
Then method extraction allows you to specify what you want to do with the header information (in this example we simply tokenize the header lines based on the comma and return it as a list but there's room to do much more).
def __readheader(filehandle, numberheaderlines=1):
"""Reads the specified number of lines and returns the comma-delimited
strings on each line as a list"""
for _ in range(numberheaderlines):
yield map(str.strip, filehandle.readline().strip().split(','))
with open('coordinates.txt', 'r') as rh:
# Single header line
#print next(__readheader(rh))
# Multiple header lines
for headerline in __readheader(rh, numberheaderlines=2):
print headerline # Or do other stuff with headerline tokens
Output
['Name', 'Longitude', 'Latitude', 'Elevation', 'Comments']
['String', 'Decimal Deg.', 'Decimal Deg.', 'Meters', 'String']
If coordinates.txt contains another headerline, simply change numberheaderlines. Best of all, it's clear what __readheader(rh, numberheaderlines=2) is doing and we avoid the ambiguity of having to figure out or comment on why author of the the accepted answer uses next() in his code.
If you want to read multiple CSV files starting from line 2, this works like a charm
for files in csv_file_list:
with open(files, 'r') as r:
next(r) #skip headers
rr = csv.reader(r)
for row in rr:
#do something
(this is part of Parfait's answer to a different question)
# Open a connection to the file
with open('world_dev_ind.csv') as file:
# Skip the column names
file.readline()
# Initialize an empty dictionary: counts_dict
counts_dict = {}
# Process only the first 1000 rows
for j in range(0, 1000):
# Split the current line into a list: line
line = file.readline().split(',')
# Get the value for the first column: first_col
first_col = line[0]
# If the column value is in the dict, increment its value
if first_col in counts_dict.keys():
counts_dict[first_col] += 1
# Else, add to the dict and set value to 1
else:
counts_dict[first_col] = 1
# Print the resulting dictionary
print(counts_dict)