I have an assignment that requires me to analyze data for presidential job creation without using dictionaries. I have to open a text file and average data that applies to the democratic and republican presidents. I am having trouble understanding how to skip over certain lines (in my case I don't want to include the first line and index position 0, the years and months). This is what I have so far and a bit of the input file:
Year,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec
1979,14090,14135,14152,14191,14221,14239,14288,14328,14422,14484,14532,14559
1980,14624,14747,14754,14795,14827,14784,14861,14870,14824,14900,14903,14946
1981,14969,14981,14987,14985,14971,14963,14993,15007,14971,15028,15073,15075
1982,15056,15056,15050,15075,15132,15207,15299,15328,15403,15463,15515,15538
def g_avg():
infile = open("government_employment_Windows.txt", 'r')
lines = []
for line in infile:
print(line)
lines.append(line)
infile.close()
print(lines)
mean = 0
for line in lines:
number = float(line)
mean = mean + number
mean = mean / len(lines)
print(mean)
Its a very very pythonic way to calculate this
with open('filename') as f:
lines = f.readlines() #Read the lines of the txt
sum = 0
n = 0
for line in lines[1:]: #Use the [1:] to skip the first row with the months
row = line.split(',') #Use the split to convert the line in a list separated by comma
for element in row[1:]: #Use the [1:] to skip the years
sum += float(element)
n += 1
mean = sum/ n
This also looks like a csv file, in which case you can use the built in csv module
import csv
total = 0
count = 0
with open("government_employment_Windows.txt", 'r') as f:
reader = csv.reader(f)
next(reader) #skips the headers
for line in reader:
for item in line[1:]:
count += 1
total += float(item)
print(total)
print(count)
print('average: ', total/count)
Use a slice to skip over the first line, i.e file.readlines[1:]
# foo.txt
line, that, you, want, to, skip
important, stuff
other, important, stuff
with open('foo.txt') as file:
for line in file.readlines()[1:]:
print(line)
important, stuff
other, important, stuff
Note that since I have used with to open a file, Python will close it for me automatically. If you have just done file = open(..) you will have to also do file.close()
Related
I need to break up a 1.3m text file to smaller text file based on the 1st row of a section. The data inputs will likely vary over time so I'd like to automate the process with a something that looks like, but open to any suggestions:
FirstLine test1
1 1 1
TIMESTEP Avg VARIANCE(mm^2) STD
2006-01-06T00:00:00Z 77.556335 114.23446 10.688052
2006-02-06T00:00:00Z 30.174097 20.363855 4.512633
2006-03-06T00:00:00Z 65.48971 146.99098 12.123984
2006-04-06T00:00:00Z 68.65635 335.42905 18.314722
2006-05-06T00:00:00Z 65.31086 121.24954 11.011337
2006-06-06T00:00:00Z 123.571075 172.97223 13.151891
FirstLine test2
1 1 1
TIMESTEP Avg VARIANCE(mm^2) STD
2006-01-06T00:00:00Z 66.34833 258.47723 16.077227
2006-02-06T00:00:00Z 16.08292 16.153652 4.0191607
2006-03-06T00:00:00Z 34.585014 185.23705 13.610182
I need the 1st row to be the FirstLine row, and all to the next row with FirstLine.
I've tried identifying the row number with this script:
def search_string_in_file(content, FirstLine):
line_number = 0
list_of_results = []
RowList = []
# Open the file in read only mode
with open('test.csv', 'r') as read_obj:
# Read all lines in the file one by one
for line in read_obj:
# For each line, check if line contains the string
line_number += 1
if FirstLine in line:
# If yes, then add the line number & line as a tuple in the list
list_of_results.append((line_number, line.rstrip()))
print(list_of_results)
# Return list of tuples containing line numbers and lines where string is found
RowList = pd.DataFrame.from_string(list_of_results)
return list_of_results
The above seems to run successfully, but there are no results and no errors.
Found a way to do this that actually cut some steps out.
found = re.findall(r'\n*(.*?\n\#)\n*', data, re.M | re.S)
i'm new to python programming and my task is to tell how many binary values of the list there are where the number of 0's is grater than 1. The data for this task is in a text file, I've opened the file and put every line of text into separet value in list.
binary = list()
file = 'liczby.txt'
with open(file) as fin:
for line in fin:
binary.append(line)
print(*binary, sep = "\n")
And now im stuck.
more_zeros = 0
file = 'liczby.txt'
with open(file) as fin:
for line in fin:
if line.count('0') > line.count('1'):
more_zeros += 1
print(more_zeros)
Out[1]: 6 # based on the 17 lines you gave me in your comment above
def count(fname):
cnt = 0
with open(fname, newline='') as f:
for line in f:
if line.count('0') > line.count('1'):
cnt += 1
return cnt
print(count('/tmp/g.data'))
Read help(str), there are many useful function.
EDIT:
If you like minimalist notation, you can use ;-)
Including Nicolas Gervais len() trick – it is awesome.
def count(fname):
with open(fname, newline='') as f:
return sum(line.count('0') > len(line) // 2 for line in f)
EDIT2: Misunderstanding question. I've updated to count only lines contains more zeros.
I have a big text file with a lot of parts. Every part has 4 lines and next part starts immediately after the last part.
The first line of each part starts with #, the 2nd line is a sequence of characters, the 3rd line is a + and the 4th line is again a sequence of characters.
Small example:
#M00872:462:000000000-D47VR:1:1101:15294:1338 1:N:0:ACATCG
TGCTCGGTGTATGTAAACTTCCGACTTCAACTGTATAGGGATCCAATTTTGACAAAATATTAACGCTTATCGATAAAATTTTGAATTTTGTAACTTGTTTTTGTAATTCTTTAGTTTGTATGTCTGTTGCTATTATGTCTACTATTCTTTCCCCTGCACTGTACCCCCCAATCCCCCCTTTTCTTTTAAAAGTTAACCGATACCGTCGAGATCCGTTCACTAATCGAACGGATCTGTCTCTGTCTCTCTC
+
BAABBADBBBFFGGGGGGGGGGGGGGGHHGHHGH55FB3A3GGH3ADG5FAAFEGHHFFEFHD5AEG1EF511F1?GFH3#BFADGD55F?#GFHFGGFCGG/GHGHHHHHHHDBG4E?FB?BGHHHHHHHHHHHHHHHHHFHHHHHHHHHGHGHGHHHHHFHHHHHGGGGHHHHGGGGHHHHHHHGHGHHHHHHFGHCFGGGHGGGGGGGGFGGEGBFGGGGGGGGGFGGGGFFB9/BFFFFFFFFFF/
I want to change the 2nd and the 4th line of each part and make a new file with similar structure (4 lines for each part). In fact I want to keep the 1st 65 characters (in lines 2 and 4) and remove the rest of characters. The expected output for the small example would look like this:
#M00872:462:000000000-D47VR:1:1101:15294:1338 1:N:0:ACATCG
TGCTCGGTGTATGTAAACTTCCGACTTCAACTGTATAGGGATCCAATTTTGACAAAATATTAACG
+
BAABBADBBBFFGGGGGGGGGGGGGGGHHGHHGH55FB3A3GGH3ADG5FAAFEGHHFFEFHD5A
I wrote the following code:
infile = open("file.fastq", "r")
new_line=[]
for line_number in len(infile.readlines()):
if line_number ==2 or line_number ==4:
new_line.append(infile[line_number])
with open('out_file.fastq', 'w') as f:
for item in new_line:
f.write("%s\n" % item)
but it does not return what I want. How to fix it to get the expected output?
This code will achieve what you want -
from itertools import islice
with open('bio.txt', 'r') as infile:
while True:
lines_gen = list(islice(infile, 4))
if not lines_gen:
break
a,b,c,d = lines_gen
b = b[0:65]+'\n'
d = d[0:65]+'\n'
with open('mod_bio.txt', 'a+') as f:
f.write(a+b+c+d)
How it works?
We first make a generator that gives 4 lines at a time as you mention.
Then we open the lines into individual lines a,b,c,d and perform string slicing. Eventually we join that string and write it to a new file.
I think some itertools.cycle could be nice here:
import itertools
with open("transformed.file.fastq", "w+") as output_file:
with open("file.fastq", "r") as input_file:
for i in itertools.cycle((1,2,3,4)):
line = input_file.readline().strip()
if not line:
break
if i in (2,4):
line = line[:65]
output_file.write("{}\n".format(line))
readlines() will return list of each line in your file. You don't need to prepare a list new_line. Directly iterate over index-value pair of list, then you can modify all the values in your desired position.
By modifying your code, try this
infile = open("file.fastq", "r")
new_lines = infile.readlines()
for i, t in enumerate(new_lines):
if i == 1 or i == 3:
new_lines[i] = new_lines[i][:65]
with open('out_file.fastq', 'w') as f:
for item in new_lines:
f.write("%s" % item)
I try analyze text file with data - columns, and records.
My file:
Name Surname Age Sex Grade
Chris M. 14 M 4
Adam A. 17 M
Jack O. M 8
The text file has some empty data. As above.
User want to show Name and Grade:
import csv
with open('launchlog.txt', 'r') as in_file:
stripped = (line.strip() for line in in_file)
lines = (line.split() for line in stripped if line)
with open('log.txt', 'w') as out_file:
writer = csv.writer(out_file)
writer.writerow(('Name', 'Surname', 'Age', 'Sex', 'Grade'))
writer.writerows(lines)
log.txt :
Chris,M.,14,M,4
Adam,A.,17,M
Jack,O.,M,8
How to empty data insert a "None" string?
For example:
Chris,M.,14,M,4
Adam,A.,17,M,None
Jack,O.,None,M,8
What would be the best way to do this in Python?
Use pandas:
import pandas
data=pandas.read_fwf("file.txt")
To get your dictionnary:
data.set_index("Name")["Grade"].to_dict()
Here's something in Pure Python™ that seems to do what you want, at least on the sample data file in your question.
In a nutshell what it does is first determine where each of the field names in column header line start and end, and then for each of the remaining lines of the file, does the same thing getting a second list which is used to determine what column each data item in the line is underneath (which it then puts in its proper position in the row that will be written to the output file).
import csv
def find_words(line):
""" Return a list of (start, stop) tuples with the indices of the
first and last characters of each "word" in the given string.
Any sequence of consecutive non-space characters is considered
as comprising a word.
"""
line_len = len(line)
indices = []
i = 0
while i < line_len:
start, count = i, 0
while line[i] != ' ':
count += 1
i += 1
if i >= line_len:
break
indices.append((start, start+count-1))
while i < line_len and line[i] == ' ': # advance to start of next word
i += 1
return indices
# convert text file with missing fields to csv
with open('name_grades.txt', 'rt') as in_file, open('log.csv', 'wt', newline='') as out_file:
writer = csv.writer(out_file)
header = next(in_file) # read first line
fields = header.split()
writer.writerow(fields)
# determine the indices of where each field starts and stops based on header line
field_positions = find_words(header)
for line in in_file:
line = line.rstrip('\r\n') # remove trailing newline
row = ['None' for _ in range(len(fields))]
value_positions = find_words(line)
for (vstart, vstop) in value_positions:
# determine what field the value is underneath
for i, (hstart, hstop) in enumerate(field_positions):
if vstart <= hstop and hstart <= vstop: # overlap?
row[i] = line[vstart:vstop+1]
break # stop looking
writer.writerow(row)
Here's the contents of the log.csv file it created:
Name,Surname,Age,Sex,Grade
Chris,C.,14,M,4
Adam,A.,17,M,None
Jack,O.,None,M,8
I would use baloo's answer over mine -- but if you just want to get a feel for where your code went wrong, the solution below mostly works (there is a formatting issue with the Grade field, but I'm sure you can get through that.) Add some print statements to your code and to mine and you should be able to pick up the differences.
import csv
<Old Code removed in favor of new code below>
EDIT: I see your difficulty now. Please try the below code; I'm out of time today so you will have to fill in the writer parts where the print statement is, but this will fulfill your request to replace empty fields with None.
import csv
with open('Test.txt', 'r') as in_file:
with open('log.csv', 'w') as out_file:
writer = csv.writer(out_file)
lines = [line for line in in_file]
name_and_grade = dict()
for line in lines[1:]:
parts = line[0:10], line[11:19], line[20:24], line[25:31], line[32:]
new_line = list()
for part in parts:
val = part.replace('/n','')
val = val.strip()
val = val if val != '' else 'None'
new_line.append(val)
print(new_line)
Without using pandas:
Edited based on your comment, I hard coded this solution based on your data. This will not work for the rows doesn't have Surname column.
I'm writing out Name and Grade since you only need those two columns.
o = open("out.txt", 'w')
with open("inFIle.txt") as f:
for lines in f:
lines = lines.strip("\n").split(",")
try:
grade = int(lines[-1])
if (lines[-2][-1]) != '.':
o.write(lines[0]+","+ str(grade)+"\n")
except ValueError:
print(lines)
o.close()
How can I skip the header row and start reading a file from line2?
with open(fname) as f:
next(f)
for line in f:
#do something
f = open(fname,'r')
lines = f.readlines()[1:]
f.close()
If you want the first line and then you want to perform some operation on file this code will helpful.
with open(filename , 'r') as f:
first_line = f.readline()
for line in f:
# Perform some operations
If slicing could work on iterators...
from itertools import islice
with open(fname) as f:
for line in islice(f, 1, None):
pass
f = open(fname).readlines()
firstLine = f.pop(0) #removes the first line
for line in f:
...
To generalize the task of reading multiple header lines and to improve readability I'd use method extraction. Suppose you wanted to tokenize the first three lines of coordinates.txt to use as header information.
Example
coordinates.txt
---------------
Name,Longitude,Latitude,Elevation, Comments
String, Decimal Deg., Decimal Deg., Meters, String
Euler's Town,7.58857,47.559537,0, "Blah"
Faneuil Hall,-71.054773,42.360217,0
Yellowstone National Park,-110.588455,44.427963,0
Then method extraction allows you to specify what you want to do with the header information (in this example we simply tokenize the header lines based on the comma and return it as a list but there's room to do much more).
def __readheader(filehandle, numberheaderlines=1):
"""Reads the specified number of lines and returns the comma-delimited
strings on each line as a list"""
for _ in range(numberheaderlines):
yield map(str.strip, filehandle.readline().strip().split(','))
with open('coordinates.txt', 'r') as rh:
# Single header line
#print next(__readheader(rh))
# Multiple header lines
for headerline in __readheader(rh, numberheaderlines=2):
print headerline # Or do other stuff with headerline tokens
Output
['Name', 'Longitude', 'Latitude', 'Elevation', 'Comments']
['String', 'Decimal Deg.', 'Decimal Deg.', 'Meters', 'String']
If coordinates.txt contains another headerline, simply change numberheaderlines. Best of all, it's clear what __readheader(rh, numberheaderlines=2) is doing and we avoid the ambiguity of having to figure out or comment on why author of the the accepted answer uses next() in his code.
If you want to read multiple CSV files starting from line 2, this works like a charm
for files in csv_file_list:
with open(files, 'r') as r:
next(r) #skip headers
rr = csv.reader(r)
for row in rr:
#do something
(this is part of Parfait's answer to a different question)
# Open a connection to the file
with open('world_dev_ind.csv') as file:
# Skip the column names
file.readline()
# Initialize an empty dictionary: counts_dict
counts_dict = {}
# Process only the first 1000 rows
for j in range(0, 1000):
# Split the current line into a list: line
line = file.readline().split(',')
# Get the value for the first column: first_col
first_col = line[0]
# If the column value is in the dict, increment its value
if first_col in counts_dict.keys():
counts_dict[first_col] += 1
# Else, add to the dict and set value to 1
else:
counts_dict[first_col] = 1
# Print the resulting dictionary
print(counts_dict)