I try analyze text file with data - columns, and records.
My file:
Name Surname Age Sex Grade
Chris M. 14 M 4
Adam A. 17 M
Jack O. M 8
The text file has some empty data. As above.
User want to show Name and Grade:
import csv
with open('launchlog.txt', 'r') as in_file:
stripped = (line.strip() for line in in_file)
lines = (line.split() for line in stripped if line)
with open('log.txt', 'w') as out_file:
writer = csv.writer(out_file)
writer.writerow(('Name', 'Surname', 'Age', 'Sex', 'Grade'))
writer.writerows(lines)
log.txt :
Chris,M.,14,M,4
Adam,A.,17,M
Jack,O.,M,8
How to empty data insert a "None" string?
For example:
Chris,M.,14,M,4
Adam,A.,17,M,None
Jack,O.,None,M,8
What would be the best way to do this in Python?
Use pandas:
import pandas
data=pandas.read_fwf("file.txt")
To get your dictionnary:
data.set_index("Name")["Grade"].to_dict()
Here's something in Pure Python™ that seems to do what you want, at least on the sample data file in your question.
In a nutshell what it does is first determine where each of the field names in column header line start and end, and then for each of the remaining lines of the file, does the same thing getting a second list which is used to determine what column each data item in the line is underneath (which it then puts in its proper position in the row that will be written to the output file).
import csv
def find_words(line):
""" Return a list of (start, stop) tuples with the indices of the
first and last characters of each "word" in the given string.
Any sequence of consecutive non-space characters is considered
as comprising a word.
"""
line_len = len(line)
indices = []
i = 0
while i < line_len:
start, count = i, 0
while line[i] != ' ':
count += 1
i += 1
if i >= line_len:
break
indices.append((start, start+count-1))
while i < line_len and line[i] == ' ': # advance to start of next word
i += 1
return indices
# convert text file with missing fields to csv
with open('name_grades.txt', 'rt') as in_file, open('log.csv', 'wt', newline='') as out_file:
writer = csv.writer(out_file)
header = next(in_file) # read first line
fields = header.split()
writer.writerow(fields)
# determine the indices of where each field starts and stops based on header line
field_positions = find_words(header)
for line in in_file:
line = line.rstrip('\r\n') # remove trailing newline
row = ['None' for _ in range(len(fields))]
value_positions = find_words(line)
for (vstart, vstop) in value_positions:
# determine what field the value is underneath
for i, (hstart, hstop) in enumerate(field_positions):
if vstart <= hstop and hstart <= vstop: # overlap?
row[i] = line[vstart:vstop+1]
break # stop looking
writer.writerow(row)
Here's the contents of the log.csv file it created:
Name,Surname,Age,Sex,Grade
Chris,C.,14,M,4
Adam,A.,17,M,None
Jack,O.,None,M,8
I would use baloo's answer over mine -- but if you just want to get a feel for where your code went wrong, the solution below mostly works (there is a formatting issue with the Grade field, but I'm sure you can get through that.) Add some print statements to your code and to mine and you should be able to pick up the differences.
import csv
<Old Code removed in favor of new code below>
EDIT: I see your difficulty now. Please try the below code; I'm out of time today so you will have to fill in the writer parts where the print statement is, but this will fulfill your request to replace empty fields with None.
import csv
with open('Test.txt', 'r') as in_file:
with open('log.csv', 'w') as out_file:
writer = csv.writer(out_file)
lines = [line for line in in_file]
name_and_grade = dict()
for line in lines[1:]:
parts = line[0:10], line[11:19], line[20:24], line[25:31], line[32:]
new_line = list()
for part in parts:
val = part.replace('/n','')
val = val.strip()
val = val if val != '' else 'None'
new_line.append(val)
print(new_line)
Without using pandas:
Edited based on your comment, I hard coded this solution based on your data. This will not work for the rows doesn't have Surname column.
I'm writing out Name and Grade since you only need those two columns.
o = open("out.txt", 'w')
with open("inFIle.txt") as f:
for lines in f:
lines = lines.strip("\n").split(",")
try:
grade = int(lines[-1])
if (lines[-2][-1]) != '.':
o.write(lines[0]+","+ str(grade)+"\n")
except ValueError:
print(lines)
o.close()
Related
I have an assignment that requires me to analyze data for presidential job creation without using dictionaries. I have to open a text file and average data that applies to the democratic and republican presidents. I am having trouble understanding how to skip over certain lines (in my case I don't want to include the first line and index position 0, the years and months). This is what I have so far and a bit of the input file:
Year,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec
1979,14090,14135,14152,14191,14221,14239,14288,14328,14422,14484,14532,14559
1980,14624,14747,14754,14795,14827,14784,14861,14870,14824,14900,14903,14946
1981,14969,14981,14987,14985,14971,14963,14993,15007,14971,15028,15073,15075
1982,15056,15056,15050,15075,15132,15207,15299,15328,15403,15463,15515,15538
def g_avg():
infile = open("government_employment_Windows.txt", 'r')
lines = []
for line in infile:
print(line)
lines.append(line)
infile.close()
print(lines)
mean = 0
for line in lines:
number = float(line)
mean = mean + number
mean = mean / len(lines)
print(mean)
Its a very very pythonic way to calculate this
with open('filename') as f:
lines = f.readlines() #Read the lines of the txt
sum = 0
n = 0
for line in lines[1:]: #Use the [1:] to skip the first row with the months
row = line.split(',') #Use the split to convert the line in a list separated by comma
for element in row[1:]: #Use the [1:] to skip the years
sum += float(element)
n += 1
mean = sum/ n
This also looks like a csv file, in which case you can use the built in csv module
import csv
total = 0
count = 0
with open("government_employment_Windows.txt", 'r') as f:
reader = csv.reader(f)
next(reader) #skips the headers
for line in reader:
for item in line[1:]:
count += 1
total += float(item)
print(total)
print(count)
print('average: ', total/count)
Use a slice to skip over the first line, i.e file.readlines[1:]
# foo.txt
line, that, you, want, to, skip
important, stuff
other, important, stuff
with open('foo.txt') as file:
for line in file.readlines()[1:]:
print(line)
important, stuff
other, important, stuff
Note that since I have used with to open a file, Python will close it for me automatically. If you have just done file = open(..) you will have to also do file.close()
The title isn't big enough for me to explain this so here it goes:
I have a csv file looking something like this:
Example csv containing
long string with some special characters , number, string, number
long string with some special characters , number, string, number
long string with some special characters , number, string, number
long string with some special characters , number, string, number
I want to go through the first column and if the length of the string is greater then 20 do this:
LINE 20 : long string with som, e special characters
split the string, modify first csv with first part of the string, and create a new csv and add the other part on the same line number, leaving the rest just whitespace
what i have for now is this:
this below doesn't do anything right now, its just what I made to try and explain to myself and figure out how could i do new file writing with splitString
fileName = file name
maxCollumnLength = number of rows in the whole set
lineNum = line number of a string that is greater then 20
splitString = second part of the split string that should be written on another file
def newopenfile(fileName, maxCollumnLength, lineNum, splitString):
with open(fileName, 'rw', encoding="utf8") as nf:
writer = csv.writer(fileName, quoting=csv.QUOTE_NONE)
for i in range(0, maxCollumnLength-1):
#write whitespace until reaching lineNum of a string thats bigger then 20 then write that part of the string to a csv
this goes through the first column and checks the length
fileName = 'uskrs.csv'
firstColList=[] #an empty list to store the second column
splitString=[]
i = 0
with open(fileName, 'rw', encoding="utf8") as rf:
reader = csv.reader(rf, delimiter=',')
for row in reader:
if len(row[0]) > 20:
i+=1
#split row and parse the other end of the row to newopenfile(fileName, len(reader), i, splitString )
#print(row[0])
#for debuging
firstColList.append(row[0])
from this point i am stuck at how to actualy change the string in the csv and how to split them
THE STRING COULD ALSO HAVE 60+ chars, so it would need splitting more then 2 times and storing it in more then 2 csvs
I suck at explaining the problem, so if you have any questions please do ask
Okay so i was sucessful in dividing the first column if it has length greater then 20, and replace the first column with first 20 chars
import csv
def checkLength(column, readFile, writeFile, maxLen):
counter = 0
i = 0
idxSplitItems = []
final = []
newSplits = 0
with open(readFile,'r', encoding="utf8", newline='') as f:
reader = csv.reader(f)
your_list = list(reader)
final = your_list
for sublist in your_list:
#del sublist[-1] -remove last invisible element
i+=1
data = removeUnwanted(sublist[column])
print(data)
if len(data) > maxLen:
counter += 1 # Number of large
idxSplitItems.append(split_bylen(i,data,maxLen))
if len(idxSplitItems) > newSplits: newSplits = len(idxSplitItems)
final[i-1][column] = split_bylen(i,data,maxLen)[1]
final[i-1][column] = removeUnwanted(final[i-1][column])
print("After split data: "+ data)
print("After split final: "+ final[i-1][column])
writeSplitToCSV(writeFile, final)
checkCols(final, 6)
return final, idxSplitItems
def removeUnwanted(data):
data = data.replace(',',' ')
return data
def split_bylen(index, item, maxLen):
clean = removeUnwanted(item)
splitList = [clean[ind:ind+maxLen] for ind in range(0, len(item), maxLen)]
splitList.insert(0,index)
return splitList
def writeSplitToCSV(writeFile,data):
with open(writeFile,'w', encoding="utf8", newline='') as f:
writer = csv.writer(f)
writer.writerows(data)
def checkCols(data, columns):
for sublist in data:
if len(sublist)-1!=columns:
print ("[X] This row doesnt have the same amount of columns as others: "+sublist)
else:
print("All okay")
#len(data) #how many split items
#print(your_list[0][0])
#print("Number of large: ", counter)
final, idxSplitItems = checkLength(0,'test.csv','final.csv', 30)
print("------------------------")
print(idxSplitItems)
print("-------------------------")
print(final)
Now I have a problem with this part of the code, notice this:
print("After split data: "+ data)
print("After split final: "+ final[i-1][column])
This is to check if removing comma worked.
with example of
"BUTKOVIĆ VESNA , DIPL.IUR."
data returns
BUTKOVIĆ VESNA DIPL.IUR.
but final returns
BUTKOVIĆ VESNA , DIPL.IUR.
why does my final return "," again but in data its gone, must be something done in "split_bylen()" that makes it do that
Dictionaries are fun!
To overwrite the original csv see here. You would have to use Dictreader & Dictwriter. I keep your method of reading just for clarity.
writecsvs = {} #store each line of each new csv
# e.g. {'csv1':[[row0_split1,row0_num,row0_str,row0_num],[row1_split1,row1_num,row1_str,row1_num],...],
# 'csv2':[[row0_split2,row0_num,row0_str,row0_num],[row1_split2,row1_num,row1_str,row1_num],...],
# .
# .
# .}
with open(fileName, mode='rw', encoding="utf-8-sig") as rf:
reader = csv.reader(rf, delimiter=',')
for row in reader:
col1 = row[0]
# check size & split
# decide number of new csvs
# overwrite original csv
# store new content in writecsvs dict
for # Loop over each csv in writecsvs:
writelines = # Get List of Lines
out_file = open('csv1.csv', mode='w') # use the keys in writecsvs for filenames
for line in writelines:
out_file.write(line)
Hope this helps.
I have a CSV file, which I created using an HTML export from a Check Point firewall policy.
Each rule is represented as several lines, in some cases. That occurs when a rule has several address sources, destinations or services.
I need the output to have each rule described in only one line.
It's easy to distinguish when each rule begins. In the first column, there's the rule ID, which is a number.
Here's an example. In green are marked the strings that should be moved:
http://i.imgur.com/i785sDi.jpg
Let me show you an example:
NO.;NAME;SOURCE;DESTINATION;SERVICE;ACTION;
1;;fwgcluster;mcast_vrrp;vrrp;accept;
;;;;igmp;;
2;Testing;fwgcluster;fwgcluster;FireWall;accept;
;;fwmgmpe;fwmgmpe;ssh;;
;;fwmgm;fwmgm;;;
What I need ,explained in pseudo code, is this:
Read the first column of the next line. If there's a number:
Evaluate the first column of the next line. If there's no number there, concatenate (separating with a comma) \
the strings in the columns of this line with the last one and eliminate the text in the current one
The output should be something like this:
NO.;NAME;SOURCE;DESTINATION;SERVICE;ACTION;
1;;fwgcluster;mcast_vrrp;vrrp-igmp;accept;
;;;;;;
2;Testing;fwgcluster-fwmgmpe-fwmgm;fwgcluster-fwmgmpe-fwmgm;FireWall-ssh;accept;
;;;;;;
The empty lines are there only to be more clear, I don't actually need them.
Thanks!
This should get you started
import csv
with open('data.txt', 'r') as f:
reader = csv.DictReader(f, delimiter=';')
for r in reader:
print r
EDIT: Given your required output, this should get you nearly there. Its a bit crude but does the majority of what you need. It checks for the 'No.' key and if it has a value it will start a record. If not it will join any other data in the row with the equivalent data in the record. Finally, when a new record is created the old one is appended to the result, this also happens at the end to catch the last item.
import csv
result, record = [], None
with open('data2.txt', 'r') as f:
reader = csv.DictReader(f, delimiter=';', lineterminator='\n')
for r in reader:
if r['NO.']:
if record:
result.append(record)
record = r
else:
for key in r.keys():
if r[key]:
record[key] = '-'.join([record[key], r[key]])
if record:
result.append(record)
print result
Graeme, thanks again, just before your edit I solved it with the following code.
But you got me looking in the right direction!
If anyone needs it, here it is:
import csv
# adjust these 3 lines
WRITE_EMPTIES = False
INFILE = "input.csv"
OUTFILE = "output.csv"
with open(INFILE, "r") as in_file:
r = csv.reader(in_file, delimiter=";")
with open(OUTFILE, "wb") as out_file:
previous = None
empties_to_write = 0
out_writer = csv.writer(out_file, delimiter=";")
for i, row in enumerate(r):
first_val = row[0].strip()
if first_val:
if previous:
out_writer.writerow(previous)
if WRITE_EMPTIES and empties_to_write:
out_writer.writerows(
[["" for _ in previous]] * empties_to_write
)
empties_to_write = 0
previous = row
else: # append sub-portions to each other
previous = [
"|".join(
subitem
for subitem in existing.split(",") + [new]
if subitem
)
for existing, new in zip(previous, row)
]
empties_to_write += 1
if previous: # take care of the last row
out_writer.writerow(previous)
if WRITE_EMPTIES and empties_to_write:
out_writer.writerows(
[["" for _ in previous]] * empties_to_write
)
I want to get data from a table in a text file into a python array. The text file that I am using as an input has 7 columns and 31 rows. Here is an example of the first two rows:
10672 34.332875 5.360831 0.00004035881220 0.00000515052523 4.52E-07 6.5E-07
12709 40.837833 19.429158 0.00012010938453 -0.00000506426720 7.76E-06 2.9E-07
The code that I have tried to write isn't working as it is not reading one line at a time when it goes through the for loop.
data = []
f = open('hyadeserr.txt', 'r')
while True:
eof = "no"
array = []
for i in range(7):
line = f.readline()
word = line.split()
if len(word) == 0:
eof = "yes"
else:
array.append(float(word[0]))
print array
if eof == "yes": break
data.append(array)
Any help would be greatly appreciated.
A file with space-separated values is just a dialect of the classic comma-separated values (CSV) file where the delimiter is a space (), followed by more spaces, which can be ignored.
Happily, Python comes with a csv.reader class that understands dialects.
You should use this:
Example:
#!/usr/bin/env python
import csv
csv.register_dialect('ssv', delimiter=' ', skipinitialspace=True)
data = []
with open('hyadeserr.txt', 'r') as f:
reader = csv.reader(f, 'ssv')
for row in reader:
floats = [float(column) for column in row]
data.append(floats)
print data
If you don't want to use cvs here, since you don't really need it:
data = []
with open("hyadeserr.txt") as file:
for line in file:
data.append([float(f) for f in line.strip().split()])
Or, if you know for sure that the only extra chars are spaces and line ending \n, you can turn the last line into:
data.append([float(f) for f in line[:-1].split()])
How can I skip the header row and start reading a file from line2?
with open(fname) as f:
next(f)
for line in f:
#do something
f = open(fname,'r')
lines = f.readlines()[1:]
f.close()
If you want the first line and then you want to perform some operation on file this code will helpful.
with open(filename , 'r') as f:
first_line = f.readline()
for line in f:
# Perform some operations
If slicing could work on iterators...
from itertools import islice
with open(fname) as f:
for line in islice(f, 1, None):
pass
f = open(fname).readlines()
firstLine = f.pop(0) #removes the first line
for line in f:
...
To generalize the task of reading multiple header lines and to improve readability I'd use method extraction. Suppose you wanted to tokenize the first three lines of coordinates.txt to use as header information.
Example
coordinates.txt
---------------
Name,Longitude,Latitude,Elevation, Comments
String, Decimal Deg., Decimal Deg., Meters, String
Euler's Town,7.58857,47.559537,0, "Blah"
Faneuil Hall,-71.054773,42.360217,0
Yellowstone National Park,-110.588455,44.427963,0
Then method extraction allows you to specify what you want to do with the header information (in this example we simply tokenize the header lines based on the comma and return it as a list but there's room to do much more).
def __readheader(filehandle, numberheaderlines=1):
"""Reads the specified number of lines and returns the comma-delimited
strings on each line as a list"""
for _ in range(numberheaderlines):
yield map(str.strip, filehandle.readline().strip().split(','))
with open('coordinates.txt', 'r') as rh:
# Single header line
#print next(__readheader(rh))
# Multiple header lines
for headerline in __readheader(rh, numberheaderlines=2):
print headerline # Or do other stuff with headerline tokens
Output
['Name', 'Longitude', 'Latitude', 'Elevation', 'Comments']
['String', 'Decimal Deg.', 'Decimal Deg.', 'Meters', 'String']
If coordinates.txt contains another headerline, simply change numberheaderlines. Best of all, it's clear what __readheader(rh, numberheaderlines=2) is doing and we avoid the ambiguity of having to figure out or comment on why author of the the accepted answer uses next() in his code.
If you want to read multiple CSV files starting from line 2, this works like a charm
for files in csv_file_list:
with open(files, 'r') as r:
next(r) #skip headers
rr = csv.reader(r)
for row in rr:
#do something
(this is part of Parfait's answer to a different question)
# Open a connection to the file
with open('world_dev_ind.csv') as file:
# Skip the column names
file.readline()
# Initialize an empty dictionary: counts_dict
counts_dict = {}
# Process only the first 1000 rows
for j in range(0, 1000):
# Split the current line into a list: line
line = file.readline().split(',')
# Get the value for the first column: first_col
first_col = line[0]
# If the column value is in the dict, increment its value
if first_col in counts_dict.keys():
counts_dict[first_col] += 1
# Else, add to the dict and set value to 1
else:
counts_dict[first_col] = 1
# Print the resulting dictionary
print(counts_dict)