I have quite a messy txt file which I need to convert to a dataframe to use as reference data. An Excerpt is shown below:
http://amdc.in2p3.fr/nubase/nubase2016.txt
I've cleaned it up the best I can but to cut a long story short I would like to space delimit most of each line and then fixed delimit the last column. i.e. ignore the spaces in the last section.
Cleaned Data Text File
Can anyone point me in the right direction of a resource which can do this? Not sure if Pandas copes with this?
Kenny
P.S. I have found some great resources to clean up the multiple whitespaces and replace the line breaks. Sorry can't find the original reference, so see attached.
fin = open("Input.txt", "rt")
fout = open("Ouput.txt", "wt")
for line in fin:
fout.write(re.sub(' +', ' ', line).strip() + "\n")
fin.close()
fout.close()
So what i would do is very simple, i would clean up the data as much as possible and then convert it to a csv file, because they are easy to use. An then i would step by step load it into a pandas dataframe and change if it needed.
with open("NudatClean.txt") as f:
text=f.readlines()
import csv
with open('dat.csv', 'w', newline='') as file:
writer = csv.writer(file)
for i in text:
l=i.split(' ')
row=[]
for a in l:
if a!='':
row.append(a)
print(row)
writer.writerow(row)
That should to the job for the beginning. But I don't know what you want remove exactly so I think the rest should be pretty clear.
The way I managed to do this was split the csv into two parts then recombine. Not particularly elegant but did the job I needed.
Split by Column
Related
I'm trying to load the following test.csv file:
R1C1 R1C2 R1C3
R2C1 R2C2 R2C3
R3C1 "R3C2 R3C3
R4C1 R4C2 R4C3
... Using this Python script :
import csv
with open("test.csv") as f:
for row in csv.reader(f, delimiter='\t'):
print(row)
The result I got was the following :
['R1C1', 'R1C2', 'R1C3']
['R2C1', 'R2C2', 'R2C3']
['R3C1', 'R3C2\tR3C3\nR4C1\tR4C2\tR4C3\n']
It turns out that when Python finds a field whose first character is a quotation mark and there is no closing quotation mark, it will include all of the following content as part of the same field.
My question: What is the best approach for all rows in the file to be read properly? Please consider I'm using Python 3.8.5 and the script should be able to read huge files (2gb or more), so memory usage and performance issues should be also considered.
Thanks!
Honestly, if you're dealing with that much data, it'd be best to go in and clean it first. And if possible, fix whatever process is producing your bad data in the first place.
I haven't tested with a large file, but you may just be able to replace " characters as you read lines, assuming there's never a case where they're valid characters:
import csv
with open("test.csv") as f:
line_generator = (line.replace('"', '') for line in f)
for row in csv.reader(line_generator, delimiter='\t'):
print(row)
Output:
['R1C1', 'R1C2', 'R1C3']
['R2C1', 'R2C2', 'R2C3']
['R3C1', 'R3C2', 'R3C3']
['R4C1', 'R4C2', 'R4C3']
I'm doing some measurements in the lab and want to transform them into some nice Python plots. The problem is the way the software exports CSV files, as I can't find a way to properly read the numbers. It looks like this:
-10;-0,0000026
-8;-0,00000139
-6;-0,000000546
-4;-0,000000112
-2;-5,11E-09
0,0000048;6,21E-09
2;0,000000318
4;0,00000304
6;0,0000129
8;0,0000724
10;0,000268
Separation by ; is fine, but I need every , to be ..
Ideally I would like Python to be able to read numbers such as 6.21E-09 as well, but I should be able to fix that in excel...
My main issue: Change every , to . so Python can read them as a float.
The simplest way would be for you to convert them to string and then use the .replace() method to pretty much do anything. For i.e.
txt = "0,0000048;6,21E-09"
txt = txt.replace(';', '.')
You could also read the CSV file (I don't know how you are reading the file) but depending on the library, you could change the 'delimiter' (to : for example). CSV is Comma-separated values and as the name implies, it separates columns by means of '.
You can do whatever you want in Python, for example:
import csv
with open('path_to_csv_file', 'r') as csv_file:
data = list(csv.reader(csv_file, delimiter=';'))
data = [(int(raw_row[0]), float(raw_row[1].replace(',', '.'))) for row in data]
with open('path_to_csv_file', 'w') as csv_file:
writer = csv.writer(csv_file, delimiter=';')
writer.writerows(data)
Can you consider a regex to match the ',' all in the text, then loop the match results in a process that takes ',' to '.'.
I have an input file that contains data in the same format repeatedly across 5 rows. I need to format this data into one row (CSV file) and have only few fields relevant to me. How do i achieve the mentioned output with the input file provided.
Note - I'm very new to learning any language and haven't reached to this depth of details yet to write my own. I have already written the code where i'm importing the input file, reaching to a specif word and then printing the rest of the data(this is where i need help as i don't need all the information in the input as using space is delimiter is not giving the output in correct columns). I have also written the code to write the output in a csv file.
Note 2 - I'm very to this forum as well and kindly excuse me in case i have made any posting in posting my query.
Input -
Input File
Output -
Output File
import itertools, csv
You should read in the file and parse it manually, then use the csv module to write it to a .csv file:
import re
with open('myfile.txt', 'r') as f:
lines = f.readlines()
# divide on whitespace characters, but not single spaces
lines = [re.split("\s\s+", line) for line in lines]
with open('output.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile, delimiter=' ', quotechar='|', quoting=csv.QUOTE_MINIMAL)
for line in lines:
writer.writerow(lines)
But this will include every piece of data. You can iterate through lines and remove the fields you don't want to keep. So before you do the csv writing, you could do:
def filter_line(line):
# see how the input file was parsed
print(line)
# for example, only keep the first 2 columns
return [line[0], line[1]]
lines = [filter_line(line) for line in lines]
I have a plain .txt document that I'm splitting up into individual words, appending them into a list, and then I'm trying to write that list into an Excel file so that each word is a new cell in the file. No matter what I do though, my code keeps taking all the words in the string and puts them in one cell, instead of splitting it by word like I'd intended. If you can help, could you also help explain why your solution works or why mine was wrong? Thanks!
Here's what my code looks like right now:
import csv
list_of_words = []
with open('ExampleText.txt', 'r') as ExampleText:
for line in ExampleText:
for word in line.split():
print(word)
list_of_words.append(word)
print("Done!")
print("Also done!")
with open('Gazete.csv', 'wb') as WordsFromText:
writer = csv.writer(WordsFromText, delimiter=' ', dialect='excel')
writer.writerow(list_of_words)
Excel defaults to comma or tab delimited when opening from CSV regardless of what your delimiter was set to during the export from Python. Try using a delimiter of ',' if you are trying to put these words into a separate cell in the same row.
Remove the delimiter parameter from your call to csv.writer.
writer = csv.writer(WordsFromText, dialect='excel')
You don't need one when a dialect is specified.
I have an excel sheet that has a lot of data in it in one column in the form of a python dictionary from a sql database. I don't have access to the original database and I can't import the CSV back into sql with the local infile command due to the fact that the keys/values on each row of the CSV are not in the same order. When I export the excel sheet to CSV I get:
"{""first_name"":""John"",""last_name"":""Smith"",""age"":30}"
"{""first_name"":""Tim"",""last_name"":""Johnson"",""age"":34}"
What is the best way to remove the " before and after the curly brackets as well as the extra " around the keys/values?
I also need to leave the integers alone that don't have quotes around them.
I am trying to then import this into python with the json module so that I can print specific keys but I can't import them with the doubled double quotes. I ultimately need the data saved in a file that looks like:
{"first_name":"John","last_name":"Smith","age":30}
{"first_name":"Tim","last_name":"Johnson","age":34}
Any help is most appreciated!
Easy:
text = re.sub(r'"(?!")', '', text)
Given the input file: TEST.TXT:
"{""first_name"":""John"",""last_name"":""Smith"",""age"":30}"
"{""first_name"":""Tim"",""last_name"":""Johnson"",""age"":34}"
The script:
import re
f = open("TEST.TXT","r")
text_in = f.read()
text_out = re.sub(r'"(?!")', '', text_in)
print(text_out)
produces the following output:
{"first_name":"John","last_name":"Smith","age":30}
{"first_name":"Tim","last_name":"Johnson","age":34}
This should do it:
with open('old.csv') as old, open('new.csv', 'w') as new:
new.writelines(re.sub(r'"(?!")', '', line) for line in old)
If the input file is just as shown, and of the small size you mention, you can load the whole file in memory, make the substitutions, and then save it. IMHO, you don't need a RegEx to do this. The easiest to read code that does this is:
with open(filename) as f:
input= f.read()
input= str.replace('""','"')
input= str.replace('"{','{')
input= str.replace('}"','}')
with open(filename, "w") as f:
f.write(input)
I tested it with the sample input and it produces:
{"first_name":"John","last_name":"Smith","age":30}
{"first_name":"Tim","last_name":"Johnson","age":34}
Which is exactly what you want.
If you want, you can also pack the code and write
with open(inputFilename) as if:
with open(outputFilename, "w") as of:
of.write(if.read().replace('""','"').replace('"{','{').replace('}"','}'))
but I think the first one is much clearer and both do exactly the same.
I think you are overthinking the problem, why don't replace data?
l = list()
with open('foo.txt') as f:
for line in f:
l.append(line.replace('""','"').replace('"{','{').replace('}"','}'))
s = ''.join(l)
print s # or save it to file
It generates:
{"first_name":"John","last_name":"Smith","age":30}
{"first_name":"Tim","last_name":"Johnson","age":34}
Use a list to store intermediate lines and then invoke .join for improving performance as explained in Good way to append to a string
You can actual use the csv module and regex to do this:
st='''\
"{""first_name"":""John"",""last_name"":""Smith"",""age"":30}"
"{""first_name"":""Tim"",""last_name"":""Johnson"",""age"":34}"\
'''
import csv, re
data=[]
reader=csv.reader(st, dialect='excel')
for line in reader:
data.extend(line)
s=re.sub(r'(\w+)',r'"\1"',''.join(data))
s=re.sub(r'({[^}]+})',r'\1\n',s).strip()
print s
Prints
{"first_name":"John","last_name":"Smith","age":"30"}
{"first_name":"Tim","last_name":"Johnson","age":"34"}