I need to process hundreds of fairly large CSV files. Each file contains 4 header lines followed by 864000 lines of data and weight more than 200 Mo. Columns type are most of the time recognized as object because missing values are indicated as "NAN" (with quotes). I want to perform couple of operations on these data and export them to a new file in a format similar to the input file. To do so, I wrote the following code
df = pd.read_csv(in_file, skiprows=[0,2,3])
# Get file header
with open(in_file, 'r') as fi:
header = [next(fi) for x in range(4)]
# Write header to destination file
with open(out_file, 'w') as fo:
for i_line in header:
fo.write(i_line)
# Do some data transformation here
df = foobar(df)
# Append data to destination file
df.to_csv(out_file, header=False, index=False, mode='a')
I struggle to preserve exactly the input format. For instance, I have dates in the input files formated as "2019-08-28 00:00:00.2" while they are written in the output files as 2019-08-28 00:00:00.2, i.e. without the quotation marks.
Same for "NAN" values that are rewritten without their quotes.Pandas wants to clean everything out.
I tried other variants that worked, but because of the file size, running time was unreasonable.
Include quoting parameter in to_csv i.e. quoting=csv.QUOTE_NONNUMERIC or quoting=2
so your to csv statement will be as follows:
df.to_csv(out_file, header=False, index=False, mode='a', quoting=2)
Note: you need to import csv if you want to use csv.QUOTE_NONNUMERIC
More details about the parameters can be found on the documentation (below): https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html
Related
Dear all i often need to concat csv files with identical headers (i.e. but them into one big file). Usually i just use pandas, but i now need to operate in an enviroment were i am not at liberty to install any library. The csv and the html libs do exsit.
I also need to remove all remaining html tags like %amp; for the apercent symbol which are still within the data. I do know in which columns these can come up.
I thought aboug doing it like this, and the concat part of my code seems to work fine:
import CSV
import html
for file in files: # files is a list of csv files.
with open(file, "rt", encoding="utf-8-sig") as source, open(outfilePath, "at", newline='',
encoding='utf-8-sig') as result:
d_reader = csv.DictReader(source,delimiter=";")
# Set header based on first file in file_list:
if file == test_files[0]:
Common_header = d_reader.fieldnames
# Define DcitwriterObject
wtr = csv.DictWriter(result, fieldnames=Common_header, lineterminator='\n', delimiter=";")
# Write Header only once to emtpy file
if result.tell() == 0:
wtr.writeheader()
# If i remove this block i get my concateneated singe csv file as a result
# Howerver the html tags/encoded symbols are sill present.
for row in d_reader:
print(html.unescape(row['ColA'])) # This prints the unescaped Values in the column correctly
# If i kepp these two lines, i get an empty file with just the header as a result of the concatenation
row['ColA'] = html.unescape(row['ColA'])
row['ColB'] = html.unescape(row['ColB'])
wtr.writerows(d_reader)
I would have thought the simply suppling the encoding='utf-8-sig' part to the result file would be sufficient to get rid of the html symbols but that does not work. If you could give me a hint what i am doint wrong in the usage of the code containing the html.unescape function in my code that would be nice.
Thank you in advance
I'm loading a .csv from an Excel workbook and running some data cleaning on the file. However, the issue that I'm having is there are commas everywhere in the data, and this is moving my data around and breaking the schema. This is how I'm converting the df to a CSV (also not sure what quotecar is doing).
Does anyone know how to stop commas ANYWHERE in a dataframe from displacing columns?
gas_data.to_csv('Clean_zenos_data_' + datetime.datetime.today().strftime('%m%d%Y%H%M%S''.csv'), index=False,
quotechar="'")
Here is a link to an example file with my displaced data, the name is causing the comma in this example but my true data set has allowed in all columns
Use a different delimiter, like a tab.
fname = 'Clean_zenos_data_' + datetime.datetime.today().strftime('%m%d%Y%H%M%S')+'.tsv'
gas_data.to_csv(fname, index=False, sep='\t', quotechar="'")
Trying to whip this out in python. Long story short I got a csv file that contains column data i need to inject into another file that is pipe delimited. My understanding is that python can't replace values, so i have to re-write the whole file with the new values.
data file(csv):
value1,value2,iwantthisvalue3
source file(txt, | delimited)
value1|value2|iwanttoreplacethisvalue3|value4|value5|etc
fixed file(txt, | delimited)
samevalue1|samevalue2| replacedvalue3|value4|value5|etc
I can't figure out how to accomplish this. This is my latest attempt(broken code):
import re
import csv
result = []
row = []
with open("C:\data\generatedfixed.csv","r") as data_file:
for line in data_file:
fields = line.split(',')
result.append(fields[2])
with open("C:\data\data.txt","r") as source_file, with open("C:\data\data_fixed.txt", "w") as fixed_file:
for line in source_file:
fields = line.split('|')
n=0
for value in result:
fields[2] = result[n]
n=n+1
row.append(line)
for value in row
fixed_file.write(row)
I would highly suggest you use the pandas package here, it makes handling tabular data very easy and it would help you a lot in this case. Once you have installed pandas import it with:
import pandas as pd
To read the files simply use:
data_file = pd.read_csv("C:\data\generatedfixed.csv")
source_file = pd.read_csv('C:\data\data.txt', delimiter = "|")
and after that manipulating these two files is easy, I'm not exactly sure how many values or which ones you want to replace, but if the length of both "iwantthisvalue3" and "iwanttoreplacethisvalue3" is the same then this should do the trick:
source_file['iwanttoreplacethisvalue3'] = data_file['iwantthisvalue3]
now all you need to do is save the dataframe (the table that we just updated) into a file, since you want to save it to a .txt file with "|" as the delimiter this is the line to do that (however you can customize how to save it in a lot of ways):
source_file.to_csv("C:\data\data_fixed.txt", sep='|', index=False)
Let me know if everything works and this helped you. I would also encourage to read up (or watch some videos) on pandas if you're planning to work with tabular data, it is an awesome library with great documentation and functionality.
I have a pandas dataframe of shape (455698, 62). I want to save it as a csv file, and load it again later with pandas. For now I do this :
df.to_csv("/path/to/file.csv",index=False,sep="\\", encoding='utf-8') #saving
df=pd.read_csv("/path/to/file.csv",delimiter="\\",encoding ='utf-8') #loading
and I get a dataframe with shape (455700, 62) : 2 more lines ? When I check in detail, (looking at all unique values in each columns), I found that some values changed columns in the process.
I've tried multiple separators, forcing dtype ="object", and I can't figure out where the bug is. What should I try?
Is it possible that some of your strings contain new-line (\n) character?
In this case i would suggest to use quoting when saving your CSV file:
import csv
df.to_csv("/path/to/file.csv",index=False,sep="\\", encoding='utf-8', quoting=csv.QUOTE_NONNUMERIC)
...
I don't know if this is something possible. I am trying to append 12 files into a single file. One of the files is tab delimited and the rest comma delimitted. I loaded all the 12 files into dataframe and append it into an empty dataframe one by one in a loop.
list_of_files = glob.glob('./*.txt')
df = pd.DataFrame()
for filename in list_of_files:
file = pd.read_csv(filename)
dfFilename = pd.DataFrame(file)
df = df.append(dfFilename, ignore_index=True)
But the big file is not in the format I wanted it to be. And I think the problem is with the tab delimited file. And I tried to run the code without the tab delimited file and the format of the appended file is fine. So I was thinking if it is possible to change the tab delimited format into comma delimited using pandas.
Thank you for your help and suggestion
You need to tell Pandas that the file is tab delimited when you import it. You can pass a delimiter to the read_csv method but in your case, since the delimiter changes by file, you want to pass None - this will make Pandas auto-detect the correct delimiter.
Change your read_csv line to:
pd.read_csv(filename,sep=None)
For the file that is tab-separated, you should use:
file = pd.read_csv(filename, sep="\t")
Pandas read_csv has quite a lot of parameters, check it out in the docs