I don't know if this is something possible. I am trying to append 12 files into a single file. One of the files is tab delimited and the rest comma delimitted. I loaded all the 12 files into dataframe and append it into an empty dataframe one by one in a loop.
list_of_files = glob.glob('./*.txt')
df = pd.DataFrame()
for filename in list_of_files:
file = pd.read_csv(filename)
dfFilename = pd.DataFrame(file)
df = df.append(dfFilename, ignore_index=True)
But the big file is not in the format I wanted it to be. And I think the problem is with the tab delimited file. And I tried to run the code without the tab delimited file and the format of the appended file is fine. So I was thinking if it is possible to change the tab delimited format into comma delimited using pandas.
Thank you for your help and suggestion
You need to tell Pandas that the file is tab delimited when you import it. You can pass a delimiter to the read_csv method but in your case, since the delimiter changes by file, you want to pass None - this will make Pandas auto-detect the correct delimiter.
Change your read_csv line to:
pd.read_csv(filename,sep=None)
For the file that is tab-separated, you should use:
file = pd.read_csv(filename, sep="\t")
Pandas read_csv has quite a lot of parameters, check it out in the docs
Related
I'm loading a .csv from an Excel workbook and running some data cleaning on the file. However, the issue that I'm having is there are commas everywhere in the data, and this is moving my data around and breaking the schema. This is how I'm converting the df to a CSV (also not sure what quotecar is doing).
Does anyone know how to stop commas ANYWHERE in a dataframe from displacing columns?
gas_data.to_csv('Clean_zenos_data_' + datetime.datetime.today().strftime('%m%d%Y%H%M%S''.csv'), index=False,
quotechar="'")
Here is a link to an example file with my displaced data, the name is causing the comma in this example but my true data set has allowed in all columns
Use a different delimiter, like a tab.
fname = 'Clean_zenos_data_' + datetime.datetime.today().strftime('%m%d%Y%H%M%S')+'.tsv'
gas_data.to_csv(fname, index=False, sep='\t', quotechar="'")
Trying to whip this out in python. Long story short I got a csv file that contains column data i need to inject into another file that is pipe delimited. My understanding is that python can't replace values, so i have to re-write the whole file with the new values.
data file(csv):
value1,value2,iwantthisvalue3
source file(txt, | delimited)
value1|value2|iwanttoreplacethisvalue3|value4|value5|etc
fixed file(txt, | delimited)
samevalue1|samevalue2| replacedvalue3|value4|value5|etc
I can't figure out how to accomplish this. This is my latest attempt(broken code):
import re
import csv
result = []
row = []
with open("C:\data\generatedfixed.csv","r") as data_file:
for line in data_file:
fields = line.split(',')
result.append(fields[2])
with open("C:\data\data.txt","r") as source_file, with open("C:\data\data_fixed.txt", "w") as fixed_file:
for line in source_file:
fields = line.split('|')
n=0
for value in result:
fields[2] = result[n]
n=n+1
row.append(line)
for value in row
fixed_file.write(row)
I would highly suggest you use the pandas package here, it makes handling tabular data very easy and it would help you a lot in this case. Once you have installed pandas import it with:
import pandas as pd
To read the files simply use:
data_file = pd.read_csv("C:\data\generatedfixed.csv")
source_file = pd.read_csv('C:\data\data.txt', delimiter = "|")
and after that manipulating these two files is easy, I'm not exactly sure how many values or which ones you want to replace, but if the length of both "iwantthisvalue3" and "iwanttoreplacethisvalue3" is the same then this should do the trick:
source_file['iwanttoreplacethisvalue3'] = data_file['iwantthisvalue3]
now all you need to do is save the dataframe (the table that we just updated) into a file, since you want to save it to a .txt file with "|" as the delimiter this is the line to do that (however you can customize how to save it in a lot of ways):
source_file.to_csv("C:\data\data_fixed.txt", sep='|', index=False)
Let me know if everything works and this helped you. I would also encourage to read up (or watch some videos) on pandas if you're planning to work with tabular data, it is an awesome library with great documentation and functionality.
My data in Excel is not separated by ",". Twitter data separated by columns. When I throw it in Python, it automatically installs DataFrame and Tweets are not showed full text. How can I overcome this?
enter image description here
If you have a copy open in Excel, the easiest solution would be to save a copy as a csv.
File -> Save As -> dropdown and select CSV.
But pandas also allows you to read excel files. This would be recommended if you have a lot of files and don't want to convert all of them.
df = pd.read_excel(<file>)
Now, if you're saying it isn't .xlsx and also not .csv, but you know the delimiter, then:
df = pd.read_csv(<file>, delimiter='\t') # for tab delimited, but you can change '\t' to any delimiter
import pandas as pd
check = pd.read_csv('1.csv')
nocheck = check['CUSIP'].str[:-1]
nocheck = nocheck.to_frame()
nocheck['CUSIP'] = nocheck['CUSIP'].astype(str)
nocheck.to_csv('NoCheck.csv')
This works but while writing the csv, a value for an identifier like 0003418 (type = str) converts to 3418 (type = general) when the csv file is opened in Excel. How do I avoid this?
I couldn't find a dupe for this question, so I'll post my comment as a solution.
This is an Excel issue, not a python error. Excel autoformats numeric columns to remove leading 0's. You can "fix" this by forcing pandas to quote when writing:
import csv
# insert pandas code from question here
# use csv.QUOTE_ALL when writing CSV.
nocheck.to_csv('NoCheck.csv', quoting=csv.QUOTE_ALL)
Note that this will actually put quotes around each value in your CSV. It will render the way you want in Excel, but you may run into issues if you try to read the file some other way.
Another solution is to write the CSV without quoting, and change the cell format in Excel to "General" instead of "Numeric".
I have a .tsf file.
I want to read it to a dataframe in pandas through a specified path.
How can i do that??
If you by TSF refer to Tab Separated Fields, then you need to use the pandas.read_csv('filename.tsf', sep='\t')
The sep='\t' will tell pandas that the fields are separated by tabs.