how to handle error in reading file containing multiple languages

how to handle error in reading file containing multiple languages - python

data trying to read
I have tried various ways still getting errors of the different type.
import codecs
f = codecs.open('sampledata.xlsx', encoding='utf-8')
for line in f:
print (repr(line))
the other way I tried is
f = open(fname, encoding="ascii", errors="surrogateescape")
still no luck.any help?

Newer versions of Pandas supports xlxs.
file_name = # path to file + file name
sheet = # sheet name or sheet number or list of sheet numbers and names
import pandas as pd
df = pd.read_excel(io=file_name, sheet_name=sheet)
print(df.head(5)) # print first 5 rows of the dataframe
Works great, especially if you're working with many sheets.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html

Related

Issues with the delimiter when trying to read a comma separated file (Python, Pandas & .csv)

The problem:
I am trying to reproduce results from a youtube course of Keith Galli's.
import pandas as pd
import os
import csv
input_loc = "./SalesAnalysis/Sales_Data/"
output_loc = "./SalesAnalysis/korbi_output/"
fileList = os.listdir(input_loc)
all_months_data = pd.DataFrame()
problem probably starts here:
for file in fileList:
if file.endswith(".csv"):
df = pd.read_csv(input_loc+file)
all_months_data = all_months_data.append(df)
all_months_data.to_csv(output_loc+"all_months_data.csv")
all_months_data.head()
this is my output and I don't want row 1 to be displayed, because it contains no data:
The issue seems to be line 3 in one of my csv files. A3 is empty except for commas:
So I go to the csv file, and delete A3 cell. run the code again and I get this:
instead of this:
What do I have to do to remove the cells without value and to still display everything correctly?

I did not understand, WHY this weird problems occured, but I figured out a workaround to change the data and save everything in a new csv file:
all_months_data_cleaned = all_months_data.copy()
all_months_data_cleaned = all_months_data.dropna()
all_months_data_cleaned.reset_index(drop=True, inplace=True)
all_months_data_cleaned.to_csv(output_loc+"all_months_data_cleaned.csv")

Pandas pd.read_csv() splits each character into a new row

Weird (and possibly bad) question - when using the pd.read_csv() method, it appears as if pandas separates each character into a new row in the csv.
My code:
import pandas as pd
import csv
# doing this in colab
from google.colab import files
# downloading a csv with my apps sdk
data = sdk.run_look('2131','csv')
with open('big.csv', 'w') as file:
csvwriter = csv.writer(file, delimiter=',')
csvwriter.writerows(data)
df = pd.read_csv('big.csv', delimiter=',')
files.download('big.csv')
Output:
What I'm getting from line 4 (data=sdk...) looks like this:
Orders Status,Orders Count
complete,31377
pending,505
cancelled,375
However, what I get back from pandas looks like this:
0 r
1 d
2 e
3 r
4 s
...
I think it's line 6 (df=read_csv...) because if I do print(data) then compare to print(df.head()), I see that print data returns correct data, print df.head returns the weirdly formatted data.
Any idea of what I'm doing wrong here? I'm a complete noob, so probably pebkac :)

It appears that your data is just coming in as a big string. If this is the case, you don't need to use the csv writer at all, and can just write it directly to your out file.
import pandas as pd
# doing this in colab
from google.colab import files
# downloading a csv with my apps sdk
data = sdk.run_look('2131', 'csv')
with open('big.csv', 'w') as file:
file.write(data)
df = pd.read_csv('big.csv', delimiter=',')
files.download('big.csv')

Printing columns from a CSV file into an excel file with python

I am trying to come up with a script that will allow me to read all csv files with greater than 62 bits and print two columns into a separate excel file and create a list.
The following is one of the csv files:
FileUUID Table RowInJSON JSONVariable Error Notes SQLExecuted
ff3ca629-2e9c-45f7-85f1-a3dfc637dd81 lng02_rpt_b_calvedets 1 Duplicate entry 'ETH0007805440544' for key 'nosameanimalid' INSERT INTO lng02_rpt_b_calvedets(farmermobile,hh_id,rpt_b_calvedets_rowid,damidyesno,damid,calfdam_id,damtagid,calvdatealv,calvtype,calvtypeoth,easecalv,easecalvoth,birthtyp,sex,siretype,aiprov,othaiprov,strawidyesno,strawid) VALUES ('0974502779','1','1','0','ETH0007805440544','ETH0007805470547',NULL,'2017-09-16','1',NULL,'1',NULL,'1','2','1',NULL,NULL,NULL,NULL,NULL,'0',NULL,NULL,NULL,NULL,NULL,NULL,'0',NULL,'Tv',NULL,NULL,'Et','23',NULL,'5',NULL,NULL,NULL,'0','0')
This is my attempt to solving this problem:
path = 'csvs/'
for infile in glob.glob( os.path.join(path, '*csv') ):
output = infile + '.out'
with open(infile, 'r') as source:
readr = csv.reader(source)
with open(output,"w") as result:
writr = csv.writer(result)
for r in readr:
writr.writerow((r[4], r[2]))
Please help point me to the right direction with any alternative solution

pandas does a lot of what you are trying to achieve:
import pandas as pd
# Read a csv file to a dataframe
df = pd.read_csv("<path-to-csv>")
# Filter two columns
columns = ["FileUUID", "Table"]
df = df[columns]
# Combine multiple dataframes
df_combined = pd.concat([df1, df2, df3, ...])
# Output dataframe to excel file
df_combined.to_excel("<output-path>", index=False)
To loop through all csv files > 62bits, you can use glob.glob() and os.stat()
import os
import glob
dataframes = []
for csvfile in glob.glob("<csv-folder-path>/*.csv"):
if os.stat(csvfile).st_size > 62:
dataframes.append(pd.read_csv(csvfile))

Use the standard csv module. Don't re-invent the wheel.
https://docs.python.org/3/library/csv.html

How to concatenate multiple csv files into one based on column names without having to type every column header in code

I am relatively new to python (about a weeks experience) and I can't seem to find the answer to my problem.
I am trying to merge hundreds of csv files based in my folder Data into a single csv file based on column name.
The solutions I have found require me to type out either each file name or column headers which would take days.
I used this code to create one csv file but the column names move around and therefore the data is not in the same columns over the whole DataFrame:
import pandas as pd
import glob
import os
def concatenate(indir=r"C:\\Users\ge\Documents\d\de",
outfile = r"C:\Users\ge\Documents\d"):
os.chdir(indir)
fileList=glob.glob("*.csv")
dfList = []
for filename in fileList:
print(filename)
df = pd.read_csv(filename, header = None)
dfList.append(df)
concatDf = pd.concat(dfList, axis = 0)
concatDf.to_csv(outfile, index= None)
Is there quick fire method to do this as I have less than a week to run statistics on the dataset.
Any help would be appreciated.

Here is one, memory efficient, way to do that.
from pathlib import Path
import csv
indir = Path(r'C:\\Users\gerardchurch\Documents\Data\dev_en')
outfile = Path(r"C:\\Users\gerardchurch\Documents\Data\output.csv")
def find_header_from_all_files(indir):
columns = set()
print("Looking for column names in", indir)
for f in indir.glob('*.csv'):
with f.open() as sample_csv:
sample_reader = csv.DictReader(sample_csv)
try:
first_row = next(sample_reader)
except StopIteration:
print("File {} doesn't contain any data. Double check this".format(f))
continue
else:
columns.update(first_row.keys())
return columns
columns = find_header_from_all_files(indir)
print("The columns are:", sorted(columns))
with outfile.open('w') as outf:
wr = csv.DictWriter(outf, fieldnames=list(columns))
wr.writeheader()
for inpath in indir.glob('*.csv'):
print("Parsing", inpath)
with inpath.open() as infile:
reader = csv.DictReader(infile)
wr.writerows(reader)
print("Done, find the output at", outfile)
This should handle case, when one of the input csvs doesn't contain all columns

I am not sure if I understand your problem correctly, but this is one of the ways that you can merge your files without giving any column names:
import pandas as pd
import glob
import os
def concatenate(indir):
os.chdir(indir)
fileList=glob.glob("*.csv")
output_file = pd.concat([pd.read_csv(filename) for filename in fileList])
output_file.to_csv("_output.csv", index=False)
concatenate(indir= r"C:\\Users\gerardchurch\Documents\Data\dev_en")

How do I extract data from multiple text files to Excel using Python? (One file's data per sheet)

So far for my code to read from text files and export to Excel I have:
import glob
data = {}
for infile in glob.glob("*.txt"):
with open(infile) as inf:
data[infile] = [l[:-1] for l in inf]
with open("summary.xls", "w") as outf:
outf.write("\t".join(data.keys()) + "\n")
for sublst in zip(*data.values()):
outf.write("\t".join(sublst) + "\n")
The goal with this was to reach all of the text files in a specific folder.
However, when I run it, Excel gives me an error saying,
"File cannot be opened because: Invalid at the top level of the document. Line 1, Position 1. outputgooderr.txt outputbaderr.txt. fixed_inv.txt
Note: outputgooderr.txt, outputbaderr.txt.,fixed_inv.txt are the names of the text files I wish to export to Excel, one file per sheet.
When I only have one file for the program to read, it is able to extract the data. Unfortunately, this is not what I would like since I have multiple files.
Please let me know of any ways I can combat this. I am very much so a beginner in programming in general and would appreciate any advice! Thank you.

If you're not opposed to having the outputted excel file as a .xlsx rather than .xls, I'd recommend making use of some of the features of Pandas. In particular pandas.read_csv() and DataFrame.to_excel()
I've provided a fully reproducible example of how you might go about doing this. Please note that I create 2 .txt files in the first 3 lines for the test.
import pandas as pd
import numpy as np
import glob
# Creating a dataframe and saving as test_1.txt/test_2.txt in current directory
# feel free to remove the next 3 lines if yo want to test in your directory
df = pd.DataFrame(np.random.randn(10, 3), columns=list('ABC'))
df.to_csv('test_1.txt', index=False)
df.to_csv('test_2.txt', index=False)
txt_list = [] # empty list
sheet_list = [] # empty list
# a for loop through filenames matching a specified pattern (.txt) in the current directory
for infile in glob.glob("*.txt"):
outfile = infile.replace('.txt', '') #removing '.txt' for excel sheet names
sheet_list.append(outfile) #appending for excel sheet name to sheet_list
txt_list.append(infile) #appending for '...txt' to txtt_list
writer = pd.ExcelWriter('summary.xlsx', engine='xlsxwriter')
# a for loop through all elements in txt_list
for i in range(0, len(txt_list)):
df = pd.read_csv('%s' % (txt_list[i])) #reading element from txt_list at index = i
df.to_excel(writer, sheet_name='%s' % (sheet_list[i]), index=False) #reading element from sheet_list at index = i
writer.save()
Output example:

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to handle error in reading file containing multiple languages - python

data trying to read I have tried various ways still getting errors of the different type. import codecs f = codecs.open('sampledata.xlsx', encoding='utf-8') for line in f: print (repr(line)) the other way I tried is f = open(fname, encoding="ascii", errors="surrogateescape") still no luck.any help?

Related

Issues with the delimiter when trying to read a comma separated file (Python, Pandas & .csv)

Pandas pd.read_csv() splits each character into a new row

Printing columns from a CSV file into an excel file with python

How to concatenate multiple csv files into one based on column names without having to type every column header in code

How do I extract data from multiple text files to Excel using Python? (One file's data per sheet)

Categories

Resources