Pandas pd.read_csv() splits each character into a new row - python

Weird (and possibly bad) question - when using the pd.read_csv() method, it appears as if pandas separates each character into a new row in the csv.
My code:
import pandas as pd
import csv
# doing this in colab
from google.colab import files
# downloading a csv with my apps sdk
data = sdk.run_look('2131','csv')
with open('big.csv', 'w') as file:
csvwriter = csv.writer(file, delimiter=',')
csvwriter.writerows(data)
df = pd.read_csv('big.csv', delimiter=',')
files.download('big.csv')
Output:
What I'm getting from line 4 (data=sdk...) looks like this:
Orders Status,Orders Count
complete,31377
pending,505
cancelled,375
However, what I get back from pandas looks like this:
0 r
1 d
2 e
3 r
4 s
...
I think it's line 6 (df=read_csv...) because if I do print(data) then compare to print(df.head()), I see that print data returns correct data, print df.head returns the weirdly formatted data.
Any idea of what I'm doing wrong here? I'm a complete noob, so probably pebkac :)

It appears that your data is just coming in as a big string. If this is the case, you don't need to use the csv writer at all, and can just write it directly to your out file.
import pandas as pd
# doing this in colab
from google.colab import files
# downloading a csv with my apps sdk
data = sdk.run_look('2131', 'csv')
with open('big.csv', 'w') as file:
file.write(data)
df = pd.read_csv('big.csv', delimiter=',')
files.download('big.csv')

Related

Issues with the delimiter when trying to read a comma separated file (Python, Pandas & .csv)

The problem:
I am trying to reproduce results from a youtube course of Keith Galli's.
import pandas as pd
import os
import csv
input_loc = "./SalesAnalysis/Sales_Data/"
output_loc = "./SalesAnalysis/korbi_output/"
fileList = os.listdir(input_loc)
all_months_data = pd.DataFrame()
problem probably starts here:
for file in fileList:
if file.endswith(".csv"):
df = pd.read_csv(input_loc+file)
all_months_data = all_months_data.append(df)
all_months_data.to_csv(output_loc+"all_months_data.csv")
all_months_data.head()
this is my output and I don't want row 1 to be displayed, because it contains no data:
The issue seems to be line 3 in one of my csv files. A3 is empty except for commas:
So I go to the csv file, and delete A3 cell. run the code again and I get this:
instead of this:
What do I have to do to remove the cells without value and to still display everything correctly?
I did not understand, WHY this weird problems occured, but I figured out a workaround to change the data and save everything in a new csv file:
all_months_data_cleaned = all_months_data.copy()
all_months_data_cleaned = all_months_data.dropna()
all_months_data_cleaned.reset_index(drop=True, inplace=True)
all_months_data_cleaned.to_csv(output_loc+"all_months_data_cleaned.csv")

How do you add a header to an excel csv file using python

So I'm trying to add a header to a csv file dynamically. My current code looks like the following:
import csv
from datetime import datetime
import pandas as pd
rows = []
with open(r'Test_Timestamp.csv', 'r', newline='') as file:
with open(r'Test_Timestamp_Result.csv', 'w', newline='') as file2:
reader = csv.reader(file, delimiter=',')
for row in reader:
rows.append(row)
file_write = csv.writer(file2)
for val in rows:
current_date_time = datetime.now().replace(hour=0, minute=0, second=0, microsecond=0)
val.insert(0, current_date_time)
file_write.writerow(val)
Currently how this works is it inserts a new timestamp at column A which is exactly what I want it to do, as I want everything to be pushed as I'll be working with csv files with various different number of columns.
What I'm having trouble with is, how am I able to add a column header? Currently a timestamp is created next to the header. I would want to create a new header named: Execution_Date
I have looked at pandas as a solution but from the documentation I've seen the examples given looks like its a set of column headers already pre-determined. I've tried inserting a column header with df.insert(0, "Execution_Date", current_date_time) but gives me an error when trying to accomplish this.
I know I'm fairly close to doing this but I'm running into errors. Is there a way to do this dynamically so it automatically does this with various different csv files and number of different columns in each csv file, etc.? The current output looks like:
What I want the final result to look like is:
Any help with this would be greatly appreciated! I'm going to continue to see if I can solve this in the meantime, but I'm at a wall with how to proceed.
If the end result is something that excel can read like maybe a csv you can likely bypass pandas altogether:
Edit: adding support for existing titles
Given a simple csv like:
Title,Other
Geeks1,foo
Geeks2,bar
Then you might use:
import contextlib
import csv
from datetime import datetime
with contextlib.ExitStack() as stack:
file_in = open('Test_Timestamp.csv', "r", encoding="utf-8")
file_out = open('Test_Timestamp_Result.csv', "w", encoding="utf-8", newline="")
reader = csv.reader(file_in, delimiter=',')
writer = csv.writer(file_out)
writer.writerow(["Execution_Date"] + next(reader))
writer.writerows(
[datetime.now().replace(hour=0, minute=0, second=0, microsecond=0)] + row
for row in reader
)
to give you a file like:
Execution_Date,Title,Other
2022-02-11 00:00:00,Geeks1,foo
2022-02-11 00:00:00,Geeks2,bar
One way to do this is to utilize to_csv().
Example:
# importing python package
import pandas as pd
# read contents of csv file
file = pd.read_csv("gfg.csv")
print("\nOriginal file:")
print(file)
# adding header
headerList = ['id', 'name', 'profession']
# converting data frame to csv
file.to_csv("gfg2.csv", header=headerList, index=False)
# display modified csv file
file2 = pd.read_csv("gfg2.csv")
print('\nModified file:')enter code here
print(file2)

How to write a dictionary to a text file with columns for each element in the dictionary in Python

I am trying to use a previously generated Workspace from matlab to create a input text file for another program to read from. I can import the Workspace into Python with no issues as a dictionary. From here I am having difficulty writing to a text file due to the different data types and different sized arrays in the dictionary. I would like for each field in the dictionary to have its own column in the text file but have had little luck.
Here is a screen shot of the imported dictionary from matlab.
I would like the text file to have this format. (the header is not necessary)
I was able to get one of the variables into the text file with the following code but, I can't add more variables or ones with different data types.
import csv
import scipy.io
import numpy as np
#import json
#import _pickle as pickle
#import pandas as pd
mat = scipy.io.loadmat('Day055.mat')
#print(type(mat))
#print(mat['CC1'])
CC1 = mat['CC1']
CC2 = mat['CC2']
DP1 = mat['DP1']
#print(CC1)
#print(CC2)
dat = np.array([CC1, DP1])
dat =dat.T
#np.savetxt('tester.txt', dat, delimiter = '\t')
np.savetxt('tester.txt', CC1, delimiter = '\t')
'''
with open('test.csv', 'w') as f:
writer = csv.writer(f)
for row in CC1:
writer.writerow(row)
#print(type(CC1))
#print("CC1=", CC1)
#print("first entry to CC1:", CC1[0])
mat = {'mat':mat}
df = pd.DataFrame.from_dict(mat)
print(df)
print(type(mat))
x=0
with open('inputDay055.txt', 'w') as file:
for line in CC1:
file.write(CC1[x])
#file.write("\t".join(map(str, CC2))+"\n")
#file.write(pickle.dumps(mat))
x=x+1
'''
print("all done")
As you can see I have tried a few different ways as well but commented them out when I was not succesful.

how to handle error in reading file containing multiple languages

data trying to read
I have tried various ways still getting errors of the different type.
import codecs
f = codecs.open('sampledata.xlsx', encoding='utf-8')
for line in f:
print (repr(line))
the other way I tried is
f = open(fname, encoding="ascii", errors="surrogateescape")
still no luck.any help?
Newer versions of Pandas supports xlxs.
file_name = # path to file + file name
sheet = # sheet name or sheet number or list of sheet numbers and names
import pandas as pd
df = pd.read_excel(io=file_name, sheet_name=sheet)
print(df.head(5)) # print first 5 rows of the dataframe
Works great, especially if you're working with many sheets.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html

How can I read only the header column of a CSV file using Python?

I am looking for a a way to read just the header row of a large number of large CSV files.
Using Pandas, I have this method available, for each csv file:
>>> df = pd.read_csv(PATH_TO_CSV)
>>> df.columns
I could do this with just the csv module:
>>> reader = csv.DictReader(open(PATH_TO_CSV))
>>> reader.fieldnames
The problem with these is that each CSV file is 500MB+ in size, and it seems to be a gigantic waste to read in the entire file of each just to pull the header lines.
My end goal of all of this is to pull out unique column names. I can do that once I have a list of column headers that are in each of these files.
How can I extract only the header row of a CSV file, quickly?
Expanding on the answer given by Jeff It is now possbile to use pandas without actually reading any rows.
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: pd.DataFrame(np.random.randn(10, 4), columns=list('abcd')).to_csv('test.csv', mode='w')
In [4]: pd.read_csv('test.csv', index_col=0, nrows=0).columns.tolist()
Out[4]: ['a', 'b', 'c', 'd']
pandas can have the advantage that it deals more gracefully with CSV encodings.
I might be a little late to the party but here's one way to do it using just the Python standard library. When dealing with text data, I prefer to use Python 3 because unicode. So this is very close to your original suggestion except I'm only reading in one row rather than the whole file.
import csv
with open(fpath, 'r') as infile:
reader = csv.DictReader(infile)
fieldnames = reader.fieldnames
Hopefully that helps!
I've used iglob as an example to search for the .csv files, but one way is to use a set, then adjust as necessary, eg:
import csv
from glob import iglob
unique_headers = set()
for filename in iglob('*.csv'):
with open(filename, 'rb') as fin:
csvin = csv.reader(fin)
unique_headers.update(next(csvin, []))
Here's one way. You get 1 row.
In [9]: DataFrame(np.random.randn(10,4),columns=list('abcd')).to_csv('test.csv',mode='w')
In [10]: read_csv('test.csv',index_col=0,nrows=1)
Out[10]:
a b c d
0 0.365453 0.633631 -1.917368 -1.996505
What about:
pandas.read_csv(PATH_TO_CSV, nrows=1).columns
That'll read the first row only and return the columns found.
you have missed nrows=1 param to read_csv
>>> df= pd.read_csv(PATH_TO_CSV, nrows=1)
>>> df.columns
it depends on what the header will be used for, if you needed the headers for comparison purposes only (my case) this code will be simple and super fast, it will read the whole header as one string. you can transform all the collected strings together according to your needs:
for filename in glob.glob(files_path+"\*.csv"):
with open(filename) as f:
first_line = f.readline()
it is easy you can use this:
df = pd.read_csv("path.csv", skiprows=0, nrows=2)
df.columns.to_list()
In this case you can only read really few row for get your header
if you are only interested in the headers and would like to use pandas, the only extra thing you need to pass in apart from the csv file name is "nrows=0":
headers = pd.read_csv("test.csv", nrows=0)
import pandas as pd
get_col = list(pd.read_csv("first_test_pipe.csv",sep="|",nrows=1).columns)
print(get_col)

Categories

Resources