Trouble getting pandas to read entire column of data - python

The goal for this program to accomplish is to read each column header and to read all of the data underneath each column. After reading this data it will then make a list of it and log it all into a text file. When doing this with small data it works but when working with large amounts of data (2000 lines and up) it records in the text file up to the number 30 then the next element is '...'. it then resumes recording correctly all the way up until the 2000th element.
I have tried all that i can do. Plz help. I almost punched a hole in the wall trying to fix this.
import csv
import pandas as pd
import os
import linecache
from tkinter import *
from tkinter import filedialog
def create_dict(df):
# Creates an empty text file for the dictionary if it doesn't exist
if not os.path.isfile("Dictionary.txt"):
open("Dictionary.txt", 'w').close()
# Opens the dictionary for reading and writing
with open("Dictionary.txt", 'r+') as dictionary:
column_headers = list(df)
i = 0
# Creates an entry in the dictionary for each header
for header in column_headers:
dictionary.write("==========================\n"
"\t=" + header + "=\n"
"==========================\n\n\n\n")
dictionary.write(str(df[str(column_headers[i])]))
#for line in column_info[:-1]:
# dictionary.write(line + '\n')
dictionary.write('\n')
i += 1
Some of these imports might not be used. I just included all of them.

you can directly write pandas dataframe to txt file ..
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(low = 1, high = 100, size =3000), columns= ['Random Number'])
filename = 'dictionary.txt'
with open(filename,'w') as file:
df.to_string(file)

Related

Issues with the delimiter when trying to read a comma separated file (Python, Pandas & .csv)

The problem:
I am trying to reproduce results from a youtube course of Keith Galli's.
import pandas as pd
import os
import csv
input_loc = "./SalesAnalysis/Sales_Data/"
output_loc = "./SalesAnalysis/korbi_output/"
fileList = os.listdir(input_loc)
all_months_data = pd.DataFrame()
problem probably starts here:
for file in fileList:
if file.endswith(".csv"):
df = pd.read_csv(input_loc+file)
all_months_data = all_months_data.append(df)
all_months_data.to_csv(output_loc+"all_months_data.csv")
all_months_data.head()
this is my output and I don't want row 1 to be displayed, because it contains no data:
The issue seems to be line 3 in one of my csv files. A3 is empty except for commas:
So I go to the csv file, and delete A3 cell. run the code again and I get this:
instead of this:
What do I have to do to remove the cells without value and to still display everything correctly?
I did not understand, WHY this weird problems occured, but I figured out a workaround to change the data and save everything in a new csv file:
all_months_data_cleaned = all_months_data.copy()
all_months_data_cleaned = all_months_data.dropna()
all_months_data_cleaned.reset_index(drop=True, inplace=True)
all_months_data_cleaned.to_csv(output_loc+"all_months_data_cleaned.csv")

How to write a dictionary to a text file with columns for each element in the dictionary in Python

I am trying to use a previously generated Workspace from matlab to create a input text file for another program to read from. I can import the Workspace into Python with no issues as a dictionary. From here I am having difficulty writing to a text file due to the different data types and different sized arrays in the dictionary. I would like for each field in the dictionary to have its own column in the text file but have had little luck.
Here is a screen shot of the imported dictionary from matlab.
I would like the text file to have this format. (the header is not necessary)
I was able to get one of the variables into the text file with the following code but, I can't add more variables or ones with different data types.
import csv
import scipy.io
import numpy as np
#import json
#import _pickle as pickle
#import pandas as pd
mat = scipy.io.loadmat('Day055.mat')
#print(type(mat))
#print(mat['CC1'])
CC1 = mat['CC1']
CC2 = mat['CC2']
DP1 = mat['DP1']
#print(CC1)
#print(CC2)
dat = np.array([CC1, DP1])
dat =dat.T
#np.savetxt('tester.txt', dat, delimiter = '\t')
np.savetxt('tester.txt', CC1, delimiter = '\t')
'''
with open('test.csv', 'w') as f:
writer = csv.writer(f)
for row in CC1:
writer.writerow(row)
#print(type(CC1))
#print("CC1=", CC1)
#print("first entry to CC1:", CC1[0])
mat = {'mat':mat}
df = pd.DataFrame.from_dict(mat)
print(df)
print(type(mat))
x=0
with open('inputDay055.txt', 'w') as file:
for line in CC1:
file.write(CC1[x])
#file.write("\t".join(map(str, CC2))+"\n")
#file.write(pickle.dumps(mat))
x=x+1
'''
print("all done")
As you can see I have tried a few different ways as well but commented them out when I was not succesful.

Iterate through Time Series data from .txt file using Numpy Array

My background is VBA and very new to Python, so please forgive me at the outset.
I have a .txt file with time series data.
My goal is to loop through the data and do simple comparisons, such as High - Close etc. From a VBA background this is straight forward for me in VBA, namely (in simple terms):
Sub Loop()
Dim arrTS() As Variant, i As Long
arrTS = Array("Date", "Time", ..)
For i = LBound(arrTS, 1) to UBound(arrTS, 1)
Debug.Print arrTS(i, "High") - arrTS(i, "Close")
Next i
End Sub
Now what I have in python is:
import os
import numpy as np
import urllib.request
import matplotlib.pyplot as plt
#load the .txt file
ES_D1 = np.loadtxt(fname = os.getcwd()+"\ES\D1\ES_10122007_04122019_D1.txt", dtype='str')
#now get the shape
print(ES_D1.shape)
Out: (3025, 8)
Can anyone recommend the best way to iterate through this file line by line, with reference to specific columns, and not iterate through each element?
Something like:
For i = 0 To 3025
print(ES_D1[i,4] - ES_D1[i,5])
Next i
The regular way to read csv/tsv files for me is this:
import os
filename = '...'
filepath = '...'
infile = os.path.join(filepath, filename)
with open(infile) as fin:
for line in fin:
parts = line.split('\t')
# do something with the list "parts"
But in your case, using the pandas function read_csv()might be a better way:
import pandas as pd
# Control delimiters, rows, column names with read_csv
data = pd.read_csv(infile)
# View the first 5 lines
data.head()
Creating the simple for loop was easier than I though, here for others.
import os
import numpy as np
import urllib.requests
import matplotlib.pyplot as plt
#load the .txt file
ES_D1 = np.loadtxt(fname = os.getcwd()+"\ES\D1\ES_10122007_04122019_D1.txt", dtype='str')
#now need to loop through the array
#this is the engine
for i in range(ES_D1.shape[0]):
if ES_D1[i,3] > ES_D1[i,6]:
print(ES_D1[i,0])

Writing CSV row values to a PDF using Python

I've been using some great answers on Stack Overflow to help solve my problem, but I've hit a roadblock.
What I'm trying to do
Read values from rows of CSV
Write the values from the CSV to Unique PDFs
Work through all rows in the CSV file and write each row to a different unique PDF
What I have so far
from PyPDF2 import PdfFileWriter, PdfFileReader
import io
import pandas as pd
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter
# Read CSV into pandas dataframe and assign columns as variables
csv = '/myfilepath/test.csv'
df = pd.read_csv(csv)
Name = df['First Name'].values + " " + df['Last Name'].values
OrderID = df['Order Number'].values
packet = io.BytesIO()
# create a new PDF with Reportlab
can = canvas.Canvas(packet, pagesize=letter)
can.setFont("Helvetica", 12)
if OrderID is not None:
can.drawString(80, 655, '#' + str(OrderID)[1:-1])
can.setFont("Helvetica", 16)
if Name is not None:
can.drawString(315, 630, str(Name)[2:-2]
can.save()
# move to the beginning of the StringIO buffer
packet.seek(0)
new_pdf = PdfFileReader(packet)
# read your existing PDF
existing_pdf = PdfFileReader(open("Unique1.pdf", "rb"))
output = PdfFileWriter()
# add the new pdf to the existing page
page = existing_pdf.getPage(0)
page2 = new_pdf.getPage(0)
page.mergePage(page2)
output.addPage(page)
# finally, write "output" to a real file
outputStream = open("Output.pdf", "wb")
output.write(outputStream)
outputStream.close()
The code above works if:
I specify the PDF that I want to write to
I specify the output file name
The CSV only has 1 row
What I need help with
Reading values from the CSV one row at a time and storing them as a variable to write
Select a unique PDF, and write the values from above, then save that file and select the next unique PDF
Loop through all rows in a CSV and end when the last row has been reached
Additional Info: the unique PDFs will be contained in a folder as they each have the same layout but different barcodes
Any help would be greatly appreciated!
I would personally suggest that you reconsider using Pandas and instead try the standard CSV module. It will meet your need for streaming through a file for row-by-row processing. Shown below is some code looping through a CSV file getting each row as a dictionary, and processing that in a write_pdf function, as well as logic that will get you a new filename to write the PDF to for each row.
import csv
# import the PDF libraries you need
def write_pdf(data, filename):
name = data['First Name'] + ' ' + data['Last Name']
order_no = data['Order Number']
# Leaving PDF writing to you
row_counter = 0
with open('file.csv', 'r') as f:
reader = csv.DictReader(f)
for row in reader:
write_pdf(row, 'Output' + row_counter + '.pdf')
row_counter += 1
I'm going to leave the PDF writing to you because I think you understand what you need from that better than I do.
I known I cut out the Pandas part, but I think the issue are having with that, and how it doesn't work for a CSV with more than 1 row stems from DataFrame.get being an operation that retrieve an entire column.
Python CSV module docs
pandas DataFrame docs

subtract consecutive rows from a .dat file

I wish to subtract rows from the preceding rows in a .dat file and then make a new column out of the result. In my file, I wish to do that with the first column time , I want to find time interval for each timestep and then make a new column out of it. I took help from stackoverflow community and wrote a pseudo code in pandas python. but it's not working so far:
import pandas as pd
import numpy as np
from sys import argv
from pylab import *
import csv
script, filename = argv
# read flash.dat to a list of lists
datContent = [i.strip().split() for i in open("./flash.dat").readlines()]
# write it as a new CSV file
with open("./flash.dat", "wb") as f:
writer = csv.writer(f)
writer.writerows(datContent)
columns_to_keep = ['#time']
dataframe = pd.read_csv("./flash.csv", usecols=columns_to_keep)
df = pd.DataFrame({"#time": pd.date_range("24 sept 2016"),periods=5*24,freq="1h")})
df["time"] = df["#time"] + [pd.Timedelta(minutes=m) for m in np.random.choice(a=range(60), size=df.shape[0])]
df["value"] = np.random.normal(size=df.shape[0])
df["prev_time"] = [np.nan] + df.iloc[:-1]["time"].tolist()
df["time_delta"] = df.time - df.prev_time
df
dataframe.plot(x='#time', y='time_delta', style='r')
print dataframe
show()
I am also sharing the file for your convenience, your help is mostly appreciated.
https://www.dropbox.com/s/w4jbxmln9e83355/flash.dat?dl=0

Categories

Resources