I am trying to read a text file which has split lines randomly generated at column 28th from a third party.
When I conver to csv it is fine but, when I feed the files to Athena, it is not able to read because of split.
Is there a way to fine the CR here and put it back as other lines are?
Thanks,
SM
This is a code snippet :
import pandas as pd
add_columns = ["col1", "col2", "col3"...."col59"]
res = pd.read_csv("file_name.txt", names= add_columns, sep=',\s+', delimiter=',', encoding="utf-8", skipinitialspace=True)
df = pd.DataFrame(res)
df.to_csv('final_name.csv', index = None)
file_name.txt
99,999,00499013,X701,,,5669,5669,1232,,1,1,,2,,,,0,0,0,,,,,,,,,,,,,,2400,1232,LXA,,<<line is split on column 28>>
2,5669,,,,68,,,1,,,,,,,,,,,,71,
99,999,00499017,X701,,,5669,5669,1160,,1,1,,2,,,,0,0,0,,,,,,,,,,,,,,2400,1160,LXA,,1,5669,,,,,,,1,,,,,,,,,,,,71,
99,999,00499019,X701,,,5669,5669,1284,,1,1,,2,,,,0,0,0,,,,,,,,,,,,,,2400,1284,LXA,,2,5669,,,,66,,,1,,,,,,,,,,,,71,
I have tried str.split but, no luck.
If you are able to convert it successfully to CSV using pandas, you can try to save it as a CSV to feed into Athena.
I want to read text file. The file is like this:
17430147 17277121 17767569 17352501 17567841 17650342 17572001
I want the result:
17430147
17277121
17767569
17352501
17567841
17650342
17572001
So, i try some codes:
data = pd.read_csv('train.txt', header=None, delimiter=r"\s+")
or
data = pd.read_csv('train.txt', header=None, delim_whitespace=True)
From those codes, the error like this:
ParserError: Too many columns specified: expected 75262 and found 154
Then i try this code:
file = open("train.txt", "r")
data = []
for i in file:
i = i.replace("\n", "")
data.append(i.split(" "))
But i think there are missing value in txt file:
'2847',
'2848',
'2849',
'1947',
'2850',
'2851',
'2729',
''],
['2852',
'2853',
'2036',
Thank you!
The first step would be to read the text file as a string of values.
with open('train.txt','r') as f:
lines = f.readlines()
list_of_values = lines[0].split(' ')
Here, list_of_values looks like:
['17430147',
'17277121',
'17767569',
'17352501',
'17567841',
'17650342',
'17572001']
Now, to create a DataFrame out of this list, simply execute:
import pandas as pd
pd.DataFrame(list_of_values)
This will give a pandas DataFrame with a single column with values read from the text file.
If only different values that exist in the text file are required to be obtained, then the list list_of_values can be directly used.
You can use .T method to transpose your dataframe.
data = pd.read_csv("train.txt", delim_whitespace=True).T
I am trying to analyse WhatsApp by putting it into a Pandas dataframe, however it is only being read as a single column when I do enter it. What do I need to do to correct my error? I believe my error is due to how it needs to be formatted
I have tried to read it and then use Pandas to make it into columns, but because of how it is read, I believe it only sees one column.
I have also tried to use pd.read_csv and that method does not yield the correct result either and the sep method too
The information from whatsapp is presented as follows in notebook:
[01/09/2017, 13:51:27] name1: abc
[02/09/2017, 13:51:28] name2: def
[03/09/2017, 13:51:29] name3: ghi
[04/09/2017, 13:51:30] name4: jkl
[05/09/2017, 13:51:31] name5: mno
[06/09/2017, 13:51:32] name6: pqr
The python code is as folows:
enter code here
import re
import sys
import pandas as pd
pd.set_option('display.max_rows', 500)
def read_history1(file):
chat = open(file, 'r', encoding="utf8")
#get all which exist in this format
messages = re.findall('\d+/\d+/\d+, \d+:\d+:\d+\W .*: .*', chat.read())
print(messages)
chat.close()
#make messages into a database
history = pd.DataFrame(messages,columns=['Date','Time', 'Name',
'Message'])
print(history)
return history
#the encoding is added because of the way the file is written
#https://stackoverflow.com/questions/9233027/unicodedecodeerror-charmap-
codec-cant-decode-byte-x-in-position-y-character/9233174
#i tried using sep, but it is not ideal for this data
def read_history2(file):
messages = pd.read_csv(file)
messages.columns = ['a','b]
print(messages.head())
return
filename = "AFC_Test.txt"
read_history2(filename)
The two methods I have tried are above.
I expect 4 coluumns.
The date, time, name and the message for each row
In case anyone comes across this I resolved it as follows:
The error was in the regex
def read_history2(file):
print('\n')
chat = open(file, 'r', encoding="utf8")
content = re.findall('\W(\d+/\d+/\d+), (\d+:\d+:\d+)\W (.*): (.*)', chat.read())
history = pd.DataFrame(content, columns=['Date','Time', 'Name', 'Message'])
print(history)
filename = "AFC_Test.txt"
read_history2(filename)
So you can split each line into a set of strings, with code that might look a bit like this:
# read in file
with open(file, 'r', encoding="utf8") as chat:
contents = chat.read()
# list for each line of the dataframe
rows = []
# clean data up into nice strings
for line in contents.splitlines():
newline = line.split()
for item in newline:
item = item.strip("[],:")
rows.append(line)
# create dataframe
history = pd.DataFrame(rows, columns=['Date','Time', 'Name', 'Message']
I think that should work!
Let me know how it goes :)
I've been using some great answers on Stack Overflow to help solve my problem, but I've hit a roadblock.
What I'm trying to do
Read values from rows of CSV
Write the values from the CSV to Unique PDFs
Work through all rows in the CSV file and write each row to a different unique PDF
What I have so far
from PyPDF2 import PdfFileWriter, PdfFileReader
import io
import pandas as pd
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter
# Read CSV into pandas dataframe and assign columns as variables
csv = '/myfilepath/test.csv'
df = pd.read_csv(csv)
Name = df['First Name'].values + " " + df['Last Name'].values
OrderID = df['Order Number'].values
packet = io.BytesIO()
# create a new PDF with Reportlab
can = canvas.Canvas(packet, pagesize=letter)
can.setFont("Helvetica", 12)
if OrderID is not None:
can.drawString(80, 655, '#' + str(OrderID)[1:-1])
can.setFont("Helvetica", 16)
if Name is not None:
can.drawString(315, 630, str(Name)[2:-2]
can.save()
# move to the beginning of the StringIO buffer
packet.seek(0)
new_pdf = PdfFileReader(packet)
# read your existing PDF
existing_pdf = PdfFileReader(open("Unique1.pdf", "rb"))
output = PdfFileWriter()
# add the new pdf to the existing page
page = existing_pdf.getPage(0)
page2 = new_pdf.getPage(0)
page.mergePage(page2)
output.addPage(page)
# finally, write "output" to a real file
outputStream = open("Output.pdf", "wb")
output.write(outputStream)
outputStream.close()
The code above works if:
I specify the PDF that I want to write to
I specify the output file name
The CSV only has 1 row
What I need help with
Reading values from the CSV one row at a time and storing them as a variable to write
Select a unique PDF, and write the values from above, then save that file and select the next unique PDF
Loop through all rows in a CSV and end when the last row has been reached
Additional Info: the unique PDFs will be contained in a folder as they each have the same layout but different barcodes
Any help would be greatly appreciated!
I would personally suggest that you reconsider using Pandas and instead try the standard CSV module. It will meet your need for streaming through a file for row-by-row processing. Shown below is some code looping through a CSV file getting each row as a dictionary, and processing that in a write_pdf function, as well as logic that will get you a new filename to write the PDF to for each row.
import csv
# import the PDF libraries you need
def write_pdf(data, filename):
name = data['First Name'] + ' ' + data['Last Name']
order_no = data['Order Number']
# Leaving PDF writing to you
row_counter = 0
with open('file.csv', 'r') as f:
reader = csv.DictReader(f)
for row in reader:
write_pdf(row, 'Output' + row_counter + '.pdf')
row_counter += 1
I'm going to leave the PDF writing to you because I think you understand what you need from that better than I do.
I known I cut out the Pandas part, but I think the issue are having with that, and how it doesn't work for a CSV with more than 1 row stems from DataFrame.get being an operation that retrieve an entire column.
Python CSV module docs
pandas DataFrame docs
I have huge text file which i want to export to the excel by first doing some operations by making it a dataframe using Python.
Now, the file contains some special characters in one of the Header which is why i am not able to export that header line data from the DataFrame to the excel.
Its is something like this
{"ÿþ""DOEClientID""",DOEClient,ChgClientID,ChgClient,ChgSystemID,ChgSystem}
I am able to export the data when i use {header = False} property but it shows some error when i make this header property TRUE
Please Help me Out with , I have searched a lot but not able to find any solution.
I need those headers in the file.
COde:
`def files(file_name, outfile_name):
data_initial = open(path + file_name, "rU")
data1 = csv.reader((line.replace('\0','') for line in data_initial), delimiter=",")
reader = csv.reader(open(path + file_name, 'rU'))
writer = csv.writer(open(path + outfile_name ,'wb'),dialect = 'excel')
for row in data1:
writer.writerow(row)
df = pd.DataFrame(pd.read_csv(path + outfile_name,sep=',', engine='python'))
final_frame = df.dropna(how='all')
file_list = list(uniq(list(final_frame['DOEClient'])))
return file_list, final_frame`
The problem with your input file is that it has a utf-16 little endian BOM this is why you see the funny characters: ÿþ which is 0xFFFE but is being displayed using ISO-8859-1.
So you just need to pass the param encoding=utf-16' in order to be able to read the file fine:
df = pd.read_csv(path_to_csv, encoding='utf-16')