We receive multiple .txt files every night from our ERP, sometimes we have product names ending in TAB after the person who inserts the product name has copy pasted it from somewhere else, long story short this breaks the process as there is python script automated that will perform very modest cleaning, and then insert the data to our MySQL database.
Now, the script that imports them to our database errors and breaks when this happens as it will push one row in the file to be 1 column longer, and i'll need to find a way to fix this from happening as when it happens it breaks our BI reporting.
I've thought of some rules on how to pin point where the user-input-error is in the file, i reckon the right way would be to write a python script to import the .txt file as a pandas dataframe, find all rows where column [amount] is blank and then fix the said row. Unfortunately to my understanding fixing can't happen in pandas, as when I import the file to pandas dataframe the problem already happened and needs to be fixed prior to importing to pandas, unless it is somehow possible to remove the blank cell from column X, and move all the other columns one step back filling the void left. This is what happens with the error rows:
So i need to find a way to either move all the cells one step back(left) when column X is blank, or some other way, all help is welcome.
EDIT:
I suppose there is a way afterall to do this in pandas with shift, if anyone can assist on how to make it shift when columnX is blank, would be greatly appreciated!
EDIT2:
Here are headers in the .txt file, and 2nd row which is fine, and 3rd row which errors out:
tilausnro tasiakasnro ttkoodi lasiakasnro ltkoodi tilpvm myyja kasittelija myypri toiala tila tyonro toimpvm tryhma tuote nimi maara hinta valuutta mtili kpka s.posti kirjpvm aspvm ensvahaspvm vahvpvm tulpvm 100000-1 121007 121007 20-10-15 oer oer 8 100000-1 27-10-15 2100 ESP_734249 Wisby Hopfwis. Wei 5,6% 50EG Buk 150000 2032,26 SEK 3350 2 20-10-15 30-10-15 ? ? ? 500072-2 121110 121110 20-10-20 jra NTA 1 500072-2 21-10-20 2000 EVILN_007 Kwas Ostrabramski 0,5l back 60000 82,8 3350 600 20-10-20 23-10-20 ? ? ?
Managed to fix this with a little help from kind people at Discord.
lst = open('directory//filename.txt').readlines()
fixed = []
for line in lst:
inner = line.split('\t') #string to list
if inner[16] == '':
inner.pop(16)
inner = "\t".join(inner) #list to string
fixed.append(inner)
with open("directory//filename.txt", "w") as output:
for item in fixed:
output.write("%s" % item)
Related
csv data:
>c1,v1,c2,v2,Time
>13.9,412.1,29.7,177.2,14:42:01
>13.9,412.1,29.7,177.2,14:42:02
>13.9,412.1,29.7,177.2,14:42:03
>13.9,412.1,29.7,177.2,14:42:04
>13.9,412.1,29.7,177.2,14:42:05
>0.1,415.1,1.3,-0.9,14:42:06
>0.1,408.5,1.2,-0.9,14:42:07
>13.9,412.1,29.7,177.2,14:42:08
>0.1,413.4,1.3,-0.9,14:42:09
>0.1,413.8,1.3,-0.9,14:42:10
My current code that I have:
import pandas as pd
import csv
import datetime as dt
#Read .csv file, get timestamp and split it into date and time separately
Data = pd.read_csv('filedata.csv', parse_dates=['Time_Stamp'], infer_datetime_format=True)
Data['Date'] = Data.Time_Stamp.dt.date
Data['Time'] = Data.Time_Stamp.dt.time
#print (Data)
print (Data['Time_Stamp'])
Data['Time_Stamp'] = pd.to_datetime(Data['Time_Stamp'])
#Read timestamp within a certain range
mask = (Data['Time_Stamp'] > '2017-06-12 10:48:00') & (Data['Time_Stamp']<= '2017-06-12 11:48:00')
june13 = Data.loc[mask]
#print (june13)
What I'm trying to do is to read every 5 secs of data, and if 1 out of 5 secs of data of c1 is 10.0 and above, replace that value of c1 with 0.
I'm still new to python and I could not find examples for this. May I have some assistance as this problem is way beyond my python programming skills for now. Thank you!
I don't know the modules around csv files so my answer might look primitive, and I'm not quite sure what you are trying to accomplish here, but have you though of dealing with the file textually ?
From what I get, you want to read every c1, check the value and modify it.
To read and modify the file, you could do:
with open('filedata.csv', 'r+') as csv_file:
lines = csv_file.readlines()
# for each line, isolate data part and check - and modify, the first one if needed.
# I'm seriously not sure, you might have wanted to read only one out of five lines.
# For that, just do a while loop with an index, which increments through lines by 5.
for line in lines:
line = line.split(',') # split comma-separated-values
# Check condition and apply needed change.
if float(line[0]) >= 10:
line[0] = "0" # Directly as a string.
# Transform the list back into a single string.
",".join(line)
# Rewrite the file.
csv_file.seek(0)
csv_file.writelines(lines)
# Here you are ready to use the file just like you were already doing.
# Of course, the above code could be put in a function for known advantages.
(I don't have python here, so I couldn't test it and typos might be there.)
If you only need the dataframe without the file being modified:
Pretty much the same to be honest.
Instead of the file-writing at the end, you could do :
from io import StringIO # pandas needs stringIO instead of strings.
# Above code here, but without the last 6 lines.
Data = pd.read_csv(
StringIo("\n".join(lines)),
parse_dates=['Time_Stamp'],
infer_datetime_format=True
)
This should give you the Data you have, with changed values where needed.
Hope this wasn't completely off. Also, some people might find this approach horrible ; we have already coded working modules to do that kind of things, so why botter and dealing with the rough raw data ourselves ? Personally, I think that it's often much easier than learning all of the external modules I'll be using in my life if I don't try to understand how the text representation of files can be used. Your opinion might differ.
Also, this code might result in performances being lower, as we need to iterate through the text twice (pandas does it when reading). However, I don't think you'd get faster result by reading the csv like you already do, then iterate through data anyway to check condition. (You might win a cast per c1 checked value, but the difference is small and iterating through pandas dataframe might as well be slower than a list, depending on the state of their current optimisation.)
Of course, if you don't really need the pandas dataframe format, you could completely do it manually, it would take only a few more lines (or not, tbh) and shouldn't be slower, as the amount of iterations would be minimized : you could check conditions on data at the same time as you read it. It's getting late and I'm sure you can figure that out by yourself so I won't code it in my great editor (known as stackoverflow), ask if there's anything !
I am trying to use ascii.read to scan in a large number of tabulated data files. The column headings do not appear to have self-consistent delimiters. There is a range from 2 to 7 spaces between each column header. The files looks something like this:
K WAVELN NEFF GEOM ALB BOND/QFIT BOND/GEOM Q-FITTED
1 0.3000000 0.0556723 0.0000000 0.0000000 2.1435934 2.0582306
[...]
[...]
I first suspected I could treat them tabs, however this does not appear to be the case:
raw = (ascii.read('filename', delimiter='\t')
will read the file but returns only a quite useless single column of data.
Now, this would not be a problem under normal cases - a simple
delimter='\s'
could have done the trick. However much to my frustration, one column is named "GEOM ALB" - complete with a space in the middle. This fouls up the delimiter, as it thinks this is two column headers, not one:
raw = (ascii.read('filename',delimiter='\s')
InconsistentTableError: Number of header columns (8) inconsistent with data columns (7) at data line 0
This is solveable by replacing the "GEOM ALB" header with "GEOM_ALB" in the files in question, however I would prefer to avoid spending the time to write the script to do this, particularly if there is a more simple and elegant solution.
I found a workaround for my problem here. By calling ascii.read as
raw = (ascii.read('filename',guess=False,header_start=None,data_start=2,names=('K','WAVELN','NEFF','GEOM ALB','BOND/QFIT','BOND/GEOM','Q-FITTED')))
I was able to bypass ascii.read's attempts to find and apply header names and define them myself. The key of course being
header_start=None
which tells ascii.read that there are no headers.
Sorry I know this question has been asked many times, but I really can't find a solution that could solve my problem.
I am using pyspark module in python to read a file:
data = sc.textFile("data/text_data.csv")
After some data cleaning, I get two columns, all of which are Chinese characters. However, the first three records look like below, where elements in the first tuple is the column name.
[('aybh_zw', 'jyaq'),
('������', '�ڶ��綫·�\U000ffd7c�\u0530�ſ�������������'),
('030', 'FF5E84D38B5B48CF97F26B5E6DAB4DD8')]
So I did this transformation next:
second_cleaned_data = first_cleaned_data.map(lambda s: (s[0].encode('UTF-8), s[1].encode('UTF-8'))
However, data becomes below:
[(b'aybh_zw', b'jyaq'),
(b'\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd',
(b'030', b'FF5E84D38B5B48CF97F26B5E6DAB4DD8')]
Where I omitted the second element in the second tuple due to some stackoverflow formatting issues.
Since this is still incorrect, I then did following and get:
[('aybh_zw', 'jyaq'),
('锟斤拷锟斤拷锟斤拷', '锟节讹拷锟界东路锟襟康硷拷园锟脚匡拷锟'),
('030', 'FF5E84D38B5B48CF97F26B5E6DAB4DD8')]
Still, this is incorrect since those characters are not what they should be.
So could anyone help me with this? I really don't know what to do.
Thank you
A sample of csv file is below:
recordkey ajbh ajjf ajlx ajlx_zw ajly ajly_zw ajmc ajzt
QTIwMTUwNjAwMDFfMzcxNDAwMDE A2015060001 0 2 刑事 1 110指令 张俊杰被盗窃案 202 已立案 212000002 盗窃罪 盗窃罪 371499 经济 2.02E+13 东风路电业局宿舍 6/13/15 7:08 19B2569194BB4471E0530390300A15A6
I am writing a bit of Python code to automate the manipulation of Excel spreadsheets. The idea is to use spreadsheet templates to create daily reports. Saw this idea working several years ago using Perl. Anyway.
Here are the simple rules:
Sheets with the Workbook are process in the order they appear.
Within the sheets cells are process left to right, then top to bottom.
There are names defined which are single cell ranges, can contain static values or the results of queries. Cells can contain comments which contain SQL queries to run. ...
Here is the problem, as I process the cells I need to check if the cell has an attached comment and if the cell has a name. I am able to handle processing the attached cell comments. But I can not figure out how to determine if a cell is within a named range. In my case the single cell within the range.
I saw a posting the suggested this would work:
cellName = ws.ActiveCell.Name.Name
No luck.
Does anybody have any idea how to do this?
I am so close but no cigar.
Thanks for your attention to this matter.
KD
What you may consider doing is first building a list of all addresses of names in the worksheet, and checking the address of each cell against the list to see if it's named.
In VBA, you obtain the names collection (all the names in a workbook) this way:
Set ns = ActiveWorkbook.Names
You can determine if the names are pointed toward part of the current sheet, and a single cell, this way:
shname = ActiveSheet.Name
Dim SheetNamedCellAddresses(1 To wb.Names.Count) as String
i = 1
For Each n in ns:
If Split(n.Value, "!")(0) = "=" & shname And InStr(n.Value, ":") = 0 Then
' The name Value is something like "=Sheet1!A1"
' If there is no colon, it is a single cell, not a range of cells
SheetNamedCellAddresses(i) = Split(n,"=")(1) 'Add the address to your array, remove the "="
i = i + 1
End If
Next
So now you have a string array containing the addresses of all the named cells in your current sheet. Move that array into a python list and you are good to go.
OK so it errors out if the cell does NOT have a range name. If the cell has a range name the following bit of code returns the name: Great success!!
ws.Cells(r,c).Activate()
c = xlApp.ActiveCell
cellName = c.Name.Name
If there is no name associated with the cell, an exception is tossed.
So even in VBA you would have to wrap this bit of code in exception code. Sounds expensive to me to use exception processing for this call.
I've already asked the root question but I thought I might see if I can get more help with this. I'm trying to work with XlDirectionDown in order to select the last filled cell in an Excel spreadsheet.
Ultimately, I'd like to use Python to select all filled cells in this sheet from A through AE. It will be copied into a text file and appended into SQL Server...so I don't want any blanks.
What I have so far:
import win32com.client as win32
excel = win32.gencache.EnsureDispatch('Excel.Application')
excel.Visible = 1;
excel.Workbooks.Open('G:/working.xlsx')
XlDirectionDown = 4
last = excel.Range("A:A").End(XlDirectionDown)
excel.Range("A1:A"+str(last)).Select()
First of all, the XlDirectionDown does not seem to work. The cursor in Excel remains on the first cell.
Secondly, I get an exception for the last line in this code (something to do with Range). Does anybody understand what's going on with this code? Also, is there ANY documentation on win32com or Pywin32 out there?? I can't find any how-to's! Thanks as always everyone.
I have used a specific cell rather than range of cells as starting point. Replace
last = excel.Range("A:A").End(XlDirectionDown)
with
last = excel.Range("A1:A1").End(XlDirectionDown)
However if there are any blank cells, this will stop just before it. You probably want to use UsedRange() instead. This will be the smallest range that contains all your cells, according to Excel: you may find (as I have) that resulting range is wider than AE (contains blank columns at end), and contains many entirely blank rows at the bottom. However, since you want to filter out blank cells anyways, those will be skipped during filtering.
As to the exception on last line of code, this is because End returns a Range object, and you can't convert a range to a string, or if you can then str(last) is a range so "A1:A"+str(last) will be an invalid range.
As to filtering out blank cells, I'm not sure what that means: when you copy the data to a text file, what will you put for blank cells? If you have "A blank C" will you put "A C"? The C will end up in wrong column of your database. Anyways just something that caught my attention.
There is no single place for documentation for win32com, although the Python on Windows book has a lot of info, and google gets you results quite useful, including SO hits. The one thing that keeps tripping me whenever I use Excel COM (this is not specific to python's win32com) is that everything in a workbook is a Range, you can't have an individual cells, even when some methods or properties might lead you to think you are getting a cell you're actually getting a range, it often requires a bit of extra thinking about how to go about getting to the desired cell.
I got started with win32com and Excel here.
In your code, what does excel.Range("A:A").End(XlDirectionDown) return? Test it. You might want to add .Select(), and then use excel.Selection.Address to get the last cell. Test it in interactive mode, it's easier to see what's going on there.
As an alternative, you can use a while loop to go through your cells. This code is looping the rows until an empty cell:
excel.Range("A1").Select()
while excel.ActiveCell.Value:
val = excel.ActiveCell.Value
print(val)
excel.ActiveCell.Offset(2,1).Select() # Move a row down
The last line is a bit funny; in VBA you should write Offset(1,0) to go one row down. However in Python you have to add one to both row and column. Maybe due to indexing?