Pandas reads almost every column in .txt as index - SOLVED

Pandas reads almost every column in .txt as index - SOLVED - python

I have a file named "sample name_TIC.txt". The first three columns in this file are useful - Scan, Time, and TIC. It also has 456 not useful columns after the first 3. To do other data processing, I need these not-useful columns to go away. So I wrote a bit of code to start:
os.chdir(main_folder)
mydir = (os.getcwd())
nameslist=['Scan','Time', 'TIC']
for path, subdirs, files in os.walk(mydir):
for file in files:
if (file.endswith('TIC.txt')):
myfile=os.path.join(path, file)
TIC_df = pd.read_csv(myfile,sep="\t",skiprows=1, usecols=[0,1,2],names=nameslist)
Normally, the for loop is set into a function that is iterated over a very large set of folders with a lot of samples, hence the os.walk stuff, but we can ignore that right now. This code will be completed to save a new .txt file with only the 3 relevant columns.
The problem comes in the last line, the pd.read_csv line. This results in a dataframe with an index column that comprises the data from the first 456 columns and the last 3 columns of the .txt are given the names in nameslist and callable as columns in pandas, (i.e. using .iloc). This is not a multi-index. It is a single index with all the data and whitespace of those first columns.
In this example code sep="\t" because that's how excel can successfully import it. But I've also tried:
sep="\s"
delimiter=r"\s+" rather than a sep argument
including header=None
not including the usecols argument I made an error, and did not call the proper result from this code edit. This is the correct solution. See edit below or the answer.
setting index_col=False
How can I get pd.read_csv to take the first 3 columns and ignore the rest?
Thanks.
EDIT: In my end-of-day foolishness, I made an error, changing the target df to the example TIC_df. In the original code set I took this from, this was named mz207_df. My call function was still referncing the old df name.
Changing the last line of code to:
TIC_df = pd.read_csv(myfile,sep="\s+",skiprows=1, usecols[0,1,2],names=nameslist)
successfully resolved my problem. Using sep="\t" also worked. Sorry for wasting people's time. I will post this with an answer as well in case someone needs to learn about usecols like I did.

Answering here to make sure the problem gets flagged as answered, in case someone else searches for it.
I made an error when calling the result from the code which included the usecols=[0,1,2] argument, and I was calling an older dataframe. The following line of code successfully generated the desired code.
TIC_df = pd.read_csv(myfile,sep="\s+",skiprows=1, usecols=[0,1,2],names=nameslist)
Using sep="\t" also generated the correct dataframe, but I default to \s+ to accomdate different and varible formatting from analytical machine outputs.

Related

Reading .csv file while it is being written

I have a similar problem to the one described here below in a different question:
Reading from a CSV file while it is being written to
Unfortunately the solution is not explained.
I'd like to create a script that plots some variables in a .csv file dynamically. The .csv is updated everytime a sensor registers something.
My basic idea was to read the file each fixed period of time and if the number of rows is increased, to update the plot with the new variables.
How can I proceed?

I am not that experienced in csv
but take this logic
def writeandRead(one_row):
with open (path/file.csv,"a"): # it is append .. if your case write just change from a to w
write.row(one_row)
with open(whateve.csv,"r"):
red=csv.read #i don't know syntax ... take the logic😊
return red
for row in rows: #or you have a lists whatever dictionary
print(writeandRead(row))

Pandas.read_csv() ignore bad lines/rows containing FEWER fields. Text file

I am trying to read this huge text file: https://www.dropbox.com/s/3ikikw8bxde6y1i/TCAD_SPECIAL%20EXPORT_2019_20200409.zip?dl=0 (if you download the zip, the file is Special_ARB.txt (not necessary for my question imo).
I am running this code (adding error_bad_lines=False) to ignore lines with more-than-expected fields, which works well:
pd.read_csv(r'~/Special_ARB.txt', sep="|",
header=None,encoding='cp1252',error_bad_lines=False)
The problem is that read.csv() crashed when a line had only 1 field. With the following error:
Too many columns specified: expected 77 and found 1
Is there a way to tell python/pandas to ignore this error? It is not letting me know which line it is. There are more than a million rows so I can't just find it on my own.
I tried a for loop to read line by line and figure it out from there, but data is so large that python crashed.
The number of columns is 77 which is correctly identify by pandas when running the code, I don't think that's an issue.
Thanks,

Errors and Exceptions
Python Try Except
try:
pd.read_csv(r'~/Special_ARB.txt', sep="|", header=None,encoding='cp1252',error_bad_lines=False)
except <your error description>:
<do this>

This should work for in-memory datasets, you can use chunking for a solution on large datasets: https://stackoverflow.com/a/59331754/9379924

Parsing two files with Python

I'm still new to python and cannot achieve to make what i'm looking for. I'm using Python 3.7.0
I have one file, called log.csv, containing a log of CANbus messages.
I want to check what is the content of column label Data2 and Data3 when the ID is 348 in column label ID.
If they are both different from "00", I want to make a new string called fault_code with the "Data3+Data2".
Then I want to check on another CSV file where this code string appear, and print the column 6 of this row (label description). But this last part I want to do it only one time per fault_code.
Here is my code:
import csv
CAN_ID = "348"
with open('0.csv') as log:
reader = csv.reader(log,delimiter=',')
for log_row in reader:
if log_row[1] == CAN_ID:
if (log_row[5]+log_row[4]) != "0000":
fault_code = log_row[5]+log_row[4]
with open('Fault_codes.csv') as fault:
readerFC = csv.reader(fault,delimiter=';')
for fault_row in readerFC:
if "0x"+fault_code in readerFC:
print("{fault_row[6]}")
Here is a part of the log.csv file
Timestamp,ID,Data0,Data1,Data2,Data3,Data4,Data5,Data6,Data7,
396774,313,0F,00,28,0A,00,00,C2,FF
396774,314,00,00,06,02,10,00,D8,00
396775,**348**,2C,00,**00,00**,FF,7F,E6,02
and this is a part of faultcode.csv
Level;LED Flashes;UID;FID;Type;Display;Message;Description;RecommendedAction
1;2;1;**0x4481**;Warning;F12001;Handbrake Fault;Handbrake is active;Release handbrake
1;5;1;**0x4541**;Warning;F15001;Fan Fault;blablabla;blablalba
1;5;2;**0x4542**;Warning;F15002;blablabla
Also do you think of a better way to do this task? I've read that Pandas can be very good for large files. As log.csv can have 100'000+ row, it's maybe a better idea to use it. What do you think?
Thank you for your help!

Be careful with your indentation, you get this error because you sometimes you use spaces and other tabs to indent.
As PM 2Ring said, reading 'Fault_codes.csv' everytime you read 1 line of your log is really not efficient.
You should read faultcode once and store the content in RAM (if it fits). You can use pandas to do it, and store the content into a DataFrame. I would do that before reading your logs.
You do not need to store all log.csv lines in RAM. So I'd keep reading it line by line with csv module, do my stuff, write to a new file, and read the next line. No need to use pandas here as it will fill your RAM for nothing.

filevalues.fillna not filling for some files only

readfile = pd.read_csv('42.csv')
filevalues= readfile.loc[readfile['Customer'].str.contains('Lam Dep', na=False), 'Jun-18\nQty']
filevalues = filevalues.fillna(0)
print(filevalues)
I have sales forecast files that have the same format as each other. When I change the value of the read file ( right now im reading file 42.csv), sometimes the columns will fill with 0 for null values and sometimes they do not
I am unsure why this happens as all the files have the same format and seem very identical.Any thoughts of why this may be happening? And please let me know if a screenshot of my files is needed
In addition, for the files that do not fill, when I run this program withou the
fillvalues.fillana(0)
function, then the files that do not fill with 0, still just show blank spaces, while the files that do fill with 0 do show nAn. I suppose a better question would be is that for both files, they have blank spaces, but python seems to detect some of them as nAn and some of them as just blanks and does not write anything there. Why??

filevalues = filevalues.replace(r'^\s*$', np.nan, regex=True)
Figured it out! It seems that some of the files had empty strings, and some of the files were simply detected as nAns ( not idea why) I used the function above and it worked !

WEKA - Cant read CSV generated with Python pandas

I've been working on some dataframes with Python. I load them in using readCSV(filename, index=0) and it's all fine. The files also open fine in Excel. I also opened them in notepad, and the seem alright; below is an example line:
851,1.218108787,0.636454978,0.269719611,-0.849476404,-0.143909689,0.050626813,-0.094248374,-0.3096134,-0.131347142,0.671271112,0.167593329,0.439417259,-0.198164647,-0.031552824,-0.215189948,-0.1791156,0.092648696,-0.107840318,-0.162596466,0.019324121,0.040572892,-0.008307331,-0.077819297,-0.023809355,-0.148229913,-0.041082835,0.138234498,-0.070986117,0.024788437,-0.050982962,0.24689969,0
The first column is as I understand it an index column. Then there's a bunch of Principal Components, and at the end is a 1/0.
When I try and load the file into WEKA, however, it gives me a nasty error and urges me to use the converter, saying:
Reason:
32 Problem encountered on line: 2
When I attempt to use the converter with the default settings, it states a new error:
Couldn't read object file_name.csv invalid stream header: 2C636F6D
Could anyone help with any of this? I can't provide the entire data file but if requested I can try and maybe cut out a few rows and only paste those if the error still occurs. Are there any flags I need to specify when saving a file to CSV in python? At the moment I just use a .toCSV('x.csv').

I think the index column not having an issue would prevent weka from reading it, when you write using pandas.to_csv() set the index = False
df.to_csv(index = False)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas reads almost every column in .txt as index - SOLVED - python

Related

Reading .csv file while it is being written

Pandas.read_csv() ignore bad lines/rows containing FEWER fields. Text file

Parsing two files with Python

filevalues.fillna not filling for some files only

WEKA - Cant read CSV generated with Python pandas

Categories

Resources