Pandas won't separate columns in my comma separated .txt file - python

I'm aware this has been asked so many times, but it's left me really scratching my head. I have a .txt file which looks like:
"0,1,2,3,4, ....63,0,1,2,3,4.....63"
"0,1,2,3,4, ....63,0,1,2,3,4.....63"
"0,1,2,3,4, ....63,0,1,2,3,4.....63"
and so on for multiple rows.
So that's 64+64 = 128 columns separated by commas, while each row is enclosed in double quotes.
I have used the commands:
#Used this initially
df = pd.read_csv('test_data.txt')
#Used this after reading more stackoverflow answers
df = pd.read_csv('test_data.txt', header = None, sep=",", delimiter=',', quotechar='"', index_col = None)
I know sep and delimiter are the same parameters, but I tried both out anyway, I shouldn't have to specify these either because pandas chooses commas by default.
After this I'm just using:
df.head()
And it outputs:
0
0 0.00,1.00,2.00,3.00,4.00,5.00,6.00,7.00,8.00,9...
1 0.00,1.00,2.00,3.00,4.00,5.00,6.00,7.00,8.00,9...
2 0.00,1.00,2.00,3.00,4.00,5.00,6.00,7.00,8.00,9...
3 0.00,1.00,2.00,3.00,4.00,5.00,6.00,7.00,8.00,9...
4 0.00,1.00,2.00,3.00,4.00,5.00,6.00,7.00,8.00,9...
It just reads it all as the one column, please advise on how I can read all 128 columns.

This will get you to desired result:
df = pd.read_csv('test_data.txt', header=None)
df = pd.DataFrame(df[0].str.split(',').tolist())
So this will read your file, that has each row wrapped with quote marks, and pack it into single column.
Then, you split that column by comma and construct new dataframe from the results.

Related

How can I clean text in an Excel file with Python?

I have an Excel file with numbers (integers) in some rows of the first column (A) and text in all rows of the second column (B):
I want to clean this text, that is I want to remove tags like < b r > (without spaces). My current approach doesn't seem to work:
file_name = "F:\Project\comments_all_sorted.xlsx"
import pandas as pd
df = pd.read_excel(file_name, header=None, index_col=None, usecols='B') # specify that there's no header and no column for row labels, use only column B (which includes the text)
clean_df = df.replace('<br>', '')
clean_df.to_excel('output.xlsx')
What this code does (which I don't want it to do) is it adds running numbers in the first column (A), replacing also the few numbers that were already there, and it adds a first row with '1' in second column of this row (cell 1B):
I'm sure there's an easy way to solve my problem and I'm just not trained enough to see it.
Thanks!
Try this:
df['column_name'] = df['column_name'].str.replace(r'<br>', '')
The index in the output file can be turned off with index=False in the df.to_excel function, i.e,
clean_df.to_excel('output.xlsx', index=False)
As far as I'm aware, you can't use .replace on an entire dataframe. You need to explicitly call out the column. In this case, I just iterate through all columns in case there are more than just the one column.
To get rid of the first column with the sequential numbers (that's the index of the dataframe), add the parameter index=False. The number 1 on the top is the column name. To get rid of that, use header=False
import pandas as pd
file_name = "F:\Project\comments_all_sorted.xlsx"
df = pd.read_excel(file_name, header=None, index_col=None, usecols='B') # specify that there's no header and no column for row labels, use only column B (which includes the text)
clean_df = df.copy()
for col in clean_df.columns:
clean_df[col] = df[col].str.replace('<br>', '')
clean_df.to_excel('output.xlsx', index=False, header=False)

How to skip a line with more values more/less than 6 in a .txt file when importing using Pandas

I have a .txt file with 170k rows. I am importing the txt file into pandas.
Each row has a number of values separated by a comma.
I want to extract the rows with 9 values.
I am currently using:
data = pd.read_csv('uart.txt', sep=",")
The first thing you should try - preprocess the file.
import csv
with open('uart.txt', 'r') as inp, open('uart_processed.txt', 'w') as outp:
inp_csv = csv.reader(inp)
outp_csv = csv.writer(outp)
for row in inp_csv:
if len(row) == 9:
outp_csv.writerow(row)
There can be more efficient way to do that, but it the simplest thing you can do and it entirely removes invalid rows.
As #ksooklall answered, if you need only 2 columns for simplicity:
[a,b,c,d] will be in your DataFrame as [a, b]
[e] as [e, Nan]
So, if you're ok with that - go ahead and no preprocessing required.
If you know the names of the 9 columns, you can do:
df = pd.read_csv('uart.txt', names='abcdefghj')
This will only read the first 9 columns.
As long as your header rows are fine,
you can use data = pd.read_csv('uart.txt', sep=",", error_bad_lines=False, warn_bad_lines=True)
This will ignore any lines having more than the desired amount of values and will also show which such lines were skipped.
If you know the rest of the actual data (i.e. lines that have 9 values) don't have any missing values in them, then you can dropna after reading it in to drop all rows that have less than 9 records. i.e. (data = pd.read_csv('uart.txt', sep=",", error_bad_lines=False, warn_bad_lines=True).dropna()
However, if the records that have 9 values can have NAs (e.g. 242,2421,,,,,,,1) then I don't think there's a built-in way in Pandas and you'd have to pre-process the csv before reading it in.

Reading .txt file columns into pandas data frame and creating new columns

I have a .txt file which looks like the following:
156 2.87893e+06
157 968759
699 2.15891e+06
700 44927.2
1108 830338
1156 70513.5
1172 64263.2
The above file is an output file I am obtaining on running a c++ prog. I want this output file in pandas df.
I tried following code:
df = pd.read_csv("output.txt", index_col=0)
df
But here, columns from .txt file becomes one column in the df. I want the values in 2 columns separately, maybe with a column heading.
OR since its a text file, if they are not 2 different columns in .txt file, then each row has 2 values separated by a space. How can I get them in two different cols in pandas df ?
Also, I tried the following code:
df = pd.read_csv("output.txt")
df.iloc[:,(0)]
Now, here the very first row from the original text file doesn't appears at all, and again both the values appear in one column.
The default delimiter for pandas.read_csv is comma ,, you need to explicitly specify the sep parameter to be space in order to read in space delimited file:
df = pd.read_csv("output.txt", sep = " ", index_col=0, header=None)

pandas.read_csv not partitioning data at semicolon delimiter

I'm having a tough time correctly loading csv file to pandas dataframe. The file is csv saved in MS Excel, where the rows looks like this:
Montservis, s.r.o.;"2 012";"-14.98";"-34.68";"- 11.7";"0.02";"0.09";"0.16";"284.88";"10.32";"
I am using
filep="file_name.csv"
raw_data = pd.read_csv(filep,engine="python",index_col=False, header=None, delimiter=";")
(I have tried several combinations and alternatives of read_csv arguments, but without any success.....I have tried also read_table )
What I want to see in my dataframe that each semi colon separated value will be in separate column (I understand that read_csv works this way(?)).
Unfortunately, I always end up with whole row being placed in first column of dataframe. So basicly after loading I have many rows, but only one column (two if I count also indexes)
I have placed sample here:
datafile
Any idea welcomed.
Add quoting = 3. 3 is for QUOTE_NONE refer this.
raw_data = pd.read_csv(filep,engine="python",index_col=False, header=None, delimiter=";", quoting = 3)
This will give [7 rows x 23 columns] dataframe
The problem is enclosing characters which can be ignored by \ character.
raw_data = pd.read_csv(filep,engine="python",index_col=False, header=None, delimiter='\;')

python pandas read_csv unable to read character double quoted twice

I'm trying to a 2 columns csv file (error.csv) with semi-column separator which contains double quoted semi-columns:
col1;col2
2016-04-17_22:34:25.126;"Linux; Android"
2016-04-17_22:34:25.260;"{"g":2}iPhone; iPhone"
And I'm trying:
logs = pd.read_csv('error.csv', na_values="null", sep=';',
quotechar='"', quoting=0)
I understand that the problem comes from having a double quoted "g" inside my double quotes in line 3 but I can't figure out how to deal with it. Any ideas ?
You will probably need to pre-process the data so that it conforms to the expected CSV format. I doubt pandas will handle this just by changing a parameter or two.
If there are only two columns, and the first never contains a semi-colon, then you could split the lines on the first semi-colon:
records = []
with open('error.csv', 'r') as fh:
# first row is a header
header = next(fh).strip().split(';')
for rec in fh:
# split only on the first semi-colon
date, dat = rec.strip().split(';', maxsplit=1)
# assemble records, removing quotes from the second column
records.append((date, dat.strip('"')))
# create a data frame
df = pandas.DataFrame.from_records(records, columns=header)
You will have to manually parse the dates yourself with the datetime module if you want the first column to contain proper dates and not strings.

Categories

Resources