pandas.read_csv not partitioning data at semicolon delimiter - python

I'm having a tough time correctly loading csv file to pandas dataframe. The file is csv saved in MS Excel, where the rows looks like this:
Montservis, s.r.o.;"2 012";"-14.98";"-34.68";"- 11.7";"0.02";"0.09";"0.16";"284.88";"10.32";"
I am using
filep="file_name.csv"
raw_data = pd.read_csv(filep,engine="python",index_col=False, header=None, delimiter=";")
(I have tried several combinations and alternatives of read_csv arguments, but without any success.....I have tried also read_table )
What I want to see in my dataframe that each semi colon separated value will be in separate column (I understand that read_csv works this way(?)).
Unfortunately, I always end up with whole row being placed in first column of dataframe. So basicly after loading I have many rows, but only one column (two if I count also indexes)
I have placed sample here:
datafile
Any idea welcomed.

Add quoting = 3. 3 is for QUOTE_NONE refer this.
raw_data = pd.read_csv(filep,engine="python",index_col=False, header=None, delimiter=";", quoting = 3)
This will give [7 rows x 23 columns] dataframe

The problem is enclosing characters which can be ignored by \ character.
raw_data = pd.read_csv(filep,engine="python",index_col=False, header=None, delimiter='\;')

Related

Split column in several columns by delimiter '\' in pandas

I have a txt file which I read into pandas dataframe. The problem is that inside this file my text data recorded with delimiter ''. I need to split information in 1 column into several columns but it does not work because of this delimiter.
I found this post on stackoverflow just with one string, but I don't understand how to apply it once I have a whole dataframe: Split string at delimiter '\' in python
After reading my txt file into df it looks something like this
df
column1\tcolumn2\tcolumn3
0.1\t0.2\t0.3
0.4\t0.5\t0.6
0.7\t0.8\t0.9
Basically what I am doing now is the following:
df = pd.read_fwf('my_file.txt', skiprows = 8) #I use skip rows because there is irrelevant text
df['column1\tcolumn2\tcolumn3'] = "r'" + df['column1\tcolumn2\tcolumn3'] +"'" # i try to make it a row string as in the post suggested but it does not really work
df['column1\tcolumn2\tcolumn3'].str.split('\\',expand=True)
and what I get is just the following (just displayed like text inside a data frame)
r'0.1\t0.2\t0.3'
r'0.4\t0.5\t0.6'
r'0.7\t0.8\t0.9'
I am not very good with regular expersions and it seems a bit hard, how can I target this problem?
It looks like your file is tab-delimited, because of the "\t". This may work
pd.read_csv('file.txt', sep='\t', skiprows=8)

Nan issue with pandas.read_csv

I am trying to read a data file with a header. The data file is attached and I am using the following code:
import pandas as pd
data=pd.read_csv('TestData.out', sep=' ', skiprows=1, header=None)
The issue is that I have 20 columns in my data file, while I am getting 32 columns in the variable data. How can I resolve this issue. I am very new to Python and I am learning.
Data_File
Your Text File has two spaces together, in from of any value that does not have a minus sign. if sep=' ', pandas sees this as two delimiters with nothing (Nan) inbetween.
This will fix it:
data = pd.read_csv('TestData.out', sep='\s+', skiprows=1, header=None)
In this case the sep is interpreted as a regex, which looks for "one of more spaces" as the delimiter, and reurns Columns 0 though 19.
Your data file has inconsistent space delimitation. So, you just have to skip the subsequent space after the delimiter. This simple code works:
data= pd.read_csv('TestData.out',sep=' ',skiprows=1,skipinitialspace=True)

How to skip a line with more values more/less than 6 in a .txt file when importing using Pandas

I have a .txt file with 170k rows. I am importing the txt file into pandas.
Each row has a number of values separated by a comma.
I want to extract the rows with 9 values.
I am currently using:
data = pd.read_csv('uart.txt', sep=",")
The first thing you should try - preprocess the file.
import csv
with open('uart.txt', 'r') as inp, open('uart_processed.txt', 'w') as outp:
inp_csv = csv.reader(inp)
outp_csv = csv.writer(outp)
for row in inp_csv:
if len(row) == 9:
outp_csv.writerow(row)
There can be more efficient way to do that, but it the simplest thing you can do and it entirely removes invalid rows.
As #ksooklall answered, if you need only 2 columns for simplicity:
[a,b,c,d] will be in your DataFrame as [a, b]
[e] as [e, Nan]
So, if you're ok with that - go ahead and no preprocessing required.
If you know the names of the 9 columns, you can do:
df = pd.read_csv('uart.txt', names='abcdefghj')
This will only read the first 9 columns.
As long as your header rows are fine,
you can use data = pd.read_csv('uart.txt', sep=",", error_bad_lines=False, warn_bad_lines=True)
This will ignore any lines having more than the desired amount of values and will also show which such lines were skipped.
If you know the rest of the actual data (i.e. lines that have 9 values) don't have any missing values in them, then you can dropna after reading it in to drop all rows that have less than 9 records. i.e. (data = pd.read_csv('uart.txt', sep=",", error_bad_lines=False, warn_bad_lines=True).dropna()
However, if the records that have 9 values can have NAs (e.g. 242,2421,,,,,,,1) then I don't think there's a built-in way in Pandas and you'd have to pre-process the csv before reading it in.

Remove rows containing blank space in python data frame

I imported a csv file to Python (Using Python data frame) and there are some missing values in a CSV file. In the data frame I have rows like following
> 08,63.40,86.21,63.12,72.78,,
I have tried everything to remove the rows containing the elements similar to the last element in the above data. Nothing works. I do not know if above is categorized as white space or empty string or what.
Here is what I have:
result = pandas.read_csv(file,sep='delimiter')
result[result!=',,']
This did not work. Then I have done following:
result.replace(' ', np.nan, inplace=True)
result.dropna(inplace=True)
This also did not work.
result = result.replace(r'\s+', np.nan, regex=True)
This also did not work. I still see the row containing the ,, element.
Also my dataframe is 100 by 1. When I import it from CSV file all the columns become 1.( I do not know if this helps)
Can anyone tell me how to remove rows containing ,, elements?
Also my dataframe is 100 by 1. When I import it from CSV file all the columns become 1
This is probably the key and IMHO is weird. When you import a csv in a pandas DataFrame you normally want each field to go in its own column, precisely to later be able to process that column values individually. So (still IMHO) the correct solution if to fix that.
Now to directly answer your (probably XY question), you do not want to remove rows containing blank or empty columns, because your row only contains one single column, but rows containing consecutive commas(,,). So you should use:
df.drop(df.iloc[0].str.contains(',,').index)
I think your code should work with a minor change:
result.replace('', np.nan, inplace=True)
result.dropna(inplace=True)
In case you have several rows in your CSV file, you can avoid the extra conversion step to NaN:
result = pandas.read_csv(file)
result = result[result.notnull().all(axis = 1)]
This will remove any row where there is an empty element.
However, your added comment explains that there is just one row in the CSV file, and it seems that the CSV reader shows some special behavior. Since you need to select the columns without NaN, I suggest these lines:
result = pandas.read_csv(file, header = None)
selected_columns = result.columns[result.notnull().any()]
result = result[selected_columns]
Note the option header = None with read_csv.

Pandas won't separate columns in my comma separated .txt file

I'm aware this has been asked so many times, but it's left me really scratching my head. I have a .txt file which looks like:
"0,1,2,3,4, ....63,0,1,2,3,4.....63"
"0,1,2,3,4, ....63,0,1,2,3,4.....63"
"0,1,2,3,4, ....63,0,1,2,3,4.....63"
and so on for multiple rows.
So that's 64+64 = 128 columns separated by commas, while each row is enclosed in double quotes.
I have used the commands:
#Used this initially
df = pd.read_csv('test_data.txt')
#Used this after reading more stackoverflow answers
df = pd.read_csv('test_data.txt', header = None, sep=",", delimiter=',', quotechar='"', index_col = None)
I know sep and delimiter are the same parameters, but I tried both out anyway, I shouldn't have to specify these either because pandas chooses commas by default.
After this I'm just using:
df.head()
And it outputs:
0
0 0.00,1.00,2.00,3.00,4.00,5.00,6.00,7.00,8.00,9...
1 0.00,1.00,2.00,3.00,4.00,5.00,6.00,7.00,8.00,9...
2 0.00,1.00,2.00,3.00,4.00,5.00,6.00,7.00,8.00,9...
3 0.00,1.00,2.00,3.00,4.00,5.00,6.00,7.00,8.00,9...
4 0.00,1.00,2.00,3.00,4.00,5.00,6.00,7.00,8.00,9...
It just reads it all as the one column, please advise on how I can read all 128 columns.
This will get you to desired result:
df = pd.read_csv('test_data.txt', header=None)
df = pd.DataFrame(df[0].str.split(',').tolist())
So this will read your file, that has each row wrapped with quote marks, and pack it into single column.
Then, you split that column by comma and construct new dataframe from the results.

Categories

Resources