Nan issue with pandas.read_csv

Nan issue with pandas.read_csv - python

I am trying to read a data file with a header. The data file is attached and I am using the following code:
import pandas as pd
data=pd.read_csv('TestData.out', sep=' ', skiprows=1, header=None)
The issue is that I have 20 columns in my data file, while I am getting 32 columns in the variable data. How can I resolve this issue. I am very new to Python and I am learning.
Data_File

Your Text File has two spaces together, in from of any value that does not have a minus sign. if sep=' ', pandas sees this as two delimiters with nothing (Nan) inbetween.
This will fix it:
data = pd.read_csv('TestData.out', sep='\s+', skiprows=1, header=None)
In this case the sep is interpreted as a regex, which looks for "one of more spaces" as the delimiter, and reurns Columns 0 though 19.

Your data file has inconsistent space delimitation. So, you just have to skip the subsequent space after the delimiter. This simple code works:
data= pd.read_csv('TestData.out',sep=' ',skiprows=1,skipinitialspace=True)

Related

Split column in several columns by delimiter '\' in pandas

I have a txt file which I read into pandas dataframe. The problem is that inside this file my text data recorded with delimiter ''. I need to split information in 1 column into several columns but it does not work because of this delimiter.
I found this post on stackoverflow just with one string, but I don't understand how to apply it once I have a whole dataframe: Split string at delimiter '\' in python
After reading my txt file into df it looks something like this
df
column1\tcolumn2\tcolumn3
0.1\t0.2\t0.3
0.4\t0.5\t0.6
0.7\t0.8\t0.9
Basically what I am doing now is the following:
df = pd.read_fwf('my_file.txt', skiprows = 8) #I use skip rows because there is irrelevant text
df['column1\tcolumn2\tcolumn3'] = "r'" + df['column1\tcolumn2\tcolumn3'] +"'" # i try to make it a row string as in the post suggested but it does not really work
df['column1\tcolumn2\tcolumn3'].str.split('\\',expand=True)
and what I get is just the following (just displayed like text inside a data frame)
r'0.1\t0.2\t0.3'
r'0.4\t0.5\t0.6'
r'0.7\t0.8\t0.9'
I am not very good with regular expersions and it seems a bit hard, how can I target this problem?

It looks like your file is tab-delimited, because of the "\t". This may work
pd.read_csv('file.txt', sep='\t', skiprows=8)

How to use 'Shift In' from text in csv file to split columns

I'm trying to import csv style data from a software designed in Europe into a df for analysis.
The data uses two characters to delimit the data in the files, 'DC4' and 'SI' ("Shift In" I believe). I'm currently concatenating the files and delimiting them by the 'DC4' character using read_csv into a df. Then I use a regex line to replace all the 'SI' characters into ';' in the df. I skip every other line in the code to remove the identifiers I don't need next. If I open the data at this point everything is split by the 'DC4' and all 'SI' are converted to ;.
What would you suggest to further split the df by the ; character now? I've tried to split the df by series.string but got type errors. I've exported to csv and reimported it using ; as the delimiter, but it doesn't split the existing columns that were already split with the first import for some reason? I also get parser errors on some rows way down the df so I think there are dirty rows (this is just information I've found. If not helpful please ignore it). I can ignore these lines without affecting the data I need.
The size of the df is around 60-70 columns and usually less than 75K rows when I pull a full report. I'm using PyCharm and Python 3.8. Thank you all for any help on this, I very much appreciate it. Here is my code so far:
path = file directory location
df = pd.concat([pd.read_csv(f, sep='', comment=" ", na_values='Nothing', header=None, index_col=False)
for f in glob.glob(path + ".file extension")], ignore_index=True)
df = df.replace('', ';', regex=True)
df = df.iloc[::2]
df.to_csv(r'new_file_location', index=False, encoding='utf-8-sig')

So you have a CSV (technically not a CSV I guess) that's separated by two different values (DC4 and SI) and you want to read it into a dataframe?
You can do so directly with pandas, the read_csv function allows you to specify regex delimiters, so you could use "\x0e|\x14" and use either DC4 or SI as selarator: pd.read_csv(path, sep="\x0e|\x14")
An example with readable characters:
The csv contains:
col1,col2;col3
val1,val2,val3
val4;val5;val6
Which can be read as follows:
import pandas as pd
df = pd.read_csv(path, sep=",|;")
which results in df being:
col1 col2 col3
0 val1 val2 val3
1 val4 val5 val6

Converting Comma delimted CSV to Tab delimted CSV in pandas

I am using python I have a CSV file which had values separated by tab,
I applied a rule to each of its row and created a new csv file, the resulting dataframe is comma separated , I want this new csv to be tab separated as well. How can I do it ?
I understand using sep = '\t' can work but where do I apply it ?
I applied the following code but it didn't work either
df = pd.read_csv('data.csv', header=None)
df_norm= df.apply(lambda x:np.where(x>0,x/x.max(),np.where(x<0,-x/x.min(),x)),axis=1)
df_norm.to_csv("file.csv", sep="\t")

Have you tried, this ?
pd.read_csv('file.csv', sep='\t')

I found the issue, the rule had changed the type to "object', because of which I was unable to perform any further operations. I followed Remove dtype at the end of numpy array, and converted my data frame to a list which solved the issue.
df = pd.read_csv('data.csv', header=None)
df_norm= df.apply(lambda x:np.where(x>0,x/x.max(),np.where(x<0,-x/x.min(),x)),axis=1)
df_norm=df_norm.tolist()
df_norm = np.squeeze(np.asarray(df_norm))
np.savetxt('result.csv', df_norm, delimiter=",")

Pandas read_csv - Ignore Escape Char in SemiColon Seperated File

I am trying to load a semicolon seperated txt file and there are a few instances where escape chars are in the data. These are typically &lt ; (space removed so it isn't covered to <) which adds a semicolon. This obviously messes up my data and since dtypes are important causes read_csv problems. Is there away to tell pandas to ignore these when the file is read?
I tried deleting the char from the file and it works now, but given that I want an automated process on millions of rows this is not sustainable.
df = pd.read_csv(file_loc.csv,
header=None,
names=column_names,
usecols=counters,
dtype=dtypes,
delimiter=';',
low_memory=False)
ValueError: could not convert string to float:
As my first column is a string and the second is a float, but if the first is split by the &lt ; it then goes on the 2nd too.
Is there a way to tell pandas to ignore these or efficiently remove before loading?

Give the following example csv file so57732330.csv:
col1;col2
1<2;a
3;
we read it using StringIO after unescaping named and numeric html5 character references:
import pandas as pd
import io
import html
with open('so57732330.csv') as f:
s = f.read()
f = io.StringIO(html.unescape(s))
df = pd.read_csv(f,sep=';')
Result:
col1 col2
0 1<2 a
1 3 NaN

pandas.read_csv not partitioning data at semicolon delimiter

I'm having a tough time correctly loading csv file to pandas dataframe. The file is csv saved in MS Excel, where the rows looks like this:
Montservis, s.r.o.;"2 012";"-14.98";"-34.68";"- 11.7";"0.02";"0.09";"0.16";"284.88";"10.32";"
I am using
filep="file_name.csv"
raw_data = pd.read_csv(filep,engine="python",index_col=False, header=None, delimiter=";")
(I have tried several combinations and alternatives of read_csv arguments, but without any success.....I have tried also read_table )
What I want to see in my dataframe that each semi colon separated value will be in separate column (I understand that read_csv works this way(?)).
Unfortunately, I always end up with whole row being placed in first column of dataframe. So basicly after loading I have many rows, but only one column (two if I count also indexes)
I have placed sample here:
datafile
Any idea welcomed.

Add quoting = 3. 3 is for QUOTE_NONE refer this.
raw_data = pd.read_csv(filep,engine="python",index_col=False, header=None, delimiter=";", quoting = 3)
This will give [7 rows x 23 columns] dataframe

The problem is enclosing characters which can be ignored by \ character.
raw_data = pd.read_csv(filep,engine="python",index_col=False, header=None, delimiter='\;')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Nan issue with pandas.read_csv - python

Your data file has inconsistent space delimitation. So, you just have to skip the subsequent space after the delimiter. This simple code works: data= pd.read_csv('TestData.out',sep=' ',skiprows=1,skipinitialspace=True)

Related

Split column in several columns by delimiter '\' in pandas

How to use 'Shift In' from text in csv file to split columns

Converting Comma delimted CSV to Tab delimted CSV in pandas

Pandas read_csv - Ignore Escape Char in SemiColon Seperated File

pandas.read_csv not partitioning data at semicolon delimiter

Categories

Resources