reading a multi-indexed CSV in pandas with multiple delimiters

reading a multi-indexed CSV in pandas with multiple delimiters - python

I'm trying to create a very human-readable script that will be multi-indexed. It looks like this:
A
one : some data
two : some other data
B
one : foo
three : bar
I'd like to use pandas' read_csv to automatically read this in as a multi-indexed file with both \t and : used as delimiters so that I can easily slice by section (i.e., A and B). I understand something like that header=[0,1] and perhaps tupleize_cols may be used to this end, but I can't get that far since it doesn't seem to want to read both the tabs and colons properly. If I use sep='[\t:]', it consumes the leading tabs. If I don't use the regexp and read with sep='\t', it gets the tabs right, but doesn't handle the colons. Is this possible using read_csv? I could do it line by line, but there must be an easier way :)
This is the output I had in mind. I added labels to the indices and column, which could hopefully be applied when reading it in:
value
index_1 index_2
A one some data
two some other data
B one foo
three bar
EDIT: I used part of Ben.T's answer to get what I needed. I'm not in love with my solution since I'm writing to a temp file, but it does work:
with open('temp.csv','w') as outfile:
for line in open(reader.filename,'r'):
if line[0] != '\t' or not line.strip():
index1 = line.split('\n')[0]
else:
outfile.write(index1+':'+re.sub('[\t]+','',line))
pd.read_csv('temp.csv', sep=':', header=None, \
names = ['index_1', 'index_2', 'Value'] ).set_index(['index_1', 'index_2'])

You can use two delimiters in read_csv such as:
pd.read_csv( path_file, sep=':|\t', engine='python')
Note the engine='python' to prevent a warning.
EDIT: with your input format it seems difficult, but with input like:
A one : some data
A two : some other data
B one : foo
B three : bar
with a \t as delimiter after A or B, then you get a multiindex by:
pd.read_csv(path_file, sep=':|\t', header = None, engine='python', names = ['index_1', 'index_2', 'Value'] ).set_index(['index_1', 'index_2'])

Related

Pandas ignoring cells with " and ,

I have a semicolon-delimited pandas DataFrame with all dtypes of object. Within some of the cells the string value can have ", a comma (,), or both (ex. TES"T_ING,_VALUE). I am then querying the DF using df.query based on some condition to get a subset of the DataFrame but the rows that have the pattern described in the example are being omitted completely but the remaining rows are being returned just fine. Another requirement is that I need to match all " within the text with a closing quote as well but applying a lambda to replace " with "" is also not being done properly. I have tried several methods and they are listed below
Problem 1:
pd.read_csv("file.csv", delimiter=';')
pd.read_csv("file.csv", delmiter=';', thousands=',')
pd.read_csv("file.csv", delimiter=";", escapechar='"')
pd.read_csv("file.csv", delimiter=";", encoding='utf-8')
All of the above fail to load the data in question.
Problem 2:
Input: TES"T_ING,_VALUE to TES""T_ING,_VALUE
I have tried:
df.apply(lambda s: s.str.replace('"', '""')
which doesn't do anything.
What exactly is going on? I haven't been able to find any questions tackling this particular type of issue anywhere.
Appreciate your help in advance.
EDIT: Sorry I didn't provide some mockup data due to sensitivity but here is some fake data that illustrates the issue
The following is a sample of how the csv structure
Column1;Column2;Column3;Column4;Column5\n
TES"T_ING,_VALUE;Col2Value;Col3Value;Col4Value;Col5Value\n
Col1value;TES"T_ING,_VALUE2;Col3Value;Col4Value;Col5Value\n
I have tried utilizing quoting=csv.QUOTE_ALL/QUOTE_NONNUMERIC and quotechar='"' when loading in the df but the result ends up being
Column1;Column2;Column3;Column4;Column5\n
"TES"T_ING,_VALUE;Col2Value;Col3Value;Col4Value;Col5Value";;;;\n
"Col1value;TES"T_ING,_VALUE2;Col3Value;Col4Value;Col5Value";;;;\n
So it interprets the whole row as value in column 1 rather than actually splitting on the ; and applying the quoting to only column1. Truthfully I can iterate through each row in the df and maybe do a split and load the remaining values into their respective column but the CSV is quite large so this operation would take sometime. The subset of the data the user queries on is supposed to be returned from an endpoint (this part is already working).

The problem was solved utilizing pd.apply and utilizing a custom function to process each record.
df = pd.read_csv("csv_file.csv", delimiter=';', escapechar='\\')
def mapper(record):
if ';' in record['col1']:
content = record['col1'].split(';')
if len(content) == num_columns:
if '"' in content[0]:
content[0] = content[0].replace('"', '""')
record['col1'] = content[0]
# repeat for remaining columns
processed = df.apply(lambda x: mapper(x), axis=1)

Split column in several columns by delimiter '\' in pandas

I have a txt file which I read into pandas dataframe. The problem is that inside this file my text data recorded with delimiter ''. I need to split information in 1 column into several columns but it does not work because of this delimiter.
I found this post on stackoverflow just with one string, but I don't understand how to apply it once I have a whole dataframe: Split string at delimiter '\' in python
After reading my txt file into df it looks something like this
df
column1\tcolumn2\tcolumn3
0.1\t0.2\t0.3
0.4\t0.5\t0.6
0.7\t0.8\t0.9
Basically what I am doing now is the following:
df = pd.read_fwf('my_file.txt', skiprows = 8) #I use skip rows because there is irrelevant text
df['column1\tcolumn2\tcolumn3'] = "r'" + df['column1\tcolumn2\tcolumn3'] +"'" # i try to make it a row string as in the post suggested but it does not really work
df['column1\tcolumn2\tcolumn3'].str.split('\\',expand=True)
and what I get is just the following (just displayed like text inside a data frame)
r'0.1\t0.2\t0.3'
r'0.4\t0.5\t0.6'
r'0.7\t0.8\t0.9'
I am not very good with regular expersions and it seems a bit hard, how can I target this problem?

It looks like your file is tab-delimited, because of the "\t". This may work
pd.read_csv('file.txt', sep='\t', skiprows=8)

Remove rows containing blank space in python data frame

I imported a csv file to Python (Using Python data frame) and there are some missing values in a CSV file. In the data frame I have rows like following
> 08,63.40,86.21,63.12,72.78,,
I have tried everything to remove the rows containing the elements similar to the last element in the above data. Nothing works. I do not know if above is categorized as white space or empty string or what.
Here is what I have:
result = pandas.read_csv(file,sep='delimiter')
result[result!=',,']
This did not work. Then I have done following:
result.replace(' ', np.nan, inplace=True)
result.dropna(inplace=True)
This also did not work.
result = result.replace(r'\s+', np.nan, regex=True)
This also did not work. I still see the row containing the ,, element.
Also my dataframe is 100 by 1. When I import it from CSV file all the columns become 1.( I do not know if this helps)
Can anyone tell me how to remove rows containing ,, elements?

Also my dataframe is 100 by 1. When I import it from CSV file all the columns become 1
This is probably the key and IMHO is weird. When you import a csv in a pandas DataFrame you normally want each field to go in its own column, precisely to later be able to process that column values individually. So (still IMHO) the correct solution if to fix that.
Now to directly answer your (probably XY question), you do not want to remove rows containing blank or empty columns, because your row only contains one single column, but rows containing consecutive commas(,,). So you should use:
df.drop(df.iloc[0].str.contains(',,').index)

I think your code should work with a minor change:
result.replace('', np.nan, inplace=True)
result.dropna(inplace=True)

In case you have several rows in your CSV file, you can avoid the extra conversion step to NaN:
result = pandas.read_csv(file)
result = result[result.notnull().all(axis = 1)]
This will remove any row where there is an empty element.
However, your added comment explains that there is just one row in the CSV file, and it seems that the CSV reader shows some special behavior. Since you need to select the columns without NaN, I suggest these lines:
result = pandas.read_csv(file, header = None)
selected_columns = result.columns[result.notnull().any()]
result = result[selected_columns]
Note the option header = None with read_csv.

pandas.read_csv not partitioning data at semicolon delimiter

I'm having a tough time correctly loading csv file to pandas dataframe. The file is csv saved in MS Excel, where the rows looks like this:
Montservis, s.r.o.;"2 012";"-14.98";"-34.68";"- 11.7";"0.02";"0.09";"0.16";"284.88";"10.32";"
I am using
filep="file_name.csv"
raw_data = pd.read_csv(filep,engine="python",index_col=False, header=None, delimiter=";")
(I have tried several combinations and alternatives of read_csv arguments, but without any success.....I have tried also read_table )
What I want to see in my dataframe that each semi colon separated value will be in separate column (I understand that read_csv works this way(?)).
Unfortunately, I always end up with whole row being placed in first column of dataframe. So basicly after loading I have many rows, but only one column (two if I count also indexes)
I have placed sample here:
datafile
Any idea welcomed.

Add quoting = 3. 3 is for QUOTE_NONE refer this.
raw_data = pd.read_csv(filep,engine="python",index_col=False, header=None, delimiter=";", quoting = 3)
This will give [7 rows x 23 columns] dataframe

The problem is enclosing characters which can be ignored by \ character.
raw_data = pd.read_csv(filep,engine="python",index_col=False, header=None, delimiter='\;')

Pandas: read_csv ignore rows after a blank line

There is a weird .csv file, something like:
header1,header2,header3
val11,val12,val13
val21,val22,val23
val31,val32,val33
pretty fine, but after these lines, there is always a blank line followed by lots of useless lines. The whole stuff is something line:
header1,header2,header3
val11,val12,val13
val21,val22,val23
val31,val32,val33
dhjsakfjkldsa
fasdfggfhjhgsdfgds
gsdgffsdgfdgsdfgs
gsdfdgsg
The number of lines in the bottom is totally random, the only remark is the empty line before them.
Pandas has a parameter "skipfooter" for ignoring a known number of rows in the footer.
Any idea about how to ignore this rows without actually opening (open()...) the file and removing them?

There is not any option to terminate read_csv function by getting the first blank line. This module isn't capable of accepting/rejecting lines based on desired conditions. It only can ignore blank lines (optional) or rows which disobey the formed shape of data (rows with more separators).
You can normalize the data by the below approaches (without parsing file - pure pandas):
Knowing the number of the desired\trash data rows. [Manual]
pd.read_csv('file.csv', nrows=3) or pd.read_csv('file.csv', skipfooter=4)
Preserving the desired data by eliminating others in DataFrame. [Automatic]
df.dropna(axis=0, how='any', inplace=True)
The results will be:
header1 header2 header3
0 val11 val12 val13
1 val21 val22 val23
2 val31 val32 val33

The best way to do this using pandas native functions is a combination of arguments and function calls - a bit messy, but definitely possible!
First, call read_csv with the skip_blank_lines=False, since the default is True.
df = pd.read_csv(<filepath>, skip_blank_lines=False)
Then, create a dataframe that only contains the blank rows, using the isnull or isna method. This works by locating (.loc) the indices where all values are null/blank.
blank_df = df.loc[df.isnull().all(1)]
By utilizing the fact that this dataframe preserves the original indices, you can get the index of the first blank row.
Because this uses indexing, you will also want to check that there actually is a blank line in the csv. And finally, you simply slice the original dataframe in order to remove the unwanted lines.
if len(blank_df) > 0:
first_blank_index = blank_df.index[0]
df = df[:first_blank_index]

If you're using the csv module, it's fairly trivial to detect an empty row.
import csv
with open(filename, newline='') as f:
r = csv.reader(f)
for l in r:
if not l:
break
#Otherwise, process data

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

reading a multi-indexed CSV in pandas with multiple delimiters - python

Related

Pandas ignoring cells with " and ,

Split column in several columns by delimiter '\' in pandas

Remove rows containing blank space in python data frame

pandas.read_csv not partitioning data at semicolon delimiter

Pandas: read_csv ignore rows after a blank line

Categories

Resources