Reading csv file in pandas with newlines and natural language

Reading csv file in pandas with newlines and natural language - python

I have a csv file with multiple columns that contain numbers and natural language text. The columns have a special separator (say "%#") so that one can immediately see the boundary between columns. For example:
55%#This is an example text%#24
32%#This is another example text%#1649
However, sometimes the text contains newlines:
212%#This is a possible
occurrence of
a text in the csv file%#3399
My problem is how to read the csv file with pandas such that the newline is not interpreted as another csv row. When I use df = pd.read_csv(filename, sep='%#') I get for example:
Col1 Col2 Col3
212 This is a possible None
occurrence of None None
a text in the csv file 3399
How could I solve this problem?

Related

Split column in several columns by delimiter '\' in pandas

I have a txt file which I read into pandas dataframe. The problem is that inside this file my text data recorded with delimiter ''. I need to split information in 1 column into several columns but it does not work because of this delimiter.
I found this post on stackoverflow just with one string, but I don't understand how to apply it once I have a whole dataframe: Split string at delimiter '\' in python
After reading my txt file into df it looks something like this
df
column1\tcolumn2\tcolumn3
0.1\t0.2\t0.3
0.4\t0.5\t0.6
0.7\t0.8\t0.9
Basically what I am doing now is the following:
df = pd.read_fwf('my_file.txt', skiprows = 8) #I use skip rows because there is irrelevant text
df['column1\tcolumn2\tcolumn3'] = "r'" + df['column1\tcolumn2\tcolumn3'] +"'" # i try to make it a row string as in the post suggested but it does not really work
df['column1\tcolumn2\tcolumn3'].str.split('\\',expand=True)
and what I get is just the following (just displayed like text inside a data frame)
r'0.1\t0.2\t0.3'
r'0.4\t0.5\t0.6'
r'0.7\t0.8\t0.9'
I am not very good with regular expersions and it seems a bit hard, how can I target this problem?

It looks like your file is tab-delimited, because of the "\t". This may work
pd.read_csv('file.txt', sep='\t', skiprows=8)

How to use 'Shift In' from text in csv file to split columns

I'm trying to import csv style data from a software designed in Europe into a df for analysis.
The data uses two characters to delimit the data in the files, 'DC4' and 'SI' ("Shift In" I believe). I'm currently concatenating the files and delimiting them by the 'DC4' character using read_csv into a df. Then I use a regex line to replace all the 'SI' characters into ';' in the df. I skip every other line in the code to remove the identifiers I don't need next. If I open the data at this point everything is split by the 'DC4' and all 'SI' are converted to ;.
What would you suggest to further split the df by the ; character now? I've tried to split the df by series.string but got type errors. I've exported to csv and reimported it using ; as the delimiter, but it doesn't split the existing columns that were already split with the first import for some reason? I also get parser errors on some rows way down the df so I think there are dirty rows (this is just information I've found. If not helpful please ignore it). I can ignore these lines without affecting the data I need.
The size of the df is around 60-70 columns and usually less than 75K rows when I pull a full report. I'm using PyCharm and Python 3.8. Thank you all for any help on this, I very much appreciate it. Here is my code so far:
path = file directory location
df = pd.concat([pd.read_csv(f, sep='', comment=" ", na_values='Nothing', header=None, index_col=False)
for f in glob.glob(path + ".file extension")], ignore_index=True)
df = df.replace('', ';', regex=True)
df = df.iloc[::2]
df.to_csv(r'new_file_location', index=False, encoding='utf-8-sig')

So you have a CSV (technically not a CSV I guess) that's separated by two different values (DC4 and SI) and you want to read it into a dataframe?
You can do so directly with pandas, the read_csv function allows you to specify regex delimiters, so you could use "\x0e|\x14" and use either DC4 or SI as selarator: pd.read_csv(path, sep="\x0e|\x14")
An example with readable characters:
The csv contains:
col1,col2;col3
val1,val2,val3
val4;val5;val6
Which can be read as follows:
import pandas as pd
df = pd.read_csv(path, sep=",|;")
which results in df being:
col1 col2 col3
0 val1 val2 val3
1 val4 val5 val6

Pandas read_csv - Ignore Escape Char in SemiColon Seperated File

I am trying to load a semicolon seperated txt file and there are a few instances where escape chars are in the data. These are typically &lt ; (space removed so it isn't covered to <) which adds a semicolon. This obviously messes up my data and since dtypes are important causes read_csv problems. Is there away to tell pandas to ignore these when the file is read?
I tried deleting the char from the file and it works now, but given that I want an automated process on millions of rows this is not sustainable.
df = pd.read_csv(file_loc.csv,
header=None,
names=column_names,
usecols=counters,
dtype=dtypes,
delimiter=';',
low_memory=False)
ValueError: could not convert string to float:
As my first column is a string and the second is a float, but if the first is split by the &lt ; it then goes on the 2nd too.
Is there a way to tell pandas to ignore these or efficiently remove before loading?

Give the following example csv file so57732330.csv:
col1;col2
1<2;a
3;
we read it using StringIO after unescaping named and numeric html5 character references:
import pandas as pd
import io
import html
with open('so57732330.csv') as f:
s = f.read()
f = io.StringIO(html.unescape(s))
df = pd.read_csv(f,sep=';')
Result:
col1 col2
0 1<2 a
1 3 NaN

Convert data from .data file to .csv file and put data in columns using pandas

I want to convert data from a .data file to a .csv file and put the data from the .data file in columns with values under them. However, the .data file has a specific format and I don't know how to put the text in it in columns. Here is how the .data file looks like:
column1
column2
column3
column4
column5
column6
column7
column8
column9
column10
column11
column12
column13
........
column36
1243;6543;5754;5678;4567;4567;4567;2573;7532;6332;6432;6542;5542;7883;7643;4684;4568;4573
3567;5533;6532;6432;7643;8635;7654;6543;8753;7643;7543;7543;7543;6543;6444;7543;6444;6444
1243;6543;5754;5678;4567;4567;4567;2573;7532;6332;6432;6542;5542;7883;7643;4684;4568;4573
3567;5533;6532;6432;7643;8635;7654;6543;8753;7643;7543;7543;7543;6543;6444;7543;6444;6444
1243;6543;5754;5678;4567;4567;4567;2573;7532;6332;6432;6542;5542;7883;7643;4684;4568;4573
3567;5533;6532;6432;7643;8635;7654;6543;8753;7643;7543;7543;7543;6543;6444;7543;6444;6444
1243;6543;5754;5678;4567;4567;4567;2573;7532;6332;6432;6542;5542;7883;7643;4684;4568;4573
3567;5533;6532;6432;7643;8635;7654;6543;8753;7643;7543;7543;7543;6543;6444;7543;6444;6444
The file as shown above has the names of 36 columns, each on 1 line. Under these are many datapoints, with 36 values in them that are separated by semicolons. The datapoints are 2 lines long and each datapoint is separated by a blank line. The .csv file must look like this:
column1,column2,column3,column4,column5,column6,column7,column8,column9,column10,column11,column12,column14,column15,column16,column17,column18,column20,column20,column21,column22,column23,column24,column25,column26,column27,column28,column29,column30,column31,column32,column33,column34,column35,column36
1243,6543,5754,5678,4567,4567,4567,2573,7532,6332,6432,6542,5542,7883,7643,4684,4568,4573,3567,5533,6532,6432,7643,8635,7654,6543,8753,7643,7543,7543,7543,6543,6444,7543,6444,6444
1243,6543,5754,5678,4567,4567,4567,2573,7532,6332,6432,6542,5542,7883,7643,4684,4568,4573,3567,5533,6532,6432,7643,8635,7654,6543,8753,7643,7543,7543,7543,6543,6444,7543,6444,6444
1243,6543,5754,5678,4567,4567,4567,2573,7532,6332,6432,6542,5542,7883,7643,4684,4568,4573,3567,5533,6532,6432,7643,8635,7654,6543,8753,7643,7543,7543,7543,6543,6444,7543,6444,6444
1243,6543,5754,5678,4567,4567,4567,2573,7532,6332,6432,6542,5542,7883,7643,4684,4568,4573,3567,5533,6532,6432,7643,8635,7654,6543,8753,7643,7543,7543,7543,6543,6444,7543,6444,6444
The first line of the .csv as shown above file must consist of 36 columns with the names in it separated by commas. The next lines must consist of all datapoints, each on 1 line and in which the 36 values must be separated by commas.
Can you use the software library 'pandas' for this? Anyways, this is my starting code:
with open("file.data") as fIn, open("file.csv", "w") as fOut:
for r, line in enumerate(fIn):
if not line:
break
Thanks

Sure you can do it with pandas. You just need to read first N lines (36 in your case) to use them as header and read rest of the file like normal csv (pandas good at it). Then you can save pandas.DataFrame object to csv.
Since your data splitted into adjacent lines, we should split DataFrame we've read on two and stack them one next to other (horizontaly).
Consider the following code:
import pandas as pd
COLUMNS_COUNT = 36
# read first `COLUMNS_COUNT` lines to serve as a header
with open('data.data', 'r') as f:
columns = [next(f).strip() for line in range(COLUMNS_COUNT)]
# read rest of the file to temporary DataFrame
temp_df = pd.read_csv('data.data', skiprows=COLUMNS_COUNT, header=None, delimiter=';', skip_blank_lines=True)
# split temp DataFrame on even and odd rows
even_df = temp_df.iloc[::2].reset_index(drop=True)
odd_df = temp_df.iloc[1::2].reset_index(drop=True)
# stack even and odd DataFrames horizontaly
df = pd.concat([even_df, odd_df], axis=1)
# assign column names
df.columns = columns
# save result DataFrame to csv
df.to_csv('out.csv', index=False)
UPD: code updated to correctly process data splitted onto two lines

Pandas: read_csv ignore rows after a blank line

There is a weird .csv file, something like:
header1,header2,header3
val11,val12,val13
val21,val22,val23
val31,val32,val33
pretty fine, but after these lines, there is always a blank line followed by lots of useless lines. The whole stuff is something line:
header1,header2,header3
val11,val12,val13
val21,val22,val23
val31,val32,val33
dhjsakfjkldsa
fasdfggfhjhgsdfgds
gsdgffsdgfdgsdfgs
gsdfdgsg
The number of lines in the bottom is totally random, the only remark is the empty line before them.
Pandas has a parameter "skipfooter" for ignoring a known number of rows in the footer.
Any idea about how to ignore this rows without actually opening (open()...) the file and removing them?

There is not any option to terminate read_csv function by getting the first blank line. This module isn't capable of accepting/rejecting lines based on desired conditions. It only can ignore blank lines (optional) or rows which disobey the formed shape of data (rows with more separators).
You can normalize the data by the below approaches (without parsing file - pure pandas):
Knowing the number of the desired\trash data rows. [Manual]
pd.read_csv('file.csv', nrows=3) or pd.read_csv('file.csv', skipfooter=4)
Preserving the desired data by eliminating others in DataFrame. [Automatic]
df.dropna(axis=0, how='any', inplace=True)
The results will be:
header1 header2 header3
0 val11 val12 val13
1 val21 val22 val23
2 val31 val32 val33

The best way to do this using pandas native functions is a combination of arguments and function calls - a bit messy, but definitely possible!
First, call read_csv with the skip_blank_lines=False, since the default is True.
df = pd.read_csv(<filepath>, skip_blank_lines=False)
Then, create a dataframe that only contains the blank rows, using the isnull or isna method. This works by locating (.loc) the indices where all values are null/blank.
blank_df = df.loc[df.isnull().all(1)]
By utilizing the fact that this dataframe preserves the original indices, you can get the index of the first blank row.
Because this uses indexing, you will also want to check that there actually is a blank line in the csv. And finally, you simply slice the original dataframe in order to remove the unwanted lines.
if len(blank_df) > 0:
first_blank_index = blank_df.index[0]
df = df[:first_blank_index]

If you're using the csv module, it's fairly trivial to detect an empty row.
import csv
with open(filename, newline='') as f:
r = csv.reader(f)
for l in r:
if not l:
break
#Otherwise, process data

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reading csv file in pandas with newlines and natural language - python

Related

Split column in several columns by delimiter '\' in pandas

How to use 'Shift In' from text in csv file to split columns

Pandas read_csv - Ignore Escape Char in SemiColon Seperated File

Convert data from .data file to .csv file and put data in columns using pandas

Pandas: read_csv ignore rows after a blank line

Categories

Resources