Separating csv columns with delimiter / sep - python

My goal is to separate data stored in cells to multiple columns in the same row.
For example, I would like to take data that looks like this:
Row 1: [<1><2>][<3><4>][][]
Row 2: [<1><2>][<3><4>][][]
Into data that looks like this:
Row 1: [1][2][3][4]
Row 2: [1][2][3][4]
I tried using the code below to pull the csv and separate each line at the ">"
df = pd.read_csv('file.csv', engine='python', sep="\*>", header=None)
However, the code did not function as anticipated. Instead, the separation occurred at seemingly random and unpredictable points (I'm sure there's a pattern but I don't see it.) And each break created another row as opposed to another column. For example:
Row 1: [<1>][<2>]
Row 2: [<3>]
Row 3: [<4>]
I thought the issue might lie with reading the CSV file so I tried just re-scraping the site with the separator included but it produced the same results so I'm assuming its an issue with the separator call. However, I found that call after trying many others that caused various errors. For example, when I tried using sep = '>' I got the following error: ParserError: '>' expected after '"' and when I tried sep = '\>' , I got the following error: ParserError: Expected 36 fields in line 1106, saw 120. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.
These errors sent me looking though multiple resources including this and this among others.
However, I have find no resources that have successfully demonstrated how I can separate each column within a row following the use of a '>' delimiter. If anyone knows how to do this, please let me know. Your help is much appreciated!
Update:
Here is an actual screenshot of the CSV file for a better understanding of what I was trying to demonstrate above. My end goal is to have all the data is columns I+ have data on one descriptive factor as opposed to many as they do now.

Would this work:
string="[<1><2>][<3><4>][][]"
string=string.replace("[","")
string=string.replace("]","")
string=string.replace("<","[")
string=string.replace(">","]")
print(string)
Result:
[1][2][3][4]

I ended up using Google Sheets. Once you upload the csv there is a header titled "data" and then a sub-section titled "split text to columns."
If you want a faster way to do this with code, you can also do the following with pandas:
# new data frame with split value columns
new = data["Name"].str.split(" ", n = 1, expand = True)
# making separate first name column from new data frame
data["First Name"]= new[0]
# making separate last name column from new data frame
data["Last Name"]= new[1]
# Dropping old Name columns
data.drop(columns =["Name"], inplace = True)
# df display
data

Related

using pandas.read_csv() for malformed csv data

This is a conceptual question, so no code or reproduceable example.
I am processing data pulled from a database which contains records from automated processes. The regular record contains 14 fields, with a unique ID, and 13 fields containing metrics, such as the date of creation, the time of execution, the customer ID, the type of job, and so on. The database accumulates records at the rate of dozens a day, and a couple of thousand per month.
Sometimes, the processes result in errors, which result in malformed rows. Here is an example:
id1,m01,m02,m03,m04,m05,m06,m07,m08,m09,m10,m11,m12,m13 /*regular record, no error, 14 fields*/
id2,m01,m02,m03,m04,m05,m06,m07,m08,m09,m10,m11,m12,"DELETE error, failed" /*error in column 14*/
id3,m01,m02,"NO SUCH JOB error, failed" /*error in column 4*/
id4,m01,m02,m03,m04,m05,m06,"JOB failed, no time recorded" /*error in column 7*/
The requirements are to (1) populate a dashboard from the metrics, and (2) catalog the types of errors. The ideal solution uses read_csv with on_bad_lines set to some function that returns a dataframe. My hacky solution is to munge the data by hand, row by row, and create two data frames from the output. The presence of the bad lines can be reliably detected by the use of the keyword "failed." I have written the logic that collects the "failed" messages and produces a stacked bar chart by date. It works, but I'd rather use a total Pandas solution.
Is it possible to use pd.read_csv() to return 2 dataframes? If so, how would this be done? Can you point me to any example code? Or am I totally off base? Thanks.
You can load your csv file on a Dataframe and apply a filter :
df = pd.read_csv("your_file.csv", header = None)
df_filter = df.apply(lambda row: row.astype(str).str.contains('failed').any(), axis=1)
df[df_filter.values] #this gives a dataframe of "failed" rows
df[~df_filter.values] #this gives a dataframe of "non failed" rows
You need to make sure that your keyword does not appear on your data.
PS : There might be more optimized ways to do it
This approach reads the entire CSV into a single column. Then uses a mask that identifies failed rows to break out and create good and failed dataframes.
Read the entire CSV into a single column
import io
dfs = pd.read_fwf(sim_csv, widths=[999999], header=None)
Build a mask identifying the failed rows
fail_msk = dfs[0].str.contains('failed')
Use that mask to split out and build separate dataframes
df_good = pd.read_csv(io.StringIO('\n'.join(dfs[~fail_msk].squeeze())), header=None)
df_fail = pd.read_csv(io.StringIO('\n'.join(dfs[fail_msk].squeeze())), header=None)

How can I read a document with pandas (python) that don't look like the average one?

I am trying to get the values from the colums from a file. The doc looks like this
the data I want to read
So all the examples I have found uses pd.read_csv or pd.DataFrame but all data usually have a clear header an nothing on top of the data (I have like 10 lines of code I don't really need for what I am doing).
Also, I think maybe there is something wrong because I tried to run:
data = pd.read_csv('tdump_BIL_100_-96_17010100.txt',header=15)
and I get
pd.read_csv output
which is just the row in one column, so there is no separation apparently, and therefore no way of getting the columns I need.
So my question is if there is a way to get the data from this file with pandas and how to get it.
If a defined number, skip the initial rows, indicate that no header is present, and that values are separated by spaces.
df = pd.read_csv(filename, skiprows=15, header=None, sep='\s+')
See read_csv for documentation.

Trying to salvage a messed up dataset

I've been tasked with the tedious job of salvaging a mangled dataset that looks like this in its .txt file form
"DATE"|"ID"|"TITLE"|"NAME"|"SURNAME"|"INCOME"|"ADDRESS1"|"JOB"
||MR|ROBIN|WILLIAMS|291029102|
2019-01-01||MRS||DREW||SECOND_TO_THE_RIGHT_AND_STRAIGHT_ON_TILL_MORNING||
2021-02-04|||JONATHAN|SIMMONS|3012000|SOMEWHERE_IN_BEVERLY_HILLS|
|MARK|ZUCKERBURG||SILICON_VALLEY|
|||PARKER|JONES||SOMEWHERE_IN_HIS_HOMETOWN_IN_
NORTHERN_ITALY
2019-06-03|4444301243451|MRS|CARMEN|SANDIEGO||WHERE_IN_THE_WORLD_IS_
CARMEN_SANDIEGO|THIEF
2018-02-25|2395022812223|MR|ARNOLD|SCHWARZENEGGER|MUSCLE_BEACH_VENI
CE_CALIFORNIA_USA|ACTOR
|DOCTOR|STRANGE|NEW_YORK_SA
NCTUM|SORCERER_SUPREME
||HAL|JORDAN||EARTH|GREEN_LANTERN
2019-01-02|0212931229475|MR|KEVIN|BACON|129340212|SOME_PLACE_IN_THE_US|ACTOR
2019-01-03|2939583749642|MRS|NICOLE|KIDMAN|291928494|BEVERLY_HILLS|ACTOR
2019-03-15|2947959371923|MR|WARREN|BUFFET|2000000000|OMAHA_NEBRASKA|INVESTOR
2020-02-45|2847939172394|MRS|JULIE|ARIS|200034|SOMEWHERE_IN_THE_UK|LECTURER
There are two major problems with this dataset that I've been struggling with, the first of which is that there aren't enough delimiters in certain rows (e.g. the first row) and the second being that long values in the address column have somehow caused the address value to "spill over" into the next row. When I asked the people in charge of extracting the data from their databases why the data looked like this, this was apparently caused by people pressing "Enter" while inputting long addresses, which also meant that the data was stored this way in the database to begin with, thus ruling out any possibility of a re-extraction rectifying these errors. According to them, the easiest way to rectify the second problem is to press backspace at the head of each problematic row like this.
"DATE"|"ID"|"TITLE"|"NAME"|"SURNAME"|"INCOME"|"ADDRESS1"|"JOB"
||MR|ROBIN|WILLIAMS|291029102|
2019-01-01||MRS||DREW||SECOND_TO_THE_RIGHT_AND_STRAIGHT_ON_TILL_MORNING||
2021-02-04|||JONATHAN|SIMMONS|3012000|SOMEWHERE_IN_BEVERLY_HILLS|
|MARK|ZUCKERBURG||SILICON_VALLEY|
2019-01-02|0212931229475|MR|KEVIN|BACON|129340212|SOME_PLACE_IN_THE_US|ACTOR
2019-01-03|2939583749642|MRS|NICOLE|KIDMAN|291928494|BEVERLY_HILLS|ACTOR
2019-03-15|2947959371923|MR|WARREN|BUFFET|2000000000|OMAHA_NEBRASKA|INVESTOR
2020-02-45|2847939172394|MRS|JULIE|ARIS|200034|SOMEWHERE_IN_THE_UK|LECTURER
While technically not a bad solution, I'm not about to do this manually for a dataset that has over 2 million rows.
My first idea was to look for escape characters in the datafile using Notepad++, but there were no signs of escape characters in the dataset at all. My second idea was to try to read the datafile row by row and append it to a dataframe using the following lines of code.
import time
start = time.time()
with open("DATAFILE.txt", 'rt', encoding = 'UTF8') as f:
data = f.read()
fl = data.splitlines()
# 350231 rows
len(fl)
df1 = pd.DataFrame()
for i in range(0, 10000):
df1 = df1.append([fl[i+1].split('|')], ignore_index=True)
print('Total Time:',round((time.time() - start),2), '(sec)')
This did absolutely nothing as it gave me the exact same results as read_csv, which is why I'd very much appreciate some advice on how to solve this problem.
Example of a row with empty values filled in with NaN:
"DATE"|"ID"|"TITLE"|"NAME"|"SURNAME"|"INCOME"|"ADDRESS1"|"JOB"
NaN |NaN|MR|ROBIN|WILLIAMS|291029102|NaN
Note that this row is also missing a delimiter for the JOB column

Import table to DataFrame and set group of column as list

I have a table (Tab delimited .txt file) in the following form:
each row is an entry;
first row are headers
the first 5 columns are simple numeric parameters
all column after the 7th column are supposed to be a list of values
My problem is how can I import and create a data frame where the last column contain a list of values?
-----Problem 1 ----
The header (first row) is "shorter", containing simply the name of some columns. All the values after the 7th do not have a header (because it is suppose to be a list). If I import the file as is, this appear to confuse the import functions
If, for example, I import as follow
df = pd.read_table( path , sep="\t")
the DataFrame created has only as many columns as the elements in the first row. Moreover, the data value assigned are mismatched.
---- Problem 2 -----
What is really confusing to me is that if I open the .txt in Excel and save it as Tab-delimited (without changing anything), I can then import it without problems, with headers too: columns with no header are simply given an “Unnamed XYZ” tag.
Why would saving in Excel change it? Using Note++ I can see only one difference: the original .txt is in "Unix (LF)" form, while the one saved in Excel is "Windows (CR LF)". Both are UTF-8, so I do not understand how this would be an issue?!?
Nevertheless, from here I could manipulate the data and try to gather all columns I wish and make them into a list. However, I hope that there is a more elegant and faster way to do it.
Here is a screen-shot of the .txt file
Thank you,

Python read_csv - ParserError: Error tokenizing data

I understand why I get this error when trying to df = pd.read_csv(file) :
ParserError: Error tokenizing data. C error: Expected 14 fields in line 7, saw 30
When it reads in the csv, it sees 14 strings/columns in the first row, based on the first row of the csv calls it the headers (which is what I want).
However, those columns are extended further, down the rows (specifially when it gets to row 7).
I can find solutions that will read it in by skipping those rows 1-6, but I don't want that. I still want the whole csv to be read, but instead of the header being 14 columns, how can I tell it make the header 30 columns, and if there is no text/string, just leave the column as a "", or null, or some random numbering. In other words, I don't care what it's named, I just need the space holder so it can parse after row 6.
I'm wondering is there a way to read in the csv, and explicitly say there are 30 columns but have not found a solution.
I can throw some random solutions that I think should work.
1) Set Header=None and give columns names in 'Name' attribute of read_csv.
df=pd.read_csv(file, header=None, namees = [field1, field2, ...., field 30])
PS. This will work if your CSV doesn't have a header already.
2) Secondly you can try using below command (if your csv already has header row)
df=pd.read_csv(file, usecols=[0,1,2,...,30])
Let me know if this works out for you.
Thanks,
Rohan Hodarkar
what about trying, to be noted error_bad_lines=False will cause the offending lines to be skipped
data = pd.read_csv('File_path', error_bad_lines=False)
Just few more collectives answers..
It might be an issue with the delimiters in your data the first row,
To solve it, try specifying the sep and/or header arguments when calling read_csv. For instance,
df = pandas.read_csv('File_path', sep='delimiter', header=None)
In the code above, sep defines your delimiter and header=None tells pandas that your source data has no row for headers / column titles. Here Documenet says: "If file contains no header row, then you should explicitly pass header=None". In this instance, pandas automatically creates whole-number indices for each field {0,1,2,...}.
According to the docs, the delimiter thing should not be an issue. The docs say that "if sep is None [not specified], will try to automatically determine this." I however have not had good luck with this, including instances with obvious delimiters.
This might be an issue of delimiter, as most of the csv CSV are got create using sep='/t' so try to read_csv using the tab character (\t) using separator /t. so, try to open using following code line.
data=pd.read_csv("File_path", sep='\t')
OR
pandas.read_csv('File_path',header=None,sep=', ')

Categories

Resources