Fix bad lines downloaded when reading csv/txt in pandas

Fix bad lines downloaded when reading csv/txt in pandas - python

I have a dataset that I daily download from amazon aws. Problem is that there are some lines bad downloaded (see image. Also can download the sample here). Those 2 lines that start with "ref" should be append in the previous row that starts with "001ec214-97e6-4d84-a64a-d1bee0079fd8" in order to have that row correct. I have like a hundred of this cases to solve, so I wonder if there's a method to do this kind of append with pandas read_csv function. Also i can't just ignore those rows cause I'd be missing useful data.

This oneliner should do:
from io import StringIO
import pandas as pd
df = pd.read_csv(StringIO(open('filename.csv').read().replace('\n\\\nref :', ' ref :')), sep="\t", header=None)
It reads the file and replaces the redundant newlines before loading the string representation of the csv file into pandas with StringIO.

Given the text from the Google doc in the questions:
001ea9a6-2c30-4ce6-b4ee-f445803303db 596d89bf-e641-4e9c-9374-695241694a6d 45640 MUNDO JURIDICO 26204 20350 \N A0000000031010 10530377 MATEO FLORES ELMER ASOC INTEGRACION LOS OLIVOS AV NAPOLES MZ I LTE 31 visanet 650011581 1 0 mpos contactless 0 0 0 visa 421410 5969 12 24 c6d06de1-c9f0-4b4a-992b-ad3a5f263039 000066 666407 2021-07-20 09:31:23 000 301201522830426 995212014496392 158348 \N \N \N PEN 200.00 200.00 0.00 1 0 0.00 200026214 5999 ASOC INTEGRACION LOS OLIVOS AV NAPOLES MZ I LTE 31 MUNDO JURIDICO 947761525 AUTHORIZED 2021-07-20 09:31:24 2021-07-20 09:32:00 210720 093123
001ec0fd-8d0e-4332-851a-bcd93fdf0a37 ee32d094-8fc4-4d92-b788-58ca2ae2a590 36750 Chifa Maryori 18923 25313 \N A0000000031010 46753818 Chifa Maryori San isidro Peru visanet 650011581 1 0 mpos contactless 0 0 0 visa 421355 5765 04 26 7d00f10e-d0b7-40ec-9b97-d20590cb7710 000708 620744 2021-06-27 16:52:21 000 301178787424243 996211783887688 100513 \N \N \N PEN 17.00 17.00 0.00 0 0 0.00 400034782 5499 San isidro Peru Chifa Maryori +51 01 5179805 AUTHORIZED 2021-06-27 16:52:23 2021-06-27 16:52:23 210627 165221
001ec214-97e6-4d84-a64a-d1bee0079fd8 3c4a98cc-d279-4f9e-af8b-5198647889c5 33214 Inversiones Polleria el rey MPOS 15699 23053 \N \N 88846910264 JoseOrbegoso Puelle D 5 Lt 2 urb Mariscal
\
ref : altura de elektra visanet 650011581 1 0 mpos chip 0 0 0 visa 455788 2123 09 22 8b3fd975-140a-42d5-8601-1b4bcc16bd60 000022 170790 2020-10-20 11:32:44 687 \N \N \N \N \N \N PEN 1.00 1.00 0.00 0 0 0.00 200020168 5999 D 5 Lt 2 urb Mariscal
\
ref : altura de elektra Inversiones Polleria el rey MPOS 964974226 DENIED 2020-10-20 11:32:44 2020-10-20 11:32:44 201020 113244
001ec66f-6350-4a33-a1fe-b9375dac2161 34169a7a-a66f-4258-80c2-8c0d4512aa36 27044 ABRAHAM LORENZO MELGAREJO 10353 13074 \N \N 99944748991 ABRAHAM LORENZO MELGAREJO JR RENOVACION 399 visanet 650011581 1 0 mpos chip 0 0 0 visa 455788 4712 08 24 83915520-1a1f-4d5e-b118-f0c57a9d96ae 000367 161286 2020-10-14 15:59:51 000 300288755920408 995202889603642 242407 \N \N \N PEN 15.00 15.00 0.00 1 0 0.00 200012523 5811 JR RENOVACION 399 ABRAHAM LORENZO MELGAREJO 957888909 AUTHORIZED 2020-10-14 15:59:52 2020-10-14 16:00:21 201014 155951
001eebaf-bccc-47a7-87a3-be8b09eb971b c5e14889-d61c-4f18-8000-d3bac0564cfb 27605 Polleria Arroyo Express 10792 21904 \N A0000000031010 41429707 Polleria Arroyo Express San isidro Peru visanet 650011581 1 0 mpos contactless 0 0 0 visa 421355 9238 09 25 3c5268e8-5731-4fea-905d-45b03ed623d2 000835 379849 2021-01-30 19:43:36 000 301031026163928 995210303271986 928110 \N \N \N PEN 24.00 24.00 0.00 0 0 0.00 400026444 5499 San isidro Peru Polleria Arroyo Express +51 01 5179805 AUTHORIZED 2021-01-30 19:43:37 2021-01-30 19:43:37 210130 194336
...and using the code from Concatenate lines with previous line based on number of letters in first column, if the "good lines" all start 001 (that's what the regex is checking for below) you can try...
Code:
import re
import pandas as pd
all_the_data = ""
pattern = r'^001.*$'
with open('aws.txt') as f, open('aws_out.csv','w') as output:
all_the_data = ""
for line in f:
if not re.search(pattern, line):
all_the_data = re.sub("\n$", "", all_the_data)
all_the_data = "".join([all_the_data, line])
output.write(all_the_data)
df = pd.read_csv('aws_out.csv', header=None, sep='\t')
print(df)
Output example:

Related

How to regex extract CAR MAKE from URL in pandas df column

I am trying to extract from URL str "/used/Mercedes-Benz/2021-Mercedes-Benz-Sprinte..."
the entire Make name, i.e. "Mercedes-Benz"
BUT my pattern only returns the first letter, i.e. "M"
Please help me come up with the correct pattern to use on pandas df.
Thank you
CODE:
URLS_by_City['Make'] = URLS_by_City['Page'].str.extract('.+([A-Z])\w+(?=[\/])+', expand=True) Clean_Make = URLS_by_City.dropna(subset=["Make"]) Clean_Make # WENT FROM 5K rows --> to 2688 rows
Page City Pageviews Unique Pageviews Avg. Time on Page Entrances Bounce Rate % Exit **Make**
71 /used/Mercedes-Benz/2021-Mercedes-Benz-Sprinte... San Jose 310 149 00:00:27 149 2.00% 47.74% **B**
103 /used/Audi/2015-Audi-SQ5-286f67180a0e09a872992... Menlo Park 250 87 00:02:36 82 0.00% 32.40% **A**
158 /used/Mercedes-Benz/2021-Mercedes-Benz-Sprinte... San Francisco 202 98 00:00:18 98 2.04% 48.02% **B**
165 /used/Audi/2020-Audi-S8-c6df09610a0e09af26b5cf... San Francisco 194 93 00:00:42 44 2.22% 29.38% **A**
168 /used/Mercedes-Benz/2021-Mercedes-Benz-Sprinte... (not set) 192 91 00:00:11 91 2.20% 47.40% **B**
... ... ... ... ... ... ... ... ... ...
4995 /used/Subaru/2019-Subaru-Crosstrek-5717b3040a0... Union City 10 3 00:02:02 0 0.00% 30.00% **S**
4996 /used/Tesla/2017-Tesla-Model+S-15605a190a0e087... San Jose 10 5 00:01:29 5 0.00% 50.00% **T**
4997 /used/Tesla/2018-Tesla-Model+3-0f3ea14d0a0e09a... Las Vegas 10 4 00:00:09 2 0.00% 40.00% **T**
4998 /used/Tesla/2018-Tesla-Model+3-0f3ea14d0a0e09a... Austin 10 4 00:03:29 2 0.00% 40.00% **T**
4999 /used/Tesla/2018-Tesla-Model+3-5f29cdc70a0e09a... Orinda 10 4 00:04:00 1 0.00% 0.00% **T**
TRIED:
`example_url = "/used/Mercedes-Benz/2021-Mercedes-Benz-Sprinter+2500-9f3d32130a0e09af63592c3c48ac5c24.htm?store_code=AudiOakland&ads_adgroup=139456079219&ads_adid=611973748445&ads_digadprovider=adpearance&adpdevice=m&campaign_id=17820707224&adpprov=1"
pattern = ".+([a-zA-Z0-9()])\w+(?=[/])+"
wanted_make = URLS_by_City['Page'].str.extract(pattern)
wanted_make
`
0
0 r
1 r
2 NaN
3 NaN
4 r
... ...
4995 r
4996 l
4997 l
4998 l
4999 l
It worked in regex online tool.
but unfortunately not in my jupyter notebook
EXAMPLE PATTERNS - I bolded what should match:
/used/Mercedes-Benz/2021-Mercedes-Benz-Sprinter+2500-9f3d32130a0e09af63592c3c48ac5c24.htm?store_code=AudiOakland&ads_adgroup=139456079219&ads_adid=611973748445&ads_digadprovider=adpearance&adpdevice=m&campaign_id=17820707224&adpprov=1
/used/Audi/2020-Audi-S8-c6df09610a0e09af26b5cff998e0f96e.htm
/used/Mercedes-Benz/2021-Mercedes-Benz-Sprinter+2500-9f3d32130a0e09af63592c3c48ac5c24.htm?store_code=AudiOakland&ads_adgroup=139456079219&ads_adid=611973748445&ads_digadprovider=adpearance&adpdevice=m&campaign_id=17820707224&adpprov=1
/used/Audi/2021-Audi-RS+5-b92922bd0a0e09a91b4e6e9a29f63e8f.htm
/used/LEXUS/2018-LEXUS-GS+350-dffb145e0a0e09716bd5de4955662450.htm
/used/Porsche/2014-Porsche-Boxster-0423401a0a0e09a9358a179195e076a9.htm
/used/Audi/2014-Audi-A6-1792929d0a0e09b11bc7e218a1fa7563.htm
/used/Honda/2018-Honda-Civic-8e664dd50a0e0a9a43aacb6d1ab64d28.htm
/new-inventory/index.htm?normalFuelType=Hybrid&normalFuelType=Electric
/used-inventory/index.htm
/new-inventory/index.htm
/new-inventory/index.htm?normalFuelType=Hybrid&normalFuelType=Electric
/

I have tried completing your requirement in Jupyter Notebook.
PFB the code and screenshots:
I have created a dummy pandas dataframe(data_df), below is a screenshot of the same
I have created a pattern based on the pattern of the string to be extracted
pattern = "^/used/(.*)/(?=[20][0-9{2}])"
Used the patten to extract required data from the URLs and saved it in another column in the same dataframe
data_df['Car Maker'] = data_df['urls'].str.extract(pattern)
Below is a screenshot of the output
I hope this is helpful..

I would use:
URLS_by_City["Make"] = URLS_by_City["Page"].str.extract(r'([^/]+)/\d{4}\b')
This targets the URL path segment immediately before the portion with the year. You could also try this version:
URLS_by_City["Make"] = URLS_by_City["Page"].str.extract(r'/[^/]+/([^/]+)')

Below code will give you the model & VIN values:
pattern2 = '^/used/[a-zA-Z\-]*/([0-9]{4}[a-zA-Z0-9\-+]*)-[a-z0-9]*.htm'
pattern3 = '^/used/[a-zA-Z\-]*/[0-9]{4}[a-zA-Z0-9\-+]*-([a-z0-9]*).htm'
data_df['Model'] = data_df['urls'].str.extract(pattern2)
data_df['VIN'] = data_df['urls'].str.extract(pattern3)
Here is a screenshot of the output:

How to read a string with custom terminator as a pandas dataframe?

I was trying to read a text into a pandas DataFrame, but instead of 4 columns I got a lot of columns.
How to read the following text as pandas DataFrame?
txt="""2020-09-12, Budget, GD-0032-DD-XP, Ford,\\n 2020-04-22, Avis, D143123, Toyota,\\n 2020-04-03, Herz, 331029411, Jeep,\\n 2020-10-31, HERZ, , Hyundai,\\n 2020-09-10, Budget, Gd-1932-Ee-Rm, Chevrolet,\\n 2020-12-01, National, 9890001, Ford,\\n 2020-05-13, Alamo, W***, Hyundai,\\n 2020-01-21, Enterprise, GD-8888-TL-MP, Jeep,\\n"""
My attempt:
txt="""2020-09-12, Budget, GD-0032-DD-XP, Ford,\n 2020-04-22, Avis, D143123, Toyota,\n 2020-04-03, Herz, 331029411, Jeep,\n 2020-10-31, HERZ, , Hyundai,\n 2020-09-10, Budget, Gd-1932-Ee-Rm, Chevrolet,\n 2020-12-01, National, 9890001, Ford,\n 2020-05-13, Alamo, W***, Hyundai,\n 2020-01-21, Enterprise, GD-8888-TL-MP, Jeep,\n"""
# input file with only ONE row
with open('input000.txt','w') as fo:
txt = txt.replace('\n','\\n')
fo.write(txt)
# read the data file
import io
import pandas as pd
df = pd.read_csv('input000.txt', lineterminator='\n')
df
My output
2020-09-12 Budget GD-0032-DD-XP Ford \n 2020-04-22 Avis D143123 Toyota \n 2020-04-03 Herz ... Ford.1 \n 2020-05-13 Alamo W*** Hyundai.1 \n 2020-01-21 Enterprise GD-8888-TL-MP Jeep.1 \n
0 rows × 33 columns
Required output
0 1 2 3
0 2020-09-12 Budget GD-0032-DD-XP Ford
1 2020-04-22 Avis D143123 Toyota
2 2020-04-03 Herz 331029411 Jeep
3 2020-10-31 HERZ Hyundai
4 2020-09-10 Budget Gd-1932-Ee-Rm Chevrolet
5 2020-12-01 National 9890001 Ford
6 2020-05-13 Alamo W*** Hyundai
7 2020-01-21 Enterprise GD-8888-TL-MP Jeep

You have to pass in the separator as well using sep parameter. Since the last element in each row also has a comma, this will result in a NaN column, so you can drop that one using drop:
df = pd.read_csv(io.StringIO(txt), lineterminator='\n', sep=',').drop(columns='Unnamed: 4')
On a text file, \n seems to be converted to \\n and since c-engine of read_csv doesn't support \\n as a separator, perhaps it's better if we open it as a text file and build the DataFrame:
with open("input000.txt") as file:
df = pd.DataFrame([line.split(',') for line in file.read().split('\\n')])
Output:
2020-09-12 Budget GD-0032-DD-XP Ford
0 2020-04-22 Avis D143123 Toyota
1 2020-04-03 Herz 331029411 Jeep
2 2020-10-31 HERZ Hyundai
3 2020-09-10 Budget Gd-1932-Ee-Rm Chevrolet
4 2020-12-01 National 9890001 Ford
5 2020-05-13 Alamo W*** Hyundai
6 2020-01-21 Enterprise GD-8888-TL-MP Jeep

Parse Problematic Fixed width text file to a pandas dataframe

I need to parse a FWF to a df, but the system that exports the file instead of & symbol exports &.
This break the fixed-width as the lines with & have now more characters so I cant use read_fwf. I can import it with the below code, but this gives me a single column that I must later split.
import pandas as pd
cnv = lambda txt: txt.replace('&', '&')
df = pd.read_csv('AccountStatement1.txt', skiprows=4, skipfooter=3, engine='python', sep='$%^#~&', converters= { i: cnv for i in range(1) })
df
i use sep='$%^#~&' so i can have only one column and correct the text using the converter.
What is the proper solution to this?
Sample of the text file:
=======================================================================================================================================================================================================================================================================================================================================================
From: Transaction Date: 21/05/2021 To: Transaction Date: 23/06/2021 IBAN:: CYxxxxxx0000000000000 Currency: EUR Previous Statement Balance: 1,111.10 BIC: xxxxxxxx
=======================================================================================================================================================================================================================================================================================================================================================
Transaction Date Transaction details Reference No. Description 1 Description 2 Value date Debit Credit Balance
27/05/2021 CHQ: 12568987 26/05/2021 645.00 9,708.70
27/05/2021 DEBIT EB2021057554434221149 xxxx xxxxxxxxx xxxxxx 27/05/2021 0,888.36 3,723.74
28/05/2021 I2456787437452 B/O: xxxxxxxxxxxxxxx LTD TRANSFER xxxxxxxxx xxxxxxxxxx 27/05/2021 19,002.00 13,755.74
28/05/2021 INWARD TRANSFER COMMISSION CY21jhjh884786 I2107675689452 28/05/2021 10.00 15,723.74
31/05/2021 ATM/POS DEBIT jhgjhkjhjk jkh f4 1211 xxxxxx xxxxx & xx xxxxx 27/05/2021 60.00 52,680.74
31/05/2021 Service Charges MONTHLY MAINTENANCE FEE 31/05/2021 35.00 73,645.74
01/06/2021 Service Charges STATEMENT FEE - MAY 2021 31/05/2021 5.00 19,645.74
02/06/2021 ATM/POS DEBIT POS 567521 124454 1211 xxxxxxxxxxxx & Exxxxxxx 31/05/2021 170.00 09,320.74
03/06/2021 CHQ: 13456784 02/06/2021 80.00 10,230.74
04/06/2021 ATM/POS DEBIT POS 345671 124258 1278 xxxxxxxxxxxx & xxxxxxxx 02/06/2021 940.00 23,960.74
08/06/2021 ATM/POS DEBIT POS 345671 125678 1278 xxxxxxx xxxxx xxxxx 04/06/2021 13.20 13,347.54
15/06/2021 ATM/POS DEBIT POS 145671 156612 1671 xxxx xxxxxxxxxxxxxx680 11/06/2021 25.53 13,322.01
=======================================================================================================================================================================================================================================================================================================================================================
Number of records: 22 IBAN:: xxxx234567898765434567876545 Currency: EUR Current Statement Balance: 0,000.00
=======================================================================================================================================================================================================================================================================================================================================================

Maybe you could load the file, replace the problematic characters, then read it as fixed width with pd.read_fwf using io.StringIO to make an in-memory buffer:
>>> import io, pandas as pd
>>> with open('test.csv') as f:
... lines = f.readlines()
>>> pd.read_fwf(io.StringIO(''.join(lines[4:-3]).replace('&', '&')))
a b c
0 11 fo& 0
This is the file’s content, unaligned by & as you indicate:
>>> print(''.join(lines))
foo
bar
baz
qux
a b c
11 fo& 0
ig
nore
me

Data not showing in table form when using Jupyter Notebook

I ran the below code in Jupyter Notebook, I was expecting the output to appear like an excel table but instead the output was split up and not in a table. How can I get it to show up in table format?
import pandas as pd
from matplotlib import pyplot as plt
df = pd.read_csv("Robbery_2014_to_2019.csv")
print(df.head())
Output:
X Y Index_ event_unique_id occurrencedate \
0 -79.270393 43.807190 17430 GO-2015134200 2015-01-23T14:52:00.000Z
1 -79.488281 43.764091 19205 GO-20142956833 2014-09-21T23:30:00.000Z
2 -79.215836 43.761856 15831 GO-2015928336 2015-03-23T11:30:00.000Z
3 -79.436264 43.642963 16727 GO-20142711563 2014-08-15T22:00:00.000Z
4 -79.369461 43.654526 20091 GO-20142492469 2014-07-12T19:00:00.000Z
reporteddate premisetype ucr_code ucr_ext \
0 2015-01-23T14:57:00.000Z Outside 1610 210
1 2014-09-21T23:37:00.000Z Outside 1610 200
2 2015-06-03T15:08:00.000Z Other 1610 220
3 2014-08-16T00:09:00.000Z Apartment 1610 200
4 2014-07-14T01:35:00.000Z Apartment 1610 100
offence ... occurrencedayofyear occurrencedayofweek \
0 Robbery - Business ... 23.0 Friday
1 Robbery - Mugging ... 264.0 Sunday
2 Robbery - Other ... 82.0 Monday
3 Robbery - Mugging ... 227.0 Friday
4 Robbery With Weapon ... 193.0 Saturday
occurrencehour MCI Division Hood_ID Neighbourhood \
0 14 Robbery D42 129 Agincourt North (129)
1 23 Robbery D31 27 York University Heights (27)
2 11 Robbery D43 137 Woburn (137)
3 22 Robbery D11 86 Roncesvalles (86)
4 19 Robbery D51 73 Moss Park (73)
Long Lat ObjectId
0 -79.270393 43.807190 2001
1 -79.488281 43.764091 2002
2 -79.215836 43.761856 2003
3 -79.436264 43.642963 2004
4 -79.369461 43.654526 2005
[5 rows x 29 columns]

Use display(df.head()) (produces slightly nicer output than without display()

Print function is applied to represent any kind of information like string or estimated value.
Whereas Display() will display the dataset in

Ignore backslash when reading tsv file in python

I have a large sep="|" tsv with an address field that has a bunch of values with the following
...xxx|yyy|Level 1 2 xxx Street\(MYCompany)|...
This ends up as:
line1) ...xxx|yyy|Level 1 2 xxx Street\
line2) (MYCompany)|...
Tried running the quote=2 to turn non numeric into strings in read_table with Pandas but it still treats the backslash as new line. What is an efficient way to ignore rows with values in a field that contain backslash escapes to new line, is there a way to ignore the new line for \?
Ideally it will prepare the data file so it can be read into a dataframe in pandas.
Update: showing 5 lines with breakage on 3rd line.
1788768|1831171|208434489|2014-08-14 13:40:02|108|c||Desktop|coupon|49 XXX Ave|Australia|Victoria|3025|Melbourne
1788772|1831177|202234489|2014-08-14 13:41:37|108|c||iOS||u7 38-46 South Street|Australia|New South Wales|2116|Sydney
1788776|1831182|205234489|2014-08-14 13:42:41|108|c||Desktop||Level XXX Margaret Street\
(My Company)|Australia|New South Wales|2000|Sydney|Sydney
1788780|1831186|202634489|2014-08-14 13:43:46|108|c||Desktop||Po box ZZZ|Australia|New South Wales|2444|NSW Other|Port Macquarie

Here is another solution using regex:
import pandas as pd
import re
f = open('input.tsv')
fl = f.read()
f.close()
#Replace '\\n' with '\' using regex
fl = re.sub('\\\\\n','\\\\',s)
o = open('input_fix.tsv','w')
o.write(fl)
o.close()
cols = range(1,17)
#Prime the number of columns by specifying names for each column
#This takes care of the issue of variable number of columns
df = pd.read_csv(fl,sep='|',names=cols)
will produce the following result:

I think you can try first read_csv with sep which is NOT in values and it seems that it read correct:
import pandas as pd
import io
temp=u"""
49 XXX Ave|Australia
u7 38-46 South Street|Australia
XXX Margaret Street\
New South Wales|Australia
Po box ZZZ|Australia"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep="^", header=None)
print df
0
0 49 XXX Ave|Australia
1 u7 38-46 South Street|Australia
2 XXX Margaret StreetNew South Wales|Australia
3 Po box ZZZ|Australia
Then you can create new file with to_csv and read_csv with sep="|":
df.to_csv('myfile.csv', header=False, index=False)
print pd.read_csv('myfile.csv', sep="|", header=None)
0 1
0 49 XXX Ave Australia
1 u7 38-46 South Street Australia
2 XXX Margaret StreetNew South Wales Australia
3 Po box ZZZ Australia
Next solution with not createing new file, but write to variable output and then read_csv with io.StringIO:
import pandas as pd
import io
temp=u"""
49 XXX Ave|Australia
u7 38-46 South Street|Australia
XXX Margaret Street\
New South Wales|Australia
Po box ZZZ|Australia"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep=";", header=None)
print df
0
0 49 XXX Ave|Australia
1 u7 38-46 South Street|Australia
2 XXX Margaret StreetNew South Wales|Australia
3 Po box ZZZ|Australia
output = df.to_csv(header=False, index=False)
print output
49 XXX Ave|Australia
u7 38-46 South Street|Australia
XXX Margaret StreetNew South Wales|Australia
Po box ZZZ|Australia
print pd.read_csv(io.StringIO(u""+output), sep="|", header=None)
0 1
0 49 XXX Ave Australia
1 u7 38-46 South Street Australia
2 XXX Margaret StreetNew South Wales Australia
3 Po box ZZZ Australia
If I test it in your data, it seems that 1. and 2.rows have 14 fields, next two 15 fields.
So I remove last item from both rows (3. and 4.), maybe this is only typo (I hope it):
import pandas as pd
import io
temp=u"""1788768|1831171|208434489|2014-08-14 13:40:02|108|c||Desktop|coupon|49 XXX Ave|Australia|Victoria|3025|Melbourne
1788772|1831177|202234489|2014-08-14 13:41:37|108|c||iOS||u7 38-46 South Street|Australia|New South Wales|2116|Sydney
1788776|1831182|205234489|2014-08-14 13:42:41|108|c||Desktop||Level XXX Margaret Street\
(My Company)|Australia|New South Wales|2000|Sydney
1788780|1831186|202634489|2014-08-14 13:43:46|108|c||Desktop||Po box ZZZ|Australia|New South Wales|2444|NSW Other"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep=";", header=None)
print df
0
0 1788768|1831171|208434489|2014-08-14 13:40:02|...
1 1788772|1831177|202234489|2014-08-14 13:41:37|...
2 1788776|1831182|205234489|2014-08-14 13:42:41|...
3 1788780|1831186|202634489|2014-08-14 13:43:46|...
output = df.to_csv(header=False, index=False)
print pd.read_csv(io.StringIO(u""+output), sep="|", header=None)
0 1 2 3 4 5 6 7 \
0 1788768 1831171 208434489 2014-08-14 13:40:02 108 c NaN Desktop
1 1788772 1831177 202234489 2014-08-14 13:41:37 108 c NaN iOS
2 1788776 1831182 205234489 2014-08-14 13:42:41 108 c NaN Desktop
3 1788780 1831186 202634489 2014-08-14 13:43:46 108 c NaN Desktop
8 9 10 11 \
0 coupon 49 XXX Ave Australia Victoria
1 NaN u7 38-46 South Street Australia New South Wales
2 NaN Level XXX Margaret Street(My Company) Australia New South Wales
3 NaN Po box ZZZ Australia New South Wales
12 13
0 3025 Melbourne
1 2116 Sydney
2 2000 Sydney
3 2444 NSW Other
But if data are correct, add parameter names=range(15) to read_csv:
print pd.read_csv(io.StringIO(u""+output), sep="|", names=range(15))
0 1 2 3 4 5 6 7 \
0 1788768 1831171 208434489 2014-08-14 13:40:02 108 c NaN Desktop
1 1788772 1831177 202234489 2014-08-14 13:41:37 108 c NaN iOS
2 1788776 1831182 205234489 2014-08-14 13:42:41 108 c NaN Desktop
3 1788780 1831186 202634489 2014-08-14 13:43:46 108 c NaN Desktop
8 9 10 11 \
0 coupon 49 XXX Ave Australia Victoria
1 NaN u7 38-46 South Street Australia New South Wales
2 NaN Level XXX Margaret Street(My Company) Australia New South Wales
3 NaN Po box ZZZ Australia New South Wales
12 13 14
0 3025 Melbourne NaN
1 2116 Sydney NaN
2 2000 Sydney Sydney
3 2444 NSW Other Port Macquarie

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fix bad lines downloaded when reading csv/txt in pandas - python

Related

How to regex extract CAR MAKE from URL in pandas df column

How to read a string with custom terminator as a pandas dataframe?

Parse Problematic Fixed width text file to a pandas dataframe

Data not showing in table form when using Jupyter Notebook

Ignore backslash when reading tsv file in python

Categories

Resources