How to specify a row delimiter in pandas read_csv()? - python

I would use a different row delimiter than \n with pandas.read_csv(). Does anyone know how to do that?
In the documentation of read_csv(), I found nothing related, but in the to_csv() page, I found the parameter line_terminator.
How may I solve this problem?

Here is an Example:
File: a.csv
hello,world:hell,worl:hel,wor:he,wo
a = pd.read_csv('a.csv', lineterminator=':')
print(a)
Output:
hello world
0 hell worl
1 hel wor
2 he wo

Related

Using python to search for strings in a file and use output to group the content of the second folder

I tried writing a python code that search for one/more strings in file1.txt, and then then make a change to the findall output (e.g., change cap0001 to 1). Next the code use the modfied output to group the content of file2.txt based on matches to column "capNo" in File2.txt.
File1.txt:
>cap00001 supr2
x2shh qewrrw
dsfff rggfdd
>cap00002 supr5
dadamic adertsy
waeee ddccmet
File2.txt
Ref capNo qual
AM1 1 Good
AM8 1 Good
AM7 2 Poor
AM2 2 Good
AM9 2 Good
AM6 3 Poor
AM1 3 Poor
AM2 3 Good
Require output:
capNo counts
1 2
2 3
The following code did not work for me:
import re
With open("File1.txt","r") as InFile1:
for line in InFile1:
match=re.findall(r'cap\d+',line)
if len(match) > 0:
match=match.remove(cap0000)
With open("File2.txt","r") as InFile2:
df=InFile2.read()
df2=df.groupby(match)["capNo"].value_counts()
print(df2)
How can I get this code working? Thanks
Change the Withs to with
Call the read function:
e.g.
with open('File1.txt') as f:
InFile1 = f.read()
# Do something with InFile1
In your code df is a string - you can't call groupby on it (did you mean to convert it to a pandas DataFrame?)

Python dataframe to_cvs() how to remove the tab at the beginning of the output text file

I want to save a pandas DataFrame into a text file and make it R-friendly for later analysis. I used dataframe.to_cvs(filename, sep = '\t'). But I noticed that the output file begins with a tab, which is not quite readable for read.table() in R.
I used od -c filename, and it showed like this:
\t 1 2 3 4 \t 5 6 7 8 \t 1 2 3 ...
Is there any way to remove the tab at the beginning? Thank you in advance.
Looking at the documentation, it seems that this has to do with the index.
Try this one on for size:
dataframe.to_csv(filename, sep = '\t', index_label=False)
The docs are stating this:
index_labelstr or sequence, or False, default None
Column label for index column(s) if desired. If None is given, and header and index are True, then the index names are used. A sequence should be given if the object uses MultiIndex. If False do not print fields for index names. Use index_label=False for easier importing in R.
I am on Windows so I cannot use the od command to check.

How to read in csv with no specific delimiter?

I have a problem. I have a csv file which has no "," as delimiter but is built as a common excel file.
# 2016-01-01: Prices/Volumes for Market
23-24 24,57
22-23 30,1
21-22 29,52
20-21 33,07
19-20 35,34
18-19 37,41
I am only interested in reading in the second column for e.g. 24,57 in the first line. The data has no header. How could I proceed here?
pd.read_csv(f,usecols = [2])
Does not work because I think there is no column identified. Thanks for your help!
May be it is not suitable to read it as CSV
try to use regular expression, process it line by line
https://docs.python.org/2/library/re.html
for example
import re
>>> re.search('(\d{2})-(\d{2}) (\d{2}),(\d{2})', "23-24 24,57").group(1)
'23'
>>> re.search('(\d{2})-(\d{2}) (\d{2}),(\d{2})', "23-24 24,57").group(2)
'24'
>>> re.search('(\d{2})-(\d{2}) (\d{2}),(\d{2})', "23-24 24,57").group(3)
'24'
>>> re.search('(\d{2})-(\d{2}) (\d{2}),(\d{2})', "23-24 24,57").group(4)
'57'
To read file line by line in python, read this:
How to read large file, line by line in python
Try this:
pd.read_csv(f, delim_whitespace=True, names=['desired_col_name'], usecols=[1])
alternatively you might want to use pd.read_fwf

Data reading - csv

I have some datas in a .dfx file and I trying to read it as a csv with pandas. But it has some special characters which are not read by pandas. They are separators as well.I attached one line from it
The "DC4" is being removed when I print the file. The SI is read as space, correctly. I tried some encoding (utf-8, latin1 etc), but no success.
I attached the printed first line as well. I marked the place where the characters should be.
My code is simple:
import pandas
file_log = pandas.read_csv("file_log.DFX", header=None)
print(file_log)
I hope I was clear and someone has an idea.
Thanks in advance!
EDIT:
The input. LINK: drive.google.com/open?id=0BxMDhep-LHOIVGcybmsya2JVM28
The expected output:
88.4373 0 12.07.2014/17:05:22 38.0366 38.5179 1.3448 31.9839
30.0070 0 12.07.2014/17:14:27 38.0084 38.5091 0.0056 0.0033
By examining the example.DFX in hex (with xxd), the two separators are 0x14 and 0x0f accordingly.
Read the csv with multiple separators using python engine:
import pandas
sep1 = chr(0x14) # the one shows dc4
sep2 = chr(0x0f) # the one shows si
file_log = pandas.read_csv('example.DFX', header=None, sep='{}|{}'.format(sep1, sep2), engine='python')
print file_log
And you get:
0 1 2 3 4 5 6 7
0 88.4373 0 12.07.2014/17:05:22 38.0366 38.5179 1.3448 31.9839 NaN
1 30.0070 0 12.07.2014/17:14:27 38.0084 38.5091 0.0056 0.0033 NaN
It seems it has an empty column at the end. But I'm sure you can handle that.
The encoding seems to be ASCII here. DC4 stands for "device control 4" and SI for "Shift In". These are control characters in an ASCII file and not printable. Thus you cannot see them when you issue a "print(file_log)", although it might do something depending on your terminal to view this (like \n would do a new-line).
Try typing file_log in your interpreter to get the representation of that variable and check if those special characters are included. Chances are that you'll see DC4 in the representation as '\x14' which means hexadecimal 14.
You may then further process these strings in your program by using string manipulation like replace.

reading in rho delimited file

I'm trying to use Pandas to read in a delimited file. The separator is a greek character, lowercase rho (þ).
I'm struggling to define the correct read_table parameters so that the resulting data frame is correctly formatted.
Does anyone have any experience or suggestions with this?
An example of the file is below
TimeþUser-IDþAdvertiser-IDþOrder-IDþAd-IDþCreative-IDþCreative-VersionþCreative-Size-IDþSite-IDþPage-IDþCountry-IDþState/ProvinceþBrowser-IDþBrowser-VersionþOS-IDþDMA-IDþCity-IDþZip-CodeþSite-DataþTime-UTC-Sec
03-28-2016-00:50:03þ0þ3893600þ7786669þ298662779þ67802437þ1þ300x250þ1722397þ125754620þ68þþ30þ0.0þ501012þ0þ3711þþþ1459122603
03-28-2016-00:24:29þ0þ3893600þ7352234þ290743769þ55727503þ1þ1x1þ1602646þ117915815þ68þþ31þ0.0þ501012þ0þ3711þþþ1459121069
03-28-2016-00:13:42þ0þ3893600þ7352234þ290743769þ55727503þ1þ1x1þ1602646þ117915815þ68þþ31þ0.0þ501012þ0þ3711þþþ1459120422
03-28-2016-00:21:09þ0þ3893600þ7352234þ290743769þ55727503þ1þ1x1þ1602646þ117915815þ68þþ31þ0.0þ501012þ0þ3711þþþ1459120869
I think what's happening is that the C engine isn't working here. If we switch to the Python engine, which is more powerful but slower, it seems to behave. For example, with the default C engine:
>>> df = pd.read_csv("out.rsv",sep="þ")
>>> df.iloc[:,:5]
TimeþUser-IDþAdvertiser-IDþOrder-IDþAd-IDþCreative-IDþCreative-VersionþCreative-Size-IDþSite-IDþPage-IDþCountry-IDþState/ProvinceþBrowser-IDþBrowser-VersionþOS-IDþDMA-IDþCity-IDþZip-CodeþSite-DataþTime-UTC-Sec
0 03-28-2016-00:50:03þ0þ3893600þ7786669þ29866277...
1 03-28-2016-00:24:29þ0þ3893600þ7352234þ29074376...
2 03-28-2016-00:13:42þ0þ3893600þ7352234þ29074376...
3 03-28-2016-00:21:09þ0þ3893600þ7352234þ29074376...
But with Python:
>>> df = pd.read_csv("out.rsv",sep="þ", engine="python")
>>> df.iloc[:,:5]
Time User-ID Advertiser-ID Order-ID Ad-ID
0 03-28-2016-00:50:03 0 3893600 7786669 298662779
1 03-28-2016-00:24:29 0 3893600 7352234 290743769
2 03-28-2016-00:13:42 0 3893600 7352234 290743769
3 03-28-2016-00:21:09 0 3893600 7352234 290743769
.. but seriously, þ? You're using þ as a delimiter? The only search hits google gives me for "rho delimited file" are all related to this question!
Note that you say lowercase rho, but it looks like thorn to me.. Maybe it's a lowercase rho on your end and got confused in posting?

Categories

Resources