Pandas read_csv fails to separate tab-delimited data - python

I have some input files that look something like this:
GENE CHR START STOP NSNPS NPARAM N ZSTAT P
2541473 1 1109286 1133315 2 1 15000 3.8023 7.1694e-05
512150 1 1152288 1167447 1 1 15000 3.2101 0.00066347
3588581 1 1177826 1182102 1 1 15000 3.2727 0.00053256
I am importing the file like this:
df = pd.read_csv('myfile.out', sep='\t')
But all the data gets read into a single column. I have tried changing the file format to encoding='utf-8', encoding='utf-16-le', encoding='utf-16-be' but this does not work. Separating by sep=' ' will separate the data into too many columns, but it will separate. Is there a way to correctly read in this data?

Try using \s+ (which reads as "one or more whitespace characters") as your delimiter:
df = pd.read_csv('myfile.out', sep='\s+')

Related

Reading in text file with time column which is separated by commas?

I have a txt-file with data that looks like this
A,B,C,Time
xyz,1,MN,14/11/20 17:20:08,296000000
tuv,0,ST,30/12/20 11:11:18,111111111
I read the data in using this code:
df = pd.read_csv('path/to/file',delimiter=',')
Because of my time column it does not work correctly because Time is separated through a comma. How can I solve this and how can I make it work even in the case that I have multiple columns with such a time format?
I would like to get a datframe which looks like this:
A B C Time
xyz 1 MN 14/11/20 17:20:08,296000000
tuv 0 ST 30/12/20 11:11:18,111111111
Thanks a lot!
Use reset_index() method,apply() method and drop() method:
df=df.reset_index()
df['Time']=df[['C','Time']].astype(str).apply(','.join,1)
df=df.drop(columns=['C'])
df.columns=['A','B','C','Time']
Now If you print df you will get desired output:
A B C Time
0 xyz 1 MN 14/11/20 17:20:08,296000000
1 tuv 0 ST 30/12/20 11:11:18,111111111
Now If you wish to convert it back to txt file then use:
df.to_csv('filename.txt',sep='|',index=False)
Note: you can't use ',' and ' ' as sep parameter because it creates the same problem when you try to load your txt/csv file

Panda module export, split data

I'm trying to read a .txt file and output the count of each letter which works, however, I'm having trouble exporting that data to .csv in a specific way.
A snippet of the code:
freqs = {}
with open(Book1) as f:
for line in f:
for char in line:
if char in freqs:
freqs[char] += 1
else:
freqs[char] = 1
print(freqs)
And for the exporting to csv, I did the following:
test = {'Book 1 Output':[freqs]}
df = pd.DataFrame(test, columns=['Book 1 Output'])
df.to_csv(r'book_export.csv', sep=',')
Currently when I run it, the export looks like this (Manually done):
However I want the output to be each individual row, so it should look something like this when I open it:
I want it to separate it from the ":" and "," into 3 different columns.
I've tried various other answers on here but most of them end up with giving ValueErrors so maybe I just don't know how to apply it like the following one.
df[[',']] = df[','].str.split(expand=True)
Use DataFrame.from_dict with DataFrame.rename_axis for set index name, then csv looks like you need:
#sample data
freqs = {'a':5,'b':2}
df = (pd.DataFrame.from_dict(freqs, orient='index',columns=['Book 1 Output'])
.rename_axis('Letter'))
print (df)
Book 1 Output
Letter
a 5
b 2
df.to_csv(r'book_export.csv', sep=',')
Or alternative is use Series:
s = pd.Series(freqs, name='Book 1 Output').rename_axis('Letter')
print (s)
Letter
a 5
b 2
Name: Book 1 Output, dtype: int64
s.to_csv(r'book_export.csv', sep=',')
EDIT:
If there are multiple frequencies change DataFrame constructor:
freqs = {'a':5,'b':2}
freqs1 = {'a':9,'b':3}
df = pd.DataFrame({'f1':freqs, 'f2':freqs1}).rename_axis('Letter')
print (df)
f1 f2
Letter
a 5 9
b 2 3

Pandas read_csv - How to handle a comma inside double quotes that are themselves inside double quotes

This is not the same question as double quoted elements in csv cant read with pandas.
The difference is that in that question: "ABC,DEF" was breaking the code.
Here, "ABC "DE" ,F" is breaking the code.
The whole string should be parsed in as 'ABC "DE", F'. Instead the inside double quotes are leading to the below-mentioned issue.
I am working with a csv file that contains the following type of entries:
header1, header2, header3,header4
2001-01-01,123456,"abc def",V4
2001-01-02,789012,"ghi "jklm" n,op",V4
The second row of data is breaking the code, with the following error:
ParserError: Error tokenizing data. C error: Expected 4 fields in line 1234, saw 5
I have tried playing with various sep, delimiter & quoting etc. arguments but nothing seems to work.
Can someone please help with this? Thank you!
Based on the two rows you have provided here is an option where the text file is read into a Series object and then regex extract is used via Series.str.extract() get the information you want in a DataFrame:
with open('so.txt') as f:
contents = f.readlines()
s = pd.Series(contents)
s now looks like the following:
0 header1, header2, header3,header4\n
1 \n
2 2001-01-01,123456,"abc def",V4\n
3 \n
4 2001-01-02,789012,"ghi "jklm" n,op",V4
Now you can use regex extract to get what you want into a DataFrame:
df = s.str.extract('^([0-9]{4}-[0-9]{2}-[0-9]{2}),([0-9]+),(.+),(\w{2})$')
# remove empty rows
df = df.dropna(how='all')
df looks like the following:
0 1 2 3
2 2001-01-01 123456 "abc def" V4
4 2001-01-02 789012 "ghi "jklm" n,op" V4
and you can set your columns names with df.columns = ['header1', 'header2', 'header3', 'header4']

how can I really change the values of specific text file with pandas

all
I have a txt file with many columns with no headers.
I use
df=pd.read_csv('a.txt',sep=' ',header=None,usecols=[2,3],names=['waiting time','running time')
suppose the columns would be like this:
waiting time running time
0 8617344 8638976
1 8681728 8703360
2 8703488 8725120
3 8725120 8725760
4 4185856 4207488
for the third column, I want to subtract values of the second columns, then I can get
waiting time running time
0 8617344 21632
1 8681728 21632
2 8703488 21632
3 8725120 640
4 4185856 21632
My question is that how let change really happen in txt file? It means the txt file has been really changed correspondingly.
If your question is how to update the text file with your new data, you just use the write version of your first line:
# Save to file a.txt
# Use a space as the separator
# Don't save the index
df.to_csv('a.txt', sep=' ', index=False)

Change CSV before importing it to pandas

I have an issue with a CSV files I am trying to import in pandas. The structure of the file is as follow:
first character of the file is a single quote;
last character of the file is a single quote;
every line of the CSV start with a double quotes, end with a double quote followed by \n
So I have issues importing it with pandas.read_csv. Ideally I would like pandas to just ignore the single and double quotes when importing (not taking them into account for the structure of the data frame, and not importing these as characters).
I do not really know if I should manipulate the CSV file before using pandas.read_csv, or if I have option for just ignoring these characters.
The pd.read_csv methods first argument is either a file name or a stream.
You can read the file manually and manipulate the stream before handing it to pandas.
sio = StringIO("id,category,value\n1,beer,2.40\n2,wine,6.40\n3,$$$Theawsomestuff$$$###,166.00"
pd.read_csv(sio)
id category value
0 1 beer 2.4
1 2 wine 6.4
2 3 $$$Theawsomestuff$$$### 166.0
Thus subclassing StringIO you can change the behavior of the read method
class StreamChanger(StringIO):
def read(self, **kwargs):
data = super().read(**kwargs)
data = data.replace("$", "")
data = data.replace("#", "")
return data
sio = StreamChanger("id,category,value\n1,beer,2.40\n2,wine,6.40\n3,$$$Theawsomestuff$$$###,166.00")
pd.read_csv(sio)
id category value
0 1 beer 2.4
1 2 wine 6.4
2 3 Theawsomestuff 166.0
Use parameter 'quoting' and pass value 3 to read_csv. once you have dataframe created you should take care of the quotes in data and headers.
import pandas as pd
df=pd.read_csv('check.txt',doublequote=True,delimiter=',',quoting=3)
df=df.replace({'"': '','\'':''}, regex=True)
df.columns = ['Id1','StartTime','start_lat','start_long','StartGeohash']
print df
Sample File
'Id1,StartTime,start_lat,start_long,StartGeohash
"113,2016-11-01 10:50:28.063,-33.139507,-100.226715,9vbsx2"
"113,2016-11-02 10:49:24.063,-33.139507,-100.226715,9vbsx2"
"115,2016-11-03 10:55:20.063,-36.197660,-101.186050,9y2jcm"'
output
Id1 StartTime start_lat start_long StartGeohash
0 113 2016-11-01 10:50:28.063 -33.139507 -100.226715 9vbsx2
1 113 2016-11-02 10:49:24.063 -33.139507 -100.226715 9vbsx2
2 115 2016-11-03 10:55:20.063 -36.197660 -101.186050 9y2jcm

Categories

Resources