Reading in text file with time column which is separated by commas? - python

I have a txt-file with data that looks like this
A,B,C,Time
xyz,1,MN,14/11/20 17:20:08,296000000
tuv,0,ST,30/12/20 11:11:18,111111111
I read the data in using this code:
df = pd.read_csv('path/to/file',delimiter=',')
Because of my time column it does not work correctly because Time is separated through a comma. How can I solve this and how can I make it work even in the case that I have multiple columns with such a time format?
I would like to get a datframe which looks like this:
A B C Time
xyz 1 MN 14/11/20 17:20:08,296000000
tuv 0 ST 30/12/20 11:11:18,111111111
Thanks a lot!

Use reset_index() method,apply() method and drop() method:
df=df.reset_index()
df['Time']=df[['C','Time']].astype(str).apply(','.join,1)
df=df.drop(columns=['C'])
df.columns=['A','B','C','Time']
Now If you print df you will get desired output:
A B C Time
0 xyz 1 MN 14/11/20 17:20:08,296000000
1 tuv 0 ST 30/12/20 11:11:18,111111111
Now If you wish to convert it back to txt file then use:
df.to_csv('filename.txt',sep='|',index=False)
Note: you can't use ',' and ' ' as sep parameter because it creates the same problem when you try to load your txt/csv file

Related

Pandas read_csv fails to separate tab-delimited data

I have some input files that look something like this:
GENE CHR START STOP NSNPS NPARAM N ZSTAT P
2541473 1 1109286 1133315 2 1 15000 3.8023 7.1694e-05
512150 1 1152288 1167447 1 1 15000 3.2101 0.00066347
3588581 1 1177826 1182102 1 1 15000 3.2727 0.00053256
I am importing the file like this:
df = pd.read_csv('myfile.out', sep='\t')
But all the data gets read into a single column. I have tried changing the file format to encoding='utf-8', encoding='utf-16-le', encoding='utf-16-be' but this does not work. Separating by sep=' ' will separate the data into too many columns, but it will separate. Is there a way to correctly read in this data?
Try using \s+ (which reads as "one or more whitespace characters") as your delimiter:
df = pd.read_csv('myfile.out', sep='\s+')

Panda module export, split data

I'm trying to read a .txt file and output the count of each letter which works, however, I'm having trouble exporting that data to .csv in a specific way.
A snippet of the code:
freqs = {}
with open(Book1) as f:
for line in f:
for char in line:
if char in freqs:
freqs[char] += 1
else:
freqs[char] = 1
print(freqs)
And for the exporting to csv, I did the following:
test = {'Book 1 Output':[freqs]}
df = pd.DataFrame(test, columns=['Book 1 Output'])
df.to_csv(r'book_export.csv', sep=',')
Currently when I run it, the export looks like this (Manually done):
However I want the output to be each individual row, so it should look something like this when I open it:
I want it to separate it from the ":" and "," into 3 different columns.
I've tried various other answers on here but most of them end up with giving ValueErrors so maybe I just don't know how to apply it like the following one.
df[[',']] = df[','].str.split(expand=True)
Use DataFrame.from_dict with DataFrame.rename_axis for set index name, then csv looks like you need:
#sample data
freqs = {'a':5,'b':2}
df = (pd.DataFrame.from_dict(freqs, orient='index',columns=['Book 1 Output'])
.rename_axis('Letter'))
print (df)
Book 1 Output
Letter
a 5
b 2
df.to_csv(r'book_export.csv', sep=',')
Or alternative is use Series:
s = pd.Series(freqs, name='Book 1 Output').rename_axis('Letter')
print (s)
Letter
a 5
b 2
Name: Book 1 Output, dtype: int64
s.to_csv(r'book_export.csv', sep=',')
EDIT:
If there are multiple frequencies change DataFrame constructor:
freqs = {'a':5,'b':2}
freqs1 = {'a':9,'b':3}
df = pd.DataFrame({'f1':freqs, 'f2':freqs1}).rename_axis('Letter')
print (df)
f1 f2
Letter
a 5 9
b 2 3

Extract prefix from string in dataframe column where exists in a list

Looking for some help.
I have a pandas dataframe column and I want to extract the prefix where such prefix exists in a separate list.
pr_list = ['1 FO-','2 IA-']
Column in df is like
PartNumber
ABC
DEF
1 FO-BLABLA
2 IA-EXAMPLE
What I am looking for is to extract the prefix where present, put in a new column and leave the rest of the string in the original column.
PartNumber Prefix
ABC
DEF
BLABLA 1 FO-
EXAMPLE 2 IA-
Have tried some things like str.startswith but a bit of a python novice and wasn't able to get it to work.
much appreciated
EDIT
Both solutions below work on the test data, however I am getting an error
error: nothing to repeat at position 16
Which suggests something askew in my dataset. Not sure what position 16 refers to but looking at both the prefix list and PartNumber column in position 16 nothing seems out of the ordinary?
EDIT 2
I have traced it to have an * in the pr_list seems to be throwing it. is * some reserved character? is there a way to break it out so it is read as text?
You can try:
df['Prefix']=df.PartNumber.str.extract(r'({})'.format('|'.join(pr_list))).fillna('')
df.PartNumber=df.PartNumber.str.replace('|'.join(pr_list),'')
print(df)
PartNumber Prefix
0 ABC
1 DEF
2 BLABLA 1 FO-
3 EXAMPLE 2 IA-
Maybe it's not what you are looking for, but may it help.
import pandas as pd
pr_list = ['1 FO-','2 IA-']
df = pd.DataFrame({'PartNumber':['ABC','DEF','1 FO-BLABLA','2 IA-EXAMPLE']})
extr = '|'.join(x for x in pr_list)
df['Prefix'] = df['PartNumber'].str.extract('('+ extr + ')', expand=False).fillna('')
df['PartNumber'] = df['PartNumber'].str.replace('|'.join(pr_list),'')
df

Pandas read_csv - How to handle a comma inside double quotes that are themselves inside double quotes

This is not the same question as double quoted elements in csv cant read with pandas.
The difference is that in that question: "ABC,DEF" was breaking the code.
Here, "ABC "DE" ,F" is breaking the code.
The whole string should be parsed in as 'ABC "DE", F'. Instead the inside double quotes are leading to the below-mentioned issue.
I am working with a csv file that contains the following type of entries:
header1, header2, header3,header4
2001-01-01,123456,"abc def",V4
2001-01-02,789012,"ghi "jklm" n,op",V4
The second row of data is breaking the code, with the following error:
ParserError: Error tokenizing data. C error: Expected 4 fields in line 1234, saw 5
I have tried playing with various sep, delimiter & quoting etc. arguments but nothing seems to work.
Can someone please help with this? Thank you!
Based on the two rows you have provided here is an option where the text file is read into a Series object and then regex extract is used via Series.str.extract() get the information you want in a DataFrame:
with open('so.txt') as f:
contents = f.readlines()
s = pd.Series(contents)
s now looks like the following:
0 header1, header2, header3,header4\n
1 \n
2 2001-01-01,123456,"abc def",V4\n
3 \n
4 2001-01-02,789012,"ghi "jklm" n,op",V4
Now you can use regex extract to get what you want into a DataFrame:
df = s.str.extract('^([0-9]{4}-[0-9]{2}-[0-9]{2}),([0-9]+),(.+),(\w{2})$')
# remove empty rows
df = df.dropna(how='all')
df looks like the following:
0 1 2 3
2 2001-01-01 123456 "abc def" V4
4 2001-01-02 789012 "ghi "jklm" n,op" V4
and you can set your columns names with df.columns = ['header1', 'header2', 'header3', 'header4']

python pass string to pandas dataframe in a specific format

I am not entirely sure if this is possible but I thought I would go ahead and ask. I currently have a string that looks like the following:
myString =
"{"Close":175.30,"DownTicks":122973,"DownVolume":18639140,"High":177.47,"Low":173.66,"Open":177.32,"Status":29,"TimeStamp":"\/Date(1521489600000)\/","TotalTicks":245246,"TotalVolume":33446771,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":122273,"UpVolume":14807630,"OpenInterest":0}
{"Close":175.24,"DownTicks":69071,"DownVolume":10806836,"High":176.80,"Low":174.94,"Open":175.24,"Status":536870941,"TimeStamp":"\/Date(1521576000000)\/","TotalTicks":135239,"TotalVolume":19649350,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":66168,"UpVolume":8842514,"OpenInterest":0}"
The datasets can be varying lengths (this example has 2 datasets but it could have more), however the parameters will always be the same, (close, downticks, downvolume, etc).
Is there a way to create a dataframe from this string that takes the parameters as the index, and the numbers as the values in the column? So the dataframe would look something like this:
df =
0 1
index
Close 175.30 175.24
DownTicks 122973 69071
DownVolume 18639140 10806836
High 177.47 176.80
Low 173.66 174.94
Open 177.32 175.24
(etc)...
It looks like there are some issues with your input. As mentioned by #lmiguelvargasf, there's a missing comma at the end of the first dictionary. Additionally, there's a \n which you can simply use a str.replace to fix.
Once those issues have been solved, the process it pretty simple.
myString = '''{"Close":175.30,"DownTicks":122973,"DownVolume":18639140,"High":177.47,"Low":173.66,"Open":177.32,"Status":29,"TimeStamp":"\/Date(1521489600000)\/","TotalTicks":245246,"TotalVolume":33446771,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":122273,"UpVolume":14807630,"OpenInterest":0}
{"Close":175.24,"DownTicks":69071,"DownVolume":10806836,"High":176.80,"Low":174.94,"Open":175.24,"Status":536870941,"TimeStamp":"\/Date(1521576000000)\/","TotalTicks":135239,"TotalVolume":19649350,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":66168,"UpVolume":8842514,"OpenInterest":0}'''
myString = myString.replace('\n', ',')
import ast
list_of_dicts = list(ast.literal_eval(myString))
df = pd.DataFrame.from_dict(list_of_dicts).T
df
0 1
Close 175.3 175.24
DownTicks 122973 69071
DownVolume 18639140 10806836
High 177.47 176.8
Low 173.66 174.94
Open 177.32 175.24
OpenInterest 0 0
Status 29 536870941
TimeStamp \/Date(1521489600000)\/ \/Date(1521576000000)\/
TotalTicks 245246 135239
TotalVolume 33446771 19649350
UnchangedTicks 0 0
UnchangedVolume 0 0
UpTicks 122273 66168
UpVolume 14807630 8842514

Categories

Resources