pandas read csv use delimiter for a fixed amount of time - python

Suppose I have a log file structured as follow for each line:
$date $machine $task_name $loggedstuff
I hope to read the whole thing with pd.read_csv('blah.log', sep=r'\s+'). The problem is, $loggedstuff has spaces in it, is there any way to limit the delimiter to operate exactly 3 times so everything in loggedstuff will appear in the dataframe as a single column?
I've already tried using csv to parse it as list of list and then feed it into pandas, but that is slow, I'm wondering if there's a more direct way to do this. Thanks!

Setup
tmp.txt
a b c d
1 2 3 test1 test2 test3
1 2 3 test1 test2 test3 test4
Code
df = pd.read_csv('tmp.txt', sep='\n', header=None)
cols = df.loc[0].str.split(' ')[0]
df = df.drop(0)
def splitter(s):
vals = s.iloc[0].split(' ')
d = dict(zip(cols[:-1], vals))
d[cols[-1]] = ' '.join(vals[len(cols) - 1: ])
return pd.Series(d)
df.apply(splitter, axis=1)
returns
a b c d
1 1 2 3 test1 test2 test3
2 1 2 3 test1 test2 test3 test4

When using expand=True, the split elements will expand out into separate columns.
Parameter n can be used to limit the number of splits in the output.
Details about the same cane From pandas.Series.str.split
Pattern to use
df.str.split(pat=None, n=-1, expand=False)
expand : bool, default False
Expand the splitted strings into separate columns.
If True, return DataFrame/MultiIndex expanding dimensionality.
If False, return Series/Index, containing lists of strings
df.str.split(' ', n=3, expand=True)

I think you can read each line of the csv file as a single string, then convert the resulted dataframe to 3 columns by regular expression.
df = pd.read_csv('./test.csv', sep='#', squeeze=True)
df = df.str.extract('([^\s]+)\s+([^\s]+)\s+(.+)')
in which you can change the separator to whatever not appeared in the document.

Related

How to identify records in a DataFrame (Python/Pandas) that contains leading or trailing spaces

I would like to know how to write a formula that would identify/display records of string/object data type on a Pandas DataFrame that contains leading or trailing spaces.
The purpose for this is to get an audit on a Jupyter notebook of such records before applying any strip functions.
The goal is for the script to identify these records automatically without having to type the name of the columns manually. The scope should be any column of str/object data type that contains a value that includes either a leading or trailing spaces or both.
Please notice. I would like to see the resulting output in a dataframe format.
Thank you!
Link to sample dataframe data
You can use:
df['col'].str.startswith(' ')
df['col'].str.endswith(' ')
or with a regex:
df['col'].str.match(r'\s+')
df['col'].str.contains(r'\s+$')
Example:
df = pd.DataFrame({'col': [' abc', 'def', 'ghi ', ' jkl ']})
df['start'] = df['col'].str.startswith(' ')
df['end'] = df['col'].str.endswith(' ')
df['either'] = df['start'] | df['stop']
col start end either
0 abc True False True
1 def False False False
2 ghi False True True
3 jkl True True True
However, this is likely not faster than directly stripping the spaces:
df['col'] = df['col'].str.strip()
col
0 abc
1 def
2 ghi
3 jkl
updated answer
To detect the columns with leading/traiing spaces, you can use:
cols = df.astype(str).apply(lambda c: c.str.contains(r'^\s+|\s+$')).any()
cols[cols].index
example on the provided link:
Index(['First Name', 'Team'], dtype='object')

Extract values within the quotes signs into two separate columns with python

How can i extract the values within the quotes signs into two separate columns with python. The dataframe is given below:
df = pd.DataFrame(["'FRH02';'29290'", "'FRH01';'29300'", "'FRT02';'29310'", "'FRH03';'29340'",
"'FRH05';'29350'", "'FRG02';'29360'"], columns = ['postcode'])
df
postcode
0 'FRH02';'29290'
1 'FRH01';'29300'
2 'FRT02';'29310'
3 'FRH03';'29340'
4 'FRH05';'29350'
5 'FRG02';'29360'
i would like to get an output like the one below:
postcode1 postcode2
FRH02 29290
FRH01 29300
FRT02 29310
FRH03 29340
FRH05 29350
FRG02 29360
i have tried several str.extract codes but havent been able to figure this out. Thanks in advance.
Finishing Quang Hoang's solution that he left in the comments:
import pandas as pd
df = pd.DataFrame(["'FRH02';'29290'",
"'FRH01';'29300'",
"'FRT02';'29310'",
"'FRH03';'29340'",
"'FRH05';'29350'",
"'FRG02';'29360'"],
columns = ['postcode'])
# Remove the quotes and split the strings, which results in a Series made up of 2-element lists
postcodes = df['postcode'].str.replace("'", "").str.split(';')
# Unpack the transposed postcodes into 2 new columns
df['postcode1'], df['postcode2'] = zip(*postcodes)
# Delete the original column
del df['postcode']
print(df)
Output:
postcode1 postcode2
0 FRH02 29290
1 FRH01 29300
2 FRT02 29310
3 FRH03 29340
4 FRH05 29350
5 FRG02 29360
You can use Series.str.split:
p1 = []
p2 = []
for row in df['postcode'].str.split(';'):
p1.append(row[0])
p2.append(row[1])
df2 = pd.DataFrame()
df2["postcode1"] = p1
df2["postcode2"] = p2

Pandas read_csv fails to separate tab-delimited data

I have some input files that look something like this:
GENE CHR START STOP NSNPS NPARAM N ZSTAT P
2541473 1 1109286 1133315 2 1 15000 3.8023 7.1694e-05
512150 1 1152288 1167447 1 1 15000 3.2101 0.00066347
3588581 1 1177826 1182102 1 1 15000 3.2727 0.00053256
I am importing the file like this:
df = pd.read_csv('myfile.out', sep='\t')
But all the data gets read into a single column. I have tried changing the file format to encoding='utf-8', encoding='utf-16-le', encoding='utf-16-be' but this does not work. Separating by sep=' ' will separate the data into too many columns, but it will separate. Is there a way to correctly read in this data?
Try using \s+ (which reads as "one or more whitespace characters") as your delimiter:
df = pd.read_csv('myfile.out', sep='\s+')

Panda module export, split data

I'm trying to read a .txt file and output the count of each letter which works, however, I'm having trouble exporting that data to .csv in a specific way.
A snippet of the code:
freqs = {}
with open(Book1) as f:
for line in f:
for char in line:
if char in freqs:
freqs[char] += 1
else:
freqs[char] = 1
print(freqs)
And for the exporting to csv, I did the following:
test = {'Book 1 Output':[freqs]}
df = pd.DataFrame(test, columns=['Book 1 Output'])
df.to_csv(r'book_export.csv', sep=',')
Currently when I run it, the export looks like this (Manually done):
However I want the output to be each individual row, so it should look something like this when I open it:
I want it to separate it from the ":" and "," into 3 different columns.
I've tried various other answers on here but most of them end up with giving ValueErrors so maybe I just don't know how to apply it like the following one.
df[[',']] = df[','].str.split(expand=True)
Use DataFrame.from_dict with DataFrame.rename_axis for set index name, then csv looks like you need:
#sample data
freqs = {'a':5,'b':2}
df = (pd.DataFrame.from_dict(freqs, orient='index',columns=['Book 1 Output'])
.rename_axis('Letter'))
print (df)
Book 1 Output
Letter
a 5
b 2
df.to_csv(r'book_export.csv', sep=',')
Or alternative is use Series:
s = pd.Series(freqs, name='Book 1 Output').rename_axis('Letter')
print (s)
Letter
a 5
b 2
Name: Book 1 Output, dtype: int64
s.to_csv(r'book_export.csv', sep=',')
EDIT:
If there are multiple frequencies change DataFrame constructor:
freqs = {'a':5,'b':2}
freqs1 = {'a':9,'b':3}
df = pd.DataFrame({'f1':freqs, 'f2':freqs1}).rename_axis('Letter')
print (df)
f1 f2
Letter
a 5 9
b 2 3

Change one column with one of multiple strings from another column if condition is met

I want to populate one column with a string (one of many) contained in another column (if it is contained in that column)
Right now I can do it by repeating the line of code for every different string, I'm looking for the more efficient way of doing it. I have about a dozen in total.
df.loc[df['column1'].str.contains('g/mL'),'units'] = 'g/mL'
df.loc[df['column1'].str.contains('mPa.s'),'units'] = 'mPa.s'
df.loc[df['column1'].str.contains('mN/m'),'units'] = 'mN/m'
I don't know how to make it to check
df.loc[df['column1'].str.contains('g/mL|mPa.s|mN/m'),'units'] = ...
And then make it equal to the one that is contained.
Use str.extract:
# example dataframe
df = pd.DataFrame({'column1':['this is test g/mL', 'this is test2 mPa.s', 'this is test3 mN/m']})
column1
0 this is test g/mL
1 this is test2 mPa.s
2 this is test3 mN/m
df['units'] = df['column1'].str.extract('(g/mL|mPa.s|mN/m)')
column1 units
0 this is test g/mL g/mL
1 this is test2 mPa.s mPa.s
2 this is test3 mN/m mN/m
Use loop with str.contains:
L = ['g/mL', 'mPa.s', 'mN/m']
for val in L:
df.loc[df['column1'].str.contains(val),'units'] = val
Or Series.str.extract with list of all possible values:
L = ['g/mL', 'mPa.s', 'mN/m']
df['units'] = df['column1'].str.extract('(' + '|'.join(L) + ')')
Actually, according to the docs you can exactly do that using the regex=True parameter!
df.loc[df['column1'].str.contains('g/mL|mPa.s|mN/m', regex=True),'units'] = ...

Categories

Resources