How to split a column in DataFrame with pandas? - python

I imported a dataset looks like this.
Peak, Trough
0 1857-06-01, 1858-12-01
1 1860-10-01, 1861-06-01
2 1865-04-01, 1867-12-01
3 1869-06-01, 1870-12-01
4 1873-10-01, 1879-03-01
5 1882-03-01, 1885-05-01
6 1887-03-01, 1888-04-01
it is a CSV file. But when I check the .shape, it is
(7, 1)
I thought CSV file can automatically be seperated by its commas, however this one doesn't work.
I want to split this column into two, sperated by comma, and also the column names as well. How can I do that?

Use 'sep' tag in read_csv
It's like:
df = read_csv(path, sep = ', ')

Same data to text file or csv and then use read_csv with parameter skipinitialspace=True and parse_dates for convert values to datetimes:
df = pd.read_csv('data.txt', skipinitialspace=True, parse_dates=[0,1])
print (df.head())
Peak Trough
0 1857-06-01 1858-12-01
1 1860-10-01 1861-06-01
2 1865-04-01 1867-12-01
3 1869-06-01 1870-12-01
4 1873-10-01 1879-03-01
print (df.dtypes)
Peak datetime64[ns]
Trough datetime64[ns]
dtype: object
If data are in excel in one column is possible use Series.str.split by first column, convert to datetimes and last set new columns names:
df = pd.read_excel('data.xlsx')
df1 = df.iloc[:, 0].str.split(', ', expand=True).apply(pd.to_datetime)
df1.columns = df.columns[0].split(', ')
print (df1.head())
Peak Trough
0 1857-06-01 1858-12-01
1 1860-10-01 1861-06-01
2 1865-04-01 1867-12-01
3 1869-06-01 1870-12-01
4 1873-10-01 1879-03-01
print (df1.dtypes)
Peak datetime64[ns]
Trough datetime64[ns]
dtype: object

Related

Sum of range of rows in a dataframe column

For the following csv file
ID,Kernel Time,device__attribute_warp_size,cycles_elapsed,time_duration
,,,cycle,msecond
0,2021-Dec-09 23:04:13,32,175013.666667,0.122208
1,2021-Dec-09 23:04:16,32,2988.833333,0.002592
2,2021-Dec-09 23:04:18,32,2911.666667,0.002624
I want to sum the values of a column, cycles_elapsed, but as you can see the first row is not a number. I wrote the following code, but the result is not what I expect.
import pandas as pd
import csv
df = pd.read_csv('test.csv', thousands=',', usecols=['ID', 'cycles_elapsed'])
print(df['cycles_elapsed'])
c_sum = df['cycles_elapsed'].loc[1:].sum()
print(c_sum)
$ python3 test.py
0 cycle
1 175013.666667
2 2988.833333
3 2911.666667
Name: cycles_elapsed, dtype: object
175013.6666672988.8333332911.666667
How can I fix that?
There is problem with second data of file, omit this row by skiprows=[1] parameter, so get numeric column with correct sum:
df = pd.read_csv('cycles_elapsed.csv', skiprows=[1], usecols=['ID', 'cycles_elapsed'])
print (df)
ID cycles_elapsed
0 0 175013.666667
1 1 2988.833333
2 2 2911.666667
print (df.dtypes)
ID int64
cycles_elapsed float64
dtype: object
c_sum = df['cycles_elapsed'].sum()
print(c_sum)
180914.166667

Set the first column of pandas dataframe as header

This is my output DataFrame from reading an excel file
I would like my first column to be index/header
one Entity
0 two v1
1 three Prod
2 four 2015-05-27 00:00:00
3 five 2018-04-27 00:00:00
4 six Both
5 seven id
6 eight hello
To Set the first column of pandas data frame as header
set "header=1" while reading file
eg: df = pd.read_csv(inputfilePath, header=1)
set skiprows=1 while reading the file
eg: df = df.read_csv(inputfilepath, skiprows=1)
set iloc[0] in dataframe
eg: df.columns = df.iloc[0]
I hope this will help.
One way is using T twice
df=df.T.set_index(0).T

Saving a Pandas dataframe in fixed format with different column widths

I have a pandas dataframe (df) that looks like this:
A B C
0 1 10 1234
1 2 20 0
I want to save this dataframe in a fixed format. The fixed format I have in mind has different column width and is as follows:
"one space for column A's value then a comma then four spaces for column B's values and a comma and then five spaces for column C's values"
Or symbolically:
-,----,-----
My dataframe above (df) would look like the following in my desired fixed format:
1, 10, 1234
2, 20, 0
How can I write a command in Python that saves my dataframe into this format?
df['B'] = df['B'].apply(lambda t: (' '*(4-len(str(t)))+str(t)))
df['C'] = df['C'].apply(lambda t: (' '*(5-len(str(t)))+str(t)))
df.to_csv('path_to_file.csv', index=False)

Prevent pandas read_csv treating first row as header of column names

I'm reading in a pandas DataFrame using pd.read_csv. I want to keep the first row as data, however it keeps getting converted to column names.
I tried header=False but this just deleted it entirely.
(Note on my input data: I have a string (st = '\n'.join(lst)) that I convert to a file-like object (io.StringIO(st)), then build the csv from that file object.)
You want header=None the False gets type promoted to int into 0 see the docs emphasis mine:
header : int or list of ints, default ‘infer’ Row number(s) to use as
the column names, and the start of the data. Default behavior is as if
set to 0 if no names passed, otherwise None. Explicitly pass header=0
to be able to replace existing names. The header can be a list of
integers that specify row locations for a multi-index on the columns
e.g. [0,1,3]. Intervening rows that are not specified will be skipped
(e.g. 2 in this example is skipped). Note that this parameter ignores
commented lines and empty lines if skip_blank_lines=True, so header=0
denotes the first line of data rather than the first line of the file.
You can see the difference in behaviour, first with header=0:
In [95]:
import io
import pandas as pd
t="""a,b,c
0,1,2
3,4,5"""
pd.read_csv(io.StringIO(t), header=0)
Out[95]:
a b c
0 0 1 2
1 3 4 5
Now with None:
In [96]:
pd.read_csv(io.StringIO(t), header=None)
Out[96]:
0 1 2
0 a b c
1 0 1 2
2 3 4 5
Note that in latest version 0.19.1, this will now raise a TypeError:
In [98]:
pd.read_csv(io.StringIO(t), header=False)
TypeError: Passing a bool to header is invalid. Use header=None for no
header or header=int or list-like of ints to specify the row(s) making
up the column names
I think you need parameter header=None to read_csv:
Sample:
import pandas as pd
from pandas.compat import StringIO
temp=u"""a,b
2,1
1,1"""
df = pd.read_csv(StringIO(temp),header=None)
print (df)
0 1
0 a b
1 2 1
2 1 1
If you're using pd.ExcelFile to read all the excel file sheets then:
df = pd.ExcelFile("path_to_file.xlsx")
df.sheet_names # Provide the sheet names in the excel file
df = df.parse(2, header=None) # Parsing the 2nd sheet in the file with header = None
df
Output:
0 1
0 a b
1 1 1
2 0 1
3 5 2
You can set custom column name in order to prevent this:
Let say if you have two columns in your dataset then:
df = pd.read_csv(your_file_path, names = ['first column', 'second column'])
You can also generate programmatically column names if you have more than and can pass a list in front of names attribute.

Can pandas read a transposed CSV?

Can pandas read a transposed CSV? Here's the file (note I'd also like to select a subset of columns):
A,x,x,x,x,1,2,3
B,x,x,x,x,4,5,6
C,x,x,x,x,7,8,9
Would like to get this DataFrame:
A B C
0 1 4 7
1 2 5 8
2 3 6 9
pd.read_csv('file.csv', index_col=0, header=None).T
In addition, if your file looks like this:
"some-line-you-want-to-skip"
A,x,x,x,x,1,2,3
B,x,x,x,x,4,5,6
C,x,x,x,x,7,8,9
It is possible to do the following:
df = pd.read_csv(filename, skiprows=1, header=None).T # Read csv, and transpose
df.columns = df.iloc[0] # Set new column names
df.drop(0,inplace=True) # Drop duplicated row
This will also end up with the df looking the way you want

Categories

Resources