Remove leading comma in header when using pandas to_csv - python

By default to_csv writes a CSV like
,a,b,c
0,0.0,0.0,0.0
1,0.0,0.0,0.0
2,0.0,0.0,0.0
But I want it to write like this:
a,b,c
0,0.0,0.0,0.0
1,0.0,0.0,0.0
2,0.0,0.0,0.0
How do I achieve this? I can't set index=False because I want to preserve the index. I just want to remove the leading comma.
df = pd.DataFrame(np.zeros((3,3)), columns = ['a','b','c'])
df.to_csv("test.csv") # this results in the first example above.

It is possible by write only columns without index first and then data without header in append mode:
df = pd.DataFrame(np.zeros((3,3)), columns = ['a','b','c'], index=list('XYZ'))
pd.DataFrame(columns=df.columns).to_csv("test.csv", index=False)
#alternative for empty df
#df.iloc[:0].to_csv("test.csv", index=False)
df.to_csv("test.csv", header=None, mode='a')
df = pd.read_csv("test.csv")
print (df)
a b c
X 0.0 0.0 0.0
Y 0.0 0.0 0.0
Z 0.0 0.0 0.0

Alternatively, try reseting the index so it becomes a column in data frame, named index. This works with multiple indexes as well.
df = df.reset_index()
df.to_csv('output.csv', index = False)

Simply set a name for your index: df.index.name = 'blah'. This name will appear as the first name in the headers.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.zeros((3,3)), columns = ['a','b','c'])
df.index.name = 'my_index'
print(df.to_csv())
yields
my_index,a,b,c
0,0.0,0.0,0.0
1,0.0,0.0,0.0
2,0.0,0.0,0.0
However if (as per your comment) you wish to have 3 coma-separated names in the headers while there are 4 coma-separated values in the rows of the csv, you'll have to handcraft it. It will NOT be compliant with any csv standard format though.

Related

Replacing all instances of standalone "." in a pandas dataframe

Beginner question incoming.
I have a dataframe derived from an excel file with a column that I will call "input".
In this column are floats (e.g. 7.4, 8.1, 2.2,...). However, there are also some wrong values such as strings (which are easy to filter out) and, what I find difficult, single instances of "." or "..".
I would like to clean the column to generate only numeric float values.
I have used this approach for other columns, but cannot do so here because if I get rid of the "." instances, my floats will be messed up:
for col in [col for col in new_df.columns if col.startswith("input")]:
new_df[col] = new_df[col].str.replace(r',| |\-|\^|\+|#|j|0|.', '', regex=True)
new_df[col] = pd.to_numeric(new_df[col], errors='raise')
I have also tried the following, but it then replaces every value in the column with None:
for index, row in new_df.iterrows():
col_input = row['input']
if re.match(r'^-?\d+(?:.\d+)$', str(col_input)) is None:
new_df["input"] = None
How do I get rid of the dots?
Thanks!
You can simply use pandas.to_numeric and pass errors='coerce' without the loop :
from io import StringIO
import pandas as pd
s = """input
7.4
8.1
2.2
foo
foo.bar
baz/foo"""
df = pd.read_csv(StringIO(s))
df['input'] = pd.to_numeric(df['input'], errors='coerce')
# Outputs :
print(df)
input
0 7.4
1 8.1
2 2.2
3 NaN
4 NaN
5 NaN
df.dropna(inplace=True)
print(df)
input
0 7.4
1 8.1
2 2.2
If you need to clean up multiple mixed columns, use :
cols = ['input', ...] # put here the name of the columns concerned
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
df.dropna(subset=cols, inplace=True)

Pandas: Read csv with dtypes but mixed type columns(NA values)

Im trying to downcast columns of a csv in the process of reading it, because doing it after reading the file is too time consuming. So far so good. The problem occurs of course if one column has NA values. Is there any possiblity to ignore that or to filter those in the process of reading maybe with that converter input of pandas read csv? And what does the argument 'verbose' do? The documentation says something about Indicate number of NA values placed in non-numeric columns.
My approach for downcasting so far is to read the first two rows and gues the dtype. I create a mapping dict for the dtype argument when reading the whole csv. Ofcourse NaN values can occur in the rows later on. So there is where mixed dtypes can occur:
import pandas as pd
df = pd.read_csv(filePath, delimiter=delimiter, nrows=2, low_memory=True, memory_map=True,engine='c')
if downcast == True:
mapdtypes = {'int64': 'int8', 'float64': 'float32'}
dtypes = list(df.dtypes.apply(str).replace(mapdtypes))
dtype = {key: value for (key, value) in enumerate(dtypes)}
df = pd.read_csv(filePath, delimiter=delimiter, memory_map=True,engine='c', low_memory=True, dtype=dtype)
Not sure if I properly understood your question but you are probably looking for the
na_values argument, where you can specify one or multiple strings to be recognized as NaN values.
EDIT: Get the dtype from individual columns and save them to a dictionary for the down-casting. Again, you can limit the number of rows to be read into df, if you need to.
import csv
# get only the column headers from the csv:
with open(filePath, 'r') as infile:
reader = csv.DictReader(infile)
fieldnames = reader.fieldnames
# iterate through each column to get the dtype:
dtypes = {}
for f in fieldnames:
df = pd.read_csv(filePath, usecols=[f], nrows=1000)
dtypes.update({f:str(df.iloc[:,0].dtypes)})
The original question relates to this one, so answering with similar info. The Pandas v1.0+ "Integer Array" data types enable what you ask. Use capitalized versions of the types such as 'Int16' etc. Missing values are recognized by Pandas .isnull(). Here is an example. Note the capital 'I' in the Pandas-specific Int16 data type (Pandas Documentation).
import pandas as pd
import numpy as np
dftemp = pd.DataFrame({'int_col':[4,np.nan,3,1],
'float_col':[0.0,1.0,np.nan,4.5]})
#Write to CSV (to be read back in to fully simulate CSV behavior with missing values etc.)
dftemp.to_csv('MixedTypes.csv', index=False)
lst_cols = ['int_col','float_col']
lst_dtypes = ['Int16','float']
dict_types = dict(zip(lst_cols,lst_dtypes))
#Unoptimized DataFrame
df = pd.read_csv('MixedTypes.csv')
df
Result:
int_col float_col
0 4.0 0.0
1 NaN 1.0
2 3.0 NaN
3 1.0 4.5
Repeat with assignment of variable types --including Int16 for int_col
df2 = pd.read_csv('Data.csv', dtype=dict_types)
print(df2)
int_col float_col
0 4 0.0
1 <NA> 1.0
2 3 NaN
3 1 4.5

Remove Column with Duplicate Values in Pandas

I have a database with sample as below:
Data frame is generated when I load data in Python as per below code
import os
import pandas as pd
data_dir="D:\\userdata\\adbharga\\Desktop\\AVA\\PythonCoding\\VF-Aus\\4G Cell Graphs"
os.chdir(data_dir)
df = pd.read_csv('CA Throughput(Kbit_s) .csv',index_col=None, header=0)
Output:
Is there any way by which we can avoid reading duplicate columns in Pandas, or remove the duplicate columns post reading.
Pl Note: Column Name is different once data is read in Pandas, so command like df=df.loc[:,~df.columns.duplicated()] won't work.
Actual database is very big and has many duplicate column with Dates only.
There are 2 ways you can do this.
Ignore columns when reading the data
pandas.read_csv has the argument usecols, which accepts an integer list.
So you can try:
# work out required columns
df = pd.read_csv('file.csv', header=0)
cols = [0] + list(range(1, len(df.columns), 2))
# use column integer list
df = pd.read_csv('file.csv', usecols=cols)
Remove columns from dataframe
You can use similar logic with pd.DataFrame.iloc to remove unwanted columns.
# cols as defined in previous example
df = df.iloc[:, cols]
One way of do it could be to read only the first row and create a mask using drop_duplicates(). This we pass to the usecols without the need to specify the index beforehand. It should be failsafe.
m = pd.read_csv(pd.compat.StringIO(data),nrows=1, header=None).T.drop_duplicates().index
df = pd.read_csv(pd.compat.StringIO(data), usecols=m)
Full example:
import pandas as pd
data = '''\
Date,Value1,Date,Value2
2018-01-01,0,2018-01-01,1
2018-01-02,0,2018-01-02,1'''
m = pd.read_csv(pd.compat.StringIO(data),nrows=1, header=None).T.drop_duplicates().index
df = pd.read_csv(pd.compat.StringIO(data), usecols=m)
print(df)
# Date Value1 Value2
#0 2018-01-01 0 1
#1 2018-01-02 0 1
Another way to do it would be to remove all columns with a dot inside .. This should work in most cases as the dot is rarely used in column names:
df = df.loc[:,~df.columns.str.contains('.', regex=False)]
Full example:
import pandas as pd
data = '''\
Date,Value1,Date,Value2
2018-01-01,0,2018-01-01,1
2018-01-02,0,2018-01-02,1'''
df = pd.read_csv(pd.compat.StringIO(data))
df = df.loc[:,~df.columns.str.contains('.', regex=False)]
print(df)
# Date Value1 Value2
#0 2018-01-01 0 1
#1 2018-01-02 0 1

pandas add columns when read from a csv file

I want to read from a CSV file using pandas read_csv. The CSV file doesn't have column names. When I use pandas to read the CSV file, the first row is set as columns by default. But when I use df.columns = ['ID', 'CODE'], the first row is gone. I want to add, not replace.
df = pd.read_csv(CSV)
df
a 55000G707270
0 b 5l0000D35270
1 c 5l0000D63630
2 d 5l0000G45630
3 e 5l000G191200
4 f 55000G703240
df.columns=['ID','CODE']
df
ID CODE
0 b 5l0000D35270
1 c 5l0000D63630
2 d 5l0000G45630
3 e 5l000G191200
4 f 55000G703240
I think you need parameter names in read_csv:
df = pd.read_csv(CSV, names=['ID','CODE'])
names : array-like, default None
List of column names to use. If file contains no header row, then you should explicitly pass header=None. Duplicates in this list are not allowed unless mangle_dupe_cols=True, which is the default.
You may pass the column names at the time of reading the csv file itself as :
df = pd.read_csv(csv_path, names = ["ID", "CODE"])
Use names argument in function call to add the columns yourself:
df = pd.read_csv(CSV, names=['ID','CODE'])
you need both: header=None and names=['ID','CODE'], because there are no column names/labels/headers in your CSV file:
df = pd.read_csv(CSV, header=None, names=['ID','CODE'])
The reason there are extra index columns add is because to_csv() writes an index per default, so you can either disable index when saving your CSV:
df.to_csv('file.csv', index=False)
or you can specify an index column when reading:
df = pd.read_csv('file.csv', index_col=0)

Using Pandas to create DataFrame with Series, resulting in memory error

I'm using Pandas library for remote sensing time series analysis. Eventually I would like to save my DataFrame to csv by using chunk-sizes, but I run into a little issue. My code generates 6 NumPy arrays that I convert to Pandas Series. Each of these Series contains a lot of items
>>> prcpSeries.shape
(12626172,)
I would like to add the Series into a Pandas DataFrame (df) so I can save them chunk by chunk to a csv file.
d = {'prcp': pd.Series(prcpSeries),
'tmax': pd.Series(tmaxSeries),
'tmin': pd.Series(tminSeries),
'ndvi': pd.Series(ndviSeries),
'lstm': pd.Series(lstmSeries),
'evtm': pd.Series(evtmSeries)}
df = pd.DataFrame(d)
outFile ='F:/data/output/run1/_'+str(i)+'.out'
df.to_csv(outFile, header = False, chunksize = 1000)
d = None
df = None
But my code get stuck at following line giving a Memory Error
df = pd.DataFrame(d)
Any suggestions? Is it possible to fill the Pandas DataFrame chunk by chunk?
If you know each of these are the same length then you could create the DataFrame directly from the array and then append each column:
df = pd.DataFrame(prcpSeries, columns=['prcp'])
df['tmax'] = tmaxSeries
...
Note: you can also use the to_frame method (which allows you to (optionally) pass a name - which is useful if the Series doesn't have one):
df = prcpSeries.to_frame(name='prcp')
However, if they are variable length then this will lose some data (any arrays which are longer than prcpSeries). An alternative here is to create each as a DataFrame and then perform an outer join (using concat):
df1 = pd.DataFrame(prcpSeries, columns=['prcp'])
df2 = pd.DataFrame(tmaxSeries, columns=['tmax'])
...
df = pd.concat([df1, df2, ...], join='outer', axis=1)
For example:
In [21]: dfA = pd.DataFrame([1,2], columns=['A'])
In [22]: dfB = pd.DataFrame([1], columns=['B'])
In [23]: pd.concat([dfA, dfB], join='outer', axis=1)
Out[23]:
A B
0 1 1
1 2 NaN

Categories

Resources