pandas add columns when read from a csv file - python

I want to read from a CSV file using pandas read_csv. The CSV file doesn't have column names. When I use pandas to read the CSV file, the first row is set as columns by default. But when I use df.columns = ['ID', 'CODE'], the first row is gone. I want to add, not replace.
df = pd.read_csv(CSV)
df
a 55000G707270
0 b 5l0000D35270
1 c 5l0000D63630
2 d 5l0000G45630
3 e 5l000G191200
4 f 55000G703240
df.columns=['ID','CODE']
df
ID CODE
0 b 5l0000D35270
1 c 5l0000D63630
2 d 5l0000G45630
3 e 5l000G191200
4 f 55000G703240

I think you need parameter names in read_csv:
df = pd.read_csv(CSV, names=['ID','CODE'])
names : array-like, default None
List of column names to use. If file contains no header row, then you should explicitly pass header=None. Duplicates in this list are not allowed unless mangle_dupe_cols=True, which is the default.

You may pass the column names at the time of reading the csv file itself as :
df = pd.read_csv(csv_path, names = ["ID", "CODE"])

Use names argument in function call to add the columns yourself:
df = pd.read_csv(CSV, names=['ID','CODE'])

you need both: header=None and names=['ID','CODE'], because there are no column names/labels/headers in your CSV file:
df = pd.read_csv(CSV, header=None, names=['ID','CODE'])

The reason there are extra index columns add is because to_csv() writes an index per default, so you can either disable index when saving your CSV:
df.to_csv('file.csv', index=False)
or you can specify an index column when reading:
df = pd.read_csv('file.csv', index_col=0)

Related

Remove leading comma in header when using pandas to_csv

By default to_csv writes a CSV like
,a,b,c
0,0.0,0.0,0.0
1,0.0,0.0,0.0
2,0.0,0.0,0.0
But I want it to write like this:
a,b,c
0,0.0,0.0,0.0
1,0.0,0.0,0.0
2,0.0,0.0,0.0
How do I achieve this? I can't set index=False because I want to preserve the index. I just want to remove the leading comma.
df = pd.DataFrame(np.zeros((3,3)), columns = ['a','b','c'])
df.to_csv("test.csv") # this results in the first example above.
It is possible by write only columns without index first and then data without header in append mode:
df = pd.DataFrame(np.zeros((3,3)), columns = ['a','b','c'], index=list('XYZ'))
pd.DataFrame(columns=df.columns).to_csv("test.csv", index=False)
#alternative for empty df
#df.iloc[:0].to_csv("test.csv", index=False)
df.to_csv("test.csv", header=None, mode='a')
df = pd.read_csv("test.csv")
print (df)
a b c
X 0.0 0.0 0.0
Y 0.0 0.0 0.0
Z 0.0 0.0 0.0
Alternatively, try reseting the index so it becomes a column in data frame, named index. This works with multiple indexes as well.
df = df.reset_index()
df.to_csv('output.csv', index = False)
Simply set a name for your index: df.index.name = 'blah'. This name will appear as the first name in the headers.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.zeros((3,3)), columns = ['a','b','c'])
df.index.name = 'my_index'
print(df.to_csv())
yields
my_index,a,b,c
0,0.0,0.0,0.0
1,0.0,0.0,0.0
2,0.0,0.0,0.0
However if (as per your comment) you wish to have 3 coma-separated names in the headers while there are 4 coma-separated values in the rows of the csv, you'll have to handcraft it. It will NOT be compliant with any csv standard format though.

How to read a column header with a line break into Pandas?

I have a csv column heading:
"Submission S
tatus"
csv headers:
Unit,Publication ID,Title,"Submission S
tatus",Notes,Name,User ID
How can I refer to this when reading into the dataframe with the usecols parameter (or alternatively when renaming at a later stage)?
I have tried:
df = pd.read_csv('myfile.csv', usecols = ['Submission S\ntatus']
error: Usecols do not match columns, columns expected but not found
df = pd.read_csv('myfile.csv', usecols = ['Submission S\rtatus']
error: Usecols do not match columns, columns expected but not found
df = pd.read_csv('myfile.csv', usecols = ['Submission S
tatus']
error: SyntaxError: EOL while scanning string literal
How should I be referring to this column?
This is not the answer you wanted, but I hope it will help you if you want any workaround for this.
df = pd.read_csv('myfile.csv', usecols = [n])
df.rename(columns={df.columns[n]: "new column name"}, inplace=True)
# n is your column postion
You can read a csv file using the traditonal way of statements:
import pandas as pd
df = pd.read_csv(csv_file)
saved_column = df.column_name
You can save the column names by
colnames = df.Names
Later replace the name of your specified column using a meaningful word

Read dataframe in pandas skipping first column to read time series data

Question is quite self explanatory.Is there any way to read the csv file to read the time series data skipping first column.?
I tried this code:
df = pd.read_csv("occupancyrates.csv", delimiter = ',')
df = df[:,1:]
print(df)
But this is throwing an error:
"TypeError: unhashable type: 'slice'"
If you know the name of the column just do:
df = pd.read_csv("occupancyrates.csv") # no need to use the delimiter = ','
df = df.drop(['your_column_to_drop'], axis=1)
print(df)
df = pd.read_csv("occupancyrates.csv")
df.pop('column_name')
dataframe is like a dictionary, where column names are keys & values are the column items. For Ex
d = dict(a=1,b=2)
d.pop('a')
Now if you print d, the output will be
{'b': 2}
This is what I have done above to remove a column out of data frame.
By doing this way you need not to assign it back to dataframe like other answer(s)
df = df.iloc[:, 1:]
Or you don't even need to specify inplace=True anywhere
The simplest way to delete the first column should be:
del df[df.columns[0]]
or
df.pop(df.columns[0])

Python 2.7 - merge two CSV files without headers and with two delimiters in the first file

I have one csv test1.csv (I do not have headers in it!!!). I also have as you can see delimiter with pipe but also with exactly one tab after the eight column.
ug|s|b|city|bg|1|94|ON-05-0216 9.72|28|288
ug|s|b|city|bg|1|94|ON-05-0217 9.72|28|288
I have second file test2.csv with only delimiter pipe
ON-05-0216|100|50
ON-05-0180|244|152
ON-05-0219|269|146
So because only one value (ON-05-0216) is being matched from the eight column from the first file and first column from the second file it means that I should have only one value in output file, but with addition of SUM column from the second and third column from second file (100+50).
So the final result is the following:
ug|s|b|city|bg|1|94|ON-05-0216 Total=150|9.72|28|288
or
ug|s|b|city|bg|1|94|ON-05-0216|Total=150 9.72|28|288
whatever is easier.
I though that the best way to use is with pandas. But I stuck with taking multiple delimiters from the first file and how to match columns without column names, so not sure how to continue further.
import pandas as pd
a = pd.read_csv("test1.csv", header=None)
b = pd.read_csv("test2.csv", header=None)
merged = a.merge(b,)
merged.to_csv("output.csv", index=False)
Thank you in advance
Use:
# Reading files
df1 = pd.read_csv('file1.csv', header=None, sep='|')
df2 = pd.read_csv('file2.csv', header=None, sep='|')
# splitting file on tab and concatenating with rest
ndf = pd.concat([df1.iloc[:,:7], df1[7].str.split('\t', expand=True), df1.iloc[:,8:]], axis=1)
ndf.columns = np.arange(11)
# adding values from df2 and bringing in format Total=sum
df2.columns = ['c1', 'c2', 'c3']
tot = df2.eval('c2+c3').apply(lambda x: 'Total='+str(x))
# Finding which rows needs to be retained
idx_1 = ndf.iloc[:,7].str.split('-',expand=True).iloc[:,2]
idx_2 = df2.c1.str.split('-',expand=True).iloc[:,2]
idx = idx_1.isin(idx_2) # Updated
ndf = ndf[idx].reset_index(drop=True)
tot = tot[idx].reset_index(drop=True)
# concatenating both CSV together and writing output csv
ndf.iloc[:,7] = ndf.iloc[:,7].map(str) + chr(9) + tot
pd.concat([ndf.iloc[:,:8],ndf.iloc[:,8:]], axis=1).to_csv('out.csv', sep='|', header=None, index=None)
# OUTPUT
# ug|s|b|city|bg|1|94|ON-05-0216 Total=150|9.72|28|288
You can use pipe as a delimeter when reading csv pd.read_csv(... sep='|'), and only split the tab separated columns later on by using this example here.
When merging two dataframes, you must have a common column that you will merge on. You could use them as index for easier appending after you do the neccessary math on separate dataframes.

Skipping range of rows after header through pandas.read_excel

I know the argument usecols in pandas.read_excel() allows you to select specific columns.
Say, I read an Excel file in with pandas.read_excel(). My excel spreadsheet has 1161 rows. I want to keep the 1st row (with index 0), and skip rows 2:337. Seems like the argument skiprows works only when 0 indexing is involved. I tried several different ways but my code always produces an output where all my 1161 rows are read rather than only after the 337th row on. Such as this:
documentationscore_dataframe = pd.read_excel("Documentation Score Card_17DEC2015 Rev 2 17JAN2017.xlsx",
sheet_name = "Sheet1",
skiprows = "336",
usecols = "H:BD")
Here is another attempt:
documentationscore_dataframe = pd.read_excel("Documentation Score Card_17DEC2015 Rev 2 17JAN2017.xlsx",
sheet_name = "Sheet1",
skiprows = "1:336",
usecols = "H:BD")
I would like the dataframe to exclude rows 2 through 337 in the original Excel import.
As per the documentation for pandas.read_excel, skiprows must be list-like.
Try this instead to exclude rows 1 to 336 inclusive:
df = pd.read_excel("file.xlsx",
sheet_name = "Sheet1",
skiprows = range(1, 337),
usecols = "H:BD")
Note: range constructor is considered list-like for this purpose, so no explicit list conversion is necessary.
You can also pass a function to skiprows=. For example, to skip the first 336 rows (after the header row):
df = pd.read_excel('Book1.xlsx', sheet_name='Sheet1', skiprows=lambda x: 1<=x<=336)

Categories

Resources