I want to read from a CSV file using pandas read_csv. The CSV file doesn't have column names. When I use pandas to read the CSV file, the first row is set as columns by default. But when I use df.columns = ['ID', 'CODE'], the first row is gone. I want to add, not replace.
df = pd.read_csv(CSV)
df
a 55000G707270
0 b 5l0000D35270
1 c 5l0000D63630
2 d 5l0000G45630
3 e 5l000G191200
4 f 55000G703240
df.columns=['ID','CODE']
df
ID CODE
0 b 5l0000D35270
1 c 5l0000D63630
2 d 5l0000G45630
3 e 5l000G191200
4 f 55000G703240
I think you need parameter names in read_csv:
df = pd.read_csv(CSV, names=['ID','CODE'])
names : array-like, default None
List of column names to use. If file contains no header row, then you should explicitly pass header=None. Duplicates in this list are not allowed unless mangle_dupe_cols=True, which is the default.
You may pass the column names at the time of reading the csv file itself as :
df = pd.read_csv(csv_path, names = ["ID", "CODE"])
Use names argument in function call to add the columns yourself:
df = pd.read_csv(CSV, names=['ID','CODE'])
you need both: header=None and names=['ID','CODE'], because there are no column names/labels/headers in your CSV file:
df = pd.read_csv(CSV, header=None, names=['ID','CODE'])
The reason there are extra index columns add is because to_csv() writes an index per default, so you can either disable index when saving your CSV:
df.to_csv('file.csv', index=False)
or you can specify an index column when reading:
df = pd.read_csv('file.csv', index_col=0)
Related
By default to_csv writes a CSV like
,a,b,c
0,0.0,0.0,0.0
1,0.0,0.0,0.0
2,0.0,0.0,0.0
But I want it to write like this:
a,b,c
0,0.0,0.0,0.0
1,0.0,0.0,0.0
2,0.0,0.0,0.0
How do I achieve this? I can't set index=False because I want to preserve the index. I just want to remove the leading comma.
df = pd.DataFrame(np.zeros((3,3)), columns = ['a','b','c'])
df.to_csv("test.csv") # this results in the first example above.
It is possible by write only columns without index first and then data without header in append mode:
df = pd.DataFrame(np.zeros((3,3)), columns = ['a','b','c'], index=list('XYZ'))
pd.DataFrame(columns=df.columns).to_csv("test.csv", index=False)
#alternative for empty df
#df.iloc[:0].to_csv("test.csv", index=False)
df.to_csv("test.csv", header=None, mode='a')
df = pd.read_csv("test.csv")
print (df)
a b c
X 0.0 0.0 0.0
Y 0.0 0.0 0.0
Z 0.0 0.0 0.0
Alternatively, try reseting the index so it becomes a column in data frame, named index. This works with multiple indexes as well.
df = df.reset_index()
df.to_csv('output.csv', index = False)
Simply set a name for your index: df.index.name = 'blah'. This name will appear as the first name in the headers.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.zeros((3,3)), columns = ['a','b','c'])
df.index.name = 'my_index'
print(df.to_csv())
yields
my_index,a,b,c
0,0.0,0.0,0.0
1,0.0,0.0,0.0
2,0.0,0.0,0.0
However if (as per your comment) you wish to have 3 coma-separated names in the headers while there are 4 coma-separated values in the rows of the csv, you'll have to handcraft it. It will NOT be compliant with any csv standard format though.
I have a csv column heading:
"Submission S
tatus"
csv headers:
Unit,Publication ID,Title,"Submission S
tatus",Notes,Name,User ID
How can I refer to this when reading into the dataframe with the usecols parameter (or alternatively when renaming at a later stage)?
I have tried:
df = pd.read_csv('myfile.csv', usecols = ['Submission S\ntatus']
error: Usecols do not match columns, columns expected but not found
df = pd.read_csv('myfile.csv', usecols = ['Submission S\rtatus']
error: Usecols do not match columns, columns expected but not found
df = pd.read_csv('myfile.csv', usecols = ['Submission S
tatus']
error: SyntaxError: EOL while scanning string literal
How should I be referring to this column?
This is not the answer you wanted, but I hope it will help you if you want any workaround for this.
df = pd.read_csv('myfile.csv', usecols = [n])
df.rename(columns={df.columns[n]: "new column name"}, inplace=True)
# n is your column postion
You can read a csv file using the traditonal way of statements:
import pandas as pd
df = pd.read_csv(csv_file)
saved_column = df.column_name
You can save the column names by
colnames = df.Names
Later replace the name of your specified column using a meaningful word
Question is quite self explanatory.Is there any way to read the csv file to read the time series data skipping first column.?
I tried this code:
df = pd.read_csv("occupancyrates.csv", delimiter = ',')
df = df[:,1:]
print(df)
But this is throwing an error:
"TypeError: unhashable type: 'slice'"
If you know the name of the column just do:
df = pd.read_csv("occupancyrates.csv") # no need to use the delimiter = ','
df = df.drop(['your_column_to_drop'], axis=1)
print(df)
df = pd.read_csv("occupancyrates.csv")
df.pop('column_name')
dataframe is like a dictionary, where column names are keys & values are the column items. For Ex
d = dict(a=1,b=2)
d.pop('a')
Now if you print d, the output will be
{'b': 2}
This is what I have done above to remove a column out of data frame.
By doing this way you need not to assign it back to dataframe like other answer(s)
df = df.iloc[:, 1:]
Or you don't even need to specify inplace=True anywhere
The simplest way to delete the first column should be:
del df[df.columns[0]]
or
df.pop(df.columns[0])
I have one csv test1.csv (I do not have headers in it!!!). I also have as you can see delimiter with pipe but also with exactly one tab after the eight column.
ug|s|b|city|bg|1|94|ON-05-0216 9.72|28|288
ug|s|b|city|bg|1|94|ON-05-0217 9.72|28|288
I have second file test2.csv with only delimiter pipe
ON-05-0216|100|50
ON-05-0180|244|152
ON-05-0219|269|146
So because only one value (ON-05-0216) is being matched from the eight column from the first file and first column from the second file it means that I should have only one value in output file, but with addition of SUM column from the second and third column from second file (100+50).
So the final result is the following:
ug|s|b|city|bg|1|94|ON-05-0216 Total=150|9.72|28|288
or
ug|s|b|city|bg|1|94|ON-05-0216|Total=150 9.72|28|288
whatever is easier.
I though that the best way to use is with pandas. But I stuck with taking multiple delimiters from the first file and how to match columns without column names, so not sure how to continue further.
import pandas as pd
a = pd.read_csv("test1.csv", header=None)
b = pd.read_csv("test2.csv", header=None)
merged = a.merge(b,)
merged.to_csv("output.csv", index=False)
Thank you in advance
Use:
# Reading files
df1 = pd.read_csv('file1.csv', header=None, sep='|')
df2 = pd.read_csv('file2.csv', header=None, sep='|')
# splitting file on tab and concatenating with rest
ndf = pd.concat([df1.iloc[:,:7], df1[7].str.split('\t', expand=True), df1.iloc[:,8:]], axis=1)
ndf.columns = np.arange(11)
# adding values from df2 and bringing in format Total=sum
df2.columns = ['c1', 'c2', 'c3']
tot = df2.eval('c2+c3').apply(lambda x: 'Total='+str(x))
# Finding which rows needs to be retained
idx_1 = ndf.iloc[:,7].str.split('-',expand=True).iloc[:,2]
idx_2 = df2.c1.str.split('-',expand=True).iloc[:,2]
idx = idx_1.isin(idx_2) # Updated
ndf = ndf[idx].reset_index(drop=True)
tot = tot[idx].reset_index(drop=True)
# concatenating both CSV together and writing output csv
ndf.iloc[:,7] = ndf.iloc[:,7].map(str) + chr(9) + tot
pd.concat([ndf.iloc[:,:8],ndf.iloc[:,8:]], axis=1).to_csv('out.csv', sep='|', header=None, index=None)
# OUTPUT
# ug|s|b|city|bg|1|94|ON-05-0216 Total=150|9.72|28|288
You can use pipe as a delimeter when reading csv pd.read_csv(... sep='|'), and only split the tab separated columns later on by using this example here.
When merging two dataframes, you must have a common column that you will merge on. You could use them as index for easier appending after you do the neccessary math on separate dataframes.
I know the argument usecols in pandas.read_excel() allows you to select specific columns.
Say, I read an Excel file in with pandas.read_excel(). My excel spreadsheet has 1161 rows. I want to keep the 1st row (with index 0), and skip rows 2:337. Seems like the argument skiprows works only when 0 indexing is involved. I tried several different ways but my code always produces an output where all my 1161 rows are read rather than only after the 337th row on. Such as this:
documentationscore_dataframe = pd.read_excel("Documentation Score Card_17DEC2015 Rev 2 17JAN2017.xlsx",
sheet_name = "Sheet1",
skiprows = "336",
usecols = "H:BD")
Here is another attempt:
documentationscore_dataframe = pd.read_excel("Documentation Score Card_17DEC2015 Rev 2 17JAN2017.xlsx",
sheet_name = "Sheet1",
skiprows = "1:336",
usecols = "H:BD")
I would like the dataframe to exclude rows 2 through 337 in the original Excel import.
As per the documentation for pandas.read_excel, skiprows must be list-like.
Try this instead to exclude rows 1 to 336 inclusive:
df = pd.read_excel("file.xlsx",
sheet_name = "Sheet1",
skiprows = range(1, 337),
usecols = "H:BD")
Note: range constructor is considered list-like for this purpose, so no explicit list conversion is necessary.
You can also pass a function to skiprows=. For example, to skip the first 336 rows (after the header row):
df = pd.read_excel('Book1.xlsx', sheet_name='Sheet1', skiprows=lambda x: 1<=x<=336)