I want to read the csv file and I am trying to make the date as the index column. However, this "international visitor arrivals statistics" can't be removed!!! How do I remove this annoying header? I have no idea how it got there and how to remove it.
import pandas as pd
import datetime
data5 = pd.read_csv('visitor.csv', parse_dates = [0], index_col=[0])
#data5 = data5.drop([0,1,2], axis = 0) # delete rows with irrelevant data
data5.columns = data5.iloc[3] # set the new header row with the proper header
data5 = data5[4:7768] # Take remaining data less the irrelevant data and the header row
data5
my output
Original excel file
Try using the header parameter in pd.read_csv which sets the row you want to use as your header in your df so for you, you would want to use the 5th row so you'd set the header=4 like this:
data5 = pd.read_csv('visitor.csv', parse_dates = [0], index_col=[0], header=4)
Related
I'm trying to add the add the header to my csv file that I created in the code given below:
There's only 1 column in the csv file that I'm trying to create,
the data frame consists of an array, the array is
[0.6999346, 0.6599296, 0.69770324, 0.71822715, 0.68585426, 0.6738229, 0.70231324, 0.693281, 0.7101939, 0.69629824]
i just want to create a csv file with header like this
Desired csv File , I want my csv file in this format
Please help me with detailed code, I'm new to coding.
I tried this
df = pd.DataFrame(c)
df.columns = ['Confidence values']
pd.DataFrame(c).to_csv('/Users/sunny/Desktop/objectdet/final.csv',header= True , index= True)
But i'm getting this csv file
Try this
import pandas as pd
array = [0.6999346, 0.6599296, 0.69770324, 0.71822715, 0.68585426, 0.6738229, 0.70231324, 0.693281, 0.7101939, 0.69629824]
df = pd.DataFrame(array)
df.columns = ['Confidence values']
df.to_csv('final.csv', index=True, header=True)
Your action pd.DataFrame(c) is creating a new dataframe with no header, while your df is a dataframe with header.
You are writing the dataframe with no header to a csv, that's why you dont get your header in your csv. All you need to do is replace pd.DataFrame(c) with df
right so this is my .csv file
,n,bubble sort,insertion sort,quick sort,tim sort
0,10,9.059906005859375e-06,5.0067901611328125e-06,1.9073486328125e-05,1.9073486328125e-06
1,50,0.0001659393310546875,8.487701416015625e-05,5.3882598876953125e-05,3.0994415283203125e-06
2,100,0.0006668567657470703,0.0003230571746826172,0.00011801719665527344,7.867813110351562e-06
3,500,0.028728008270263672,0.011162996292114258,0.0013577938079833984,6.008148193359375e-05
4,1000,0.11858582496643066,0.049070119857788086,0.0027892589569091797,0.000141143798828125
5,5000,2.022613048553467,0.8588027954101562,0.011118888854980469,0.0006251335144042969
and I was a bit confused with how could I remove the row headers from this line since its using DataFrame to get those row headers.
df = pd.DataFrame(timming)
df = pd.DataFrame(timming , header=None)
the csv data is like this:
,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
and I want here is my program:
data = pd.read_csv('train.csv',delimiter=',')
group = data.drop('quality',axis=1).values
print(group[0])
I want the result is 7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6, but the it comes 0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8. So how to avoid the index column?
There is problem your data before first , are not converted to index, so need index_col=[0]. Then after call .values first column is omited:
data = pd.read_csv('train.csv',delimiter=',', index_col=[0])
Or:
data = pd.read_csv('train.csv', index_col=[0])
I have the following csv file:
csv file
there are about 6-8 rows at the top of the file, I know how to make a new dataframe in Pandas, and filter the data:
df = pd.read_csv('payments.csv')
df = df[df["type"] == "Order"]
print df.groupby('sku').size()
df = df[df["marketplace"] == "amazon.com"]
print df.groupby('sku').size()
df = df[df["promotional rebates"] > ((df["product sales"] + df["shipping credits"])*-.25)]
print df.groupby('sku').size()
df.to_csv("out.csv")
My issue is with the Headers. I need to
1. look for the row that has date/time & another field.
That way I do not have to change my code if the file keeps changing the row count before the headers.
2. make a new DF excluding those rows.
What is the best approach, to make sure the code does not break to changes as long as the header row exist and has a few Fields matching. Open for any suggestions.
considering a CSV file like this:
random line content
another random line
yet another one
datetime, settelment id, type
dd, dd, dd
You can use the following to compute the header's line number:
#load the first 20 rows of the csv file as a one column dataframe
#to look for the header
df = pd.read_csv("csv_file.csv", sep="|", header=None, nrows=20)
# use a regular expression to look check which column has the header
# the following will generate a array of booleans
# with True if the row contains the regex "datetime.+settelment id.+type"
indices = df.iloc[:,0].str.contains("datetime.+settelment id.+type")
# get the row index of the header
header_index = df[indices].index.values[0]
and read the csv file starting from the header's index:
# to read the csv file, use the following:
df = pd.read_csv("csv_file.csv", skiprows=header_index+1)
Reproducible example:
import pandas as pd
from StringIO import StringIO
st = """
random line content
another random line
yet another one
datetime, settelment id, type
dd, dd, dd
"""
df = pd.read_csv(StringIO(st), sep="|", header=None, nrows=20)
indices = df.iloc[:,0].str.contains("datetime.+settelment id.+type")
header_index = df[indices].index.values[0]
df = pd.read_csv(StringIO(st), skiprows=header_index+1)
print(df)
print("columns")
print(df.columns)
print("shape")
print(df.shape)
Output:
datetime settelment id type
0 dd dd dd
columns
Index([u'datetime', u' settelment id', u' type'], dtype='object')
shape
(1, 3)
Python newbie, please be gentle. I have data in two "middle sections" of a multiple Excel spreadsheets that I would like to isolate into one pandas dataframe. Below is a link to a data screenshot.
Within each file, my headers are in Row 4 with data in Rows 5-15, Columns B:O. The headers and data then continue with headers on Row 21, data in Rows 22-30, Columns B:L. I would like to move the headers and data from the second set and append them to the end of the first set of data.
This code captures the header from Row 4 and data in Columns B:O but captures all Rows under the header including the second Header and second set of data. How do I move this second set of data and append it after the first set of data?
path =r'C:\Users\sarah\Desktop\Original'
allFiles = glob.glob(path + "/*.xls")
frame = pd.DataFrame()
list_ = []
for file_ in allFiles:
df = pd.read_excel(file_,sheetname="Data1", parse_cols="B:O",index_col=None, header=3, skip_rows=3 )
list_.append(df)
frame = pd.concat(list_)
Screenshot of my data
If all of your Excel files have the same number of rows and this is a one time operation, you could simply hard code those numbers in your read_excel. If not, it will be a little tricky, but you pretty much follow the same procedure:
for file_ in allFiles:
top = pd.read_excel(file_, sheetname="Data1", parse_cols="B:O", index_col=None,
header=4, skip_rows=3, nrows=14) # Note the nrows kwag
bot = pd.read_excel(file_, sheetname="Data1", parse_cols="B:L", index_col=None,
header=21, skip_rows=20, nrows=14)
list_.append(top.join(bot, lsuffix='_t', rsuffix='_b'))
you can do it this way:
df1 = pd.read_excel(file_,sheetname="Data1", parse_cols="B:O",index_col=None, header=3, skip_rows=3)
df2 = pd.read_excel(file_,sheetname="Data1", parse_cols="B:L",index_col=None, header=20, skip_rows=20)
# pay attention at `axis=1`
df = pd.concat([df1,df2], axis=1)