I loaded a CSV File using Python Pandas and want to drop every second column. I cant access the File from the first to last column. My CSV File has only one row with no captions. The origial file has about 1000 columns. For testing i use 12 columns. How to access the columns from the first to the last
I try to drop the first column by index. Later I want to iterate through it. I expect a index like 0 to 11 or index 1 to 12. Here is my code:
import pandas as pd
df = pd.read_csv("test.csv", index_col=0)
print(len(df.columns)) #returns 11 - expected: 12
df.drop(df.columns[0], axis=1)
df.to_csv('output.csv')
Code works, but with index 0 it drops the second column instead of the first and index 2 drops the fourth one and so on...
Hope you can help me
I've edit my code. Not pretty but it works:
import pandas as pd
fileName = 'test.csv'
dummy = pd.read_csv(fileName)
length = len(dummy.columns)
del dummy
df = pd.read_csv(fileName, usecols=[i for i in range(length) if i%2==0])
df.to_csv('output.csv', index=False)
Thank you for your answers
Related
I have a csv file with a wrong first row data. The names of labels are in the row number 2. So when I am storing this file to the DataFrame the names of labels are incorrect. And correct names become values of the row 0. Is there any function similar to reset_index() but for columns? PS I can not change csv file. Here is an image for better understanding. DataFrame with wrong labels
Hello let's suppose you csv file is data.csv :
Try this code:
import pandas as pd
#reading the csv file
df = pd.read_csv('data.csv')
#changing the headers name to integers
df.columns = range(df.shape[1])
#saving the data in another csv file
df.to_csv('data_without_header.csv',header=None,index=False)
#reading the new csv file
new_df = pd.read_csv('data_without_header.csv')
#plotting the new data
new_df.head()
If you do not care about the rows preceding your column names, you can pass in the "header" argument with the value of the correct row, for example if the proper column names are in row 2:
df = pd.read_csv('my_csv.csv', header=2)
Keep in mind that this will erase the previous rows from the DataFrame. If you still want to keep them, you can do the following thing:
df = pd.read_csv('my_csv.csv')
df.columns = df.iloc[2, :] # replace columns with values in row 2
Cheers.
I'm trying to read an Excel or CSV file into pandas dataframe. The file will read the first two columns only, and the top row of the first two columns will be the column names. The problem is when I have the first column of the top row empty in the Excel file.
IDs
2/26/2010 2
3/31/2010 4
4/31/2010 2
5/31/2010 2
Then, the last line of the following code fails:
uploaded_file = request.FILES['file-name']
if uploaded_file.name.endswith('.csv'):
df = pd.read_csv(uploaded_file, usecols=[0,1])
else:
df = pd.read_excel(uploaded_file, usecols=[0,1])
ref_date = 'ref_date'
regime_tag = 'regime_tag'
df.columns = [ref_date, regime_tag]
Apparently, it only reads one column (i.e. the IDs). However, with read_csv, it reads both column, with the first column being unnamed. I want it to behave that way and read both columns regardless of whether the top cells are empty or filled. How do I go about doing that?
What's happening is the first "column" in the Excel file is being read in as an index, while in the CSV file it's being treated as a column / series.
I recommend you work the other way and amend pd.read_csv to read the first column as an index. Then use reset_index to elevate the index to a series:
if uploaded_file.name.endswith('.csv'):
df = pd.read_csv(uploaded_file, usecols=[0,1], index_col=0)
else:
df = pd.read_excel(uploaded_file, header=[0,1], usecols=[0,1])
df = df.reset_index() # this will elevate index to a column called 'index'
This will give consistent output, i.e. first series will have label 'index' and the index of the dataframe will be the regular pd.RangeIndex.
You could potentially use a dispatcher to get rid of the unwieldy if / else construct:
file_flag = {True: pd.read_csv, False: pd.read_excel}
read_func = file_flag[uploaded_file.name.endswith('.csv')]
df = read_func(uploaded_file, usecols=[0,1], index_col=0).reset_index()
So I've currently got a dataset that has a column called 'logid' which consists of 4 digit numbers. I have about 200k rows in my csv files and I would like to count each unique logid and output it something like this;
Logid | #ofoccurences for each unique id. So it might be 1000 | 10 meaning that the logid 1000 is seen 10 times in the csv file column 'logid'. The separator | is not necessary, just making it easier for you guys to read. This is my code currently:
import pandas as pd
import os, sys
import glob
count = 0
path = "C:\\Users\\cam19\\Desktop\\New folder\\*.csv"
for fname in glob.glob(path):
df = pd.read_csv(fname, dtype=None, names=['my_data'], low_memory=False)
counts = df['my_data'].value_counts()
counts
Using this I get a weird output that I dont quite understand:
4 16463
10013 490
pserverno 1
Name: my_data, dtype: int64
I know I am doing something wrong in the last line
counts = df['my_data'].value_counts()
but I am not too sure what. For reference the values I am extracting are from row C in the excel file (so I guess thats column 3?) Thanks in advance!
ok. from my understanding. I think csv file may be like this.
row1,row1,row1
row2,row2,row2
row3,row3,row3
logid,header1,header2
1000,a,b
1001,c,d
1000,e,f
1001,g,h
And I have all done this with this format of csv file like
# skipping the first three row
df = pd.read_csv("file_name.csv", skiprows=3)
print(df['logid'].value_counts())
And the output look like this
1001 2
1000 2
Hope this will help.
update 1
df = pd.read_csv(fname, dtype=None, names=['my_data'], low_memory=False)
in this line the parameter names = ['my_data'] creates a new header of the data frame. As your csv file has header row so you can skip this parameter. And as the main header you want to row3 so you can skip first three row. And last one thing you are reading all csv file in given path. So be conscious all of the csv files are same format. Happy codding.
I think you need create one big DataFrame by append all df to list and then concat first:
dfs = []
path = "C:\\Users\\cam19\\Desktop\\New folder\\*.csv"
for fname in glob.glob(path):
df = pd.read_csv(fname, dtype=None, usecols=['logid'], low_memory=False)
dfs.append(df)
df = pd.concat(dfs)
Then use value_counts - output is Series. So for 2 column DataFrame need rename_axis with reset_index:
counts = df['my_data'].value_counts().rename_axis('my_data').reset_index(name='count')
counts
Or groupby and aggregate size:
counts = df.groupby('my_data').size().reset_index(name='count')
counts
you may try this.
counts = df['logid'].value_counts()
Now the "counts" should give you the count of each value.
I am reading a csv file, cleaning it up a little, and then saving it back to a new csv file. The problem is that the new csv file has a new column (first column in fact), labelled as index. Now this is not the row index, as I have turned that off in the to_csv() function as you can see in the code. Plus row index doesn't have a column label as well.
df = pd.read_csv('D1.csv', na_values=0, nrows = 139) # Read csv, with 0 values converted to NaN
df = df.dropna(axis=0, how='any') # Delete any rows containing NaN
df = df.reset_index()
df.to_csv('D1Clean.csv', index=False)
Any ideas where this phantom column is coming from and how to get rid of it?
I think you need add parameter drop=True to reset_index:
df = df.reset_index(drop=True)
drop : boolean, default False
Do not try to insert index into dataframe columns. This resets the index to the default integer index.
I am trying to write a script that loops over files via a certain pattern/variable, then it concatenates the 8th column of the files while keeping the first 4 columns which are common to all files. The script works if I use the following command:
reader = csv.reader(open("1isoforms.fpkm_tracking.txt", 'rU'), delimiter='\t') #to read the header names so i can use them as index. all headers for the three files are the same
header_row = reader.next() # Gets the header
df1 = pd.read_csv("1isoforms.fpkm_tracking.txt", index_col=header_row[0:4], sep="\t") #file #1 with index as first 5 columns
df2 = pd.read_csv("2isoforms.fpkm_tracking.txt", index_col=header_row[0:4], sep="\t") #file #2 with index as first 5 columns
df3 = pd.read_csv("3isoforms.fpkm_tracking.txt", index_col=header_row[0:4], sep="\t") #file #3 with index as first 5 columns
result = pd.concat([df1.ix[:,4], df2.ix[:,4]], keys=["Header1", "Header2", "Header3"], axis=1) #concatenates the 8th column of the files and changes the header
result.to_csv("OutputTest.xls", sep="\t")
While this works, it is NOT practical for me to enter file names one by one as I sometimes have 100's of files, so cant type in a df...function for each. Instead, I was trying to use a for loop to do this but i couldnt figure it out. here is what I have so far:
k=0
for geneFile in glob.glob("*_tracking*"):
while k < 3:
reader = csv.reader(open(geneFile, 'rU'), delimiter='\t')
header_row = reader.next()
key = str(k)
key = pd.read_csv(geneFile, index_col=header_row[0:1], sep="\t")
result = pd.concat([key[:,5]], axis=1)
result.to_csv("test2.xls", sep="\t")
However, this is not working .
The issues I am facing are as follows:
How can I be able to iterate over input files and generate different
variable names for each which I can then have it used in the
pd.concat function one after the other?
How can I use a for loop to generate a string file name that is a
combination of df and an integer
How can I fix the above script get my desired item.
A minor issue is regarding the way I am using the col_index function: is there a way to use the column # rather than column names? I know it works for index_col=0 or any single #. But I couldn't use integers for > 1 column of indexing.
Note that all files have the exact same structure, and the index columns are the same.
Your feedback is highly appreciated.
Consider using merge with right_index and left_index arguments:
import pandas as pd
numberoffiles = 100
# FIRST IMPORT (CREATE RESULT DATA FRAME)
result = pd.read_csv("1isoforms.fpkm_tracking.txt", sep="\t",
index_col=[0,1,2,3], usecols=[0,1,2,3,7])
# ALL OTHER IMPORTS (MERGE TO RESULT DATA FRAME, 8TH COLUMN SUFFIXED ITERATIVELY)
for i in range(2,numberoffiles+1):
df = pd.read_csv("{}isoforms.fpkm_tracking.txt".format(i), sep="\t",
index_col=[0,1,2,3], usecols=[0,1,2,3,7])
result = pd.merge(result, df, right_index=True, left_index=True, suffixes=[i-1, i])
result.to_excel("Output.xlsx")
result.to_csv("Output.csv")