Read the last column from an Excel file with pandas - python

It's like how to read certain columns from Excel using Pandas - Python but a little bit more complicated.
Say I have an Excel file called "foo.xlsx" and it grows over time - a new column will be appended on the right every month. However, when I read it, I need only the first two and the last columns. I expected usecols parameter can solve this problem so I went df = pd.read_excel("foo.xlsx", usecols=[0, 1, -1]) but it gives me only the first two columns.
My workaround turns out to be:
df = pd.read_excel("foo.xlsx")
df = df[df.columns[[0, 1, -1]]]
But it needs reading the whole file every time. Is there any way that I can get my desired data frame while reading the file? Thanks.

If you really want to do this (see my comment above) you could to this:
xl = pd.ExcelFile(file)
ncols = xl.book.sheets()[0].ncols
df = xl.parse(0, usecols=[0, 1, ncols-1])
This solution won't read the excel file twice.

One idea is get column count and pass to usecols:
from openpyxl import load_workbook
path = "file.xlsx"
wb = load_workbook(path)
sheet = wb.worksheets[0]
column_count = sheet.max_column
print (column_count)
Or read only first row of file:
column_count = len(pd.read_excel(path, nrows=0).columns)
df = pd.read_excel(path, usecols=[0, 1, column_count-1])
print (df)

You can use df.head() and df.tail() to read the first 2 and last line. For example:
df = pd.read_excel("foo.xlsx", sheet_name='ABC')
#print the first 2 column
print(df.head(2))
#print the last column
print(df.tail(1))
EDIT: Oops the above code reads rows and not columns. Yes, you have to read the file everytime. I don't think there's an option to read partial file.
For reading column maybe you can do something like this
df['Column Name'][index]

Related

Reset labels in Pandas DataFrame, Python

I have a csv file with a wrong first row data. The names of labels are in the row number 2. So when I am storing this file to the DataFrame the names of labels are incorrect. And correct names become values of the row 0. Is there any function similar to reset_index() but for columns? PS I can not change csv file. Here is an image for better understanding. DataFrame with wrong labels
Hello let's suppose you csv file is data.csv :
Try this code:
import pandas as pd
#reading the csv file
df = pd.read_csv('data.csv')
#changing the headers name to integers
df.columns = range(df.shape[1])
#saving the data in another csv file
df.to_csv('data_without_header.csv',header=None,index=False)
#reading the new csv file
new_df = pd.read_csv('data_without_header.csv')
#plotting the new data
new_df.head()
If you do not care about the rows preceding your column names, you can pass in the "header" argument with the value of the correct row, for example if the proper column names are in row 2:
df = pd.read_csv('my_csv.csv', header=2)
Keep in mind that this will erase the previous rows from the DataFrame. If you still want to keep them, you can do the following thing:
df = pd.read_csv('my_csv.csv')
df.columns = df.iloc[2, :] # replace columns with values in row 2
Cheers.

Pandas drop_duplicates() not working after add a row to DataFrame when read from a csv file

My code like below:
indexing_file_path = 'indexing.csv'
if not os.path.exists(indexing_file_path):
df = pd.DataFrame([['1111', '20200101', '20200101'],
['1112', '20200101', '20200101'],
['1113', '20200101', '20200101']],
columns = ['nname', 'nstart', 'nend'])
else:
df = pd.read_csv(indexing_file_path, header = 0)
print(df)
df.loc[len(df)] = ['1113', '20200202', '20200303']
# append() method not working either
print(df)
df.drop_duplicates('nname', keep = 'last', inplace = True)
print(df)
df.to_csv(indexing_file_path, index = False)
I want to keep the nname column unique in this file.
When the code run first time, it will save the records to csv file correctly, although the 1113 is not unique.
When the code run second time, it will save two 1113 rows to the csv file, because the DataFrame is created from a csv file.
After the third time run, it will always keep two 1113 rows.
Now I have a solution:
1, save to csv file with two 1113 row.
2, read the csv file again.
3, use drop_duplicates again.
4, save to csv file again.
Why the DataFrame created from a csv file is so different?
How can I save the unique row to csv file one time?
I can answer my question now.
The reason is:
When DataFrame is created from a csv file, pandas recognize the nname column as integer
But, when I add 1113 row again, pandas recognize the new row nname as a string, so the integer 1113 is not equals the string 1113, pandas will keep two row.
The solution is:
Read csv file as string.
df = pd.read_csv(indexing_file_path, header=0, dtype=str)

CSV File , Text to Column using Panda

def Text2Col(df_File):
for i in range(0,len(df_File)):
with open(df_File.iloc[i]['Input']) as inf:
with open(df_File.iloc[i]['Output'], 'w') as outf:
i=0
for line in inf:
i=i+1
if i==2 or i==3:
continue
outf.write(','.join(line.split(';')))
Above code is used to convert a csv file from text to column.
This code makes all values string ( because split() ) which is problematic for me.
I tried using map function but cant make it.
Is there any other way in which I can do this.
My input file has 5 columns, the first column is string, the second is int and the rest are float.
I think it required some modification in last statement
outf.write(','.join(line.split(';')))
Please let me know if any other input is required.
Ok, trying to help here. If this doesn't work, please specify in your question, what you're missing or what else needs to be done:
Use pandas to read in a csv file:
import pandas as pd
df = pd.read_csv('your_file.csv')
If you have a header on the first row, then use:
import pandas as pd
df = pd.read_csv('your_file.csv', header=0)
If you have a tab delimiter instead of a comma delimiter, then use:
import pandas as pd
df = pd.read_csv('your_file.csv', header=0, sep='\t')
Thank you !
Following Code worked:
def Text2Col(df_File):
for i in range(0,len(df_File)):
df = pd.read_csv(df_File.iloc[i]['Input'],sep=';')
df = df[df.index != 0]
df= df[df.index != 1]
df.to_csv(df_File.iloc[i]['Output'])
File_List="File_List.csv"
df_File=pd.read_csv(File_List)
Text2Col(df_File)
Input files are kept in same folder with same name as mentioned in File_List.xls
Output files will be created in same folder with separated in column. I deleted row 0 and 1 for my use. One can skip or add depending upon his requirement.
In above code df_file is dataframe contain two column list, first column is input file name and second column is output file name.

Add header with merged cells from one excel and insert to another excel Pandas

I have been searching over on how to append/insert/concat a row from one excel to another but with merged cells. I was not able to find what I am looking for.
What I need to get is this:
and append to the very first row of this:
I tried using pandas append() but it destroyed the arrangement of columns.
df = pd.DataFrame()
for f in ['merge1.xlsx', 'test1.xlsx']:
data = pd.read_excel(f, 'Sheet1')
df = df.append(data)
df.to_excel('test3.xlsx')
Is there way pandas could do it? I just need to literally insert the header to the top row.
Although I am still trying to find a way, it would actually be fine to me if this question had a duplicate as long as I can find answers or advice.
You can use pd.read_excel to read in the workbook with the data you want, in your case that is 'test1.xlsx'. You could then utilize openpyxl.load_workbook() to open an existing workbook with the header, in your case that is 'merge1.xlsx'. Finally you could save the new workbbok by a new name ('test3.xlsx') without changing the two existing workbooks.
Below I've provided a fully reproducible example of how you can do this. To make this example fully reproducible, I create 'merge1.xlsx' and 'test1.xlsx'.
Please note that if in your 'merge1.xlsx', if you only have the header that you want and nothing else in the file, you can make use of the two lines I've left commented out below. This would just append your data from 'test1.xlsx' to the header in 'merge1.xlsx'. If this is the case then you can get rid of the two for llops at the end. Otherwise as in my example it's a bit more complicated.
In creating 'test3.xlsx', we loop through each row and we determine how many columns there are using len(df3.columns). In my example this is equal to two but this code would also work for a greater number of columns.
import pandas as pd
from openpyxl import load_workbook
from openpyxl.utils.dataframe import dataframe_to_rows
df1 = pd.DataFrame()
writer = pd.ExcelWriter('merge1.xlsx') #xlsxwriter engine
df1.to_excel(writer, sheet_name='Sheet1')
ws = writer.sheets['Sheet1']
ws.merge_range('A1:C1', 'This is a merged cell')
ws.write('A3', 'some string I might not want in other workbooks')
writer.save()
df2 = pd.DataFrame({'col_1': [1,2,3,4,5,6], 'col_2': ['A','B','C','D','E','F']})
writer = pd.ExcelWriter('test1.xlsx')
df2.to_excel(writer, sheet_name='Sheet1')
writer.save()
df3 = pd.read_excel('test1.xlsx')
wb = load_workbook('merge1.xlsx')
ws = wb['Sheet1']
#for row in dataframe_to_rows(df3):
# ws.append(row)
column = 2
for item in list(df3.columns.values):
ws.cell(2, column=column).value = str(item)
column = column + 1
for row_index, row in df3.iterrows():
ws.cell(row=row_index+3, column=1).value = row_index #comment out to remove index
for i in range(0, len(df3.columns)):
ws.cell(row=row_index+3, column=i+2).value = row[i]
wb.save("test3.xlsx")
Expected Output of the 3 Workbooks:

concatenate pandas dataframe in a loop of files

I am trying to write a script that loops over files via a certain pattern/variable, then it concatenates the 8th column of the files while keeping the first 4 columns which are common to all files. The script works if I use the following command:
reader = csv.reader(open("1isoforms.fpkm_tracking.txt", 'rU'), delimiter='\t') #to read the header names so i can use them as index. all headers for the three files are the same
header_row = reader.next() # Gets the header
df1 = pd.read_csv("1isoforms.fpkm_tracking.txt", index_col=header_row[0:4], sep="\t") #file #1 with index as first 5 columns
df2 = pd.read_csv("2isoforms.fpkm_tracking.txt", index_col=header_row[0:4], sep="\t") #file #2 with index as first 5 columns
df3 = pd.read_csv("3isoforms.fpkm_tracking.txt", index_col=header_row[0:4], sep="\t") #file #3 with index as first 5 columns
result = pd.concat([df1.ix[:,4], df2.ix[:,4]], keys=["Header1", "Header2", "Header3"], axis=1) #concatenates the 8th column of the files and changes the header
result.to_csv("OutputTest.xls", sep="\t")
While this works, it is NOT practical for me to enter file names one by one as I sometimes have 100's of files, so cant type in a df...function for each. Instead, I was trying to use a for loop to do this but i couldnt figure it out. here is what I have so far:
k=0
for geneFile in glob.glob("*_tracking*"):
while k < 3:
reader = csv.reader(open(geneFile, 'rU'), delimiter='\t')
header_row = reader.next()
key = str(k)
key = pd.read_csv(geneFile, index_col=header_row[0:1], sep="\t")
result = pd.concat([key[:,5]], axis=1)
result.to_csv("test2.xls", sep="\t")
However, this is not working .
The issues I am facing are as follows:
How can I be able to iterate over input files and generate different
variable names for each which I can then have it used in the
pd.concat function one after the other?
How can I use a for loop to generate a string file name that is a
combination of df and an integer
How can I fix the above script get my desired item.
A minor issue is regarding the way I am using the col_index function: is there a way to use the column # rather than column names? I know it works for index_col=0 or any single #. But I couldn't use integers for > 1 column of indexing.
Note that all files have the exact same structure, and the index columns are the same.
Your feedback is highly appreciated.
Consider using merge with right_index and left_index arguments:
import pandas as pd
numberoffiles = 100
# FIRST IMPORT (CREATE RESULT DATA FRAME)
result = pd.read_csv("1isoforms.fpkm_tracking.txt", sep="\t",
index_col=[0,1,2,3], usecols=[0,1,2,3,7])
# ALL OTHER IMPORTS (MERGE TO RESULT DATA FRAME, 8TH COLUMN SUFFIXED ITERATIVELY)
for i in range(2,numberoffiles+1):
df = pd.read_csv("{}isoforms.fpkm_tracking.txt".format(i), sep="\t",
index_col=[0,1,2,3], usecols=[0,1,2,3,7])
result = pd.merge(result, df, right_index=True, left_index=True, suffixes=[i-1, i])
result.to_excel("Output.xlsx")
result.to_csv("Output.csv")

Categories

Resources