I am trying to read a bunch of tables into a dataframe using pandas. The files have an extension of .xls, but appear to be in HTML format, so I'm using the pandas.read_html() function. The issue I face is that the first column contains merged cells, and pandas is shifting values.
The original file:
The contents of the pandas dataframe:
As you can see, some of the values from the second column have been read into the first column. How can I make sure that the values are read into the correct column when one of the columns has merged cells?
Below is the code I'm using to read the files:
rawFileDir = 'C:/ftproot/Projects/Korea/Data/AL_Seg/Domestic'
rawFiles = os.listdir(rawFileDir)
for rawFile in rawFiles:
if not os.path.isfile(rawFile):
continue
xl = pandas.read_html(rawFile)
Related
So I have an excel sheet that contains in this order:
Sample_name | column data | column data2 | column data ... n
I also have a .txt file that contains
Sample_name
What I want to do is filter the excel file for only the sample names contained in the .txt file. My current idea is to go through each column (excel sheet) and see if it matches any name in the .txt file, if it does, then grab the whole column. However, this seems like a nonefficient way to do it. I also need to do this using python. I was hoping someone could give me an idea on how to approach this better. Thank you very much.
Excel PowerQuery should do the trick:
Load .txt file as a table (list)
Load sheet with the data columns as another table
Merge (e.g. Left join) first table with second table
Optional: adjust/select the columns to be included or excluded in the resulting table
In Python with Pandas’ data frames the same can be accomplished (joining 2 data frames)
P.S. Pandas supports loading CSV files and txt files (as a variant of CSV) into a data frame
please see attached photo
here's the image
I only need to import a specific column with conditions(such as specific data found in that column). And also, I only need to remove unnecessary columns. dropping them takes too much code. What specific code or syntax is applicable?
How to get a column from pandas dataframe is answered in Read specific columns from a csv file with csv module?
To quote:
Pandas is spectacular for dealing with csv files, and the following
code would be all you need to read a csv and save an entire column
into a variable:
import pandas as pd
df = pd.read_csv(csv_file)
saved_column = df.column_name #you can also use df['column_name']
So in your case, you just save the the filtered data frame in a new variable.
This means you do newdf = data.loc[...... and then use the code snippet from above to extract the column you desire, for example newdf.continent
Excel Data
This is the data I've in an excel file. There are 10 sheets containing different data and I want to sort data present in each sheet by the 'BA_Rank' column in descending order.
After sorting the data, I've to write the sorted data in an excel file.
(for eg. the data which was present in sheet1 of the unsorted sheet should be written in sheet1 of the sorted list and so on...)
If I remove the heading from the first row, I can use the pandas (sort_values()) function to sort the data present in the first sheet and save it in another list.
like this
import pandas as pd
import xlrd
doc = xlrd.open_workbook('without_sort.xlsx')
xl = pd.read_excel('without_sort.xlsx')
length = doc.nsheets
#print(length)
#for i in range(0,length):
#sheet = xl.parse(i)
result = xl.sort_values('BA_Rank', ascending = False)
result.to_excel('SortedData.xlsx')
print(result)
So is there any way I can sort the data without removing the header file from the first row?
and how can I iterate between sheets so as to sort the data present in multiple sheets?
(Note: All the sheets contain the same columns and I need to sort every sheet using 'BA_Rank' in descending order.)
First input, you don't need to call xlrd when using pandas, it's done under the hood.
Secondly, the read_excel method its REALLY smart. You can (imo should) define the sheet you're pulling data from. You can also set up lines to skip, inform where the header line is or to ignore it (and then set column names manually). Check the docs, it's quite extensive.
If this "10 sheets" it's merely anecdotal, you could use something like xlrd to extract the workbook's sheet quantity and work by index (or extract names directly).
The sorting looks right to me.
Finally, if you wanna save it all in the same workbook, I would use openpyxl or some similar library (there are many others, like pyexcelerate for large files).
This procedure pretty much always looks like:
Create/Open destination file (often it's the same method)
Write down data, sheet by sheet
Close/Save file
If the data is to be writen all on the same sheet, pd.concat([all_dataframes]).to_excel("path_to_store") should get it done
I have to work with 50+ .txt files each containing 2 columns and 631 rows where I have to do different operations to each (sometimes with each other) before doing data analysis. I was hoping there was a way to import each text file under a different dataframe in pandas instead of doing it individually. The code I've been using individually has been
df = pd.read_table(file_name, skiprows=1, index_col=0)
print(B)
I use index_col=0 because the first row is the x-value. I use skiprows=1 because I have to drop the title which is the first row (and file name in folder) of each .txt file. I was thinking maybe I could use glob package and importing all as a single data frame from the folder and then splitting it into different dataframes while keeping the first column as the name of each variable? Is there a feasible way to import all of these files at once under different dataframes from a folder and storing them under the first column name? All .txt files would be data frames of 2 col x 631 rows not including the first title row. All values in the columns are integers.
Thank you
Yes. If you store your file in a list named filelist (maybe using glob) you can use the following commands to read all files and store them on a dict.
dfdict = {f: pd.read_table(f,...) for f in filelist}
Then you can use each data frame with dfdict["filename.txt"].
So here is my situation. Using Python I want to copy specific columns from excel spreadsheet into specific columns into a csv worksheet.
The pre-filled column header names are named differently in each spreadsheet and I need to use a sublist as a parameter.
For example, in the first sublist, data column in excel needs to be copied from/to:
spreadsheet csv
"scan_date" => "date_of_scan"
Two sublists as parameters: one of names copied from excel, one of names of where to paste into csv.
Not sure if a dictionary sublist would be better than two individual sublists?
Also, the csv column header names are in row B (not row A like excel) which has complicated things such as data frames.
So, ideally I would like to have sublists converted to arrays,
spreadsheet iterates columns to find "scan_date"
copies data
iterates to find "date_of_scan" in csv
paste data
moves on to the second item in the sublists and repeats.
I've tried pandas and openpyxl and just can't seem to figure out the approach/syntax of how to do it.
Any help would be greatly appreciated.
Thank you.
Clarification edit:
The csv file has some preexisting data within. Also, I cannot change the headers into different columns. So, if "date_of_scan" is in column "RF" then it must stay in column "RF". I was able to copy, say, the 5 columns of data from excel into a temp spreadsheet and then concatenate into the csv but it always moved the pasted columns to the beginning of the csv document (columns A, B, C, D, E).
It is hard to know the answer without seeing you specific dataset, but it seems to me that a simpler approach might be to simply make your excel sheet a df, drop everything except the columns you want in the csv then write a csv with pandas. Here's some psuedo-code.
import pandas as pd
df=pd.read_excel('your_file_name.xlsx')
drop_cols=[,,,] #list of columns to get rid of
df.drop(drop_cols,axis='columns')
col_dict={'a':'x','b':'y','c':'z'} #however you want to map you new columns in this example abc are old columns and xyz are new ones
#this line will actually rename your columns with the dictionary
df=df.rename(columns=col_dict)
df.to_csv('new_file_name.csv') #write new file
and this will actually run in python, but I created the df from dummy data instead of an excel file.
#with dummy data
df=pd.DataFrame([0,1,2],index=['a','b','c']).T
col_dict={'a':'x','b':'y','c':'z'}
df=df.rename(columns=col_dict)
df.to_csv('new_file_name.csv') #write new file