How to handle excel data with merged headers in Python? - python

I have an excel sheet contains the data like the following.
How to handle this in python using pandas?
Typically I wants to plot this data in a graph. And wanted to find the percentage of people who have registered for ANC from the Estimated Number of Annual Pregnancies year-wise across the states.
Any idea would be deeply helpful.
PS: I am using IPython in Ipython notebook in LinuxMint.
I need the data to be indexed like this..

I would recommend you read in the data frame by skipping rows, then create a dictionary to rename your columns.
Something like the following:
df = pd.read_excel(path, skiprows=8)
mydict = {"Original Col1":"New Col Name1", "Original Col2":"New Col Name2"}
df = df.rename(mydict)

Related

Iterate through multiple sheets in excel file and filter all the data after a value in a row and append all the sheets

I have an excel file with around 50 sheets. All the data in the excel file looks like below:
I want to read the first row and all the rows after 'finish' in the first column.
I have written my script as something like this.
df = pd.read_excel('excel1_all_data.xlsx')
df = df.head(1).append(df[df.index>df.iloc[:,0] == 'finish'].index[0]+1])
The output looks like below:
The start and finish are gone.
My question is - How can I iterate through all the sheets in a similar way and append them into one dataframe? Also have a column which is the sheet name please.
The data in other sheets is similar too, but will have different dates and Names. But start and finish will still be present and we want to get everything after 'finish'.
Thank you so much for your help !!
Try this code and let me know if if it works for you :
import pandas as pd
wbSheets = pd.ExcelFile("excel1_all_data.xlsx").sheet_names
frames = []
for st in wbSheets:
df = pd.read_excel("excel1_all_data.xlsx",st)
frames.append(df.iloc[[0]])
frames.append(df[5:])
res = pd.concat(frames)
print(res)
The pd.ExcelFile("excel1_all_data.xlsx").sheet_names is what will get you the sheet you need to iterate.
In Pandas.read_excel's documentation! you'll find that you can read a specific sheet of the workbook. Which I've used in the for loop.
I don't know if concat is the best way to solve this for huge files but it worked fine on my sample.

How to use Vlookup in csv with python?

Original file
Data source
Output
My code is as follows.
import pandas as pd
file_dest = r"C:\Users\user\Desktop\Book1.csv"
# read csv data
book=pd.read_csv(file_dest)
file_source = r"C:\Users\user\Desktop\Book2.csv"
materials=pd.read_csv(file_source)
Right_join = pd.merge(book,
materials,
on ='Name',
how ='left')
Right_join.to_csv(file_dest, index=False)
However, the output is as follows, which looks like it just copied the contents but didn't use Vlookup to insert the data. I had tried it with different kinds of data. The results are all the same (which looks like it just copied the contents). Please help me find out the bugs.
Since column names are different in each data source, you have to specify columns to join on in the left and right dataframes. Try this:
# assuming materials is your data source with Price column
joined = book.merge(materials,
left_on="Custmor",
right_on="Name",
how ='left')

How do I convert these 2 columns as seen below in In [10] to a dataframe/table to be able to export to a csv file

enter image description here
Hi, I am very new to Python and I plan to create a final exportable table with these reviews scraped from a website to see the words that were most used. I have thus managed to get this 2 columns but have no idea how to proceed, can I directly export this into a table in excel or must I convert it into a dataframe then export it to a CSV? And what is the required code to run as such? Thank you so much for your help!!
It's convenient to use pandas library for working with dataframes:
import pandas as pd
series = pd.Series(wordcount)
series.to_csv("wordcount.csv")
However, if you use the code above, you'll get a warning. To fix it, there are 2 ways:
1) Add header parameter:
series.to_csv("wordcount.csv", header=True)
2) Or convert series to dataframe and then save it (without new index):
df = series.reset_index()
df.to_csv("wordcount.csv", index=False)

Python: convert excel data into dataframes

I want to put some data available in an excel file into a dataframe in Python.
The code I use is as below (two examples I use to read an excel file):
d=pd.ExcelFile(fileName).parse('CT_lot4_LDO_3Tbin1')
e=pandas.read_excel(fileName, sheetname='CT_lot4_LDO_3Tbin1',convert_float=True)
The problem is that the dataframe I get has the values with only two numbers after comma. In other words, excel values are like 0.123456 and I get into the dataframe values like 0.12.
A round up or something like that seems to be done, but I cannot find how to change it.
Can anyone help me?
thanks for the help !
You can try this. I used test.xlsx which has two sheets, and 'CT_lot4_LDO_3Tbin1' is the second sheet. I also set the first value as Text format in excel.
import pandas as pd
fileName = 'test.xlsx'
df = pd.read_excel(fileName,sheetname='CT_lot4_LDO_3Tbin1')
Result:
In [9]: df
Out[9]:
Test
0 0.123456
1 0.123456
2 0.132320
Without seeing the real raw data file, I think this is the best answer I can think of.
Well, when I try:
df = pd.read_csv(r'my file name')
I have something like that in df
http://imgur.com/a/Q2upp
And I cannot put .fileformat in the sentence
You might be interested in removing column datatype inference that pandas performs automatically. This is done by manually specifying the datatype for the column. Here is what you might be looking for.
Python pandas: how to specify data types when reading an Excel file?
Using pandas 0.20.1 something like this should work:
df = pd.read_csv('CT_lot4_LDO_3Tbin1.fileformat')
for exemple, in excel:
df = pd.read_csv('CT_lot4_LDO_3Tbin1.xlsx')
Read this documentation:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

Methods of reading in and creating a list/array with excel information from excel

Imagine I am given two columns: a,b,c,d,e,f and e,f,g,h,i,j (commas indicating a new row in the column)
How could I read in such information from excel and put it in an two separate arrays? I would like to manipulate this info and read it off later. as part of an output.
You have a few choices here. If your data is rectangular, starts from A1, etc. you should just use pandas.read_excel:
import pandas as pd
df = pd.read_excel("/path/to/excel/file", sheetname = "My Sheet Name")
print(df["column1"].values)
print(df["column2"].values)
If your data is a little messier, and the read_excel options aren't enough to get your data, then I think your only choice will be to use something a little lower level like the fantastic xlrd module (read the quickstart on the README)

Categories

Resources