Read excel sheet with multiple header using Pandas - python

I have an excel sheet with multiple header like:
_________________________________________________________________________
____|_____| Header1 | Header2 | Header3 |
ColX|ColY |ColA|ColB|ColC|ColD||ColD|ColE|ColF|ColG||ColH|ColI|ColJ|ColDK|
1 | ds | 5 | 6 |9 |10 | .......................................
2 | dh | ..........................................................
3 | ge | ..........................................................
4 | ew | ..........................................................
5 | er | ..........................................................
Now here you can see that first two columns do not have headers they are blank but other columns have headers like Header1, Header2 and Header3. So I want to read this sheet and merge it with other sheet with similar structure.
I want to merge it on first column 'ColX'. Right now I am doing this:
import pandas as pd
totalMergedSheet = pd.DataFrame([1,2,3,4,5], columns=['ColX'])
file = pd.ExcelFile('ExcelFile.xlsx')
for i in range (1, len(file.sheet_names)):
df1 = file.parse(file.sheet_names[i-1])
df2 = file.parse(file.sheet_names[i])
newMergedSheet = pd.merge(df1, df2, on='ColX')
totalMergedSheet = pd.merge(totalMergedSheet, newMergedSheet, on='ColX')
But I don't know its neither reading columns correctly and I think will not return the results in the way I want. So, I want the resulting frame should be like:
________________________________________________________________________________________________________
____|_____| Header1 | Header2 | Header3 | Header4 | Header5 |
ColX|ColY |ColA|ColB|ColC|ColD||ColD|ColE|ColF|ColG||ColH|ColI|ColJ|ColK| ColL|ColM|ColN|ColO||ColP|ColQ|ColR|ColS|
1 | ds | 5 | 6 |9 |10 | ..................................................................................
2 | dh | ...................................................................................
3 | ge | ....................................................................................
4 | ew | ...................................................................................
5 | er | ......................................................................................

[See comments for updates and corrections]
Pandas already has a function that will read in an entire Excel spreadsheet for you, so you don't need to manually parse/merge each sheet. Take a look pandas.read_excel(). It not only lets you read in an Excel file in a single line, it also provides options to help solve the problem you're having.
Since you have subcolumns, what you're looking for is MultiIndexing. By default, pandas will read in the top row as the sole header row. You can pass a header argument into pandas.read_excel() that indicates how many rows are to be used as headers. In your particular case, you'd want header=[0, 1], indicating the first two rows. You might also have multiple sheets, so you can pass sheetname=None as well (this tells it to go through all sheets). The command would be:
df_dict = pandas.read_excel('ExcelFile.xlsx', header=[0, 1], sheetname=None)
This returns a dictionary where the keys are the sheet names, and the values are the DataFrames for each sheet. If you want to collapse it all into one DataFrame, you can simply use pandas.concat:
df = pandas.concat(df_dict.values(), axis=0)

Sometimes, indices are MultiIndex too (it is indeed the case in the OP). To account for that, pass the index_col= appropriately.
df_dict = pd.read_excel('Book1.xlsx', header=[0,1], index_col=[0,1], sheetname=None)

Related

How do I add two columns with different rows with special condition?

Hi I have a PySpark dataframe. So, I would like to add two columns from different rows with special condition. One of the columns is a date type.
Here is the example of the data:
--------------------------------
| flag| date | diff |
--------------------------------
| 1 | 2014-05-31 | 0 |
--------------------------------
| 2 | 2014-06-02 | 2 |
--------------------------------
| 3 | 2016-01-14 | 591 |
--------------------------------
| 1 | 2016-07-08 | 0 |
--------------------------------
| 2 | 2016-07-12 | 4 |
--------------------------------
Currently I only know how to add the two columns, by using this code:
from pyspark.sql.functions import expr
dataframe.withColumn("new_column", expr("date_add(date_column, int_column)"))
The expected result:
There's a new Column, called "new_date" which is a result by adding the "diff" column to "date column".
The catch is there's a special condition: if the "flag" is 1, "date" and "diff" come from the same row, if not, the "date" comes from the previous row.
I am aware that in this scenario, my data has to be correctly sorted.
If anyone could help me, I would be very grateful. Thank you.
You just have to create a column with the previous date using Window and construct the new column depending on the value of 'flag'
import pyspark.sql.functions as F
from pyspark.sql.window import Window
w = Window().partitionBy().orderBy(F.col('date'))
dataframe = dataframe.withColumn('previous_date', F.lag('date', 1).over(w))
dataframe = dataframe.withColumn('new_date',
F.when(F.col('flag')==1,
F.expr("date_add(previous_date, diff)")
).otherwise(F.expr("date_add(date, diff)"))
).drop('previous_date')
Just in case you have the same issue with the answer of Xavier. The idea is the same, but I removed some unnecessary conditions for the Window and fixed the syntax error, as well as the date_add error I faced, when I tried his version.
from pyspark.sql.functions import *
df1 = spark.createDataFrame([(1,datetime.date(2014,5,31),0),(2,datetime.date(2014,6,2),2),(3,datetime.date(2016,1,14),591),(1,datetime.date(2016,7,8),0),(2,datetime.date(2016,7,12),4)], ["flag","date","diff"])
w = Window.orderBy(col("date"))
df1 = df1.withColumn('previous_date', lag('date', 1).over(w))
df1 = df1.withColumn('new_date',when(col('flag')==1,\
expr('date_add(date, diff)'))\
.otherwise(expr('date_add(previous_date,diff)'))).drop('previous_date')
df1.show()
Output:
+----+----------+----+----------+
|flag| date|diff| new_date|
+----+----------+----+----------+
| 1|2014-05-31| 0|2014-05-31|
| 2|2014-06-02| 2|2014-06-02|
| 3|2016-01-14| 591|2016-01-14|
| 1|2016-07-08| 0|2016-07-08|
| 2|2016-07-12| 4|2016-07-12|
+----+----------+----+----------+

Extracting multiple columns from column in PySpark DataFrame using named regex

Suppose I have a DataFrame df in pySpark of the following form:
| id | type | description |
| 1 | "A" | "Date: 2018/01/01\nDescr: This is a test des\ncription\n |
| 2 | "B" | "Date: 2018/01/02\nDescr: Another test descr\niption\n |
| 3 | "A" | "Date: 2018/01/03\nWarning: This is a warnin\ng, watch out\n |
which is of course a dummy set, but will suffice for this example.
I have made a regex-statement with named groups that can be used to extract the relevant information from the description-field, something along the lines of:
^(?:(?:Date: (?P<DATE>.+?)\n)|(?:Descr: (?P<DESCR>.+?)\n)|(?:Warning: (?P<WARNING>.+?)\n)+$
again, dummy regex, the actual regex is somewhat more elaborate, but the purpose is to capture three possible groups:
| DATE | DESCR | WARNING |
| 2018/01/01 | This is a test des\ncription | None |
| 2018/01/02 | Another test descr\niption | None |
| 2018/01/03 | None | This is a warnin\ng, watch out |
Now I would want to add the columns that are the result of the regex match to the original DataFrame (i.e. combine the two dummy tables in this question into one).
I have tried several ways to accomplish this, but none have lead to the full solution yet. A thing I've tried is:
def extract_fields(string):
patt = <ABOVE_PATTERN>
result = re.match(patt, string, re.DOTALL).groupdict()
# Actually, a slight work-around is needed to overcome the None problem when
# no match can be made, I'm using pandas' .str.extract for this now
return result
df.rdd.map(lambda x: extract_fields(x.description))
This will yield the second table, but I see no way to combine this with the original columns from df. I have tried to construct a new Row(), but then I run into problems with the ordering of columns (and the fact that I cannot hard-code the column names that will be added by the regex groups) that is needed in the Row()-constructor, resulting in a dataframe that is has the columns all jumbled up. How can I achieve what I want, i.e. one DataFrame with six columns: id, type, description, DATE, DESCR and WARNING?
Remark. Actually, the description field is not just one field, but several columns. Using concat_ws, I have concatenated these columns into a new columns description with the description-fields separated with \n, but maybe this can be incorporated in a nicer way.
I think you can use Pandas features for this case. Firstly I convert df to rdd to split description field. I pull a Pandas df then I create spark df with using Pandas df. It works regardless of column numbers in description field
>>> import pandas as pd
>>> import re
>>>
>>> df.show(truncate=False)
+---+----+-----------------------------------------------------------+
|id |type|description |
+---+----+-----------------------------------------------------------+
|1 |A |Date: 2018/01/01\nDescr: This is a test des\ncription\n |
|2 |B |Date: 2018/01/02\nDescr: Another test desc\niption\n |
|3 |A |Date: 2018/01/03\nWarning: This is a warnin\ng, watch out\n|
+---+----+-----------------------------------------------------------+
>>> #convert df to rdd
>>> rdd = df.rdd.map(list)
>>> rdd.first()
[1, 'A', 'Date: 2018/01/01\\nDescr: This is a test des\\ncription\\n']
>>>
>>> #split description field
>>> rddSplit = rdd.map(lambda x: (x[0],x[1],re.split('\n(?=[A-Z])', x[2].encode().decode('unicode_escape'))))
>>> rddSplit.first()
(1, 'A', ['Date: 2018/01/01', 'Descr: This is a test des\ncription\n'])
>>>
>>> #create empty Pandas df
>>> df1 = pd.DataFrame()
>>>
>>> #insert rows
>>> for rdd in rddSplit.collect():
... a = {i.split(':')[0].strip():i.split(':')[1].strip('\n').replace('\n','\\n').strip() for i in rdd[2]}
... a['id'] = rdd[0]
... a['type'] = rdd[1]
... df2 = pd.DataFrame([a], columns=a.keys())
... df1 = pd.concat([df1, df2])
...
>>> df1
Date Descr Warning id type
0 2018/01/01 This is a test des\ncription NaN 1 A
0 2018/01/02 Another test desc\niption NaN 2 B
0 2018/01/03 NaN This is a warnin\ng, watch out 3 A
>>>
>>> #create spark df
>>> df3 = spark.createDataFrame(df1.fillna('')).replace('',None)
>>> df3.show(truncate=False)
+----------+----------------------------+------------------------------+---+----+
|Date |Descr |Warning |id |type|
+----------+----------------------------+------------------------------+---+----+
|2018/01/01|This is a test des\ncription|null |1 |A |
|2018/01/02|Another test desc\niption |null |2 |B |
|2018/01/03|null |This is a warnin\ng, watch out|3 |A |
+----------+----------------------------+------------------------------+---+----+

How to insert a Pandas Series into a specific column of an existing Excel file (without deleting content from that file)?

I have imported the following data from Excel with pandas:
import pandas as pd
sht = pd.read_excel(path, 'Table', index_col=None, header=None, usecols = "A:C")
sht.head()
|-------+------------+----------|
| jon | tyrion | daenerys |
| sansa | cersei | rhaegar |
| arya | jaime | 0 |
| bran | tywin | 0 |
| robb | 0 | 0 |
| 0 | 0 | 0 |
|-------+------------+----------|
Then I created the following Series (D) in pandas:
D = pd.Series((sht[sht!=0]).values.flatten()).dropna().reset_index(drop=True)
D
|----------|
| jon |
| tyrion |
| daenerys |
| sansa |
| cersei |
| rhaegar |
| arya |
| jaime |
| bran |
| tywin |
| rob |
|----------|
How could I insert the Series D in the column D of sht (the "Table" sheet of my spreadsheet)?
I tried:
writer = pd.ExcelWriter(path, engine='openpyxl')
K.to_excel(writer,'Table', startrow=0, startcol=4, header=False, index=False)
writer.save()
But it deletes all the other tabs from my spreadsheet and also erases the values in the A:C columns of my spreadsheet...
The pd.ExcelWriter and .to_excel method in pandas overwrite the existing file. You are not modifying the existing file, but are deleting it and writing a new file with the same name.
If you want to write to an existing excel file, you probably want to use openpyxl.
import openpyxl
# open the existing file
wb = openpyxl.load_workbook('myfile.xlsx')
# grab the worksheet. my file has 2 sheets: A-sheet and B-sheet
ws = wb['A-sheet']
# write the series, D, to the 4th column
for row, v in enumerate(D, 1):
ws.cell(row, 4, v)
# save the changes to the workbook
wb.save('myfile.xlsx')
Try this
Open excel file and store contents in panda object lets say df.
Create the new column in df['new column']
Store the value in the new column,now panda has all ur previous values + the new column
Save the df to excel file using ExcelWriter
PS:while opening excel file and writing excel file in pandas there is option to open a specific sheet ie sheet_name=

Treat excel/csv file as file just considering rows, instead of rows and columns

I need to get a specific row from an excel file, and to do so i am using pandas library, in python. Here it follows:
path = r"C:\Users\Utilizador\Desktop\vendas2017.xlsx"
dataframe = pd.read_excel(path)
print dataframe.ix[1]
### I'm getting this, in which the second row columns are the first row corresponding cells:
sales | meatMenu | fishMenu | vegetarianMenu ### very first row
323 | 1 | 0 | 0 ### second row
## I want to print something like (as a truly independent plain text, one row at a time), without any column concept associated:
sales | meatMenu | fishMenu | vegetarianMenu ### very first row
323 | 1 | 0 | 0 ### second row
friday| sunny | Mild | rainy ### third row
1 | 0 | 1 | 0 ### fourth row
I just want the rows as plain text instead of indexing, as its corresponding columns, the very first line/row content cells. Do you know any workaround to perform such?
Much appreciated and an happy Easter!

Splitting a dataframe into separate CSV files

I have a fairly large csv, looking like this:
+---------+---------+
| Column1 | Column2 |
+---------+---------+
| 1 | 93644 |
| 2 | 63246 |
| 3 | 47790 |
| 3 | 39644 |
| 3 | 32585 |
| 1 | 19593 |
| 1 | 12707 |
| 2 | 53480 |
+---------+---------+
My intent is to
Add a new column
Insert a specific value into that column, 'NewColumnValue', on each row of the csv
Sort the file based on the value in Column1
Split the original CSV into new files based on the contents of 'Column1', removing the header
For example, I want to end up with multiple files that look like:
+---+-------+----------------+
| 1 | 19593 | NewColumnValue |
| 1 | 93644 | NewColumnValue |
| 1 | 12707 | NewColumnValue |
+---+-------+----------------+
+---+-------+-----------------+
| 2 | 63246 | NewColumnValue |
| 2 | 53480 | NewColumnValue |
+---+-------+-----------------+
+---+-------+-----------------+
| 3 | 47790 | NewColumnValue |
| 3 | 39644 | NewColumnValue |
| 3 | 32585 | NewColumnValue |
+---+-------+-----------------+
I have managed to do this using separate .py files:
Step1
# -*- coding: utf-8 -*-
import pandas as pd
df = pd.read_csv('source.csv')
df = df.sort_values('Column1')
df['NewColumn'] = 'NewColumnValue'
df.to_csv('ready.csv', index=False, header=False)
Step2
import csv
from itertools import groupby
for key, rows in groupby(csv.reader(open("ready.csv")),
lambda row: row[0]):
with open("%s.csv" % key, "w") as output:
for row in rows:
output.write(",".join(row) + "\n")
But I'd really like to learn how to accomplish everything in a single .py file. I tried this:
# -*- coding: utf-8 -*-
#This processes a large CSV file.
#It will dd a new column, populate the new column with a uniform piece of data for each row, sort the CSV, and remove headers
#Then it will split the single large CSV into multiple CSVs based on the value in column 0
import pandas as pd
import csv
from itertools import groupby
df = pd.read_csv('source.csv')
df = df.sort_values('Column1')
df['NewColumn'] = 'NewColumnValue'
for key, rows in groupby(csv.reader((df)),
lambda row: row[0]):
with open("%s.csv" % key, "w") as output:
for row in rows:
output.write(",".join(row) + "\n")
but instead of working as intended, it's giving me multiple CSVs named after each column header.
Is that happening because I removed the header row when I used separate .py files and I'm not doing it here? I'm not really certain what operation I need to do when splitting the files to remove the header.
Why not just groupby Column1 and save each group?
df = df.sort_values('Column1').assign(NewColumn='NewColumnValue')
print(df)
Column1 Column2 NewColumn
0 1 93644 NewColumnValue
5 1 19593 NewColumnValue
6 1 12707 NewColumnValue
1 2 63246 NewColumnValue
7 2 53480 NewColumnValue
2 3 47790 NewColumnValue
3 3 39644 NewColumnValue
4 3 32585 NewColumnValue
for i, g in df.groupby('Column1'):
g.to_csv('{}.csv'.format(i), header=False, index_label=False)
Thanks to Unatiel for the improvement. header=False will not write headers and index_label=False will not write an index column.
This creates 3 files:
1.csv
2.csv
3.csv
Each having data corresponding to each Column1 group.
You don't need to switch to itertools for the filtering, pandas has all of the necessary functionality built-in.
# -*- coding: utf-8 -*-
import pandas as pd
df = pd.read_csv('source.csv')
df = df.sort_values('Column1') # Sorting isn't needed
df['NewColumn'] = 'NewColumnValue'
for key in df['Column1'].unique(): # For each value in Column1
# These two steps can be combined into a single call
# I'll separate for clarity:
# 1) filter the dataframe on the unique value
dw = df[df['Column1']==key]
# 2) write the resulting dataframe without headers
dw.to_csv("%s.csv" % key, header=False)
pandas.DataFrame supports a method to write it's data as a csv to_csv(). You have no need for csv module in this case.
import pandas as pd
df = pd.read_csv('source.csv')
df = df.sort_values('Column1').set_index('Column1')
df['NewColumn'] = 'NewColumnValue'
for key in df.index.unique():
df.loc[key].to_csv('%d.csv' % int(key), header=False)
for key df.index.unique(): will loop over every unique value in the index. In your example, it will loop over (1, 2 , 3). header=False willmake sure the header isn't written to the output file.
And to explain why you get the wrong output in your example, try print(list(df)). This should output all the columns in df. This is why for key, rows in csv.reader((df)): iterates over the columns in df.
Actually, you should get 1 csv for every column in your dataframe, and their contents are likely something like ,[NAME_OF_COLUMN] or maybe ,<itertools.... object at 0x.....>.

Categories

Resources