Transforming multiindex df into sorted xlsx in python

Transforming multiindex df into sorted xlsx in python - python

long-time reader, first-time poster. I've been doing some time tracking for two projects, grouped the data by project and date using pandas, and would like to fill it into an existing Excel template for a client that is sorted by date (y-axis) and project (x-axis). But I'm stumped. I've been struggling to convert the multi-index dataframe into a sorted xlsx file.
Example data I want to sort
|Date | Project | Hours |
|-----------|---------------------------|---------|
|2022-05-09 |Project 1 | 5.50|
|2022-05-09 |Project 1 | 3.75|
|2022-05-11 |Project 2 | 1.50|
|2022-05-11 |Project 2 | 4.75|
etc.
Desired template
|Date |Project 1|Project 2|
|-----------|---------|---------|
|2022-05-09 | 5.5| 3.75|
|2022-05-11 | 4.75| 1.5|
etc...
So far I've tried a very basic iteration using openpyxl that has inserted the dates, but I can't figure out how to
a) rearrange the data in pandas so I can simply insert it or
b) how to write conditionally in openpyxl for a given date and project
# code grouping dates and projects
df = df.groupby(["Date", "Project"]).sum("Hours")
r = 10 # below the template headers and where I would start inserting time tracked
for date in df.index:
sheet.cell(row=r, column=1).value = date
r+=1
I've trawled StackOverflow for answers but am coming up empty. Thanks for any help you can provide.

I think your data sample is not correct. the 2nd row, instead of 2022-05-09 |Project 1|3.75, it should be 2022-05-09 |Project 2|3.75. The same with 4th row.
As I understand, your data is in long-format and your output is wide-format. In this case, pd.pivot_table can help:
pd.pivot_table(data=df, columns='name', index='year', values='hours').reset_index()

df.pivot_table(index='Date', columns='Project', values='Hours')
Date Project1 Project2
2022-05-09 5.5 3.75
2022-05-11 4.75 1.5

Related

How to combine two pandas dataset based on multiple conditions?

I want to combine two datasets in Python based on multiple conditions using pandas.
The two datasets are different numbers of rows.
The first one contains almost 300k entries, while the second one contains almost 1000 entries.
More specifically, The first dataset: "A" has the following information:
Path | Line | Severity | Vulnerability | Name | Text | Title
An instance of the content of "A" is this:
src.bla.bla.class.java| 24; medium| Logging found| hr.kravarscan.enchantedfortress_15| description| Enchanted Fortress
While the second dataset: "B" contains the following information:
Class | Path | DTWC | DR | DW | IDFP
An instance of the content in "B" is this:
y.x.bla.MainActivity | com.lucao.limpazap_11| 0 | 0 | 0 | 0
I want to combine these two dataset as follow:
If A['Name'] is equal to B['Path'] AND B['Class'] is in A['Class']
Than
Merge the two lines into another data frame "C"
An output example is the following:
Suppose that A contains:
src.bla.bla.class.java| 24| medium| Logging found| hr.kravarscan.enchantedfortress_15| description| Enchanted Fortress|
and B contains:
com.bla.class | hr.kravarscan.enchantedfortress_15| 0 | 0 | 0 | 0
the output should be the following:
src.bla.bla.class.java| 24| medium| Logging found| hr.kravarscan.enchantedfortress_15| description| Enchanted Fortress| com.bla.class | hr.kravarscan.enchantedfortress_15| 0 | 0 | 0 | 0

I'm not sure if this the best and the most efficient way but i have test it and it worked. So my answer is pretty straight forward, we will loop over two dataframes and apply the desired conditions.
Suppose the dataset A is df_a and dataset B is df_b.
First we have to add a suffix on every columns on df_a and df_b so both rows can be appended later.
df_a.columns= [i+'_A' for i in df_a.columns]
df_b.columns= [i+'_B' for i in df_b.columns]
And then we can apply this for loop
df_c= pd.DataFrame()
# Iterate through df_a
for (idx_A, v_A) in df_a.iterrows():
# Iterate through df_b
for (idx_B, v_B) in df_b.iterrows():
# Apply the condition
if v_A['Name_A']==v_B['Path_B'] and v_B['Class_B'] in v_A['Path_A']:
# Cast both series to dictionary and then append them to a new dict
c_dict= {**v_A.to_dict(), **v_B.to_dict()}
# Append the df_c with c_dict
df_c= df_c.append(c_dict, ignore_index=True)

Filtering, transposing and concatenating with Pandas

I'm trying something i've never done before and i'm in need of some help.
Basically, i need to filter sections of a pandas dataframe, transpose each filtered section and then concatenate every resulting section together.
Here's a representation of my dataframe:
df:
id | text_field | text_value
1 Date 2021-06-23
1 Hour 10:50
2 Position City
2 Position Countryside
3 Date 2021-06-22
3 Hour 10:45
I can then use some filtering method to isolate parts of my data:
df.groupby('id').filter(lambda x: True)
test = df.query(' id == 1 ')
test = test[["text_field","text_value"]]
test_t = test.set_index("text_field").T
test_t:
text_field | Date | Hour
text_value | 2021-06-23 | 10:50
If repeat the process looking for row with id == 3 and then concatenate the result with test_t, i'll have the following:
text_field | Date | Hour
text_value | 2021-06-23 | 10:50
text_value | 2021-06-22 | 10:45
I'm aware that performing this with rows where id == 2 will give me other columns and that's alright too, it's what a i want as well.
What i can't figure out is how to do this for every "id" in my dataframe. I wasn't able to create a function or for loop that works. Can somebody help me?
To summarize:
1 - I need to separate my dataframe in sections according with values from the "id" column
2 - After that i need to remove the "id" column and transpose the result
3 - I need to concatenate every resulting dataframe into one big dataframe

You can use pivot_table:
df.pivot_table(
index='id', columns='text_field', values='text_value', aggfunc='first')
Output:
text_field Date Hour Position
id
1 2021-06-23 10:50 NaN
2 NaN NaN City
3 2021-06-22 10:45 NaN
It's not exactly clear how you want to deal with repeating values though, would be great to have some description of that (id=2 would make a good example)
Update: If you want to ignore the ids and simply concatenate all the values:
pd.DataFrame(df.groupby('text_field')['text_value'].apply(list).to_dict())
Output:
Date Hour Position
0 2021-06-23 10:50 City
1 2021-06-22 10:45 Countryside

Serialized Array to Columns in Panda DataFrame

I have imported a .csv file and it contains a column with a serialized array in it.
How can I make 4 columns out of the array? I already tried somethings with regex and phpserialize package but I could not get it done.
This is how the columns looks:
forecast
---------------------------------------------------------------------------
a:4:{s:5:"sunny";i:10;s:5:"rainy";i:70;s:8:"thundery";i:0;s:5:"snowy";i:20;}
Now i want that the whole column gets seperated in 4 colums like this:
sunny|rainy|thundery|snowy
--------------------------
10 |70 |0 |20
Is there an easy way to do this? Thanks in advance!

If your forecasts are saved as strings in your dataframe, then you can extract your desired values with a regex, then pivot the dataframe. Something like this should help get you started (I've added in a row with new values just to demonstrate):
>>> df
forecast
0 'a:4:{s:5:"sunny";i:10;s:5:"rainy";i:70;s:8:"t...'
1 'a:4:{s:5:"sunny";i:20;s:5:"rainy";i:80;s:8:"t...'
df.forecast.str.extractall('"(?P<column>.*?)";i:(?P<value>\d+)').reset_index(level=0).pivot('level_0','column','value')
column rainy snowy sunny thundery
level_0
0 70 20 10 0
1 80 10 20 5

How to reshape dataframe with multi year data in Python

I believe my question can be solved with a loop but I haven't been able to create such. I have a data sample which looks like this
sample data
And I would like to have dataframe that would be organised by the year:
result data
I tried pivot-function by creating a year column with df['year'] = df.index.year and then reshaping with pivot but it will populate only the first year column because of the index.
I have managed to do this type of reshaping manually but with several years of data it is time consuming solution. Here is the example code for manual solution:
mydata = pd.DataFrame()
mydata2 = pd.DataFrame()
mydata3 = pd.DataFrame()
mydata1['1'] = df['data'].iloc[160:664]
mydata2['2'] = df['data'].iloc[2769:3273]
mydata3['3'] = df['data'].iloc[5583:6087]
mydata1.reset_index(drop=True, inplace=True)
mydata2.reset_index(drop=True, inplace=True)
mydata3.reset_index(drop=True, inplace=True)
mydata = pd.concat([mydata1, mydata2, mydata3],axis=1, ignore_index=True)
mydata.columns = ['78','88','00','05']

Welcome to StackOverflow! I think I understood what you were asking for from your question, but please correct me if I'm wrong. Basically, you want to reshape your current pandas.DataFrame using a pivot. I set up a sample dataset and solved the problem in the following way:
import pandas as pd
#test set
df = pd.DataFrame({'Index':['2.1.2000','3.1.2000','3.1.2001','4.1.2001','3.1.2002','4.1.2002'],
'Value':[100,101,110,111,105,104]})
#create a year column for yourself
#by splitting on '.' and selecting year element.
df['Year'] = df['Index'].str.split('.', expand=True)[2]
#pivot your table
pivot = pd.pivot_table(df, index=df.index, columns='Year', values='Value')
#now, in my pivoted test set there should be unwanted null values showing up so
#we can apply another function that drops null values in each column without losing values in other columns
pivot = pivot.apply(lambda x: pd.Series(x.dropna().values))
Result on my end
| Year | 2000 | 2001 | 2002 |
|------|------|------|------|
| 0 | 100 | 110 | 105 |
| 1 | 101 | 111 | 104 |
Hope this solves your problem!

How can I concatenate 3 columns into 1 using Excel and Python?

I am working on a project in GIS software where I need to have a column containing dates in the format YYYY-MM-DD. Currently, in Excel I have 3 columns: 1 with the year, 1 with the month and 1 with the day. Looks like this:
| A | B | C |
| 2012 | 1 | 1 |
| 2012 | 2 | 1 |
| 2012 | 3 | 1 |
...etc...
And I need it to look like this:
| A |
| 2012-01-01|
| 2012-02-01|
| 2012-03-01|
I have several workbooks that I need in the same format so I figured that perhaps python would be a useful tool so that I didn't have to manually concatenate everything in Excel.
So, my question is, is there a simple way to not only concatenate these three columns, but to also add a zero in front of the month and day numbers?
I have been experimenting a little bit with the python library openpyxl, but have not come up with anything useful so far. Any help would be appreciated, thanks.

If you're going to be staying in excel, you may as well just use the excel macro scripting. If your year, month, and day are in columns A, B and C, you can just type this in column D to concatenate them, then format it as a date and adjust the padding.
=$A1 & "-" & $B1 & "-" & $C1

Try this:
def DateFromABC(self,ws):
import datetime
# Start reading from row 1
for i,row in enumerate(ws.rows,1):
sRow = str(i)
# datetime goes into column 'D'
Dcell = ws['D'+sRow]
# Set datetime from 'ABC'
Dcell.value = datetime.date(year=ws['A'+sRow].value,
month=ws['B'+sRow].value,
day=ws['C'+sRow].value
)
print('i=%s,tpye=%s, value=%s' % (i,str(type(Dcell.value)),Dcell.value ) )
#end for
#end def DateFromABC
Tested with Python:3.4.2 - openpyxl:2.4.1 - LibreOffice: 4.3.3.2

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Transforming multiindex df into sorted xlsx in python - python

df.pivot_table(index='Date', columns='Project', values='Hours') Date Project1 Project2 2022-05-09 5.5 3.75 2022-05-11 4.75 1.5

Related

How to combine two pandas dataset based on multiple conditions?

Filtering, transposing and concatenating with Pandas

Serialized Array to Columns in Panda DataFrame

How to reshape dataframe with multi year data in Python

How can I concatenate 3 columns into 1 using Excel and Python?

Categories

Resources