Python create dataframe provided columnindex & rowindex for multiple tables - python

All,
I have a dataset I extracted from a JSON file which essentially looks like this, the 'content' doesnt really matter the point is I have the indices and the values from a table extracted. Content being the table values and the indices respectfully.
columnIndex
rowIndex
content
0
0
x
1
0
y
2
0
z
3
0
xx
0
1
yy
1
1
zz
and so on for each row in the table extracted, about 10 rows and columns or so. So from that I can easily run a pivot:
pd.pivot(data , index = 'rowindex',columns = 'columnIndex',values= 'content)
and this will construct the actual data how I need it.
The problem I'm having is I have data where there are multiple tables extracted so everything is listed together and there is no distinction between the first and second table.
For example
columnIndex
rowIndex
content
0
0
x
1
0
y
2
0
z
3
0
xx
0
1
yy
1
1
zz
0
0
x2
1
0
y2
I understand I can't pivot the data the same way since I have repeating rows due to multiple tables combined like this. Is there any way I can pivot it the same way but just have it all combined or even split as individual tables? From what I understand pivot_table looks like it should do the job but I can't get this to work.
I'm also very new to this so figuring this out as I go.
Appreciate any help on this hope it makes sense..

Related

Column by column pairplotting of 2 dataframes

I want to be able to plot two dataframes against each other pairing each column successively (but not all columns against all columns). The dataframes are identical in size and column headers but differ in the values. So the dataframes are of the form:
df_X =
A B C
0 1 1 1
1 2 2 2
...
df_Y =
A B C
0 3 3 3
1 4 4 4
...
At the moment I can do this manually on subplots using by starting with a merged dataframe with two header columns:
df_merge =
col A B C
X Y X Y X Y
0 1 3 1 3 1 3
1 2 4 2 4 2 4
...
_, ax = plt.subplots(3, 1)
for i in range(3):
ax[i].scatter(df_merge[col[i]][X], df_merge[col[i]][Y])
This works, but I am wondering if there is a better way of acheving this. Particularly when trying to then calculate the numerical correlation value between the pairs, which would again involve another loop and several more lines of code.
You can get correlation with something like:
df_merge[[col[i]][X],col[i]][Y]]).corr()
You can generally assume that most statistical functions can be applied in a single line to dataframe content either with built-in Pandas functions (https://pandas.pydata.org/docs/user_guide/computation.html), or scipy/numpy functions which you can apply.
To title each plot with the correlation, for example, you can do
thisAX.set_title("Corr: {}".format(df_merge[[col[i]][X],col[i]][Y]]).corr())
(I flattened your column names to make display a bit simpler, and I reversed one of the number pairs to show negative correlation)
Note: when feeding two Pandas columns (Series) into .corr(), you'll get a dataframe returned - to get the X:Y correlation, you can pick out a single value with .corr()["{}_X".format(col[i])]["{}_Y".format(col[i])])) (those are just the column and index names of the correlation)
Here's a lightly styled version of the same plot (again, using the flattened version of your column names)

Pandas DataFrame MultiIndex Pivot - Remove Empty Headers and Axis Rows

this is closely related to the question I asked earlier here Python Pandas Dataframe Pivot Table Column and Values Order. Thanks again for the help. Very much appreciated.
I'm trying to automate a report that will be distributed via email to a large audience so it needs to look "pretty" :)
I'm having trouble resetting/removing the Indexes and/or Axis post-Pivots to enable me to use the .style CSS functions (i.e. creating a Styler Object out of the df) to make the table look nice.
I have a DataFrame where two of the principal fields (in my example here they are "Name" and "Bucket") will be variable. The desired display order will also change (so it can't be hard-coded) but it can be derived earlier in the application (e.g. "Name_Rank" and "Bucket_Rank") into Integer "Sorting Values" which can be easily sorted (and theoretically dropped later).
I can drop the column Sorting Value but not the Row/Header/Axis(?). Additionally, no matter what I try I just can't seem to get rid of the blank row between the headers and the DataTable.
I (think) I need to set the Index = Bucket and Headers = "Name" and "TDY/Change" to use the .style style object functionality properly.
import pandas as pd
import numpy as np
data = [
['AAA',2,'X',3,5,1],
['AAA',2,'Y',1,10,2],
['AAA',2,'Z',2,15,3],
['BBB',3,'X',3,15,3],
['BBB',3,'Y',1,10,2],
['BBB',3,'Z',2,5,1],
['CCC',1,'X',3,10,2],
['CCC',1,'Y',1,15,3],
['CCC',1,'Z',2,5,1],
]
df = pd.DataFrame(data, columns =
['Name','Name_Rank','Bucket','Bucket_Rank','Price','Change'])
display(df)
Name
Name_Rank
Bucket
Bucket_Rank
Price
Change
0
AAA
2
X
3
5
1
1
AAA
2
Y
1
10
2
2
AAA
2
Z
2
15
3
3
BBB
3
X
3
15
3
4
BBB
3
Y
1
10
2
5
BBB
3
Z
2
5
1
6
CCC
1
X
3
10
2
7
CCC
1
Y
1
15
3
8
CCC
1
Z
2
5
1
Based on the prior question/answer I can pretty much get the table into the right format:
df2 = (pd.pivot_table(df, values=['Price','Change'],index=['Bucket_Rank','Bucket'],
columns=['Name_Rank','Name'], aggfunc=np.mean)
.swaplevel(1,0,axis=1)
.sort_index(level=0,axis=1)
.reindex(['Price','Change'],level=1,axis=1)
.swaplevel(2,1,axis=1)
.rename_axis(columns=[None,None,None])
).reset_index().drop('Bucket_Rank',axis=1).set_index('Bucket').rename_axis(columns=
[None,None,None])
which looks like this:
1
2
3
CCC
AAA
BBB
Price
Change
Price
Change
Price
Change
Bucket
Y
15
3
10
2
10
2
Z
5
1
15
3
5
1
X
10
2
5
1
15
3
Ok, so...
A) How do I get rid of the Row/Header/Axis(?) that used to be "Name_Rank" (e.g. the integer "Sorting Values" 1,2,3). I figured a hack where the df is exported to XLS/re-imported with Header=(1,2) but that can't be the best way to accomplish the objective.
B) How do I get rid of the blank row above the data in the table? From what I've read online it seems like you should "rename_axis=[None]" but this doesn't seem to work no matter which order I try.
C) Is there a way to set the Header(s) such that the both what used to be "Name" and "Price/Change" rows are Headers so that the .style functionality can be employed to format them separate from the data in the table below?
Thanks a lot for whatever suggestions anyone might have. I'm totally stuck!
Cheers,
Devon
In pandas 1.4.0 the options for A and B are directly available using the Styler.hide method:

extract multiple sub-fields from Pandas dataframe column into a new dataframe

I have a Pandas dataframe (approx 100k rows) as my input. It is an export from a database, and each of the fields in one of the columns contain one or more records which I need to expand into independent records. For example:
record_id
text_field
0
r0_sub_record1_field1#r0_sub_record1_field2#r0_sub_record2_field1#r0_sub_record2_field2#
1
sub_record1_field1#sub_record1_field2#
2
sub_record1_field1#sub_record1_field2#sub_record2_field1#sub_record2_field2#sub_record3_field1#sub_record3_field2#
The desired result should look like this:
record_id
field1
field2
original_record_id
0
r0_sub_record1_field1
r0_sub_record1_field2
0
1
r0_sub_record2_field1
r0_sub_record2_field2
0
2
r1_sub_record1_field1
r1_sub_record1_field2
1
3
r2_sub_record1_field1
r2_sub_record1_field2
2
4
r2_sub_record2_field1
r2_sub_record2_field2
2
5
r2_sub_record3_field1
r2_sub_record3_field2
2
It is quite straight-forward how to extract the data I need using a loop, but I suspect it is not the most efficient and also not the nicest way.
As I understand it, I cannot use apply or map here, because I am building another dataframe with the extracted data.
Is there a good Python-esque and Panda-style way to solve the problem?
I am using Python 3.7 and Pandas 1.2.1.
I think you need to explode based on # then split the # text.
df1 = df.assign(t=df['text_field'].str.split('#')
).drop('text_field',1).explode('t').reset_index(drop=True)
df2 = df1.join(df1['t'].str.split('#',expand=True)).drop('t',1)
print(df2.dropna())
record_id 0 1
0 0 r0_sub_record1_field1 r0_sub_record1_field2
1 0 r0_sub_record2_field1 r0_sub_record2_field2
3 1 sub_record1_field1 sub_record1_field2
5 2 sub_record1_field1 sub_record1_field2
6 2 sub_record2_field1 sub_record2_field2
7 2 sub_record3_field1 sub_record3_field2
Is it what you expect?
out = df['text_field'].str.strip('#').str.split('#').explode() \
.str.split('#').apply(pd.Series)
prefix = 'r' + out.index.map(str) + '_'
out.apply(lambda v: prefix + v).reset_index() \
.rename(columns={0: 'field1', 1: 'field2', 'index': 'original_record_id'})
>>> out
original_record_id field1 field2
0 0 r0_sub_record1_field1 r0_sub_record1_field2
1 0 r0_sub_record2_field1 r0_sub_record2_field2
2 1 r1_sub_record1_field1 r1_sub_record1_field2
3 2 r2_sub_record1_field1 r2_sub_record1_field2
4 2 r2_sub_record2_field1 r2_sub_record2_field2
5 2 r2_sub_record3_field1 r2_sub_record3_field2

python pandas - transforming table

I would like to transform a table which looks similiar to this below:
X|Y|Z|
1|2|3|
3|5|2|
4|2|1|
The result, I want to achive, should look like that:
col|1|2|3|4|5|
X |1|0|1|0|0|
Y |0|2|0|0|1|
Z |1|1|1|0|0|
So, after transformation the new columns should be unique values from previous table, the new values should be populated with count/appearance, and in the index should be the old column names.
I got stuck and i do not know hot to handle with cause I am a newbe in python, so thanks in advance for support.
Regards,
guddy_7
Use apply with value_counts, replace missing values to 0 and transpose by T:
df = df.apply(pd.value_counts).fillna(0).astype(int).T
print (df)
1 2 3 4 5
X 1 0 1 1 0
Y 0 2 0 0 1
Z 1 1 1 0 0

overwrite slice of multi-index dataframe with series

I have a multi-index dataframe and want to set a slice of one of its columns equal to a series, ordered (sorted) according to the column slice' and series' index-match. The column's innermost index and series' index are identical, except their ordering (sorting). (see example below)
I can do this by first sorting the series' index according to the column's index and then using series.values (see below), but this feels like a workaround and I was wondering if it's possible to directly assign the series to the column slice.
example:
import pandas as pd
multi_index=pd.MultiIndex.from_product([['a','b'],['x','y']])
df=pd.DataFrame(0,multi_index,['p','q'])
s1=pd.Series([1,2],['y','x'])
df.loc['a','p']=s1[df.loc['a','p'].index].values
The code above gives the desired output, but I was wondering if the last line could be done simpler, e.g.:
df.loc['a','p']=s1
but this sets the column slice to NaNs.
Desired output:
p q
a x 2 0
y 1 0
b x 0 0
y 0 0
obtained output from df.loc['a','p']=s1:
p q
a x NaN 0
y NaN 0
b x 0.0 0
y 0.0 0
It seems like a simple issue to me but I haven't been able to find the answer anywhere.
Have you tried something like that?
df.loc['a']['p'] = s1
Resulting df is here
p q
a x 2 0
y 1 0
b x 0 0
y 0 0

Categories

Resources