python pandas - transforming table - python

I would like to transform a table which looks similiar to this below:
X|Y|Z|
1|2|3|
3|5|2|
4|2|1|
The result, I want to achive, should look like that:
col|1|2|3|4|5|
X |1|0|1|0|0|
Y |0|2|0|0|1|
Z |1|1|1|0|0|
So, after transformation the new columns should be unique values from previous table, the new values should be populated with count/appearance, and in the index should be the old column names.
I got stuck and i do not know hot to handle with cause I am a newbe in python, so thanks in advance for support.
Regards,
guddy_7

Use apply with value_counts, replace missing values to 0 and transpose by T:
df = df.apply(pd.value_counts).fillna(0).astype(int).T
print (df)
1 2 3 4 5
X 1 0 1 1 0
Y 0 2 0 0 1
Z 1 1 1 0 0

Related

Dataframe reindexing in order

I have a dataframe like this
datasource datavalue
0 aaaa.pdf 5
0 bbbbb.pdf 5
0 cccc.pdf 9
I don't know if this is the reason but this seems to be messing a dash display so
I would like to reindex it like
datasource datavalue
0 aaaa.pdf 5
1 bbbbb.pdf 5
2 cccc.pdf 9
I used
data_all.reset_index()
but it is not working, the index are still 0
how it should be done?
EDIT1:
Thanks to the two participants who made me notice my mistake.
I should have put
data_all=data_all.reset_index()
Unfortunately it did not go as expected.
Before:
datasource datavalue
0 aaaa.pdf 5
0 bbbbb.pdf 5
0 cccc.pdf 9
Then
data_all.keys()
Index(['datasource','datavalue'],dtype='object')
So
data_all.reset_index()
After
index datasource datavalue
0 0 aaaa.pdf 5
1 0 bbbbb.pdf 5
2 0 cccc.pdf 9
data_all.keys()
Index(['index','datasource','datavalue'],dtype='object')
As you see one column "index" was added. I suppose I can drop that column but I was expecting something that in one step reindex the df without adding anything
EDIT2: Turns out drop=True was necessary!
Thanks everybody!
I think this is what you are looking for.
df.reset_index(drop=True, inplace=True)
#drop: Do not try to insert index into dataframe columns. This resets the index to the default integer index.
# inplace: Whether to modify the DataFrame rather than creating a new one.
Try:
data_all = data_all.reset_index(drop=True)

Python create dataframe provided columnindex & rowindex for multiple tables

All,
I have a dataset I extracted from a JSON file which essentially looks like this, the 'content' doesnt really matter the point is I have the indices and the values from a table extracted. Content being the table values and the indices respectfully.
columnIndex
rowIndex
content
0
0
x
1
0
y
2
0
z
3
0
xx
0
1
yy
1
1
zz
and so on for each row in the table extracted, about 10 rows and columns or so. So from that I can easily run a pivot:
pd.pivot(data , index = 'rowindex',columns = 'columnIndex',values= 'content)
and this will construct the actual data how I need it.
The problem I'm having is I have data where there are multiple tables extracted so everything is listed together and there is no distinction between the first and second table.
For example
columnIndex
rowIndex
content
0
0
x
1
0
y
2
0
z
3
0
xx
0
1
yy
1
1
zz
0
0
x2
1
0
y2
I understand I can't pivot the data the same way since I have repeating rows due to multiple tables combined like this. Is there any way I can pivot it the same way but just have it all combined or even split as individual tables? From what I understand pivot_table looks like it should do the job but I can't get this to work.
I'm also very new to this so figuring this out as I go.
Appreciate any help on this hope it makes sense..

extract multiple sub-fields from Pandas dataframe column into a new dataframe

I have a Pandas dataframe (approx 100k rows) as my input. It is an export from a database, and each of the fields in one of the columns contain one or more records which I need to expand into independent records. For example:
record_id
text_field
0
r0_sub_record1_field1#r0_sub_record1_field2#r0_sub_record2_field1#r0_sub_record2_field2#
1
sub_record1_field1#sub_record1_field2#
2
sub_record1_field1#sub_record1_field2#sub_record2_field1#sub_record2_field2#sub_record3_field1#sub_record3_field2#
The desired result should look like this:
record_id
field1
field2
original_record_id
0
r0_sub_record1_field1
r0_sub_record1_field2
0
1
r0_sub_record2_field1
r0_sub_record2_field2
0
2
r1_sub_record1_field1
r1_sub_record1_field2
1
3
r2_sub_record1_field1
r2_sub_record1_field2
2
4
r2_sub_record2_field1
r2_sub_record2_field2
2
5
r2_sub_record3_field1
r2_sub_record3_field2
2
It is quite straight-forward how to extract the data I need using a loop, but I suspect it is not the most efficient and also not the nicest way.
As I understand it, I cannot use apply or map here, because I am building another dataframe with the extracted data.
Is there a good Python-esque and Panda-style way to solve the problem?
I am using Python 3.7 and Pandas 1.2.1.
I think you need to explode based on # then split the # text.
df1 = df.assign(t=df['text_field'].str.split('#')
).drop('text_field',1).explode('t').reset_index(drop=True)
df2 = df1.join(df1['t'].str.split('#',expand=True)).drop('t',1)
print(df2.dropna())
record_id 0 1
0 0 r0_sub_record1_field1 r0_sub_record1_field2
1 0 r0_sub_record2_field1 r0_sub_record2_field2
3 1 sub_record1_field1 sub_record1_field2
5 2 sub_record1_field1 sub_record1_field2
6 2 sub_record2_field1 sub_record2_field2
7 2 sub_record3_field1 sub_record3_field2
Is it what you expect?
out = df['text_field'].str.strip('#').str.split('#').explode() \
.str.split('#').apply(pd.Series)
prefix = 'r' + out.index.map(str) + '_'
out.apply(lambda v: prefix + v).reset_index() \
.rename(columns={0: 'field1', 1: 'field2', 'index': 'original_record_id'})
>>> out
original_record_id field1 field2
0 0 r0_sub_record1_field1 r0_sub_record1_field2
1 0 r0_sub_record2_field1 r0_sub_record2_field2
2 1 r1_sub_record1_field1 r1_sub_record1_field2
3 2 r2_sub_record1_field1 r2_sub_record1_field2
4 2 r2_sub_record2_field1 r2_sub_record2_field2
5 2 r2_sub_record3_field1 r2_sub_record3_field2

How Can I drop a column if the last row is nan

I have found examples of how to remove a column based on all or a threshold but I have not been able to find a solution to my particular problem which is dropping the column if the last row is nan. The reason for this is im using time series data in which the collection of data doesnt all start at the same time which is fine but if I used one of the previous solutions it would remove 95% of the dataset. I do however not want data whose most recent column is nan as it means its defunct.
A B C
nan t x
1 2 3
x y z
4 nan 6
Returns
A C
nan x
1 3
x z
4 6
You can also do something like this
df.loc[:, ~df.iloc[-1].isna()]
A C
0 NaN x
1 1 3
2 x z
3 4 6
Try with dropna
df = df.dropna(axis=1, subset=[df.index[-1]], how='any')
Out[8]:
A C
0 NaN x
1 1 3
2 x z
3 4 6
You can use .iloc, .loc and .notna() to sort out your problem.
df = pd.DataFrame({"A":[np.nan, 1,"x",4],
"B":["t",2,"y",np.nan],
"C":["x",3,"z",6]})
df = df.loc[:,df.iloc[-1,:].notna()]
You can use a boolean Series to select the column to drop
df.drop(df.loc[:,df.iloc[-1].isna()], axis=1)
Out:
A C
0 NaN x
1 1 3
2 x z
3 4 6
for i in range(temp_df.shape[1]):
if temp_df.iloc[-1,i] == 'nan':
temp_df = temp_df.drop(i,1)
This will work for you.
Basically what I'm doing here is looping over all columns and checking if last entry is 'nan', then dropping that column.
temp_df.shape[1]
this is the numbers of columns.
pandas.df.drop(i,1)
i represents the column index and 1 represents that you want to drop the column.
EDIT:
I read the other answers on this same post and it seems to me that notna would be best (I would use it), but the advantage of this method is that someone can compare anything they wish to.
Another method I found is isnull() which is a function in the pandas library which will work like this:
for i in range(temp_df.shape[1]):
if temp_df.iloc[-1,i].isnull():
temp_df = temp_df.drop(i,1)

Trying to update a dataframe

I have a dataframe (df) which looks like:
0 1 2 3
0 BBG.apples.S BBG.XNGS.bananas.S 0
1 BBG.apples.S BBG.XNGS.oranges.S 0
2 BBG.apples.S BBG.XNGS.pairs.S 0
3 BBG.apples.S BBG.XNGS.mango.S 0
4 BBG.apples.S BBG.XNYS.mango.S 0
5 BBG.XNGS.bananas.S BBG.XNGS.oranges.S 0
6 BBG.XNGS.bananas.S BBG.XNGS.pairs.S 0
7 BBG.XNGS.bananas.S BBG.XNGS.kiwi.S 0
8 BBG.XNGS.oranges.S BBG.XNGS.pairs.S 0
9 BBG.XNGS.oranges.S BBG.XNGS.kiwi.S 0
10 BBG.XNGS.peaches.S BBG.XNGS.strawberrys.S 0
11 BBG.XNGS.peaches.S BBG.XNGS.strawberrys.S 0
12 BBG.XNGS.peaches.S BBG.XNGS.strawberrys.S 0
13 BBG.XNGS.peaches.S BBG.XNGS.kiwi.S 0
I am trying to update a value (first row, third column) in the dataframe using:
for index, row in df.iterrows():
status = row[3]
if int(status) == 0:
df[index]['3'] = 1
but when I print the dataframe out it remains unchanged.
What am I doing wrong?
Replace your last line by:
df.at[index,'3'] = 1
Obviously as mentioned by others you're better off using a vectorized expression instead of iterating, especially for large dataframes.
You can't modify a data frame by iterating like that. See here.
If you only want to modify the element at [1, 3], you can access it directly:
df[1, 3] = 1
If you're trying to turn every 0 in column 3 to a 1, try this:
df[df['3'] == 0] = 1
EDIT: In addition, the docs for iterrows say that you'll often get a copy back, which is why the operation fails.
If you are trying to update the third column for all rows based on the row having a certain value, as shown in your example code, then it would be much easier use the where method on the dataframe:
df.loc[:,'3'] = df['3'].where(df['3']!=0, 1)
Try to update the row using .loc or .iloc (depend on your needs).
For example, in this case:
if int(status) == 0:
df.iloc[index]['3']='1'

Categories

Resources