Unexpected result in pandas pivot_table - python

I am trying to do a pivot_table on a pandas Dataframe. I am almost getting the expected result, but it seems to be multiplied by two. I could just divide by two and call it a day, however, I want to know whether I am doing something wrong.
Here goes the code:
import pandas as pd
import numpy as np
df = pd.DataFrame(data={"IND":[1,2,3,4,5,1,5,5],"DATA":[2,3,4,2,10,4,3,3]})
df_pvt = pd.pivot_table(df, aggfunc=np.size, index=["IND"], columns="DATA")
df_pvt is now:
DATA 2 3 4 10
IND
1 2.0 NaN 2.0 NaN
2 NaN 2.0 NaN NaN
3 NaN NaN 2.0 NaN
4 2.0 NaN NaN NaN
5 NaN 4.0 NaN 2.0
However, instead of the 2.0 is should be 1.0! What am I misunderstanding / doing wrong?

Use the string 'size' instead. This will trigger the Pandas interpretation of "size", i.e. the number of elements in a group. The NumPy interpretation of size is the product of the lengths of each dimension.
df = pd.pivot_table(df, aggfunc='size', index=["IND"], columns="DATA")
print(df)
DATA 2 3 4 10
IND
1 1.0 NaN 1.0 NaN
2 NaN 1.0 NaN NaN
3 NaN NaN 1.0 NaN
4 1.0 NaN NaN NaN
5 NaN 2.0 NaN 1.0

Related

combine two pd df by index and column

The data looks like this:
df1 = 456089.0 456091.0 456093.0
5428709.0 1.0 1.0 NaN
5428711.0 1.0 1.0 NaN
5428713.0 NaN NaN 1.0
df2 = 456093.0 456095.0 456097.0
5428711.0 2.0 NaN NaN
5428713.0 NaN 2.0 NaN
5428715.0 NaN NaN 2.0
I would like to have this output:
df3 = 456089.0 456091.0 456093.0 456095.0 456097.0
5428709.0 1.0 1.0 NaN NaN NaN
5428711.0 1.0 1.0 2.0 NaN NaN
5428713.0 NaN NaN 1.0 2.0 NaN
5428715.0 NaN NaN NaN NaN 2.0
I tried several combinations with pd.merge, pd.join, pd.concat but nothing worked the way I want it, since I want to combine the data by index and column.
Does anyone have an idea how to do this? Thanks in advance!
Let us try sum with concat
out = pd.concat([df1,df2]).sum(axis=1,level=0,min_count=1).sum(axis=0,level=0,min_count=1)
Out[150]:
456089.0 456091.0 456093.0 456095.0 456097.0
5428709.0 1.0 1.0 NaN NaN NaN
5428711.0 1.0 1.0 2.0 NaN NaN
5428713.0 NaN NaN 1.0 2.0 NaN
5428715.0 NaN NaN NaN NaN 2.0

Rearranging column based on sequesnce in pandas dataframe

I have a pandas dataframe as below. I want to rearrange columns in my dataframe based on the sequence seperately for XX_ and YY_ columns.
import numpy as np
import pandas as pd
import math
import sys
import re
data=[[np.nan,2, 5,np.nan,np.nan,1],
[np.nan,np.nan,2,np.nan,np.nan,np.nan],
[np.nan,3,np.nan,np.nan,np.nan,np.nan],
[1,np.nan,np.nan,np.nan,np.nan,1],
[np.nan,2,np.nan,np.nan,2,np.nan],
[np.nan,np.nan,np.nan,2,np.nan,5]]
df = pd.DataFrame(data,columns=['XX_4','XX_2','XX_3','YY_4','YY_2','YY_3'])
df
My output dataframe should look like:
XX_2 XX_3 XX_4 YY_2 YY_3 YY_4
0 2.0 5.0 NaN NaN 1.0 NaN
1 NaN 2.0 NaN NaN NaN NaN
2 3.0 NaN NaN NaN NaN NaN
3 NaN NaN 1.0 NaN 1.0 NaN
4 2.0 NaN NaN 2.0 NaN NaN
5 NaN NaN 2.0 NaN 5.0 2.0
Since this is a small dataframe, I can manually rearrange the columns. Is there any way of doing it based on _2, _3 suffix?
IIUC we can use a function based off Jeff Attwood's article on sorting alphanumeric columns written by Mark Byers :
https://stackoverflow.com/a/2669120/9375102
import re
def sorted_nicely( l ):
""" Sort the given iterable in the way that humans expect."""
convert = lambda text: int(text) if text.isdigit() else text
alphanum_key = lambda key: [ convert(c) for c in re.split('([0-9]+)', key) ]
return sorted(l, key = alphanum_key)
df = pd.DataFrame(data,columns=['XX_9','XX_10','XX_3','YY_9','YY_10','YY_3'])
data = df.colums.tolist()
print(df[sorted_nicely(data)])
XX_3 XX_9 XX_10 YY_3 YY_9 YY_10
0 5.0 NaN 2.0 1.0 NaN NaN
1 2.0 NaN NaN NaN NaN NaN
2 NaN NaN 3.0 NaN NaN NaN
3 NaN 1.0 NaN 1.0 NaN NaN
4 NaN NaN 2.0 NaN NaN 2.0
5 NaN NaN NaN 5.0 2.0 NaN

Cuting dataframe loop

I have a dataset which is only one column. I want to cut the column into multiple dataframes.
I use a for loop to create a list which contains the values at which positions I want to cut the dataframe.
import pandas as pd
df = pd.read_csv("column.csv", delimiter=";", header=0, index_col=(0))
number_of_pixels = int(len(df.index))
print("You have " + str(number_of_pixels) +" pixels in your file")
number_of_rows = int(input("Enter number of rows you want to create"))
list=[] #this list contains the number of pixels per row
for i in range (0,number_of_rows): #this loop fills the list with the number of pixels per row
pixels_per_row=int(input("Enter number of pixels in row " + str(i)))
list.append(pixels_per_row)
print(list)
After cutting the column into multiple dataframes I want to transpose each dataframe and concating all dataframes back together using:
df1=df1.reset_index(drop=True)
df1=df1.T
df2=df2.reset_index(drop=True)
df2=df2.T
frames = [df1,df2]
result = pd.concat(frames, axis=0)
print(result)
So I want to create a loop that cuts my dataframe into multiple frames at the positions stored in my list.
Thank you!
This is a problem that is better solved with numpy. I'll start from the point of you receiving a list from your user input. The whole point is to use numpy.split to separate the values based on the cumulative number of pixels requested, and then create a new DataFrame
Setup
import numpy as np
import pandas as pd
np.random.seed(123)
df = pd.DataFrame({'val': np.random.randint(1,10,50)})
lst = [4,10,2,1,15,8,9,1]
Code
pd.DataFrame(np.split(df.val.values, np.cumsum(lst)[:-1]))
Output
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 3 3.0 7.0 2.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 4 7.0 2.0 1.0 2.0 1.0 1.0 4.0 5.0 1.0 NaN NaN NaN NaN NaN
2 1 5.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 8 4.0 3.0 5.0 8.0 3.0 5.0 9.0 1.0 8.0 4.0 5.0 7.0 2.0 6.0
5 7 3.0 2.0 9.0 4.0 6.0 1.0 3.0 NaN NaN NaN NaN NaN NaN NaN
6 7 3.0 5.0 5.0 7.0 4.0 1.0 7.0 5.0 NaN NaN NaN NaN NaN NaN
7 8 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
If your list has more pixels than the total number of rows in your initial DataFrame then you'll get extra all NaN rows in your output. If your lst sums to less than the total number of pixels, it will add them to all to the last row. Since you didn't specify either of these conditions in your question, not sure how you'd want to handle that.

Visualizing multiple dummy variables over time

I have a dataframe with dummy variables for daily weather types observations.
date high_wind thunder snow smoke
0 2050-10-23 1.0 NaN NaN NaN
1 2050-10-24 1.0 1.0 NaN NaN
2 2050-10-25 NaN NaN NaN NaN
3 2050-10-26 NaN NaN NaN 1.0
4 2050-10-27 NaN NaN NaN 1.0
5 2050-10-28 NaN NaN NaN 1.0
6 2050-10-29 1.0 NaN NaN NaN
7 2050-10-30 NaN 1.0 NaN NaN
8 2050-10-31 NaN 1.0 NaN NaN
9 2050-11-01 1.0 1.0 NaN NaN
10 2050-11-02 1.0 1.0 NaN NaN
11 2050-11-03 1.0 1.0 NaN NaN
12 2050-11-04 1.0 NaN NaN NaN
13 2050-11-05 1.0 NaN NaN NaN
14 2050-11-06 NaN NaN NaN NaN
15 2050-11-07 NaN 1.0 NaN NaN
16 2050-11-08 NaN NaN NaN NaN
17 2050-11-09 NaN NaN 1.0 NaN
18 2050-11-10 NaN NaN NaN NaN
19 2050-11-11 NaN NaN 1.0 NaN
20 2050-11-12 NaN NaN 1.0 NaN
21 2050-11-13 NaN NaN NaN NaN
For those of you playing along at home, copy the above and then:
import pandas as pd
df = pd.read_clipboard()
df.date = df.date.apply(pd.to_datetime)
df.set_index('date', inplace=True)
I want to visualize this dataframe with the date on the x axis and each weather type category on the y axis. Here's what I've tried so far:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
labels = df.columns.tolist()
#unsatisfying loop to give categories some y separation
for i,col in enumerate(df.columns):
ax.scatter(x=df[col].index, y=(df[col]+i)) #add a little to each
ax.set_yticklabels(labels)
ax.set_xlim(df.index.min(), df.index.max())
fig.autofmt_xdate()
Which gives me this:
Questions:
How do I get the y labels aligned properly?
Is there a better way to structure the data to make plotting easier?
This aligns you y labels:
ax.set_yticks(range(1, len(df.columns) + 1))

Delete rows in dataframe based on column values

I need to rid myself of all rows with a null value in column C. Here is the code:
infile="C:\****"
df=pd.read_csv(infile)
A B C D
1 1 NaN 3
2 3 7 NaN
4 5 NaN 8
5 NaN 4 9
NaN 1 2 NaN
There are two basic methods I have attempted.
method 1:
source: How to drop rows of Pandas DataFrame whose value in certain columns is NaN
df.dropna()
The result is an empty dataframe, which makes sense because there is an NaN value in every row.
df.dropna(subset=[3])
For this method I tried to play around with the subset value using both column index number and column name. The dataframe is still empty.
method 2:
source: Deleting DataFrame row in Pandas based on column value
df = df[df.C.notnull()]
Still results in an empty dataframe!
What am I doing wrong?
df = pd.DataFrame([[1,1,np.nan,3],[2,3,7,np.nan],[4,5,np.nan,8],[5,np.nan,4,9],[np.nan,1,2,np.nan]], columns = ['A','B','C','D'])
df = df[df['C'].notnull()]
df
It's just a prove that your method 2 works properly (at least with pandas 0.18.0):
In [100]: df
Out[100]:
A B C D
0 1.0 1.0 NaN 3.0
1 2.0 3.0 7.0 NaN
2 4.0 5.0 NaN 8.0
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN
In [101]: df.dropna(subset=['C'])
Out[101]:
A B C D
1 2.0 3.0 7.0 NaN
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN
In [102]: df[df.C.notnull()]
Out[102]:
A B C D
1 2.0 3.0 7.0 NaN
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN
In [103]: df = df[df.C.notnull()]
In [104]: df
Out[104]:
A B C D
1 2.0 3.0 7.0 NaN
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN

Categories

Resources