Fill pandas blank groupby rows without resetting the index - python

Hi I have a table like this after group by:
t = df.loc[(year-3 <= year) & (year <= year-1), 'Net Sum'].groupby([month, association]).sum()
t
YearMonth Type
1 Other 27471.73
base -14563752.74
plan 16286620.30
2 Other 754691.36
base 30465722.53
plan 17906687.29
3 Other 20285.92
base 29339325.21
plan 15492558.91
How can I fill the blanks with grouped Year Month without resetting the index as I'd like to keep YearMonth as index?
Expected Outcome.
t
YearMonth Type
1 Other 27471.73
1 base -14563752.74
1 plan 16286620.30
2 Other 754691.36
2 base 30465722.53
2 plan 17906687.29
3 Other 20285.92
3 base 29339325.21
3 plan 15492558.91

I think this can only be achieved by altering the display option:
with pd.option_context('display.multi_sparse', False):
print(t)
If we refer the docs
display.multi_sparse True
“Sparsify” MultiIndex display (don’t display repeated elements in outer levels within groups)
Hence we can set this to False.

Following should do the work
t.reset_index()
https://pandas.pydata.org/pandasdocs/stable/reference/api/pandas.DataFrame.reset_index.html

Related

How to use np.where in creating new column using previous rows?

I'm kind of new in python and I've been working on migrating my excel to pandas because it cannot run a hundred of thousands of rows.
I have a table that looks like this in excel:
Where column A and B are inputs and column C is output.
The formula for column C is
=IF(B2="new",A2,C3)
If 'Status' is equal to "new" the result will be the value in column A
And if 'Status' is not equal to "new" the result will be the previous row of C
I tried doing it using np.where and .shift(-1) using this code
df['Previous'] = np.where (df['Status']=='new', df['Count'], df['Previous'].shift(-1))
but it seems I am receiving this error
Key error: 'Previous'
It seems that I need to define the column 'Previous' first.
I tried searching Stack Overflow but most of the time the related solutions are somewhat based in complex problems and I was not able to pattern it to my simple problem.
df.columns looks like
Index(['Count', 'Status'], dtype='object')
This is the result of my code once run.
Since you are creating new column Previous and this column is still not yet defined when you use it in the definition of itself, in the np.where() statement, you will get an error.
Also, your question is actually not taking a "previous" value since when you are handling the first row, there is no previous value for the first row and even when processing 2nd and 3rd rows the value is still not defined until we go to the 4th row.
So, the solution need to set a kind of temporary non-deterministic value while processing rows still with unknown value and set these non-deterministic values afterwards when some values are defined. In this case, we can set those temporary non-deterministic values as np.nan and then back-fill using .bfill() with defined values afterwards. We use backward fill because we are filling values of rows with index 0, 1, 2 by value on row with index 3.
To solve it, you can try the following:
df['Previous'] = np.where(df['Status']=='new', df['Count'], np.nan)
df['Previous'] = df['Previous'].bfill().astype(int)
print(df)
Count Status Previous
0 4 old 1
1 3 old 1
2 2 old 1
3 1 new 1
4 40 old 10
5 30 old 10
6 20 old 10
7 10 new 10
8 400 old 100
9 300 old 100
10 200 old 100
11 100 new 100
Here, I assumed the dtype of column Count is integer. If it is of string type, then you don't need to use the .astype(int) in the code above.
Alternatively, you can also do it in one step using .where() on column Count, instead of np.where() as follows:
df['Previous'] = df['Count'].where(df['Status'] =='new').bfill().astype(int)
print(df)
Count Status Previous
0 4 old 1
1 3 old 1
2 2 old 1
3 1 new 1
4 40 old 10
5 30 old 10
6 20 old 10
7 10 new 10
8 400 old 100
9 300 old 100
10 200 old 100
11 100 new 100
Similarly, no need to use .astype(int) in the code above if column Count is of string type.
.where() is to: "Replace values where the condition is False". This is some how equivalent to "Retain values where the condition is True". So when the condition is True, we use the values of original Count column. Then, you probably would ask: "What if the condition False and what value to replace?" The answer can be seen from the official document and can be found from the 2nd parameter showing other=nan. When the condition is False, the value specified in the 2nd parameter other (if any) will be used. If the 2nd parameter is not specified, it defaults to nan. Hence, in our case, we don't specify the 2nd parameter for when the condition is False, nan will be used for the values. Therefore, same effect as we specify np.nan for the False condition in the np.where() call.

Python and pandas, groupby only column in DataFrame

I would like to group some strings in the column called 'type' and insert them in a plotly bar, the problem is that from the new table created with groupby I can't extract the x and y to define them in the graph:
tipol1 = df.groupby(['tipology']).nunique()
tipol1
the outpot gives me tipology as index and the grouping based on how many times they repeat
number data
typology
one 2 113
two 33 33
three 12 88
four 44 888
five 11 66
in the number column (in which I have other values ​​it gives me the correct grouping of the tipology column)
Also in the date column it gives me values ​​(I think grouping the dates but not the dates in the correct format)
I also found:
tipol=df.groupby(['tipology']).nunique()
tipol2 = tipol[['number']]
tipol2
to take only the number column,
but nothing to do, I would need the tipology column (not in index) and the column with the tipology grouping numbers to get the x and y axis to import it into plotly!
One last try I made (making a big mess):
tipol=df.groupby(['tipology'],as_index=False).nunique()
tipol2 = tipol[['number']]
fig = go.Figure(data=[
go.Bar(name='test', x=df['tipology'], y=tipol2)
])
fig.update_layout(barmode='stack')
fig.show()
any suggestions
thanks!
UPDATE
I would have too much code to give an example, it would be difficult for me and it would waste your time too. basically I would need a groupby with the addition of a column that would show the grouping value eg:
tipology Date
home 10/01/18
home 11/01/18
garden 12/01/18
garden 12/01/18
garden 13/01/18
bathroom 13/01/18
bedroom 14/01/18
bedroom 15/01/18
kitchen 16/01/18
kitchen 16/01/18
kitchen 17/01/18
I wish this would happen:
by deleting the date column and inserting the value column in the DataFrame that does the count
tipology value
home 2
garden 3
bathroom 1
bedroom 2
kitchen 3
Then (I'm working with jupyer notebook)
leaving the date column and adding the corresponding values ​​to the value column based on their grouping:
tipology Date value
home 10/01/18 1
home 11/01/18 1
garden 12/01/18 2
garden 12/01/18_____.
garden 13/01/18 1
bathroom 13/01/18 1
bedroom 14/01/18 1
bedroom 15/01/18 1
kitchen 16/01/18 2
kitchen 16/01/18_____.
kitchen 17/01/18 1
I would need the columns to assign them to the x and y axes to import them to a graph! so none of the columns should be index
By default the method groupby will return a dataframe where the fields you are grouping on will be in the index of the dataframe. You can adjust this behaviour by setting as_index=False in the group by. Then tipology will still be a column in the dataframe that is returned:
tipol1 = df.groupby('tipology', as_index=False).nunique()

Finding rows with highest means in dataframe

I am trying to find the rows, in a very large dataframe, with the highest mean.
Reason: I scan something with laser trackers and used a "higher" point as reference to where the scan starts. I am trying to find the object placed, through out my data.
I have calculated the mean of each row with:
base = df.mean(axis=1)
base.columns = ['index','Mean']
Here is an example of the mean for each row:
0 4.407498
1 4.463597
2 4.611886
3 4.710751
4 4.742491
5 4.580945
This seems to work fine, except that it adds an index column, and gives out columns with an index of type float64.
I then tried this to locate the rows with highest mean:
moy = base.loc[base.reset_index().groupby(['index'])['Mean'].idxmax()]
This gives out tis :
index Mean
0 0 4.407498
1 1 4.463597
2 2 4.611886
3 3 4.710751
4 4 4.742491
5 5 4.580945
But it only re-index (I have now 3 columns instead of two) and does nothing else. It still shows all rows.
Here is one way without using groupby
moy=base.sort_values('Mean').tail(1)
It looks as though your data is a string or single column with a space in between your two numbers. Suggest splitting the column into two and/or using something similar to below to set the index to your specific column of interest.
import pandas as pd
df = pd.read_csv('testdata.txt', names=["Index", "Mean"], delimiter="\s+")
df = df.set_index("Index")
print(df)

Adding the quantities of products in a dataframe column in Python

I'm trying to calculate the sum of weights in a column of an excel sheet that contains the product title with the help of Numpy/Pandas. I've already managed to load the sheet into a dataframe, and isolate the rows that contain the particular product that I'm looking for:
dframe = xlsfile.parse('Sheet1')
dfFent = dframe[dframe['Product:'].str.contains("ABC") == True]
But, I can't seem to find a way to sum up its weights, due to the obvious complexity of the problem (as shown below). For eg. if the column 'Product Title' contains values like -
1 gm ABC
98% pure 12 grams ABC
0.25 kg ABC Powder
ABC 5gr
where, ABC is the product whose weight I'm looking to add up. Is there any way that I can add these weights all up to get a total of 268 gm. Any help or resources pointing to the solution would be highly appreciated. Thanks! :)
You can use extractall for values with units or percentage:
(?P<a>\d+\.\d+|\d+) means extract float or int to column a
\s* - is zero or more spaces between number and unit
(?P<b>[a-z%]+) is extract lowercase unit or percentage after number to b
#add all possible units to dictonary
d = {'gm':1,'gr':1,'grams':1,'kg':1000,'%':.01}
df1 = df['Product:'].str.extractall('(?P<a>\d+\.\d+|\d+)\s*(?P<b>[a-z%]+)')
print (df1)
a b
match
0 0 1 gm
1 0 98 %
1 12 grams
2 0 0.25 kg
3 0 5 gr
Then convert first column to numeric and second map by dictionary of all units. Then reshape by unstack and multiple columns by prod, last sum:
a = df1['a'].astype(float).mul(df1['b'].map(d)).unstack().prod(axis=1).sum()
print (a)
267.76
Similar solution:
a = df1['a'].astype(float).mul(df1['b'].map(d)).prod(level=0).sum()
You need to do some data wrangling to get the column consistent in same format. You may do some matching and try to get Product column aligned and consistent, similar to date -time formatting.
Like you may do the following things.
Make a separate column with only values(float)
Change % value to decimal and multiply by quantity
Replace value with kg to grams
Without any string, only float column to get total.
Pandas can work well with this problem.
Note: There is no shortcut to this problem, you need to get rid of strings mixed with decimal values for calculation of sum.

Pandas/Python - Updating dataframes based on value match

I want to update the mergeAllGB.Intensity columns NaN values with values from another dataframe where ID, weekday and hour are matching. I'm trying:
mergeAllGB.Intensity[mergeAllGB.Intensity.isnull()] = precip_hourly[precip_hourly.SId == mergeAllGB.SId & precip_hourly.Hour == mergeAllGB.Hour & precip_hourly.Weekday == mergeAllGB.Weekday].Intensity
However, this returns ValueError: Series lengths must match to compare. How could I do this?
Minimal example:
Inputs:
_______
mergeAllGB
SId Hour Weekday Intensity
1 12 5 NaN
2 5 6 3
precip_hourly
SId Hour Weekday Intensity
1 12 5 2
Desired output:
________
mergeAllGB
SId Hour Weekday Intensity
1 12 5 2
2 5 6 3
TL;DR this will (hopefully) work:
# Set the index to compare by
df = mergeAllGB.set_index(["SId", "Hour", "Weekday"])
fill_df = precip_hourly.set_index(["SId", "Hour", "Weekday"])
# Fill the nulls with the relevant values of intensity
df["Intensity"] = df.Intensity.fillna(fill_df.Intensity)
# Cancel the special indexes
mergeAllGB = df.reset_index()
Alternatively, the line before the last could be
df.loc[df.Intensity.isnull(), "Intensity"] = fill_df.Intensity
Assignment and comparison in pandas are done by index (which isn't shown in your example).
In the example, running precip_hourly.SId == mergeAllGB.SId results in ValueError: Can only compare identically-labeled Series objects. This is because we try to compare the two columns by value, but precip_hourly doesn't have a row with index 1 (default indexing starts at 0), so the comparison fails.
Even if we assume the comparison succeeded, the assignment stage is problematic.
Pandas tries to assign according to the index - but this doesn't have the intended meaning.
Luckily, we can use it for our own benefit - by setting the index to be ["SId", "Hour", "Weekday"], any comparison and assignments will be done with relation to this index, so running df.Intensity= fill_df.Intensity will assign to df.Intensity the values in fill_df.Intensity wherever the index match, that is, wherever they have the same ["SId", "Hour", "Weekday"].
In order to assign only to the places where the Intensity is NA, we need to filter first (or use fillna). Note that filter by df.Intensity[df.Intensity.isnull()] will work, but assignment to it will probably fail if you have several values with the same (SId, Hour, Weekday) values.

Categories

Resources