pandas.DataFrame.mul matching on other column - python

I have a pandas DataFrame (ignore the indices of the DataFrame)
Tab Ind Com Val
4 BAS 1 1 10
5 BAS 1 2 5
6 BAS 2 1 20
8 AIR 1 1 5
9 AIR 1 2 2
11 WTR 1 1 2
12 WTR 2 1 1
And a pandas series
Ind
1 1.208333
2 0.857143
dtype: float64
I want to multiply each element of the Val column of the DataFrame with the element of the series that has the same Ind value. How would I approach this? pandas.DataFrame.mul only matches on index, but I don't want to transform the DataFrame.

Looks like pandas.DataFrame.join could solve your problem:
temp = df.join(the_series,on='Ind', lsuffix='_orig')
df['ans'] = temp.Val*temp.Ind
Output
Tab Ind Com Val ans
4 BAS 1 1 10 12.083330
5 BAS 1 2 5 6.041665
6 BAS 2 1 20 17.142860
8 AIR 1 1 5 6.041665
9 AIR 1 2 2 2.416666
11 WTR 1 1 2 2.416666
12 WTR 2 1 1 0.857143
Or another way to achieve the same using a more compact syntax (thanks W-B)
df1['New']=df1.Ind.map(the_series).values*df1.Val

Related

pandas, expand series of dataframes

I have a series that looks like this:
result
3 pd.DataFrame({"ABC":1,"American":2,"Heroes":3})
8 pd.DataFrame({"ABC":1,"American":2,"Heroes":3})
11 pd.DataFrame({"ABC":1,"American":2,"Heroes":3})
14 pd.DataFrame({"ABC":1,"American":2,"Heroes":3})
17 pd.DataFrame({"ABC":1,"American":2,"Heroes":3})
20 pd.DataFrame({"ABC":1,"American":2,"Heroes":3})
How do I produce this result:
ABC American Heroes
3 1 2 3
8 1 2 3
11 1 2 3
14 1 2 3
17 1 2 3
20 1 2 3
This is driving me crazy, cuz if concat I loose my index.
here's my closest try pd.concat(myDf.tolist(), axis=1)
This is a pretty convoluted structure, I tried reconstructing your series of dataframes this way (I don't see any series with this structure in the link you point to):
df_list = [pd.DataFrame({"ABC":[1],"American":[2],"Heroes":[3]}),
pd.DataFrame({"ABC":[1],"American":[2],"Heroes":[3]}),
pd.DataFrame({"ABC":[1],"American":[2],"Heroes":[3]})]
series = pd.Series(df_list)
And to get what you want:
df = pd.DataFrame(series\
.apply(lambda x : x.squeeze().to_list())\
.to_list(),
columns=series[0].columns)
Results:
ABC American Heroes
0 1 2 3
1 1 2 3
2 1 2 3

Python - Adding rows to timeseries dataset

I have a pandas dataframe containing retail sales data which shows the total number of a product sold each week and the stock left at the end of the week. Unfortunately, the dataset only shows a row when a product has been sold and the stock left changes.
I would like to bulk out the dataset so that for each week there is a line for each product being sold. I've shown an example of this below - how can this be done?
As-Is:
Week Product Sold Stock
1 1 1 10
1 2 1 10
1 3 1 10
2 1 2 8
2 3 3 7
To-Be:
Week Product Sold Stock
1 1 1 10
1 2 1 10
1 3 1 10
2 1 2 8
2 2 0 10
2 3 3 7
Create a dataframe using product from itertools with all the combinations of both columns 'Week' and 'Product' and use merge with your original data. Let's say your dataframe is called dfp:
from itertools import product
new_dfp = (pd.DataFrame(list(product(dfp.Week.unique(), dfp.Product.unique())),columns=['Week','Product'])
.merge(dfp,how='left'))
You get the missing row in new_dfp:
Week Product Sold Stock
0 1 1 1.0 10.0
1 1 2 1.0 10.0
2 1 3 1.0 10.0
3 2 1 2.0 8.0
4 2 2 NaN NaN
5 2 3 3.0 7.0
Now you fillna on both column with different values:
new_dfp['Sold'] = new_dfp['Sold'].fillna(0).astype(int) # because no sold in missing rows
new_dfp['Stock'] = new_dfp.groupby('Product')['Stock'].fillna(method='ffill').astype(int)
To fill 'Stock', you need to groupby product and use the method 'ffill' to put the same value than last 'week'. At the end, you get:
Week Product Sold Stock
0 1 1 1 10
1 1 2 1 10
2 1 3 1 10
3 2 1 2 8
4 2 2 0 10
5 2 3 3 7

Conditional sum from rows into a new column in pandas

I am looking to create a new column in panda based on the value in the row. My sample data:
df=pd.DataFrame({"A":['a','a','a','a','a','a','b','b','b'],
"Sales":[2,3,7,1,4,3,5,6,9,10,11,8,7,13,14],
"Week":[1,2,3,4,5,11,1,2,3,4])
I want a new column "Last3WeekSales" corresponding to each week, having the sum of sales for the previous 3 weeks.
NOTE: Shift() won't work here as data for some weeks is missing.
Logic which I thought:
Checking the week no. in each row, then summing up the data from w-1, w-2, w-3.
Output required:
A Week Last3WeekSales
0 a 1 0
1 a 2 2
2 a 3 5
3 a 4 12
4 a 5 11
5 a 11 0
6 b 1 0
7 b 2 5
8 b 3 11
9 b 4 20
Use groupby, shift and rolling:
df['Last3WeekSales'] = df.groupby('A')['Sales']\
.apply(lambda x: x.shift(1)
.rolling(3, min_periods=1)
.sum())\
.fillna(0)
Output:
A Sales Week Last3WeekSales
0 a 2 1 0.0
1 a 3 2 2.0
2 a 7 3 5.0
3 a 1 4 12.0
4 a 4 5 11.0
5 a 3 6 12.0
6 b 5 1 0.0
7 b 6 2 5.0
8 b 9 3 11.0
you can use pandas.rolling_sum to sum over 3 last values, and shift(n) to shift your column by n times (1 in your case).
if we suppose you a column 'sales' with the sales of each week, the code would be :
df["Last3WeekSales"] = df.groupby("A")["sales"].apply(lambda x: pd.rolling_sum(x.shoft(1),3))

Rearrange misplaced columns

I want to ask about a data cleansing question which I suppose python may be more efficient. The data have a lot of mis-placed columns and I have to use some characteristics based on certain columns to place them to the right positions. Below is an example in Stata code:
forvalues i = 20(-1)2{
local j = `i' + 25
local k = `j' - 2
replace v`j' = v`k' if substr(v23, 1, 4) == "1980"
}
That is, I move contents in columns v25 - v43 backward by 2 if the observation in column v23 starts with "1980". Otherwise, the columns are correct.
Any help is appreciated.
The following is a simplified example to show it works:
In [65]:
# create some dummy data
import pandas as pd
import io
pd.set_option('display.notebook_repr_html', False)
temp = """v21 v22 v23 v24 v25 v28
1 1 19801923 1 5 8
1 1 20003 1 5 8
1 1 9129389 1 5 8
1 1 1980 1 5 8
1 1 1923 2 5 8
1 1 9128983 1 5 8"""
df = pd.read_csv(io.StringIO(temp),sep='\s+')
df
Out[65]:
v21 v22 v23 v24 v25 v28
0 1 1 19801923 1 5 8
1 1 1 20003 1 5 8
2 1 1 9129389 1 5 8
3 1 1 1980 1 5 8
4 1 1 1923 2 5 8
5 1 1 9128983 1 5 8
In [68]:
# I have to convert my data to a string in order for this to work, it may not be necessary for you in which case the following commented out line would work for you:
#df.v23.str.startswith('1980')
df.v23.astype(str).str.startswith('1980')
Out[68]:
0 True
1 False
2 False
3 True
4 False
5 False
Name: v23, dtype: bool
In [70]:
# now we can call shift by 2 along the column axis to assign the values back
df.loc[df.v23.astype(str).str.startswith('1980'),['v25','v28']] = df.shift(2,axis=1)
df
Out[70]:
v21 v22 v23 v24 v25 v28
0 1 1 19801923 1 19801923 1
1 1 1 20003 1 5 8
2 1 1 9129389 1 5 8
3 1 1 1980 1 1980 1
4 1 1 1923 2 5 8
5 1 1 9128983 1 5 8
So what you need to do is define the list of columns up front:
In [72]:
target_cols = ['v' + str(x) for x in range(25,44)]
print(target_cols)
['v25', 'v26', 'v27', 'v28', 'v29', 'v30', 'v31', 'v32', 'v33', 'v34', 'v35', 'v36', 'v37', 'v38', 'v39', 'v40', 'v41', 'v42', 'v43']
Now substitute this back into my method and I believe it should work:
df.loc[df.v23.astype(str).str.startswith('1980'),target_cols] = df.shift(2,axis=1)
See shift to understand the params

How to convert pandas dataframe so that index is the unique set of values and data is the count of each value?

I have a dataframe from a multiple choice questions and it is formatted like so:
Sex Qu1 Qu2 Qu3
Name
Bob M 1 2 1
John M 3 3 5
Alex M 4 1 2
Jen F 3 2 4
Mary F 4 3 4
The data is a rating from 1 to 5 for the 3 multiple choice questions. I want rearrange the data so that the index is range(1,6) where 1='bad', 2='poor', 3='ok', 4='good', 5='excellent', the columns are the same and the data is the count of the number occurrences of the values (excluding the Sex column). This is basically a histogram of fixed bin sizes and the x-axis labeled with strings. I like the output of df.plot() much better than df.hist() for this but I can't figure out how to rearrange the table to give me a histogram of data. Also, how do you change x-labels to be strings?
Series.value_counts gives you the histogram you're looking for:
In [9]: df['Qu1'].value_counts()
Out[9]:
4 2
3 2
1 1
So, apply this function to each of those 3 columns:
In [13]: table = df[['Qu1', 'Qu2', 'Qu3']].apply(lambda x: x.value_counts())
In [14]: table
Out[14]:
Qu1 Qu2 Qu3
1 1 1 1
2 NaN 2 1
3 2 2 NaN
4 2 NaN 2
5 NaN NaN 1
In [15]: table = table.fillna(0)
In [16]: table
Out[16]:
Qu1 Qu2 Qu3
1 1 1 1
2 0 2 1
3 2 2 0
4 2 0 2
5 0 0 1
Using table.reindex or table.ix[some_array] you can rearrange the data.
To transform to strings, use table.rename:
In [17]: table.rename(index=str)
Out[17]:
Qu1 Qu2 Qu3
1 1 1 1
2 0 2 1
3 2 2 0
4 2 0 2
5 0 0 1
In [18]: table.rename(index=str).index[0]
Out[18]: '1'

Categories

Resources