Python Pandas Groupby and Aggregation - python

Hi I am trying to aggregate some data in a dataframe by using agg but my initial statement mentioned a warning "FutureWarning: using a dict on a Series for aggregation is deprecated and will be removed in a future version". I rewrote it based on Pandas documentation but instead of getting the right column label I am getting a function label. example: "". How can I correct the output so that the labels match the deprecated output above with column names std, mean, size, sum?
Deprecated Syntax Command:
Top15.set_index('Continent').groupby(level=0)['Pop Est']
.agg({'size': np.size, 'sum': np.sum, 'mean': np.mean, 'std': np.std})
Deprecated Syntax Output:
std mean size sum
Continent
Asia 6.790979e+08 5.797333e+08 5.0 2.898666e+09
Australia NaN 2.331602e+07 1.0 2.331602e+07
Europe 3.464767e+07 7.632161e+07 6.0 4.579297e+08
North America 1.996696e+08 1.764276e+08 2.0 3.528552e+08
South America NaN 2.059153e+08 1.0 2.059153e+08
New Syntax Command:
Top15.set_index('Continent').groupby(level=0)['Pop Est']\
.agg(['size', 'sum', 'mean', 'std'])\
.rename(columns={'size': np.size, 'sum': np.sum, 'mean': np.mean, 'std': np.std})
New Syntax Output:
<function size at 0x0000000002DE9950> <function sum at 0x0000000002DE90D0> <function mean at 0x0000000002DE9AE8> <function std at 0x0000000002DE9B70>
Continent
Asia 5 2.898666e+09 5.797333e+08 6.790979e+08
Australia 1 2.331602e+07 2.331602e+07 NaN
Europe 6 4.579297e+08 7.632161e+07 3.464767e+07
North America 2 3.528552e+08 1.764276e+08 1.996696e+08
South America 1 2.059153e+08 2.059153e+08 NaN
Dataframe:
Rank Documents Citable documents Citations Self-citations Citations per document H index Energy Supply Energy Supply per Capita % Renewable 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 Pop Est Continent
Country
China 1 127050 126767 597237 411683 4.70 138 1.271910e+11 93.0 19.754910 3.992331e+12 4.559041e+12 4.997775e+12 5.459247e+12 6.039659e+12 6.612490e+12 7.124978e+12 7.672448e+12 8.230121e+12 8.797999e+12 1.367645e+09 Asia
United States 2 96661 94747 792274 265436 8.20 230 9.083800e+10 286.0 11.570980 1.479230e+13 1.505540e+13 1.501149e+13 1.459484e+13 1.496437e+13 1.520402e+13 1.554216e+13 1.577367e+13 1.615662e+13 1.654857e+13 3.176154e+08 North America
Japan 3 30504 30287 223024 61554 7.31 134 1.898400e+10 149.0 10.232820 5.496542e+12 5.617036e+12 5.558527e+12 5.251308e+12 5.498718e+12 5.473738e+12 5.569102e+12 5.644659e+12 5.642884e+12 5.669563e+12 1.274094e+08 Asia
United Kingdom 4 20944 20357 206091 37874 9.84 139 7.920000e+09 124.0 10.600470 2.419631e+12 2.482203e+12 2.470614e+12 2.367048e+12 2.403504e+12 2.450911e+12 2.479809e+12 2.533370e+12 2.605643e+12 2.666333e+12 6.387097e+07 Europe
Russian Federation 5 18534 18301 34266 12422 1.85 57 3.070900e+10 214.0 17.288680 1.385793e+12 1.504071e+12 1.583004e+12 1.459199e+12 1.524917e+12 1.589943e+12 1.645876e+12 1.666934e+12 1.678709e+12 1.616149e+12 1.435000e+08 Europe
Canada 6 17899 17620 215003 40930 12.01 149 1.043100e+10 296.0 61.945430 1.564469e+12 1.596740e+12 1.612713e+12 1.565145e+12 1.613406e+12 1.664087e+12 1.693133e+12 1.730688e+12 1.773486e+12 1.792609e+12 3.523986e+07 North America
Germany 7 17027 16831 140566 27426 8.26 126 1.326100e+10 165.0 17.901530 3.332891e+12 3.441561e+12 3.478809e+12 3.283340e+12 3.417298e+12 3.542371e+12 3.556724e+12 3.567317e+12 3.624386e+12 3.685556e+12 8.036970e+07 Europe
India 8 15005 14841 128763 37209 8.58 115 3.319500e+10 26.0 14.969080 1.265894e+12 1.374865e+12 1.428361e+12 1.549483e+12 1.708459e+12 1.821872e+12 1.924235e+12 2.051982e+12 2.200617e+12 2.367206e+12 1.276731e+09 Asia
France 9 13153 12973 130632 28601 9.93 114 1.059700e+10 166.0 17.020280 2.607840e+12 2.669424e+12 2.674637e+12 2.595967e+12 2.646995e+12 2.702032e+12 2.706968e+12 2.722567e+12 2.729632e+12 2.761185e+12 6.383735e+07 Europe
South Korea 10 11983 11923 114675 22595 9.57 104 1.100700e+10 221.0 2.279353 9.410199e+11 9.924316e+11 1.020510e+12 1.027730e+12 1.094499e+12 1.134796e+12 1.160809e+12 1.194429e+12 1.234340e+12 1.266580e+12 4.980543e+07 Asia
Italy 11 10964 10794 111850 26661 10.20 106 6.530000e+09 109.0 33.667230 2.202170e+12 2.234627e+12 2.211154e+12 2.089938e+12 2.125185e+12 2.137439e+12 2.077184e+12 2.040871e+12 2.033868e+12 2.049316e+12 5.990826e+07 Europe
Spain 12 9428 9330 123336 23964 13.08 115 4.923000e+09 106.0 37.968590 1.414823e+12 1.468146e+12 1.484530e+12 1.431475e+12 1.431673e+12 1.417355e+12 1.380216e+12 1.357139e+12 1.375605e+12 1.419821e+12 4.644340e+07 Europe
Iran 13 8896 8819 57470 19125 6.46 72 9.172000e+09 119.0 5.707721 3.895523e+11 4.250646e+11 4.289909e+11 4.389208e+11 4.677902e+11 4.853309e+11 4.532569e+11 4.445926e+11 4.639027e+11 NaN 7.707563e+07 Asia
Australia 14 8831 8725 90765 15606 10.28 107 5.386000e+09 231.0 11.810810 1.021939e+12 1.060340e+12 1.099644e+12 1.119654e+12 1.142251e+12 1.169431e+12 1.211913e+12 1.241484e+12 1.272520e+12 1.301251e+12 2.331602e+07 Australia
Brazil 15 8668 8596 60702 14396 7.00 86 1.214900e+10 59.0 69.648030 1.845080e+12 1.957118e+12 2.056809e+12 2.054215e+12 2.208872e+12 2.295245e+12 2.339209e+12 2.409740e+12 2.412231e+12 2.319423e+12 2.059153e+08 South America

Try using just this:
Top15.set_index('Continent').groupby(level=0)['Pop Est'].agg(['size', 'sum', 'mean', 'std'])

Related

Complex slicing

I am trying to perform a slice with multiple conditions without success.
Here is what my dataframe looks like
I have many countries, which names are stored as indexes. And for all those countries I have 7 different indicators, for two distinct years.
My goal is to select all the countries (and their indicators), which 'GDP per capita (constant 2005 US$')' is superior or equal than a previously defined treshold (gdp_min), OR that are named 'China', 'India', or 'Brazil'.
To do so, I have tried many different things but still cannot find a way to do it.
Here is my last try, with the error.
gdp_set = final_set[final_set['Indicator Name'] == 'GDP per capita (constant 2005 US$)']['2013'] >= gdp_min | final_set.loc[['China', 'India', 'Brazil']]
--------------------------------------------------------------------------- TypeError Traceback (most recent call
last) ~\anaconda3\lib\site-packages\pandas\core\ops\array_ops.py in
na_logical_op(x, y, op)
301 # (xint or xbool) and (yint or bool)
--> 302 result = op(x, y)
303 except TypeError:
~\anaconda3\lib\site-packages\pandas\core\roperator.py in ror_(left,
right)
55 def ror_(left, right):
---> 56 return operator.or_(right, left)
57
TypeError: ufunc 'bitwise_or' not supported for the input types, and
the inputs could not be safely coerced to any supported types
according to the casting rule ''safe''
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call
last) ~\anaconda3\lib\site-packages\pandas\core\ops\array_ops.py in
na_logical_op(x, y, op)
315 try:
--> 316 result = libops.scalar_binop(x, y, op)
317 except (
~\anaconda3\lib\site-packages\pandas_libs\ops.pyx in
pandas._libs.ops.scalar_binop()
ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
The above exception was the direct cause of the following exception:
TypeError Traceback (most recent call
last) ~\AppData\Local\Temp/ipykernel_16016/3232205269.py in
----> 1 gdp_set = final_set[final_set['Indicator Name'] == 'GDP per capita (constant 2005 US$)']['2013'] >= gdp_min |
final_set.loc[['China', 'India', 'Brazil']]
~\anaconda3\lib\site-packages\pandas\core\generic.py in
array_ufunc(self, ufunc, method, *inputs, **kwargs) 2030 self, ufunc: np.ufunc, method: str, *inputs: Any, **kwargs: Any
2031 ):
-> 2032 return arraylike.array_ufunc(self, ufunc, method, *inputs, **kwargs) 2033 2034 # ideally we would define this to avoid the getattr checks, but
~\anaconda3\lib\site-packages\pandas\core\arraylike.py in
array_ufunc(self, ufunc, method, *inputs, **kwargs)
251
252 # for binary ops, use our custom dunder methods
--> 253 result = maybe_dispatch_ufunc_to_dunder_op(self, ufunc, method, *inputs, **kwargs)
254 if result is not NotImplemented:
255 return result
~\anaconda3\lib\site-packages\pandas_libs\ops_dispatch.pyx in
pandas._libs.ops_dispatch.maybe_dispatch_ufunc_to_dunder_op()
~\anaconda3\lib\site-packages\pandas\core\ops\common.py in
new_method(self, other)
67 other = item_from_zerodim(other)
68
---> 69 return method(self, other)
70
71 return new_method
~\anaconda3\lib\site-packages\pandas\core\arraylike.py in
ror(self, other)
72 #unpack_zerodim_and_defer("ror")
73 def ror(self, other):
---> 74 return self.logical_method(other, roperator.ror)
75
76 #unpack_zerodim_and_defer("xor")
~\anaconda3\lib\site-packages\pandas\core\frame.py in
_arith_method(self, other, op) 6864 self, other = ops.align_method_FRAME(self, other, axis, flex=True, level=None)
6865
-> 6866 new_data = self._dispatch_frame_op(other, op, axis=axis) 6867 return self._construct_result(new_data)
6868
~\anaconda3\lib\site-packages\pandas\core\frame.py in
_dispatch_frame_op(self, right, func, axis) 6891 # i.e. scalar, faster than checking np.ndim(right) == 0 6892
with np.errstate(all="ignore"):
-> 6893 bm = self._mgr.apply(array_op, right=right) 6894 return type(self)(bm) 6895
~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in
apply(self, f, align_keys, ignore_failures, **kwargs)
323 try:
324 if callable(f):
--> 325 applied = b.apply(f, **kwargs)
326 else:
327 applied = getattr(b, f)(**kwargs)
~\anaconda3\lib\site-packages\pandas\core\internals\blocks.py in
apply(self, func, **kwargs)
379 """
380 with np.errstate(all="ignore"):
--> 381 result = func(self.values, **kwargs)
382
383 return self._split_op_result(result)
~\anaconda3\lib\site-packages\pandas\core\ops\array_ops.py in
logical_op(left, right, op)
390 filler = fill_int if is_self_int_dtype and is_other_int_dtype else fill_bool
391
--> 392 res_values = na_logical_op(lvalues, rvalues, op)
393 # error: Cannot call function of unknown type
394 res_values = filler(res_values) # type: ignore[operator]
~\anaconda3\lib\site-packages\pandas\core\ops\array_ops.py in
na_logical_op(x, y, op)
323 ) as err:
324 typ = type(y).name
--> 325 raise TypeError(
326 f"Cannot perform '{op.name}' with a dtyped [{x.dtype}] array "
327 f"and scalar of type [{typ}]"
TypeError: Cannot perform 'ror_' with a dtyped [float64] array and
scalar of type [bool]
The error is very long but from what I may understand, the problem comes from the second condition which is not compatible with an 'OR' ( | ).
Do you guys have any idea how I could do what I intend to please? The only thing I can see is to create a new column with current index names, so that filtering might work with the OR condition.
IIUC, use:
m1 = final_set['Indicator Name'].eq('GDP per capita (constant 2005 US$)')
m2 = fina_set['2013'] >= gdp_min
countries = list(final_set.index[m1 & m2])+['China', 'India', 'Brazil']
gdp_set = final_set[final_set.index.isin(countries)]
UPDATED:
This should do what you're asking:
gdp_set = final_set.loc[list(
{'China', 'India', 'Brazil'} |
set(final_set[((final_set['Indicator Name'] == 'GDP per capita (constant 2005 US$)') &
(final_set['2013'] >= gdp_min))].index)
)]
Explanation:
create a set containing the union of 'China', 'India', 'Brazil' with the set of any index values (i.e., Country Name values) for rows where value of Indicator Name matches the target and value of 2013 column is at least as large as gdp_min.
filter final_set on the countries in this set converted to a list and put the resulting dataframe in gdp_set.
Full test code:
import pandas as pd
final_set = pd.DataFrame({
'Country Name':['Andorra']*6 + ['Argentina']*4 + ['China']*2 + ['India']*2 + ['Brazil']*2,
'Indicator Name':[f'Indicator {i}' for i in range(1, 6)] + ['GDP per capita (constant 2005 US$)'] + [f'Indicator {i}' for i in range(1, 4)] + ['GDP per capita (constant 2005 US$)'] + [f'Indicator {i}'if i % 2 else 'GDP per capita (constant 2005 US$)' for i in range(1,7)],
'2002': [10000.0/2]*6 + [15000.0/2]*4 + [8000.0/2]*6,
'2013': [10000.0]*6 + [15000.0]*4 + [8000.0]*6,
'Currency Unit':['Euro']*6 + ['Argentine peso']*4 + ['RMB']*2 + ['INR']*2 + ['Brazilian real']*2,
'Region':['Europe & Central Asia']*6 + ['Latin America & Caribbean']*4 + ['Asia']*2 + ['South Asia']*2 + ['Latin America & Caribbean']*2,
'GDP per capita (constant 2005 US$)': [10000.0]*6 + [15000.0]*4 + [8000.0]*6
}).set_index('Country Name')
print(final_set)
gdp_min = 14000.0
gdp_set = final_set.loc[list(
{'China', 'India', 'Brazil'} |
set(final_set[((final_set['Indicator Name'] == 'GDP per capita (constant 2005 US$)') &
(final_set['2013'] >= gdp_min))].index)
)]
print(gdp_set)
Input:
Indicator Name 2002 2013 Currency Unit Region GDP per capita (constant 2005 US$)
Country Name
Andorra Indicator 1 5000.0 10000.0 Euro Europe & Central Asia 10000.0
Andorra Indicator 2 5000.0 10000.0 Euro Europe & Central Asia 10000.0
Andorra Indicator 3 5000.0 10000.0 Euro Europe & Central Asia 10000.0
Andorra Indicator 4 5000.0 10000.0 Euro Europe & Central Asia 10000.0
Andorra Indicator 5 5000.0 10000.0 Euro Europe & Central Asia 10000.0
Andorra GDP per capita (constant 2005 US$) 5000.0 10000.0 Euro Europe & Central Asia 10000.0
Argentina Indicator 1 7500.0 15000.0 Argentine peso Latin America & Caribbean 15000.0
Argentina Indicator 2 7500.0 15000.0 Argentine peso Latin America & Caribbean 15000.0
Argentina Indicator 3 7500.0 15000.0 Argentine peso Latin America & Caribbean 15000.0
Argentina GDP per capita (constant 2005 US$) 7500.0 15000.0 Argentine peso Latin America & Caribbean 15000.0
China Indicator 1 4000.0 8000.0 RMB Asia 8000.0
China GDP per capita (constant 2005 US$) 4000.0 8000.0 RMB Asia 8000.0
India Indicator 3 4000.0 8000.0 INR South Asia 8000.0
India GDP per capita (constant 2005 US$) 4000.0 8000.0 INR South Asia 8000.0
Brazil Indicator 5 4000.0 8000.0 Brazilian real Latin America & Caribbean 8000.0
Brazil GDP per capita (constant 2005 US$) 4000.0 8000.0 Brazilian real Latin America & Caribbean 8000.0
Output:
Indicator Name 2002 2013 Currency Unit Region GDP per capita (constant 2005 US$)
Country Name
Brazil Indicator 5 4000.0 8000.0 Brazilian real Latin America & Caribbean 8000.0
Brazil GDP per capita (constant 2005 US$) 4000.0 8000.0 Brazilian real Latin America & Caribbean 8000.0
China Indicator 1 4000.0 8000.0 RMB Asia 8000.0
China GDP per capita (constant 2005 US$) 4000.0 8000.0 RMB Asia 8000.0
India Indicator 3 4000.0 8000.0 INR South Asia 8000.0
India GDP per capita (constant 2005 US$) 4000.0 8000.0 INR South Asia 8000.0
Argentina Indicator 1 7500.0 15000.0 Argentine peso Latin America & Caribbean 15000.0
Argentina Indicator 2 7500.0 15000.0 Argentine peso Latin America & Caribbean 15000.0
Argentina Indicator 3 7500.0 15000.0 Argentine peso Latin America & Caribbean 15000.0
Argentina GDP per capita (constant 2005 US$) 7500.0 15000.0 Argentine peso Latin America & Caribbean 15000.0
How about using a query?
# min GDP (I used an example number
gdp_min = 3000.0
# Country name set.
countries = {"China", "India", "Brazil"}
# Create string expression to evaluate on DataFrame.
# Note: Backticks should be used for non-standard pandas field names
# (including names that begin with a numerical value.
expression = f"(`Indicator Name` == 'GDP per capita (constant 2005 US$)' & `2013` >= {gdp_min})"
# Add each country name as 'or' clause for second part of expression.
expression += "or (" + " or ".join([f"`Country Name` == '{n}'" for n in countries]) + ")"
# Collect resulting DataFrame to new variable.
gdp_set = final_set.query(expression)

Sum two variables by two specific columns and compute quotient

I have a dataframe df1:
Plant name Brand Region Units produced capacity Cost incurred
Gujarat Plant Hyundai Asia 8500 9250 18500000
Haryana Plant Honda Asia 10000 10750 21500000
Chennai Plant Hyundai Asia 12000 12750 25500000
Zurich Plant Volkswagen Europe 25000 25750 77250000
Chennai Plant Suzuki Asia 6000 6750 13500000
Rengensburg BMW Europe 12500 13250 92750000
Dingolfing Mercedes Europe 14000 14750 103250000
I want a output dataframe with the following format:
df2= Region BMW Mercedes Volkswagen Toyota Suzuki Honda Hyundai
Europe
North America
Asia
Oceania
where the contents of each cell equals sum(cost incurred) / sum(units produced) for that specific Region and Brand.
Code I have tried, resulting in a ValueError:
for i,j in itertools.zip_longest(range(len(df2),range(len(df2.columns)):
if (df2.index[i] in list(df1["Region"]) & df2.columns[j] in list(df1["Brand"])==True:
temp1 = df1["Region"]==df2.index[i]
temp2 = df1["Brand"]==df2.columns[j]]
df2.loc[df2.index[i],df2.columns[j]] = df1(temp1&temp2)["Cost incurred"].sum()/
df1(temp1&temp2)["Units Produced"].sum()
elif (df2.index[i] in list(df1["Region"]) & df2.columns[j] in list(df1["Brand"])==False:
df2.loc[df2.index[i],df2.columns[j]] = 0
ValueError: The truth value of an array with more than one element is
ambiguous. Use a.any() or a.all()
df.pivot_table() is designed for pivot-and-aggregation capability. A quick(?) and dirty solution:
df1.pivot_table(index="Region", columns="Brand", values="Cost incurred", aggfunc=np.sum)\
/ df1.pivot_table(index="Region", columns="Brand", values="Units produced", aggfunc=np.sum)
Output
Brand BMW Honda Hyundai Mercedes Suzuki Volkswagen
Region
Asia NaN 2150.0 2146.341463 NaN 2250.0 NaN
Europe 7420.0 NaN NaN 7375.0 NaN 3090.0

How to maintain the same index after sorting a Pandas series?

I have the following Pandas series from the dataframe 'Reducedset':
Reducedset = Top15.iloc[:,10:20].mean(axis=1).sort_values(ascending=False)
Which gives me:
Country
United States 1.536434e+13
China 6.348609e+12
Japan 5.542208e+12
Germany 3.493025e+12
France 2.681725e+12
United Kingdom 2.487907e+12
Brazil 2.189794e+12
Italy 2.120175e+12
India 1.769297e+12
Canada 1.660647e+12
Russian Federation 1.565459e+12
Spain 1.418078e+12
Australia 1.164043e+12
South Korea 1.106715e+12
Iran 4.441558e+11
dtype: float64
I want to update the index, so that index of the dataframe Reducedset is in the same order as the series above.
How can I do this?
In other words, when I then look at the entire dataframe, the index order should be the same as in the series above and not like that below:
Reducedset
Rank Documents Citable documents Citations \
Country
China 1 127050 126767 597237
United States 2 96661 94747 792274
Japan 3 30504 30287 223024
United Kingdom 4 20944 20357 206091
Russian Federation 5 18534 18301 34266
Canada 6 17899 17620 215003
Germany 7 17027 16831 140566
India 8 15005 14841 128763
France 9 13153 12973 130632
South Korea 10 11983 11923 114675
Italy 11 10964 10794 111850
Spain 12 9428 9330 123336
Iran 13 8896 8819 57470
Australia 14 8831 8725 90765
Brazil 15 8668 8596 60702
The answer:
Reducedset = Top15.iloc[:,10:20].mean(axis=1).sort_values(ascending=False)
This first stage finds the mean of columns 10-20 for each row (axis=1) and sorts them in descending order (ascending = False)
Reducedset.reindex(Reducedset.index)
Here, we are resetting the index of the dataframe 'Reducedset' as the index of the amended dataframe above.

Any way to get correct output after passing positional named argument (*args, **kwargs) to function?

I was trying to filter dataframe based on the list of named arguments, thanks to #MkWTF on this post. However, I want to use *args as an argument to loop through then use the filtering function, which means I am going to have a list of filtered dataframe based on named argument as an output.
in my case, I need to use cty_rpt column as named argument (*args) to loop through the country code then use filtering function to get filtered dataframe:
minimal data
here is the minimal data that posted on gist minimal data
attempt:
import pandas as pd
df=pd.read_csv('mydf.csv', encoding='utf-8')
def data_filter(df, startDate, endDate, date_colname="date", inplace=False, **kwargs):
s = ''
for i,j in kwargs.items():
s += '{}=="{}"&'.format(i,j)
s += '{}>"{}"&'.format(date_colname, startDate)
s += '{}<"{}"'.format(date_colname, endDate)
return df.query(s, inplace=inplace)
the idea is first subset df by looping through its df.cty_rpt then each subsetted dataframe further filtered by list of **kwargs, which would make list of filtered dataframe.
I feel like above attempt might be realized as follow:
new attempt to change the definition of my function
def func1(df, *args, startDate, endDate, date_col='date', inplace=False, **kwargs):
output = []
for arg in pd.unique(args):
s = ''
for i, j in kwargs.items():
s += '{}=="{}"&'.format(i,j)
s += '{}>"{}"&'.format(date_colname, startDate)
s += '{}<"{}"'.format(date_colname, endDate)
res = df.query(s, inplace=inplace)
res = df.query(arg, inplace=False)
output.append(res)
return output
func1(df, df.cty_rpt,startDate='2013-12-31', endDate='2019-01-01', meat_type='Beef', temperature='Chilled',flow='E')
but I got empty list, I don't understand what's going on here. Any idea? where is the bug of my attempt? any quick solution on that?
goal:
I want to get a list of dataframe where each dataframe can be filtered by country code wise. I was explicitly looping through df.cty_rpt then use dataa_filter function, but I feel like using *args would simplify that but I couldn't get what I actually need. Any idea to get this done? thanks
I hope this gets you part way there. With your *args strategy, it appears that you want to query on unique values in (potentially) several columns. You can use pd.unique to get unique values from any one column and an outer for to extend that to multiple columns. I don't know how you want to handle multiple columns exactly, so I just guessed on one way.
You can build most of the query before starting the loop for unique cty_rpt values. Do it an a list and build the string per query.
I couldn't get this to work when adding in the start/end dates, so I left it commented out.
import pandas as pd
def func1(df, *args, startDate, endDate, date_col='date', inplace=False, **kwargs):
output = []
query_terms = ['{}=="{}"'.format(*item) for item in kwargs.items()]
# Todo: This didn't work for me, date query needs to be debugged
# query_terms += [
# '{}>"{}"'.format(date_col, startDate),
# '{}<"{}"'.format(date_col, endDate)]
for series in args:
for name in pd.unique(series):
print('querying', series.name, name)
s = "&".join(query_terms + ['{}=="{}"'.format(series.name, name)])
res = df.query(s, inplace=inplace) # todo: i think inplace should always be false
output.append(res)
return output
df = pd.read_csv("mydf.csv", encoding="utf-8")
print(df)
result = func1(df, df.cty_rpt,startDate='2013-12-31', endDate='2019-01-01', meat_type='Beef', temperature='Chilled',flow='E')
for res in result:
print('------------------------------')
print(res)
Output
Unnamed: 0 flow cty_rpt origin destination value qty1 date animal_type meat_type temperature
0 0 E AR Argentina Albania 115691.00 18.26200 1/1/2017 Bovine Beef Frozen
1 1 I AR Argentina Albania 72425.20 19.17100 1/1/2016 Bovine Beef Frozen
2 2 I US Argentina Angola 109523.15 50.94100 5/1/2014 Bovine Beef Frozen
3 3 E US Argentina United Arab Emirates 1078.00 0.15300 10/1/2014 Bovine Beef Chilled
4 4 E US Argentina Albania 3373.00 0.26200 12/1/2014 Bovine Pork Frozen
5 5 E US Argentina Angola 36308.77 9.55494 4/1/2015 Bovine Pork Frozen
6 6 E AR Argentina Angola 10654.65 0.87569 6/1/2017 Bovine Pork Chilled
7 7 E AR Argentina United Arab Emirates 86.50 0.02000 7/1/2016 Bovine Pork Chilled
8 8 I AR Argentina Angola 68797.00 12.12000 1/1/2014 Bovine Beef Chilled
9 9 I AUC Argentina Angola 42000.00 21.00000 2/1/2017 Bovine Beef Frozen
10 10 I AUC Argentina Albania 180078.00 26.79100 12/1/2017 Bovine Beef Frozen
11 11 I AUC Argentina Angola 194402.47 45.29000 1/1/2015 Bovine Pork Frozen
12 12 I AUC Argentina United Arab Emirates 97928.05 6.47850 1/1/2014 Bovine Pork Chilled
13 13 E US Argentina Angola 61430.00 10.85000 4/1/2014 Bovine Beef Chilled
14 14 E US Argentina Angola 4153.80 1.97800 12/1/2014 Bovine Beef Frozen
15 15 E US Argentina Albania 55599.30 10.29300 6/1/2014 Bovine Beef Frozen
16 16 I US Argentina Angola 11531.00 0.20100 10/1/2014 Bovine Beef Frozen
17 17 I AR Argentina United Arab Emirates 1908.50 0.17800 4/1/2017 Bovine Pork Frozen
18 18 I AR Argentina Angola 59476.10 10.85600 1/1/2018 Bovine Pork Frozen
19 19 E CN Argentina Angola 452174.70 74.82600 12/1/2014 Bovine Pork Frozen
20 20 E CN Argentina Albania 101596.00 13.57200 11/1/2014 Bovine Pork Frozen
21 21 E KR Argentina Angola 135035.00 27.00700 5/1/2014 Bovine Beef Frozen
22 22 E KR Argentina Angola 86506.00 46.76000 10/1/2015 Bovine Beef Frozen
23 23 I KR Argentina Argentina 300876.85 24.53188 3/1/2014 Bovine Beef Chilled
24 24 E KR Argentina Albania 475380.06 72.74437 9/1/2015 Bovine Pork Frozen
25 25 E AR Argentina Albania 80396.00 8.77800 1/1/2018 Bovine Pork Frozen
26 26 I AR Argentina United Arab Emirates 160.00 0.02000 11/1/2014 Bovine Pork Chilled
27 27 I US Argentina Albania 212000.00 26.50000 10/1/2015 Bovine Beef Frozen
28 28 E US Argentina Albania 164459.08 20.70592 12/1/2015 Bovine Beef Frozen
29 29 E AUC Argentina Albania 235810.00 49.22200 3/1/2015 Bovine Beef Frozen
querying cty_rpt AR
querying cty_rpt US
querying cty_rpt AUC
querying cty_rpt CN
querying cty_rpt KR
------------------------------
Empty DataFrame
Columns: [Unnamed: 0, flow, cty_rpt, origin, destination, value, qty1, date, animal_type, meat_type, temperature]
Index: []
------------------------------
Unnamed: 0 flow cty_rpt origin destination value qty1 date animal_type meat_type temperature
3 3 E US Argentina United Arab Emirates 1078.0 0.153 10/1/2014 Bovine Beef Chilled
13 13 E US Argentina Angola 61430.0 10.850 4/1/2014 Bovine Beef Chilled
------------------------------
Empty DataFrame
Columns: [Unnamed: 0, flow, cty_rpt, origin, destination, value, qty1, date, animal_type, meat_type, temperature]
Index: []
------------------------------
Empty DataFrame
Columns: [Unnamed: 0, flow, cty_rpt, origin, destination, value, qty1, date, animal_type, meat_type, temperature]
Index: []
------------------------------
Empty DataFrame
Columns: [Unnamed: 0, flow, cty_rpt, origin, destination, value, qty1, date, animal_type, meat_type, temperature]
Index: []

Convert Pandas DataFrame columns to rows

I have the following dict which I converted to dataframe
players_info = {'Afghanistan': {'Asghar Stanikzai': 809.0,
'Mohammad Nabi': 851.0,
'Mohammad Shahzad': 1713.0,
'Najibullah Zadran': 643.0,
'Samiullah Shenwari': 774.0},
'Australia': {'AJ Finch': 1082.0,
'CL White': 988.0,
'DA Warner': 1691.0,
'GJ Maxwell': 822.0,
'SR Watson': 1465.0},
'England': {'AD Hales': 1340.0,
'EJG Morgan': 1577.0,
'JC Buttler': 985.0,
'KP Pietersen': 1176.0,
'LJ Wright': 759.0}}
pd.DataFrame(players_info)
The resulting output is
But I want the columns to be mapped with rows like the following
Player Team Score
Mohammad Nabi Afghanistan 851.0
Mohammad Shahzad Afghanistan 1713.0
Najibullah Zadran Afghanistan 643.0
JC Buttler England 985.0
KP Pietersen England 1176.0
LJ Wright England 759.0
I tried reset_index but it is not working as I want. How can I do that ?
You need:
df = df.stack().reset_index()
df.columns=['Player', 'Team', 'Score']
Output of df.head(5):
Player Team Score
0 AD Hales Score 1340.0
1 AJ Finch Team 1082.0
2 Asghar Stanikzai Player 809.0
3 CL White Team 988.0
4 DA Warner Team 1691.0
Let's take a stab at this using melt. Should be pretty fast.
df.rename_axis('Player').reset_index().melt('Player').dropna()
Player variable value
2 Asghar Stanikzai Afghanistan 809.0
10 Mohammad Nabi Afghanistan 851.0
11 Mohammad Shahzad Afghanistan 1713.0
12 Najibullah Zadran Afghanistan 643.0
14 Samiullah Shenwari Afghanistan 774.0
16 AJ Finch Australia 1082.0
18 CL White Australia 988.0
19 DA Warner Australia 1691.0
21 GJ Maxwell Australia 822.0
28 SR Watson Australia 1465.0
30 AD Hales England 1340.0
35 EJG Morgan England 1577.0
37 JC Buttler England 985.0
38 KP Pietersen England 1176.0
39 LJ Wright England 759.0

Categories

Resources