df['Inflation'] = pd.read_csv('INFLATION_EUR.csv', header=None, names=['Date', 'Value'])['Value']
df['EURUSD'] = pd.read_csv('DEXUSEU.csv', header=None, names=['Date', 'Value'])['Value']
df = df.set_index('Date')
df['Gerber'] = 0
The data looks like this:
Date Interest Inflation EURUSD Gerber
... ... ... ... ...
01/31/2019 3.50 1.84 1.329 0
02/28/2019 3.50 1.84 1.317 0
03/31/2019 3.75 1.94 1.309 0
04/30/2019 3.75 1.91 1.300 0
05/31/2019 3.75 1.87 1.302 0
... ... ... ... ...
08/31/2019 0.00 1.24 1.375 0
09/30/2019 0.00 1.00 1.381 0
10/31/2019 0.00 0.91 1.370 0
11/30/2019 0.00 0.77 1.369 0
12/31/2019 0.00 0 1.362 0
..
I need to check if the inflation is lower than 1 year ago (12 months), in that case set Gerber to 1 instead of 0. This goes for each row. The data is monthly.
The last row should be:
12/31/2019 0.00 0 1.362 1
since inflation 01/31/2019 was 1.84, i.e. higher than 0. The data goes from 1990 to 2019.
How can I do this? I tried this:
df['Gerber'] = df['Inflation'].rolling(12).apply(lambda x: 1 if (x[0] < x[-1]) else 0)
Related
I would like to scrape the 2nd table in the page seen below from the link - https://fbref.com/en/comps/82/stats/Indian-Super-League-Stats#all_stats_standard
on google collab.
but pd.read_html("https://fbref.com/en/comps/82/stats/Indian-Super-League-Stats#all_stats_standard") only gives me the first table.
Please help me understand where I am going wrong.
Snippet of page
This is one way to read that data:
import pandas as pd
import requests
url= 'https://fbref.com/en/comps/82/stats/Indian-Super-League-Stats#all_stats_standard'
response = requests.get(url).text.replace('<!--', '').replace('-->', '')
df = pd.read_html(response, header=1)[2]
print(df)
Result in terminal:
Rk Player Nation Pos Squad Age Born MP Starts Min 90s Gls Ast G-PK PK PKatt CrdY CrdR Gls.1 Ast.1 G+A G-PK.1 G+A-PK Matches
0 1 Sahal Abdul Samad in IND MF Kerala Blasters 24 1997 20 19 1443 16.0 5 1 5 0 0 0 0 0.31 0.06 0.37 0.31 0.37 Matches
1 2 Ayush Adhikari in IND MF Kerala Blasters 21 2000 14 6 540 6.0 0 0 0 0 0 3 1 0.00 0.00 0.00 0.00 0.00 Matches
2 3 Gani Ahammed Nigam in IND FW NorthEast Utd 23 1998 6 0 66 0.7 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 Matches
3 4 Airam es ESP FW Goa 33 1987 13 8 751 8.3 6 1 5 1 2 0 0 0.72 0.12 0.84 0.60 0.72 Matches
4 5 Alex br BRA MF Jamshedpur 32 1988 20 12 1118 12.4 1 4 1 0 0 2 0 0.08 0.32 0.40 0.08 0.40 Matches
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
302 292 João Victor br BRA MF Hyderabad FC 32 1988 18 18 1590 17.7 5 1 3 2 2 3 0 0.28 0.06 0.34 0.17 0.23 Matches
303 293 David Williams au AUS FW Mohun Bagan 33 1988 15 6 602 6.7 4 1 4 0 1 2 0 0.60 0.15 0.75 0.60 0.75 Matches
304 294 Banana Yaya cm CMR DF Bengaluru 30 1991 5 2 229 2.5 0 1 0 0 0 1 0 0.00 0.39 0.39 0.00 0.39 Matches
305 295 Joe Zoherliana in IND DF NorthEast Utd 22 1999 9 6 677 7.5 0 1 0 0 0 0 0 0.00 0.13 0.13 0.00 0.13 Matches
306 296 Mark Zothanpuia in IND MF Hyderabad FC 19 2002 3 0 63 0.7 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 Matches
307 rows × 24 columns
Relevant pandas documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html
I'd like to shift a column in a multiindex dataframe in order to calculate a regression model with a lagged independent variable. As my time-series has missing values I only want to have the values shifted for known previous days. The df looks like that:
cost
ID day
1 31.01.2020 0
1 03.02.2020 0
1 04.02.2020 0.12
1 05.02.2020 0
1 06.02.2020 0
1 07.02.2020 0.08
1 10.02.2020 0
1 11.02.2020 0
1 12.02.2020 0.03
1 13.02.2020 0.1
1 14.02.2020 0
The desired output would like that:
cost cost_lag
ID day
1 31.01.2020 0 NaN
1 03.02.2020 0 NaN
1 04.02.2020 0.12 0
1 05.02.2020 0 0.12
1 06.02.2020 0 0
1 07.02.2020 0.08 0
1 10.02.2020 0 NaN
1 11.02.2020 0 0
1 12.02.2020 0.03 0
1 13.02.2020 0.1 0.03
1 14.02.2020 0 0.1
Based on this answer to a similar question I've tried the following:
df['cost_lag'] = df.groupby(['id'])['cost'].shift(1)[df.reset_index().day == df.reset_index().day.shift(1) + datetime.timedelta(days=1)]
But that results in an error message I don't understand:
IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match
I've also tried to fill the missing dates following an approach suggested here:
ams_spend_ranking_df = ams_spend_ranking_df.index.get_level_values(1).apply(lambda x: datetime.datetime(x, 1, 1))
again resulting in an error message which does not enlighten me:
AttributeError: 'DatetimeIndex' object has no attribute 'apply'
Long story short: how can I shift the cost column by 1 day and add NaNs if I don't have data on the previous day?
You can add all missing datetimes by DataFrameGroupBy.resample with Resampler.asfreq:
df1 = df.reset_index(level=0).groupby(['ID'])['cost'].resample('d').asfreq()
print (df1)
ID day
1 2020-01-31 0.00
2020-02-01 NaN
2020-02-02 NaN
2020-02-03 0.00
2020-02-04 0.12
2020-02-05 0.00
2020-02-06 0.00
2020-02-07 0.08
2020-02-08 NaN
2020-02-09 NaN
2020-02-10 0.00
2020-02-11 0.00
2020-02-12 0.03
2020-02-13 0.10
2020-02-14 0.00
Name: cost, dtype: float64
So then if use your solution with DataFrameGroupBy.shift it working like need:
df['cost_lag'] = df1.groupby('ID').shift(1)
print (df)
cost cost_lag
ID day
1 2020-01-31 0.00 NaN
2020-02-03 0.00 NaN
2020-02-04 0.12 0.00
2020-02-05 0.00 0.12
2020-02-06 0.00 0.00
2020-02-07 0.08 0.00
2020-02-10 0.00 NaN
2020-02-11 0.00 0.00
2020-02-12 0.03 0.00
2020-02-13 0.10 0.03
2020-02-14 0.00 0.10
I have a file like:
AFA MT 0 0 1.22 259 169 FOD 0 50.01 1.3 1.370 0.00 -0.02 1.78 0 0.0
S 2 50.620 1.960 2.452 0.00 -0.49 0.31
MKE MS 0 0 4.22 256 149 MDO 1 30.00 1.4 2.370 3.00 -0.52 4.82 0 0.0
KTE KL 0 0 1.22 259 169 FID 0 10.01 2.0 2.470 1.00 -0.12 0.78 1 1.0
S 3 70.610 1.960 2.52 0.00 -0.19 0.41
...
...
S lines are not always there, but always start with S .
And I like to split it and create a dictionary with keys only the first fields (AFA, KTE ...) but also keep the "S 2 50.60 ... 0.31" part as the keys values of the previous key, whenecver they exist.
(aka merge the S lines with the previous line whenever they ocure).
So far I did:
import collections
st = {}
with open("file.txt") as f:
for line in f:
if len(line.split())==17 or len(line.split())==8 :
key, value = line.split(None, 1)
st[key] = (value.split())
#order output as the order in file
st=collections.OrderedDict(st)
print ([key] ,[value])
but this gives me:
['AFA'] ['MT 0 0 1.22 259 169 FOD 0 50.01 1.3 1.370 0.00 -0.02 1.78 0 0.0']
['S'] ['2 50.620 1.960 2.452 0.00 -0.49 0.31']
['MKE'] ['MS 0 0 4.22 256 149 MDO 1 30.00 1.4 2.370 3.00 -0.52 4.82 0 0.0']
['KTE'] ['KL 0 0 1.22 259 169 FID 0 10.01 2.0 2.470 1.00 -0.12 0.78 1 1.0']
['S'] ['3 70.610 1.960 2.52 0.00 -0.19 0.41']
while I try and want though to get:
['AFA'] ['MT 0 0 1.22 259 169 FOD 0 50.01 1.3 1.370 0.00 -0.02 1.78 0 0.0 S 2 50.620 1.960 2.452 0.00 -0.49 0.31']
['MKE'] ['MS 0 0 4.22 256 149 MDO 1 30.00 1.4 2.370 3.00 -0.52 4.82 0 0.0']
['KTE'] ['KL 0 0 1.22 259 169 FID 0 10.01 2.0 2.470 1.00 -0.12 0.78 1 1.0 S 3 70.610 1.960 2.52 0.00 -0.19 0.41']
The logic you could use is:
remember the last non-S line you've read
If you read an S-line, add it to the remembered non-S line & put that in your dictionary that, then forget the remembered non-S line
If you read a non-S-line, put any remembered non-S line in your dictionary & remember the new line instead
When you are done with the file, put any remembered non-S line in your dictionary
This seems to do what you want, without much modification to your attempt:
import pprint
st = {}
with open("file.txt") as f:
for line in f:
if len(line.split())==17 :
key, value = line.split(None, 1)
st[key] = (value.split())
elif len(line.split())==8 :
st[key] += line.split()
pprint.pprint(st)
To better explain by problem better lets pretend i have a shop with 3 unique customers and my dataframe contains every purchase of my customers with weekday, name and paid price.
name price weekday
0 Paul 18.44 0
1 Micky 0.70 0
2 Sarah 0.59 0
3 Sarah 0.27 1
4 Paul 3.45 2
5 Sarah 14.03 2
6 Paul 17.21 3
7 Micky 5.35 3
8 Sarah 0.49 4
9 Micky 17.00 4
10 Paul 2.62 4
11 Micky 17.61 5
12 Micky 10.63 6
The information i would like to get is the average price per unique customer per weekday. What i often do in similar situations is to group by several columns with sum and then take the average of a subset of the columns.
df = df.groupby(['name','weekday']).sum()
price
name weekday
Micky 0 0.70
3 5.35
4 17.00
5 17.61
6 10.63
Paul 0 18.44
2 3.45
3 17.21
4 2.62
Sarah 0 0.59
1 0.27
2 14.03
4 0.49
df = df.groupby(['weekday']).mean()
price
weekday
0 6.576667
1 0.270000
2 8.740000
3 11.280000
4 6.703333
5 17.610000
6 10.630000
Of course this only works if all my unique customers would have at least one purchase per day.
Is there an elegant way to get a zero value for all combinations between unique index values that have no sum after the first groupby?
My solutions has been so far to either to reindex on a multi index i created from the unique values of the grouped columns or the combination of unstack-fillna-stack but both solutions do not really satisfy me.
Appreciate your help!
IIUC, let's use unstack and fillna then stack:
df_out = df.groupby(['name','weekday']).sum().unstack().fillna(0).stack()
Output:
price
name weekday
Micky 0 0.70
1 0.00
2 0.00
3 5.35
4 17.00
5 17.61
6 10.63
Paul 0 18.44
1 0.00
2 3.45
3 17.21
4 2.62
5 0.00
6 0.00
Sarah 0 0.59
1 0.27
2 14.03
3 0.00
4 0.49
5 0.00
6 0.00
And,
df_out.groupby('weekday').mean()
Output:
price
weekday
0 6.576667
1 0.090000
2 5.826667
3 7.520000
4 6.703333
5 5.870000
6 3.543333
I think you can use pivot_table to do all the steps at once. I'm not exactly sure what you want but the default aggregation from pivot_table is the mean. You can change it to 'sum'.
df1 = df.pivot_table(index='name', columns='weekday', values='price',
fill_value=0, aggfunc='sum')
weekday 0 1 2 3 4 5 6
name
Micky 0.70 0.00 0.00 5.35 17.00 17.61 10.63
Paul 18.44 0.00 3.45 17.21 2.62 0.00 0.00
Sarah 0.59 0.27 14.03 0.00 0.49 0.00 0.00
And then take the mean of each column.
df1.mean()
weekday
0 6.576667
1 0.090000
2 5.826667
3 7.520000
4 6.703333
5 5.870000
6 3.543333
I'm trying to reshape a dataframe, but I'm not able to get the results I need.
The dataframe looks like this:
m r s p O W N
1 4 3 1 2.81 3.70 3.03
1 4 4 1 2.14 2.82 2.31
1 4 5 1 1.47 1.94 1.59
1 4 3 2 0.58 0.78 0.60
1 4 4 2 0.67 0.00 0.00
1 4 5 2 1.03 2.45 1.68
1 4 3 3 1.98 1.34 1.81
1 4 4 3 0.00 0.04 0.15
1 4 5 3 0.01 0.00 0.26
I need to reshape the dataframe so it will look like this:
m r s p O W N p O W N p O W N
1 4 3 1 2.81 3.70 3.03 2 0.58 0.78 0.60 3 1.98 1.34 1.81
1 4 4 1 2.14 2.82 2.31 2 0.67 0.00 0.00 3 0.00 0.04 0.15
1 4 5 1 1.47 1.94 1.59 2 1.03 2.45 1.68 3 0.01 0.00 0.26
I tried to use the pivot_table function
df.pivot_table(index=['m','r','s'], columns=['p'], values=['O','W','N'])
but I'm not able to get quite what I want. Does anyone know how to do this?
As someone who fancies himself as pretty handy with pandas, the pivot_table and melt functions are confusing to me. I prefer to stick with a well-defined and unique index and use the stack and unstack methods of the dataframe itself.
First, I'll ask if you really need to repeat the p-column like that? I can sort of see its value when presenting data, but IMO pandas isn't really set up to work like that. We could shoehorn it in, but let's see if a simpler solution gets you what you need.
Here's what I would do:
from io import StringIO
import pandas
datatable = StringIO("""\
m r s p O W N
1 4 3 1 2.81 3.70 3.03
1 4 4 1 2.14 2.82 2.31
1 4 5 1 1.47 1.94 1.59
1 4 3 2 0.58 0.78 0.60
1 4 4 2 0.67 0.00 0.00
1 4 5 2 1.03 2.45 1.68
1 4 3 3 1.98 1.34 1.81
1 4 4 3 0.00 0.04 0.15
1 4 5 3 0.01 0.00 0.26""")
df = (
pandas.read_table(datatable, sep='\s+')
.set_index(['m', 'r', 's', 'p'])
.unstack(level='p')
)
df.columns = df.columns.swaplevel(0, 1)
df.sort(axis=1, inplace=True)
print(df)
Which prints:
p 1 2 3
O W N O W N O W N
m r s
1 4 3 2.81 3.70 3.03 0.58 0.78 0.60 1.98 1.34 1.81
4 2.14 2.82 2.31 0.67 0.00 0.00 0.00 0.04 0.15
5 1.47 1.94 1.59 1.03 2.45 1.68 0.01 0.00 0.26
So now the columns are a MultiIndex and you can access, for example, all of the values where p = 2 with df[2] or df.xs(2, level='p', axis=1), which gives me:
O W N
m r s
1 4 3 0.58 0.78 0.60
4 0.67 0.00 0.00
5 1.03 2.45 1.68
Similarly, you can get all of the W columns with: df.xs('W', level=1, axis=1)
(we say level=1) because that column level does not have a name, so we use its position instead)
p 1 2 3
m r s
1 4 3 3.70 0.78 1.34
4 2.82 0.00 0.04
5 1.94 2.45 0.00
You can similarly query the columns by using axis=0.
If you really need the p values in a column, just add it there manually and reindex your columns:
for p in df.columns.get_level_values('p').unique():
df[p, 'p'] = p
cols = pandas.MultiIndex.from_product([[1,2,3], list('pOWN')])
df = df.reindex(columns=cols)
print(df)
1 2 3
p O W N p O W N p O W N
m r s
1 4 3 1 2.81 3.70 3.03 2 0.58 0.78 0.60 3 1.98 1.34 1.81
4 1 2.14 2.82 2.31 2 0.67 0.00 0.00 3 0.00 0.04 0.15
5 1 1.47 1.94 1.59 2 1.03 2.45 1.68 3 0.01 0.00 0.26
b = open('ss2.csv', 'w')
a = csv.writer(b)
sk = ''
with open ('df_col2.csv', 'r') as ann:
for col in ann:
an = col.lower().strip('\n').split(',')
suk += an[0] + ','
sk = sk[:-2]
a.writerow([sk])