python split lines of file maybe every 2 lines and create dictionary - python

I have a file like:
AFA MT 0 0 1.22 259 169 FOD 0 50.01 1.3 1.370 0.00 -0.02 1.78 0 0.0
S 2 50.620 1.960 2.452 0.00 -0.49 0.31
MKE MS 0 0 4.22 256 149 MDO 1 30.00 1.4 2.370 3.00 -0.52 4.82 0 0.0
KTE KL 0 0 1.22 259 169 FID 0 10.01 2.0 2.470 1.00 -0.12 0.78 1 1.0
S 3 70.610 1.960 2.52 0.00 -0.19 0.41
...
...
S lines are not always there, but always start with S .
And I like to split it and create a dictionary with keys only the first fields (AFA, KTE ...) but also keep the "S 2 50.60 ... 0.31" part as the keys values of the previous key, whenecver they exist.
(aka merge the S lines with the previous line whenever they ocure).
So far I did:
import collections
st = {}
with open("file.txt") as f:
for line in f:
if len(line.split())==17 or len(line.split())==8 :
key, value = line.split(None, 1)
st[key] = (value.split())
#order output as the order in file
st=collections.OrderedDict(st)
print ([key] ,[value])
but this gives me:
['AFA'] ['MT 0 0 1.22 259 169 FOD 0 50.01 1.3 1.370 0.00 -0.02 1.78 0 0.0']
['S'] ['2 50.620 1.960 2.452 0.00 -0.49 0.31']
['MKE'] ['MS 0 0 4.22 256 149 MDO 1 30.00 1.4 2.370 3.00 -0.52 4.82 0 0.0']
['KTE'] ['KL 0 0 1.22 259 169 FID 0 10.01 2.0 2.470 1.00 -0.12 0.78 1 1.0']
['S'] ['3 70.610 1.960 2.52 0.00 -0.19 0.41']
while I try and want though to get:
['AFA'] ['MT 0 0 1.22 259 169 FOD 0 50.01 1.3 1.370 0.00 -0.02 1.78 0 0.0 S 2 50.620 1.960 2.452 0.00 -0.49 0.31']
['MKE'] ['MS 0 0 4.22 256 149 MDO 1 30.00 1.4 2.370 3.00 -0.52 4.82 0 0.0']
['KTE'] ['KL 0 0 1.22 259 169 FID 0 10.01 2.0 2.470 1.00 -0.12 0.78 1 1.0 S 3 70.610 1.960 2.52 0.00 -0.19 0.41']

The logic you could use is:
remember the last non-S line you've read
If you read an S-line, add it to the remembered non-S line & put that in your dictionary that, then forget the remembered non-S line
If you read a non-S-line, put any remembered non-S line in your dictionary & remember the new line instead
When you are done with the file, put any remembered non-S line in your dictionary
This seems to do what you want, without much modification to your attempt:
import pprint
st = {}
with open("file.txt") as f:
for line in f:
if len(line.split())==17 :
key, value = line.split(None, 1)
st[key] = (value.split())
elif len(line.split())==8 :
st[key] += line.split()
pprint.pprint(st)

Related

Unable to scrape 2nd table from Fbref.com

I would like to scrape the 2nd table in the page seen below from the link - https://fbref.com/en/comps/82/stats/Indian-Super-League-Stats#all_stats_standard
on google collab.
but pd.read_html("https://fbref.com/en/comps/82/stats/Indian-Super-League-Stats#all_stats_standard") only gives me the first table.
Please help me understand where I am going wrong.
Snippet of page
This is one way to read that data:
import pandas as pd
import requests
url= 'https://fbref.com/en/comps/82/stats/Indian-Super-League-Stats#all_stats_standard'
response = requests.get(url).text.replace('<!--', '').replace('-->', '')
df = pd.read_html(response, header=1)[2]
print(df)
Result in terminal:
Rk Player Nation Pos Squad Age Born MP Starts Min 90s Gls Ast G-PK PK PKatt CrdY CrdR Gls.1 Ast.1 G+A G-PK.1 G+A-PK Matches
0 1 Sahal Abdul Samad in IND MF Kerala Blasters 24 1997 20 19 1443 16.0 5 1 5 0 0 0 0 0.31 0.06 0.37 0.31 0.37 Matches
1 2 Ayush Adhikari in IND MF Kerala Blasters 21 2000 14 6 540 6.0 0 0 0 0 0 3 1 0.00 0.00 0.00 0.00 0.00 Matches
2 3 Gani Ahammed Nigam in IND FW NorthEast Utd 23 1998 6 0 66 0.7 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 Matches
3 4 Airam es ESP FW Goa 33 1987 13 8 751 8.3 6 1 5 1 2 0 0 0.72 0.12 0.84 0.60 0.72 Matches
4 5 Alex br BRA MF Jamshedpur 32 1988 20 12 1118 12.4 1 4 1 0 0 2 0 0.08 0.32 0.40 0.08 0.40 Matches
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
302 292 João Victor br BRA MF Hyderabad FC 32 1988 18 18 1590 17.7 5 1 3 2 2 3 0 0.28 0.06 0.34 0.17 0.23 Matches
303 293 David Williams au AUS FW Mohun Bagan 33 1988 15 6 602 6.7 4 1 4 0 1 2 0 0.60 0.15 0.75 0.60 0.75 Matches
304 294 Banana Yaya cm CMR DF Bengaluru 30 1991 5 2 229 2.5 0 1 0 0 0 1 0 0.00 0.39 0.39 0.00 0.39 Matches
305 295 Joe Zoherliana in IND DF NorthEast Utd 22 1999 9 6 677 7.5 0 1 0 0 0 0 0 0.00 0.13 0.13 0.00 0.13 Matches
306 296 Mark Zothanpuia in IND MF Hyderabad FC 19 2002 3 0 63 0.7 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 Matches
307 rows × 24 columns
Relevant pandas documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html

How to fill missing timestamps in pandas

I have a CSV file as below:
t dd hh v.amm v.alc v.no2 v.cmo aqi
0 201811170000 17 0 0.40 0.41 1.33 1.55 2.45
1 201811170002 17 0 0.40 0.41 1.34 1.51 2.46
2 201811170007 17 0 0.40 0.37 1.35 1.45 2.40
Now I have to fill in the missing minutes by last observation carried forward. Expected output:
t dd hh v.amm v.alc v.no2 v.cmo aqi
0 201811170000 17 0 0.40 0.41 1.33 1.55 2.45
1 201811170001 17 0 0.40 0.41 1.33 1.55 2.45
2 201811170002 17 0 0.40 0.41 1.34 1.51 2.46
2 201811170003 17 0 0.40 0.41 1.34 1.51 2.46
2 201811170004 17 0 0.40 0.41 1.34 1.51 2.46
2 201811170005 17 0 0.40 0.41 1.34 1.51 2.46
2 201811170006 17 0 0.40 0.41 1.34 1.51 2.46
3 201811170007 17 0 0.40 0.37 1.35 1.45 2.40
I tried following this link but unable to achieve the expected output. Sorry I'm new to coding.
First create DatetimeIndex by to_datetime and DataFrame.set_index and then change frequency by DataFrame.asfreq:
df['t'] = pd.to_datetime(df['t'], format='%Y%m%d%H%M')
df = df.set_index('t').sort_index().asfreq('Min', method='ffill')
print (df)
dd hh v.amm v.alc v.no2 v.cmo aqi
t
2018-11-17 00:00:00 17 0 0.4 0.41 1.33 1.55 2.45
2018-11-17 00:01:00 17 0 0.4 0.41 1.33 1.55 2.45
2018-11-17 00:02:00 17 0 0.4 0.41 1.34 1.51 2.46
2018-11-17 00:03:00 17 0 0.4 0.41 1.34 1.51 2.46
2018-11-17 00:04:00 17 0 0.4 0.41 1.34 1.51 2.46
2018-11-17 00:05:00 17 0 0.4 0.41 1.34 1.51 2.46
2018-11-17 00:06:00 17 0 0.4 0.41 1.34 1.51 2.46
2018-11-17 00:07:00 17 0 0.4 0.37 1.35 1.45 2.40
Or use DataFrame.resample with Resampler.ffill:
df['t'] = pd.to_datetime(df['t'], format='%Y%m%d%H%M')
df = df.set_index('t').sort_index().resample('Min').ffill()

Pandas the value 1 year ago is lower

df['Inflation'] = pd.read_csv('INFLATION_EUR.csv', header=None, names=['Date', 'Value'])['Value']
df['EURUSD'] = pd.read_csv('DEXUSEU.csv', header=None, names=['Date', 'Value'])['Value']
df = df.set_index('Date')
df['Gerber'] = 0
The data looks like this:
Date Interest Inflation EURUSD Gerber
... ... ... ... ...
01/31/2019 3.50 1.84 1.329 0
02/28/2019 3.50 1.84 1.317 0
03/31/2019 3.75 1.94 1.309 0
04/30/2019 3.75 1.91 1.300 0
05/31/2019 3.75 1.87 1.302 0
... ... ... ... ...
08/31/2019 0.00 1.24 1.375 0
09/30/2019 0.00 1.00 1.381 0
10/31/2019 0.00 0.91 1.370 0
11/30/2019 0.00 0.77 1.369 0
12/31/2019 0.00 0 1.362 0
..
I need to check if the inflation is lower than 1 year ago (12 months), in that case set Gerber to 1 instead of 0. This goes for each row. The data is monthly.
The last row should be:
12/31/2019 0.00 0 1.362 1
since inflation 01/31/2019 was 1.84, i.e. higher than 0. The data goes from 1990 to 2019.
How can I do this? I tried this:
df['Gerber'] = df['Inflation'].rolling(12).apply(lambda x: 1 if (x[0] < x[-1]) else 0)

transpose multiple columns Pandas dataframe

I'm trying to reshape a dataframe, but I'm not able to get the results I need.
The dataframe looks like this:
m r s p O W N
1 4 3 1 2.81 3.70 3.03
1 4 4 1 2.14 2.82 2.31
1 4 5 1 1.47 1.94 1.59
1 4 3 2 0.58 0.78 0.60
1 4 4 2 0.67 0.00 0.00
1 4 5 2 1.03 2.45 1.68
1 4 3 3 1.98 1.34 1.81
1 4 4 3 0.00 0.04 0.15
1 4 5 3 0.01 0.00 0.26
I need to reshape the dataframe so it will look like this:
m r s p O W N p O W N p O W N
1 4 3 1 2.81 3.70 3.03 2 0.58 0.78 0.60 3 1.98 1.34 1.81
1 4 4 1 2.14 2.82 2.31 2 0.67 0.00 0.00 3 0.00 0.04 0.15
1 4 5 1 1.47 1.94 1.59 2 1.03 2.45 1.68 3 0.01 0.00 0.26
I tried to use the pivot_table function
df.pivot_table(index=['m','r','s'], columns=['p'], values=['O','W','N'])
but I'm not able to get quite what I want. Does anyone know how to do this?
As someone who fancies himself as pretty handy with pandas, the pivot_table and melt functions are confusing to me. I prefer to stick with a well-defined and unique index and use the stack and unstack methods of the dataframe itself.
First, I'll ask if you really need to repeat the p-column like that? I can sort of see its value when presenting data, but IMO pandas isn't really set up to work like that. We could shoehorn it in, but let's see if a simpler solution gets you what you need.
Here's what I would do:
from io import StringIO
import pandas
datatable = StringIO("""\
m r s p O W N
1 4 3 1 2.81 3.70 3.03
1 4 4 1 2.14 2.82 2.31
1 4 5 1 1.47 1.94 1.59
1 4 3 2 0.58 0.78 0.60
1 4 4 2 0.67 0.00 0.00
1 4 5 2 1.03 2.45 1.68
1 4 3 3 1.98 1.34 1.81
1 4 4 3 0.00 0.04 0.15
1 4 5 3 0.01 0.00 0.26""")
df = (
pandas.read_table(datatable, sep='\s+')
.set_index(['m', 'r', 's', 'p'])
.unstack(level='p')
)
df.columns = df.columns.swaplevel(0, 1)
df.sort(axis=1, inplace=True)
print(df)
Which prints:
p 1 2 3
O W N O W N O W N
m r s
1 4 3 2.81 3.70 3.03 0.58 0.78 0.60 1.98 1.34 1.81
4 2.14 2.82 2.31 0.67 0.00 0.00 0.00 0.04 0.15
5 1.47 1.94 1.59 1.03 2.45 1.68 0.01 0.00 0.26
So now the columns are a MultiIndex and you can access, for example, all of the values where p = 2 with df[2] or df.xs(2, level='p', axis=1), which gives me:
O W N
m r s
1 4 3 0.58 0.78 0.60
4 0.67 0.00 0.00
5 1.03 2.45 1.68
Similarly, you can get all of the W columns with: df.xs('W', level=1, axis=1)
(we say level=1) because that column level does not have a name, so we use its position instead)
p 1 2 3
m r s
1 4 3 3.70 0.78 1.34
4 2.82 0.00 0.04
5 1.94 2.45 0.00
You can similarly query the columns by using axis=0.
If you really need the p values in a column, just add it there manually and reindex your columns:
for p in df.columns.get_level_values('p').unique():
df[p, 'p'] = p
cols = pandas.MultiIndex.from_product([[1,2,3], list('pOWN')])
df = df.reindex(columns=cols)
print(df)
1 2 3
p O W N p O W N p O W N
m r s
1 4 3 1 2.81 3.70 3.03 2 0.58 0.78 0.60 3 1.98 1.34 1.81
4 1 2.14 2.82 2.31 2 0.67 0.00 0.00 3 0.00 0.04 0.15
5 1 1.47 1.94 1.59 2 1.03 2.45 1.68 3 0.01 0.00 0.26
b = open('ss2.csv', 'w')
a = csv.writer(b)
sk = ''
with open ('df_col2.csv', 'r') as ann:
for col in ann:
an = col.lower().strip('\n').split(',')
suk += an[0] + ','
sk = sk[:-2]
a.writerow([sk])

Transforming Pandas DataFrame into List of DataFrames

I have data that looks like this:
1.00 1.00 1.00
3.23 4.23 0.33
1.23 0.13 3.44
4.55 12.3 14.1
2.00 2.00 2.00
1.21 1.11 1.11
3.55 5.44 5.22
4.11 1.00 4.00
It comes in chunk of 4. The first line of the chunk is index and the rest are the values.
The chunk always comes in 4 lines, but number of columns can be more than 3.
For example:
1.00 1.00 1.00 <- 1st chunk, the index = 1
3.23 4.23 0.33 <- values
1.23 0.13 3.44 <- values
4.55 12.3 14.1 <- values
My example above only contains 2 chunks, but actually it can contain more than that.
What I want to do is to create a dictionary of data frames so I can process them
chunk by chunk. Namely from this:
In [1]: import pandas as pd
In [2]: df = pd.read_table("http://dpaste.com/29R0BSS.txt",header=None, sep = " ")
In [3]: df
Out[3]:
0 1 2
0 1.00 1.00 1.00
1 3.23 4.23 0.33
2 1.23 0.13 3.44
3 4.55 12.30 14.10
4 2.00 2.00 2.00
5 1.21 1.11 1.11
6 3.55 5.44 5.22
7 4.11 1.00 4.00
Into list of data frame, such that I can do something like this (I do this by hand):
>> # Let's call new data frame `nd`.
>> nd[1]
>> 0 1 2
0 3.23 4.23 0.33
1 1.23 0.13 3.44
2 4.55 12.30 14.10
There are lots of ways to do this; I tend to use groupby, e.g. something like
>>> grouped = df.groupby(np.arange(len(df)) // 4)
>>> d = {v.iloc[0][0]: v.iloc[1:].reset_index(drop=True) for k,v in grouped}
>>> for k,v in d.items():
... print(k)
... print(v)
...
1.0
0 1 2
0 3.23 4.23 0.33
1 1.23 0.13 3.44
2 4.55 12.30 14.10
2.0
0 1 2
0 1.21 1.11 1.11
1 3.55 5.44 5.22
2 4.11 1.00 4.00

Categories

Resources