Dropping multiple columns in pandas at once - python

I have a data set consisting of 135 columns. I am trying to drop the columns which have empty data of more than 60%. There are some 40 columns approx in it. So, I wrote a function to drop this empty columns. But I am getting "Not contained in axis" error. Could some one help me solving this?. Or any other way to drop this 40 columns at once?
My function:
list_drop = df.isnull().sum()/(len(df))
def empty(df):
if list_drop > 0.5:
df.drop(list_drop,axis=1,inplace=True)
return df
Other method i tried:
df.drop(df.count()/len(df)<0.5,axis=1,inplace=True)

You could use isnull + sum and then use the mask to filter df.columns.
m = df.isnull().sum(0) / len(df) < 0.6
df = df[df.columns[m]]
Demo
df
A B C
0 29.0 NaN 26.6
1 NaN NaN 23.3
2 23.0 94.0 28.1
3 35.0 168.0 43.1
4 NaN NaN 25.6
5 32.0 88.0 31.0
6 NaN NaN 35.3
7 45.0 543.0 30.5
8 NaN NaN NaN
9 NaN NaN 37.6
10 NaN NaN 38.0
11 NaN NaN 27.1
12 23.0 846.0 30.1
13 19.0 175.0 25.8
14 NaN NaN 30.0
15 47.0 230.0 45.8
16 NaN NaN 29.6
17 38.0 83.0 43.3
18 30.0 96.0 34.6
m = df.isnull().sum(0) / len(df) < 0.3 # 0.3 as an example
m
A False
B False
C True
dtype: bool
df[df.columns[m]]
C
0 26.6
1 23.3
2 28.1
3 43.1
4 25.6
5 31.0
6 35.3
7 30.5
8 NaN
9 37.6
10 38.0
11 27.1
12 30.1
13 25.8
14 30.0
15 45.8
16 29.6
17 43.3
18 34.6

Related

Moving forward in a panda dataframe looking for the first occurrence of multi-conditions with reset

I am having trouble with multi-conditions moving forward in a dataframe.
Here's a simplification of my model:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'date':pd.date_range(start='2022-05-12', periods=27),
'l': [10.0,9.9,11.1,10.9,12.1,9.6,13.1,17.9,18.0,15.6,13.5,14.2,10.5,9.5,7.6,9.8,10.2,15.3,17.7,21.8,10.9,18.9,16.4,13.3,7.1,6.8,9.4],
'c': [10.5,10.2,12.0,11.7,13.5,10.9,13.9,18.2,18.8,16.2,15.1,14.8,11.8,10.1,8.9,10.5,11.1,16.9,19.8,22.0,15.5,20.1,17.7,14.8,8.9,7.3,10.1],
'h': [10.8,11.5,13.4,13.6,14.2,11.4,15.8,18.5,19.2,16.9,16.0,15.3,12.9,10.5,9.2,11.1,12.3,18.5,20.1,23.5,21.1,20.5,18.2,15.4,9.6,8.4,10.5],
'oc': [False,True,False,False,False,True,True,True,False,False,True,False,True,False,False,False,False,True,False,False,False,False,False,False,False,False,False],
's': [np.nan,9.3,np.nan,np.nan,np.nan,14.5,14.4,np.nan,np.nan,np.nan,8.1,np.nan,10.7,np.nan,np.nan,np.nan,np.nan,6.9,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'i': [np.nan,9.0,np.nan,np.nan,np.nan,13.6,13.4,np.nan,np.nan,np.nan,7.0,np.nan,9.9,np.nan,np.nan,np.nan,np.nan,9.2,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
't': [np.nan,15.5,np.nan,np.nan,np.nan,16.1,15.9,np.nan,np.nan,np.nan,16.5,np.nan,17.2,np.nan,np.nan,np.nan,np.nan,25.0,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]
})
df = df.set_index('date')
# df Index is datetime type
print(df)
l c h oc s i t
date
2022-05-12 10.0 10.5 10.8 False NaN NaN NaN
2022-05-13 9.9 10.2 11.5 True 9.3 9.0 15.5
2022-05-14 11.1 12.0 13.4 False NaN NaN NaN
2022-05-15 10.9 11.7 13.6 False NaN NaN NaN
2022-05-16 12.1 13.5 14.2 False NaN NaN NaN
2022-05-17 9.6 10.9 11.4 True 14.5 13.6 16.1
2022-05-18 13.1 13.9 15.8 True 14.4 13.4 15.9
2022-05-19 17.9 18.2 18.5 True NaN NaN NaN
2022-05-20 18.0 18.8 19.2 False NaN NaN NaN
2022-05-21 15.6 16.2 16.9 False NaN NaN NaN
2022-05-22 13.5 15.1 16.0 True 8.1 7.0 16.5
2022-05-23 14.2 14.8 15.3 False NaN NaN NaN
2022-05-24 10.5 11.8 12.9 True 10.7 9.9 17.2
2022-05-25 9.5 10.1 10.5 False NaN NaN NaN
2022-05-26 7.6 8.9 9.2 False NaN NaN NaN
2022-05-27 9.8 10.5 11.1 False NaN NaN NaN
2022-05-28 10.2 11.1 12.3 False NaN NaN NaN
2022-05-29 15.3 16.9 18.5 True 6.9 9.2 25.0
2022-05-30 17.7 19.8 20.1 False NaN NaN NaN
2022-05-31 21.8 22.0 23.5 False NaN NaN NaN
2022-06-01 10.9 15.5 21.1 False NaN NaN NaN
2022-06-02 18.9 20.1 20.5 False NaN NaN NaN
2022-06-03 16.4 17.7 18.2 False NaN NaN NaN
2022-06-04 13.3 14.8 15.4 False NaN NaN NaN
2022-06-05 7.1 8.9 9.6 False NaN NaN NaN
2022-06-06 6.8 7.3 8.4 False NaN NaN NaN
2022-06-07 9.4 10.1 10.5 False NaN NaN NaN
This is the result I am trying to achieve:
date l c h oc s i t cc diff r
0 2022-05-12 10.0 10.5 10.8 False NaN NaN NaN NaN NaN NaN
1 2022-05-13 9.9 10.2 11.5 True 9.3 9.0 15.5 NaN NaN NaN
2 2022-05-14 11.1 12.0 13.4 False NaN NaN NaN NaN NaN NaN
3 2022-05-15 10.9 11.7 13.6 False NaN NaN NaN NaN NaN NaN
4 2022-05-16 12.1 13.5 14.2 False NaN NaN NaN NaN NaN NaN
5 2022-05-17 9.6 10.9 11.4 True 14.5 13.6 16.1 NaN NaN NaN
6 2022-05-18 13.1 13.9 15.8 True 14.4 13.4 15.9 True 5.3 t
7 2022-05-19 17.9 18.2 18.5 True NaN NaN NaN NaN NaN NaN
8 2022-05-20 18.0 18.8 19.2 False NaN NaN NaN NaN NaN NaN
9 2022-05-21 15.6 16.2 16.9 False NaN NaN NaN NaN NaN NaN
10 2022-05-22 13.5 15.1 16.0 True 8.1 7.0 16.5 NaN NaN NaN
11 2022-05-23 14.2 14.8 15.3 False NaN NaN NaN NaN NaN NaN
12 2022-05-24 10.5 11.8 12.9 True 10.7 9.9 17.2 NaN NaN NaN
13 2022-05-25 9.5 10.1 10.5 False NaN NaN NaN NaN NaN NaN
14 2022-05-26 7.6 8.9 9.2 False NaN NaN NaN True -7.0 s
15 2022-05-27 9.8 10.5 11.1 False NaN NaN NaN NaN NaN NaN
16 2022-05-28 10.2 11.1 12.3 False NaN NaN NaN NaN NaN NaN
17 2022-05-29 15.3 16.9 18.5 True 6.9 9.2 25.0 NaN NaN NaN
18 2022-05-30 17.7 19.8 20.1 False NaN NaN NaN NaN NaN NaN
19 2022-05-31 21.8 22.0 23.5 False NaN NaN NaN NaN NaN NaN
20 2022-06-01 10.9 15.5 21.1 False NaN NaN NaN NaN NaN NaN
21 2022-06-02 18.9 20.1 20.5 False NaN NaN NaN NaN NaN NaN
22 2022-06-03 16.4 17.7 18.2 False NaN NaN NaN NaN NaN NaN
23 2022-06-04 13.3 14.8 15.4 False NaN NaN NaN NaN NaN NaN
24 2022-06-05 7.1 8.9 9.6 False NaN NaN NaN True -7.7 i
25 2022-06-06 6.8 7.3 8.4 False NaN NaN NaN NaN NaN NaN
26 2022-06-07 9.4 10.1 10.5 False NaN NaN NaN NaN NaN NaN
Principles:
We always move forward in the dataframe
When oc is True we 'memorize' both c, s, i and t values from this row
Moving forward we look for the first occurrence of one of the following conditions:
h >= t
l <= s
l <= i
When it happens we set cc to True and we calculate the difference of the 'memorized' values when oc was True and write a letter to distinguish the condition:
If h >= t: diff = t-c and r = 't'
If l <= s: diff = s-c and r = 's'
If l <= i: diff = i-c and r = 'i'
Once one of the conditions has been met, we look again for oc is True and then the conditions to be met, until the end of the dataframe.
If oc is True again before one of the conditions has been met, we omit it.
What happens chronologically:
2022-05-13: oc is True so we memorize c, s, i, t
2022-05-17: oc is True but none of the conditions have been met, yet -> omission
2022-05-18: h > t[2022-05-13] -> diff = t[2022-05-13]-c[2022-05-13] = 15.5-10.2 = 5.3, r = 't'
2022-05-22: oc is True so we memorize c, s, i, t
2022-05-24: oc is True but none of the conditions have been met, yet -> omission
2022-05-26: l < s[2022-05-22] -> diff = s[2022-05-22]-c[2022-05-22] = 8.1-15.1 = -7.0, r = 's'
2022-05-29: oc is True so we memorize c, s, i, t
2022-06-05: l < i[2022-05-29] -> diff = i[2022-05-29]-c[2022-05-29] = 9.2-16.9 = -7.7, r = 'i'
A loop works but take an enormous amount of time, if possible I'd like to avoid it.
I've tried a really good solution from Baron Legendre described here which works perfectly when looking for equal values but I can't seem to adapt it to my model. Also I'm having an index problem: I'm getting different results when using a datetime Index even when I reset it.
I've been stuck with that problem for a while now so any help would gladly be appreciated.
IIUC, you can use the commented code below:
mem = False # Memory flag
data = [] # Store new values
# Create groups to speed the process (remove rows before first valid oc)
grp = df['oc'].cumsum().loc[lambda x: x > 0]
# For each group
for _, subdf in df.groupby(grp):
# Memorize new oc fields (c, s, i, t)
if not mem:
oc = subdf.iloc[0][['c', 's', 'i', 't']]
mem = True
# Extract l and h fields
lh = subdf.iloc[1:][['l', 'h']]
# Try to extract the first row where one of conditions is met
sr = (pd.concat([lh['h'] >= oc['t'], lh['l'] <= oc['s'], lh['l'] <= oc['i']],
keys=['t', 's', 'i'], axis=1)
.rename_axis(columns='r').stack().rename('cc')
.loc[lambda x: x].head(1).reset_index('r').squeeze())
# Keep this row if exists and unlock memory
if not sr.empty:
sr['diff'] = oc[sr['r']] - oc['c']
data.append(sr)
mem = False
# Merge new values
out = df.join(pd.concat(data, axis=1).T[['cc', 'r', 'diff']])
Output:
>>> out
l c h oc s i t cc r diff
date
2022-05-12 10.0 10.5 10.8 False NaN NaN NaN NaN NaN NaN
2022-05-13 9.9 10.2 11.5 True 9.3 9.0 15.5 NaN NaN NaN
2022-05-14 11.1 12.0 13.4 False NaN NaN NaN NaN NaN NaN
2022-05-15 10.9 11.7 13.6 False NaN NaN NaN NaN NaN NaN
2022-05-16 12.1 13.5 14.2 False NaN NaN NaN NaN NaN NaN
2022-05-17 9.6 10.9 11.4 True 14.5 13.6 16.1 NaN NaN NaN
2022-05-18 13.1 13.9 15.8 False NaN NaN NaN True t 5.3
2022-05-19 17.9 18.2 18.5 False NaN NaN NaN NaN NaN NaN
2022-05-20 18.0 18.8 19.2 False NaN NaN NaN NaN NaN NaN
2022-05-21 15.6 16.2 16.9 False NaN NaN NaN NaN NaN NaN
2022-05-22 13.5 15.1 16.0 True 8.1 7.0 16.5 NaN NaN NaN
2022-05-23 14.2 14.8 15.3 False NaN NaN NaN NaN NaN NaN
2022-05-24 10.5 11.8 12.9 True 10.7 9.9 17.2 NaN NaN NaN
2022-05-25 9.5 10.1 10.5 False NaN NaN NaN NaN NaN NaN
2022-05-26 7.6 8.9 9.2 False NaN NaN NaN True s -7.0
2022-05-27 9.8 10.5 11.1 False NaN NaN NaN NaN NaN NaN
2022-05-28 10.2 11.1 12.3 False NaN NaN NaN NaN NaN NaN
2022-05-29 15.3 16.9 18.5 True 6.9 9.2 25.0 NaN NaN NaN
2022-05-30 17.7 19.8 20.1 False NaN NaN NaN NaN NaN NaN
2022-05-31 21.8 22.0 23.5 False NaN NaN NaN NaN NaN NaN
2022-06-01 10.9 15.5 21.1 False NaN NaN NaN NaN NaN NaN
2022-06-02 18.9 20.1 20.5 False NaN NaN NaN NaN NaN NaN
2022-06-03 16.4 17.7 18.2 False NaN NaN NaN NaN NaN NaN
2022-06-04 13.3 14.8 15.4 False NaN NaN NaN NaN NaN NaN
2022-06-05 7.1 8.9 9.6 False NaN NaN NaN True i -7.7
2022-06-06 6.8 7.3 8.4 False NaN NaN NaN NaN NaN NaN
2022-06-07 9.4 10.1 10.5 False NaN NaN NaN NaN NaN NaN

Converting a table of fixed width in text format into dataframe/excel/csv

I have some data in txt format with 38 columns which looks like this:
With the exception of the header row, most of the rows have missing values. I want to convert this table into an array/ dataframe/ excel. But it is not coming as it looks in table.
I tried using python
df = pandas.read_csv(filename, sep='\s+',names=colnames, header=None)
I am confused about what seperator to use.
The program should look for value after single space. If no value is present, fill it with nan. How to do that?
Thanks in advance!
You can use pandas.read_fwf (fixed-width format):
>>> df = pd.read_fwf('data.txt')
>>> df
INDEX YEAR MN DT MAX MIN ... T.2 G.2 DUR.2 T.3 G.3 DUR.3
0 14210 1972 9 1 32.0 22.0 ... NaN NaN NaN NaN NaN NaN
1 14210 1972 9 2 32.3 21.5 ... NaN NaN NaN NaN NaN NaN
2 14210 1972 9 3 32.8 22.4 ... NaN NaN NaN NaN NaN NaN
3 14210 1972 9 4 32.0 22.0 ... NaN NaN NaN NaN NaN NaN
4 14210 1972 9 5 33.2 23.6 ... 0.0 7.0 280.0 NaN NaN NaN
5 14210 1972 9 6 31.6 23.2 ... 5.0 8.0 45.0 0.0 8.0 NaN
6 14210 1972 9 7 31.5 21.0 ... 5.0 4.0 45.0 NaN NaN NaN
7 14210 1972 9 8 29.7 21.6 ... NaN NaN NaN NaN NaN NaN
8 14210 1972 9 9 29.7 21.1 ... NaN NaN NaN NaN NaN NaN
9 14210 1972 9 10 27.6 21.5 ... NaN NaN NaN NaN NaN NaN
10 14210 1972 9 11 30.3 21.3 ... 6.0 1.0 80.0 NaN NaN NaN
11 14210 1972 9 12 30.6 22.0 ... 5.0 5.0 30.0 NaN NaN NaN
12 14210 1972 9 13 30.2 21.4 ... 0.0 7.0 195.0 NaN NaN NaN
13 14210 1972 9 14 28.2 21.5 ... NaN NaN NaN NaN NaN NaN
14 14210 1972 9 15 30.3 21.9 ... 0.0 7.0 305.0 NaN NaN NaN
15 14210 1972 9 17 32.0 22.0 ... 6.0 7.0 135.0 NaN NaN NaN
16 14210 1972 9 18 32.0 20.5 ... 6.0 6.0 80.0 5.0 NaN NaN
[17 rows x 38 columns]

To scrape the data from span tag using beautifulsoup

I am trying to scrape the webpage, where I need to decode the entire table into a dataframe. I am using beautiful soup for this purpose. In certain td tags, there are span tags which do not have any text. But the values are shown on the webpage in that particular span tag.
The following html code corresponds to that webpage,
<td>
<span class="nttu">::after</span>
<span class="ntbb">::after</span>
<span class="ntyc">::after</span>
<span class="nttu">::after</span>
</td>
But, the value shown in this td tag is 23.8. I tried to scrape it, but I am getting am empty text.
How to scrape this value using beautiful soup.
URL: https://en.tutiempo.net/climate/ws-432950.html
and my code is for scraping the table is given below,
http_url = "https://en.tutiempo.net/climate/01-2013/ws-432950.html"
retreived_data = requests.get(http_url).text
soup = BeautifulSoup(retreived_data, "lxml")
climate_table = soup.find("table", attrs={"class": "medias mensuales numspan"})
climate_data = climate_table.find_all("tr")
for data in climate_data[1:-2]:
table_data = data.find_all("td")
row_data = []
for row in table_data:
row_data.append(row.get_text())
climate_df.loc[len(climate_df)] = row_data
Misunderstood your question as you have 2 different urls referenced. I see now what you mean.
Ya that is weird that in that second table, they used CSS to fill in the content of some of those <td> tags. What you need to do is pull out those special cases from the <style> tag. Once you have that, you can replace those elements within the html source, and finally parse it into a dataframe. I used pandas as it uses BeautifulSoup under the hood to parse <table> tags. But I believe this will get you what you want:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
http_url = "https://en.tutiempo.net/climate/01-2013/ws-432950.html"
retreived_data = requests.get(http_url).text
soup = BeautifulSoup(retreived_data, "lxml")
hiddenData = str(soup.find_all('style')[1])
hiddenSpan = {}
for group in re.findall(r'span\.(.+?)}',hiddenData):
class_attr = group.split('span.')[-1].split('::')[0]
content = group.split('"')[1]
hiddenSpan[class_attr] = content
climate_table = str(soup.find("table", attrs={"class": "medias mensuales numspan"}))
for k, v in hiddenSpan.items():
climate_table = climate_table.replace('<span class="%s"></span>' %(k), hiddenSpan[k])
df = pd.read_html(climate_table)[0]
Output:
print (df.to_string())
Day T TM Tm SLP H PP VV V VM VG RA SN TS FG
0 1 23.4 30.3 19 - 59 0 6.3 4.3 5.4 - NaN NaN NaN NaN
1 2 22.4 30.3 16.9 - 57 0 6.9 3.3 7.6 - NaN NaN NaN NaN
2 3 24 31.8 16.9 - 51 0 6.9 2.8 5.4 - NaN NaN NaN NaN
3 4 24.2 32 17.4 - 53 0 6 3.3 5.4 - NaN NaN NaN NaN
4 5 23.8 32 18 - 58 0 6.9 3.1 7.6 - NaN NaN NaN NaN
5 6 23.3 31 18.3 - 60 0 6.9 5 9.4 - NaN NaN NaN NaN
6 7 22.8 30.2 17.6 - 55 0 7.7 3.7 7.6 - NaN NaN NaN NaN
7 8 23.1 30.6 17.4 - 46 0 6.9 3.3 5.4 - NaN NaN NaN NaN
8 9 22.9 30.6 17.4 - 51 0 6.9 3.5 3.5 - NaN NaN NaN NaN
9 10 22.3 30 17 - 56 0 6.3 3.3 7.6 - NaN NaN NaN NaN
10 11 22.3 29.4 17 - 53 0 6.9 4.3 7.6 - NaN NaN NaN NaN
11 12 21.8 29.4 15.7 - 54 0 6.9 2.8 3.5 - NaN NaN NaN NaN
12 13 22.3 30.1 15.7 - 43 0 6.9 2.8 5.4 - NaN NaN NaN NaN
13 14 21.8 30.6 14.8 - 41 0 6.9 1.9 5.4 - NaN NaN NaN NaN
14 15 21.6 30.6 14.2 - 43 0 6.9 3.1 7.6 - NaN NaN NaN NaN
15 16 21.1 29.9 15.4 - 55 0 6.9 4.1 7.6 - NaN NaN NaN NaN
16 17 20.4 28.1 15.4 - 59 0 6.9 5 11.1 - NaN NaN NaN NaN
17 18 21.2 28.3 14.5 - 53 0 6.9 3.1 7.6 - NaN NaN NaN NaN
18 19 21.6 29.6 16.4 - 58 0 6.9 2.2 3.5 - NaN NaN NaN NaN
19 20 21.9 29.6 16.6 - 58 0 6.9 2.4 5.4 - NaN NaN NaN NaN
20 21 22.3 29.9 17.5 - 55 0 6.9 3.1 5.4 - NaN NaN NaN NaN
21 22 21.9 29.9 15.1 - 46 0 6.9 4.3 7.6 - NaN NaN NaN NaN
22 23 21.3 29 15.2 - 50 0 6.9 3.3 5.4 - NaN NaN NaN NaN
23 24 21.3 28.8 14.6 - 45 0 6.9 3 5.4 - NaN NaN NaN NaN
24 25 21.6 29.1 15.5 - 47 0 7.7 4.8 7.6 - NaN NaN NaN NaN
25 26 21.8 29.2 14.6 - 41 0 6.9 2.8 3.5 - NaN NaN NaN NaN
26 27 22.3 30.1 15.6 - 40 0 6.9 2.4 5.4 - NaN NaN NaN NaN
27 28 22.4 30.3 16 - 51 0 6.9 2.8 3.5 - NaN NaN NaN NaN
28 29 23 30.3 16.9 - 53 0 6.6 2.8 5.4 - NaN NaN NaN o
29 30 23.1 30 17.8 - 54 0 6.9 5.4 7.6 - NaN NaN NaN NaN
30 31 22.1 29.8 17.3 - 54 0 6.9 5.2 9.4 - NaN NaN NaN NaN
31 Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals:
32 NaN 22.3 30 16.4 - 51.6 0 6.9 3.5 6.3 NaN 0 0 0 1

Is there a way to replace a whole pandas dataframe row using ffill, if one value of a specific column is NaN?

I am trying to sort a dataframe where some rows are all NaN. I want to fill these using ffill. I'm currently trying this although i feel like it's a mismatch of a few commands
df.loc[df['A'].isna(), :] = df.fillna(method='ffill')
This gives an error:
AttributeError: 'NoneType' object has no attribute 'fillna'
but I want to filter the NaNs I fill using ffill if one of the columns is NaN. i.e.
A B C D E
0 45 88 NaN NaN 3
1 62 34 2 86 NaN
2 85 65 11 31 5
3 NaN NaN NaN NaN NaN
4 90 38 34 93 8
5 0 94 45 10 10
6 58 NaN 23 60 11
7 10 32 5 15 11
8 NaN NaN NaN NaN NaN
So I would only like to fill a row IFF the value of A is NaN, whilst leaving C,0 and D,0 as NaN. Giving the below dataframe
A B C D E
0 45 88 NaN NaN 3
1 62 34 2 86 NaN
2 85 65 11 31 5
3 85 65 11 31 5
4 90 38 34 93 8
5 0 94 45 10 10
6 58 NaN 23 60 11
7 10 32 5 15 11
8 10 32 5 15 11
So just to clarify, the ONLY rows that get replaced with ffill are 3,8 and the reason is because the value of column A in rows 3 and 8 are NaN
Thanks
---Update---
When I'm debugging and evaluate the expression : df.loc[df['A'].isna(), :]
I get
3 NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN
So I assume whats happening here is, I then attempt ffill on this new dataframe only containing 3 and 8 and obviously i cant ffill NaNs with NaNs.
Change values only to those row that start with nan
df.loc[df['A'].isna(), :] = df.ffill().loc[df['A'].isna(), :]
A B C D E
0 45.0 88.0 NaN NaN 3.0
1 62.0 34.0 2.0 86.0 NaN
2 85.0 65.0 11.0 31.0 5.0
3 85.0 65.0 11.0 31.0 5.0
4 90.0 38.0 34.0 93.0 8.0
5 0.0 94.0 45.0 10.0 10.0
6 58.0 NaN 23.0 60.0 11.0
7 10.0 32.0 5.0 15.0 11.0
8 10.0 32.0 5.0 15.0 11.0
Try using a mask to identify the relevant rows where column A is null. The take those same rows from the forward filled dataframe.
mask = df['A'].isnull()
df.loc[mask, :] = df.ffill().loc[mask, :]
>>> df
A B C D E
0 45.0 88.0 NaN NaN 3.0
1 62.0 34.0 2.0 86.0 NaN
2 85.0 65.0 11.0 31.0 5.0
3 85.0 65.0 11.0 31.0 5.0
4 90.0 38.0 34.0 93.0 8.0
5 0.0 94.0 45.0 10.0 10.0
6 58.0 NaN 23.0 60.0 11.0
7 10.0 32.0 5.0 15.0 11.0
8 10.0 32.0 5.0 15.0 11.0
you just want to fill (DataFrame.ffill ) where(DataFrame.where) df['A'] is nan and the rest leave it as before (df):
df=df.ffill().where(df['A'].isna(),df)
print(df)
A B C D E
0 45.0 88.0 NaN NaN 3.0
1 62.0 34.0 2.0 86.0 NaN
2 85.0 65.0 11.0 31.0 5.0
3 85.0 65.0 11.0 31.0 5.0
4 90.0 38.0 34.0 93.0 8.0
5 0.0 94.0 45.0 10.0 10.0
6 58.0 NaN 23.0 60.0 11.0
7 10.0 32.0 5.0 15.0 11.0
8 10.0 32.0 5.0 15.0 11.0

Appending variable length columns in Pandas dataframe Python

I have a few csv files which contain a pair of bearings for many locations. I am trying to expand the values to include every number between the bearing pairs for each location and export the variable lengths as a csv in the same format.
Example:
df = pd.read_csv('bearing.csv')
Data structure:
A B C D E
0 0 94 70 67 84
1 120 132 109 152 150
Ideal result is a variable length multidimensional array:
A B C D E
0 0 94 70 67 84
1 1 95 71 68 85
2 3 96 72 69 86
...
n 120 132 109 152 150
I am looping through each column and getting the range of the pair of values, but I am struggling when trying to overwrite the old column with the new range of values.
for col in bear:
min_val = min(bear[col])
max_val = max(bear[col])
range_vals = range(min(bear[col]), max(bear[col])+1)
bear[col] = range_vals
I am getting the following error:
ValueError: Length of values does not match length of index
You can use dict comprehension with min and max in DataFrame contructor, but you get a lot NaN in the end of columns:
df = pd.DataFrame({col: pd.Series(range(df[col].min(),
df[col].max() + 1)) for col in df.columns })
print (df)
print (df)
A B C D E
0 0 94.0 70.0 67.0 84.0
1 1 95.0 71.0 68.0 85.0
2 2 96.0 72.0 69.0 86.0
3 3 97.0 73.0 70.0 87.0
4 4 98.0 74.0 71.0 88.0
5 5 99.0 75.0 72.0 89.0
6 6 100.0 76.0 73.0 90.0
7 7 101.0 77.0 74.0 91.0
8 8 102.0 78.0 75.0 92.0
9 9 103.0 79.0 76.0 93.0
10 10 104.0 80.0 77.0 94.0
11 11 105.0 81.0 78.0 95.0
12 12 106.0 82.0 79.0 96.0
13 13 107.0 83.0 80.0 97.0
14 14 108.0 84.0 81.0 98.0
15 15 109.0 85.0 82.0 99.0
16 16 110.0 86.0 83.0 100.0
17 17 111.0 87.0 84.0 101.0
18 18 112.0 88.0 85.0 102.0
19 19 113.0 89.0 86.0 103.0
20 20 114.0 90.0 87.0 104.0
21 21 115.0 91.0 88.0 105.0
22 22 116.0 92.0 89.0 106.0
23 23 117.0 93.0 90.0 107.0
24 24 118.0 94.0 91.0 108.0
25 25 119.0 95.0 92.0 109.0
26 26 120.0 96.0 93.0 110.0
27 27 121.0 97.0 94.0 111.0
28 28 122.0 98.0 95.0 112.0
29 29 123.0 99.0 96.0 113.0
.. ... ... ... ... ...
91 91 NaN NaN NaN NaN
92 92 NaN NaN NaN NaN
93 93 NaN NaN NaN NaN
94 94 NaN NaN NaN NaN
95 95 NaN NaN NaN NaN
96 96 NaN NaN NaN NaN
97 97 NaN NaN NaN NaN
98 98 NaN NaN NaN NaN
99 99 NaN NaN NaN NaN
100 100 NaN NaN NaN NaN
101 101 NaN NaN NaN NaN
102 102 NaN NaN NaN NaN
103 103 NaN NaN NaN NaN
104 104 NaN NaN NaN NaN
105 105 NaN NaN NaN NaN
106 106 NaN NaN NaN NaN
107 107 NaN NaN NaN NaN
108 108 NaN NaN NaN NaN
109 109 NaN NaN NaN NaN
110 110 NaN NaN NaN NaN
111 111 NaN NaN NaN NaN
112 112 NaN NaN NaN NaN
113 113 NaN NaN NaN NaN
114 114 NaN NaN NaN NaN
115 115 NaN NaN NaN NaN
116 116 NaN NaN NaN NaN
117 117 NaN NaN NaN NaN
118 118 NaN NaN NaN NaN
119 119 NaN NaN NaN NaN
120 120 NaN NaN NaN NaN
If you have only few columns, is possible use:
df = pd.DataFrame({'A': pd.Series(range(df.A.min(), df.A.max() + 1)),
'B': pd.Series(range(df.B.min(), df.B.max() + 1))})
EDIT:
If min value is in first row and the max in last, you can use iloc:
df = pd.DataFrame({col: pd.Series(range(df[col].iloc[0],
df[col].iloc[-1] + 1)) for col in df.columns })
Timings:
In [3]: %timeit ( pd.DataFrame({col: pd.Series(range(df[col].iloc[0], df[col].iloc[-1] + 1)) for col in df.columns }) )
1000 loops, best of 3: 1.75 ms per loop
In [4]: %timeit ( pd.DataFrame({col: pd.Series(range(df[col].min(), df[col].max() + 1)) for col in df.columns }) )
The slowest run took 5.50 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 2.18 ms per loop

Categories

Resources