Removing columns for which the column names are a float - python

I have data as follows:
import pandas as pd
url_cities="https://population.un.org/wup/Download/Files/WUP2018-F12-Cities_Over_300K.xls"
df_cities = pd.read_excel(url_cities)
i = df_cities.iloc[:, 1].notna().idxmax()
df_cities.columns = df_cities.iloc[i].tolist()
df_cities = df_cities.iloc[i+1:]
df_cities = df_cities.rename(columns={2020.0: 'City_pop'})
print(df_cities.iloc[0:20,])
I want to remove all columns for which the column names (NOT COLUMN VALUES) are floats.
I have looked at a couple of links (A, B, C), but I could not find the answer. Any suggestions?

This will do what your question asks:
df = df[[col for col in df.columns if not isinstance(col, float)]]
Example:
import pandas as pd
df = pd.DataFrame(columns=['a',1.1,'b',2.2,3,True,4.4,'c'],data=[[1,2,3,4,5,6,7,8],[11,12,13,14,15,16,17,18]])
print(df)
df = df[[col for col in df.columns if not isinstance(col, float)]]
print(df)
Initial dataframe:
a 1.1 b 2.2 3 True 4.4 c
0 1 2 3 4 5 6 7 8
1 11 12 13 14 15 16 17 18
Result:
a b 3 True c
0 1 3 5 6 8
1 11 13 15 16 18
Note that 3 is an int, not a float, so its column has not been removed.

my_list=list(df_cities.columns)
for i in my_list:
if type(i)!=str:
df_cities=df_cities.drop(columns=[i],axis=1)
please, try this code

I think your basic problem is the call to read the excel file.
If you skip early rows and define the index correctly6, you avoid the issue of having to remove float column headers altogether.
so change your call to open the excel file to the following:
df_cities = pd.read_excel(url_cities, skiprows=16, index_col=0)
Which yields a df like the following:
Country Code Country or area City Code Urban Agglomeration Note Latitude Longitude 1950 1955 1960 ... 1990 1995 2000 2005 2010 2015 2020 2025 2030 2035
Index
1 4 Afghanistan 20001 Herat NaN 34.348170 62.199670 82.468 85.751 89.166 ... 183.465 207.190 233.991 275.678 358.691 466.703 605.575 752.910 897.041 1057.573
2 4 Afghanistan 20002 Kabul NaN 34.528887 69.172460 170.784 220.749 285.352 ... 1549.320 1928.694 2401.109 2905.178 3289.005 3723.543 4221.532 4877.024 5737.138 6760.500
3 4 Afghanistan 20003 Kandahar NaN 31.613320 65.710130 82.199 89.785 98.074 ... 233.243 263.395 297.456 336.746 383.498 436.741 498.002 577.128 679.278 800.461
4 4 Afghanistan 20004 Mazar-e Sharif NaN 36.709040 67.110870 30.000 37.139 45.979 ... 135.153 152.629 172.372 206.403 283.532 389.483 532.689 681.531 816.040 962.262

Related

Expanding a column of pandas dataframe with variable number of elements and leading texts

I am trying to expand a column of a pandas dataframe
(see column segments in example below.)
I am able to break it out into the components seperated by ;
However, as you can see, some of the rows in the columns do
not have all the elements. So, what is happening is that the
data which should go into the Geo column ends up going into the
BusSeg column, since there was no Geo column; or the data
that should be in ProdServ column ends up in the Geo column.
Ideally I would like to have only the data and not the indicator
in each cell correctly placed. So,
In the Geo column it should say 'NonUs'. Not 'Geo=NonUs.'
That is after seperating correctly, I would like to remove the text
upto and including the '=' sign in each. How can I do this?
Code below:
import pandas as pd
company1 = ('Rev','Rev','Rev','Rev','Rev','Rev','Rev','Rev','Rev')
df1 = pd.DataFrame(columns=None)
df1['company'] = company1
df1['clv']=[500,200,3000,400,10,300,560,500,600]
df1['date'] = [20191231,20191231,20191231,20181231,20181231,20181231,20171231,20171231,20171231 ]
df1['line'] = [1,3,2,1,3,2,1,3,2]
df1['segments'] =['BusSeg=Pharma;Geo=NonUs;Prd=Alpha;Subseg=Tr1',
'BusSeg=Dev;Prd=Alpha;Subseg=Tr1',
'BusSeg=Pharma;Geo=US;Prd=Alpha;Subseg=Tr2',
'Subseg=Tr1',
'BusSeg=Pharma',
'Geo=China;Prd=Alpha;Subseg=Tr4;',
'Prd=Beta;Subseg=Tr1',
'BusSeg=Pharma;Geo=US;Prd=Delta;Subseg=Tr1;',
'BusSeg=Pharma;Geo=NonUs;']
print("\ndf1:")
df1[['BusSeg','Geo','ProdServ','Sub','Misc']] = df1['segments'].str.split(';',expand=True)
print(df1)
print(df1[['BusSeg','Geo','ProdServ','Sub','Misc']])
print(df1.dtypes)
print()
Your Data
import pandas as pd
company1 = ('Rev','Rev','Rev','Rev','Rev','Rev','Rev','Rev','Rev')
df1 = pd.DataFrame(columns=None)
df1['company'] = company1
df1['clv']=[500,200,3000,400,10,300,560,500,600]
df1['date'] = [20191231,20191231,20191231,20181231,20181231,20181231,20171231,20171231,20171231 ]
df1['line'] = [1,3,2,1,3,2,1,3,2]
df1['segments'] =['BusSeg=Pharma;Geo=NonUs;Prd=Alpha;Subseg=Tr1',
'BusSeg=Dev;Prd=Alpha;Subseg=Tr1',
'BusSeg=Pharma;Geo=US;Prd=Alpha;Subseg=Tr2',
'Subseg=Tr1',
'BusSeg=Pharma',
'Geo=China;Prd=Alpha;Subseg=Tr4;',
'Prd=Beta;Subseg=Tr1',
'BusSeg=Pharma;Geo=US;Prd=Delta;Subseg=Tr1;',
'BusSeg=Pharma;Geo=NonUs;']
df:
company clv date line segments
0 Rev 500 20191231 1 BusSeg=Pharma;Geo=NonUs;Prd=Alpha;Subseg=Tr1
1 Rev 200 20191231 3 BusSeg=Dev;Prd=Alpha;Subseg=Tr1
2 Rev 3000 20191231 2 BusSeg=Pharma;Geo=US;Prd=Alpha;Subseg=Tr2
3 Rev 400 20181231 1 Subseg=Tr1
4 Rev 10 20181231 3 BusSeg=Pharma
5 Rev 300 20181231 2 Geo=China;Prd=Alpha;Subseg=Tr4;
6 Rev 560 20171231 1 Prd=Beta;Subseg=Tr1
7 Rev 500 20171231 3 BusSeg=Pharma;Geo=US;Prd=Delta;Subseg=Tr1;
8 Rev 600 20171231 2 BusSeg=Pharma;Geo=NonUs;
Comment this line df1[['BusSeg','Geo','ProdServ','Sub','Misc']] = df1['segments'].str.split(';',expand=True) in your code, and add theese two lines
d = pd.DataFrame(df1['segments'].str.split(';').apply(lambda x:{i.split("=")[0] : i.split("=")[1] for i in x if i}).to_dict()).T
df = pd.concat([df1, d], axis=1)
df:
company clv date line segments BusSeg Geo Prd Subseg
0 Rev 500 20191231 1 BusSeg=Pharma;Geo=NonUs;Prd=Alpha;Subseg=Tr1 Pharma NonUs Alpha Tr1
1 Rev 200 20191231 3 BusSeg=Dev;Prd=Alpha;Subseg=Tr1 Dev NaN Alpha Tr1
2 Rev 3000 20191231 2 BusSeg=Pharma;Geo=US;Prd=Alpha;Subseg=Tr2 Pharma US Alpha Tr2
3 Rev 400 20181231 1 Subseg=Tr1 NaN NaN NaN Tr1
4 Rev 10 20181231 3 BusSeg=Pharma Pharma NaN NaN NaN
5 Rev 300 20181231 2 Geo=China;Prd=Alpha;Subseg=Tr4; NaN China Alpha Tr4
6 Rev 560 20171231 1 Prd=Beta;Subseg=Tr1 NaN NaN Beta Tr1
7 Rev 500 20171231 3 BusSeg=Pharma;Geo=US;Prd=Delta;Subseg=Tr1; Pharma US Delta Tr1
8 Rev 600 20171231 2 BusSeg=Pharma;Geo=NonUs; Pharma NonUs NaN NaN
I sugest, to fill the columns one by one instead of using split, something like the followin code:
col = ['BusSeg', 'Geo', 'ProdServ', 'Sub'] # Columns Names.
var = ['BusSeg', 'Geo', 'Prd', 'Subseg'] # Variables Name in 'Subseg' column.
for c, v in zip(col, var):
df1[c] = df1['segments'].str.extract(r'' + v + '=(\w*);')
Here's a suggestion:
df1.segments = (df1.segments.str.split(';')
.apply(lambda s:
dict(t.split('=') for t in s if t.strip() != '')))
df2 = pd.DataFrame({col: [dict_.get(col, '') for dict_ in df1.segments]
for col in set().union(*df1.segments)},
index=df1.index)
df1.drop(columns=['segments'], inplace=True)
df1 = pd.concat([df1, df2], axis='columns')
Result:
company clv date line Subseg Geo BusSeg Prd
0 Rev 500 20191231 1 Tr1 NonUs Pharma Alpha
1 Rev 200 20191231 3 Tr1 Dev Alpha
2 Rev 3000 20191231 2 Tr2 US Pharma Alpha
3 Rev 400 20181231 1 Tr1
4 Rev 10 20181231 3 Pharma
5 Rev 300 20181231 2 Tr4 China Alpha
6 Rev 560 20171231 1 Tr1 Beta
7 Rev 500 20171231 3 Tr1 US Pharma Delta
8 Rev 600 20171231 2 NonUs Pharma

Multiply rows starting by the same string in two different dataframes

I have two dataframes with the following structure:
hheuwh_df
Out[8]:
2017 2018
Geo
AT 2010.995000 1970.876944
BE 5432.053611 5457.952778
BG 105.436667 125.081944
CZ 5268.712500 5120.062222
DE 49986.276111 53605.346667
DK 1795.833333 1955.555556
EE 82.849167 82.500000
EL 165.879722 469.332500
ES 13977.728056 13635.448611
FI 61.250000 59.000000
FR 14052.315278 13945.601389
HR 1037.459167 1010.527500
HU 3441.843611 3336.155278
IE 1771.082500 1832.023333
IT 29621.374444 29911.172778
LT 196.750000 207.000000
LU 304.662222 250.367778
LV 244.393889 261.590833
NL 16704.166667 16704.166667
PL 10973.000000 10973.000000
PT 1801.363056 1961.978333
RO 6404.175000 6649.063333
SE 79.750000 72.000000
SI 265.800000 259.135556
SK 1635.002500 1598.825000
temp
Out[9]:
Percentage2017 Percentage2018
Geo
AT11 0.033278 0.033175
AT12 0.189876 0.189369
AT13 0.212882 0.214092
AT21 0.063956 0.063578
AT22 0.141037 0.140578
... ...
SI04 0.471823 0.472772
SK01 0.118096 0.119571
SK02 0.336823 0.335915
SK03 0.246955 0.246331
SK04 0.298125 0.298182
[242 rows x 2 columns]
I would like to multiply the values that start by same characters in the indexes of both dataframes.
That is, for each column multiply AT1, AT2, AT3 by AT, FR1, FR5 and FR7 by FR, and so on. What would be the best way to achieve this, storing the results either in the second dataframe or in a new dataframe? Thank you in advance.
If all values is necessary match by first 2 letters select it in rename, multiple and last set same values like original:
df = df.rename(lambda x: x[:2]).mul(df2).set_index(df.index)
print (df)
1995 1996
AT1 3 18
AT2 5 9
AT3 2 3
FR1 2 4
FR5 4 4
FR7 14 32
If need match values without numbers use str.replace working with regex:
df1 = df.copy()
df1.index = df1.index.str.replace('\d', '')
df1 = df1.mul(df2).set_index(df.index)
print (df1)
1995 1996
AT1 3 18
AT2 5 9
AT3 2 3
FR1 2 4
FR5 4 4
FR7 14 32
Or rename:
import re
df = df.rename(lambda x: re.sub('\d','', x)).mul(df2).set_index(df.index)
print (df)
1995 1996
AT1 3 18
AT2 5 9
AT3 2 3
FR1 2 4
FR5 4 4
FR7 14 32
EDIT:
If values of column in hheuwh_df are integers also convert last 4 values of columns to years in inetegers and also remove only missing values in rows:
print (type(hheuwh_df.columns[0]))
int
df = (temp.rename(index=lambda x: x[:2],
columns=lambda x: int(x[-4:]))
.mul(hheuwh_df)
.dropna(how='all')
.set_index(temp.index))
print (df)
2017 2018
Geo
AT11 66.921892 65.383843
AT12 381.839687 373.222996
AT13 428.104638 421.948987
AT21 128.615196 125.304414
AT22 283.624702 277.061939
SI04 125.410553 122.512035
SK01 193.087255 191.173104
SK02 550.706447 537.069300
SK03 403.772042 393.840161
SK04 487.435120 476.740836

Pandas how to preserve all values in dataframe into a csv?

I want to convert the html to csv using pandas functions
This is a part of what I read in the dataframe df
0 1
0 sequence 2
1 trainNo K805
2 trainNumber K805
3 departStation 鹰潭
4 departStationPy yingtan
5 arriveStation 南昌
6 arriveStationPy nanchang
7 departDate 2020-05-24
8 departTime 03:55
9 arriveDate 2020-05-24
10 arriveTime 05:44
11 isStartStation False
12 isEndStation False
13 runTime 1小时49分钟
14 preSaleTime NaN
15 takeDays 0
16 isBookable True
17 seatList seatNamepriceorderPriceinventoryisBookablebutt...
18 curSeatIndex 0
seatName price orderPrice inventory isBookable buttonDisplayName buttonType
0 硬座 23.5 23.5 99 True NaN 0
1 硬卧 69.5 69.5 99 True NaN 0
2 软卧 104.5 104.5 4 True NaN 0
0 1
0 departDate 2020-05-23
1 departStationList NaN
2 endStationList NaN
3 departStationFilterMap NaN
4 endStationFilterMap NaN
5 departCityName 上海
6 arriveCityName 南昌
7 gtMinPrice NaN
My code is like this
for i, df in enumerate(pd.read_html(html,encoding='utf-8')):
df.to_csv(".\other.csv",index=True,encoding='utf-8-sig')
To preserve the characters in csv, I need to use utf-8-sig encoding. But I don't know how to write the format symbol %
,0,1
0,departDate,2020-05-23
1,departStationList,
2,endStationList,
3,departStationFilterMap,
4,endStationFilterMap,
5,departCityName,上海
6,arriveCityName,南昌
7,gtMinPrice,
This is what I got in csv file, only the last part is preserved.
The dataframe is correct, while the csv need correction. Can you show me how to make the correct output?
you're saving each dataframe to the same file, so each is getting overwritten until the last one.
note the addition of the f-string to change the save file name e.g. f".\other_{i}.csv"
each dataframe is a different shape, so they won't all fit together properly
To CSV
for i, df in enumerate(pd.read_html(html,encoding='utf-8')):
df.to_csv(f".\other_{i}.csv", index=True, encoding='utf-8-sig')
To Excel
with pd.ExcelWriter('output.xlsx', mode='w') as writer:
for i, df in enumerate(pd.read_html(html,encoding='utf-8')):
df.to_excel(writer, sheet_name=f'Sheet{i}', encoding='utf-8-sig')

Skipping the row if there are more than 2 fields are empty

First, skip the row of data if the columns have more than 2 columns that are empty. After this step, the rows with more than 2 columns missing value will be filtered out.
Then, as some of the columns still have 1 or 2 columns are empty. So I will fill in the empty column with the mean value of that row.
I can run the second step with my code below, however, I am not sure how to filter out the rows with more than 2 columns missing value.
I have tried using dropna but it deleted all the columns of the table.
My code:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as pp
%matplotlib inline
# high technology exports percentage of manufatory exports
hightech_export = pd.read_csv('hightech_export_1.csv')
#skip the row of data if the columns have more than 2 columns are empty
hightech_export.dropna(axis=1, how='any', thresh=2, subset=None, inplace=False)
# Fill in data with mean value.
m = hightech_export.mean(axis=1)
for i, col in enumerate(hightech_export):
hightech_export.iloc[:, i] = hightech_export.iloc[:, i].fillna(m)
My dataset:
Country Name 2001 2002 2003 2004
Philippines 71
Malta 62 58 60 58
Singapore 60 56
Malaysia 58 57 55
Ireland 47 41 34 34
Georgia 38 41 24 38
Costa Rica
You can make use of .isnull() method for doing your first task.
Replace this:
hightech_export.dropna(axis=1, how='any', thresh=2, subset=None, inplace=False)
with:
hightech_export= hightech_export.loc[hightech_export.isnull().sum(axis=1)<=2]
Ok try this ...
import pandas as pd
import numpy as np
data1={'Name':['Tom',np.NaN,'Mary','Jane'],'Age':[20,np.NaN,40,30],'Pay':[np.NaN,np.NaN,20,25]}
data2={'Name':['Tom','Bob','Mary'],'Age':[40,30,20]}
df1=pd.DataFrame.from_records(data1)
Check the df
df1
Age Name Pay
0 20.0 Tom NaN
1 NaN NaN NaN
2 40.0 Mary 20.0
3 30.0 Jane 25.0
record with index 1 has 3 missing values...
Replace and make missing values None
df1 = df1.replace({pd.np.nan: None})
Now write function to count missing values per row.... and to create a list
def count_na(lst):
missing = [n for n in lst if not n]
return len(missing)
missing_data=[]
for index,n in df1.iterrows():
missing_data.append(count_na(list(n)))
Use this list as a new Column in the Dataframe
df1['missing']=missing_data
df1 should look like this
Age Name Pay missing
0 20 Tom None 1
1 None None None 3
2 40 Mary 20 0
3 30 Jane 25 0
So filtering becomes easy....
# Now only take records with <2 missing
df1[df1.missing<2]
Hope that helps...
A simple way is to compare on a row basis the count of value and the number of columns of the dataframe. You can then just replace NaN with the avg of the dataframe.
Code could be:
result = df.loc[df.apply(lambda x: x.count(), axis=1) >= (len(df.columns) - 2)].replace(
np.nan, df.agg('mean'))
With your example data, it gives as expected:
Country Name 2001 2002 2003 2004
1 Malta 62.0 58.00 60.000000 58.0
2 Singapore 60.0 49.25 39.333333 56.0
3 Malaysia 58.0 57.00 39.333333 55.0
4 Ireland 47.0 41.00 34.000000 34.0
5 Georgia 38.0 41.00 24.000000 38.0
Try this
hightech_export.dropna(thresh=2, inplace=True)
in place of the line of code
hightech_export.dropna(axis=1, how='any', thresh=2, subset=None, inplace=False)

How to calculate the expanding mean of all the columns across the DataFrame and add to DataFrame

I am trying to calculate the means of all previous rows for each column of the DataFrame and add the calculated mean column to the DataFrame.
I am using a set of nba games data that contains 20+ features (columns) that I am trying to calculate the means for. Example of the dataset is below. (Note. "...." represent rest of the feature columns)
Team TeamPoints OpponentPoints.... TeamPoints_mean OpponentPoints_mean
ATL 102 109 .... nan nan
ATL 102 92 .... 102 109
ATL 92 94 .... 102 100.5
BOS 119 122 .... 98.67 98.33
BOS 103 96 .... 103.75 104.25
Example for calculating two of the columns:
dataset = pd.read_csv('nba.games.stats.csv')
df = dataset
df['Game_mean'] = (df.groupby('Team')['TeamPoints'].apply(lambda x: x.shift().expanding().mean()))
df['TeamPoints_mean'] = (df.groupby('Team')['OpponentsPoints'].apply(lambda x: x.shift().expanding().mean()))
Again, the code only calculates the mean and adding the column to the DataFrame one at a time. Is there a way to get the column means and add them to the DataFrame without doing one at a time? For loop? Example of what I am looking for is below.
Team TeamPoints OpponentPoints.... TeamPoints_mean OpponentPoints_mean ...("..." = mean columns of rest of the feature columns)
ATL 102 109 .... nan nan
ATL 102 92 .... 102 109
ATL 92 94 .... 102 100.5
BOS 119 122 .... 98.67 98.33
BOS 103 96 .... 103.75 104.25
Try this one:
(0) sample input:
>>> df
col1 col2 col3
0 1.490977 1.784433 0.852842
1 3.726663 2.845369 7.766797
2 0.042541 1.196383 6.568839
3 4.784911 0.444671 8.019933
4 3.831556 0.902672 0.198920
5 3.672763 2.236639 1.528215
6 0.792616 2.604049 0.373296
7 2.281992 2.563639 1.500008
8 4.096861 0.598854 4.934116
9 3.632607 1.502801 0.241920
Then processing:
(1) side table to get all the means on the side (I didn't find cummulative mean function, so went with cumsum + count)
>>> df_side=df.assign(col_temp=1).cumsum()
>>> df_side
col1 col2 col3 col_temp
0 1.490977 1.784433 0.852842 1.0
1 5.217640 4.629801 8.619638 2.0
2 5.260182 5.826184 15.188477 3.0
3 10.045093 6.270855 23.208410 4.0
4 13.876649 7.173527 23.407330 5.0
5 17.549412 9.410166 24.935545 6.0
6 18.342028 12.014215 25.308841 7.0
7 20.624021 14.577855 26.808849 8.0
8 24.720882 15.176708 31.742965 9.0
9 28.353489 16.679509 31.984885 10.0
>>> for el in df.columns:
... df_side["{}_mean".format(el)]=df_side[el]/df_side.col_temp
>>> df_side=df_side.drop([el for el in df.columns] + ["col_temp"], axis=1)
>>> df_side
col1_mean col2_mean col3_mean
0 1.490977 1.784433 0.852842
1 2.608820 2.314901 4.309819
2 1.753394 1.942061 5.062826
3 2.511273 1.567714 5.802103
4 2.775330 1.434705 4.681466
5 2.924902 1.568361 4.155924
6 2.620290 1.716316 3.615549
7 2.578003 1.822232 3.351106
8 2.746765 1.686301 3.526996
9 2.835349 1.667951 3.198489
(2) joining back, on index:
>>> df_final=df.join(df_side)
>>> df_final
col1 col2 col3 col1_mean col2_mean col3_mean
0 1.490977 1.784433 0.852842 1.490977 1.784433 0.852842
1 3.726663 2.845369 7.766797 2.608820 2.314901 4.309819
2 0.042541 1.196383 6.568839 1.753394 1.942061 5.062826
3 4.784911 0.444671 8.019933 2.511273 1.567714 5.802103
4 3.831556 0.902672 0.198920 2.775330 1.434705 4.681466
5 3.672763 2.236639 1.528215 2.924902 1.568361 4.155924
6 0.792616 2.604049 0.373296 2.620290 1.716316 3.615549
7 2.281992 2.563639 1.500008 2.578003 1.822232 3.351106
8 4.096861 0.598854 4.934116 2.746765 1.686301 3.526996
9 3.632607 1.502801 0.241920 2.835349 1.667951 3.198489
I am trying to calculate the means of all previous rows for each column of the DataFrame
To get all of the columns, you can do:
df_means = df.join(df.cumsum()/
df.applymap(lambda x:1).cumsum(),
r_suffix = "_mean")
However, if Team is a column rather the index, you'd want to get rid of it:
df_data = df.drop('Teams', axis=1)
df_means = df.join(df_data.cumsum()/
df_data.applymap(lambda x:1).cumsum(),
r_suffix = "_mean")
You could also do
import numpy as np
df_data = df[[col for col in df.columns
if np.issubdtype(df[col],np.number)]]
Or manually define a list of columns that you want to take the mean of, cols_for_mean, and then do
df_data = df[cols_for_mean]

Categories

Resources