Get the avg of 2 numbers in one csv field python

Get the avg of 2 numbers in one csv field python - python

I am trying to clean a dataset(csv) in python (pandas)
In the Projected investment columns I have data that contains 2 numbers. for example 30-35 how can I get the avg of this so that the field contains 32.5

I think the best is create float column, not mixed numeric with strings.
First replace missing to NaNs, then split, convert to floats and last get mean:
df = pd.DataFrame({'Projected investment':['missing','30-35','77']})
print (df)
Projected investment
0 missing
1 30-35
2 77
df['Projected investment'] = df['Projected investment'].replace('missing', np.nan) \
.str.split('-', expand=True) \
.astype(float) \
.mean(axis=1)
print (df)
Projected investment
0 NaN
1 32.5
2 77.0
print (df['Projected investment'].dtypes)
float64
If need missing as string:
def parse_number(x):
try:
return np.mean(np.array(str(x).split('-')).astype(float))
except ValueError:
return x
df['Projected investment'] = df['Projected investment'].map(parse_number)
print (df)
Projected investment
0 missing
1 32.5
2 77
print (df['Projected investment'].apply(type))
0 <class 'str'>
1 <class 'numpy.float64'>
2 <class 'numpy.float64'>
Name: Projected investment, dtype: object

This will work as long as you are not having NaN or missing values in that column. You need to take care of that first
df['Projected Investment'] = df['Projected Investment'].apply(lambda x : np.mean(map(int, x.split('-'))))

This should work:
string_of_nums = "30-35"
nums = string_of_nums.split("-")
nums=[int(num) for num in nums]
rest=nums[1]%nums[0]
avg = str(nums[0])[:-1] + str(rest/2)
print(avg)
#>>>32.5(as string)

df['Projected Investment'].apply(lambda x: x if x == 'Missing' else np.mean([int(i) for i in x.split('-')]))

Related

change a column values with calculations

here is my dataframe : dataFrame
i just want to multiply all values in "sous_nutrition" by 10^6
When i do this code proportion_sous_nutrition_2017['sous_nutrition'] = proportion_sous_nutrition_2017.sous_nutrition * 1000000
It gave me this ... newDataFrame
I want to multiply by 1 million because the value is precised 'in million' and it will make easier to calculate other things after...
Any help would be greatly apreciated.

You can use pd.to_numeric(..., errors='coerce') to force to NaN values that cannot be converted into numeric.
Try:
proportion_sous_nutrition_2017['sous_nutrition'] = 1e6 * pd.to_numeric(proportion_sous_nutrition_2017['sous_nutrition'], errors='coerce')

Try:
# create a new column called 'sous_nutrition_float' that only has 1.1 or 0.3 etc. and removes the > or < etc.
proportion_sous_nutrition_2017['sous_nutrition_float'] = proportion_sous_nutrition_2017['sous_nutrition'].str.extract(r'([0-9.]+)').astype(float)
proportion_sous_nutrition_2017['sous_nutrition'] = proportion_sous_nutrition_2017.sous_nutrition_float * 1000000
To find the dtypes run:
print(proportion_sous_nutrition_2017.info())
The types should be float or int before multiplying etc.
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sous_nutrition 3 non-null float64
1 sous_nutrition_float 3 non-null float64
......
....
..

A solution can be convert all < xxx to xxx
>>> df['sous_nutrition']
0 1.2
1 NaN
2 < 0.1
Name: sous_nutrition, dtype: object
>>> df['sous_nutrition'].str.replace('<', '').astype(float)
0 1.2
1 NaN
2 0.1
Name: sous_nutrition, dtype: float64
So, this should work:
proportion_sous_nutrition_2017['sous_nutrition_float'] = proportion_sous_nutrition_2017['sous_nutrition'].str.replace('<', '').astype(float) * 1000000

The error that you have is due to the fact that the format of your column 'sous_nutrition' is not float as you expect, but string (or object). For the solution, you need to change the format as indicated
Hamza usman ghani
If there are errors when changing the type, try this code:
df['sous_nutrition'] = pd.to_numeric(df['sous_nutrition'], downcast='float', errors='coerce')
and then you do this corectly:
df['sous_nutrition'] = df['sous_nutrition']*1000000

Multiplying values from a string column in Pandas

I have a column with land dimensions in Pandas. It looks like this:
df.LotSizeDimensions.value_counts(dropna=False)
40.00X150.00 2
57.00X130.00 2
27.00X117.00 2
63.00X135.00 2
37.00X108.00 2
65.00X134.00 2
57.00X116.00 2
33x124x67x31x20x118 1
55.00X160.00 1
63.00X126.00 1
36.00X105.50 1
In rows where there is only one X, I would like to create a separate column that would multiply the values. In columns where there is more than one X, I would like to return a zero. This is the code I came up with
def dimensions_split(df: pd.DataFrame):
df.LotSizeDimensions = df.LotSizeDimensions.str.strip()
df.LotSizeDimensions = df.LotSizeDimensions.str.upper()
df.LotSizeDimensions = df.LotSizeDimensions.str.strip('`"M')
if df.LotSizeDimensions.count('X') > 1
return 0
df['LotSize'] = map(int(df.LotSizeDimensions.str.split("X", 1).str[0])*int(df.LotSizeDimensions.str.split("X", 1).str[1]))
This is coming back with the following error:
TypeError: cannot convert the series to <class 'int'>
I would also like to add a line where if there are any non-numeric characters other than X, return a zero.

Idea is first stripping and convert to upper column LotSizeDimensions to Series and then use Series.str.split for DataFrame and then multiple columns if there is only one X else is returned 0:
s = df.LotSizeDimensions.str.strip('`"M ').str.upper()
df1 = s.str.split('X', expand=True).astype(float)
#general data
#df1 = s.str.split('X', expand=True).apply(lambda x: pd.to_numeric(x, errors='coerce'))
df['LotSize'] = np.where(s.str.count('X').eq(1), df1[0] * df1[1], 0)
print (df)
LotSizeDimensions LotSize
0 40.00X150.00 6000.0
1 57.00X130.00 7410.0
2 27.00X117.00 3159.0
3 37.00X108.00 3996.0
4 63.00X135.00 8505.0
5 65.00X134.00 8710.0
6 57.00X116.00 6612.0
7 33x124x67x31x20x118 0.0
8 55.00X160.00 8800.0
9 63.00X126.00 7938.0
10 36.00X105.50 3798.0

I get this using list comprehension:
import pandas as pd
df = pd.DataFrame(['40.00X150.00','57.00X130.00',
'27.00X117.00',
'37.00X108.00',
'63.00X135.00' ,
'65.00X134.00' ,
'57.00X116.00' ,
'33x124x67x31x20x118',
'55.00X160.00',
'63.00X126.00',
'36.00X105.50'])
df[1] = [float(str_data.strip().split("X")[0])*float(str_data.strip().split("X")[1]) if len(str_data.strip().split("X"))==2 else None for str_data in df[0]]

type conversion in python from float to int

I am trying to change data_df which is type float64 to int.
data_df['grade'] = data_df['grade'].astype(int)
I get the following error.
invalid literal for int() with base 10: '17.44'

I think you need to_numeric first because float cannot be cast to int:
data_df['grade'] = pd.to_numeric(data_df['grade']).astype(int)
Another solution is first cast to float and then to int:
data_df['grade'] = data_df['grade'].astype(float).astype(int)
Sample:
data_df = pd.DataFrame({'grade':['10','20','17.44']})
print (data_df)
grade
0 10
1 20
2 17.44
data_df['grade'] = pd.to_numeric(data_df['grade']).astype(int)
print (data_df)
grade
0 10
1 20
2 17
data_df['grade'] = data_df['grade'].astype(float).astype(int)
print (data_df)
grade
0 10
1 20
2 17
---
If some values cannot be converted and after to_numeric get error:
ValueError: Unable to parse string
is possible add parameter errors='coerce' for convert non numeric to NaN.
If NaN values then cast to int is not possible see docs:
data_df = pd.DataFrame({'grade':['10','20','17.44', 'aa']})
print (data_df)
grade
0 10
1 20
2 17.44
3 aa
data_df['grade'] = pd.to_numeric(data_df['grade'], errors='coerce')
print (data_df)
grade
0 10.00
1 20.00
2 17.44
3 NaN
If want change NaN to some numeric e.g. 0 use fillna:
data_df['grade'] = pd.to_numeric(data_df['grade'], errors='coerce')
.fillna(0)
.astype(int)
print (data_df)
grade
0 10
1 20
2 17
3 0
Small advice:
Before using errors='coerce' check all rows where is impossible casting to numeric by boolean indexing:
print (data_df[pd.to_numeric(data_df['grade'], errors='coerce').isnull()])
grade
3 aa

what works is data_df['grade'] = int(pd.to_numeric(data_df['grade']))
The method as_type(int) throws and error because it want's to tell you, that no exact conversion from float to integer is possible and you will lose information.
My solution will truncate the integer (i.e. 1.9 will become 1), so you might want to specifiy in your question wether you want to convert float to integer by truncation or by rounding (i.e. 1.9 will become 2)

From:
data_df['grade'] = data_df['grade'].astype(int)
Need to change int into 'int'
data_df['grade'] = data_df['grade'].astype('int')

I found this to work for me where none of the other earlier answers did the job for me:
data_df['grade'] = data_df['grade'].apply(np.int)

change string object to number in dataframe

i have a 880184*1 dataframe, the only column is either integer object or string object. I want to change all string object to number 0. It looks like below:
index column
..... ......
23155 WILLS ST / MIDDLE POINT RD
23156 20323
23157 400 Block of BELLA VISTA WY
23158 19090
23159 100 Block of SAN BENITO WY
23160 20474
Now the problem is both number and string are 'object' type, I don't know how to change the string like object to 0 like below:
index column
..... ......
23155 0
23156 20323
23157 0
23158 19090
23159 0
23160 20474
Another problem is that the sample size is too large, making it too long to use for loops to fix row by row. I want to use something like:
df.loc[df.column == ...] = 0

You can convert the type to numeric with pd.to_numeric and pass errors='coerce' so that you would get NaN for the ones cannot be converted to numbers. In the end, you can replace the NaNs with zero:
df["column"] = pd.to_numeric(df["column"], errors="coerce").fillna(0)
Out[15]:
0 0.0
1 20323.0
2 0.0
3 19090.0
4 0.0
5 20474.0
Name: column, dtype: float64
If you want the integer values, add astype('int64') to the end:
df["column"] = pd.to_numeric(df["column"], errors="coerce").fillna(0).astype("int64")
Out[16]:
0 0
1 20323
2 0
3 19090
4 0
5 20474
Name: column, dtype: int64

try converting everything to integers using the int() function.
The strings cannot be converted so an error is raised. Pack this in a "try" loop and you are set.
Like this:
def converter(currentRowObj):
try:
obj = int(currentRowObj)
except:
obj = 0
return obj

Funny results with pandas argsort

I think I have hit on a bug in pandas. I was hoping to get some help either verifying the bug or helping me figure out where my logic error is located in my code.
My code is as follows:
import pandas, numpy, StringIO
def sq_fixer(sr):
sr = sr.where(sr != '20200229')
ranks = sr.argsort().astype(float)
ranks[ranks == -1] = numpy.nan
return ','.join(ranks.astype(numpy.str))
def correct_date(sr):
date_fixer = lambda x: pandas.datetime(x.year -100, x.month, x.day) if x > pandas.datetime.now() else x
sr = pandas.to_datetime(sr).apply(date_fixer).astype(pandas.datetime)
return sr
txt = '''ID,RUN_START_DATE,PUSHUP_START_DATE,SITUP_START_DATE,PULLUP_START_DATE
1,2013-01-24,2013-01-02,,2013-02-03
2,2013-01-30,2013-01-21,2013-01-13,2013-01-06
3,2013-01-29,2013-01-28,2013-01-01,2013-01-29
4,2013-02-16,2013-02-12,2013-01-04,2013-02-11
5,2013-01-06,2013-02-07,2013-02-25,2013-02-12
6,2013-01-26,2013-01-28,2013-02-12,2013-01-10
7,2013-01-26,,2013-01-12,2013-01-30
8,2013-01-03,2013-01-24,2013-01-19,2013-01-02
9,2013-01-22,2013-01-13,2013-02-03,
10,2013-02-06,2013-01-16,2013-02-07,2013-01-11
3347,,2008-02-27,2008-04-10,2008-02-13
3588,2004-09-12,,2004-11-06,2004-09-06
3784,2003-02-22,,2003-06-21,2003-02-19
593,2009-04-03,,2009-06-01,2009-04-01
4148,2003-03-21,2002-09-20,2003-04-01,2003-01-01
4299,2004-05-24,2004-07-23,,2004-04-22
4590,2005-05-05,2005-12-05,2005-04-05,
4830,2001-06-12,2000-10-12,2001-07-28,2001-01-28
4941,2006-11-08,2006-12-19,2006-07-19,2007-02-24
1416,2004-04-03,2004-05-19,2004-02-06,
1580,2008-12-20,,2009-03-19,2008-12-19
1661,2005-10-03,2005-10-26,2005-09-12,2006-02-19
1759,2001-10-18,,2002-01-17,2001-10-17
1858,2003-04-14,2003-05-17,,2002-12-17
1972,2003-06-01,2003-07-14,2002-12-14,
5905,2000-11-18,2001-01-13,,2000-11-04
2052,2002-06-11,,2002-08-23,2001-12-12
2165,2006-10-01,,2007-02-27,2006-09-30
2218,2007-09-19,,2008-02-06,2007-09-09
2350,2000-08-08,,2000-09-22,2000-01-08
2432,2001-08-22,,2001-09-25,2000-12-16
2611,2005-05-07,,2005-06-05,2005-03-26
2612,2005-05-06,,2005-05-26,2005-04-11
7378,2009-08-07,2009-01-30,2010-01-20,2009-06-08
7550,2006-04-08,,2006-06-01,2006-04-01 '''
df = pandas.read_csv(StringIO.StringIO(txt))
sequence_array = ['RUN_START_DATE', 'PUSHUP_START_DATE', 'SITUP_START_DATE', 'PULLUP_START_DATE']
xsequence_array = ['X_RUN_START_DATE', 'X_PUSHUP_START_DATE', 'X_SITUP_START_DATE', 'X_PULLUP_START_DATE']
df[sequence_array] = df[sequence_array].apply(correct_date, axis=1)
fix_day = lambda x: x if x > 0 else 29
fix_month = lambda x: x if x > 0 else 02
fix_year = lambda x: x if x > 0 else 2020
for col in sequence_array:
xcol = 'X_{0}'.format(col)
df[xcol] = ['{0:04d}{1:02d}{2:02d}'.format(fix_year(c.year), fix_month(c.month), fix_day(c.day)) for c in df[col]]
df['X_AS_SEQUENCE'] = df[xsequence_array].apply(sq_fixer, axis=1)
When I run the code most of the results are correct. Take for example index 6:
In [31]: df.ix[6]
Out[31]:
ID 7
RUN_START_DATE 2013-01-26 00:00:00
PUSHUP_START_DATE NaN
SITUP_START_DATE 2013-01-12 00:00:00
PULLUP_START_DATE 2013-01-30 00:00:00
X_RUN_START_DATE 20130126
X_PUSHUP_START_DATE 20200229
X_SITUP_START_DATE 20130112
X_PULLUP_START_DATE 20130130
X_AS_SEQUENCE 1.0,nan,0.0,2.0
However, certain indices seem to throw pandas.argsort() for a loop. Take for example index 10:
In [32]: df.ix[10]
Out[32]:
ID 3347
RUN_START_DATE NaN
PUSHUP_START_DATE 2008-02-27 00:00:00
SITUP_START_DATE 2008-04-10 00:00:00
PULLUP_START_DATE 2008-02-13 00:00:00
X_RUN_START_DATE 20200229
X_PUSHUP_START_DATE 20080227
X_SITUP_START_DATE 20080410
X_PULLUP_START_DATE 20080213
X_AS_SEQUENCE nan,2.0,0.0,1.0
The argsort should return nan,1.0,2.0,0.0 instead of nan,2.0,0.0,1.0.
I have been on this for three days. At this point I am not sure if it is me or a bug. I am not sure how to backtrace it to get an answer. Any help would be most appreciated!

You might be interpreting the result of argsort incorrectly. argsort does not give the ranking of the values. Use the rank method if you want to rank the values.
The values in the Series returned by argsort give the corresponding positions of the original values after dropping the NaNs. In your case, since you convert 20200229 to NaN, you are argsorting NaN, 20080227, 20080410, 20080213. The non-NaN values are
nonnan = [20080227, 20080410, 20080213]
The result, NaN, 2, 0, 1 says:
argsort sorted values
NaN NaN
2 nonnan[2] = 20080213
0 nonnan[0] = 20080227
1 nonnan[1] = 20080410
So it looks OK to me.

if you want to sort a Series, just use sort_values() or rank() function:
In [2]: a=pd.Series([3,2,1])
In [3]: a
Out[3]:
0 3
1 2
2 1
dtype: int64
In [4]: a.sort_values()
Out[4]:
2 1
1 2
0 3
dtype: int64
if you use argsort(), this will give you the position of each element in the sorted series,
in this case, 1 should be in the 0 position and 2 should be in the 1 position and 3 should be in the 2 position
In [5]: a.argsort()
Out[5]:
0 2
1 1
2 0
dtype: int64

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Get the avg of 2 numbers in one csv field python - python

I am trying to clean a dataset(csv) in python (pandas) In the Projected investment columns I have data that contains 2 numbers. for example 30-35 how can I get the avg of this so that the field contains 32.5

This will work as long as you are not having NaN or missing values in that column. You need to take care of that first df['Projected Investment'] = df['Projected Investment'].apply(lambda x : np.mean(map(int, x.split('-'))))

This should work: string_of_nums = "30-35" nums = string_of_nums.split("-") nums=[int(num) for num in nums] rest=nums[1]%nums[0] avg = str(nums[0])[:-1] + str(rest/2) print(avg) #>>>32.5(as string)

df['Projected Investment'].apply(lambda x: x if x == 'Missing' else np.mean([int(i) for i in x.split('-')]))

Related

change a column values with calculations

Multiplying values from a string column in Pandas

type conversion in python from float to int

change string object to number in dataframe

Funny results with pandas argsort

Categories

Resources