i have a pandas series like this:
0 $233.94
1 $214.14
2 $208.74
3 $232.14
4 $187.15
5 $262.73
6 $176.35
7 $266.33
8 $174.55
9 $221.34
10 $199.74
11 $228.54
12 $228.54
13 $196.15
14 $269.93
15 $257.33
16 $246.53
17 $226.74
i want to get rid of the dollar sign so i can convert the values to numeric. I made a function in order to do this:
def strip_dollar(series):
for number in dollar:
if number[0] == '$':
number[0].replace('$', ' ')
return dollar
This function is returning the original series untouched, nothing changes, and i don't know why.
Any ideas about how to get this right?
Thanks in advance
Use lstrip and convert to floats:
s = s.str.lstrip('$').astype(float)
print (s)
0 233.94
1 214.14
2 208.74
3 232.14
4 187.15
5 262.73
6 176.35
7 266.33
8 174.55
9 221.34
10 199.74
11 228.54
12 228.54
13 196.15
14 269.93
15 257.33
16 246.53
17 226.74
Name: A, dtype: float64
Setup:
s = pd.Series(['$233.94', '$214.14', '$208.74', '$232.14', '$187.15', '$262.73', '$176.35', '$266.33', '$174.55', '$221.34', '$199.74', '$228.54', '$228.54', '$196.15', '$269.93', '$257.33', '$246.53', '$226.74'])
print (s)
0 $233.94
1 $214.14
2 $208.74
3 $232.14
4 $187.15
5 $262.73
6 $176.35
7 $266.33
8 $174.55
9 $221.34
10 $199.74
11 $228.54
12 $228.54
13 $196.15
14 $269.93
15 $257.33
16 $246.53
17 $226.74
dtype: object
Using str.replace("$", "")
Ex:
import pandas as pd
df = pd.DataFrame({"Col" : ["$233.94", "$214.14"]})
df["Col"] = pd.to_numeric(df["Col"].str.replace("$", ""))
print(df)
Output:
Col
0 233.94
1 214.14
CODE:
ser = pd.Series(data=['$123', '$234', '$232', '$6767'])
def rmDollar(x):
return x[1:]
serWithoutDollar = ser.apply(rmDollar)
serWithoutDollar
OUTPUT:
0 123
1 234
2 232
3 6767
dtype: object
Hope it helps!
Related
I'm using Python 3.7 and I have stored a value inside a variable. This variable holds the value of padding which I want to use within curly braces for string formatting. The code explains what I am trying to do.
def print_formatted(number):
for i in range(1, number + 1):
binum = bin(i).replace('0b', '')
ocnum = oct(i).replace('0o', '')
hexnum = hex(i).replace('0x', '')
length = len(bin(number).replace('0b', ''))
print('{0:>length} {1:>length} {2:>length} {3:>length}'.format(i, ocnum, hexnum, binum)) # Error here
This is the code that I have been trying to run. What I am trying to do is to right align the numbers by padding it by the value of the length of the last binary number.
ValueError: Invalid format specifier
This is the error I get. What am I doing wrong?
You can use f-strings and also format specifiers to avoid use of the hex, oct and bin builtins and then string slicing and use int.bit_length() instead of taking the length of the binary string, eg:
def print_formatted(number):
# get number of bits required to store number
w = number.bit_length()
for n in range(1, number + 1):
# print each number as decimal, then octal, then hex, then binary with padding
print(f'{n:>{w}} {n:>{w}o} {n:>{w}x} {n:>{w}b}')
Running print_formatted(20) will give you:
1 1 1 1
2 2 2 10
3 3 3 11
4 4 4 100
5 5 5 101
6 6 6 110
7 7 7 111
8 10 8 1000
9 11 9 1001
10 12 a 1010
11 13 b 1011
12 14 c 1100
13 15 d 1101
14 16 e 1110
15 17 f 1111
16 20 10 10000
17 21 11 10001
18 22 12 10010
19 23 13 10011
20 24 14 10100
You must first incorporate the value of the variable length into your format string and, strangely, using .format() is the best way to do it.
Change the following line
print('{0:>length} {1:>length} {2:>length} {3:>length}'.format(i, ocnum, hexnum, binum))
to
print('{{0:>{length}}} {{1:>{length}}} {{2:>{length}}} {{3:>{length}}}'.format(length=length).format(i, ocnum, hexnum, binum))
You can use f-strings.
def print_formatted(number):
length = len(bin(number)) - 2
for i in range(1, number + 1):
ocnum = oct(i)[2:]
hexnum = hex(i)[2:]
binum = bin(i)[2:]
print(' '.join(f'{n:>{length}}' for n in (i, ocnum, hexnum, binum)))
>>> print_formatted(20)
1 1 1 1
2 2 2 10
3 3 3 11
4 4 4 100
5 5 5 101
6 6 6 110
7 7 7 111
8 10 8 1000
9 11 9 1001
10 12 a 1010
11 13 b 1011
12 14 c 1100
13 15 d 1101
14 16 e 1110
15 17 f 1111
16 20 10 10000
17 21 11 10001
18 22 12 10010
19 23 13 10011
20 24 14 10100
I usually do it in two steps, creating a format string first. It works by escaping the braces - {{,}} - that need to remain in the format string.
>>> length = 4
>>> f'{{:>{length}}} {{:>{length}}} {{:>{length}}} {{:>{length}}}'
'{:>4} {:>4} {:>4} {:>4}'
>>>
fmt = f'{{:>{length}}} {{:>{length}}} {{:>{length}}} {{:>{length}}}'
print(fmt.format(i, ocnum, hexnum, binum))
I have the following dataframe:
Month
1 -0.075844
2 -0.089111
3 0.042705
4 0.002147
5 -0.010528
6 0.109443
7 0.198334
8 0.209830
9 0.075139
10 -0.062405
11 -0.211774
12 -0.109167
1 -0.075844
2 -0.089111
3 0.042705
4 0.002147
5 -0.010528
6 0.109443
7 0.198334
8 0.209830
9 0.075139
10 -0.062405
11 -0.211774
12 -0.109167
Name: Passengers, dtype: float64
As you can see numbers are listed twice from 1-12 / 1-12, instead, I would like to change the index to 1-24. The problem is that when plotting it I see the following:
plt.figure(figsize=(15,5))
plt.plot(esta2,color='orange')
plt.show()
I would like to see a continuous line from 1 to 24.
esta2 = esta2.reset_index() will get you 0-23. If you need 1-24 then you could just do esta2.index = np.arange(1, len(esta2) + 1).
quite simply :
df.index = [i for i in range(1,len(df.index)+1)]
df.index.name = 'Month'
print(df)
Val
Month
1 -0.075844
2 -0.089111
3 0.042705
4 0.002147
5 -0.010528
6 0.109443
7 0.198334
8 0.209830
9 0.075139
10 -0.062405
11 -0.211774
12 -0.109167
13 -0.075844
14 -0.089111
15 0.042705
16 0.002147
17 -0.010528
18 0.109443
19 0.198334
20 0.209830
21 0.075139
22 -0.062405
23 -0.211774
24 -0.109167
Just reassign the index:
df.index = pd.Index(range(1, len(df) + 1), name='Month')
I had to clean the column with membership_id, however, there are lots of random input values like '0000000', '99999', '*', 'na'.
The membership_id is serial numbers. The format of member ID is ranged from 4 digits to 12 digits, in which:
4 digits - 9 digits are starting from any non-zero number, while from 10 to 12 digits are starting from 1000xxxxxxxx.
Sorry for not describing the format clearly at beginning, I just found the IDs failed to meet this criteria is an invalid one. I would like to distinguish all of these non-membership id format as 0, thanks for help.
member_id
1 176828287
2 176841791
3 202142958
4 222539874
5 223565464
6 224721631
7 227675081
8 30235355118
9 %
10 ---
11 .
12 .215694985
13 0
14 00
15 000
16 00000000000000
17 99999999999999
18 999999999999999
19 : 211066980
20 D5146159
21 JulieGreen
22 N/a
23 NONE
24 None
25 PP - Premium Pr
26 T0000
27 T0000019
28 T0000022
If I understood correctly, uses regex expression = \A((1000\d{8})|([1-9]\d{3,10}))\Z will meet your requirements.
Above regex expression matches below:
12 digits beginning with 1000
4 to 11 digits and must be beginning with 1
Below is one demo:
import pandas as pd
import re
df = pd.DataFrame(['176828287','176841791','202142958','222539874','223565464','224721631','227675081','30235355118',
'%','---','.','.215694985','0','00','000','00000000000000','99999999999999','999999999999999',':211066980',
'D5146159','JulieGreen','N/a','NONE','None','PP - PremiumPr','T0000','T0000019','T0000022'], columns=['member_id'])
r = re.compile(r'\A((1000\d{8})|([1-9]\d{3,10}))\Z')
df['valid'] = df['member_id'].apply(lambda x: bool(r.match(x)))
#you can use df['member_id'] = df['member_id'].apply(lambda x: x if bool(r.match(x)) else 0) to replace invalid id with 0
print(df)
Output:
member_id valid
0 176828287 True
1 176841791 True
2 202142958 True
3 222539874 True
4 223565464 True
5 224721631 True
6 227675081 True
7 30235355118 True
8 % False
9 --- False
10 . False
11 .215694985 False
12 0 False
13 00 False
14 000 False
15 00000000000000 False
16 99999999999999 False
17 999999999999999 False
18 :211066980 False
19 D5146159 False
20 JulieGreen False
21 N/a False
22 NONE False
23 None False
24 PP - PremiumPr False
25 T0000 False
26 T0000019 False
27 T0000022 False
Do you have a regex already made that satisfies the criteria for the data you want to replace with 0? If not, you have to either have to create one, or make a dictionary terms = {'N/a':0, '---':0} of the individual items you want to replace and then call .map(terms) on the series.
pandas has built-in string functions, which include pattern matching algorithms.
So you can easily create a boolean mask, which distinguishes valid from invalid id's:
pattern = r'1000\d{6,8}$|[1-9]\d{3,8}$'
mask = df.member_id.str.match(pattern)
To print only the valid rows, just use the mask as index:
print(df[mask])
member_id
1 176828287
2 176841791
3 202142958
4 222539874
5 223565464
6 224721631
7 227675081
To set invalid data to 0, just use the complement of the mask:
df.loc[~mask] = 0
print(df)
member_id
1 176828287
2 176841791
3 202142958
4 222539874
5 223565464
6 224721631
7 227675081
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 0
17 0
18 0
19 0
20 0
21 0
22 0
23 0
24 0
25 0
26 0
27 0
28 0
Here is my data:
import numpy as np
import pandas as pd
z = pd.DataFrame({'a':[1,1,1,2,2,3,3],'b':[3,4,5,6,7,8,9], 'c':[10,11,12,13,14,15,16]})
z
a b c
0 1 3 10
1 1 4 11
2 1 5 12
3 2 6 13
4 2 7 14
5 3 8 15
6 3 9 16
Question:
How can I do calculation on different element of each subgroup? For example, for each group, I want to extract any element in column 'c' which its corresponding element in column 'b' is between 4 and 9, and sum them all.
Here is the code I wrote: (It runs but I cannot get the correct result)
gbz = z.groupby('a')
# For displaying the groups:
gbz.apply(lambda x: print(x))
list = []
def f(x):
list_new = []
for row in range(0,len(x)):
if (x.iloc[row,0] > 4 and x.iloc[row,0] < 9):
list_new.append(x.iloc[row,1])
list.append(sum(list_new))
results = gbz.apply(f)
The output result should be something like this:
a c
0 1 12
1 2 27
2 3 15
It might just be easiest to change the order of operations, and filter against your criteria first - it does not change after the groupby.
z.query('4 < b < 9').groupby('a', as_index=False).c.sum()
which yields
a c
0 1 12
1 2 27
2 3 15
Use
In [2379]: z[z.b.between(4, 9, inclusive=False)].groupby('a', as_index=False).c.sum()
Out[2379]:
a c
0 1 12
1 2 27
2 3 15
Or
In [2384]: z[(4 < z.b) & (z.b < 9)].groupby('a', as_index=False).c.sum()
Out[2384]:
a c
0 1 12
1 2 27
2 3 15
You could also groupby first.
z = z.groupby('a').apply(lambda x: x.loc[x['b']\
.between(4, 9, inclusive=False), 'c'].sum()).reset_index(name='c')
z
a c
0 1 12
1 2 27
2 3 15
Or you can use
z.groupby('a').apply(lambda x : sum(x.loc[(x['b']>4)&(x['b']<9),'c']))\
.reset_index(name='c')
Out[775]:
a c
0 1 12
1 2 27
2 3 15
I want to count STATE and ID columns row-by-row in my DataFrame, but I am getting a KeyError. My final code is below. For each ID number (1 to 12) I want to count the state changes. Here is my data, and I have thousands of these data.
#this code works for state column
chars = "TGC-"
nums = {}
for char in chars:
s = df["STATE"]
A = s.str.contains("A:" + char)
num = A.value_counts(sort=True)
nums[char] = num
ATnum = nums["T"]
AGnum = nums["G"]
ACnum= nums["C"]
A_num= nums["-"]
ATnum
Out[26]:
False 51919
True 1248
dtype: int64
# and this one works for id column
pt = df.sort("ID")["ID"]
pt_num=pt.value_counts()
pt_values= pt_num.order()
pt_index= pt_num.sort_index()
#these are the total numbers of each id datas
pt_num
Out[27]:
10 5241
6 5144
11 4561
2 4439
3 4346
5 4284
9 4244
12 4218
7 4217
1 4210
4 4199
8 4064
dtype: int64
# i combine both ID and STATE columns and try to read row-by-row
draft
Out[21]:
ID STATE
0 11 chr1:100154376:G:A
1 2 chr1:100177723:C:T
2 9 chr1:100177723:C:T
3 1 chr1:100194200:-:AA
4 8 chr1:10032249:A:G
5 2 chr1:100340787:G:A
6 1 chr1:100349757:A:G
7 3 chr1:10041186:C:A
8 10 chr1:100476986:G:C
9 4 chr1:100572459:C:T
10 5 chr1:100572459:C:T
11 2 chr1:100671861:T:-
12 4 chr1:1021390:C:A
13 5 chr1:10228220:G:C
14 3 chr1:1026913:C:T
15 4 chr1:1026913:C:T
... ... ...
53152 6 chrY:21618583:G:C
53153 5 chrY:24443836:T:G
53154 6 chrY:24443836:T:G
53155 8 chrY:24443836:T:G
53156 10 chrY:24443836:T:G
53157 12 chrY:24443836:T:G
53158 3 chrY:5605924:C:T
53159 2 chrY:6932151:G:A
53160 10 chrY:7224175:G:T
53161 2 chrY:9197998:C:T
53162 3 chrY:9197998:C:T
53163 4 chrY:9197998:C:T
53164 11 chrY:9197998:C:T
53165 12 chrY:9197998:C:T
53166 11 chrY:9304866:G:A
[53167 rows x 2 columns]
draft= df[["ID", "STATE" ]]
chars = "TGC-"
number = {}
d = draft
for i in d["ID"]:
if i==1:
for item in chars:
At = d["STATE"].str.contains("A:" + item)
num = At.value_counts(sort=True)
number[item] = num
ATn=number["T"]
AGn=number["G"]
ACn=number["C"]
A_n=number["-"]
KeyError: 'G'
In total, what I want to do is, for example, ID 1 has 4210 rows, I want to determine how many of these have a state of A:T, A:G, A:C and A:-.
Where am I going wrong?