Remove last digit from String depending on length - python

I am trying to remove the last digit in the df[4] string if the string is over 5 digits.
I tried adding .str[:-1] to df[4]=df[4].astype(str) this removes the last digit from every string in the dataframe.
df[3]=df[3].astype(str)
df[4]=df[4].astype(str).str[:-1]
df[5]=df[5].astype(str)
I tried several different combinations of if statements but none have worked.
I'm new to python and pandas so any help is appreciated

You can filter first on the string length:
condition = df[4].astype(str).str.len() > 5
df.loc[condition, 4]=df.loc[condition, 4].astype(str).str[:-1]
For example:
>>> df
4
0 1
1 11
2 111
3 1111
4 11111
5 111111
6 1111111
7 11111111
8 111111111
>>> condition = df[4].astype(str).str.len() > 5
>>> df.loc[condition, 4]=df.loc[condition, 4].astype(str).str[:-1]
>>> df
4
0 1
1 11
2 111
3 1111
4 11111
5 11111
6 111111
7 1111111
8 11111111
If these are natural integers, it is however more efficient to divide by 10:
condition = df[4].astype(str).str.len() > 5
df.loc[condition, 4]=df.loc[condition, 4] // 10

Accessing Elements of a Collection
>>> x = "123456"
# get element at index from start
>>> x[0]
'1'
# get element at index from end
>>> x[-1]
'6'
# get range of elements from n-index to m-index
>>> x[0:3]
'123'
>>> x[1:-2]
'234'
>>> x[-4:-2]
'34'
# get range from/to index with open end/start
>>> x[:-2]
'1234'
>>> x[4:]
'56'
List Comprehension Syntax
I haven't see the pythons list comprehension syntax which really cool and easy.
# input data frame with variable string length 1 to n
df = [
'a',
'ab',
'abc',
'abcd',
'abcdf',
'abcdfg',
'abcdfgh',
'abcdfghi',
'abcdfghij',
'abcdfghijk',
'abcdfghijkl',
'abcdfghijklm'
]
# using list comprehension syntax: [element for element in collection]
df_new = [
# short hand if syntax: value_a if True else value_b
r if len(r) <= 5 else r[0:5]
for r in df
]
Now df_new contains only string up to a length of 5:
[
'a',
'ab',
'abc',
'abcd',
'abcdf',
'abcdf',
'abcdf',
'abcdf',
'abcdf',
'abcdf',
'abcdf',
'abcdf'
]

cause [-1]removes last numbers or change number to -1
try str df[4]=-1

Related

In a column of a dataframe, count number of elements on list starting by "a"

I have a dataset like
data = {'ID': ['first_value', 'second_value', 'third_value',
'fourth_value', 'fifth_value', 'sixth_value'],
'list_id': [['001', 'ab0', '44A'], [], ['005', '006'],
['a22'], ['azz'], ['aaa', 'abd']]
}
df = pd.DataFrame(data)
And I want to create two columns:
A column that counts the number of elements that start with "a" on 'list_id'
A column that counts the number of elements that DO NOT start with "a" on "list_id"
I was thinking on doing something like:
data['list_id'].apply(lambda x: for entity in x if x.startswith("a")
I thought on counting first the ones starting with “a” and after counting the ones not starting with “a”, so I did this:
sum(1 for w in data["list_id"] if w.startswith('a'))
Moreover this does not really work and I cannot make it work.
Any ideas? :)
Assuming this input:
ID list_id
0 first_value [001, ab0, 44A]
1 second_value []
2 third_value [005, 006]
3 fourth_value [a22]
4 fifth_value [azz]
5 sixth_value [aaa, abd]
you can use:
sum(1 for l in data['list_id'] for x in l if x.startswith('a'))
output: 5
If you rather want a count per row:
df['starts_with_a'] = [sum(x.startswith('a') for x in l) for l in df['list_id']]
df['starts_with_other'] = df['list_id'].str.len()-df['starts_with_a']
NB. using a list comprehension is faster than apply
output:
ID list_id starts_with_a starts_with_other
0 first_value [001, ab0, 44A] 1 2
1 second_value [] 0 0
2 third_value [005, 006] 0 2
3 fourth_value [a22] 1 0
4 fifth_value [azz] 1 0
5 sixth_value [aaa, abd] 2 0
Using pandas something quite similar to your proposal works:
data = {'ID': ['first_value', 'second_value', 'third_value', 'fourth_value', 'fifth_value', 'sixth_value'],
'list_id': [['001', 'ab0', '44A'], [], ['005', '006'], ['a22'], ['azz'], ['aaa', 'abd']]
}
df = pd.DataFrame(data)
df["len"] = df.list_id.apply(len)
df["num_a"] = df.list_id.apply(lambda s: sum(map(lambda x: x[0] == "a", s)))
df["num_not_a"] = df["len"] - df["num_a"]

How to extract alphanumeric word from column values in excel with Python?

I need a way to extract all words that start with 'A' followed by a 6-digit numeric string right after (i.e. A112233, A000023).
Each cell contains sentences and there could potentially be a user error where they forget to put a space, so if you could account for that as well it would be greatly appreciated.
I've done research into using Python regex and Pandas, but I just don't know enough yet and am kind of on a time crunch.
Suppose your df's content construct from the following code:
import pandas as pd
df1=pd.DataFrame(
{
"columnA":["A194533","A4A556633 system01A484666","A4A556633","a987654A948323a882332A484666","A238B004867","pageA000023lol","a089923","something lol a484876A48466 emoji","A906633 A556633a556633"]
}
)
print(df1)
Output:
columnA
0 A194533
1 A4A556633 system01A484666
2 A4A556633
3 a987654A948323a882332A484666
4 A238B004867
5 pageA000023lol
6 a089923
7 something lol a484876A48466 emoji
8 A906633 A556633a556633
Now let's fetch the target corresponding to the regex patern:
result = df1['columnA'].str.extractall(r'([A]\d{6})')
Output:
0
match
0 0 A194533
1 0 A556633
1 A484666
2 0 A556633
3 0 A948323
1 A484666
5 0 A000023
8 0 A906633
1 A556633
And count them:
result.value_counts()
Output:
A556633 3
A484666 2
A000023 1
A194533 1
A906633 1
A948323 1
dtype: int64
Send the unique index into a list:
unique_list = [i[0] for i in result.value_counts().index.tolist()]
Output:
['A556633', 'A484666', 'A000023', 'A194533', 'A906633', 'A948323']
Value counts into a list:
unique_count_list = result.value_counts().values.tolist()
Output:
[3, 2, 1, 1, 1, 1]

How do I iterate between lists of different lenghts?

I have two lists which I need to Iterate together. Let me show how:
listA=[1,2,3,4]
listB=["A","B","C"]
From those lists I would like to have this list
ListC=("1A","2B","3C","4A")
And even make a longer list in which I can loop the same iteration
ListC=("1A","2B","3C","4A","1B","2C","3A","4C".... and so on)
I couldn`t find any tutorial online that would answer this question
Thanks.
Use zip and itertools.cycle:
>>> from itertools import cycle
>>> listA = [1, 2, 3, 4]
>>> listB = ["A", "B", "C"]
>>> [f'{x}{y}' for x, y in zip(listA, cycle(listB))]
['1A', '2B', '3C', '4A']
# listA: 1 2 3 4
# cycle(listB): "A" "B" "C" "A" "B" "C" ...
cycle endlessly cycles through the elements of its argument; zip stops iterating after its shorter argument is exhausted.
You can use cycle with both lists, but the result will be an infinite sequence of values; you'll need to use something like itertools.islice to take a finite prefix of the result.
>>> from itertools import cycle, islice
>>> [f'{x}{y}' for x, y in islice(zip(cycle(listA), cycle(listB)), 8)]
['1A', '2B', '3C', '4A', '1B', '2C', '3A', '4B']
# cycle(listA): 1 2 3 4 1 2 3 4 1 2 3 4 1 ...
# cycle(listB): "A" "B" "C" "A" "B" "C" "A" "B" "C" "A" "B" "C" "A" ...
# Note that the result itself is a cycle of 12 unique elements, because
# the least common multiple (LCM) of 3 and 4 is 12.
You can use modulo to take care of this kind of problem. Here's code to repeat this 100 times:
l1 = [1, 2, 3, 4]
l2 = ['a', 'b', 'c']
result = []
for i in range(100):
result.append(str(l1[i % len(l1)]) + l2[i % len(l2)])
print (result)
listA=[1,2,3,4]
listB=["A","B","C"]
listC=[]
for a in listA:
index = listA.index(a)
if listA.index(a) > len(listB) - 1:
if listC[-1][1] != listB[-1]:
index = listB.index(listC[-1][1]) + 1
else:
index = 0
listC.append(str(a)+listB[index])
print(listC)

Strip all characters from column header before a :

I have column's named like this:
1:Arnston 2:Berg 3:Carlson 53:Brown
and I want to strip all the characters before and including :. I know I can rename the columns, but that would be pretty tedious since my numbers go up to 100.
My desired out put is:
Arnston Berg Carlson Brown
Assuming that you have a frame looking something like this:
>>> df
1:Arnston 2:Berg 3:Carlson 53:Brown
0 5 0 2 1
1 9 3 2 9
2 9 2 9 7
You can use the vectorized string operators to split each entry at the first colon and then take the second part:
>>> df.columns = df.columns.str.split(":", 1).str[1]
>>> df
Arnston Berg Carlson Brown
0 5 0 2 1
1 9 3 2 9
2 9 2 9 7
import re
s = '1:Arnston 2:Berg 3:Carlson 53:Brown'
s_minus_numbers = re.sub(r'\d+:', '', s)
Gets you
'Arnston Berg Carlson Brown'
The best solution IMO is to use pandas' str attribute on the columns. This allows for the use of regular expressions without having to import re:
df.columns.str.extract(r'\d+:(.*)')
Where the regex means: select everything ((.*)) after one or more digits (\d+) and a colon (:).
You can do it with a list comprehension:
columns = '1:Arnston 2:Berg 3:Carlson 53:Brown'.split()
print('Before: {!r}'.format(columns))
columns = [col.split(':')[1] for col in columns]
print('After: {!r}'.format(columns))
Output
Before: ['1:Arnston', '2:Berg', '3:Carlson', '53:Brown']
After: ['Arnston', 'Berg', 'Carlson', 'Brown']
Another way is with a regular expression using re.sub():
import re
columns = '1:Arnston 2:Berg 3:Carlson 53:Brown'.split()
pattern = re.compile(r'^.+:')
columns = [pattern.sub('', col) for col in columns]
print(columns)
Output
['Arnston', 'Berg', 'Carlson', 'Brown']
df = pd.DataFrame({'1:Arnston':[5,9,9],
'2:Berg':[0,3,2],
'3:Carlson':[2,2,9] ,
'53:Brown':[1,9,7]})
[x.split(':')[1] for x in df.columns.factorize()[1]]
output:
['Arnston', 'Berg', 'Carlson', 'Brown']
You could use str.replace and pass regex expression:
In [52]: df
Out[52]:
1:Arnston 2:Berg 3:Carlson 53:Brown
0 1.340711 1.261500 -0.512704 -0.064384
1 0.462526 -0.358382 0.168122 -0.660446
2 -0.089622 0.656828 -0.838688 -0.046186
3 1.041807 0.775830 -0.436045 0.162221
4 -0.422146 0.775747 0.106112 -0.044917
In [51]: df.columns.str.replace('\d+[:]','')
Out[51]: Index(['Arnston', 'Berg', 'Carlson', 'Brown'], dtype='object')

How to add inputed numbers in string.(Python)

How do i add all imputed numbers in a string?
Ex:
input:
5 5 3 5
output
18
and it must supports ('-')
Ex.
input
-5 5 3 5
output
8
I write something like this:
x = raw_input()
print sum(map(int,str(x)))
and it adds normally if x>0
But what to do with ('-') ?
I understand that i need to use split() but my knowledge is not enough (
You're close, you just need to split the string on spaces. Splitting will produce the list of strings ['-5', '5', '3', '5']. Then you can do the rest of the map and sum as you intended.
>>> s = '-5 5 3 5'
>>> sum(map(int, s.split()))
8
its simple
>>> input = raw_input('Enter your input: ')
Enter your input: 5 5 10 -10
>>> list_numbers = [int(item) for item in input.split(' ')]
>>> print list_numbers
[5, 5, 10, -10]
And after what you want :)
You can use the following line:
sum(map(int, raw_input().split()))

Categories

Resources