I have the following data:
0 Ground out of 2
1 1 out of 3
2 1 out of 3
Name: Floor, dtype: object
I want to modify this data so that I can create two columns named first floor and max floor.
Looking at the first item as an example:
0 Ground out of 2
the first floor would be 0 and max floor would be 2 etc...
This is the code I have written to extract the first floor items:
first_floor = []
lower_floors = ['Ground','Basement]
for data in df.Floor:
for char in lower_floors:
if char in data:
floor_location.append('0')
else:
floor_location.append(data[:2])
When I do this, I get the following output:
['0', 'Gr', '1 ', '1 ']
I am expecting
['0', '1 ', '1 ']
Can someone explain where I am going wrong?
Thanks in advance.
You loop is written in a wrong order.
But anyway, don't use a loop, rather use vectorial string extraction and fillna:
df['Floor'].str.extract('^(\d+)', expand=False).fillna(0).astype(int)
Or for more flexibility (Ground -> 0 ; Basement -> -1…):
(df['Floor'].str.extract('^(\w+)', expand=False)
.replace({'Ground': 0, 'Basement': -1})
.astype(int)
)
output:
0 0
1 1
2 1
Name: Floor, dtype: int64
As list:
df['Floor'].str.extract('^(\d+)', expand=False).fillna(0).astype(int).tolist()
output : [0, 1, 1]
First of all the indent of the else case is wrong. It should be:
first_floor = []
lower_floors = ['Ground','Basement']
for data in df.Floor:
for char in lower_floors:
if char in data:
floor_location.append('0')
else:
floor_location.append(data[:2])
And second, as you are looping through the Floor column, data will be just a cell, not a row. So data[:2] will cut the cell to 2 characters. This is why you see Gr.
Related
Hi,
I managed to convert the table to a data frame as initialized in the picture above. I want to iterate through each row and list down all the numbers between 'start' and 'end' (of each row) including 'start' and 'end' values too.
I wrote down a code block that works when I replace 'i' with an integer. I want to iterate through all rows by using 'i' instead of an integer in a loop.
Could you please help?
I tried to adapt some solutions from StackOverflow but couldn't...
for i in range():
bignumber = (df.end[i])
while bignumber > df.start[i]:
print(bignumber)
bignumber = bignumber - 1
if bignumber == df.start[i]:
print(bignumber.tolist())
i=i+1
I tried to iterate using for loop with 'i' argument but couldn't.
You found overcomplicated code.
You can use normal for-loop with range()
for number in range(row['start'], row['end']+1):
print(number)
And you can use .apply() to run it on every row in DataFrame
import pandas as pd
df = pd.DataFrame({
'start': [1,2,3],
'end': [4,5,6],
})
print(df)
def display(row):
print('start:', row['start'], '| end:', row['end'])
for number in range(row['start'], row['end']+1):
print(number)
df.apply(display, axis=1)
Result:
start end
0 1 4
1 2 5
2 3 6
start: 1 | end: 4
1
2
3
4
start: 2 | end: 5
2
3
4
5
start: 3 | end: 6
3
4
5
6
if you would need to iterate rows then you would use df.iterrows()
for idx, row in df.iterrows():
print('start:', row['start'], '| end:', row['end'])
for number in range(row['start'], row['end']+1):
print(number)
but apply is preferred and it may work faster.
I need a way to extract all words that start with 'A' followed by a 6-digit numeric string right after (i.e. A112233, A000023).
Each cell contains sentences and there could potentially be a user error where they forget to put a space, so if you could account for that as well it would be greatly appreciated.
I've done research into using Python regex and Pandas, but I just don't know enough yet and am kind of on a time crunch.
Suppose your df's content construct from the following code:
import pandas as pd
df1=pd.DataFrame(
{
"columnA":["A194533","A4A556633 system01A484666","A4A556633","a987654A948323a882332A484666","A238B004867","pageA000023lol","a089923","something lol a484876A48466 emoji","A906633 A556633a556633"]
}
)
print(df1)
Output:
columnA
0 A194533
1 A4A556633 system01A484666
2 A4A556633
3 a987654A948323a882332A484666
4 A238B004867
5 pageA000023lol
6 a089923
7 something lol a484876A48466 emoji
8 A906633 A556633a556633
Now let's fetch the target corresponding to the regex patern:
result = df1['columnA'].str.extractall(r'([A]\d{6})')
Output:
0
match
0 0 A194533
1 0 A556633
1 A484666
2 0 A556633
3 0 A948323
1 A484666
5 0 A000023
8 0 A906633
1 A556633
And count them:
result.value_counts()
Output:
A556633 3
A484666 2
A000023 1
A194533 1
A906633 1
A948323 1
dtype: int64
Send the unique index into a list:
unique_list = [i[0] for i in result.value_counts().index.tolist()]
Output:
['A556633', 'A484666', 'A000023', 'A194533', 'A906633', 'A948323']
Value counts into a list:
unique_count_list = result.value_counts().values.tolist()
Output:
[3, 2, 1, 1, 1, 1]
I have a dataframe containing lists of words in each row in the same column. I'd like to remove what I guess are spaces. I managed to get rid of some by doing:
for i in processed.text:
for x in i:
if x == '' or x==" ":
i.remove(x)
But some of them still remain.
>processed['text']
0 [have, month, #postdoc, within, on, chemical, ...
1 [hardworking, producers, iowa, so, for, state,...
2 [hardworking, producers, iowa, so, for, state,...
3 [today, time, is, to, sources, energy, much, p...
4 [thanks, gaetanos, club, c, oh, choosing, #rec...
...
130736 [gw, fossil, renewable, import, , , , , , , , ...
130737 [s, not, , go, ]
130738 [answer, deforestation, in, ]
130739 [plastic, regrind, any, and, grades, we, make,...
130740 [grid, generating, of, , , , gw]
Name: text, Length: 130741, dtype: object
>type(processed)
<class 'pandas.core.frame.DataFrame'>
Thank you very much.
Split on comma remove empty values and then combine again with comma
def remove_empty(x):
if type(x) is str:
x = x.split(",")
x = [ y for y in x if y.strip()]
return ",".join(x)
elif type(x) is list:
return [ y for y in x if y.strip()]
processed['text'] = processed['text'].apply(remove_empty)
You can use split(expand=True) to do that. Note: You dont have to specifically give spilt(' ', expand=True). By default, it takes ' ' as the value. You can replace ' ' with anything. For ex: if your words separate with , or -, then you can use that separator to split the columns.
import pandas as pd
df = pd.DataFrame({'Col1':['This is a long sentence',
'This is another long sentence',
'This is short',
'This is medium length',
'Wow. Tiny',
'Petite',
'Ok']})
print (df)
df = df.Col1.str.split(' ',expand=True)
print (df)
The output of this will be:
Original dataframe:
Col1
0 This is a long sentence
1 This is another long sentence
2 This is short
3 This is medium length
4 Wow. Tiny
5 Petite
6 Ok
Dataframe split into columns
0 1 2 3 4
0 This is a long sentence
1 This is another long sentence
2 This is short None None
3 This is medium length
4 Wow. Tiny None None None
5 Petite None None None None
6 Ok None None None None
If you want to limit them to 3 columns only, then use n=2
df = df.Col1.str.split(' ',n = 2, expand=True)
The output will be:
0 1 2
0 This is a long sentence
1 This is another long sentence
2 This is short
3 This is medium length
4 Wow. Tiny None
5 Petite None None
6 Ok None None
If you want to rename the columns to be more specific, then you can add rename to the end like this.
df = df.Col1.str.split(' ',n = 2, expand=True).rename({0:'A',1:'B',2:'C'},axis=1)
A B C
0 This is a long sentence
1 This is another long sentence
2 This is short
3 This is medium length
4 Wow. Tiny None
5 Petite None None
6 Ok None None
In case you want to replace all the None with '' and also prefix the column names, you can do it as follws:
df = df.Col1.str.split(expand=True).add_prefix('Col').fillna('')
Col0 Col1 Col2 Col3 Col4
0 This is a long sentence
1 This is another long sentence
2 This is short
3 This is medium length
4 Wow. Tiny
5 Petite
6 Ok
I have a string of number chars that I want to change to type int, but I need to remove the parentheses and the numbers in it (it's just a multiplier for my application, this is how I get the data).
Here is the sample code.
import pandas as pd
voltages = ['0', '0', '0', '0', '0', '310.000 (31)', '300.000 (30)', '190.000 (19)', '0', '20.000 (2)']
df = pd.DataFrame(voltages, columns=['Voltage'])
df
Out [1]:
Voltage
0 0
1 0
2 0
3 0
4 0
5 310.000 (31)
6 300.000 (30)
7 190.000 (19)
8 0
9 20.000 (2)
How can I remove the substrings within the parenthesis? Is there a Pandas.series.str way to do it?
Use str.replace with regex:
df.Voltage.str.replace(r"\s\(.*","")
Out:
0 0
1 0
2 0
3 0
4 0
5 310.000
6 300.000
7 190.000
8 0
9 20.000
Name: Voltage, dtype: object
You can also use str.split()
df_2 = df['Voltage'].str.split(' ', 0, expand = True).rename(columns = {0:'Voltage'})
df_2['Voltage'] = df_2['Voltage'].astype('float')
If you know the separating character will always be a space then the following is quite a neat way of doing it:
voltages = [i.rsplit(' ')[0] for i in voltages]
I think you could try this:
new_series = df['Voltage'].apply(lambda x:int(x.split('.')[0]))
df['Voltage'] = new_series
I hope it helps.
Hopefully, this will work for you:
result = source_value[:source_value.find(" (")]
NOTE: the find function requires a string as source_value. But if you have parens in your value, I assume it is a string.
I have data that contains 'None ...' string at random places. I am trying to replace a cell in the dataframe with empty character only when it begin with 'None ..'. Here is what I tried, but I get errors like 'KeyError'.
df = pd.DataFrame({'id': [1,2,3,4,5],
'sub': ['None ... ','None ... test','math None ...','probability','chemistry']})
df.loc[df['sub'].str.replace('None ...','',1), 'sub'] = '' # getting key error
output looking for: (I need to replace entire value in cell if 'None ...' is starting string. Notice, 3rd row shouldn't be replaced because 'None ...' is not starting character)
id sub
1
2
3 math None ...
4 probability
5 chemistry
You can use the below to identify the cells to replace and then assign them an empty value:
df.loc[df['sub'].str.startswith("None"), 'sub'] = ""
df.head()
id sub
0 1
1 2
2 3 math None ...
3 4 probability
4 5 chemistry
You can simpy replace 'None ...' and by using a regular expression you can apply this replacement only for strings that start with None.
df['sub'] = df['sub'].str.replace(r'^None \.\.\.*','',1)
the output looks like this:
id sub
0 1
1 2 test
2 3 math None ...
3 4 probability
4 5 chemistry
df['sub'] = df['sub'].str.replace('[\w\s]*?(None \.\.\.)[\s\w]*?','',1)
Out:
sub
id
1
2 test
3
4 probability
5 chemistry
Look at startswith, then after we find the row need to be replaced we using replace
df['sub']=df['sub'].mask(df['sub'].str.startswith('None ... '),'')
df
Out[338]:
id sub
0 1
1 2
2 3 math None ...
3 4 probability
4 5 chemistry
First, you are using the sub strings as index, that is why you received key error.
Second you can do this by:
df['sub']=df['sub'].apply(lambda x: '' if x.find('None')==0 else x)