Can someone explain this please?

Can someone explain this please? - python

I have the following data:
0 Ground out of 2
1 1 out of 3
2 1 out of 3
Name: Floor, dtype: object
I want to modify this data so that I can create two columns named first floor and max floor.
Looking at the first item as an example:
0 Ground out of 2
the first floor would be 0 and max floor would be 2 etc...
This is the code I have written to extract the first floor items:
first_floor = []
lower_floors = ['Ground','Basement]
for data in df.Floor:
for char in lower_floors:
if char in data:
floor_location.append('0')
else:
floor_location.append(data[:2])
When I do this, I get the following output:
['0', 'Gr', '1 ', '1 ']
I am expecting
['0', '1 ', '1 ']
Can someone explain where I am going wrong?
Thanks in advance.

You loop is written in a wrong order.
But anyway, don't use a loop, rather use vectorial string extraction and fillna:
df['Floor'].str.extract('^(\d+)', expand=False).fillna(0).astype(int)
Or for more flexibility (Ground -> 0 ; Basement -> -1…):
(df['Floor'].str.extract('^(\w+)', expand=False)
.replace({'Ground': 0, 'Basement': -1})
.astype(int)
)
output:
0 0
1 1
2 1
Name: Floor, dtype: int64
As list:
df['Floor'].str.extract('^(\d+)', expand=False).fillna(0).astype(int).tolist()
output : [0, 1, 1]

First of all the indent of the else case is wrong. It should be:
first_floor = []
lower_floors = ['Ground','Basement']
for data in df.Floor:
for char in lower_floors:
if char in data:
floor_location.append('0')
else:
floor_location.append(data[:2])
And second, as you are looping through the Floor column, data will be just a cell, not a row. So data[:2] will cut the cell to 2 characters. This is why you see Gr.

Related

python3 can't iterate loop using data frame

Hi,
I managed to convert the table to a data frame as initialized in the picture above. I want to iterate through each row and list down all the numbers between 'start' and 'end' (of each row) including 'start' and 'end' values too.
I wrote down a code block that works when I replace 'i' with an integer. I want to iterate through all rows by using 'i' instead of an integer in a loop.
Could you please help?
I tried to adapt some solutions from StackOverflow but couldn't...
for i in range():
bignumber = (df.end[i])
while bignumber > df.start[i]:
print(bignumber)
bignumber = bignumber - 1
if bignumber == df.start[i]:
print(bignumber.tolist())
i=i+1
I tried to iterate using for loop with 'i' argument but couldn't.

You found overcomplicated code.
You can use normal for-loop with range()
for number in range(row['start'], row['end']+1):
print(number)
And you can use .apply() to run it on every row in DataFrame
import pandas as pd
df = pd.DataFrame({
'start': [1,2,3],
'end': [4,5,6],
})
print(df)
def display(row):
print('start:', row['start'], '| end:', row['end'])
for number in range(row['start'], row['end']+1):
print(number)
df.apply(display, axis=1)
Result:
start end
0 1 4
1 2 5
2 3 6
start: 1 | end: 4
1
2
3
4
start: 2 | end: 5
2
3
4
5
start: 3 | end: 6
3
4
5
6
if you would need to iterate rows then you would use df.iterrows()
for idx, row in df.iterrows():
print('start:', row['start'], '| end:', row['end'])
for number in range(row['start'], row['end']+1):
print(number)
but apply is preferred and it may work faster.

How to extract alphanumeric word from column values in excel with Python?

I need a way to extract all words that start with 'A' followed by a 6-digit numeric string right after (i.e. A112233, A000023).
Each cell contains sentences and there could potentially be a user error where they forget to put a space, so if you could account for that as well it would be greatly appreciated.
I've done research into using Python regex and Pandas, but I just don't know enough yet and am kind of on a time crunch.

Suppose your df's content construct from the following code:
import pandas as pd
df1=pd.DataFrame(
{
"columnA":["A194533","A4A556633 system01A484666","A4A556633","a987654A948323a882332A484666","A238B004867","pageA000023lol","a089923","something lol a484876A48466 emoji","A906633 A556633a556633"]
}
)
print(df1)
Output:
columnA
0 A194533
1 A4A556633 system01A484666
2 A4A556633
3 a987654A948323a882332A484666
4 A238B004867
5 pageA000023lol
6 a089923
7 something lol a484876A48466 emoji
8 A906633 A556633a556633
Now let's fetch the target corresponding to the regex patern:
result = df1['columnA'].str.extractall(r'([A]\d{6})')
Output:
0
match
0 0 A194533
1 0 A556633
1 A484666
2 0 A556633
3 0 A948323
1 A484666
5 0 A000023
8 0 A906633
1 A556633
And count them:
result.value_counts()
Output:
A556633 3
A484666 2
A000023 1
A194533 1
A906633 1
A948323 1
dtype: int64
Send the unique index into a list:
unique_list = [i[0] for i in result.value_counts().index.tolist()]
Output:
['A556633', 'A484666', 'A000023', 'A194533', 'A906633', 'A948323']
Value counts into a list:
unique_count_list = result.value_counts().values.tolist()
Output:
[3, 2, 1, 1, 1, 1]

Removing empty words from column of tokenized sentences

I have a dataframe containing lists of words in each row in the same column. I'd like to remove what I guess are spaces. I managed to get rid of some by doing:
for i in processed.text:
for x in i:
if x == '' or x==" ":
i.remove(x)
But some of them still remain.
>processed['text']
0 [have, month, #postdoc, within, on, chemical, ...
1 [hardworking, producers, iowa, so, for, state,...
2 [hardworking, producers, iowa, so, for, state,...
3 [today, time, is, to, sources, energy, much, p...
4 [thanks, gaetanos, club, c, oh, choosing, #rec...
...
130736 [gw, fossil, renewable, import, , , , , , , , ...
130737 [s, not, , go, ]
130738 [answer, deforestation, in, ]
130739 [plastic, regrind, any, and, grades, we, make,...
130740 [grid, generating, of, , , , gw]
Name: text, Length: 130741, dtype: object
>type(processed)
<class 'pandas.core.frame.DataFrame'>
Thank you very much.

Split on comma remove empty values and then combine again with comma
def remove_empty(x):
if type(x) is str:
x = x.split(",")
x = [ y for y in x if y.strip()]
return ",".join(x)
elif type(x) is list:
return [ y for y in x if y.strip()]
processed['text'] = processed['text'].apply(remove_empty)

You can use split(expand=True) to do that. Note: You dont have to specifically give spilt(' ', expand=True). By default, it takes ' ' as the value. You can replace ' ' with anything. For ex: if your words separate with , or -, then you can use that separator to split the columns.
import pandas as pd
df = pd.DataFrame({'Col1':['This is a long sentence',
'This is another long sentence',
'This is short',
'This is medium length',
'Wow. Tiny',
'Petite',
'Ok']})
print (df)
df = df.Col1.str.split(' ',expand=True)
print (df)
The output of this will be:
Original dataframe:
Col1
0 This is a long sentence
1 This is another long sentence
2 This is short
3 This is medium length
4 Wow. Tiny
5 Petite
6 Ok
Dataframe split into columns
0 1 2 3 4
0 This is a long sentence
1 This is another long sentence
2 This is short None None
3 This is medium length
4 Wow. Tiny None None None
5 Petite None None None None
6 Ok None None None None
If you want to limit them to 3 columns only, then use n=2
df = df.Col1.str.split(' ',n = 2, expand=True)
The output will be:
0 1 2
0 This is a long sentence
1 This is another long sentence
2 This is short
3 This is medium length
4 Wow. Tiny None
5 Petite None None
6 Ok None None
If you want to rename the columns to be more specific, then you can add rename to the end like this.
df = df.Col1.str.split(' ',n = 2, expand=True).rename({0:'A',1:'B',2:'C'},axis=1)
A B C
0 This is a long sentence
1 This is another long sentence
2 This is short
3 This is medium length
4 Wow. Tiny None
5 Petite None None
6 Ok None None
In case you want to replace all the None with '' and also prefix the column names, you can do it as follws:
df = df.Col1.str.split(expand=True).add_prefix('Col').fillna('')
Col0 Col1 Col2 Col3 Col4
0 This is a long sentence
1 This is another long sentence
2 This is short
3 This is medium length
4 Wow. Tiny
5 Petite
6 Ok

How to remove strings between between parentheses (or any char) in DataFrame?

I have a string of number chars that I want to change to type int, but I need to remove the parentheses and the numbers in it (it's just a multiplier for my application, this is how I get the data).
Here is the sample code.
import pandas as pd
voltages = ['0', '0', '0', '0', '0', '310.000 (31)', '300.000 (30)', '190.000 (19)', '0', '20.000 (2)']
df = pd.DataFrame(voltages, columns=['Voltage'])
df
Out [1]:
Voltage
0 0
1 0
2 0
3 0
4 0
5 310.000 (31)
6 300.000 (30)
7 190.000 (19)
8 0
9 20.000 (2)
How can I remove the substrings within the parenthesis? Is there a Pandas.series.str way to do it?

Use str.replace with regex:
df.Voltage.str.replace(r"\s\(.*","")
Out:
0 0
1 0
2 0
3 0
4 0
5 310.000
6 300.000
7 190.000
8 0
9 20.000
Name: Voltage, dtype: object

You can also use str.split()
df_2 = df['Voltage'].str.split(' ', 0, expand = True).rename(columns = {0:'Voltage'})
df_2['Voltage'] = df_2['Voltage'].astype('float')

If you know the separating character will always be a space then the following is quite a neat way of doing it:
voltages = [i.rsplit(' ')[0] for i in voltages]

I think you could try this:
new_series = df['Voltage'].apply(lambda x:int(x.split('.')[0]))
df['Voltage'] = new_series
I hope it helps.

Hopefully, this will work for you:
result = source_value[:source_value.find(" (")]
NOTE: the find function requires a string as source_value. But if you have parens in your value, I assume it is a string.

Python replace entire string if it begin with certain character in dataframe

I have data that contains 'None ...' string at random places. I am trying to replace a cell in the dataframe with empty character only when it begin with 'None ..'. Here is what I tried, but I get errors like 'KeyError'.
df = pd.DataFrame({'id': [1,2,3,4,5],
'sub': ['None ... ','None ... test','math None ...','probability','chemistry']})
df.loc[df['sub'].str.replace('None ...','',1), 'sub'] = '' # getting key error
output looking for: (I need to replace entire value in cell if 'None ...' is starting string. Notice, 3rd row shouldn't be replaced because 'None ...' is not starting character)
id sub
1
2
3 math None ...
4 probability
5 chemistry

You can use the below to identify the cells to replace and then assign them an empty value:
df.loc[df['sub'].str.startswith("None"), 'sub'] = ""
df.head()
id sub
0 1
1 2
2 3 math None ...
3 4 probability
4 5 chemistry

You can simpy replace 'None ...' and by using a regular expression you can apply this replacement only for strings that start with None.
df['sub'] = df['sub'].str.replace(r'^None \.\.\.*','',1)
the output looks like this:
id sub
0 1
1 2 test
2 3 math None ...
3 4 probability
4 5 chemistry

df['sub'] = df['sub'].str.replace('[\w\s]*?(None \.\.\.)[\s\w]*?','',1)
Out:
sub
id
1
2 test
3
4 probability
5 chemistry

Look at startswith, then after we find the row need to be replaced we using replace
df['sub']=df['sub'].mask(df['sub'].str.startswith('None ... '),'')
df
Out[338]:
id sub
0 1
1 2
2 3 math None ...
3 4 probability
4 5 chemistry

First, you are using the sub strings as index, that is why you received key error.
Second you can do this by:
df['sub']=df['sub'].apply(lambda x: '' if x.find('None')==0 else x)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Can someone explain this please? - python

Related

python3 can't iterate loop using data frame

How to extract alphanumeric word from column values in excel with Python?

Removing empty words from column of tokenized sentences

How to remove strings between between parentheses (or any char) in DataFrame?

Python replace entire string if it begin with certain character in dataframe

Categories

Resources