How to strip and split in pandas - python

Is there a way to perform a split by new line and also do a strip of whitespaces in a single line ?
this is how my df looks like originally
df["Source"]
0 test1 \n test2
1 test1 \n test2
2 test1 \ntest2
Name: Source, dtype: object
I used to do a split based on new line and create a list with the below code
Data = (df["Source"].str.split("\n").to_list())
Data
[['test1 ', ' test2 '], [' test1 ', ' test2 '], [' test1 ', 'test2 ']]
I want to further improve this and remove any leading or trailing white spaces and i am not sure how to use the split and strip in a single line
df['Port']
0 443\n8080\n161
1 25
2 169
3 25
4 2014\n58
Name: Port, dtype: object
when i try to split it based on the new line , it fills in nan values for the ones that does not have \n
df['Port'].str.split("\n").to_list()
[['443', '8080', '161'], nan, nan, nan, ['2014', '58']]
the same works perfectly for other columns
df['Source Hostname']
0 test1\ntest2\ntest3
1 test5
2 test7\ntest8\n
3 test1
4 test2\ntest4
Name: Source Hostname, dtype: object
df["Source Hostname"].str.split('\n').apply(lambda z: [e.strip() for e in z]).tolist()
[['test1', 'test2', 'test3'], ['test5'], ['test7', 'test8', ''], ['test1'], ['test2', 'test4']]

df['Source'].str.split('\n').apply(lambda x: [e.strip() for e in x]).tolist()

Use Series.str.strip for remove traling whitespaces and then split by regex \s*\n\s* for one or zero whitespaces before and after \n:
df = pd.DataFrame({'Source':['test1 \n test2 ',
' test1 \n test2 ',
' test1 \ntest2 ']})
print (df)
Source
0 test1 \n test2
1 test1 \n test2
2 test1 \ntest2
Data = (df["Source"].str.strip().str.split("\s*\n\s*").to_list())
print (Data)
[['test1', 'test2'], ['test1', 'test2'], ['test1', 'test2']]
Or if possible split by arbitrary whitespace (it means spaces or \n here):
Data = (df["Source"].str.strip().str.split().to_list())
print (Data)
[['test1', 'test2'], ['test1', 'test2'], ['test1', 'test2']]

Related

Extract a list of values from a column in a pandas dataframe

I’m trying to extract a list of values from a column in a dataframe.
For example:
# dataframe with "num_fruit" column
fruit_df = pd.DataFrame({"num_fruit": ['1 "Apple"',
'100 "Peach Juice3" 1234 "Not_fruit" 23 "Straw-berry" 2 "Orange"']})
# desired output: a list of values from the "num_fruit" column
[['1 "Apple"'],
['100 "Peach Juice3"', '1234 "Not_fruit"', '23 "Straw-berry"', '2 "Orange"']]
Any suggestions? Thanks a lot.
What I’ve tried:
import re
def split_fruit_val(val):
return re.findall('(\d+ ".+")', val)
result_list = []
for val in fruit_df['num_fruit']:
result = split_fruit_val(val)
result_list.append(result)
print(result_list)
#output: some values were not split appropriately
[['1 "Apple"'],
['100 "Peach Juice3" 1234 "Not_fruit" 23 "Straw-berry" 2 "Orange"']]
Lets split with positive lookahead for a number
fruit_df['num_fruit'].str.split(r'\s(?=\d+)')
0 [1 "Apple"]
1 [100 "Peach Juice3", 1234 "Not_fruit", 23 "Str...
Name: num_fruit, dtype: object

why doesn't my python pandas dataframe strip method work for trailing whitespace? and how do I fix it?

I have this code to strip whitespace from the dataframe
# create a dataframe with 3 columns
dataFrame = pd.DataFrame({
'Product Category': [' Computer', ' Mobile Phone', 'Electronics ', 'Appliances', ' Furniture', 'Stationery'],'Product Name': ['Keyboard', 'Charger', ' SmartTV', 'Refrigerators', ' Chairs', 'Diaries'],'Quantity': [10, 50, 10, 20, 25, 50]})
print ("Dataframe before removing whitespaces...\n",dataFrame)
# removing whitespace from more than 1 column
dataFrame['Product Category'].str.strip()
dataFrame['Product Name'].str.strip()
# dataframe
print ("Dataframe after removing whitespaces...\n",dataFrame)
The Dataframe before removing whitespace...
Product Category Product Name Quantity
0 Computer Keyboard 10
1 Mobile Phone Charger 50
2 Electronics SmartTV 10
3 Appliances Refrigerators 20
4 Furniture Chairs 25
5 Stationery Diaries 50
The Dataframe after removing whitespace...
Product Category Product Name Quantity
0 Computer Keyboard 10
1 Mobile Phone Charger 50
2 Electronics SmartTV 10
3 Appliances Refrigerators 20
4 Furniture Chairs 25
5 Stationery Diaries 50
The whitespace after "Electronics" is not stripped. Any ideas how I can fix this?
As Umar said you need to do something like this:
# create a dataframe with 3 columns
dataFrame = pd.DataFrame({
'Product Category': [' Computer', ' Mobile Phone', 'Electronics ', 'Appliances', ' Furniture', 'Stationery'],'Product Name': ['Keyboard', 'Charger', ' SmartTV', 'Refrigerators', ' Chairs', 'Diaries'],'Quantity': [10, 50, 10, 20, 25, 50]})
print ("Dataframe before removing whitespaces...\n",dataFrame)
# removing whitespace from more than 1 column
dataFrame['Product Category'] = dataFrame['Product Category'].str.strip()
dataFrame['Product Name'] = dataFrame['Product Name'].str.strip()
# dataframe
print ("Dataframe after removing whitespaces...\n",dataFrame)

Remove newline characters from pandas series of lists

I have a pandas DataFrame that contains two columns, one of tags containing numbers and the other with a list containing string elements.
Dataframe:
df = pd.DataFrame({
'tags': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
'elements': {
0: ['\n☒', '\nANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 '],
1: ['', ''],
2: ['\n', '\nFor the Fiscal Year Ended June 30, 2020'],
3: ['\n', '\n'],
4: ['\n', '\nOR']
}
})
I am trying to remove all instances of \n from any element in all the lists from the column elements but I'm really struggling to do so. My solution was to use a nested loop and re.sub() to trying and replace these but it has done nothing (granted this is a horrible solution). This was my attempt:
for ls in range(len(page_table.elements)):
for st in range(len(page_table.elements[i])):
page_table.elements[i][st] = re.sub('\n', '', page_table.elements[i][st])
Is there a way to do this?
You can explode and then replace the \n values.
You can leave out the .groupby(level=0).agg(list) to not put them back into lists, though this will have a different shape to the original DataFrame.
df["elements"] = (
df["elements"]
.explode()
.str.replace(r"\n", "", regex=True)
.groupby(level=0)
.agg(list)
)
Which outputs:
0 [☒, ANNUAL REPORT PURSUANT TO SECTION 13 OR 15...
1 [, ]
2 [, For the Fiscal Year Ended June 30, 2020]
3 [, ]
4 [, OR]
Also possible:
df['elements'] = df['elements'].map(lambda x: [y.replace('\n', '') for y in x])
0 [☒, ANNUAL REPORT PURSUANT TO SECTION 13 OR 15...
1 [, ]
2 [, For the Fiscal Year Ended June 30, 2020]
3 [, ]
4 [, OR]

Python: how to check entries with white spaces in a dataframe?

I have a dataframe df containing the information of car brands. For instance,
df['Car_Brand'][1]
'HYUNDAI '
where the length of each entries is the same len(df['Car_Brand'][1])=30. I can also have entries with only white spaces.
df['Car_Brand']
0 TOYOTA
1 HYUNDAI
2
3
4
5 OPEL
6
7 JAGUAR
where
df['Car_Brand'][2]
' '
I would like to drop from the dataframe all the entries with white spaces and reduce the size of the others. Finally:
df['Car_Brand'][1]
'HYUNDAI '
becomes
df['Car_Brand'][1]
'HYUNDAI'
I started to remove the withe spaces, in this way:
tmp = df['Car_Brand'].str.replace(" ","")
using str.strip and convert it to bool to filter the empty ones
df['Car_Brand'] = df['Car_Brand'].strip()
df[df['Car_Brand'].astype(bool)]
It seems need:
s = df['Car_Brand']
s1 = s[s != ''].reset_index(drop=True)
#if multiple whitespaces
#s1 = s[s.str.strip() != ''].reset_index(drop=True)
print (s1)
0 TOYOTA
1 HYUNDAI
2 OPEL
3 JAGUAR
Name: Car_Brand, dtype: object
If multiple whitespaces:
s = df[~df['Car_Brand'].str.contains('^\s+$')]

Python String Padding

I am trying to output a string from a tuple with different widths between each element.
Here is the code I am using at the moment:
b = tuple3[3] + ', ' + tuple3[4] + ' ' + tuple3[0] + ' '
+ tuple3[2] + ' ' + '£' + tuple3[1]
print(b)
Say for example I input these lines of text:
12345 1312 Teso Billy Jones
12344 30000 Test John M Smith
The output will be this:
Smith, John M 12344 Test £30000
Jones, Billy 12345 Teso £1312
How can I keep the padding consistent with larger spacing between the 3 parts?
Also, when I input these strings straight from a text file this is the output I recieve:
Smith
, John M 12344 Test £30000
Jones, Billy 12345 Teso £1312
How can I resolve this?
Thanks alot.
String formatting to the rescue!
lines_of_text = [
(12345, 1312, 'Teso', 'Billy', 'Jones'),
(12344, 30000, 'Test', 'John M', 'Smith')
]
for mytuple in lines_of_text:
name = '{}, {}'.format(mytuple[4], mytuple[3])
value = '£' + str(mytuple[1])
print('{name:<20} {id:>8} {test:<12} {value:>8}'.format(
name=name, id=mytuple[0], test=mytuple[2], value=value)
)
results in
Jones, Billy 12345 Teso £1312
Smith, John M 12344 Test £30000

Categories

Resources