Get rid of initial spaces at specific cells in Pandas - python

I am working with a big dataset (more than 2 million rows x 10 columns) that has a column with string values that were filled oddly. Some rows start and end with many space characters, while others don't.
What I have looks like this:
col1
0 (spaces)string(spaces)
1 (spaces)string(spaces)
2 string
3 string
4 (spaces)string(spaces)
I want to get rid of those spaces at the beginning and at the end and get something like this:
col1
0 string
1 string
2 string
3 string
4 string
Normally, for a small dataset I would use a for iteration (I know it's far from optimal) but now it's not an option given the time it would take.
How can I use the power of pandas to avoid a for loop here?
Thanks!
edit: I can't get rid of all the whitespaces since the strings contain spaces.

df['col1'].apply(lambda x: x.strip())
might help

Related

DataFrame and Python don't find the columns' names of a CSV file

I use DataFrame in Python in order to manipulate CSV files. So I use things like df['column_name'].
but df doesn't seem to find the column in the file (writes a "KeyError"), even if there really is that column, and even if I checked it back to see if there was a mistake in the letters.
So if I want my program to work, and my CSV file to be readen by df and python, I need to manually change the name of the column that I want to manipulate before everything.
To explain the situation, the files I manipulate are not from me, they're pregenerated, and so it looks like python doesn't want to read it if I don't change the name of the column, because everything works after changing it.
I hope you have understood and you'll be able to help me !
Have you checked if the 'column_name' in file and code, both are in capital or both are in small letters? sometimes it gives error when one is in capital but another one is in small letters.
ok, I figured the thing out but I don't know how to deal with it:
I get that, from the file I want to manipulate, if I copy the column named "Time":
(Time
)
I added brackets just to show that it goes to the line right after, so it seems to be a problem from the original file, with the name of the column having a "enter" literaly in itself.
So it makes that for example in a code:
time = df['Time
']
It prevents the code from working.
I don't have any idea about how to deal with it, and I don't think I can fix it by fixing the column's name in the file because it is pregenerated.
have you checked for white spaces? like tabs or line breaks.
EDIT:
Now that you know the problem is a linebreak and that perhaps other columns on the data frame have the same problem, you can clean them like this
before:
df = pd.DataFrame([['A',1],
['B',2],
['C',3],
['D',4],
['E',5]], columns=['column 1 \n',' \n column2 \n'])
output:
column 1 \n \n column2 \n
0 A 1
1 B 2
2 C 3
3 D 4
4 E 5
After:
#cleaning the column names
new_columns = [i.strip() for i in df.columns]
df.columns = new_columns
column 1 column2
0 A 1
1 B 2
2 C 3
3 D 4
4 E 5

Match all values str column dataframe with other dataframe str column

I have two pandas dataframes:
Dataframe 1:
ITEM ID TEXT
1 some random words
2 another word
3 blah
4 random words
Dataframe 2:
INDEX INFO
1 random
3 blah
I would like to match the values from the INFO column (of dataframe 2) with the TEXT column of dataframe 1. If there is a match, I would like to see a new column with a "1".
Something like this:
ITEM ID TEXT MATCH
1 some random words 1
2 another word
3 blah 1
4 random words 1
I was able to create a match per value of the INFO column that I'm looking for with this line of code:
dataframe1.loc[dataframe1['TEXT'].str.contains('blah'), 'MATCH'] = '1'
However, in reality, my real dataframe 2 has 5000 rows. So I cannot manually copy paste all of this. But basically I'm looking for something like this:
dataframe1.loc[dataframe1['TEXT'].str.contains('Dataframe2[INFO]'), 'MATCH'] = '1'
I hope someone can help, thanks!
Give this a shot:
Code:
dfA['MATCH'] = dfA['TEXT'].apply(lambda x: min(len([ y for y in dfB['INFO'] if y in x]), 1))
Output:
ITEM ID TEXT MATCH
0 1 some random words 1
1 2 another word 0
2 3 blah 1
3 4 random words 1
It's a 0 if it's not a match, but that's easy enough to weed out.
There may be a better / faster native solution, but it gets the job done by iterating over both the 'TEXT' column and the 'INFO'. Depending on your use case, it may be fast enough.
Looks like .map() in lieu of .apply() would work just as well. Could make a difference in timing, again, based on your use case.
Updated to take into account string contains instead of exact match...
You could get the unique values from the column in the first dataframe convert them to list and then use eval method on the second with Column.str.contains on that list.
unique = df1['TEXT'].unique().tolist()
df2.eval("Match=Text.str.contains('|'.join(#unique))")

Removing white space at the beginning of values in multiple columns

I found a solution to this:
df['Name']=df['Name'].str.lstrip
df['Parent']=df['Name'].str.lstrip
I have this DataFrame df (there is a white space at the left of "A" and "C" in the second row (which doesn't show well here). I would like to remove that space.
Mark Name Parent age
10 A C 1
12 A C 2
13 B D 3
I tried
df['Name'].str.lstrip()
df['Parent'].str.lstrip()
then tried
df.to_excel('test.xlsx')
but the result in excel didn't remove the white spaces
I then tried defining another variable
x=df['Name'].str.lstrip
x.to_excel('test.xlsx')
that worked fine in Excel, but this is a new dataFrame, and only had the x column
I then tried repeating the same for 'Parent', and to play around with joining multiple dataframes to the original dataframe, but I still couldnt' get it to work, and that seems too convoluted anyway
Finally, even if my first attempt had worked, I would like to be able to replace the white spaces in one go, without having to do a separate call for each column name
You could try using
df['Name'].replace(" ", "")
this would delete all whitespaces though.

pandas str.extractall find unknown number of groups / regex

After some searching I seem to be coming up a bit blank. I'm also a total regex simpleton...
I have a csv file with data like this:
header1 header2
row1 "asdf (qwer) asdf"
row2 "asdf (hghg) asdf (lkjh)"
row3 "asdf (poiu) mkij (vbnc) yuwuiw (hjgk)"
I've put double quotes around the rows in header2 for clarity that it is one field.
I want to extract each occurrence of words between brackets (). There will be a least one occurrence per row, but I don't know ahead of time how many occurrences of bracketed words will appear in each line.
Using the wonderful https://www.regextester.com/ i think the regex i need is \(.*?\)
But I keep getting:
ValueError: pattern contains no capture groups
the code i used was:
pattern = r'\(.*?\)'
extracted = df.loc[:, 'header2'].str.extractall(pattern)
Any help appreciated.
thanks
You need to include a capture group inside the parenthesis. Also, when using extractall, I'd use unstack so it matches the structure of your DataFrame:
df.header2.str.extractall(r'\((.*?)\)').unstack()
0
match 0 1 2
0 qwer NaN NaN
1 hghg lkjh NaN
2 poiu vbnc hjgk
If you're concerned about performance, don't use pandas string operations:
pd.DataFrame([re.findall(r'\((.*?)\)', row) for row in df.header2])
0 1 2
0 qwer None None
1 hghg lkjh None
2 poiu vbnc hjgk

Concatenating a string with part of another string in Python

Problem:
I have a csv file with partial city names in one column (usually missing the first one or two letters), and another column in the same file, that has other information and often contains the one or two missing letters in the end of that string.
e.g.
Column 1
w York
Column 3
word word Ne
My logic on how to approach this problem is to have a separate CSV file with valid city names, and perform the Python version of a VLOOKUP pre and post concatenation, so that it only concatenates if it does not already match with valid city data.
I am stuck on how to actually pull one or two characters from the end of a string in column 3 (substring, but repeatedly down a column) and merge it with the start of a string in column 1, but I already know how to execute the rest of my idea.
Here is a general script for concatenation using Pandas:
pd.concat([col1, col2.set_axis(col1.index[-len(col2):], inplace=False)], axis=1)
Would the addition of a -2 resolve the issue? i.e.
pd.concat([col1, col2.set_axis(col1.index[-len(col2)-2:], inplace=False)], axis=1)
Thank you!
If you decide on Pandas, after loading the csv into a pandas dataframe, you can extract the last 1 or 2 characters from column 3 this way and prepend it to column 2 this way:
df_city_names['col3'].map(lambda x: "".join(str(x)[-(2 if len(str(x)) > 1 else 1 if len(str(x)) > 0 else ''):])) + df_city_names['col2']

Categories

Resources