I found a solution to this:
df['Name']=df['Name'].str.lstrip
df['Parent']=df['Name'].str.lstrip
I have this DataFrame df (there is a white space at the left of "A" and "C" in the second row (which doesn't show well here). I would like to remove that space.
Mark Name Parent age
10 A C 1
12 A C 2
13 B D 3
I tried
df['Name'].str.lstrip()
df['Parent'].str.lstrip()
then tried
df.to_excel('test.xlsx')
but the result in excel didn't remove the white spaces
I then tried defining another variable
x=df['Name'].str.lstrip
x.to_excel('test.xlsx')
that worked fine in Excel, but this is a new dataFrame, and only had the x column
I then tried repeating the same for 'Parent', and to play around with joining multiple dataframes to the original dataframe, but I still couldnt' get it to work, and that seems too convoluted anyway
Finally, even if my first attempt had worked, I would like to be able to replace the white spaces in one go, without having to do a separate call for each column name
You could try using
df['Name'].replace(" ", "")
this would delete all whitespaces though.
Related
i have a dataframe with a column location which looks like this:
on the screenshot you see the case with 5 spaces in location column, but there are a lot more cells with 3 and 4 spaces, while the most normal case is just two spaces: between the city and the state, and between the state and the post-code.
i need to perform the str.split() on location column, but due to the different number of spaces it will not work, because if i substitute spaces with empty space or commas, i'll get different number of potential splits.
so i need to find a way to turn spaces that are inside city names into hyphens, so that i am able to do the split later, but at the same time not touch other spaces (between city and state, and between state and post code). any ideas?
I have written those code in terms of easy understanding/readability. One way to solve above query is to split location column first into city & state, perform operation on city & merge back with state.
import pandas as pd
df = pd.DataFrame({'location':['Cape May Court House, NJ 08210','Van Buron Charter Township, MI 48111']})
df[['city','state'] ]= df['location'].str.split(",",expand=True)
df['city'] = df['city'].str.replace(" ",'_')
df['location_new'] = df['city']+','+df['state']
df.head()
final output will look like this with required output in column location_new :
I have a large dataframe from a csv file which has a few dozen columns. I have another csv file which I concatenated to the original. Now, the second file has exactly the same structure but a particular column may have incorrect values. I want to delete rows which are duplicates that have this one wrong column. For example in the below the last row should be removed. (The names of the specimens (Albert, etc.) are unique). I have been struggling to find a way of deleting only the data which has the wrong value, without risking deleting the correct row.
0 Albert alive
1 Newton alive
2 Galileo alive
3 Copernicus dead
4 Galileo dead
...
Any help would be greatly appreciated!
You could use this to determine if a name is mentioned more than 1 time
df['RN'] = df.groupby(['Name']).cumcount() + 1
You can also expand it out to have more columns in the "groupby" to see if there are any more limitations you want to put on the duplicates
df['RN'] = df.groupby(['Name', 'Another Column']).cumcount() + 1
The advantage I like with this is it gives you more control over the RN selection if you needed to df.loc[df['RN'] > 2].
I use DataFrame in Python in order to manipulate CSV files. So I use things like df['column_name'].
but df doesn't seem to find the column in the file (writes a "KeyError"), even if there really is that column, and even if I checked it back to see if there was a mistake in the letters.
So if I want my program to work, and my CSV file to be readen by df and python, I need to manually change the name of the column that I want to manipulate before everything.
To explain the situation, the files I manipulate are not from me, they're pregenerated, and so it looks like python doesn't want to read it if I don't change the name of the column, because everything works after changing it.
I hope you have understood and you'll be able to help me !
Have you checked if the 'column_name' in file and code, both are in capital or both are in small letters? sometimes it gives error when one is in capital but another one is in small letters.
ok, I figured the thing out but I don't know how to deal with it:
I get that, from the file I want to manipulate, if I copy the column named "Time":
(Time
)
I added brackets just to show that it goes to the line right after, so it seems to be a problem from the original file, with the name of the column having a "enter" literaly in itself.
So it makes that for example in a code:
time = df['Time
']
It prevents the code from working.
I don't have any idea about how to deal with it, and I don't think I can fix it by fixing the column's name in the file because it is pregenerated.
have you checked for white spaces? like tabs or line breaks.
EDIT:
Now that you know the problem is a linebreak and that perhaps other columns on the data frame have the same problem, you can clean them like this
before:
df = pd.DataFrame([['A',1],
['B',2],
['C',3],
['D',4],
['E',5]], columns=['column 1 \n',' \n column2 \n'])
output:
column 1 \n \n column2 \n
0 A 1
1 B 2
2 C 3
3 D 4
4 E 5
After:
#cleaning the column names
new_columns = [i.strip() for i in df.columns]
df.columns = new_columns
column 1 column2
0 A 1
1 B 2
2 C 3
3 D 4
4 E 5
Currently I have a data frame that looks like this below:
name
value
position
length
table
5.0
1234567
.25
chair
8.0
789012
5
couch
6.0
345678
5
bed
5.3
1901234
.05
what I need to do is first edit the position column by adding a "+" after the tens place so the first number should be 12345+67
I think I would have to first break up every number in the position, then measure the length, and finally add the "+" sign by adding it the length of the value - 2?
Adding the "+" sign will cause it to align left in excel so I need to make sure it is aligned right.
I tried using df = df.style.set_properties(subset=["position"], **{'text-align': 'right'})
but this doesn't work because it appears I need columns that have a similar name.
What would be another way to get both of these complete?
Thank you in advanced.
UPDATE
I was able to break up the position column to two columns and added a third column with the "+" symbol. Then I combined all 3 new columns and replaced the position column. And lastly deleted the new columns using the following:
df['col1']= df['position'].astype(str).str[:-2]
df['col2'] = df['col'].str[:-2]
df['col3'] = df['col'].str[-2:]
df['col4'] = '+'
df['position'] = df[['col1','col3','col2']].apply(lambda row: ''.join(row.values.astype(str)), axis=1)
df = df.drop(["col", "col1", "col2", "col3", "col4"], axis=1)
The only thing left I need to do is be able to align the new value to the right because in excel it aligns left when I added the "+" sign
Excel by default aligns numerical values to the right and text values to the left. So you won't be able to make the change in pandas, but when you write to excel you can modify the alignment after saving using something like openpyxl
see here for an example
another option if you want the cells to retain their numerical value would be to leave them as a number but format them to display with the '+'. You could do this by setting the number format to "#+##"
see here for number format codes documentation
To preface: I'm new to using Python.
I'm working on cleaning up a file where data was spread across multiple rows. I'm struggling to find a solution that will concatenate multiple text strings to a single cell. The .csv data looks similar to this:
name,date,description
bundy,12-12-2017,good dog
,,smells kind of weird
,,needs to be washed
with one or two blank rows between each entry, too.
The amount of rows used for 'description' isn't consistent. Sometimes it's just one cell, sometimes up to about four. The ideal output turns these multiple rows into a single row of useful data, without all the wasted space. I thought maybe I could create a series of masks by copying the data across a few columns, shifted up, and then iterating in some way. I haven't found a solution that matches what I'm trying to do, though. This is where I'm at so far:
#Add column f description stuff and shift up a row for concatenation
DogData['Z'] = DogData['Y'].shift(-1)
DogData['AA'] = DogData['Z'].shift(-1)
DogData['AB'] = DogData['AA'].shift(-1)
#create series checks to determine how to concat values properly
YNAs = DogData['Y'].isnull()
ZNAs = DogData['Z'].isnull()
AANAs = DogData['AA'].isnull()
The idea here was basically that I'd iterate over column 'Y', check if the same row in column 'Z' was NA or had a value, and concat if it did. If not, just use the value in 'Y'. Carry that logic across but stopping if it encountered an NA in any subsequent columns. I can't figure out how to do that, or if there's a more efficient way to do this.
What do I have to do to get to my end result? I can't figure out the right way to iterate or concatenate in the way I was hoping to.
'''
name,date,description
bundy,12-12-2017,good dog
,,smells kind of weird
,,needs to be washed
'''
df = pd.read_clipboard(sep=',')
df.fillna(method = 'ffill').groupby([
'name',
'date'
]).description.apply(lambda x : ', '.join(x)).to_frame(name = 'description')
I'm not sure I follow exactly what you mean. I took that text, saved it as a csv file, and successfully read it into a pandas dataframe.
import pandas as pd
df = pd.read_csv('test.csv')
df
Output:
name date description
0 bundy 12-12-2017 good dog
1 NaN NaN smells kind of weird
2 NaN NaN needs to be washed
Isn't this the output you require?