replace a string in entire dataframe from excel with value - python

I have this kind of data from excel
dminerals=pd.read_excel(datafile)
print(dminerals.head(5))
Then I replace the 'Tr' and NaN value using for loop with this script
for key, value in dminerals.iteritems():
dminerals[key] = dminerals[key].replace(to_replace='Tr', value=int(1))
dminerals[key] = dminerals[key].replace(to_replace=np.nan, value=int(0))
then print it again, it seems working and print the dataframe types.But it shows object data type.
print(dminerals.head(5))
print(dminerals['C'].dtypes)
I tried using this .astype to change one of the column ['C'] to integer but the result is value error
dminerals['C'].astype(int)
ValueError: invalid literal for int() with base 10: 'tr'
I thought I already change the 'Tr' in the dataframe into integer value. Is there anything that I miss in the process above? Please help, thank you in advance!

You are replacing Tr with 1, however there is a tr that's not being replaced (this is what you ValueError is saying. Remember python is case sensitive. Also, using for loops is extremely inefficient you might want to try using the following lines of code:
dminerales = dminerales.replace({'Tr':1,'tr':1}).fillna(0)
I'm using fillna() which is also better to fill the null values with the specified value 0 in this case, instead of using repalce.

Related

Access strings as data frames and redefine the object iteratively in python loop

for i in dataframe_list:
i=eval(i)
for num in range(1,len(dataframe_list)):
for column in [column for column in eval(i).columns if column not in eval(dataframe[num]):
eval(i)= eval(i).withcolumn(column, lit=none)
for column in [column for column in dataframe.columns if column not in dataframe2]:
eval(dataframe[num])= eval(dataframe[num]).withcolumn(column,lit=none)
Dataframe_list is a list of the names of the dataframes. The problem is python doesn't recognise the strings as the object. So I use the eval() function . On the first with column containing the eval(i).withcolumn statement, I get an error saying saying eval must contain a string, bytecode or code object. To my knowledge i is the index of the list and is clearly a string. Can anyone help me get this to work?
eval can't be used in a function/function call. I've tried exec() etc. How do I do this. Just need to iteratively redefine the dataframe. Can't do that if there is an eval on each side.
Python says each i in dataframe_list is a pyspark dataframe but when i run the code either i or dataframe_list[z] comes up as a str that has no attribute columns. ?? even though z is i+1 index for accessing the list entries....so if dataframe_list[i] is a df then dataframe_list[z] must also be one.... Any ideas?
You have already done eval(i) in the outer for loop so there is no need to do eval(i) in the nested for loop again. Also, I believe eval(dataframe[num]) should be eval(dataframe_list[num]).columns?
for i in dataframe_list:
i=eval(i)
for num in range(1,len(dataframe_list)):
# now use i not eval(i) like this...
# i = i.withcolumn(column, F.lit(None))
I decided to create another list with data frame types exclusively instead of evaluating the initial list.
So i added a loop:
for i in dataframestr:
i=eval(i)
newdataframelist.append(i)
This gets around the issue of eval.

.strip() with in-place solution not working

I'm trying to find a solution for stripping blank spaces from some strings in my DataFrame. I found this solution, where someone said this:
I agree with the other answers that there's no inplace parameter for
the strip function, as seen in the
documentation
for str.strip.
To add to that: I've found the str functions for pandas Series
usually used when selecting specific rows. Like
df[df['Name'].str.contains('69'). I'd say this is a possible reason
that it doesn't have an inplace parameter -- it's not meant to be
completely "stand-alone" like rename or drop.
Also to add! I think a more pythonic solution is to use negative
indices instead:
data['Name'] = data['Name'].str.strip().str[-5:]
This way, we don't have to assume that there are 18 characters, and/or
we'll consistently get "last 5 characters" instead!
So, I have a list of DataFrames called 'dataframes'. On the first dataframe (which is dataframes[0]), I have a column named 'CNJ' with string values, some of them with a blank space in the end. For example:
Input:
dataframes[0]['cnj'][9]
Output:
'0100758-73.2019.5.01.0064 '
So, following the comment above, I did this:
Input:
dataframes[0]['cnj'] = dataframes[0]['cnj'].strip()
Then I get the following error:
AttributeError: 'Series' object has no attribute 'strip'
Since the solution given on the other topic worked, what am I doing wrong to get this error? It seemed to me it shouldn't work because its a Series, but it should get the same result as the one mentioned above (data['Name'] = data['Name'].str.strip().str[-5:]), right?
Use
dataframes[0]['cnj']=dataframes[0]['cnj'].str.strip()
or better yet, store the dataframe in a variable first:
df0=dataframes[0]
df0['cnj']=df0['cnj'].str.strip()
The code in the solution you posted uses .str. :
data['Name'] = data['Name'].str.strip().str[-5:]
The Pandas Series object has no string or date manipulation methods methods. These are exposed through the Series.str and Series.dt accessor objects.
The result of Series.str.strip() is a new series. That's why .str[-5:] is needed to retrieve the last 5 characters. That results is a new series again. That expression is equivalent to :
temp_series=data['Name'].str.strip()
data['Name'] = temp_series.str[-5:]
You could just apply a transformation function on the column values like this.
data["Name"] = data["Name"].apply(lambda x: str(x).strip()[-5:])
What you need is a string without the right spaces is a series or a dataframe right, at least that's my understanding looking at your query, use str.rstrip() which will work both on series and dataframe objects.
Note: strip() usually is only for string datatypes, so the error you are getting is appropriate.
Refer to link , and try implementing str.rstrip() provided by pandas.
For str.strip() you can refer to this link, it works for me.
In your case, assuming the dataframe column to be s, you can use the below code:
df[s].str.strip()

Sort dataframe by absolute value without changing value or adding column

I have a dataframe that's the result of importing a csv and then performing a few operations and adding a column that's the difference between two other columns (column 10 - column 9 let's say). I am trying to sort the dataframe by the absolute value of that difference column, without changing its value or adding another column.
I have seen this syntax over and over all over the internet, with indications that it was a success (accepted answers, comments saying "thanks, that worked", etc.). However, I get the error you see below:
df.sort_values(by='Difference', ascending=False, inplace=True, key=abs)
Error:
TypeError: sort_values() got an unexpected keyword argument 'key'
I'm not sure why the syntax that I see working for other people is not working for me. I have a lot more going on with the code and other dataframes, so it's not a pandas import problem I don't think.
I have moved on and just made a new column that is the absolute value of the difference column and sorted by that, and exclude that column from my export to worksheet, but I really would like to know how to get it to work the other way. Any help is appreciated.
I'm using Python 3
df.loc[(df.c - df.b).sort_values(ascending = False).index]
Sorting by difference between "c" and "b" , without creating new column.
I hope this is what you were looking for.
key is optional argument
It accepts series as input , maybe you were working with dataframe.
check this

Pandas string.contains doesn't work if searched string contains the substring at the beginning of the string

I'm using str.contains to search for rows where the column contains a particular string as a substring
df[df['col_name'].str.contains('find_this')]
This returns all the rows where 'find_this' is somewhere within the string. However, in the rare but important case where the string in df['col_name'] STARTS with 'find_this', this row is not returned by the above query.
str.contains() returns false where it should return true.
Any help would be greatly appreciated, thanks!
EDIT
I've added some example data as requested.
Image of dataframe.
I want to update the 'Eqvnt_id' column, so for example, the rows where column 'Course_ID' contains AAS 102 all have the same 'Eqvnt_id' value.
To do this I need to be able to search the strings in 'Course_ID' for 'AAS 102' in order to locate the appropriate rows. However, when I do this:
df[df['Course_ID'].str.contains('AAS 102')]
The row that has 'AAS 102 (ENGL 102, JST 102, REL 102)' does not appear in the query!
The datatypes are all objects. I've tried mapping them and applying them to string type, but it has had no effect on the success of the query.
The data from the image can be found at https://github.com/isaachowen/stackoverflowquestionfiles
TLDR: Experiment with pandas.Series.str.normalize(), trying different Unicode forms until the issue is solved. 'NFKC' worked for me.
The problem had to do with the format of the data in the column that I was doing the...
df['column'].str.contains('substring')
...operation on. Using the pandas.Series.str.normalize() function works. Link here. Sometimes, under some circumstances that I can't deliberately recreate, the strings would have '\xa0' and '\n' appended to them at the beginning or the end of the string. This post helps address how to deal with that problem. Following that post, I for-looped through every string column and changed the unicode form until I found something that worked: 'NFKC'.
you can use pandas.Series.str.find() instead - it returns the index where the string is found - if its at the start, the returned index would be 0. If a string is not found, it returns -1.
df[df['col_name'].str.find('find_this') != -1]
Let me know if this helps!

Error passing Pandas Dataframe to Scikit Learn

I get the following error when passing a pandas dataframe to scikitlearn algorithms:
invalid literal for float(): 2.,3
How do I find the row or column with the problem in order to fix or eliminate it? Is there something like df[df.isnull().any(axis=1)] for a specific value (in my case I guess 2.,3)?
If you know what column it is, you can use
df[df.your_column == 2.,3]
then you'll get all rows where the specified column has a value of 2.,3
You might have to use
df[df.your_column == '2.,3']

Categories

Resources