I would like to remove the characters from the columns in the pandas' data frame. I got around 10 columns and each has characters. Please see the sample column. Column type is a string and would like to remove the characters and convert the column int float
10.2\I
10.1\Y
NAN
12.5\T
13.3\T
9.4\J
NAN
12.2\N
NAN
11.9\U
NAN
12.4\O
NAN
8.3\U
13.5\B
NAN
13.1\V
11.0\Q
11.0\X
8.200000000000001\U
NAN
13.1\T
8.1\O
9.4\N
I would like to remove the '\', all the Alphabets and make it into a float. I don't want to change the NAN.
I used df[column name'] = df.str[:4] - It removes some of the cells but not all cells. Also, unable to convert into a float as I am getting an error
df[column name'] = df.str[:4]
df['column name'].astype(float)
0 10.2
1 10.1
2 NaN
3 12.5
4 13.3
5 9.4\
6 8.3\
22 8.1\
27 9.4\
28 NaN
29 10.6
30 10.8
31 NaN
32 7.3\
33 9.8\
34 NaN
35 12.4
36 8.1\
Still it's not converting other cells
Getting error when I tried to convert into a float
ValueError: could not convert string to float: '10.2\I'
Two reasons I can see why your code is not working:
Using [:4] will not work for all values in your example since the number of digits before the decimal point (and apparently after it) varies.
In the df['column name'] = df.str[:4] assignment there needs to be the same column identifier on the right side of the equal sign.
Here is a solution with a sample dataframe I prepared with two abbreviated columns like in your example. It uses [:-2] to truncate each value from the right side and then replaces remaining N's with the original NAN's before converting to float.
import pandas as pd
col = pd.Series(["10.2\I","10.1\Y",'NAN','12.5\T'])
col2 = pd.Series(["11.0\Q","11.0\X",'NAN',r'8.200000000000001\U'])
df = pd.concat([col,col2],axis=1)
df.rename(columns={0:'col1',1:'col2'},inplace=True)
df
col1 col2
0 10.2\I 11.0\Q
1 10.1\Y 11.0\X
2 NAN NAN
3 12.5\T 8.200000000000001\U
#apply the conversion to all columns in the dataframe
for col in df:
df[col] = df[col].str[:-2].replace('N','NAN').astype(float)
df
col1 col2
0 10.2 11.0
1 10.1 11.0
2 NaN NaN
3 12.5 8.2
Related
for instance the column i want to split is duration here, it has data points like - 110 or 2 seasons, i want to make a differerent column for seasons and in place of seasons in my current column it should say null as this would make the type of column int from string
screenshot of my data
i tried the split function but that's for splliting in between data points, unlike splitting different other data points
I have tried to replicate a portion of your dataframe in order to provide the below solution - note that it will also change the np.NaN values to 'Null' as requested.
Creating the sample dataframe off of your screenshot:
movies_dic = {'release_year': [2021,2020,2021,2021,2021,1940,2018,2008,2021],
'duration':[np.NaN, 94, 108, 97, 104, 60, '4 Seasons', 90, '1 Season']}
stack_df = pd.DataFrame(movies_dic)
stack_df
The issue is likely that the 'duration' column is of object dtypes - namely it contains both string and integer values in it. I have made 2 small functions that will make use of the data types and allocate them to their respective column. The first is taking all the 'string' rows and placing them in the 'series_duration' column:
def series(x):
if type(x) == str:
return x
else:
return 'Null'
Then the movies function keeps the integer values (i.e. those without the word 'Season' in them) as is:
def movies(x):
if type(x) == int:
return x
else:
return 'Null'
stack_df['series_duration'] = stack_df['duration'].apply(lambda x: series(x))
stack_df['duration'] = stack_df['duration'].apply(lambda x: movies(x))
stack_df
release_year duration series_duration
0 2021 Null Null
1 2020 94 Null
2 2021 108 Null
3 2021 97 Null
4 2021 104 Null
5 1940 60 Null
6 2018 Null 4 Seasons
7 2008 90 Null
8 2021 Null 1 Season
I have created an example to give you some ideas about how to manage the problem.
First of all, I created a DF with ints, strings with format:' X seasons' and negative numbers:
import pandas as pd
data = [5,4,3,4,5,6,'4 seasons', -110, 10]
df = pd.DataFrame(data, columns=['Numbers'])
Then I created the next loop, what it does is to create new columns depending the format of the value (string or negative number), insert them and transform the original value into an NaN.
index=0
for n in df['Numbers']:
if type(n)==str:
df.loc[index, 'Seasons'] = n
df['Numbers'] = df['Numbers'].replace([n], np.nan)
elif n < 0:
df.loc[index, 'Negatives'] = n
df['Numbers'] = df['Numbers'].replace([n], np.nan)
index+=1
The output would be like this:
Numbers Seasons Negatives
0 5.0 NaN NaN
1 4.0 NaN NaN
2 3.0 NaN NaN
3 4.0 NaN NaN
4 5.0 NaN NaN
5 6.0 NaN NaN
6 NaN 4 seasons NaN
7 NaN NaN -110.0
8 10.0 NaN NaN
Then you can adjust this example as you want.
I want to merge/join two large dataframes while the 'id' column the dataframe on the right is assumed to be substrings of the left 'id' column.
For illustration purposes:
import pandas as pd
import numpy as np
df1=pd.DataFrame({'id':['abc','adcfek','acefeasdq'],'numbers':[1,2,np.nan],'add_info':[3123,np.nan,312441]})
df2=pd.DataFrame({'matching':['adc','fek','acefeasdq','abcef','acce','dcf'],'needed_info':[1,2,3,4,5,6],'other_info':[22,33,11,44,55,66]})
This is df1:
id numbers add_info
0 abc 1.0 3123.0
1 adcfek 2.0 NaN
2 acefeasdq NaN 312441.0
And this is df2:
matching needed_info other_info
0 adc 1 22
1 fek 2 33
2 acefeasdq 3 11
3 abcef 4 44
4 acce 5 55
5 dcf 6 66
And this is the desired output:
id numbers add_info needed_info other_info
0 abc 1.0 3123.0 NaN NaN
1 adcfek 2.0 NaN 2.0 33.0
2 adcfek 2.0 NaN 6.0 66.0
3 acefeasdq NaN 312441.0 3.0 11.0
So as described, I only want to merge the additional columns only when the 'matching' column is a substring of the 'id' column. If it is the other way around, e.g. 'abc' is a substring of 'adcef', nothing should happen.
In my data, a lot of the matches between df1 and df2 are actually exact, like the 'acefeasdq' row. But there are cases where 'id's contain multiple 'matching's. For the moment, it is okish to ignore these cases but I'd like to learn how I can tackle this issue. And additionally, is it possible to mark out the rows that are merged based on substrings and the rows that are merged exactly?
You can use pd.merge(how='cross') to create a dataframe containing all combinations of the rows. And then filter the dataframe using a boolean series:
df = pd.merge(df1, df2, how="cross")
include_row = df.apply(lambda row: row.matching in row.id, axis=1)
filtered = df.loc[include_row]
print(filtered)
Docs:
pd.merge
Indexing and selecting data
If your processing can handle CROSS JOIN (problematic with large datasets), then you could cross join and then delete/filter only those you want.
map= cross.apply(lambda x: str(x['matching']) in str(x['id']), axis=1) #create map of booleans
final = cross[map] #get only those where condition was met
I have a simple problem that probably has a simple solution but I couldn't found it anywhere. I have the following multindex column Dataframe:
mux = pd.MultiIndex.from_product(['A','B','C'], ['Datetime', 'Str', 'Ret']])
dfr = pd.DataFrame(columns=mux)
| A | B | C |
|Datetime|Str|Ret|Datetime|Str|Ret|Datetime|Str|Ret|
I need to add values one by one at the end of a specific subcolumn. For example add one value at the end of column A sub-column Datetime and leave the rest of the row as it is, then add another value to column B sub-column Str and again leave the rest of the values in the same row untouched and so on. So my questions are: Is it possible to target individual locations in this type of Dataframes? How? and also Is it possible to append not a full row but an individual value always at the end after the previous value without knowing where the end is?. Thank you so much for your answers.
IIUC, you can use .loc:
idx = len(dfr) # get the index of the next row after the last one
dfr.loc[idx, ('A', 'Datetime')] = pd.to_datetime('2021-09-24')
dfr.loc[idx, ('B', 'Str')] = 'Hello'
dfr.loc[idx, ('C', 'Ret')] = 4.3
Output:
>>> dfr
A B C
Datetime Str Ret Datetime Str Ret Datetime Str Ret
0 2021-09-24 00:00:00 NaN NaN NaN Hello NaN NaN NaN 4.3
Update
I mean for example when I have different number of values in different columns (for example 6 values in column A-Str but only 4 in column B-Datetime) but I donĀ“t really know. In that case what I need is to add the next value in that column after the last one so I need to know the index of the last non Nan value of that particular column so I can use it in your answer because if I use len(dfr) while trying to add value to the column that only has 4 values it will end up in the 7th row instead of the 5th row, this is because one of the columns may have more values than the others.
You can do it easily using last_valid_index. Create a convenient function append_to_col to append values inplace in your dataframe:
def append_to_col(col, val):
idx = dfr[col].last_valid_index()
dfr.loc[idx+1 if idx is not None else 0, col] = val
# Fill your dataframe
append_to_col(('A', 'Datetime'), '2021-09-24')
append_to_col(('A', 'Datetime'), '2021-09-25')
append_to_col(('B', 'Str'), 'Hello')
append_to_col(('C', 'Ret'), 4.3)
append_to_col(('C', 'Ret'), 8.2)
append_to_col(('A', 'Datetime'), '2021-09-26')
Output:
>>> dfr
A B C
Datetime Str Ret Datetime Str Ret Datetime Str Ret
0 2021-09-24 NaN NaN NaN Hello NaN NaN NaN 4.3
1 2021-09-25 NaN NaN NaN NaN NaN NaN NaN 8.2
2 2021-09-26 NaN NaN NaN NaN NaN NaN NaN NaN
I have a dataframe 'DF', part of which looks like this:
I want to select only the values between 0 and 0.01, to form a new dataframe(with blanks where the value was over 0.01)
To do this, i tried:
similarity = []
for x in DF:
similarity.append([DF[DF.between(0, 0.01).any(axis=1)]])
simdf = pd.DataFrame(similarity)
simdf.to_csv("similarity.csv")
However, i get the error AttributeError: 'DataFrame' object has no attribute 'between'
How do i select a range of values and create a new data frame with these?
Just do the two comparisons:
df_new = df[(df>0) & (df<0.01)]
Example:
import pandas as pd
df = pd.DataFrame({"a":[0,2,4,54,56,4],"b":[4,5,7,12,3,4]})
print(df[(df>5) & (df<33)])
a b
0 NaN NaN
1 NaN NaN
2 NaN 7.0
3 NaN 12.0
4 NaN NaN
5 NaN NaN
If want blank string instead of NaN:
df[(df>5) & (df<33)].fillna("")
I have a df which has NaN in it, I tried running df.dropna() the it drops all rows which have NaN value but when I try using df.dropna(thresh=2) then nothing happens ,no row gets deleted. Can someone please explain me why is this occurring.
This is how I have changed the values to NaN
:
for col in df.columns:
df.loc[df[col] == '?', col] = np.nan
It is the the pic with total number of rows and Columns
second pic when I used df.dropna()
third pic is after using df.dropna(thresh=2)
thresh=2 says the row must have at least 2 valid / non NaN values otherwise delete that row.
In given screen shots there are 13 column.
So, to remove rows which have more than 2 Nan the thresh should be thresh = 11.
In this way Pandas will move all the rows where it finds more than 2 NaN
Hope this helps!
As per my knowledge, thresh works for integer values. it will respond if you have only integer value is there and rest are NaN and thresh should be greater than 1. eq.
import pandas as pd
data = [{'id':1,'name' : 'John'},
{'id':2, 'name' : 'Aaron', 'phone':43242123213,'age':32},
{'id':3, 'name':'alan' }
]
df = pd.DataFrame(data)
OUTPUT:
age id name phone
0 NaN 1 John NaN
1 32.0 2 Aaron 4.324212e+10
2 NaN 3 Alan NaN
>> df.dropna(thresh=2)
It will not work but if I remove the name 'Alan' at index 2, It will respond and delete 3rd row 3 as only integer value id left.