why numpy place is not replacing empty strings - python

Hello i have an dataframe as shown below
daf = pd.DataFrame({'A':[10,np.nan,20,np.nan,30]})
daf['B'] = ''
the above code has created a data frame with column B having empty strings
A B
0 10.0
1 NaN
2 20.0
3 NaN
4 30.0
the problem here is i need to replace column with all empty strings,(note here entire column should be empty) with values provided with numpy place last argument here it is 1
so i used following code
np.place(daf.to_numpy(),((daf[['A','B']] == '').all() & (daf[['A','B']] == '')).to_numpy(),[1])
which did nothing it gave same output
A B
0 10.0
1 NaN
2 20.0
3 NaN
4 30.0
but when i assign daf['B'] = np.nan the code seems to work fine by checking if entire column is null, then replace it with 1
here is the data frame
A B
0 10.0 NaN
1 NaN NaN
2 20.0 NaN
3 NaN NaN
4 30.0 NaN
replace where those nan with 1 where the entire column is nan
np.place(daf.to_numpy(),(daf[['A','B']].isnull() & daf[['A','B']].isnull().all()).to_numpy(),[1])
which gave correct output
A B
0 10.0 1.0
1 NaN 1.0
2 20.0 1.0
3 NaN 1.0
4 30.0 1.0
can some one tell me how to work with empty strings replacing , and give a reason why its not working with empty string as input

If I'm understanding your question correctly, you're wanting to replace a column with empty strings with a column of 1s. This can be done with pandas.replace()
daf.replace('', 1.0)
A B
0 10.0 1.0
1 NaN 1.0
2 20.0 1.0
3 NaN 1.0
4 30.0 1.0
This function also works with regex if you want to be more granular with the replacement.

Related

Convert Pandas DataFrames with different columns into an iterable and reform as one DataFrame

How can I convert the three DataFrames a,b,c below into one DF with columns A,B,C,D?
I specifically want to gather the multiple DataFrames into one iterable (dict/list of dicts) before reconstituting them as one DF instead of appending or concatenating them.
My attempt:
a=pd.DataFrame({'A':[1,2,3],'B':[4,5,6]})
b=pd.DataFrame({'A':[7,8,9],'B':[10,11,12]})
c=pd.DataFrame({'B':[13,14,15],'C':[16,17,18],'D':[19,20,21]})
list_of_dicts=[] #can be list of lists/dicts
for i in [a, b, c]:
x=i.to_dict('split')
list_of_dicts.append(x)
pd.DataFrame.from_records(list_of_dicts)
#Solved below. Credit to Eric Truett.
import pandas as pd
import itertools
a=pd.DataFrame({'A':[1,2,3],'B':[4,5,6]})
b=pd.DataFrame({'A':[7,8,9],'B':[10,11,12]})
c=pd.DataFrame({'B':[13,14,15],'C':[16,17,18]})
list_of_dicts=[]
for i in [a, b, c]:
x=i.to_dict('records')
list_of_dicts.append(x)
pd.DataFrame.from_records(list(itertools.chain.from_iterable(list_of_dicts)))
To create a single dataframe from these three, you can use the concat() function in pandas:
a=pd.DataFrame({'A':[1,2,3],'B':[4,5,6]})
b=pd.DataFrame({'A':[7,8,9],'B':[10,11,12]})
c=pd.DataFrame({'B':[13,14,15],'C':[16,17,18],'D':[19,20,21]})
d = pd.concat([a, b, c])
print(d)
will give you:
A B C D
0 1.0 4 NaN NaN
1 2.0 5 NaN NaN
2 3.0 6 NaN NaN
0 7.0 10 NaN NaN
1 8.0 11 NaN NaN
2 9.0 12 NaN NaN
0 NaN 13 16.0 19.0
1 NaN 14 17.0 20.0
2 NaN 15 18.0 21.0
I think you can use append function to adding up multiple DataFrame objects. In the code below I initialize a variable s that will be used to combine all a, b, c DataFrame
import pandas as pd
a=pd.DataFrame({'A':[1,2,3],'B':[4,5,6]})
b=pd.DataFrame({'A':[7,8,9],'B':[10,11,12]})
c=pd.DataFrame({'B':[13,14,15],'C':[16,17,18],'D':[19,20,21]})
list_of_dicts=[] #can be list of lists/dicts
s = pd.DataFrame()
for i in [a, b, c]:
s = s.append(i)
print(s)
Printing s would give the following output
A B C D
0 1.0 4 NaN NaN
1 2.0 5 NaN NaN
2 3.0 6 NaN NaN
0 7.0 10 NaN NaN
1 8.0 11 NaN NaN
2 9.0 12 NaN NaN
0 NaN 13 16.0 19.0
1 NaN 14 17.0 20.0
2 NaN 15 18.0 21.0

Pandas set all values after first NaN to NaN

For each row I would like to set all values to NaN after the appearance of the first NaN. E.g.:
a b c
1 2 3 4
2 nan 2 nan
3 3 nan 23
Should become this:
a b c
1 2 3 4
2 nan nan nan
3 3 nan nan
So far I only know how to do this with an apply with a for loop over each column per row - it's very slow!
Check with cumprod
df=df.where(df.notna().cumprod(axis=1).eq(1))
a b c
1 2.0 3.0 4.0
2 NaN NaN NaN
3 3.0 NaN NaN

Cuting dataframe loop

I have a dataset which is only one column. I want to cut the column into multiple dataframes.
I use a for loop to create a list which contains the values at which positions I want to cut the dataframe.
import pandas as pd
df = pd.read_csv("column.csv", delimiter=";", header=0, index_col=(0))
number_of_pixels = int(len(df.index))
print("You have " + str(number_of_pixels) +" pixels in your file")
number_of_rows = int(input("Enter number of rows you want to create"))
list=[] #this list contains the number of pixels per row
for i in range (0,number_of_rows): #this loop fills the list with the number of pixels per row
pixels_per_row=int(input("Enter number of pixels in row " + str(i)))
list.append(pixels_per_row)
print(list)
After cutting the column into multiple dataframes I want to transpose each dataframe and concating all dataframes back together using:
df1=df1.reset_index(drop=True)
df1=df1.T
df2=df2.reset_index(drop=True)
df2=df2.T
frames = [df1,df2]
result = pd.concat(frames, axis=0)
print(result)
So I want to create a loop that cuts my dataframe into multiple frames at the positions stored in my list.
Thank you!
This is a problem that is better solved with numpy. I'll start from the point of you receiving a list from your user input. The whole point is to use numpy.split to separate the values based on the cumulative number of pixels requested, and then create a new DataFrame
Setup
import numpy as np
import pandas as pd
np.random.seed(123)
df = pd.DataFrame({'val': np.random.randint(1,10,50)})
lst = [4,10,2,1,15,8,9,1]
Code
pd.DataFrame(np.split(df.val.values, np.cumsum(lst)[:-1]))
Output
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 3 3.0 7.0 2.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 4 7.0 2.0 1.0 2.0 1.0 1.0 4.0 5.0 1.0 NaN NaN NaN NaN NaN
2 1 5.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 8 4.0 3.0 5.0 8.0 3.0 5.0 9.0 1.0 8.0 4.0 5.0 7.0 2.0 6.0
5 7 3.0 2.0 9.0 4.0 6.0 1.0 3.0 NaN NaN NaN NaN NaN NaN NaN
6 7 3.0 5.0 5.0 7.0 4.0 1.0 7.0 5.0 NaN NaN NaN NaN NaN NaN
7 8 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
If your list has more pixels than the total number of rows in your initial DataFrame then you'll get extra all NaN rows in your output. If your lst sums to less than the total number of pixels, it will add them to all to the last row. Since you didn't specify either of these conditions in your question, not sure how you'd want to handle that.

Duplicated rows from .csv and count- python

I have to get the number of times that a complete line appears repeated in my data frame, to then only show those lines that appear repetitions and show in the last column how many times those lines appear repetitions.
Input values for creating output correct table:
dur,wage1,wage2,wage3,cola,hours,pension,stby_pay,shift_diff,educ_allw,holidays,vacation,ldisab,dntl,ber,hplan,agr
2,4.5,4.0,?,?,40,?,?,2,no,10,below average,no,half,?,half,bad
2,2.0,2.0,?,none,40,none,?,?,no,11,average,yes,none,yes,full,bad
3,4.0,5.0,5.0,tc,?,empl_contr,?,?,?,12,generous,yes,none,yes,half,good
1,2.0,?,?,tc,40,ret_allw,4,0,no,11,generous,no,none,no,none,bad
1,6.0,?,?,?,38,?,8,3,?,9,generous,?,?,?,?,good
2,2.5,3.0,?,tcf,40,none,?,?,?,11,below average,?,?,yes,?,bad
3,2.0,3.0,?,tcf,?,empl_contr,?,?,yes,?,?,yes,half,yes,?,good
1,2.1,?,?,tc,40,ret_allw,2,3,no,9,below average,yes,half,?,none,bad
1,2.8,?,?,none,38,empl_contr,2,3,no,9,below average,yes,half,?,none,bad
1,5.7,?,?,none,40,empl_contr,?,4,?,11,generous,yes,full,?,?,good
2,4.3,4.4,?,?,38,?,?,4,?,12,generous,?,full,?,full,good
1,2.8,?,?,?,35,?,?,2,?,12,below average,?,?,?,?,good
2,2.0,2.5,?,?,35,?,?,6,yes,12,average,?,?,?,?,good
1,5.7,?,?,none,40,empl_contr,?,4,?,11,generous,yes,full,?,?,good
2,4.5,4.0,?,none,40,?,?,4,?,12,average,yes,full,yes,half,good
3,3.5,4.0,4.6,none,36,?,?,3,?,13,generous,?,?,yes,full,good
3,3.7,4.0,5.0,tc,?,?,?,?,yes,?,?,?,?,yes,?,good
3,2.0,3.0,?,tcf,?,empl_contr,?,?,yes,?,?,yes,half,yes,?,good
I just have to keep those rows that are totally equal.
This is the table result:
dur wage1 wage2 wage3 cola hours pension stby_pay shift_diff num_reps
6 3.0 2.0 3.0 NaN tcf NaN empl_contr NaN NaN 4
8 1.0 2.8 NaN NaN none 38.0 empl_contr 2.0 3.0 2
9 1.0 5.7 NaN NaN none 40.0 empl_contr NaN 4.0 3
43 2.0 2.5 3.0 NaN NaN 40.0 none NaN NaN 2
As you can see on this table, we keep for example line with index 6 because on line 6 and 17 from the input table to read, both lines are the same.
With my current code:
def detect_duplicates(data):
x = DataFrame(columns=data.columns.tolist() + ["num_reps"])
x = data[data.duplicated(keep=False)].drop_duplicates()
return x
I get the result correctly, however I do not know how to count the repeated rows and then add it in the column 'nums_rep' at the end of the table.
This is my result, without the last column that counts the number of repeated rows:
dur wage1 wage2 wage3 cola hours pension stby_pay shift_diff
6 3.0 2.0 3.0 NaN tcf NaN empl_contr NaN NaN
8 1.0 2.8 NaN NaN none 38.0 empl_contr 2.0 3.0
9 1.0 5.7 NaN NaN none 40.0 empl_contr NaN 4.0
43 2.0 2.5 3.0 NaN NaN 40.0 none NaN NaN
How can I perform a correct count, based on the equality of all the data in the column, then add it and show it in the column 'num_reps'?

Delete rows in dataframe based on column values

I need to rid myself of all rows with a null value in column C. Here is the code:
infile="C:\****"
df=pd.read_csv(infile)
A B C D
1 1 NaN 3
2 3 7 NaN
4 5 NaN 8
5 NaN 4 9
NaN 1 2 NaN
There are two basic methods I have attempted.
method 1:
source: How to drop rows of Pandas DataFrame whose value in certain columns is NaN
df.dropna()
The result is an empty dataframe, which makes sense because there is an NaN value in every row.
df.dropna(subset=[3])
For this method I tried to play around with the subset value using both column index number and column name. The dataframe is still empty.
method 2:
source: Deleting DataFrame row in Pandas based on column value
df = df[df.C.notnull()]
Still results in an empty dataframe!
What am I doing wrong?
df = pd.DataFrame([[1,1,np.nan,3],[2,3,7,np.nan],[4,5,np.nan,8],[5,np.nan,4,9],[np.nan,1,2,np.nan]], columns = ['A','B','C','D'])
df = df[df['C'].notnull()]
df
It's just a prove that your method 2 works properly (at least with pandas 0.18.0):
In [100]: df
Out[100]:
A B C D
0 1.0 1.0 NaN 3.0
1 2.0 3.0 7.0 NaN
2 4.0 5.0 NaN 8.0
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN
In [101]: df.dropna(subset=['C'])
Out[101]:
A B C D
1 2.0 3.0 7.0 NaN
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN
In [102]: df[df.C.notnull()]
Out[102]:
A B C D
1 2.0 3.0 7.0 NaN
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN
In [103]: df = df[df.C.notnull()]
In [104]: df
Out[104]:
A B C D
1 2.0 3.0 7.0 NaN
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN

Categories

Resources