Comparing two data frames columns and assigning Zero and One

Comparing two data frames columns and assigning Zero and One - python

I have a dataframe and a list, which includes a part of columns' name from my dataframe as follows:
my_frame:
col1, col2, col3, ..., coln
2, 3, 4, ..., 2
5, 8, 5, ..., 1
6, 1, 8, ..., 9
my_list:
['col1','col3','coln']
Now, I want to create an array with the size of my original dataframe (total number of columns) which consists only zero and one. Basically I want the array includes 1 if the there is a similar columns name in "my_list", otherwise 0. My desired output should be like this:
my_array={[1,0,1,0,0,...,1]}

This should help u:
import pandas as pd
dictt = {'a':[1,2,3],
'b':[4,5,6],
'c':[7,8,9]}
df = pd.DataFrame(dictt)
my_list = ['a','h','g','c']
my_array = []
for column in df.columns:
if column in my_list:
my_array.append(1)
else:
my_array.append(0)
print(my_array)
Output:
[1, 0, 1]
If u wanna use my_array as a numpy array instead of a list, then use this:
import pandas as pd
import numpy as np
dictt = {'a':[1,2,3],
'b':[4,5,6],
'c':[7,8,9]}
df = pd.DataFrame(dictt)
my_list = ['a','h','g','c']
my_array = np.empty(0,dtype = int)
for column in df.columns:
if column in my_list:
my_array = np.append(my_array,1)
else:
my_array = np.append(my_array,0)
print(my_array)
Output:
[1 0 1]
I have used test data in my code for easier understanding. U can replace the test data with ur actual data (i.e replace my test dataframe with ur actual dataframe). Hope that this helps!

Related

printing out the duplicate values in specific columns in CSV

I have a csv file and I need to print the duplicate values in a column "hash" . so I made this using pandas but I'm not sure what its printing , it seems like its printing an entire row of duplicates which I don't need , I need only duplicate vaults in that column " hash"
the script I made:
import pandas as pd
df = pd.read_csv("combined_values.csv")
dups = df[df.duplicated("hash")]
print(dups)
I also tried this one but it seems to print all the "hash" column
import pandas as pd
df = pd.read_csv("combined_values.csv")
dups = df["hash"].duplicated
print(dups)

We can try with Series.duplicated to create a boolean index then use loc to select from the DataFrame where hash is duplicated, and just the hash column:
s = df.loc[df['hash'].duplicated(), 'hash']
We can set keep=False if all duplicates are wanted:
s = df.loc[df['hash'].duplicated(keep=False), 'hash']
With some sample data:
import pandas as pd
df = pd.DataFrame({
'a': [1, 2, 3, 4],
'b': [5, 6, 7, 8],
'hash': [4, 5, 4, 6]
})
s = df.loc[df['hash'].duplicated(), 'hash']
s:
2 4
Name: hash, dtype: int64
Or keeping all duplicates:
s = df.loc[df['hash'].duplicated(keep=False), 'hash']
s:
0 4
2 4
Name: hash, dtype: int64

Adding Numpy ndarray into dataframe

I would like to add a numpy array to each row in my dataframe:
I do have a dataframe holdings some data in each row and now i like to add a new column which contains an n element array.
for example:
Name, Years
Test, 2
Test2, 4
Now i like to add:
testarray1 = [100, 101, 1 , 0, 0, 5] as a new column='array' to Name='Test'
Name, Years, array
Test, 2, testarray1
Test2, 4, NaN
how can i do this ?

import pandas as pd
import numpy as np
testarray1 = [100, 101, 1 , 0, 0, 5]
d = {'Name':['Test', 'Test2'],
'Years': [2, 4]
}
df = pd.DataFrame(d) # create a DataFrame of the data
df.set_index('Name', inplace=True) # set the 'Name' column as the dataframe index
df['array'] = np.NaN # create a new empty 'array' column (filled with NaNs)
df['array'] = df['array'].astype(object) # convert it to an 'object' data type
df.at['Test', 'array'] = testarray1 # fill in the cell where index equals 'Test' and column equals 'array'
df.reset_index(inplace=True) # if you don't want 'Name' to be the dataframe index
print(df)
Name Years array
0 Test 2 [100, 101, 1, 0, 0, 5]
1 Test2 4 NaN

Try this
import pandas as pd
import numpy as np
df = pd.DataFrame({'name':['test', 'test2'], 'year':[1,2]})
print(df)
x = np.arange(5)
df['array']=[x,np.nan]
print(df)

Picking out certain indexes from a pandas data frame

I have a pandas data frame with hundreds of entries and an array of random entries in the array. For example:
import pandas as pd
list1 = [13,2,32,34,15,7,19]
list2 = [15,65,95,9,90,88,10]
df1 = pd.DataFrame(list1)
df2 = pd.DataFrame(list2)
cols = [df1, df2]
df1.loc[:, cols]
and I have another array called
M =[1, 2, 5, 6, 9]
where these are the indexes of the pandas data frame I want, is there a way to create a new table that picks out only the rows that match the index given by the array M?

import pandas as pd
list1 = [13,2,32,34,15,7,19]
df1 = pd.DataFrame(list1)
M =[1, 2, 5, 6]
df1[df1.index.isin(M)]
Note that in your problem statement, cols is a list of dataframes, not a two-column dataframe. I am not sure if that was not clear from your code and question.

Assign constant numpy array value to pandas dataframe column

I would like to assign constant numpy array value to pandas dataframe column.
Here is what I tried:
import pandas as pd
import numpy as np
my_df = pd.DataFrame({'col_1': [1,2,3], 'col_2': [4,5,6]})
my_df['new'] = np.array([]) # did not work
my_df['new'] = np.array([])*len(df) # did not work
Here is what worked:
my_df['new'] = my_df['new'].apply(lambda x: np.array([]))
I am curious why it works with simple scalar, but does not work with numpy array. Is there simpler way to assign numpy array value?

Your "new" column will contains arrays, so it must be a object type column.
The simplest way to initialize it is :
my_df = pd.DataFrame({'col_1': [1,2,3], 'col_2': [4,5,6]})
my_df['new']=None
You can then fill it as you want. For example :
for index,(a,b,_) in my_df.iterrows():
my_df.loc[index,'new']=np.arange(a,b)
#
# col_1 col_2 new
# 0 1 4 [1, 2, 3]
# 1 2 5 [2, 3, 4]
# 2 3 6 [3, 4, 5]

Get row index from DataFrame row

Is it possible to get the row number (i.e. "the ordinal position of the index value") of a DataFrame row without adding an extra row that contains the row number (the index can be arbitrary, i.e. even a MultiIndex)?
>>> import pandas as pd
>>> df = pd.DataFrame({'a': [2, 3, 4, 2, 4, 6]})
>>> result = df[df.a > 3]
>>> result.iloc[0]
a 4
Name: 2, dtype: int64
# but how can I get the original row index of iloc[0] in df?
I could have done df['row_index'] = range(len(df)) which would maintain the original row number, but I am wondering if Pandas has a built-in way of doing this.

Access the .name attribute and use get_loc:
In [10]:
df.index.get_loc(result.iloc[0].name)
Out[10]:
2

Looking this from a different side:
for r in df.itertuples():
getattr(r, 'Index')
Where df is the data frame. May be you want to use a conditional to get the index when a condition are met.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Comparing two data frames columns and assigning Zero and One - python

Related

printing out the duplicate values in specific columns in CSV

Adding Numpy ndarray into dataframe

Picking out certain indexes from a pandas data frame

Assign constant numpy array value to pandas dataframe column

Get row index from DataFrame row

Categories

Resources