printing out the duplicate values in specific columns in CSV - python

I have a csv file and I need to print the duplicate values in a column "hash" . so I made this using pandas but I'm not sure what its printing , it seems like its printing an entire row of duplicates which I don't need , I need only duplicate vaults in that column " hash"
the script I made:
import pandas as pd
df = pd.read_csv("combined_values.csv")
dups = df[df.duplicated("hash")]
print(dups)
I also tried this one but it seems to print all the "hash" column
import pandas as pd
df = pd.read_csv("combined_values.csv")
dups = df["hash"].duplicated
print(dups)

We can try with Series.duplicated to create a boolean index then use loc to select from the DataFrame where hash is duplicated, and just the hash column:
s = df.loc[df['hash'].duplicated(), 'hash']
We can set keep=False if all duplicates are wanted:
s = df.loc[df['hash'].duplicated(keep=False), 'hash']
With some sample data:
import pandas as pd
df = pd.DataFrame({
'a': [1, 2, 3, 4],
'b': [5, 6, 7, 8],
'hash': [4, 5, 4, 6]
})
s = df.loc[df['hash'].duplicated(), 'hash']
s:
2 4
Name: hash, dtype: int64
Or keeping all duplicates:
s = df.loc[df['hash'].duplicated(keep=False), 'hash']
s:
0 4
2 4
Name: hash, dtype: int64

Related

Can I create column where each row is a running list in a Pandas data frame using groupby?

Imagine I have a Pandas DataFrame:
# create df
df = pd.DataFrame({'id': [1,1,1,2,2,2],
'val': [5,4,6,3,2,3]})
Lets assume it is ordered by 'id' and an imaginary, not shown, date column (ascending).
I want to create another column where each row is a list of 'val' at that date.
The ending DataFrame will look like this:
df = pd.DataFrame({'id': [1,1,1,2,2,2],
'val': [5,4,6,3,2,3],
'val_list': [[5],[5,4],[5,4,6],[3],[3,2],[3,2,3]]})
I don't want to use a loop because the actual df I am working with has about 4 million records. I am imagining I would use a lambda function in conjunction with groupby (something like this):
df['val_list'] = df.groupby('id')['val'].apply(lambda x: x.runlist())
This raises an AttributError because the runlist() method does not exist, but I am thinking the solution would be something like this.
Does anyone know what to do to solve this problem?
Let us try
df['new'] = df.val.map(lambda x : [x]).groupby(df.id).apply(lambda x : x.cumsum())
Out[138]:
0 [5]
1 [5, 4]
2 [5, 4, 6]
3 [3]
4 [3, 2]
5 [3, 2, 3]
Name: val, dtype: object

Comparing two data frames columns and assigning Zero and One

I have a dataframe and a list, which includes a part of columns' name from my dataframe as follows:
my_frame:
col1, col2, col3, ..., coln
2, 3, 4, ..., 2
5, 8, 5, ..., 1
6, 1, 8, ..., 9
my_list:
['col1','col3','coln']
Now, I want to create an array with the size of my original dataframe (total number of columns) which consists only zero and one. Basically I want the array includes 1 if the there is a similar columns name in "my_list", otherwise 0. My desired output should be like this:
my_array={[1,0,1,0,0,...,1]}
This should help u:
import pandas as pd
dictt = {'a':[1,2,3],
'b':[4,5,6],
'c':[7,8,9]}
df = pd.DataFrame(dictt)
my_list = ['a','h','g','c']
my_array = []
for column in df.columns:
if column in my_list:
my_array.append(1)
else:
my_array.append(0)
print(my_array)
Output:
[1, 0, 1]
If u wanna use my_array as a numpy array instead of a list, then use this:
import pandas as pd
import numpy as np
dictt = {'a':[1,2,3],
'b':[4,5,6],
'c':[7,8,9]}
df = pd.DataFrame(dictt)
my_list = ['a','h','g','c']
my_array = np.empty(0,dtype = int)
for column in df.columns:
if column in my_list:
my_array = np.append(my_array,1)
else:
my_array = np.append(my_array,0)
print(my_array)
Output:
[1 0 1]
I have used test data in my code for easier understanding. U can replace the test data with ur actual data (i.e replace my test dataframe with ur actual dataframe). Hope that this helps!

Adding Numpy ndarray into dataframe

I would like to add a numpy array to each row in my dataframe:
I do have a dataframe holdings some data in each row and now i like to add a new column which contains an n element array.
for example:
Name, Years
Test, 2
Test2, 4
Now i like to add:
testarray1 = [100, 101, 1 , 0, 0, 5] as a new column='array' to Name='Test'
Name, Years, array
Test, 2, testarray1
Test2, 4, NaN
how can i do this ?
import pandas as pd
import numpy as np
testarray1 = [100, 101, 1 , 0, 0, 5]
d = {'Name':['Test', 'Test2'],
'Years': [2, 4]
}
df = pd.DataFrame(d) # create a DataFrame of the data
df.set_index('Name', inplace=True) # set the 'Name' column as the dataframe index
df['array'] = np.NaN # create a new empty 'array' column (filled with NaNs)
df['array'] = df['array'].astype(object) # convert it to an 'object' data type
df.at['Test', 'array'] = testarray1 # fill in the cell where index equals 'Test' and column equals 'array'
df.reset_index(inplace=True) # if you don't want 'Name' to be the dataframe index
print(df)
Name Years array
0 Test 2 [100, 101, 1, 0, 0, 5]
1 Test2 4 NaN
Try this
import pandas as pd
import numpy as np
df = pd.DataFrame({'name':['test', 'test2'], 'year':[1,2]})
print(df)
x = np.arange(5)
df['array']=[x,np.nan]
print(df)

Get row index from DataFrame row

Is it possible to get the row number (i.e. "the ordinal position of the index value") of a DataFrame row without adding an extra row that contains the row number (the index can be arbitrary, i.e. even a MultiIndex)?
>>> import pandas as pd
>>> df = pd.DataFrame({'a': [2, 3, 4, 2, 4, 6]})
>>> result = df[df.a > 3]
>>> result.iloc[0]
a 4
Name: 2, dtype: int64
# but how can I get the original row index of iloc[0] in df?
I could have done df['row_index'] = range(len(df)) which would maintain the original row number, but I am wondering if Pandas has a built-in way of doing this.
Access the .name attribute and use get_loc:
In [10]:
df.index.get_loc(result.iloc[0].name)
Out[10]:
2
Looking this from a different side:
for r in df.itertuples():
getattr(r, 'Index')
Where df is the data frame. May be you want to use a conditional to get the index when a condition are met.

Can pandas.DataFrame have list type column?

Is it possible to create pandas.DataFrame which includes list type field?
For example, I'd like to load the following csv to pandas.DataFrame:
id,scores
1,"[1,2,3,4]"
2,"[1,2]"
3,"[0,2,4]"
Strip the double quotes:
id,scores
1, [1,2,3,4]
2, [1,2]
3, [0,2,4]
And you should be able to do this:
query = [[1, [1,2,3,4]], [2, [1,2]], [3, [0,2,4]]]
df = pandas.DataFrame(query, columns=['id', 'scores'])
print df
You can use:
import pandas as pd
import io
temp=u'''id,scores
1,"[1,2,3,4]"
2,"[1,2]"
3,"[0,2,4]"'''
df = pd.read_csv(io.StringIO(temp), sep=',', index_col=[0] )
print df
scores
id
1 [1,2,3,4]
2 [1,2]
3 [0,2,4]
But dtype of column scores is object, not list.
One approach use ast and converters:
import pandas as pd
import io
from ast import literal_eval
temp=u'''id,scores
1,"[1,2,3,4]"
2,"[1,2]"
3,"[0,2,4]"'''
def converter(x):
#define format of datetime
return literal_eval(x)
#define each column
converters={'scores': converter}
df = pd.read_csv(io.StringIO(temp), sep=',', converters=converters)
print df
id scores
0 1 [1, 2, 3, 4]
1 2 [1, 2]
2 3 [0, 2, 4]
#check lists:
print 2 in df.scores[2]
#True
print 1 in df.scores[2]
#False

Categories

Resources