I have a csv file and I need to print the duplicate values in a column "hash" . so I made this using pandas but I'm not sure what its printing , it seems like its printing an entire row of duplicates which I don't need , I need only duplicate vaults in that column " hash"
the script I made:
import pandas as pd
df = pd.read_csv("combined_values.csv")
dups = df[df.duplicated("hash")]
print(dups)
I also tried this one but it seems to print all the "hash" column
import pandas as pd
df = pd.read_csv("combined_values.csv")
dups = df["hash"].duplicated
print(dups)
We can try with Series.duplicated to create a boolean index then use loc to select from the DataFrame where hash is duplicated, and just the hash column:
s = df.loc[df['hash'].duplicated(), 'hash']
We can set keep=False if all duplicates are wanted:
s = df.loc[df['hash'].duplicated(keep=False), 'hash']
With some sample data:
import pandas as pd
df = pd.DataFrame({
'a': [1, 2, 3, 4],
'b': [5, 6, 7, 8],
'hash': [4, 5, 4, 6]
})
s = df.loc[df['hash'].duplicated(), 'hash']
s:
2 4
Name: hash, dtype: int64
Or keeping all duplicates:
s = df.loc[df['hash'].duplicated(keep=False), 'hash']
s:
0 4
2 4
Name: hash, dtype: int64
Related
Imagine I have a Pandas DataFrame:
# create df
df = pd.DataFrame({'id': [1,1,1,2,2,2],
'val': [5,4,6,3,2,3]})
Lets assume it is ordered by 'id' and an imaginary, not shown, date column (ascending).
I want to create another column where each row is a list of 'val' at that date.
The ending DataFrame will look like this:
df = pd.DataFrame({'id': [1,1,1,2,2,2],
'val': [5,4,6,3,2,3],
'val_list': [[5],[5,4],[5,4,6],[3],[3,2],[3,2,3]]})
I don't want to use a loop because the actual df I am working with has about 4 million records. I am imagining I would use a lambda function in conjunction with groupby (something like this):
df['val_list'] = df.groupby('id')['val'].apply(lambda x: x.runlist())
This raises an AttributError because the runlist() method does not exist, but I am thinking the solution would be something like this.
Does anyone know what to do to solve this problem?
Let us try
df['new'] = df.val.map(lambda x : [x]).groupby(df.id).apply(lambda x : x.cumsum())
Out[138]:
0 [5]
1 [5, 4]
2 [5, 4, 6]
3 [3]
4 [3, 2]
5 [3, 2, 3]
Name: val, dtype: object
I have a dataframe and a list, which includes a part of columns' name from my dataframe as follows:
my_frame:
col1, col2, col3, ..., coln
2, 3, 4, ..., 2
5, 8, 5, ..., 1
6, 1, 8, ..., 9
my_list:
['col1','col3','coln']
Now, I want to create an array with the size of my original dataframe (total number of columns) which consists only zero and one. Basically I want the array includes 1 if the there is a similar columns name in "my_list", otherwise 0. My desired output should be like this:
my_array={[1,0,1,0,0,...,1]}
This should help u:
import pandas as pd
dictt = {'a':[1,2,3],
'b':[4,5,6],
'c':[7,8,9]}
df = pd.DataFrame(dictt)
my_list = ['a','h','g','c']
my_array = []
for column in df.columns:
if column in my_list:
my_array.append(1)
else:
my_array.append(0)
print(my_array)
Output:
[1, 0, 1]
If u wanna use my_array as a numpy array instead of a list, then use this:
import pandas as pd
import numpy as np
dictt = {'a':[1,2,3],
'b':[4,5,6],
'c':[7,8,9]}
df = pd.DataFrame(dictt)
my_list = ['a','h','g','c']
my_array = np.empty(0,dtype = int)
for column in df.columns:
if column in my_list:
my_array = np.append(my_array,1)
else:
my_array = np.append(my_array,0)
print(my_array)
Output:
[1 0 1]
I have used test data in my code for easier understanding. U can replace the test data with ur actual data (i.e replace my test dataframe with ur actual dataframe). Hope that this helps!
I would like to add a numpy array to each row in my dataframe:
I do have a dataframe holdings some data in each row and now i like to add a new column which contains an n element array.
for example:
Name, Years
Test, 2
Test2, 4
Now i like to add:
testarray1 = [100, 101, 1 , 0, 0, 5] as a new column='array' to Name='Test'
Name, Years, array
Test, 2, testarray1
Test2, 4, NaN
how can i do this ?
import pandas as pd
import numpy as np
testarray1 = [100, 101, 1 , 0, 0, 5]
d = {'Name':['Test', 'Test2'],
'Years': [2, 4]
}
df = pd.DataFrame(d) # create a DataFrame of the data
df.set_index('Name', inplace=True) # set the 'Name' column as the dataframe index
df['array'] = np.NaN # create a new empty 'array' column (filled with NaNs)
df['array'] = df['array'].astype(object) # convert it to an 'object' data type
df.at['Test', 'array'] = testarray1 # fill in the cell where index equals 'Test' and column equals 'array'
df.reset_index(inplace=True) # if you don't want 'Name' to be the dataframe index
print(df)
Name Years array
0 Test 2 [100, 101, 1, 0, 0, 5]
1 Test2 4 NaN
Try this
import pandas as pd
import numpy as np
df = pd.DataFrame({'name':['test', 'test2'], 'year':[1,2]})
print(df)
x = np.arange(5)
df['array']=[x,np.nan]
print(df)
Is it possible to get the row number (i.e. "the ordinal position of the index value") of a DataFrame row without adding an extra row that contains the row number (the index can be arbitrary, i.e. even a MultiIndex)?
>>> import pandas as pd
>>> df = pd.DataFrame({'a': [2, 3, 4, 2, 4, 6]})
>>> result = df[df.a > 3]
>>> result.iloc[0]
a 4
Name: 2, dtype: int64
# but how can I get the original row index of iloc[0] in df?
I could have done df['row_index'] = range(len(df)) which would maintain the original row number, but I am wondering if Pandas has a built-in way of doing this.
Access the .name attribute and use get_loc:
In [10]:
df.index.get_loc(result.iloc[0].name)
Out[10]:
2
Looking this from a different side:
for r in df.itertuples():
getattr(r, 'Index')
Where df is the data frame. May be you want to use a conditional to get the index when a condition are met.
Is it possible to create pandas.DataFrame which includes list type field?
For example, I'd like to load the following csv to pandas.DataFrame:
id,scores
1,"[1,2,3,4]"
2,"[1,2]"
3,"[0,2,4]"
Strip the double quotes:
id,scores
1, [1,2,3,4]
2, [1,2]
3, [0,2,4]
And you should be able to do this:
query = [[1, [1,2,3,4]], [2, [1,2]], [3, [0,2,4]]]
df = pandas.DataFrame(query, columns=['id', 'scores'])
print df
You can use:
import pandas as pd
import io
temp=u'''id,scores
1,"[1,2,3,4]"
2,"[1,2]"
3,"[0,2,4]"'''
df = pd.read_csv(io.StringIO(temp), sep=',', index_col=[0] )
print df
scores
id
1 [1,2,3,4]
2 [1,2]
3 [0,2,4]
But dtype of column scores is object, not list.
One approach use ast and converters:
import pandas as pd
import io
from ast import literal_eval
temp=u'''id,scores
1,"[1,2,3,4]"
2,"[1,2]"
3,"[0,2,4]"'''
def converter(x):
#define format of datetime
return literal_eval(x)
#define each column
converters={'scores': converter}
df = pd.read_csv(io.StringIO(temp), sep=',', converters=converters)
print df
id scores
0 1 [1, 2, 3, 4]
1 2 [1, 2]
2 3 [0, 2, 4]
#check lists:
print 2 in df.scores[2]
#True
print 1 in df.scores[2]
#False