Remove duplicates from rows in mydataset - python

Im having a CSV file which contain 436 columns and 14k rows.
The format of the data inside the cells is string.
For the example it looks like this:
A,A,A,B,B,C,C,,,,,
D,F,D,F,D,F,H,,,,,
My goal is to get every row with its unique values only. Like that:
A,B,C,,,,,,,,
D,F,H,,,,,,,,
The file is on csv/txt file. I can use Jupyter notebook( with Python3 or any other code you guys will provide). But this is my enviorment of work. Any help would be amazing!
I also uploaded the csv as a Dataframe to the notebook. What you guys suggest?

First you have to read your csv file into a numpy array. Then for each row, I'd do something like:
import numpy as np
s='A,A,A,B,B,C,C'
f=s.split(',')
np.unique(np.array(f))
which prints array(['A', 'B', 'C'], dtype='|S1').

If you have the csv loaded as a dataframe df:
0 1 2 3 4 5 6
0 A A A B B C C
1 D F D F D F H
Iterate over rows and find the unique values per each row:
unique_vals = []
for _, row in df.iterrows():
unique_vals.append(row.unique().tolist())
unique_vals
[['A', 'B', 'C'], ['D', 'F', 'H']]
You haven't mentioned the return data type so I've returned a list.
Edit: If the data set is too large, consider using the chunk_size option in read_csv.

Related

Multi-slice pandas dataframe

I have a dataframe:
import pandas as pd
df = pd.DataFrame({'val': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']})
that I would like to slice into two new dataframes such that the first contains every nth value, while the second contains the remaining values not in the first.
For example, in the case of n=3, the second dataframe would keep two values from the original dataframe, skip one, keep two, skip one, etc. This slice is illustrated in the following image where the original dataframe values are blue, and these are split into a green set and a red set:
I have achieved this successfully using a combination of iloc and isin:
df1 = df.iloc[::3]
df2 = df[~df.val.isin(df1.val)]
but what I would like to know is:
Is this the most Pythonic way to achieve this? It seems inefficient and not particularly elegant to take what I want out of a dataframe then get the rest of what I want by checking what is not in the new dataframe that is in the original. Instead, is there an iloc expression, like that which was used to generate df1, which could do the second part of the slicing procedure and replace the isin line? Even better, is there a single expression that could execute the the entire two-step slice in one step?
Use modulo 3 with compare for not equal first values (same like sliced rows):
#for default RangeIndex
df2 = df[df.index % 3 != 0]
#for any Index
df2 = df[np.arange(len(df)) % 3 != 0]
print (df2)
val
1 b
2 c
4 e
5 f
7 h

Creating multiple empty list seperated by comma

I'm trying to write some dataframe to a csv file. However, the columns F to N should me empty. This is the dataframe im using:
data = [['a'], ['b'], ['c'], ['d'], ['e'], ['o']]
dataFrame = pandas.DataFrame(data).transpose()
The letters are to clarify under which column the data should go. For example, 'c' is going under the column C. However, with the current line 'o' goes under the column F. Is there a way to tell this dataFrame that it should skip columns F to N and write 'o' under column O?
I assumed that's possible to write [], [], [] many times but this seems a bit unnecessary. Is there a smart way to make multiple empty lists seperated by comma? Like the example above?
Thanks for reading. If anything is unclear please let me know!
With a Pandas dataframe, what you desire is not possible. The name Pandas is derived from "panel data". As such, it's built around NumPy arrays, one for each series or "column" of data. You can't have "placeholders" for series which should be skipped over when exporting to a CSV or Excel file.
You can explicitly set your index equal to your dataframe values and then use pd.DataFrame.reindex with a list of letters. If you have more than 26 columns, see Get Excel-Style Column Names from Column Number.
import pandas as pd
from string import ascii_lowercase
data = [['a'], ['b'], ['c'], ['d'], ['e'], ['o']]
df = pd.DataFrame(data)
df.index = df[0]
df = df.reindex(list(ascii_lowercase)).T.fillna('')
print(df[list('abcdefg') + list('mnopqrs')])
0 a b c d e f g m n o p q r s
0 a b c d e o

Python csv Merge several column to one cell

How can I merge the several columns into one cell?
How can I convert a CSV file , that includes 1 by X cells where 1 is the row count and X is the column count unkown by the user,into a new CSV file including one single cell to combine all data from the original CSV file?
Right now://One row, four columns #####In fact, it will be variable number of columns as the data is extracted from a log file#############
1 A B C D
My purposes: //One row, one colum
1 A
B
C
D
The index of the row may not be always be 1 as I have many similar rows like that.
please refer to the original file and the expected new file at the following URL for details
https://drive.google.com/drive/folders/1cX0o86nbLAV5Foj5avCK6E2oAvExzt5t
161.csv is the file at the present
161-final.csv is what I want...
The count of rows will not be changed. However, the count of columns is a variable as the data is extracted from a log file. In the end, I only need each row has only one column.
I am just a fresh man with pandas. Will it be a way that I calculate the count of columns and then merge it into one cell?
Very appreciate your help.
code:
import pandas as pd
import numpy as np
df = pd.DataFrame([['a', 'a', 'a', 'a'], ['b', 'b', 'b', 'b']])
print(df)
df1 = np.empty(df.shape[0], dtype=object)
df1[:] = df.values.tolist()
print(df1)
output:
0 1 2 3
0 a a a a
1 b b b b
[list(['a', 'a', 'a', 'a']) list(['b', 'b', 'b', 'b'])]
Not sure this is what you want, but you can manage to put the content of one row in a single column with pd.groupby()
import pandas as pd
df = pd.DataFrame(np.arange(4).reshape(1, 4))
df.columns = ['A', 'B', 'C', 'D']
df = df.groupby(df.index).apply(lambda x: x.values.ravel())
df = df.to_frame('unique_col') # If needed
Output :
unique_col
0 [0, 1, 2, 3]
Not sure this is possible to have the output not in a list as you mentioned it in your example.
I used a "stupid" way to do that. It is the "pd.read_table(file)" do the magic.
def CsvMerge(file1, file2,output):
##It is for preparation of the getFormated()
##Merge the "Format.csv" and the "text.csv.csv
df1=pd.read_csv(file1)
###This is for merging the columns to one column
###It can also avoid an csv error
df2=pd.read_table(file2)
df3=pd.concat([df1,df2],axis=1)
print(df3)
with open(output,'w+',encoding="utf-8") as f:
df3.to_csv(f,index=False)
f.close()

pandas DataFrame automatically reordering my columns [duplicate]

When I were output the result to CSV file, I generated a pandas dataframe. But the dataframe column order changed automatically, I am curious Why would this happened?
Problem Image :
As Youn Elan pointed out, python dictionaries aren't ordered, so if you use a dictionary to provide your data, the columns will end up randomly ordered. You can use the columns argument to set the order of the columns explicitly though:
import pandas as pd
before = pd.DataFrame({'lake_id': range(3), 'area': (['a', 'b', 'c'])})
print 'before'
print before
after = pd.DataFrame({'lake_id': range(3), 'area': (['a', 'b', 'c'])},
columns=['lake_id', 'area'])
print 'after'
print after
Result:
before
area lake_id
0 a 0
1 b 1
2 c 2
after
lake_id area
0 0 a
1 1 b
2 2 c
I notice you use a dictionary.
Dictionaries in python are not garanteed to be in any order. It depends on multiple factors, including what's in the array. Keys are garanteed to be unique though

Appending Separate outputs to one dataframe (Pandas)

I have 4 pieces of output as shown below. They come from 4 separate functions. I would like to store them in one single dataframe.
print("Number of Genes-", number_of_genes)
print(donor.ix[[0],0])
print(test.sum(axis=1).argmax(), test.sum(axis=1).max())
I tried something like this but it doesn't work well.
print(number_of_genes, donor.ix[[0],0],est.sum(axis=1).argmax(), test.sum(axis=1).max())
Appending it to a dataframe doesn't seem to work. Thanks for your help.
NB, Each of these are for the same input.
Start by creating an empty DataFrame with your desired columns.
Assign variables to your data.
One way to append the row of new data is via a dictionary.
df = pd.DataFrame(columns=['n', 'd', 'max1', 'max2'])
n = number_of_genes
d = donor.ix[[0],0]
max1 = test.sum(axis=1).argmax()
max2 = test.sum(axis=1).max()
df.append({'n': n, 'd': d, 'max1': max1, 'max2': max2}, ignore_index=True)

Categories

Resources