Remove Column with Duplicate Values in Pandas

Remove Column with Duplicate Values in Pandas - python

I have a database with sample as below:
Data frame is generated when I load data in Python as per below code
import os
import pandas as pd
data_dir="D:\\userdata\\adbharga\\Desktop\\AVA\\PythonCoding\\VF-Aus\\4G Cell Graphs"
os.chdir(data_dir)
df = pd.read_csv('CA Throughput(Kbit_s) .csv',index_col=None, header=0)
Output:
Is there any way by which we can avoid reading duplicate columns in Pandas, or remove the duplicate columns post reading.
Pl Note: Column Name is different once data is read in Pandas, so command like df=df.loc[:,~df.columns.duplicated()] won't work.
Actual database is very big and has many duplicate column with Dates only.

There are 2 ways you can do this.
Ignore columns when reading the data
pandas.read_csv has the argument usecols, which accepts an integer list.
So you can try:
# work out required columns
df = pd.read_csv('file.csv', header=0)
cols = [0] + list(range(1, len(df.columns), 2))
# use column integer list
df = pd.read_csv('file.csv', usecols=cols)
Remove columns from dataframe
You can use similar logic with pd.DataFrame.iloc to remove unwanted columns.
# cols as defined in previous example
df = df.iloc[:, cols]

One way of do it could be to read only the first row and create a mask using drop_duplicates(). This we pass to the usecols without the need to specify the index beforehand. It should be failsafe.
m = pd.read_csv(pd.compat.StringIO(data),nrows=1, header=None).T.drop_duplicates().index
df = pd.read_csv(pd.compat.StringIO(data), usecols=m)
Full example:
import pandas as pd
data = '''\
Date,Value1,Date,Value2
2018-01-01,0,2018-01-01,1
2018-01-02,0,2018-01-02,1'''
m = pd.read_csv(pd.compat.StringIO(data),nrows=1, header=None).T.drop_duplicates().index
df = pd.read_csv(pd.compat.StringIO(data), usecols=m)
print(df)
# Date Value1 Value2
#0 2018-01-01 0 1
#1 2018-01-02 0 1
Another way to do it would be to remove all columns with a dot inside .. This should work in most cases as the dot is rarely used in column names:
df = df.loc[:,~df.columns.str.contains('.', regex=False)]
Full example:
import pandas as pd
data = '''\
Date,Value1,Date,Value2
2018-01-01,0,2018-01-01,1
2018-01-02,0,2018-01-02,1'''
df = pd.read_csv(pd.compat.StringIO(data))
df = df.loc[:,~df.columns.str.contains('.', regex=False)]
print(df)
# Date Value1 Value2
#0 2018-01-01 0 1
#1 2018-01-02 0 1

Related

extracted data from sql for processing using python

I have saved out a data column as follows:
[[A,1], [B,5], [C,18]....]
i was hoping to group A,B,C as shown above into Category and 1,5,18 into Values/Series for updating of my powerpoint chart using python pptx.
Example:
Category
Values
A
1
B
5
Is there any way i can do it? currently the above example is also extracted as strings so i believe i have to convert it to lists first?
thanks in advance!

Try to parse your strings (a list of lists) then create your dataframe from the real list:
import pandas as pd
import re
s = '[[A,1], [B,5], [C,18]]'
cols = ['Category', 'Values']
data = [row.split(',') for row in re.findall('\[([^]]+)\]', s[1:-1])]
df = pd.DataFrame(data, columns=cols)
print(df)
# Output:
Category Values
0 A 1
1 B 5
2 C 18

You should be able to just use pandas.DataFrame and pass in your data, unless I'm misunderstanding the question. Anyway, try:
df = pandas.DataFrame(data=d, columns = ['Category', 'Value'])
where d is your list of tuples.

from prettytable import PrettyTable
column = [["A",1],["B",5],["C",18]]
columnname=[]
columnvalue =[]
t = PrettyTable(['Category', 'Values'])
for data in column:
columnname.append(data[0])
columnvalue.append(data[1])
t.add_row([data[0], data[1]])
print(t)

Take specific columns from excel into Dataframe

I only need 2 columns from an excel sheet, one is always located at B and the other shifts around a bit depending on the month. The one at B doesn't have a name but the other one does, so I was wondering how do I either set a name for the one at B or how do I find the one I know the string of and extract them into a Data-frame?
Current Implementation:
file_location = Desktop\Excelfile.xlsx'
df = pd.read_excel(file_location, index_col=None, na_values=['NA'],usecols="B,K")
any ideas?

I found out that the first unnamed cell is called Unnamed: 1 and the next Unnamed: 2 etc. so I just renamed like this:
df = df.rename(columns = {"Unnamed: 1":"Product"})
df = pd.DataFrame(df,columns=["Amount","Product"])
and now it works like intended
import numpy as np
import pandas as pd
file_location = Desktop\Excelfile.xlsx'
df = pd.read_excel(file_location)
df = df.rename(columns = {"Unnamed: 1":"Product"})
df = pd.DataFrame(df,columns=["Amount","Product"])

Remove leading comma in header when using pandas to_csv

By default to_csv writes a CSV like
,a,b,c
0,0.0,0.0,0.0
1,0.0,0.0,0.0
2,0.0,0.0,0.0
But I want it to write like this:
a,b,c
0,0.0,0.0,0.0
1,0.0,0.0,0.0
2,0.0,0.0,0.0
How do I achieve this? I can't set index=False because I want to preserve the index. I just want to remove the leading comma.
df = pd.DataFrame(np.zeros((3,3)), columns = ['a','b','c'])
df.to_csv("test.csv") # this results in the first example above.

It is possible by write only columns without index first and then data without header in append mode:
df = pd.DataFrame(np.zeros((3,3)), columns = ['a','b','c'], index=list('XYZ'))
pd.DataFrame(columns=df.columns).to_csv("test.csv", index=False)
#alternative for empty df
#df.iloc[:0].to_csv("test.csv", index=False)
df.to_csv("test.csv", header=None, mode='a')
df = pd.read_csv("test.csv")
print (df)
a b c
X 0.0 0.0 0.0
Y 0.0 0.0 0.0
Z 0.0 0.0 0.0

Alternatively, try reseting the index so it becomes a column in data frame, named index. This works with multiple indexes as well.
df = df.reset_index()
df.to_csv('output.csv', index = False)

Simply set a name for your index: df.index.name = 'blah'. This name will appear as the first name in the headers.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.zeros((3,3)), columns = ['a','b','c'])
df.index.name = 'my_index'
print(df.to_csv())
yields
my_index,a,b,c
0,0.0,0.0,0.0
1,0.0,0.0,0.0
2,0.0,0.0,0.0
However if (as per your comment) you wish to have 3 coma-separated names in the headers while there are 4 coma-separated values in the rows of the csv, you'll have to handcraft it. It will NOT be compliant with any csv standard format though.

Wide to long returns empty output - Python dataframe

I have a dataframe which can be generated from the code as given below
df = pd.DataFrame({'person_id' :[1,2,3],'date1':
['12/31/2007','11/25/2009','10/06/2005'],'val1':
[2,4,6],'date2': ['12/31/2017','11/25/2019','10/06/2015'],'val2':[1,3,5],'date3':
['12/31/2027','11/25/2029','10/06/2025'],'val3':[7,9,11]})
I followed the below solution to convert it from wide to long
pd.wide_to_long(df, stubnames=['date', 'val'], i='person_id',
j='grp').sort_index(level=0)
Though this works with sample data as shown below, it doesn't work with my real data which has more than 200 columns. Instead of person_id, my real data has subject_ID which is values like DC0001,DC0002 etc. Does "I" always have to be numeric? Instead it adds the stub values as new columns in my dataset and has zero rows
This is how my real columns looks like
My real data might contains NA's as well. So do I have to fill them with default values for wide_to_long to work?
Can you please help as to what can be the issue? Or any other approach to achieve the same result is also helpful.

Try adding additional argument in the function which allows the strings suffix.
pd.long_to_wide(.......................,suffix='\w+')

The issue is with your column names, the numbers used to convert from wide to long need to be at the end of your column names or you need to specify a suffix to groupby. I think the easiest solution is to create a function that accepts regex and the dataframe.
import pandas as pd
import re
def change_names(df, regex):
# Select one of three column groups
old_cols = df.filter(regex = regex).columns
# Create list of new column names
new_cols = []
for col in old_cols:
# Get the stubname of the original column
stub = ''.join(re.split(r'\d', col))
# Get the time point
num = re.findall(r'\d+', col) # returns a list like ['1']
# Make new column name
new_col = stub + num[0]
new_cols.append(new_col)
# Create dictionary mapping old column names to new column names
dd = {oc: nc for oc, nc in zip(old_cols, new_cols)}
# Rename columns
df.rename(columns = dd, inplace = True)
return df
tdf = pd.DataFrame({'person_id' :[1,2,3],'h1date': ['12/31/2007','11/25/2009','10/06/2005'],'t1val': [2,4,6],'h2date': ['12/31/2017','11/25/2019','10/06/2015'],'t2val':[1,3,5],'h3date': ['12/31/2027','11/25/2029','10/06/2025'],'t3val':[7,9,11]})
# Change date columns
tdf = change_names(tdf, 'date$')
tdf = change_names(tdf, 'val$')
print(tdf)
person_id hdate1 tval1 hdate2 tval2 hdate3 tval3
0 1 12/31/2007 2 12/31/2017 1 12/31/2027 7
1 2 11/25/2009 4 11/25/2019 3 11/25/2029 9
2 3 10/06/2005 6 10/06/2015 5 10/06/2025 11

This is quite late to answer this question. But putting the solution here in case someone else find it useful
tdf = pd.DataFrame({'person_id' :[1,2,3],'h1date': ['12/31/2007','11/25/2009','10/06/2005'],'t1val': [2,4,6],'h2date': ['12/31/2017','11/25/2019','10/06/2015'],'t2val':[1,3,5],'h3date': ['12/31/2027','11/25/2029','10/06/2025'],'t3val':[7,9,11]})
## You can use m13op22 solution to rename your columns with numeric part at the
## end of the column name. This is important.
tdf = tdf.rename(columns={'h1date': 'hdate1', 't1val': 'tval1',
'h2date': 'hdate2', 't2val': 'tval2',
'h3date': 'hdate3', 't3val': 'tval3'})
## Then use the non-numeric portion, (in this example 'hdate', 'tval') as
## stubnames. The mistake you were doing was using ['date', 'val'] as stubnames.
df = pd.wide_to_long(tdf, stubnames=['hdate', 'tval'], i='person_id', j='grp').sort_index(level=0)
print(df)

Using Pandas to create DataFrame with Series, resulting in memory error

I'm using Pandas library for remote sensing time series analysis. Eventually I would like to save my DataFrame to csv by using chunk-sizes, but I run into a little issue. My code generates 6 NumPy arrays that I convert to Pandas Series. Each of these Series contains a lot of items
>>> prcpSeries.shape
(12626172,)
I would like to add the Series into a Pandas DataFrame (df) so I can save them chunk by chunk to a csv file.
d = {'prcp': pd.Series(prcpSeries),
'tmax': pd.Series(tmaxSeries),
'tmin': pd.Series(tminSeries),
'ndvi': pd.Series(ndviSeries),
'lstm': pd.Series(lstmSeries),
'evtm': pd.Series(evtmSeries)}
df = pd.DataFrame(d)
outFile ='F:/data/output/run1/_'+str(i)+'.out'
df.to_csv(outFile, header = False, chunksize = 1000)
d = None
df = None
But my code get stuck at following line giving a Memory Error
df = pd.DataFrame(d)
Any suggestions? Is it possible to fill the Pandas DataFrame chunk by chunk?

If you know each of these are the same length then you could create the DataFrame directly from the array and then append each column:
df = pd.DataFrame(prcpSeries, columns=['prcp'])
df['tmax'] = tmaxSeries
...
Note: you can also use the to_frame method (which allows you to (optionally) pass a name - which is useful if the Series doesn't have one):
df = prcpSeries.to_frame(name='prcp')
However, if they are variable length then this will lose some data (any arrays which are longer than prcpSeries). An alternative here is to create each as a DataFrame and then perform an outer join (using concat):
df1 = pd.DataFrame(prcpSeries, columns=['prcp'])
df2 = pd.DataFrame(tmaxSeries, columns=['tmax'])
...
df = pd.concat([df1, df2, ...], join='outer', axis=1)
For example:
In [21]: dfA = pd.DataFrame([1,2], columns=['A'])
In [22]: dfB = pd.DataFrame([1], columns=['B'])
In [23]: pd.concat([dfA, dfB], join='outer', axis=1)
Out[23]:
A B
0 1 1
1 2 NaN

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove Column with Duplicate Values in Pandas - python

Related

extracted data from sql for processing using python

Take specific columns from excel into Dataframe

Remove leading comma in header when using pandas to_csv

Wide to long returns empty output - Python dataframe

Using Pandas to create DataFrame with Series, resulting in memory error

Categories

Resources