I only need 2 columns from an excel sheet, one is always located at B and the other shifts around a bit depending on the month. The one at B doesn't have a name but the other one does, so I was wondering how do I either set a name for the one at B or how do I find the one I know the string of and extract them into a Data-frame?
Current Implementation:
file_location = Desktop\Excelfile.xlsx'
df = pd.read_excel(file_location, index_col=None, na_values=['NA'],usecols="B,K")
any ideas?
I found out that the first unnamed cell is called Unnamed: 1 and the next Unnamed: 2 etc. so I just renamed like this:
df = df.rename(columns = {"Unnamed: 1":"Product"})
df = pd.DataFrame(df,columns=["Amount","Product"])
and now it works like intended
import numpy as np
import pandas as pd
file_location = Desktop\Excelfile.xlsx'
df = pd.read_excel(file_location)
df = df.rename(columns = {"Unnamed: 1":"Product"})
df = pd.DataFrame(df,columns=["Amount","Product"])
Related
I was using this bit of code (re-worked for my application) when I found that the df_temp.drop(index=sample.index, inplace=True) performed the same action on df_input i.e. it emptied it!!! I was not expecting that at all.
I solved it by changing df_temp = df_input to df_temp = df_input.copy() but can someone illuminate me on what is going on here?
import seaborn as sns
import pandas as pd
df_input = sns.load_dataset('diamonds')
df = df_input.loc[[]]
df_temp = df_input # this is where we're sampling from
n_samples = 1000
for _ in range(n_samples):
sample = df_temp.sample(1)
df_temp.drop(index=sample.index, inplace=True)
df = df.append(sample)
assert((df.index.value_counts() > 1).sum() == 0)
df
Pandas does not copy the whole df if you simply assign it to a new variable. After executing df_temp = df_input you end up with two variables referring to the exact same df. It's not the case that both are referring to an identical df, they are actually pointing to the same df. (think: you just gave this one df two names (variable names)) So no matter which variable (think: name) you are using to alter the df you're also changing for the other variable. If you use .copy() you get what you intended, namely two variables with two distinct versions of the df.
I have a dataframe which can be generated from the code as given below
df = pd.DataFrame({'person_id' :[1,2,3],'date1':
['12/31/2007','11/25/2009','10/06/2005'],'val1':
[2,4,6],'date2': ['12/31/2017','11/25/2019','10/06/2015'],'val2':[1,3,5],'date3':
['12/31/2027','11/25/2029','10/06/2025'],'val3':[7,9,11]})
I followed the below solution to convert it from wide to long
pd.wide_to_long(df, stubnames=['date', 'val'], i='person_id',
j='grp').sort_index(level=0)
Though this works with sample data as shown below, it doesn't work with my real data which has more than 200 columns. Instead of person_id, my real data has subject_ID which is values like DC0001,DC0002 etc. Does "I" always have to be numeric? Instead it adds the stub values as new columns in my dataset and has zero rows
This is how my real columns looks like
My real data might contains NA's as well. So do I have to fill them with default values for wide_to_long to work?
Can you please help as to what can be the issue? Or any other approach to achieve the same result is also helpful.
Try adding additional argument in the function which allows the strings suffix.
pd.long_to_wide(.......................,suffix='\w+')
The issue is with your column names, the numbers used to convert from wide to long need to be at the end of your column names or you need to specify a suffix to groupby. I think the easiest solution is to create a function that accepts regex and the dataframe.
import pandas as pd
import re
def change_names(df, regex):
# Select one of three column groups
old_cols = df.filter(regex = regex).columns
# Create list of new column names
new_cols = []
for col in old_cols:
# Get the stubname of the original column
stub = ''.join(re.split(r'\d', col))
# Get the time point
num = re.findall(r'\d+', col) # returns a list like ['1']
# Make new column name
new_col = stub + num[0]
new_cols.append(new_col)
# Create dictionary mapping old column names to new column names
dd = {oc: nc for oc, nc in zip(old_cols, new_cols)}
# Rename columns
df.rename(columns = dd, inplace = True)
return df
tdf = pd.DataFrame({'person_id' :[1,2,3],'h1date': ['12/31/2007','11/25/2009','10/06/2005'],'t1val': [2,4,6],'h2date': ['12/31/2017','11/25/2019','10/06/2015'],'t2val':[1,3,5],'h3date': ['12/31/2027','11/25/2029','10/06/2025'],'t3val':[7,9,11]})
# Change date columns
tdf = change_names(tdf, 'date$')
tdf = change_names(tdf, 'val$')
print(tdf)
person_id hdate1 tval1 hdate2 tval2 hdate3 tval3
0 1 12/31/2007 2 12/31/2017 1 12/31/2027 7
1 2 11/25/2009 4 11/25/2019 3 11/25/2029 9
2 3 10/06/2005 6 10/06/2015 5 10/06/2025 11
This is quite late to answer this question. But putting the solution here in case someone else find it useful
tdf = pd.DataFrame({'person_id' :[1,2,3],'h1date': ['12/31/2007','11/25/2009','10/06/2005'],'t1val': [2,4,6],'h2date': ['12/31/2017','11/25/2019','10/06/2015'],'t2val':[1,3,5],'h3date': ['12/31/2027','11/25/2029','10/06/2025'],'t3val':[7,9,11]})
## You can use m13op22 solution to rename your columns with numeric part at the
## end of the column name. This is important.
tdf = tdf.rename(columns={'h1date': 'hdate1', 't1val': 'tval1',
'h2date': 'hdate2', 't2val': 'tval2',
'h3date': 'hdate3', 't3val': 'tval3'})
## Then use the non-numeric portion, (in this example 'hdate', 'tval') as
## stubnames. The mistake you were doing was using ['date', 'val'] as stubnames.
df = pd.wide_to_long(tdf, stubnames=['hdate', 'tval'], i='person_id', j='grp').sort_index(level=0)
print(df)
I have a database with sample as below:
Data frame is generated when I load data in Python as per below code
import os
import pandas as pd
data_dir="D:\\userdata\\adbharga\\Desktop\\AVA\\PythonCoding\\VF-Aus\\4G Cell Graphs"
os.chdir(data_dir)
df = pd.read_csv('CA Throughput(Kbit_s) .csv',index_col=None, header=0)
Output:
Is there any way by which we can avoid reading duplicate columns in Pandas, or remove the duplicate columns post reading.
Pl Note: Column Name is different once data is read in Pandas, so command like df=df.loc[:,~df.columns.duplicated()] won't work.
Actual database is very big and has many duplicate column with Dates only.
There are 2 ways you can do this.
Ignore columns when reading the data
pandas.read_csv has the argument usecols, which accepts an integer list.
So you can try:
# work out required columns
df = pd.read_csv('file.csv', header=0)
cols = [0] + list(range(1, len(df.columns), 2))
# use column integer list
df = pd.read_csv('file.csv', usecols=cols)
Remove columns from dataframe
You can use similar logic with pd.DataFrame.iloc to remove unwanted columns.
# cols as defined in previous example
df = df.iloc[:, cols]
One way of do it could be to read only the first row and create a mask using drop_duplicates(). This we pass to the usecols without the need to specify the index beforehand. It should be failsafe.
m = pd.read_csv(pd.compat.StringIO(data),nrows=1, header=None).T.drop_duplicates().index
df = pd.read_csv(pd.compat.StringIO(data), usecols=m)
Full example:
import pandas as pd
data = '''\
Date,Value1,Date,Value2
2018-01-01,0,2018-01-01,1
2018-01-02,0,2018-01-02,1'''
m = pd.read_csv(pd.compat.StringIO(data),nrows=1, header=None).T.drop_duplicates().index
df = pd.read_csv(pd.compat.StringIO(data), usecols=m)
print(df)
# Date Value1 Value2
#0 2018-01-01 0 1
#1 2018-01-02 0 1
Another way to do it would be to remove all columns with a dot inside .. This should work in most cases as the dot is rarely used in column names:
df = df.loc[:,~df.columns.str.contains('.', regex=False)]
Full example:
import pandas as pd
data = '''\
Date,Value1,Date,Value2
2018-01-01,0,2018-01-01,1
2018-01-02,0,2018-01-02,1'''
df = pd.read_csv(pd.compat.StringIO(data))
df = df.loc[:,~df.columns.str.contains('.', regex=False)]
print(df)
# Date Value1 Value2
#0 2018-01-01 0 1
#1 2018-01-02 0 1
I'm trying to use python to read my csv file extract specific columns to a pandas.dataframe and show that dataframe. However, I don't see the data frame, I receive Series([], dtype: object) as an output. Below is the code that I'm working with:
My document consists of:
product sub_product issue sub_issue consumer_complaint_narrative
company_public_response company state zipcode tags
consumer_consent_provided submitted_via date_sent_to_company
company_response_to_consumer timely_response consumer_disputed?
complaint_id
I want to extract :
sub_product issue sub_issue consumer_complaint_narrative
import pandas as pd
df=pd.read_csv("C:\\....\\consumer_complaints.csv")
df=df.stack(level=0)
df2 = df.filter(regex='[B-F]')
df[df2]
import pandas as pd
input_file = "C:\\....\\consumer_complaints.csv"
dataset = pd.read_csv(input_file)
df = pd.DataFrame(dataset)
cols = [1,2,3,4]
df = df[df.columns[cols]]
Here specify your column numbers which you want to select. In dataframe, column start from index = 0
cols = []
You can select column by name wise also. Just use following line
df = df[["Column Name","Column Name2"]]
A simple way to achieve this would be as follows:
df = pd.read_csv("C:\\....\\consumer_complaints.csv")
df2 = df.loc[:,'B':'F']
Hope that helps.
This worked for me, using slicing:
df=pd.read_csv
df1=df[n1:n2]
Where $n1<n2# are both columns in the range, e.g:
if you want columns 3-5, use
df1=df[3:5]
For the first column, use
df1=df[0]
Though not sure how to select a discontinuous range of columns.
We can also use i.loc. Given data in dataset2:
dataset2.iloc[:3,[1,2]]
Will spit out the top 3 rows of columns 2-3 (Remember numbering starts at 0)
Then dataset2.iloc[:3,[1,2]] spits out
I'm using Pandas library for remote sensing time series analysis. Eventually I would like to save my DataFrame to csv by using chunk-sizes, but I run into a little issue. My code generates 6 NumPy arrays that I convert to Pandas Series. Each of these Series contains a lot of items
>>> prcpSeries.shape
(12626172,)
I would like to add the Series into a Pandas DataFrame (df) so I can save them chunk by chunk to a csv file.
d = {'prcp': pd.Series(prcpSeries),
'tmax': pd.Series(tmaxSeries),
'tmin': pd.Series(tminSeries),
'ndvi': pd.Series(ndviSeries),
'lstm': pd.Series(lstmSeries),
'evtm': pd.Series(evtmSeries)}
df = pd.DataFrame(d)
outFile ='F:/data/output/run1/_'+str(i)+'.out'
df.to_csv(outFile, header = False, chunksize = 1000)
d = None
df = None
But my code get stuck at following line giving a Memory Error
df = pd.DataFrame(d)
Any suggestions? Is it possible to fill the Pandas DataFrame chunk by chunk?
If you know each of these are the same length then you could create the DataFrame directly from the array and then append each column:
df = pd.DataFrame(prcpSeries, columns=['prcp'])
df['tmax'] = tmaxSeries
...
Note: you can also use the to_frame method (which allows you to (optionally) pass a name - which is useful if the Series doesn't have one):
df = prcpSeries.to_frame(name='prcp')
However, if they are variable length then this will lose some data (any arrays which are longer than prcpSeries). An alternative here is to create each as a DataFrame and then perform an outer join (using concat):
df1 = pd.DataFrame(prcpSeries, columns=['prcp'])
df2 = pd.DataFrame(tmaxSeries, columns=['tmax'])
...
df = pd.concat([df1, df2, ...], join='outer', axis=1)
For example:
In [21]: dfA = pd.DataFrame([1,2], columns=['A'])
In [22]: dfB = pd.DataFrame([1], columns=['B'])
In [23]: pd.concat([dfA, dfB], join='outer', axis=1)
Out[23]:
A B
0 1 1
1 2 NaN