Importing Excel into Pandas with text in cell A1 - Transpose Problem - python

If I take an Excel spreadsheet with text in cell A1 like this:
And bring it into Pandas with:
import pandas as pd
df = pd.read_excel("sessions.xlsx")
Then in my Jupyter notebook it looks like this:
I then perform a transpose with:
df_t = df.T
Which then gives me this:
My problem is that the column headers consist of the old row numbers as the index rather than "Mon", "Tue", "Wed" etc so when I am trying to address the columns to change formats etc I can't address them as I would like.
Using header = None in the pd.read_excel doesn't help.
I can go into the Excel spreadsheet first and delete the contents of cell A1 which then does work, but I want to be efficient and hands off on this.
Any suggestions to prevent this from happening?

You can copy the dataframe into a new dataframe from the second row and on, and name the column headers as the first row of the transposed dataframe.
import pandas as pd
df = pd.read_excel("sessions.xlsx")
df_t = df.T
new_df = pd.DataFrame(df_t.values[1:], columns=df_t.iloc[0])
Result:
Sessions Mon Tue Wed Thu Fri Avg Max
0 1 2 3 4 5 6 7
1 1 2 3 4 5 6 7
2 1 2 3 4 5 6 7
3 1 2 3 4 5 6 7

Related

Pandas total count each day

I have a large dataset (df) with lots of columns and I am trying to get the total number of each day.
|datetime|id|col3|col4|col...
1 |11-11-2020|7|col3|col4|col...
2 |10-11-2020|5|col3|col4|col...
3 |09-11-2020|5|col3|col4|col...
4 |10-11-2020|4|col3|col4|col...
5 |10-11-2020|4|col3|col4|col...
6 |07-11-2020|4|col3|col4|col...
I want my result to be something like this
|datetime|id|col3|col4|col...|Count
6 |07-11-2020|4|col3|col4|col...| 1
3 |5|col3|col4|col...| 1
2 |10-11-2020|5|col3|col4|col...| 1
4 |4|col3|col4|col...| 2
1 |11-11-2020|7|col3|col4|col...| 1
I tried to use resample like this df = df.groupby(['id','col3', pd.Grouper(key='datetime', freq='D')]).sum().reset_index() and this is my result. I am still new to programming and Pandas but I have read up on pandas docs and am still unable to do it.
|datetime|id|col3|col4|col...
6 |07-11-2020|4|col3|1|0.0
3 |07-11-2020|5|col3|1|0.0
2 |10-11-2020|5|col3|1|0.0
4 |10-11-2020|4|col3|2|0.0
1 |11-11-2020|7|col3|1|0.0
try this:
df = df.groupby(['datetime','id','col3']).count()
If you want the count values for all columns based only on the date, then:
df.groupby('datetime').count()
And you'll get a DataFrame who has the date time as the index and the column cells representing the number of entries for that given index.

Make Pandas figure out how many rows to skip in pd.read_excel

I'm trying to automate reading in hundreds of excel files into a single dataframe. Thankfully the layout of the excel files is fairly constant. They all have the same header (the casing of the header may vary) and then of course the same number of columns, and the data I want to read is always stored in the first spreadsheet.
However, in some files a number of rows have been skipped before the actual data begins. There may or may not be comments and such in the rows before the actual data. For instance, in some files the header is in row 3 and then the data starts in row 4 and down.
I would like pandas to figure out on its own, how many rows to skip. Currently I use a somewhat complicated solution...I first read the file into a dataframe, check if the header is correct, if no search to find the row containing the header, and then re-read the file now knowing how many rows to skip..
def find_header_row(df, my_header):
"""Find the row containing the header."""
for idx, row in df.iterrows():
row_header = [str(t).lower() for t in row]
if len(set(my_header) - set(row_header)) == 0:
return idx + 1
raise Exception("Cant find header row!")
my_header = ['col_1', 'col_2',..., 'col_n']
df = pd.read_excel('my_file.xlsx')
# Make columns lower case (case may vary)
df.columns = [t.lower() for t in df.columns]
# Check if the header of the dataframe mathces my_header
if len(set(my_header) - set(df.columns)) != 0:
# If no... use my function to find the row containing the header
n_rows_to_skip = find_header_row(df, kolonner)
# Re-read the dataframe, skipping the right number of rows
df = pd.read_excel(fil, skiprows=n_rows_to_skip)
Since I know what the header row looks like is there a way to let pandas figure out on its own where the data begins? Or is can anyone think of a better solution?
Let's know if this work for you
import pandas as pd
df = pd.read_excel("unamed1.xlsx")
df
Unnamed: 0 Unnamed: 1 Unnamed: 2
0 NaN bad row1 badddd row111 NaN
1 baaaa NaN NaN
2 NaN NaN NaN
3 id name age
4 1 Roger 17
5 2 Rosa 23
6 3 Rob 31
7 4 Ives 15
first_row = (df.count(axis = 1) >= df.shape[1]).idxmax()
df.columns = df.loc[first_row]
df = df.loc[first_row+1:]
df
3 id name age
4 1 Roger 17
5 2 Rosa 23
6 3 Rob 31
7 4 Ives 15

Pandas: Merge two data frames and keep non-intersecting data from a single data frame

Desire:
I want a way to merge two data frames and keep the non-intersected data from a specified data frame.
Problem:
I have duplicate data and I expected this line to remove that duplicate data:
final_df = new_df[~new_df.isin(previous_df)].dropna()
Example data and data test:
record = Record(1000, 9300815, '<redacted type>', '<redacted id>')
test_df = pd.DataFrame([record])
if not final_df.empty:
# this produces an empty data frame
empty_df = test_df[test_df.isin(final_df)].dropna()
# this produces the record
record_df = final_pdf[final_pdf.col01 == record.col01]
Background:
I'm loading xml data and converting the xml file into several different record types as namedtuples. I split each record type into its own dataframe. I then compare the current set of data from the xml file with the data already loaded into the database by constructing previous_df as such:
previous_df = pd.read_sql_table(table_name, con=conn, schema=schema, columns=columns)
Columns are dynamically created based on the fields in the named tuple. The database schema is generated using sqlalchemy, and I have added UniqueConstraint to manage when I think there are duplicates within the database.
Thanks in advance for any help provided.
PRESERVING SINGLE RECORDS FROM BOTH DATAFRAMES:
Try to concat the dataframes first, so you are sure that you will have duplicates. Then apply drop_duplicates and I think you will end up with what you are after. See the example below:
#Create dummy data
df1 = pd.DataFrame(columns=["A","B"],data=[[1,2],[3,4],[5,6]])
print(df1)
A B
0 1 2
1 3 4
2 5 6
df2 = pd.DataFrame(columns=["A","B"],data=[[3,4],[5,6],[7,8],[9,10]])
print(df2)
A B
0 3 4
1 5 6
2 7 8
3 9 10
#Concatenate dataframes
df = pd.concat([df1,df2],axis=0)
print(df)
A B
0 1 2
1 3 4
2 5 6
0 3 4
1 5 6
2 7 8
3 9 10
#Drop duplicates
df = df.drop_duplicates(keep=False)
print(df)
A B
0 1 2
2 7 8
3 9 10
PRESERVING SINGLE RECORDS FROM ONE DATAFRAME ONLY:
If you only want to keep the data from the new dataframe, just use a dirty little trick: concat the old dataframe twice, so all old records will fall under drop_duplicates criteria. Like so:
#Concatenate dataframes with old dataframe taken twice!
df = pd.concat([df1,df1,df2],axis=0)
#Now you will only end up with the records from second dataframe
df = df.drop_duplicates(keep=False)
print(df)
A B
2 7 8
3 9 10

Pandas: Dataframe.Drop - ValueError: labels ['id'] not contained in axis

Attempting to drop a column from a DataFrame in Pandas. DataFrame created from a text file.
import pandas as pd
df = pd.read_csv('sample.txt')
df.drop(['a'], 1, inplace=True)
However, this generates the following error:
ValueError: labels ['a'] not contained in axis
Here is a copy of the sample.txt file :
a,b,c,d,e
1,2,3,4,5
2,3,4,5,6
3,4,5,6,7
4,5,6,7,8
Thanks in advance.
So the issue is that your "sample.txt" file doesn't actually include the data you are trying to remove.
Your line
df.drop(['id'], 1, inplace=True)
is attepmting to take your DataFrame (which includes the data from your sample file), find the column where the value is 'id' in the first row (axis 1) and do an inplace replace (modify the existing object rather than create a new object missing that column, this will return None and just modify the existing object.).
The issue is that your sample data doesn't include a column with a header equal to 'id'.
In your current sample file, you can only to a drop where the value in axis 1 is 'a', 'b', 'c', 'd', or 'e'. Either correct your code to drop one of those values or get a sample files with the correct header.
The documentation for Pandas isn't fantastic, but here is a good example of how to do a column drop in Pandas: http://chrisalbon.com/python/pandas_dropping_column_and_rows.html
** Below added in response to Answer Comment from #saar
Here is my example code:
Sample.txt:
a,b,c,d,e
1,2,3,4,5
2,3,4,5,6
3,4,5,6,7
4,5,6,7,8
Sample Code:
import pandas as pd
df = pd.read_csv('sample.txt')
print('Current DataFrame:')
print(df)
df.drop(['a'], 1, inplace=True)
print('\nModified DataFrame:')
print(df)
Output:
>>python panda_test.py
Current DataFrame:
a b c d e
0 1 2 3 4 5
1 2 3 4 5 6
2 3 4 5 6 7
3 4 5 6 7 8
Modified DataFrame:
b c d e
0 2 3 4 5
1 3 4 5 6
2 4 5 6 7
3 5 6 7 8
bad= pd.read_csv('bad_modified.csv')
A=bad.sample(n=10)
B=bad.drop(A.index,axis=0)
This is an example of dropping a dataframe partly.
In case you need it.

How to read all rows of a csv file using pandas in python?

I am using the pandas module for reading the data from a .csv file.
I can write out the following code to extract the data belonging to an individual column as follows:
import pandas as pd
df = pd.read_csv('somefile.tsv', sep='\t', header=0)
some_column = df.column_name
print some_column # Gives the values of all entries in the column
However, the file that I am trying to read now has more than 5000 columns and writing out the statement
some_column = df.column_name
is now not feasible. How can I get all the column values so that I can access them using indexing?
e.g to extract the value present at the 100th row and the 50th column, I should be able to write something like this:
df([100][50])
Use DataFrame.iloc or DataFrame.iat, but python counts from 0, so need 99 and 49 for select 100. row and 50. column:
df = df.iloc[99,49]
Sample - select 3. row and 4. column:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,10],
'E':[5,3,6],
'F':[7,4,3]})
print (df)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 10 6 3
print (df.iloc[2,3])
10
print (df.iat[2,3])
10
Combination for selecting by column name and position of row is possible by Series.iloc or Series.iat:
print (df['D'].iloc[2])
10
print (df['D'].iat[2])
10
Pandas has indexing for dataframes, so you can use
df.iloc[[index]]["column header"]
the index is in a list as you can pass multiple indexes at one in this way.

Categories

Resources