How to index a 2d array properly pandas dataframe? - python

I am reading a .xslx excel file into a pandas dataframe.
Here is what it looks like:
Image Link
Or in text form:
1 2 3 4
3.5 15.48403728 23.22605592 30.96807456 38.7100932
4 17.41954194 26.12931291 34.83908388 43.54885485
4.5 19.3550466 29.0325699 38.7100932 48.3876165
5 21.29055126 31.93582689 42.58110252 53.22637815
As you can see there is a space in the top left hand cell that is empty.
The rows are amounts and the columns are material, the values are the prices.
I really don't know how to give names properly for indexing.
If I was to try
df.columns = ['Material 1',...'Material 4']
It errors because obviously it is wanting 5 column headers as there are five columns.
Really what I want is to label the top left column as amount/material or something like that, but I don't have a clue on how to do it.
I think the best way would be for me to try and transform this dataframe into something like this:
Amount Material Price
3.5 1 15.48...
3.5 2 23.22...
...
5 4 53.22...
as this will hopefully make it easier to deal with.
Any idea how to do this?
I believe this is called unpivot columns in excel or something like that????

I am not sure how you have read the excel file but if all you wanted is to rename your columns then you can set column names while reading the excel itself.
Supposing my file name is MyExcelFile.xlsx and the columns names that are there 'Amount','Material_1','Material_2','Material_3' and 'Material_4' then I will read it as follows. If these column names do not exist (in the excel) then you have to pass header=None explicitly.
MyDF = pd.read_excel('/FullPathToYourExcelFile/MyExcelFile.xlsx', names=['Amount','Material_1','Material_2','Material_3','Material_4'], header=None)
The output is as below.
See the documentation here (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html). If you have already done it, as I have suggested above, then I am sorry I have underestimated your problem requirements. All the best

Related

I need help concatenating 1 csv file and 1 pandas dataframe together without duplicates

My code currently looks like this:
df1 = pd.DataFrame(statsTableList)
df2 = pd.read_csv('StatTracker.csv')
result = pd.concat([df1,df2]).drop_duplicates().reset_index(drop=True)
I get an error and I'm not sure why.
The goal of my program is to pull data from an API, and then write it all to a file for analyzing. df1 is the lets say the first 100 games written to the csv file as the first version. df2 is me reading back those first 100 games the second time around and comparing it to that of df1 (new data, next 100 games) to check for duplicates and delete them.
The part that is not working is the drop duplicates part. It gives me an error of unhashable list, I would assume that's because its two dataframes that are lists of dictionaries. The goal is to pull 100 games of data, and then pull the next 50, but if I pull number 100 again, to drop that one, and just add 101-150 and then add it all to my csv file. Then if I run it again, to pull 150-200, but drop 150 if its a duplicate, etc etc..
Based from your explanation, you can use this one liner to find unique values in df1:
df_diff = df1[~df1.apply(tuple,1)\
.isin(df2.apply(tuple,1))]
This code checks if the rows is exists in another dataframe. To do the comparision it converts each row to tuple (apply tuple conversion along 1 (row) axis).
This solution is indeed slow because its compares each row inside df1 to all rows in df2. So it has time complexity n^2.
If you want more optimised version, try to use pandas built in compare method
df1.compare(df2)

Pandas deleting partly duplicate rows with wrong values in specific columns

I have a large dataframe from a csv file which has a few dozen columns. I have another csv file which I concatenated to the original. Now, the second file has exactly the same structure but a particular column may have incorrect values. I want to delete rows which are duplicates that have this one wrong column. For example in the below the last row should be removed. (The names of the specimens (Albert, etc.) are unique). I have been struggling to find a way of deleting only the data which has the wrong value, without risking deleting the correct row.
0 Albert alive
1 Newton alive
2 Galileo alive
3 Copernicus dead
4 Galileo dead
...
Any help would be greatly appreciated!
You could use this to determine if a name is mentioned more than 1 time
df['RN'] = df.groupby(['Name']).cumcount() + 1
You can also expand it out to have more columns in the "groupby" to see if there are any more limitations you want to put on the duplicates
df['RN'] = df.groupby(['Name', 'Another Column']).cumcount() + 1
The advantage I like with this is it gives you more control over the RN selection if you needed to df.loc[df['RN'] > 2].

pandas dataframe aggregate adding index in some cases

i have a pandas dataframe with an id column and relatively large text in other column. i want to group by id column and concatenate all the large texts into one single text whenever id repeats. it works great in simple toy example but when i run it on my real data it adds index of rows added in final concatenated text. here is my example code
data = {"A":[1,2,2,3],"B":['asdsa','manish','shukla','wfs']}
testdf = pd.DataFrame(data)
testdf = testdf.groupby(['A'],as_index=False).agg({'B':" ".join})
as you can see this code works great but when i run it on my real data it adds indexes in begnning of column B like it will say something like "1 manish \n 2 shukla" for A=2. it obviously is working here but no idea why its misbehaving when i have larger text with real data. any pointers? i tried to search but apparently noone else has run into this issue.
ok i figured out the answer. if any rows in the dataframe as na or nulls, it does that. once i removed the na and nulls it worked.

Cleaning survey data in python - how to finding and cleaning common rows in two files?

I am working on a survey data analysing project which consist 2 Excel files- in file pre-survey, it contains 800+ response records; while in post-survey file it contains 500ish responses. Both of them have (at least) one common column SID (Student ID). Something Y happened in between, and I am interested analysing the effectiveness of Y, and in what degrade Y impacts on different categories of people.
What adds more complexity is that in each Excel file, it contains multiple tabs. Different interviewers interviewed several interviewees and documented in each tabs for different sections of survey. Columns may or may not be the same for different tabs, so it would be hard to be complied in one file. (Or does it actually make sense to combine them in one with lots of null values?)
I am trying to find the students who did both pre- and post- surveys. How can I do it across sheets and files using python/pandas/other packages?
Bonus if you could also suggest the approach to solve the problem.
If i'm understanding this correctly, your data is currently formatted like this
survey1.xlsx
Sheet 1 (interviewer a)
STU-ID QUESTION 1 RESPONSE 1 QUESTION 2 RESPONSE 2
00001 tutoring? True lunch a? False
survey1.xlsx
Sheet 2 (interviewer b)
STU-ID QUESTION 1 RESPONSE 1 QUESTION 2 RESPONSE 2
00004 tutoring? True lunch a? TRUE
survey2.xlsx
Sheet 1
STU-ID QUESTION 1 RESPONSE 1 Tutorer GPA
00001 improvement? True Jim 3.5
survey2.xlsx
Sheet 2 (interviewer b)
STU-ID QUESTION 1 RESPONSE 1 Tutorer GPA
00004 improvement? yes Sally 2.8
if that's the case, and without knowing the data well, I would combine the tabs so that the pre-survey has the unique student ID (i'm not sure if the same student was interviewed by multiple surveyors) (if they were, you may need to do a group by, but that sounds messy)
Then I would do the same for the post survey response. Then join them into a single dataframe. From the df create a new DF with only the responses you care about (this could get rid of some na answers).
do a df.describe, and a df.dtypes
transform the data so that answers such as "yes/no" become booleans, or atleast so they're all the same format, and the same for numerical responses (int64 or float64)
Finally, I would dropna, so that the df follows your guidelines of containing responses from the first survey, and the second survey.
side note: with only 800 responses, it may be easier to do this just in excel, if you aren't comfortable with python, it would take you several hours to accomplish this, when in excel, it could take you 20 minutes.
If your goal is to learn python, then go for it
Python
import pandas as pd
df_s1s1 = pd.read_excel('survey1.xlsx', na_values="Missing", sheet_names='sheet 1', usecols=cols)
df.head()
df_s1s2 = pd.read_excel('survey1.xlsx', na_values="Missing", sheet_names='sheet 2', usecols=cols)
df_s1s2.head()
and then the same for the second survey file
df_s2s1 = pd.read_excel('survey2.xlsx', na_values="Missing", sheet_names='sheet 1', usecols=cols)
df.head()
df_s2s2 = pd.read_excel('survey2.xlsx', na_values="Missing", sheet_names='sheet 2', usecols=cols)
df_s1s2.head()
to add the different sheets to the same dataframe as rows you would use something like this
df_survey_1 = pd.concat([df_s1s1, df_s1s2])
df_survey_1.head()
then the same for the second survey
df_survey_2 = pd.concat([df_s2s1, df_s2s2])
df_survey_2.head()
and then to create the larger dataframe with all of the columns you would use something like this
master_df = pd.merge(df_survey_1, df_survey2, left_on='STU_ID', right_on='STU_ID')
Drop NA
master_df = master_df.dropna(axis = 0, how ='any')
hope this helps

How to combine multiple rows of data into a single sting per group

To preface: I'm new to using Python.
I'm working on cleaning up a file where data was spread across multiple rows. I'm struggling to find a solution that will concatenate multiple text strings to a single cell. The .csv data looks similar to this:
name,date,description
bundy,12-12-2017,good dog
,,smells kind of weird
,,needs to be washed
with one or two blank rows between each entry, too.
The amount of rows used for 'description' isn't consistent. Sometimes it's just one cell, sometimes up to about four. The ideal output turns these multiple rows into a single row of useful data, without all the wasted space. I thought maybe I could create a series of masks by copying the data across a few columns, shifted up, and then iterating in some way. I haven't found a solution that matches what I'm trying to do, though. This is where I'm at so far:
#Add column f description stuff and shift up a row for concatenation
DogData['Z'] = DogData['Y'].shift(-1)
DogData['AA'] = DogData['Z'].shift(-1)
DogData['AB'] = DogData['AA'].shift(-1)
#create series checks to determine how to concat values properly
YNAs = DogData['Y'].isnull()
ZNAs = DogData['Z'].isnull()
AANAs = DogData['AA'].isnull()
The idea here was basically that I'd iterate over column 'Y', check if the same row in column 'Z' was NA or had a value, and concat if it did. If not, just use the value in 'Y'. Carry that logic across but stopping if it encountered an NA in any subsequent columns. I can't figure out how to do that, or if there's a more efficient way to do this.
What do I have to do to get to my end result? I can't figure out the right way to iterate or concatenate in the way I was hoping to.
'''
name,date,description
bundy,12-12-2017,good dog
,,smells kind of weird
,,needs to be washed
'''
df = pd.read_clipboard(sep=',')
df.fillna(method = 'ffill').groupby([
'name',
'date'
]).description.apply(lambda x : ', '.join(x)).to_frame(name = 'description')
I'm not sure I follow exactly what you mean. I took that text, saved it as a csv file, and successfully read it into a pandas dataframe.
import pandas as pd
df = pd.read_csv('test.csv')
df
Output:
name date description
0 bundy 12-12-2017 good dog
1 NaN NaN smells kind of weird
2 NaN NaN needs to be washed
Isn't this the output you require?

Categories

Resources