first column in dataframe lost after grouping - python

Please excuse me if this question is too n00bish, I am brand new to Python and need to use it for work, which unfortunately means diving into higher level stuff without first understanding the basics...
I have a massive CSV with text transcripts which I read into a pandas dataframe. These transcripts are broken down into IDs and the ID's must be grouped to get a singular record for each interaction as they are broken apart into segments in the original database they come from. The format is something like this:
ID TEXT
1 This is the beginning of a convo
1 heres the middle
1 heres the end of the convo
2 this is the start of another convo...etc.
I used this code to group by ID and create singular records:
df1 = df.groupby('ID').text.apply(' '.join)
This code worked great but now I am stuck with a series (?) that no longer recognizes the index "ID", I think it's been merged with the text or something. When I use to_frame() the problem remains. I am wondering how I might separate the ID again and use that to index the data?

The groupby will return groupby-ed column as the index. Looking at your code this is what I see.
import pandas as pd
df = pd.DataFrame({'ID':[1,1,1,2],
'TEXT':['This is the beginning of a convo', 'heres the
middle', 'heres the end of the convo', 'this is the
start of another convo...etc.']})
df1 = df.groupby('ID').TEXT.apply(' '.join)
print(df1)
ID
1 This is the beginning of a convo heres the mid...
2 this is the start of another convo...etc.
Name: TEXT, dtype: object
You can take the series df1 and re-index it if you want the ID as a column in a dataframe, or move on with it as an index to the series which can be handy depending on what your next steps will be.
df1 = df1.reset_index()
print(df1)
ID TEXT
0 1 This is the beginning of a convo heres the mid...
1 2 this is the start of another convo...etc.

Related

pandas dataframe aggregate adding index in some cases

i have a pandas dataframe with an id column and relatively large text in other column. i want to group by id column and concatenate all the large texts into one single text whenever id repeats. it works great in simple toy example but when i run it on my real data it adds index of rows added in final concatenated text. here is my example code
data = {"A":[1,2,2,3],"B":['asdsa','manish','shukla','wfs']}
testdf = pd.DataFrame(data)
testdf = testdf.groupby(['A'],as_index=False).agg({'B':" ".join})
as you can see this code works great but when i run it on my real data it adds indexes in begnning of column B like it will say something like "1 manish \n 2 shukla" for A=2. it obviously is working here but no idea why its misbehaving when i have larger text with real data. any pointers? i tried to search but apparently noone else has run into this issue.
ok i figured out the answer. if any rows in the dataframe as na or nulls, it does that. once i removed the na and nulls it worked.

Compare two date columns in pandas DataFrame to validate third column

Background info
I'm working on a DataFrame where I have successfully joined two different datasets of football players using fuzzymatcher. These datasets did not have keys for an exact match and instead had to be done by their names. An example match of the name column from two databases to merge as one is the following
long_name name
L. Messi Lionel Andrés Messi Cuccittini
As part of the validation process of a 18,000 row database, I want to check the two date of birth columns in the merged DataFrame - df, ensuring that the columns match like the example below
dob birth_date
1987-06-24 1987-06-24
Both date columns have been converted from strings to dates using pd.to_datetime(), e.g.
df['birth_date'] = pd.to_datetime(df['birth_date'])
My question
My query, I have another column called 'value'. I want to update my pandas DataFrame so that if the two date columns match, the entry is unchanged. However, if the two date columns don't match, I want the data in this value column to be changed to null. This is something I can do quite easily in Excel with a date_diff calculation but I'm unsure in pandas.
My current code is the following:
df.loc[(df['birth_date'] != df['dob']),'value'] = np.nan
Reason for this step (feel free to skip)
The reason for this code is that it will quickly show me fuzzy matches that are inaccurate (approx 10% of total database) and allow me to quickly fix those.
Ideally I need to also work on the matching algorithm to ensure a perfect date match, however, my current algorithm currently works quite well in it's current state and the project is nearly complete. Any advice on this however I'd be happy to hear, if this is something you know about
Many thanks in advance!
IICU:
Please Try np.where.
Works as follows;
np.where(if condition, assign x, else assign y)
if condition=df.loc[(df['birth_date'] != df['dob'],
x=np.nan and
y= prevailing df.value
df['value']= np.where(df.loc[(df['birth_date'] != df['dob']),'value'], np.nan, df['value'])

question how to deal with KeyError: 0 or KeyError: 1 etc

I am new in python and this data science world and I am trying to play with different datasets.
In this case I am using the housing price index from quandl but unfortunately I get stuck when when I need to take the abbreviations names from the wiki page always getting the same Error KeyError.
import quandl
import pandas as pd
#pull every single housing price index from quandl
#quandl api key
api_key = 'xxxxxxxxxxxx'
#get stuff from quandl
df = quandl.get('FMAC/HPI_AK',authtoken = api_key) #alaska \
##print(df.head())
#get 50 states using pandas read html from wikipedia
fifty_states = pd.read_html('https://en.wikipedia.org /wiki/List_of_states_and_territories_of_the_United_States')
##print(fifty_states[0][1]) #first data frame is index 0, #looking for column 1,#from element 1 on
#get quandl frannymac query names for each 50 state
for abbv in fifty_states[0][1][2:]:
#print('FMAC/HPI_'+str(abbv))
So the problem I got in the following step:
#get 50 states using pandas read html from wikipedia
fifty_states = pd.read_html('https://en.wikipedia.org /wiki/List_of_states_and_territories_of_the_United_States')
##print(fifty_states[0][1]) #first data frame is index 0, #looking for column 1,#from element 1 on
I have tried different ways to get just the abbreviation but does not work
for abbv in fifty_states[0][1][2:]:
#print('FMAC/HPI_'+str(abbv))
for abbv in fifty_states[0][1][1:]:
#print('FMAC/HPI_'+str(abbv))
always Keyerror: 0
I just need this step to work, and to have the following output:
FMAC/HPI_AL,
FMAC/HPI_AK,
FMAC/HPI_AZ,
FMAC/HPI_AR,
FMAC/HPI_CA,
FMAC/HPI_CO,
FMAC/HPI_CT,
FMAC/HPI_DE,
FMAC/HPI_FL,
FMAC/HPI_GA,
FMAC/HPI_HI,
FMAC/HPI_ID,
FMAC/HPI_IL,
FMAC/HPI_IN,
FMAC/HPI_IA,
FMAC/HPI_KS,
FMAC/HPI_KY,
FMAC/HPI_LA,
FMAC/HPI_ME
for the 50 states from US and then proceed to make a data analysis from this data.
Can anybody tell me what am I doing wrong ? cheers
Note that fifty_states is a list of DataFrames, filled with
content of tables from the source page.
The first of them (at index 0 in fifty_states) is the table of US states.
If you don't know column names in a DataFrame (e.g. df),
to get column 1 from it (numeration form 0), run:
df.iloc[:, 1]
So, since we want this column from fifty_states[0], run:
fifty_states[0].iloc[:, 1]
Your code failed because you attempted to apply [1] to this DataFrame,
but this DataFrame has no column named 1.
Note that e.g. fifty_states[0][('Cities', 'Capital')] gives proper result,
because:
this DataFrame has a MultiIndex on columns,
one of columns has Cities at the first MultiIndex level
and Capital at the second level.
And getting back to your code, run:
for abbv in fifty_states[0].iloc[:, 1]:
print('FMAC/HPI_' + str(abbv))
Note that [2:] is not needed. You probably wanted to skip 2 initial rows
of the <table> HTML tag, containing column names,
but in Pandas they are actually kept in the MultiIndex on columns,
so to get all values, you don't need to skip anything.
If you want these strings as a list, for future use, the code can be:
your_list = ('FMAC/HPI_' + fifty_states[0].iloc[:, 1]).tolist()

How to combine multiple rows of data into a single sting per group

To preface: I'm new to using Python.
I'm working on cleaning up a file where data was spread across multiple rows. I'm struggling to find a solution that will concatenate multiple text strings to a single cell. The .csv data looks similar to this:
name,date,description
bundy,12-12-2017,good dog
,,smells kind of weird
,,needs to be washed
with one or two blank rows between each entry, too.
The amount of rows used for 'description' isn't consistent. Sometimes it's just one cell, sometimes up to about four. The ideal output turns these multiple rows into a single row of useful data, without all the wasted space. I thought maybe I could create a series of masks by copying the data across a few columns, shifted up, and then iterating in some way. I haven't found a solution that matches what I'm trying to do, though. This is where I'm at so far:
#Add column f description stuff and shift up a row for concatenation
DogData['Z'] = DogData['Y'].shift(-1)
DogData['AA'] = DogData['Z'].shift(-1)
DogData['AB'] = DogData['AA'].shift(-1)
#create series checks to determine how to concat values properly
YNAs = DogData['Y'].isnull()
ZNAs = DogData['Z'].isnull()
AANAs = DogData['AA'].isnull()
The idea here was basically that I'd iterate over column 'Y', check if the same row in column 'Z' was NA or had a value, and concat if it did. If not, just use the value in 'Y'. Carry that logic across but stopping if it encountered an NA in any subsequent columns. I can't figure out how to do that, or if there's a more efficient way to do this.
What do I have to do to get to my end result? I can't figure out the right way to iterate or concatenate in the way I was hoping to.
'''
name,date,description
bundy,12-12-2017,good dog
,,smells kind of weird
,,needs to be washed
'''
df = pd.read_clipboard(sep=',')
df.fillna(method = 'ffill').groupby([
'name',
'date'
]).description.apply(lambda x : ', '.join(x)).to_frame(name = 'description')
I'm not sure I follow exactly what you mean. I took that text, saved it as a csv file, and successfully read it into a pandas dataframe.
import pandas as pd
df = pd.read_csv('test.csv')
df
Output:
name date description
0 bundy 12-12-2017 good dog
1 NaN NaN smells kind of weird
2 NaN NaN needs to be washed
Isn't this the output you require?

Importing Two Financial Stocks with Python

So, I'm trying to import the stock prices for S&P500 (SPY) BP, the (O&G/energy company). The result I am looking for is a "table" of 3 columns; 1 for dates, 1 for the Adj Close of SPY and 1 for the Adj Close of BP. However, my code produces:
ValueError: columns overlap but no suffix specified: Index(['SPY'], dtype='object')
I understand what this error is telling me though: The index column “Adj Close” has an overlap. Irrespective of the stock, the column we are extracting each time is called “SPY”. The join() method I am using is confused because column names must be unique, well, something like that is how I've interpreted it...
The code:
import pandas as pd
def test_run():
start_date=('2016-03-10') #start date parameter
end_date=('2017-03-10') #end date parameter
dates=pd.date_range(start_date,end_date)
df1=pd.DataFrame(index=dates) #create empty dataframe df1
dfSPY=pd.read_csv("C:\SPY.csv",index_col="Date",parse_dates=True,
usecols=['Date','Adj Close'],na_values=['nan'])
#create dataframe for SPY stock
#rename Adj Close column to SPY to prevent clash
dfSPY=dfSPY.rename(columns={'Adj Close':'SPY'})
#join the 2 dataframes using DataFrame.join(), and how='inner'
df1=df1.join(dfSPY,how='inner')
#read in more stocks; SPY & BP
symbols=['SPY','BP']
for symbol in symbols:
df_temp=pd.read_csv("C{}.csv".format(symbol),index_col='Date',parse_dates=True,
usecols=['Date','Adj Close'],na_values=['nan'])
#rename to prevent clash
df_temp=df_temp.rename(columns={'Adj Close':symbol})
df1=df1.join(df_temp) #use default how='left'
print(df1)
if __name__=="__main__":
test_run()
So, that's the code I've got. If there's anyone out there who can shed some light as to what an Earth I've done wrong, please let me know.
Many thanks!
The code you provided is overriding the value of df_temp in your for loop- it will only end up with the value assigned during the last iteration. I assume the last two lines posted below are actually inside your for loop:
for symbol in symbols:
df_temp=pd.read_csv("C{}.csv".format(symbol),index_col='Date',parse_dates=True,
usecols=['Date','Adj Close'],na_values=['nan'])
df_temp=df_temp.rename(columns={'Adj Close':symbol})
df1=df1.join(df_temp) #use default how='left'
There's already an 'SPY' column after you joined dfSPY to df1. You have 'SPY' again in your list of symbols, which is going to throw an error because pandas can't join dataframes with overlapping column names, unless you specify a suffix to distinguish the columns
I just wanted to kinda get this question closed. So I gave up on importing the .CSV file of stocks, and just "imported" directly from Yahoo Finance. This doesn't really answer my original question, and so I still don't know what went wrong, but the following solution is much more efficient and "elegant" I feel:
import pandas as pd
import pandas.io.data as web
import datetime
start = datetime.datetime(2000,1,1)
end = datetime.date.today()
BP=web.DataReader("BP","yahoo",start,end)
SPY=web.DataReader("SPY","yahoo",start,end)
df_stocks=pd.DataFrame({"BP":BP["Adj Close"],"SPY":SPY["Adj Close"]})
df_stocks.tail()
BP SPY
Date
2017-03-07 33.869999 237.000000
2017-03-08 33.310001 236.559998
2017-03-09 33.500000 236.860001
2017-03-10 34.330002 237.690002
2017-03-13 34.070000 237.809998
Thanks to anyone who had a look.

Categories

Resources