Pandas Column of Lists to Separate Rows - python

I've got a dataframe that contains analysed news articles w/ each row referencing an article and columns w/ some information about that article (e.g. tone).
One column of that df contains a list of FIPS country codes of the locations that were mentioned in that article.
I want to "extract" these country codes such that I get a dataframe in which each mentioned location has its own row, along with the other columns of the original row in which that location was referenced (there will be multiple rows with the same information, but different locations, as the same article may mention multiple locations).
I tried something like this, but iterrows() is notoriously slow, so is there any faster/more efficient way for me to do this?
Thanks a lot.
'events' is the column that contains the locations
'event_cols' are the columns from the original df that I want to retain in the new df.
'df_events' is the new data frame
for i, row in df.iterrows():
for location in df.events.loc[i]:
try:
df_storage = pd.DataFrame(row[event_cols]).T
df_storage['loc'] = location
df_events = df_events.append(df_storage)
except ValueError as e:
continue

I would group the DataFrame with groupby(), explode the lists with a combination of apply and a lambda function, and then reset the index and drop the level column that is created to clean up the resulting DataFrame.
df_events = df.groupby(['event_col1', 'event_col2', 'event_col3'])['events']\
.apply(lambda x: pd.DataFrame(x.values[0]))\
.reset_index().drop('level_3', axis = 1)
In general, I always try to find a way to use apply() before most other methods, because it is often much faster than iterating over each row.

Related

Iterate over a pandas DataFrame & Check Row Comparisons

I'm trying to iterate over a large DataFrame that has 32 fields, 1 million plus rows.
What i'm trying to do is iterate over each row, and check whether any of the rest of the rows have duplicate information in 30 of the fields, while the other two fields have different information.
I'd then like to store the the ID info. of the rows that meet these conditions.
So far i've been trying to figure out how to check two rows with the below code, it seems to work when comparing single columns but throws an error when I try more than one column, could anyone advise on how best to approach?
for index in range(len(df)):
for row in range(index, len(df)):
if df.iloc[index][1:30] == df.iloc[row][1:30]:
print(df.iloc[index])
As a general rule, you should always always try not to iterate over the rows of a DataFrame.
It seems that what you need is the pandas duplicated() method. If you have a list of the 30 columns you want to use to determine duplicates rows, the code looks something like this:
df.duplicated(subset=['col1', 'col2', 'col3']) # etc.
Full example:
# Set up test df
from io import StringIO
sub_df = pd.read_csv(
StringIO("""ID;col1;col2;col3
One;23;451;42;31
Two;24;451;42;54
Three;25;513;31;31"""
),
sep=";"
)
Find which rows are duplicates in col1 and col2. Note that the default is that the first instance is not marked as a duplicate, but later duplicates are. This behaviour can be changed as described in the documentation I linked to above.
mask = sub_df.duplicated(["col1", "col2"])
This looks like:
Now, filter using the mask.
sub_df["ID"][sub_df.duplicated(["col1", "col2"])]
Of course, you can do the last two steps in one line.

EDA for loop on multiple columns of dataframe in Python

Just a random q. If there's a dataframe, df, from the Boston Homes ds, and I'm trying to do EDA on a few of the columns, set to a variable feature_cols, which I could use afterwards to check for na, how would one go about this? I have the following, which is throwing an error:
This is what I was hoping to try to do after the above:
Any feedback would be greatly appreciated. Thanks in advance.
There are two problems in your pictures. First is a keyError, because if you want to access subset of columns of a dataframe, you need to pass the names of the columns in a list not a tuple, so the first line should be
feature_cols = df[['RM','ZN','B']]
However, this will return a dataframe with three columns. What you want to use in the for loop can not work with pandas. We usually iterate over rows, not columns, of a dataframe, you can use the one line:
df.isna().sum()
This will print all names of columns of the dataframe along with the count of the number of missing values in each column. Of course, if you want to check only a subset of columns, you can. replace df buy df[list_of_columns_names].
You need to store the names of the columns only in an array, to access multiple columns, for example
feature_cols = ['RM','ZN','B']
now accessing it as
x = df[feature_cols]
Now to iterate on columns of df, you can use
for column in df[feature_cols]:
print(df[column]) # or anything
As per your updated comment,. if your end goal is to see null counts only, you can achieve without looping., e.g
df[feature_cols].info(verbose=True,null_count=True)

Using describe() method to exclude a column

I am new to using python with data sets and am trying to exclude a column ("id") from being shown in the output. Wondering how to go about this using the describe() and exclude functions.
describe works on the datatypes. You can include or exclude based on the datatype & not based on columns. If your column id is of unique data type, then
df.describe(exclude=[datatype])
or if you just want to remove the column(s) in describe, then try this
cols = set(df.columns) - {'id'}
df1 = df[list(cols)]
df1.describe()
TaDa its done. For more info on describe click here
You can do that by slicing your original DF and remove the 'id' column. One way is through .iloc . Let's suppose the column 'id' is the first column from you DF, then, you could do this:
df.iloc[:,1:].describe()
The first colon represents the rows, the second the columns.
Although somebody responded with an example given from the official docs which is more then enough, I'd just want to add this, since It might help a few ppl:
IF your DataFrame is large (let's say 100s columns), removing one or two, might not be a good idea (not enough), instead, create a smaller DataFrame holding what you're interested and go from there.
Example of removing 2+ columns:
table_of_columns_you_dont_want = set(your_bigger_data_frame.colums) = {'column_1', 'column_2','column3','etc'}
your_new_smaller_data_frame = your_new_smaller_data_frame[list[table_of_columns_you_dont_want]]
your_new_smaller_data_frame.describe()
IF your DataFrame is medium/small size, you already know every column and you only need a few columns, just create a new DataFrame and then apply describe():
I'll give an example from reading a .csv file and then read a smaller portion of that DataFrame which only holds what you need:
df = pd.read_csv('.\docs\project\file.csv')
df = [['column_1','column_2','column_3','etc']]
df.describe()
Use output.describe(exclude=['id'])

pandas max function results in inoperable DataFrame

I have a DataFrame with four columns and want to generate a new DataFrame with only one column containing the maximum value of each row.
Using df2 = df1.max(axis=1) gave me the correct results, but the column is titled 0 and is not operable. Meaning I can not check it's data type or change it's name, which is critical for further processing. Does anyone know what is going on here? Or better yet, has a better way to generate this new DataFrame?
It is Series, for one column DataFrame use Series.to_frame:
df2 = df1.max(axis=1).to_frame('maximum')

Splitting Pandas Dataframe with groupby and last

I am working with a pandas dataframe where i want to group by one column, grab the last row of each group (creating a new dataframe), and then drop those rows from the original.
I've done a lot of reading and testing, and it seems that I can't do that as easily as I'd hoped. I can do a kludgy solution, but it seems inefficient and, well, kludgy.
Here's pseudocode for what I wanted to do:
df = pd.DataFrame
last_lines = df.groupby('id').last()
df.drop(last_lines.index)
creating the last_lines dataframe is fine, it's dropping those rows from the original df that's an issue. the problem is that the original index (from df) is disconnected when last_lines is created. i looked at filter and transform, but neither seems to address this problem. is there a good way to split the dataframe into two pieces based on position?
my kludge solution is to iterate over the group iterator and create a list of indexes, then drop those.
grouped = df.groupby('id')
idx_to_remove = []
for _, group in grouped:
idx_to_remove.append(group.tail(1).index[0])
df.drop(idx_to_remove)
Better suggestions?
If you use .reset_index() first, you'll get the index as a column and you can use .last() on that to get the indices you want.
last_lines = df.reset_index().groupby('A').index.last()
df.drop(last_lines)
Here the index is accessed as .index because "index" is the default name given to this column when you use reset_index. If your index has a name, you'll use that instead.
You can also "manually" grab the last index by using .apply():
last_lines = d.groupby('A').apply(lambda g: g.index[-1])
You'll probably have to do it this way if you're using a MultiIndex (since in that case using .reset_index() would add multiple columns that can't easily be combined back into indices to drop).
Try:
df.groupby('A').apply(lambda x: x.iloc[:-1, :])

Categories

Resources