Passing a defaultdict into a df - python

I am trying to import a txt file with states and universities listed in it. I have utilized defaultdict to import the txt and parse it to where I have a list whereby universities are attached to the state. How do I then put the data into a pandas dataframe with two columns (State, RegionName)? Nothing thus far has worked.
I built an empty dataframe with:
ut = pd.DataFrame(columns = {'State', 'RegionName'})
and have tried a couple of different methods but none have worked.
with open('ut.txt') as ut:
for line in ut:
if '[edit]' in line:
a = line.rstrip().split('[')
d[a[0]].append(a[1])
else:
b = line.rstrip().split(' ')
d[a[0]].append(b[0])
continue
This gets me a nice list:
defaultdict(<class 'list'>, {'State': ['edit]', 'School', 'School2', 'School3', 'School4', 'School5', 'School6', 'School7', 'School8'],
The edit] is part of the original txt file signifying a state. Everything after are the towns the schools are in.
I'd like to build a nice 2 column dataframe where state is the left column and all schools on the right...

Considering the following dictionary
data_dict = {"a": 1, "b": 2, "c": 3}
Considering that from that dictionary you want to create a dataframe and name the columns State and RegionName, respectively, the following will do the work
data_items = data_dict.items()
data_list = list(data_items)
df = pd.DataFrame(data_list, columns = ["State", "RegionName"])
Which will get
[In]: print(df)
[Out]:
State RegionName
0 a 1
1 b 2
2 c 3
If one doesn't pass the name of the columns when creating the dataframe, considering that the columns have the name a and b one can rename the columns with pandas.DataFrame.rename
df = df.rename(columns = {"a": "State", "b": "RegionName"})
If the goal is solely reading a txt file with a structure like this
column1 column2
1 2
3 4
5 6
Then the following will do the work
colnames=['State', 'RegionName']
df = pd.read_csv("file.txt", colnames, header=None)
Note that if the name of the columns is already the one one wants use just the following
df = pd.read_csv("file.txt")

Related

pandas create a subset according to a value in a column

I have this dataframe:
86,1/28/2004 0:00:00,16.9
86,5/25/2004 0:00:00,17.01
86,7/22/2004 0:00:00,17.06
87,11/15/2004 0:00:00,7.39
87,3/14/2005 0:00:00,7.59
86,11/15/2004 0:00:00,17.29
86,3/14/2005 0:00:00,17.38
86,4/19/2005 0:00:00,17.43
86,5/19/2005 0:00:00,17.28
87,1/22/2004 0:00:00,7.44
87,5/13/2004 0:00:00,7.36
I would like to work on two separate dataframe according to the value (id) of the first column. Ideally, I would like to have:
87,11/15/2004 0:00:00,7.39
87,3/14/2005 0:00:00,7.59
87,1/22/2004 0:00:00,7.44
87,5/13/2004 0:00:00,7.36
and
86,1/28/2004 0:00:00,16.9
86,5/25/2004 0:00:00,17.01
86,7/22/2004 0:00:00,17.06
86,11/15/2004 0:00:00,17.29
86,3/14/2005 0:00:00,17.38
86,4/19/2005 0:00:00,17.43
86,5/19/2005 0:00:00,17.28
As you can see I have one dataframe with all 87 in the first column and another with 86.
This is how I read the dataframe:
dfr = pd.read_csv(fname,sep=',',index_col=False,header=None)
I think that groupby is not the right options, if I have understood correctly the command.
I was thinking about query as:
aa = dfr.query(dfr.iloc[:,0]==86)
However, I have this error:
expr must be a string to be evaluated, <class 'pandas.core.series.Series'> given
You can simply slice your dataframe:
df_86 = df.loc[df['ColName'] == 86,:]
Another way to do it dynamically without having to specify the group beforehand.
df = pd.DataFrame({'ID': np.repeat([1, 2, 3], 4), 'col2': np.repeat([10, 11, 12], 4)})
Get the unique groupings:
groups = df['ID'].unique()
Create an empty dict to store new data frames
new_dfs = {}
Loop through and create new data frames from the slice:
for group in groups:
name = "ID" + str(group)
new_dfs[name] = df[df['ID'] == group]
new_dfs['ID1']
Which gives:
ID col2
0 1 10
1 1 10
2 1 10
3 1 10

Remove grave accent from IDs

I have an ID column with grave accent like this `1234ABC40 and I want to remove just that character from this column but keep the dataframe form.
I tried this on the column only. I have a file name x here and has multiple columns. id is the col i want to fix.
pd.read_csv(r'C:\filename.csv', index_col = False)
id = str(x['id'])
id2 = unidecode.unidecode(id)
id3 = id2.replace('`','')
This changes to str but I want that column back in the dataframe form
DataFrames have their own replace() function. Note, for partial replacements you must enable regex=True in the parameters:
import pandas as pd
d = {'id': ["12`3", "32`1"], 'id2': ["004`", "9`99"]}
df = pd.DataFrame(data=d)
df["id"] = df["id"].replace('`','', regex=True)
print df
id id2
0 123 004`
1 321 9`99

Pandas, for each unique value in one column, get unique values in another column

I have a dataframe where each row contains various meta-data pertaining to a single Reddit comment (e.g. author, subreddit, comment text).
I want to do the following: for each author, I want to grab a list of all the subreddits they have comments in, and transform this data into a pandas dataframe where each row corresponds to an author, and a list of all the unique subreddits they comment in.
I am currently trying some combination of the following, but can't get it down:
Attempt 1:
group = df['subreddit'].groupby(df['author']).unique()
list(group)
Attempt 2:
from collections import defaultdict
subreddit_dict = defaultdict(list)
for index, row in df.iterrows():
author = row['author']
subreddit = row['subreddit']
subreddit_dict[author].append(subreddit)
for key, value in subreddit_dict.items():
subreddit_dict[key] = set(value)
subreddit_df = pd.DataFrame.from_dict(subreddit_dict,
orient = 'index')
Here are two strategies to do it. No doubt, there are other ways.
Assuming your dataframe looks something like this (obviously with more columns):
df = pd.DataFrame({'author':['a', 'a', 'b'], 'subreddit':['sr1', 'sr2', 'sr2']})
>>> df
author subreddit
0 a sr1
1 a sr2
2 b sr2
...
SOLUTION 1: groupby
More straightforward than solution 2, and similar to your first attempt:
group = df.groupby('author')
df2 = group.apply(lambda x: x['subreddit'].unique())
# Alternatively, same thing as a one liner:
# df2 = df.groupby('author').apply(lambda x: x['subreddit'].unique())
Result:
>>> df2
author
a [sr1, sr2]
b [sr2]
The author is the index, and the single column is the list of all subreddits they are active in (this is how I interpreted how you wanted your output, according to your description).
If you wanted the subreddits each in a separate column, which might be more useable, depending on what you want to do with it, you could just do this after:
df2 = df2.apply(pd.Series)
Result:
>>> df2
0 1
author
a sr1 sr2
b sr2 NaN
Solution 2: Iterate through dataframe
you can make a new dataframe with all unique authors:
df2 = pd.DataFrame({'author':df.author.unique()})
And then just get the list of all unique subreddits they are active in, assigning it to a new column:
df2['subreddits'] = [list(set(df['subreddit'].loc[df['author'] == x['author']]))
for _, x in df2.iterrows()]
This gives you this:
>>> df2
author subreddits
0 a [sr2, sr1]
1 b [sr2]
By using sacul's sample data
df['subreddit'].groupby(df['author']).unique().apply(pd.Series)
Out[370]:
0 1
author
a sr1 sr2
b sr2 NaN
Using groupby.agg() "aggrgeate" function:
*
DataFrameGroupBy.agg(arg, *args, **kwargs): aggregate using one or
more operations over the specified axis. Function to use for
aggregating the data. If a function, must either work when passed a
DataFrame or when passed to DataFrame.apply
df = pd.DataFrame({'numbers': [1, 2, 3, 6, 9], 'colors': ['red', 'white', 'blue', 'red', 'white']}, columns=['numbers', 'colors'])
df.groupby('colors', as_index=True).agg({'numbers' : {"unique" : lambda x: set(x),
"nunique" : lambda x : len(set(x))}})

Text To Column Function

I am trying to write my own function in Python 3.5, but not having much luck.
I have a data frame that is 17 columns, 1,200 rows (tiny)
One of the columns is called "placement". Within this column, I have text contained in each row. The naming convention is as follows:
Campaign_Publisher_Site_AdType_AdSize_Device_Audience_Tactic_
The following code works perfectly and does exactly what i need it to do; I just don't want to do this for every data set i have:
df_detailed = df['Placement'].str[0:-1].str.split('_', expand=True).astype(str)
df_detailed = df.join(df_detailed)
new_columns = *["Then i rename the columns labelled 0,1,2 etc"]*
df_detailed.columns = new_columns
df_detailed.head()
What I'm trying to do is build a function, that takes any columns with _ as the delimitator and splits it across new columns.
I have tried the following (but unfortunately defining my own functions is something I'm horrible at.
def text_to_column(df):
df_detailed = df['Placement'].str[0:-1].str.split('_', expand=True).astype(str)
headings = df_detailed.columns
headings.replace(" ", "_")
df_detailed = df.join(df_detailed)
df_detailed.columns = headings
return (df)
and I get the following error "AttributeError: 'RangeIndex' object has no attribute 'replace'"
The end goal here is to write a function where I can pass the column name into the function, it separates the values contained within the column into new columns and then joins this back to my original Data Frame.
If I'm being ridiculous, please let me know. If someone can help me, it would be greatly appreciated.
Thanks,
Adrian
You need rename function for replace columns names:
headings = df_detailed.columns
headings.replace(" ", "_")
change to:
df_detailed = df_detailed.rename(columns=lambda x: x.replace(" ", "_"))
Or convert columns to_series because replace does not work with index (columns names):
headings.replace(" ", "_")
change to:
headings = headings.to_series().replace(" ", "_")
Also:
df_detailed = df['Placement'].str[0:-1].str.split('_', expand=True).astype(str)
is possible change to:
df_detailed = df['Placement'].str.rstrip('_').str.split('_', expand=True).astype(str)
EDIT:
Sample:
df = pd.DataFrame({'a': [1, 2], 'Placement': ['Campaign_Publisher_Site_AdType_AdSize_Device_Audience_Tactic_', 'a_b_c_d_f_g_h_i_']})
print (df)
Placement a
0 Campaign_Publisher_Site_AdType_AdSize_Device_A... 1
1 a_b_c_d_f_g_h_i_ 2
#input is DataFrame and column name
def text_to_column(df, col):
df_detailed = df[col].str.rstrip('_').str.split('_', expand=True).astype(str)
#replace columns names if necessary
df_detailed.columns = df_detailed.columns.to_series().replace(" ", "_")
#remove column and join new df
df_detailed = df.drop(col, axis=1).join(df_detailed)
return df_detailed
df = text_to_column(df, 'Placement')
print (df)
a 0 1 2 3 4 5 6 7
0 1 Campaign Publisher Site AdType AdSize Device Audience Tactic
1 2 a b c d f g h i

Slicing Pandas DataFrame based on csv

Let's say I have a Pandas DataFrame like following.
df = pd.DataFrame({'Name' : ['A','B','C'],
'Country' : ['US','UK','SL']})
Country Name
0 US A
1 UK B
2 SL C
And I'm having a csv like following.
Name,Extended
A,Jorge
B,Alex
E,Mark
F,Bindu
I need to check whether df['Name'] is in csv and if so get the "Extended". If not I need to just get the "Name". So my Expected output is like following.
Country Name Extended
0 US A Jorge
1 UK B Alex
2 SL C C
Following shows what I tried so far.
f = open('mycsv.csv','r')
lines = f.readlines()
def parse(x):
for line in lines:
if x in line.split(',')[0]:
return line.strip().split(',')[1]
df['Extended'] = df['Name'].apply(parse)
Name Country Extended
0 A US Jorge
1 B UK Alex
2 C SL None
I can not figure out how to get the "Name" for C at "Extended"(else part in the code)? Any help.
You can use the "fillna" function from pandas like this:
import pandas as pd
df1 = pd.DataFrame({'Name' : ['A','B','C'],
'Country' : ['US','UK','SL']})
df2 = pd.DataFrame.from_csv('mycsv.csv', index_col=None)
df_merge = pd.merge(df, f, how="left", on="Name")
df_merge["Extended"].fillna('Name', inplace=True)
You could just load the csv as a df and then assign using where:
df['Name'] = df2['Extended'].where(df2['Name'] != df2['Extended'], df2['Name'])
So here we use the boolean condition to test if 'Name' is not equal to 'Extended' and use that value, otherwise just use 'Name'.
Also is 'Extended' always either different or same as 'Name'? If so why not just assign the value of extended to the dataframe:
df['Name'] = df2['Extended']
This would be a lot simpler.

Categories

Resources