I have a pandas DataFrame as below
From_email,To_email,email_count
110165.74#compuserve.com,klay#enron.com,1
2krayz#gte.net,klay#enron.com,1
"<""d#piassick"".#enron#enron.com>",klay#enron.com,1
I would like to change it to a dictionary of the following format
hrc_dict = {('110165.74#compuserve.com', 'klay#enron.com'): 1,
('2krayz#gte.net', 'klay#enron.com'): 1,
('<"d#piassick".#enron#enron.com>', 'klay#enron.com '): 1}
What is the best way to do this?
You can use a dict comprehension to create the dict from your DataFrame.
df = DataFrame({
'From_email': ['110165.74#compuserve.com', '2krayz#gte.net', '<"d#piassick".#enron#enron.com>'],
'To_email': ['klay#enron.com', 'klay#enron.com', 'klay#enron.com'],
'email_count': [1, 1, 1]})
d = {tuple(x[:2]):x[2] for x in df[['From_email', 'To_email', 'email_count']].values}
First we explicitly grab the necessary columns from your data frame in the desired order. Then iterate over the rows and for each row, create a tuple from the email addresses (first two columns) and use this as the key. The value is simply the 3rd column (email_count)
Related
I would like to apply the loop below where for each index value the unique values of a column called SERIAL_NUMBER will be returned. Essentially I want to confirm that for each index there is a unique serial number.
index_values = df.index.levels
for i in index_values:
x = df.loc[[i]]
x["SERIAL_NUMBER"].unique()
The problem, however, is that my dataset has a multi-index and as you can see below it is stored in a frozen list. I am just interested in the index values that contain a long number. The word "vehicle" also as an index can be removed as it is repeated all over the dataset.
How can I extract these values into a list so I can use them in the loop?
index_values
>>
FrozenList([['0557bf98-c3e0-4955-a23f-2394635ab531', '074705a3-a96a-418c-9bfe-14c37f5c4e6f', '0f47e260-0fa2-40ba-a417-7c00ea74248c', '17342ca2-6246-4150-8080-96d6125cf2b5', '26c6c0d1-0134-4b3a-a149-61dd93afab3b', '7600be43-5d0a-49b3-a1ee-fd107db5822f', 'a07f2b0c-447c-4143-a361-d7ddbffdcc77', 'b929801c-2f32-4a95-bfc4-48a05b48ee01', 'cc912023-0113-42cd-8fe7-4df4005127c2', 'e424bd02-e188-462e-a1a6-2f4ed8fe0a2d'], ['vehicle']])
without an example its hard to judge, but I think you need
df.index.get_level_values(0).unique() # add .tolist() if you want a list
import pandas as pd
df = pd.DataFrame({'A' : [5]*5, 'B' : [6]*5})
df = df.set_index('A',append=True)
df.index.get_level_values(0).unique()
Int64Index([0, 1, 2, 3, 4], dtype='int64')
df.index.get_level_values(1).unique()
Int64Index([5], dtype='int64', name='A')
to drop duplicates from an index level use the .duplicated() method.
df[~df.index.get_level_values(1).duplicated(keep='first')]
B
A
0 5 6
I have some data in text file that I am reading into Pandas. A simplified version of the txt read in is:
idx_level1|idx_level2|idx_level3|idx_level4|START_NODE|END_NODE|OtherData...
353386066294006|1142|2018-09-20T07:57:26Z|1|18260004567689|18260005575180|...
353386066294006|1142|2018-09-20T07:57:26Z|2|18260004567689|18260004240718|...
353386066294006|1142|2018-09-20T07:57:26Z|3|18260005359901|18260004567689|...
353386066294006|1142|2018-09-20T07:57:31Z|1|18260004567689|18260005575180|...
353386066294006|1142|2018-09-20T07:57:31Z|2|18260004567689|18260004240718|...
353386066294006|1142|2018-09-20T07:57:31Z|3|18260005359901|18260004567689|...
353386066294006|1142|2018-09-20T07:57:36Z|1|18260004567689|18260005575180|...
353386066294006|1142|2018-09-20T07:57:36Z|2|18260004567689|18260004240718|...
353386066294006|1142|2018-09-20T07:57:36Z|3|18260005359901|18260004567689|...
353386066736543|22|2018-04-17T07:08:23Z||||...
353386066736543|22|2018-04-17T07:08:24Z||||...
353386066736543|22|2018-04-17T07:08:25Z||||...
353386066736543|22|2018-04-17T07:08:26Z||||...
353386066736543|403|2018-07-02T16:55:07Z|1|18260004580350|18260005235340|...
353386066736543|403|2018-07-02T16:55:07Z|2|18260005235340|18260005141535|...
353386066736543|403|2018-07-02T16:55:07Z|3|18260005235340|18260005945439|...
353386066736543|403|2018-07-02T16:55:07Z|4|18260006215338|18260005235340|...
353386066736543|403|2018-07-02T16:55:07Z|5|18260004483352|18260005945439|...
353386066736543|403|2018-07-02T16:55:07Z|6|18260004283163|18260006215338|...
353386066736543|403|2018-07-02T16:55:01Z|1|18260004580350|18260005235340|...
353386066736543|403|2018-07-02T16:55:01Z|2|18260005235340|18260005141535|...
353386066736543|403|2018-07-02T16:55:01Z|3|18260005235340|18260005945439|...
353386066736543|403|2018-07-02T16:55:01Z|4|18260006215338|18260005235340|...
353386066736543|403|2018-07-02T16:55:01Z|5|18260004483352|18260005945439|...
353386066736543|403|2018-07-02T16:55:01Z|6|18260004283163|18260006215338|...
And the code I use to read in is as follows:
mydata = pd.read_csv('/myloc/my_simple_data.txt', sep='|',
dtype={'idx_level1': 'int',
'idx_level2': 'int',
'idx_level3': 'str',
'idx_level4': 'float',
'START_NODE': 'str',
'END_NODE': 'str',
'OtherData...': 'str'},
parse_dates = ['idx_level3'],
index_col=['idx_level1','idx_level2','idx_level3','idx_level4'])
What I really want to do is have a seperate panadas DataFrames for each unique idx_level1 & idx_level2 value. So in the above example there would be 3 DataFrames pertaining to idx_level1|idx_level2 values of 353386066294006|1142, 353386066736543|22 & 353386066736543|403 respectively.
Is it possible to read in a text file like this and output each change in idx_level2 to a new Pandas DataFrame, maybe as part of some kind of loop? Alternatively, what would be the most efficient way to split mydata into DataFrame subsets, given that everything I have read suggests that it is inefficient to iterate through a DataFrame.
Read your dataframe as you are currently doing then groupby and use list comprehension:
group = mydata.groupby(level=[0,1])
dfs = [group.get_group(x) for x in group.groups]
you can call your dataframes by doing dfs[0] and so on
To specifically address your last paragraph, you could create a dict of dfs, based on unique values in the column using something like:
import copy
dict = {}
cols = df[column].unique()
for value in col_values:
key = 'df'+str(value)
dict[key] = copy.deepcopy(df)
dict[key] = dict[key][df[column] == value]
dict[key].reset_index(inplace = True, drop = True)
where column = idx_level2
Read the table as-it-is and use groupby, for instance:
data = pd.read_table('/myloc/my_simple_data.txt', sep='|')
groups = dict()
for group, subdf in data.groupby(data.columns[:2].tolist()):
groups[group] = subdf
Now you have all the sub-data frames in a dictionary whose keys are a tuple of the two indexers (eg: (353386066294006, 1142))
I have a DataFrame like:
df = pd.DataFrame({'A' : (1,2,3), 'B': ([0,1],[3,4],[6,0])})
I want to create a new column called test that displays a 1 if a 0 exists within each list in column B. The results hopefully would look like:
df = pd.DataFrame({'A' : (1,2,3), 'B': ([0,1],[3,4],[6,0]), 'test': (1,0,1)})
With a dataframe that contains strings rather than lists I have created additional columns from the string values using the following
df.loc[df['B'].str.contains(0),'test']=1
When I try this with the example df, I generate a
TypeError: first argument must be string or compiled pattern
I also tried to convert the list to a string but only retained the first integer in the list. I suspect that I need something else to access the elements within the list but don't know how to do it. Any suggestions?
This should do it for you:
df['test'] = pd.np.where(df['B'].apply(lambda x: 0 in x), 1, 0)
I have a dataframe with three columns like this:
Subject{1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 4, ...} datetime{6/4/16 3:04:30, 6/5/16 6:02:15, ...} markers{}
It is sorted by subject then by datetime, and the markers column is empty
I also have a dictionary which maps subject numbers to lists of datetimes. These datetimes are not exactly the same as the ones already in the dataframe. I want to add all these datetimes to the markers column in their corresponding subject and date row for comparison purposes, so a dictionary with the key (subject) 1 with a list of values like {6/4/16 5:00:15, 6/5/16 6:10:30} would have its first value added to row 1 because the subject and date match and its second value added to row 2 for the same reason.
I thought of looping through each dictionary key and all it's corresponding datetimes, but then finding the corresponding row in the map for each datetime within the nested loops would be very inefficient. It would be something like this:
for subject in df.iloc[:, 0]:
# go to subject in dictionary and loop through datetimes in corresponding list,
# adding the matching datetime to the current row
# O(n^2) time!
Is there a more efficient way to do this?
Thanks!
try this, you will have to customize the answer somewhat to meet your specific needs, but the logic is basically the same.
df = pd.DataFrame({'colA': [100,200],'colB': ['NaN','NaN']})
dict1 = {100: ['rat','cat','hat'], 200: ['hen','men','den']}
df = pd.concat([df['colA'],df['colA'].map(dict1).apply(pd.Series)], axis = 1)
I have a pandas dataframe with a column that is a small selection of strings. Let's call the column 'A' and all of the values in it are string_1, string_2, string_3.
Now, I want to add another column and fill it with numeric values that correspond to the strings.
I created a dictionary
d = { 'string_1' : 1, 'string_2' : 2, 'string_3': 3}
I then initialized the new column:
df['B'] = pd.Series(index=df.index)
Now, I want to fill it with the integer values. I can call the values associated with the strings in the dictionary by:
for s in df['A']:
n = d[s]
That works fine, but I've tried using just plain df['B'] = n to fill the new column in the for-loop, but that doesn't work, and I've tried to figure out indexing with pandas.
If I understand you correctly you can just call map:
df['B'] = df['A'].map(d)
This will perform the lookup and fill the values you are looking for.
Rather than fill as an empty column, you can simply populate this with an apply:
df['B'] = df['A'].apply(d.get)