I have problem with indexing python dataframe. I have dataframe which I fill it with loop. I simplified it like this :
d = pd.DataFrame(columns=['img', 'time', 'key'])
for i in range(5):
image = i
timepoint = i+1
key = i+2
temp = pd.DataFrame({'img':[image], 'timepoint':[timepoint], 'key': [key]})
d = pd.concat([d, temp])
The problem is since it shows 0 as and index for all rows, I can not access to the specific row based on .loc[]. Does anybody have any idea how can I fix the problem and get normal index column?
You may want to use the ignore_index parameter in your concatenation :
d = pd.concat([d, temp], ignore_index=True)
This gives me the following result :
img key time timepoint
0 0.0 2.0 NaN 1.0
1 1.0 3.0 NaN 2.0
2 2.0 4.0 NaN 3.0
3 3.0 5.0 NaN 4.0
4 4.0 6.0 NaN 5.0
d = d.reset_index(drop=True)
PS: It's better practice to make a list of rows and then turn it into a DataFrame, much less computationally expensive and it will make a good index instantly.
This list could be a list of lists combined with the columns in your DataFrame init or a list of dictionaries with column names as keys. In your case:
list_of_dicts = []
for i in range(5):
new_row = {'img': i, 'time': i+1, 'key': i+2}
list_of_dicts.append(new_row)
d = pd.DataFrame(new_row)
I think better is first fill lists by values and then once use DataFrame constructor:
image, timepoint, key = [],[],[]
for i in range(5):
image.append(i)
timepoint.append(i+1)
key.append(i+2)
d = pd.DataFrame({'img':image, 'time':timepoint, 'key': key})
print (d)
img key time
0 0 2 1
1 1 3 2
2 2 4 3
3 3 5 4
4 4 6 5
Related
I have the following dataframe which contains a columns of nested tuples:
index nested_tuples
1 (('a',(1,0)),('b',(2,0)),('c',(3,0)))
2 (('a',(5,0)),('d',(6,0)),('e',(7,0)),('f',(8,0)))
3 (('c',(4,0)),('d',(5,0)),('g',(6,0)),('h',(7,0)))
I am trying to unpack the tuples to obtain the following dataframe:
index a b c d e f g h
1 1 2 3
2 5 6 7 8
3 4 5 6 7
I.e. for each tuple ( char, (num1, num2) ), I would like char to be a column and num1 to be the entry. I initially tried all sorts of methods with to_list() but because the number of mini-tuples and the chars in them are different, I couldn't use that without losing information, and eventually the only solution I could think of was:
for index, row in df.iterrows():
tuples = row['nested_tuples']
if not tuples:
continue
for mini_tuple in tuples:
df.loc[index, mini_tuple[0]] = mini_tuple[1][0]
However, with the actual dataframe I have where the nested tuples are long and the df is significantly large, iterrows is incredibly slow. Is there a better vectorised way to do this?
It's probably more efficient to clean the data in vanilla Python before building the DataFrame:
out = pd.DataFrame([{k:v[0] for k,v in tpl} for tpl in df['nested_tuples'].tolist()])
A bit more concisely:
out = pd.DataFrame(map(dict, df['nested_tuples'])).stack().str[0].unstack()
Yet another option using apply:
out = pd.DataFrame(df['nested_tuples'].apply(lambda x: {k:v[0] for k,v in x}).tolist())
Output:
a b c d e f g h
0 1.0 2.0 3.0 NaN NaN NaN NaN NaN
1 5.0 NaN NaN 6.0 7.0 8.0 NaN NaN
2 NaN NaN 4.0 5.0 NaN NaN 6.0 7.0
I'm sure there is an elegant solution for this, but I cannot find one. In a pandas dataframe, how do I remove all duplicate values in a column while ignoring one value?
repost_of_post_id title
0 7139471603 Man with an RV needs a place to park for a week
1 6688293563 Land for lease
2 None 2B/1.5B, Dishwasher, In Lancaster
3 None Looking For Convenience? Check Out Cordova Par...
4 None 2/bd 2/ba, Three Sparkling Swimming Pools, Sit...
5 None 1 bedroom w/Closet is bathrooms in Select Unit...
6 None Controlled Access/Gated, Availability 24 Hours...
7 None Beautiful 3 Bdrm 2 & 1/2 Bth Home For Rent
8 7143099582 Need Help Getting Approved?
9 None *MOVE IN READY APT* REQUEST TOUR TODAY!
What I want is to keep all None values in repost_of_post_id, but omit any duplicates of the numerical values, for example if there are duplicates of 7139471603 in the dataframe.
[UPDATE]
I got the desired outcome using this script, but I would like to accomplish this in a one-liner, if possible.
# remove duplicate repost id if present (i.e. don't remove rows where repost_of_post_id value is "None")
# ca_housing is the original dataframe that needs to be cleaned
ca_housing_repost_none = ca_housing.loc[ca_housing['repost_of_post_id'] == "None"]
ca_housing_repost_not_none = ca_housing.loc[ca_housing['repost_of_post_id'] != "None"]
ca_housing_repost_not_none_unique = ca_housing_repost_not_none.drop_duplicates(subset="repost_of_post_id")
ca_housing_unique = ca_housing_repost_none.append(ca_housing_repost_not_none_unique)
You could try dropping the None values, then detecting duplicates, then filtering them out of the original list.
In [1]: import pandas as pd
...: from string import ascii_lowercase
...:
...: ids = [1,2,3,None,None, None, 2,3, None, None,4,5]
...: df = pd.DataFrame({'id': ids, 'title': list(ascii_lowercase[:len(ids)])})
...: print(df)
...:
...: print(df[~df.index.isin(df.id.dropna().duplicated().loc[lambda x: x].index)])
id title
0 1.0 a
1 2.0 b
2 3.0 c
3 NaN d
4 NaN e
5 NaN f
6 2.0 g
7 3.0 h
8 NaN i
9 NaN j
10 4.0 k
11 5.0 l
id title
0 1.0 a
1 2.0 b
2 3.0 c
3 NaN d
4 NaN e
5 NaN f
8 NaN i
9 NaN j
10 4.0 k
11 5.0 l
You could use drop_duplicates and merge with the NaNs as follows:
df_cleaned = df.drop_duplicates('post_id', keep='first').merge(df[df.post_id.isnull()], how='outer')
This will keep the first occurence of ids duplicated and all NaNs rows.
The example below:
import pandas as pd
list1 = ['a','a','a','b','b','b','b','c','c','c']
list2 = range(len(list1))
df = pd.DataFrame(zip(list1, list2), columns= ['Item','Value'])
df
gives:
required: GroupFirstValue column as shown below.
The idea is to use a lambda formula to get the 'first' value for each group..for example "a"'s first value is 0, "b"'s first value is 3, "c"'s first value is 7. That's why those numbers appear in the GroupFirstValue column.
Note: I know that I can do this on 2 steps...one is the original df and the second is a grouped by df and then merge them together. The idea is to see if this can be done more efficiently in a single step. Many thanks in advance!
groupby and use first
df.groupby('Item')['Value'].first()
or you can use transform and assign to a new column in your frame
df['new_col'] = df.groupby('Item')['Value'].transform('first')
Use mask and duplicated
df['GroupFirstValue'] = df.Value.mask(df.Item.duplicated())
Out[109]:
Item Value GroupFirstValue
0 a 0 0.0
1 a 1 NaN
2 a 2 NaN
3 b 3 3.0
4 b 4 NaN
5 b 5 NaN
6 b 6 NaN
7 c 7 7.0
8 c 8 NaN
9 c 9 NaN
I have a DataFrame with some NaN records that I want to fill based on a combination of data of the NaN record (index in this example) and of the non-NaN records. The original DataFrame should be modified.
Details of input/output/code below:
I have an initial DataFrame that contains some pre-calculated data:
Initial Input
raw_data = {'raw':[x for x in range(5)]+[np.nan for x in range(2)]}
source = pd.DataFrame(raw_data)
raw
0 0.0
1 1.0
2 2.0
3 3.0
4 4.0
5 NaN
6 NaN
I want to identify and perform calculations to "update" the NaN data, where the calculations are based on data of the non-NaN data and some data from the NaN records.
In this contrived example I am calculating this as:
Calculate average/mean of 'valid' records.
Add this to the index number of 'invalid' records.
Finally this needs to be updated on the initial DataFrame.
Desired Output
raw valid
0 0.0 1
1 1.0 1
2 2.0 1
3 3.0 1
4 4.0 1
5 7.0 0
6 8.0 0
The current solution I have (below) makes a calculation on a copy then updates the original DataFrame.
# Setup grouping by NaN in 'raw'
source['valid'] = ~np.isnan(source['raw'])*1
subsets = source.groupby('valid')
# Mean of 'valid' is used later to fill 'invalid' records
valid_mean = subsets.get_group(1)['raw'].mean()
# Operate on a copy of group(0), then update the original DataFrame
invalid = subsets.get_group(0).copy()
invalid['raw'] = subsets.get_group(0).index + valid_mean
source.update(invalid)
Is there a less clunky or more efficient way to do this? The real application is on significantly larger DataFrames (and with a significantly longer process of processing NaN rows).
Thanks in advance.
You can use combine_first:
#mean by default omit `NaN`s
m = source['raw'].mean()
#same as
#m = source['raw'].dropna().mean()
print (m)
2.0
#create valid column if necessary
source['valid'] = source['raw'].notnull().astype(int)
#update NaNs
source['raw'] = source['raw'].combine_first(source.index.to_series() + m)
print (source)
raw valid
0 0.0 1
1 1.0 1
2 2.0 1
3 3.0 1
4 4.0 1
5 7.0 0
6 8.0 0
I have a dataframe that looks like this:
parent region
estid
1 NaN A
2 NaN B
3 1.0 A
4 1.0 B
5 2.0 C
6 2.0 C
7 8.0 A
What I want is to create an extra column containing the region of the parent, defaulting to None if the parent is not found in the data, e.g.:
parent region parent_region
estid
1 NaN A None
2 NaN B None
3 1.0 A A
4 1.0 B A
5 2.0 C B
6 2.0 C B
7 8.0 A None
The following returns the correct result:
df["parent_region"] = df.apply(lambda x : df.loc[x["parent"]]["region"] if not math.isnan(x["parent"]) and x["parent"] in df.index else None, axis = 1)
But I'm very scared of inefficiencies, given that my dataframe has 168 million rows. Is there a better way to do it? I looked at lookup and get but I can't quite figure out how to work with IDs that can be NaN or not present in the dataframe.
For instance, I thought this could work: df.lookup(df["region"], df["parent"]), but it doesn't like very much null keys. df.get("region") does not return the parent's region, but the column itself, so it doesn't do what I want.
You can use Series.map method which functions similar to a dictionary. The values from parent column and region column serve as the keys and values constituting it. The mapping takes place if they share a common index between them.
Additionally, na_action=ignore could be used to speedup this mapping process, as all NaNs present in these columns would be ignored altogether and simply propagated.
Lastly, the missing values must be replaced with None using Series.replace method.
df["parent_region"] = df.parent.map(df.region, na_action='ignore').replace({np.NaN:None})
Out[121]:
estid
1 None
2 None
3 A
4 A
5 B
6 B
7 None
Name: parent_region, dtype: object
We could also use a merge for this, joining on itself to match parents to estid:
z = pd.merge(x, x[['estid','region']],\
left_on = 'parent',\
right_on = 'estid',\
how = 'left',\
suffixes=('', '_parent')) #left join
del z['estid_parent'] #remove uneeded row
z['region_parent'] = z['region_parent'].replace({np.NaN:None}) #remove nans, same as other answer
z
estid parent region region_parent
0 1 NaN A None
1 2 NaN B None
2 3 1.0 A A
3 4 1.0 B A
4 5 2.0 C B
5 6 2.0 C B
6 7 8.0 A None