I have the following dataframe:
And I made dictionaries from each unique appId as you see below:
with this command:
dfs = dict(tuple(timeseries.groupby('appId')))
After that I want to remove all dictionaries which have less than 30 rows from my dataframe. I removed those dictionaries from my dictionaries(dfs) and then I tried this code:
pd.concat([dfs]).drop_duplicates(keep=False)
but it doesn't work.
I believe you need transform size and then filter by boolean indexing:
df = pd.concat([dfs])
df = df[df.groupby('appId')['appId'].transform('size') >= 30]
#alternative 1
#df = df[df.groupby('appId')['appId'].transform('size').ge(30)]
#alternative 2 (slowier in large data)
#df = df.groupby('appId').filter(lambda x: len(x) >= 30)
Another approach is filter dictionary:
dfs = {k: v for k, v in dfs.items() if len(v) >= 30}
EDIT:
timeseries = timeseries[timeseries.groupby('appId')['appId'].transform('size') >= 30]
dfs = dict(tuple(timeseries.groupby('appId')))
Related
I have a data frame with different columns name (asset.new, few, value.issue, etc). And I want to change some characters or symbols in the name of columns. I can do it in this form:
df.columns = df.columns.str.replace('.', '_')
df.columns = df.columns.str.replace('few', 'LOW')
df.columns = df.columns.str.replace('value', 'PRICE')
....
But I think it should have a better and shorter way.
You can create a dictionary with the actual character as a key and the replacement as a value and then you iterate through your dictionary:
df = pd.DataFrame({'asset.new':[1,2,3],
'few':[4,5,6],
'value.issue':[7,8,9]})
replaceDict = { '.':'_', 'few':'LOW', 'value':'PRICE'}
for k, v in replaceDict.items():
df.columns = [c.replace(k, v) for c in list(df.columns)]
print(df)
output:
asset_new LOW PRICE_issue
1 4 7
2 5 8
3 6 9
or:
df.columns = df.columns.to_series().replace(["\.","few","value"],['_','LOW','PRICE'],regex=True)
Produces the same output.
Use Series.replace with dictionary - also necessary escape . because special regex character:
d = { '\.':'_', 'few':'LOW', 'value':'PRICE'}
df.columns = df.columns.to_series().replace(d, regex=True)
More general solution with re.esape:
import re
d = { '.':'_', 'few':'LOW', 'value':'PRICE'}
d1 = {re.escepe(k): v for k, v in d.items()}
df.columns = df.columns.to_series().replace(d1, regex=True)
I have 10 dataframes (ex: dfc,df1,df2,df3,df4,dft1,dft2,dft3,dft4,dft5). I want to check the length of each dataframe. If the length of dataframe is less than 2, I want to add the name of that dataframe to an empty list. How can I do this?
You can store the dataframes in a dictionary using their names as keys and then iterate over the dictionary:
dic = {'df1': df1,'df2': df2,'df3': df3,'df4': df4}
d = []
for k, v in dic.items():
if len(v) < 2:
d.append(k)
print(d)
You can also use aa list comprehension instead of the for loop:
dic = {'df1': df1,'df2': df2,'df3': df3,'df4': df4}
d = [k for k, v in dic.items() if len(v) < 2]
If I understand you correct, you want to create a list of the short dataframe-list.
I would do it like this:
dataframes = ['d','df1','df2','df3','df4','dft1','dft2','dft3','dft4','dft5']
short_dataframe = [] # the empy list.
for frame in dataframes:
if len(frame) < 2:
short_dataframe.append(frame) # adds the frame to the empty list
print(short_dataframe)
result of the print = ['d']
Say I have a long list and I want to iteratively join them to produce a final dataframe.
The data is originally in dict so I need to iterate over the dictionary first.
header = ['apple', 'pear', 'cocoa']
for key, value in data.items():
for idx in header:
# Flatten the dictionary to dataframe
data_df = pd.json_normalize(data[key][idx])
# Here I start to lose.....
How can I iteratively join the dataframe?
Manually it can be done like this:
data_df = pd.json_normalize(data["ParentKey"]['apple'])
data_df1 = pd.json_normalize(data["ParentKey"]['pear'])
final_df = data_df1.join(data_df, lsuffix='_left')
# or
final_df = pd.concat([data_df, data_df1], axis=1, sort=False)
Since the list will be large, I want to iterate them instead.. How can I achieve this?
Is this what you're looking for? You can use k as a counter to indicate whether or not it's the first iterator and then for future ones just join it to that same dataframe:
header = ['apple', 'pear', 'cocoa']
for key, value in data.items():
k = 0
for idx in header:
data_df = pd.json_normalize(data[key][idx])
if k==0:
final_df = data_df
else:
final_df = final_df.join(data_df, lsuffix='_left')
k += 1
How can you combine multiple columns from a dataframe into a list?
Input:
df = pd.DataFrame(np.random.randn(10000, 7), columns=list('ABCDEFG'))
If I wanted to create a list from column A I would perform:
df1 = df['A'].tolist()
But if I wanted to combine numerous columns into this list it wouldn't be efficient write df['A','B','C'...'Z'].tolist()
I have tried to do the following but it just adds the columns headers to a list.
df1 = list(df.columns)[0:8]
Intended input:
A B C D E F G
0 0.787576 0.646178 -0.561192 -0.910522 0.647124 -1.388992 0.728360
1 0.265409 -1.919283 -0.419196 -1.443241 -2.833812 -1.066249 0.553379
2 0.343384 0.659273 -0.759768 0.355124 -1.974534 0.399317 -0.200278
Intended Output:
[0.787576, 0.646178, -0.561192, -0.910522, 0.647124, -1.388992, 0.728360,
0.265409, -1.919283, -0.419196, -1.443241, -2.833812, -1.066249, 0.553379,
0.343384, 0.659273, -0.759768, 0.355124, -1.974534, 0.399317, -0.200278]
Is this what you are looking for
lst = df.values.tolist()
flat_list = [item for x in lst for item in x]
print(flat_list)
You can using to_dict
df = pd.DataFrame(np.random.randn(10, 10), columns=list('ABCDEFGHIJ'))
df.to_dict('l')
Out[1036]:
{'A': [-0.5611441440595607,
-0.3785906500723589,
-0.19480328695097676,
-0.7472526275034221,
-2.4232786057647457,
0.10506614562827334,
0.4968179288412277,
1.635737019365132,
-1.4286421753281746,
0.4973223222844811],
'B': [-1.0550082961139444,
-0.1420067090193365,
0.30130476834580633,
1.1271866812852227,
0.38587456174846285,
-0.531163142682951,
-1.1335754634118729,
0.5975963084356348,
-0.7361022807495443,
1.4329395663140427],
...}
Or adding values.tolist()
df[list('ABC')].values.tolist()
Out[1041]:
[[0.09552771302434987, 0.18551596484768904, -0.5902249875268607],
[-1.5285190712746388, 1.2922627021799646, -0.8347422966138306],
[-0.4092028716404067, -0.5669107267579823, 0.3627970727410332],
[-1.3546346273319263, -0.9352316948439341, 1.3568726575880614],
[-1.3509518030469496, 0.10487182694997808, -0.6902134363370515]]
Edit : np.concatenate(df[list('ABC')].T.values.tolist())
I have a pandas df that contains 2 columns 'start' and 'end' (both are integers). I would like an efficient method to search for rows such that the range that is represented by the row [start,end] contains a specific value.
Two additional notes:
It is possible to assume that ranges don't overlap
The solution should support a batch mode - that given a list of inputs, the output will be a mapping (dictionary or whatever) to the row indices that contain the matching range.
For example:
start end
0 7216 7342
1 7343 7343
2 7344 7471
3 7472 8239
4 8240 8495
and the query of
[7215,7217,7344]
will result in
{7217: 0, 7344: 2}
Thanks!
Brute force solution, could use lots of improvements though.
df = pd.DataFrame({'start': [7216, 7343, 7344, 7472, 8240],
'end': [7342, 7343, 7471, 8239, 8495]})
search = [7215, 7217, 7344]
res = {}
for i in search:
mask = (df.start <= i) & (df.end >= i)
idx = df[mask].index.values
if len(idx):
res[i] = idx[0]
print res
Yields
{7344: 2, 7217: 0}
Selected solution
This new solution could have better performances. But there is a limitation, it will only works if there is no gap between ranges like in the example provided.
# Test data
df = pd.DataFrame({'start': [7216, 7343, 7344, 7472, 8240],
'end': [7342, 7343, 7471, 8239, 8495]}, columns=['start','end'])
query = [7215,7217,7344]
# Reshaping the original DataFrame
df = df.reset_index()
df = pd.concat([df['start'], df['end']]).reset_index()
df = df.set_index(0).sort_index()
# Creating a DataFrame with a continuous index
max_range = max(df.index) + 1
min_range = min(df.index)
s = pd.DataFrame(index=range(min_range,max_range))
# Joining them
s = s.join(df)
# Filling the gaps
s = s.fillna(method='backfill')
# Then a simple selection gives the result
s.loc[query,:].dropna().to_dict()['index']
# Result
{7217: 0.0, 7344: 2.0}
Previous proposal
# Test data
df = pd.DataFrame({'start': [7216, 7343, 7344, 7472, 8240],
'end': [7342, 7343, 7471, 8239, 8495]}, columns=['start','end'])
# Constructing a DataFrame containing the query numbers
query = [7215,7217,7344]
result = pd.DataFrame(np.tile(query, (len(df), 1)), columns=query)
# Merging the data and the query
df = pd.concat([df, result], axis=1)
# Making the test
df = df.apply(lambda x: (x >= x['start']) & (x <= x['end']), axis=1).loc[:,query]
# Keeping only values found
df = df[df==True]
df = df.dropna(how='all', axis=(0,1))
# Extracting to the output format
result = df.to_dict('split')
result = dict(zip(result['columns'], result['index']))
# The result
{7217: 0, 7344: 2}