Working with Python Pandas 0.19.1.
I'm calling a function in loops, which returns a numeric list with length of 4 each time. What's the easiest way of concatenating them into a DataFrame?
I'm doing this:
result = pd.DataFrame()
for t in dates:
result_t = do_some_stuff(t)
result.append(result_t, ignore_index=True)
The problem is that it concatenates along the column instead of by rows. If dates has a length of 250, it gives a single-column df with 1000 rows. Instead, what I want is a 250 x 4 df.
I think you need append all DataFrames to list result and then use concat:
result = []
for t in dates:
result.append(do_some_stuff(t))
print (pd.concat(result, axis=1))
Related
I want to select some rows based on a condition from an existing Pandas DataFrame and then insert it into a new DataFrame.
At frist, I tried this way:
second_df = pd.DataFrame()
for specific_idx in specific_idx_set:
second_df = existing_df.iloc[specific_idx]
len(specific_idx_set), second_df.shape => (1000), (15,)
As you see, I'm iterating over a set which has 1000 indexes. However, after I add these 1000 rows to into a new Pandas DataFrame(second_df), I saw only one of these rows was stored into the new DataFrame while I expected to see 1000 rows with 15 columns in this DataFrame.
So, I tried new way:
specific_rows = list()
for specific_val in specific_idx_set:
specific_rows.append( existing_df[existing_df[col] == specific_val])
new_df = pd.DataFrame(specific_rows)
And I got this error:
ValueError: Must pass 2-d input. shape=(1000, 1, 15)
Then, I wrote this code:
specific_rows = list()
new_df = pd.DataFrame()
for specific_val in specific_idx_set:
specific_rows.append(existing_df[existing_df[col] == specific_val])
pd.concat([new_df, specific_rows])
But I got this error:
TypeError: cannot concatenate object of type '<class 'list'>'; only Series and DataFrame objs are valid
You need modify your last solution - remove empty DataFrame and for concat use list of DataFrames only:
specific_rows = list()
for specific_val in specific_idx_set:
specific_rows.append(existing_df[existing_df[col] == specific_val])
out = pd.concat(specific_rows)
Problem of your solution - if join list with DataFrame error is raised:
pd.concat([new_df, specific_rows])
#specific_rows - is list
#new_df - is DataFrame
If need append DataFrame need join lists - append one element list [new_df] + another list specific_rows - ouput is list of DataFrames:
pd.concat([new_df] + specific_rows)
I have a very large dictionary of dataframes. It contains around 250 dataframes, each of which has around 50 columns per df. My goal is to concat the dataframes to create one large df; however, as you can imagine, this process isn't great because it will create a df that is way too large view outside of using python.
My goal is to explode the large dictionary of df in half and turn it into two large, but manageable files.
I will try to replicate what it looks like:
d = {df1, df2,........,df500}
df = pd.concat(d)
# However, Is there a way to split 50%?
df1 = pd.concat(d) # only gets first 250 of the df
df2 =pd.concat(d) # only gets last 250 df
How about something like this?
v = list(d.values())
part1 = v[:len(v)//2]
part2 = v[len(part1):]
df1 = pd.concat(part1)
df2 = pd.concat(part2)
First of all it's not a dictionary , it's a set which can be converted to list.
An List can be divided into 2 as you need.
d=list(d)
ln=len(d)
d1=d[0:ln//2]
d2=d[ln//2:]
df1 = pd.concat(d1)
df2 = pd.concat(d2)
I have pandas df which has 7000 rows * 7 columns. And I have list (row_list) that consists with the value that I want to filter out from df.
What I want to do is to filter out the rows if the rows from df contain the corresponding value in the list.
This is what I got when I tried,
"Empty DataFrame
Columns: [A,B,C,D,E,F,G]
Index: []"
df = pd.read_csv('filename.csv')
df1 = pd.read_csv('filename1.csv', names = 'A')
row_list = []
for index, rows in df1.iterrows():
my_list = [rows.A]
row_list.append(my_list)
boolean_series = df.D.isin(row_list)
filtered_df = df[boolean_series]
print(filtered_df)
replace
boolean_series = df.RightInsoleImage.isin(row_list)
with
boolean_series = df.RightInsoleImage.isin(df1.A)
And let us know the result. If it doesn't work show a sample of df and df1.A
(1) generating separate dfs for each condition, concat, then dedup (slow)
(2) a custom function to annotate with bool column (default as False, then annotated True if condition is fulfilled), then filter based on that column
(3) keep a list of indices of all rows with your row_list values, then filter using iloc based on your indices list
Without an MRE, sample data, or a reason why your method didn't work, it's difficult to provide a more specific answer.
I've a DataFrame df of let us say one thousand rows, and I'd like to split it to 10 lists where each list contains a DataFrame of 100 rows. So list
zero = df[0:99]
one = df[100:199]
two = df[200:299]
...
nine = df[900:900]
What could be a good (preferably) oneliner for this?
Assuming index is a running integer (can use .reset_index() if not)
[d for g,d in df.groupby(df.index//100)]
Returns a list of dataframes.
Like this maybe:
list_of_dfs = [df.loc[i:i+size-1,:] for i in range(0, len(df),1000)]
I have several dataframes in a list, obtained after using np.array_split and I want to concat some of then into a single dataframe. In this example, I want to concat 3 dataframes contained in b (all but the 2nd one, which is the element b[1] in the list):
df = pd.DataFrame({'country':['a','b','c','d'],
'gdp':[1,2,3,4],
'iso':['x','y','z','w']})
a = np.array_split(df,4)
i = 1
b = a[:i]+a[i+1:]
desired_final_df = pd.DataFrame({'country':['a','c','d'],
'gdp':[1,3,4],
'iso':['x','z','w']})
I have tried to create an empty df and then use append through a loop for the elements in b but with no complete success:
CV = pd.DataFrame()
CV = [CV.append[(b[i])] for i in b] #try1
CV = [CV.append(b[i]) for i in b] #try2
CV = pd.DataFrame([CV.append[(b[i])] for i in b]) #try3
for i in b:
CV.append(b) #try4
I have reached to a solution which works but it is not efficient:
CV = pd.DataFrame()
CV = [CV.append(b) for i in b][0]
In this case, I get in CV three times the same dataframe with all the rows and I just get the first of them. However, in my real case, in which I have big datasets, having three times the same would result in much more time of computation.
How could I do that without repeating operations?
According to the docs, DataFrame.append does not work in-place, like lists. The resulting DataFrame object is returned instead. Catching that object should be enough for what you need:
df = pd.DataFrame()
for next_df in list_of_dfs:
df = df.append(next_df)
You may want to use the keyword argument ignore_index=True in the append call so that the indices become continuous, instead of starting from 0 for each appended DataFrame (assuming that the index of the DataFrames in the list all start from 0).
To cancatenate multiple DFs, resetting index, use pandas.concat:
pd.concat(b, ignore_index=True)
output
country gdp iso
0 a 1 x
1 c 3 z
2 d 4 w