Python DataFrame: split to lists - python

I've a DataFrame df of let us say one thousand rows, and I'd like to split it to 10 lists where each list contains a DataFrame of 100 rows. So list
zero = df[0:99]
one = df[100:199]
two = df[200:299]
...
nine = df[900:900]
What could be a good (preferably) oneliner for this?

Assuming index is a running integer (can use .reset_index() if not)
[d for g,d in df.groupby(df.index//100)]
Returns a list of dataframes.

Like this maybe:
list_of_dfs = [df.loc[i:i+size-1,:] for i in range(0, len(df),1000)]

Related

How to split a dictionary of df in half using pandas?

I have a very large dictionary of dataframes. It contains around 250 dataframes, each of which has around 50 columns per df. My goal is to concat the dataframes to create one large df; however, as you can imagine, this process isn't great because it will create a df that is way too large view outside of using python.
My goal is to explode the large dictionary of df in half and turn it into two large, but manageable files.
I will try to replicate what it looks like:
d = {df1, df2,........,df500}
df = pd.concat(d)
# However, Is there a way to split 50%?
df1 = pd.concat(d) # only gets first 250 of the df
df2 =pd.concat(d) # only gets last 250 df
How about something like this?
v = list(d.values())
part1 = v[:len(v)//2]
part2 = v[len(part1):]
df1 = pd.concat(part1)
df2 = pd.concat(part2)
First of all it's not a dictionary , it's a set which can be converted to list.
An List can be divided into 2 as you need.
d=list(d)
ln=len(d)
d1=d[0:ln//2]
d2=d[ln//2:]
df1 = pd.concat(d1)
df2 = pd.concat(d2)

Pandas split DataFrame according to indices

I've been working on a pandas DataFrame,
df = pd.DataFrame({'col':[-0.217514, -0.217834, 0.844116, 0.800125, 0.824554]}, index=[49082, 49083, 49853, 49854, 49855])
and I get data that looks like this:
As you can see, the index suddenly jumps 770 values (due to a sorting I did earlier).
Now I would like to split this DataFrame into many different ones, where each one would be made of the rows whose index follow each other only (here the first 2 rows would be in the same DataFrame while the last three would be in a different one).
Does anyone have an idea as to how to do this?
Thanks!
Use groupby on the index from which we subtract an increasing by 1 sequence, then stick each group as a separate df in the list
all_dfs = [g for _,g in df.groupby(df.index - np.arange(len(df.index)))]
all_dfs
output:
[ col
49082 -0.217514
49083 -0.217834,
col
49853 0.844116
49854 0.800125
49855 0.824554]

Filtering a large Pandas DataFrame based on a list of strings in column names

Stack Overflow Family,
I have recently started learning Python and am using Pandas to handle some factory data. The csv file is essentially a large dataframe (1621 rows Ă— 5633 columns). While I need all the rows as these are data of each unit, I need to filter many unwanted columns. I have identified a list of strings in these column names that I can use to find only the wanted columns, however, I am not able to figure out what a good logic here would be or any built in python functions.
dropna is not an option for me as some of these wanted columns have NA as values (for example test limit)
dropna for columns with all NA is also not good enough as I will still end up with a large number of columns.
Looking for some guidance here. Thank you for your time.
If you have a list of valid columns you can just use df.filter(cols_subset, axis=1) to drop everything else.
You could use a regex to also match substrings from your list in column names:
df.filter(regex='|'.join(cols_subset), axis=1)
Or you could match only columns starting with a substring from your list:
df.filter(regex='^('+'|'.join(cols_subset)+')', axis=1)
EDIT:
Given the time complexity of my previous solution, I came up with a way to use list comprehension:
fruits = ["apple", "banana", "cherry", "kiwi", "mango"]
app = ["app", "ban"]
new_list = [x for x in fruits if any(y in x for y in app)]
output:
['apple', 'banana']
This should only display the columns you need. In your case you just need to do:
my_strings = ["A", "B", ...]
new_list = [x for x in df.columns if any(y in x for y in my_strings)]
print(new_list)
If you know exactly the column names, what you could do is some thing like that:
unwanted_cols = ['col1', 'col4'] #list of unwanted cols names
df_cleaned = current_df.drop(unwanted_cols, axis=1)
# or
current_df.drop(unwanted_cols, inplace=True, axis=1)
If you don't know exactly the columns names what you could do is first retrieveing all the columns
all_cols = current_df.columns.tolist()
and apply a regex on all of the columns names, to obtain all of the columns names that matches your list of string and apply the same code as above
You can drop columns from a dataframe by applying string contains with regular expression. Below is an example
df.drop(df.columns[df.columns.str.contains('^abc')], axis=1)

compare list of dictionaries to dataframe, show missing values

I have a list of dictionaries
example_list = [{'email':'myemail#email.com'},{'email':'another#email.com'}]
and a dataframe with an 'Email' column
I need to compare the list against the dataframe and return the values that are not in the dataframe.
I can certainly iterate over the list, check in the dataframe, but I was looking for a more pythonic way, perhaps using list comprehension or perhaps a map function in dataframes?
To return those values that are not in DataFrame.email, here's a couple of options involving set difference operations—
np.setdiff1d
emails = [d['email'] for d in example_list)]
diff = np.setdiff1d(emails, df['Email']) # returns a list
set.difference
# returns a set
diff = set(d['email'] for d in example_list)).difference(df['Email'])
One way is to take one set from another. For a functional solution you can use operator.itemgetter:
from operator import itemgetter
res = set(map(itemgetter('email'), example_list)) - set(df['email'])
Note - is syntactic sugar for set.difference.
I ended up converting the list into a dataframe, comparing the two dataframes by merging them on a column, and then creating a dataframe out of the missing values
so, for example
example_list = [{'email':'myemail#email.com'},{'email':'another#email.com'}]
df_two = pd.DataFrame(item for item in example_list)
common = df_one.merge(df_two, on=['Email'])
df_diff = df_one[(~df_one.Email.isin(common.Email))]

Combining results from a loop into a DataFrame

Working with Python Pandas 0.19.1.
I'm calling a function in loops, which returns a numeric list with length of 4 each time. What's the easiest way of concatenating them into a DataFrame?
I'm doing this:
result = pd.DataFrame()
for t in dates:
result_t = do_some_stuff(t)
result.append(result_t, ignore_index=True)
The problem is that it concatenates along the column instead of by rows. If dates has a length of 250, it gives a single-column df with 1000 rows. Instead, what I want is a 250 x 4 df.
I think you need append all DataFrames to list result and then use concat:
result = []
for t in dates:
result.append(do_some_stuff(t))
print (pd.concat(result, axis=1))

Categories

Resources