I have a Pandas dataframe (approx 100k rows) as my input. It is an export from a database, and each of the fields in one of the columns contain one or more records which I need to expand into independent records. For example:
record_id
text_field
0
r0_sub_record1_field1#r0_sub_record1_field2#r0_sub_record2_field1#r0_sub_record2_field2#
1
sub_record1_field1#sub_record1_field2#
2
sub_record1_field1#sub_record1_field2#sub_record2_field1#sub_record2_field2#sub_record3_field1#sub_record3_field2#
The desired result should look like this:
record_id
field1
field2
original_record_id
0
r0_sub_record1_field1
r0_sub_record1_field2
0
1
r0_sub_record2_field1
r0_sub_record2_field2
0
2
r1_sub_record1_field1
r1_sub_record1_field2
1
3
r2_sub_record1_field1
r2_sub_record1_field2
2
4
r2_sub_record2_field1
r2_sub_record2_field2
2
5
r2_sub_record3_field1
r2_sub_record3_field2
2
It is quite straight-forward how to extract the data I need using a loop, but I suspect it is not the most efficient and also not the nicest way.
As I understand it, I cannot use apply or map here, because I am building another dataframe with the extracted data.
Is there a good Python-esque and Panda-style way to solve the problem?
I am using Python 3.7 and Pandas 1.2.1.
I think you need to explode based on # then split the # text.
df1 = df.assign(t=df['text_field'].str.split('#')
).drop('text_field',1).explode('t').reset_index(drop=True)
df2 = df1.join(df1['t'].str.split('#',expand=True)).drop('t',1)
print(df2.dropna())
record_id 0 1
0 0 r0_sub_record1_field1 r0_sub_record1_field2
1 0 r0_sub_record2_field1 r0_sub_record2_field2
3 1 sub_record1_field1 sub_record1_field2
5 2 sub_record1_field1 sub_record1_field2
6 2 sub_record2_field1 sub_record2_field2
7 2 sub_record3_field1 sub_record3_field2
Is it what you expect?
out = df['text_field'].str.strip('#').str.split('#').explode() \
.str.split('#').apply(pd.Series)
prefix = 'r' + out.index.map(str) + '_'
out.apply(lambda v: prefix + v).reset_index() \
.rename(columns={0: 'field1', 1: 'field2', 'index': 'original_record_id'})
>>> out
original_record_id field1 field2
0 0 r0_sub_record1_field1 r0_sub_record1_field2
1 0 r0_sub_record2_field1 r0_sub_record2_field2
2 1 r1_sub_record1_field1 r1_sub_record1_field2
3 2 r2_sub_record1_field1 r2_sub_record1_field2
4 2 r2_sub_record2_field1 r2_sub_record2_field2
5 2 r2_sub_record3_field1 r2_sub_record3_field2
Related
I would like to transform a table which looks similiar to this below:
X|Y|Z|
1|2|3|
3|5|2|
4|2|1|
The result, I want to achive, should look like that:
col|1|2|3|4|5|
X |1|0|1|0|0|
Y |0|2|0|0|1|
Z |1|1|1|0|0|
So, after transformation the new columns should be unique values from previous table, the new values should be populated with count/appearance, and in the index should be the old column names.
I got stuck and i do not know hot to handle with cause I am a newbe in python, so thanks in advance for support.
Regards,
guddy_7
Use apply with value_counts, replace missing values to 0 and transpose by T:
df = df.apply(pd.value_counts).fillna(0).astype(int).T
print (df)
1 2 3 4 5
X 1 0 1 1 0
Y 0 2 0 0 1
Z 1 1 1 0 0
I am trying to add additional index rows to an existing pandas dataframe after loading csv data into it.
So let's say I load my data like this:
columns = ['Relative_Pressure','Volume_STP']
df = pd.read_csv(StringIO(contents), skiprows=4, delim_whitespace=True,index_col=False,header=None)
df.columns = columns
where contents is a string in csv format. The resulting DataFrame might look something like this:
For clarity reasons I would now like to add additional index rows to the DataFrame as shown here:
However in the link these multiple index rows are generated right when the DataFrame is created. I would like to add e.g. rows for unit or descr to the columns.
How could I do this?
You can create a MultiIndex on the columns by specifically creating the index and then assigning it to the columns separately from reading in the data.
I'll use the example from the link you provided. The first method is to create the MultiIndex when you make the dataframe:
df = pd.DataFrame({('A',1,'desc A'):[1,2,3],('B',2,'desc B'):[4,5,6]})
df.columns.names=['NAME','LENGTH','DESCRIPTION']
df
NAME A B
LENGTH 1 2
DESCRIPTION desc A desc B
0 1 4
1 2 5
2 3 6
As stated, this is not what you are after. Instead, you can make the dataframe (from your file for example) and then make the MultiIndex from a set of lists and then assign it to the columns:
df = pd.DataFrame({'desc A':[1,2,3], 'desc B':[4,5,6]})
# Output
desc A desc B
0 1 4
1 2 5
2 3 6
# Create a multiindex from lists
index = pd.MultiIndex.from_arrays((['A', 'B'], [1, 2], ['desc A', 'desc B']))
# Assign to the columns
df.columns = index
# Output
A B
1 2
desc A desc B
0 1 4
1 2 5
2 3 6
# Name the columns
df.columns.names = ['NAME','LENGTH','DESCRIPTION']
# Output
NAME A B
LENGTH 1 2
DESCRIPTION desc A desc B
0 1 4
1 2 5
2 3 6
There are other ways to construct a MultiIndex, for example, from_tuples and from_product. You can read more about Multi Indexes in the documentation.
I have a dataframe which looks like:
ID1 ID2 Issues Value1 Value2 IssueDate
1 1 1 56.85490855 9.489650847 02/12/2015
1 1 2 89.55441203 23.60227363 07/02/2015
1 2 1 21.8456428 23.37353082 01/10/2015
2 2 1 55.10795933 1.928443984 13/08/2015
2 2 2 10.22459873 24.44298882 07/04/2015
4 1 1 55.29748656 6.308424035 19/02/2015
and I want it to be multiple dataframes (this is Value1, but imagine a second for 2) which looks like:
Value 1
2015_1 2015_2 2015_3 2015_4 2015_5 2015_6 2015_7I 2015_8 2015_9 2015_10 2015_11 2015_12
ID1 ID2
1 1 89.55441203 56.85490855
1 2 21.8456428
2 2 10.22459873 55.10795933
4 1 55.29748656
The only way I can work out how to do this is to use a lambda function and add values in specific ranges to the associated columns. The problem is that my dataset is very large and trying to complete this movement line by line looping for each possible month/year combination will take a very long period of time.
Is there clever way to use masks or melts to reformat the data into the tables I am looking for?
I guess you are looking for something like this
df.IssueDate = pd.to_datetime(df.IssueDate)
df['Date'] = df.IssueDate.dt.year.astype(str) + '_' + df.IssueDate.dt.month.astype(str)
pd.pivot_table(df[['ID1', 'ID2', 'Value1', 'Date']], columns='Date', index=['ID1', 'ID2'])
I have a pandas DataFrame which contains information in columns which I would like to extract into a new column.
It is best explained visually:
df = pd.DataFrame({'Number Type 1':[1,2,np.nan],
'Number Type 2':[np.nan,3,4],
'Info':list('abc')})
The Table shows the initial DataFrame with Number Type 1 and NumberType 2 columns.
I would like to extract the types and create a new Type column, refactoring the DataFrame accordingly.
basically, Numbers are collapsed into the Number columns, and the types extracted into the Type column. The information in the Info column is bound to the numbers (f.e. 2 and 3 have the same information b)
What is the best way to do this in Pandas?
Use melt with dropna:
df = df.melt('Info', value_name='Number', var_name='Type').dropna(subset=['Number'])
df['Type'] = df['Type'].str.extract('(\d+)')
df['Number'] = df['Number'].astype(int)
print (df)
Info Type Number
0 a 1 1
1 b 1 2
4 b 2 3
5 c 2 4
Another solution with set_index and stack:
df = df.set_index('Info').stack().rename_axis(('Info','Type')).reset_index(name='Number')
df['Type'] = df['Type'].str.extract('(\d+)')
df['Number'] = df['Number'].astype(int)
print (df)
Info Type Number
0 a 1 1
1 b 1 2
2 b 2 3
3 c 2 4
The documentation on dataframe.query() is very terse http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html . I was also unable to find examples of projections by web search.
So I tried simply providing the column names: that gave a syntax error. Likewise for typing select and then the column names. So .. how to do this?
After playing around with this for a while and reading through the source code for DataFrame.query, I can't figure out a way to do it.
If it's not impossible, apparently it's at least strongly discouraged. When this question came up on github, prolific Pandas dev/maintainer jreback suggested using df.eval() for selecting columns and df.query() for filtering on rows.
UPDATE:
javadba points out that the return value of eval is not a dataframe. For example, to flesh out jreback's example a bit more...
df.eval('A')
returns a Pandas Series, but
df.eval(['A', 'B'])
does not return at DataFrame, it returns a list (of Pandas Series).
So it seems ultimately the best way to maintain flexibility to filter on rows and columns is to use iloc/loc, e.g.
df.loc[0:4, ['A', 'C']]
output
A C
0 -0.497163 -0.046484
1 1.331614 0.741711
2 1.046903 -2.511548
3 0.314644 -0.526187
4 -0.061883 -0.615978
Dataframe.query is more like the where clause in a SQL statement than the select part.
import pandas as pd
import numpy as np
np.random.seed(123)
dates = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C', 'D'])
To select a column or columns you can use the following:
df['A'] or df.loc[:,'A']
or
df[['A','B']] or df.loc[:,['A','B']]
To use the .query method you do something like
df.query('A > B') which would return all the rows where the value in column A is greater than the value in column b.
A B C D
2000-01-03 1.265936 -0.866740 -0.678886 -0.094709
2000-01-04 1.491390 -0.638902 -0.443982 -0.434351
2000-01-05 2.205930 2.186786 1.004054 0.386186
2000-01-08 -0.140069 -0.861755 -0.255619 -2.798589
Which is more readable in my opinion that boolean index selection with
df[df['A'] > df['B']]
How about
df_new = df.query('col1==1 & col2=="x" ')[['col1', 'col3']]
Would filter rows where col1 equals 1 and col2 equals "X" and return only columns 1 and 3.
but you would need to filter for rows otherwise it doesn't work.
for filtering columns only better use .loc or .iloc
pandasql
https://pypi.python.org/pypi/pandasql/0.1.0
Here is an example from the following blog http://blog.yhat.com/posts/pandasql-sql-for-pandas-dataframes.html . The inputs are two DataFrames meat and births : and this approach gives the projections, filtering, aggregation and sorting expected from sql.
#maxpower did mention this package is buggy: so let's see.. At least the code from the blog and shown below works fine.
pysqldf = lambda q: sqldf(q, globals())
q = """
SELECT
m.date
, m.beef
, b.births
FROM
meat m
LEFT JOIN
births b
ON m.date = b.date
WHERE
m.date > '1974-12-31';
"""
meat = load_meat()
births = load_births()
df = pysqldf(q)
The output is a pandas DataFrame as desired.
It is working great for my particular use case (evaluating us crimes)
odf = pysqldf("select %s from df where sweapons > 10 order by sweapons desc limit 10" %scols)
p('odf\n', odf)
odf
: SMURDER SRAPE SROBBERY SAGASSLT SOTHASLT SVANDLSM SWEAPONS
0 0 0 0 1 1 10 54
1 0 0 0 0 1 0 52
2 0 0 0 0 1 0 46
3 0 0 0 0 1 0 43
4 0 0 0 0 1 0 33
5 1 0 2 16 28 4 32
6 0 0 0 7 17 4 30
7 0 0 0 0 1 0 29
8 0 0 0 7 16 3 29
9 0 0 0 1 0 5 28
Update I have done a bunch of stuff with pandasql now: calculated fields, limits, aliases, cascaded dataframes.. it is just so productive.
Another update (3 yrs later) This works but warning it is very slow (seconds vs milliseconds) –
Just a simpler example solution (using get):
My goal:
I want the lat and lon columns out of the result of the query.
My table details:
df_city.columns
Index(['name', 'city_id', 'lat', 'lon', 'CountryName',
'ContinentName'], dtype='object')
# All columns
city_continent = df_city.get(df_city['ContinentName']=='Oceania')
# Only lat and lon
city_continent[['lat', 'lon']]
lat lon
113883 -19.12753 -169.84623
113884 -19.11667 -169.90000
113885 -19.10000 -169.91667
113886 -46.33333 168.85000
113887 -46.36667 168.55000
... ... ...
347956 -23.14083 113.77630
347957 -31.48023 131.84242
347958 -28.29967 153.30142
347959 -35.60358 138.10548
347960 -35.02852 117.83416
3712 rows × 2 columns