Python multiple pivot from the same column - python

I have a dataframe with just one column with content like:
view: meta_record_extract
dimension: e_filter
type: string
hidden: yes
sql: "SELECT * FROM files"
dimension: category
type: string
...
What I tried to produce would be a dataframe with columns and the data like this:
____________________________________________________________________
view | dimension |label | type | hidden | sql |
meta_record_extract| e_filter | NaN | string| yes |"SELECT * FROM files"
NaN | category | NaN | string ...
Given that splitting the string data like
df.header[0].split(': ')[0]
was giving me label with [0] or value with [1]
I tried this:
df.pivot_table(df, columns = df.header.str.split(': ')[0], values = df.header.str.split(': ')[1])
but it did not work giving the error.
Can anyone help me to achieve the result I need?

Use str.findall() + map, as follows:
str.findall() helps you extract the keyword and value pairs into a list. We then map the list of keyword-value pairs into a dict for pd.Dataframe to turn the dict into a dataframe.
(Assuming the column label of your column is Col1):
df_extract = df['Col1'].str.findall(r'(\w+):\s*(.*)')
df_result = pd.DataFrame(map(dict, df_extract))
Result:
print(df_result)
view dimension type hidden sql
0 meta_record_extract NaN NaN NaN NaN
1 NaN e_filter NaN NaN NaN
2 NaN NaN string NaN NaN
3 NaN NaN NaN yes NaN
4 NaN NaN NaN NaN "SELECT * FROM files"
5 NaN category NaN NaN NaN
6 NaN NaN string NaN NaN
Update
To compress the rows to minimize the NaN's, we can further use .apply() with .dropna(), as follows:
df_compressed = df_result.apply(lambda x: pd.Series(x.dropna().values))
Result:
print(df_compressed)
view dimension type hidden sql
0 meta_record_extract e_filter string yes "SELECT * FROM files"
1 NaN category string NaN NaN

Related

Python Pandas maximum and minimum values of every row if header name contain str

i have a df like this and i want to get the minium value from every rows only from headers that contain 'ask' and maximum values from every rows if the headers name contains 'bid' in it. :
symbol bid_Binance ask_Binance bid_Kucoin ask_Kucoin bid_Mexc ask_Mexc
819 GRINUSDT NaN NaN 0.107270 0.108250 NaN NaN
424 MITXUSDT NaN NaN 0.005252 0.005300 0.009010 0.009860
expected result:
symbol bid_Binance ask_Binance bid_Kucoin ask_Kucoin bid_Mexc ask_Mexc min_ask max_bid
819 GRINUSDT NaN NaN 0.107270 0.108250 NaN NaN 0.108250 0.107270
424 MITXUSDT NaN NaN 0.005252 0.005300 0.009010 0.009860 0.005300 0.009010
I have tried this way, it takes the max values and min values of every rows but i don't know how to filter the header name
df_merged['min_'] = df_merged.min(axis=1)
df_merged['max_'] = df_merged.max(axis=1)
You can use filter with an anchored regex:
df['min_ask'] = df.filter(regex='^ask').min(1)
df['max_bid'] = df.filter(regex='^bid').max(1)
If you don't want to anchor to the start, you can use like:
df['min_ask'] = df.filter(like='ask').min(1)
df['max_bid'] = df.filter(like='bid').max(1)
output:
symbol bid_Binance ask_Binance bid_Kucoin ask_Kucoin bid_Mexc ask_Mexc min_ask max_bid
819 GRINUSDT NaN NaN 0.107270 0.10825 NaN NaN 0.10825 0.10727
424 MITXUSDT NaN NaN 0.005252 0.00530 0.00901 0.00986 0.00530 0.00901
You can create min_ask as follows:
df.loc[:, df.columns.str.contains('ask')].apply(min, axis=1)

Pandas, how to calculate delta between one cell and another in different rows

I have the following frame:
USERID, EVENT1TIME, EVENT2TIME, MISC1, MISC2
123,45,,,
123,,46,,
123,,47,,
123,,48,,
123,,49,,
123,,51,,
124,45,,,
124,,46,,
124,,47,,
124,,48,,
124,,49,,
124,,51,,
I'd like to add a 4th column that is (EVENT2TIME - EVENT1TIME)
USERID, EVENT1TIME, EVENT2TIME, MISC1, MISC2, DELTA
123,45,,,,
123,,46,,,1
123,,47,,,2
123,,48,,,3
123,,49,,,4
123,,51,,,6
124,45,,,,
124,,46,,,1
124,,47,,,2
124,,48,,,3
124,,49,,,4
124,,51,,,6
I think the first thing to do is to copy the value from the row where EVENT1TIME is populated into the other instances of that USERID. But I suspect there may be a better way.
I am making some assumptions:
You want to calculate the difference between column EVENT2TIME and first row of EVENT1TIME
You want to store the results into DELTA
You can do this as follows:
import pandas as pd
df = pd.read_csv('abc.txt')
print (df)
df['DELTA'] = df.iloc[:,2] - df.iloc[0,1]
print (df)
The output of this will be:
USERID EVENT1TIME EVENT2TIME MISC1 MISC2 DELTA
0 123 45.0 NaN NaN NaN
1 123 NaN 46.0 NaN NaN 1.0
2 123 NaN 47.0 NaN NaN 2.0
3 123 NaN 48.0 NaN NaN 3.0
4 123 NaN 49.0 NaN NaN 4.0
5 123 NaN 51.0 NaN NaN 6.0
If you know EVENT1TIME is always and only in the first row, just store it as a variable and subtract it.
val = df.EVENT1TIME[0]
df['DELTA'] = df.EVENT2TIME - val
If you have multiple values every so often in EVENT1TIME, use some logic to back or forward fill all the empty rows for EVENT1TIME. This fill is not stored in the final output df.
df['DELTA'] = df.EVENT2TIME - df.EVENT1TIME.ffill() # forward fill (down) all nan values
# OR
df['DELTA'] = df.EVENT2TIME - df.EVENT1TIME.bfill() # back fill (up) all nan values
EDIT: Keeping this for continuity despite how hacky it is.
locations = list(df[~np.isnan(df.EVENT1TIME)].index)
vals = df.EVENT1TIME.loc[locations] # all EVENT1TIME values
locations.append(df.EVENT1TIME.index[-1]+1) # last row index + 1
last_loc = locations[0]
for idx, next_loc in enumerate(locations[1:]):
temp = df.loc[last_loc:next_loc-1]
df['DELTA'].loc[last_loc:next_loc-1] = temp.EVENT2VALUE - vals[last_loc]
last_loc = next_loc

pd.json_normalize() gives “str object has no attribute 'values'"

I manually create a DataFrame:
import pandas as pd
df_articles1 = pd.DataFrame({'Id' : [4,5,8,9],
'Class':[
{'encourage': 1, 'contacting': 1},
{'cardinality': 16, 'subClassOf': 3},
{'get-13.5.1': 1},
{'cardinality': 12, 'encourage': 1}
]
})
I export it to a csv file to import after splitting it:
df_articles1.to_csv(f"""{path}articles_split.csv""", index = False, sep=";")
I can split it with pd.json_normalize():
df_articles1 = pd.json_normalize(df_articles1['Class'])
I import its csv file to a DataFrame:
df_articles2 = pd.read_csv(f"""{path}articles_split.csv""", sep=";")
But this fails with:
AttributeError: 'str' object has no attribute 'values' pd.json_normalize(df_articles2['Class'])
that was because when you save by to_csv() the data in your 'Class' column is stored as string not as dictionary/json so after loading that saved data:
df_articles2 = pd.read_csv(f"""{path}articles_split.csv""", sep=";")
Then to make it back in original form make use of eval() method and apply() method:-
df_articles2['Class']=df_articles2['Class'].apply(lambda x:eval(x))
Finally:
resultdf=pd.json_normalize(df_articles2['Class'])
Now If you print resultdf you will get your desired output
While the accepted answer works, using eval is bad practice.
To parse a string column that looks like JSON/dict, use one of the following options (last one is best, if possible).
ast.literal_eval (better)
import ast
objects = df2['Class'].apply(ast.literal_eval)
normed = pd.json_normalize(objects)
df2[['Id']].join(normed)
# Id encourage contacting cardinality subClassOf get-13.5.1
# 0 4 1.0 1.0 NaN NaN NaN
# 1 5 NaN NaN 16.0 3.0 NaN
# 2 8 NaN NaN NaN NaN 1.0
# 3 9 1.0 NaN 12.0 NaN NaN
json.loads (even better)
import json
objects = df2['Class'].apply(json.loads)
normed = pd.json_normalize(objects)
df2[['Id']].join(normed)
# encourage contacting cardinality subClassOf get-13.5.1
# 0 1.0 1.0 NaN NaN NaN
# 1 NaN NaN 16.0 3.0 NaN
# 2 NaN NaN NaN NaN 1.0
# 3 1.0 NaN 12.0 NaN NaN
If the strings are single quoted, use str.replace to convert them to double quotes (and thus valid JSON) before applying json.loads:
objects = df2['Class'].str.replace("'", '"').apply(json.loads)
normed = pd.json_normalize(objects)
df2[['Id']].join(normed)
pd.json_normalize before pd.to_csv (recommended)
If possible, when you originally save to CSV, just save the normalized JSON (not raw JSON objects):
df1 = df1[['Id']].join(pd.json_normalize(df1['Class']))
df1.to_csv('df1_normalized.csv', index=False, sep=';')
# Id;encourage;contacting;cardinality;subClassOf;get-13.5.1
# 4;1.0;1.0;;;
# 5;;;16.0;3.0;
# 8;;;;;1.0
# 9;1.0;;12.0;;
This is a more natural CSV workflow (rather than storing/loading object blobs):
df2 = pd.read_csv('df1_normalized.csv', sep=';')
# Id encourage contacting cardinality subClassOf get-13.5.1
# 0 4 1.0 1.0 NaN NaN NaN
# 1 5 NaN NaN 16.0 3.0 NaN
# 2 8 NaN NaN NaN NaN 1.0
# 3 9 1.0 NaN 12.0 NaN NaN

How to load a text file of data with many commented rows, into pandas?

I am trying to read a deliminated text file into a dataframe in python. The deliminator is not being identified when I use pd.read_table. If I explicitly set sep = ' ', I get an error: Error tokenizing data. C error. Notably the defaults work when I use np.loadtxt().
Example:
pd.read_table('http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt',
comment = '%',
header = None)
0
0 1850 1 -0.777 0.412 NaN NaN...
1 1850 2 -0.239 0.458 NaN NaN...
2 1850 3 -0.426 0.447 NaN NaN...
3 1850 4 -0.680 0.367 NaN NaN...
4 1850 5 -0.687 0.298 NaN NaN...
If I set sep = ' ', I get another error:
pd.read_table('http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt',
comment = '%',
header = None,
sep = ' ')
ParserError: Error tokenizing data. C error: Expected 2 fields in line 78, saw 58
Looking up this error, people suggest using header = None (already done) and setting sep = explicitly, but that is causing the problem: Python Pandas Error tokenizing data. I looked up line 78 and can't see any problems. If I set error_bad_lines=False i get an empty df suggesting there is a problem with every entry.
Notably this works when I use np.loadtxt():
pd.DataFrame(np.loadtxt('http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt',
comments = '%'))
0 1 2 3 4 5 6 7 8 9 10 11
0 1850.0 1.0 -0.777 0.412 NaN NaN NaN NaN NaN NaN NaN NaN
1 1850.0 2.0 -0.239 0.458 NaN NaN NaN NaN NaN NaN NaN NaN
2 1850.0 3.0 -0.426 0.447 NaN NaN NaN NaN NaN NaN NaN NaN
3 1850.0 4.0 -0.680 0.367 NaN NaN NaN NaN NaN NaN NaN NaN
4 1850.0 5.0 -0.687 0.298 NaN NaN NaN NaN NaN NaN NaN NaN
This suggests to me that there isn't something wrong with the file, but rather with how I am calling pd.read_table(). I looked through the documentation for np.loadtxt() in the hope of setting the sep to the same value, but that just shows: delimiter=None (https://numpy.org/doc/stable/reference/generated/numpy.loadtxt.html).
I'd prefer to be able to import this as a pd.DataFrame, setting the names, rather than having to import as a matrix and then convert to pd.DataFrame.
What am I getting wrong?
This one is quite tricky. Please try out the snippet code below:
import pandas as pd
url = 'http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt'
df = pd.read_csv(url,
sep='\s+',
comment='%',
usecols=(0, 1, 2, 3, 4, 5, 7, 8, 9, 10, 11),
names=('Year', 'Month', 'M.Anomaly', 'M.Unc.', 'A.Anomaly',
'A.Unc.','5y.Anomaly', '5y.Unc.' ,'10y.Anomaly', '10y.Unc.',
'20y.Anomaly', '20y.Unc.'))
The issue is the file has 77 rows of commented text, for 'Global Average Temperature Anomaly with Sea Ice Temperature Inferred from Air Temperatures'
Two of the rows are headers
There's a bunch of data, then there are two more headers, and a new set of data for 'Global Average Temperature Anomaly with Sea Ice Temperature Inferred from Water Temperatures'
This solution separates the two tables in the file into separate dataframes.
This is not as nice as the other answer, but the data is properly separated into different dataframes.
The headers were a pain, it would probably be easier to manually create a custom header, and skip the lines of code for separating the headers from the text.
The important point separating air and ice data.
import requests
import pandas as pd
import math
# read the file with requests
url = 'http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt'
response = requests.get(url)
data = response.text
# convert data into a list
data = [d.strip().replace('% ', '') for d in data.split('\n')]
# specify the data from the ranges in the file
air_header1 = data[74].split() # not used
air_header2 = [v.strip() for v in data[75].split(',')]
# combine the 2 parts of the header into a single header
air_header = air_header2[:2] + [f'{air_header1[math.floor(i/2)]}_{v}' for i, v in enumerate(air_header2[2:])]
air_data = [v.split() for v in data[77:2125]]
h2o_header1 = data[2129].split() # not used
h2o_header2 = [v.strip() for v in data[2130].split(',')]
# combine the 2 parts of the header into a single header
h2o_header = h2o_header2[:2] + [f'{h2o_header1[math.floor(i/2)]}_{v}' for i, v in enumerate(h2o_header2[2:])]
h2o_data = [v.split() for v in data[2132:4180]]
# create the dataframes
air = pd.DataFrame(air_data, columns=air_header)
h2o = pd.DataFrame(h2o_data, columns=h2o_header)
Without the header code
Simplify the code, by using a manual header list.
import pandas as pd
import requests
# read the file with requests
url = 'http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt'
response = requests.get(url)
data = response.text
# convert data into a list
data = [d.strip().replace('% ', '') for d in data.split('\n')]
# manually created header
headers = ['Year', 'Month', 'Monthly_Anomaly', 'Monthly_Unc.',
'Annual_Anomaly', 'Annual_Unc.',
'Five-year_Anomaly', 'Five-year_Unc.',
'Ten-year_Anomaly', 'Ten-year_Unc.',
'Twenty-year_Anomaly', 'Twenty-year_Unc.']
# separate the air and h2o data
air_data = [v.split() for v in data[77:2125]]
h2o_data = [v.split() for v in data[2132:4180]]
# create the dataframes
air = pd.DataFrame(air_data, columns=headers)
h2o = pd.DataFrame(h2o_data, columns=headers)
air
Year Month Monthly_Anomaly Monthly_Unc. Annual_Anomaly Annual_Unc. Five-year_Anomaly Five-year_Unc. Ten-year_Anomaly Ten-year_Unc. Twenty-year_Anomaly Twenty-year_Unc.
0 1850 1 -0.777 0.412 NaN NaN NaN NaN NaN NaN NaN NaN
1 1850 2 -0.239 0.458 NaN NaN NaN NaN NaN NaN NaN NaN
2 1850 3 -0.426 0.447 NaN NaN NaN NaN NaN NaN NaN NaN
h2o
Year Month Monthly_Anomaly Monthly_Unc. Annual_Anomaly Annual_Unc. Five-year_Anomaly Five-year_Unc. Ten-year_Anomaly Ten-year_Unc. Twenty-year_Anomaly Twenty-year_Unc.
0 1850 1 -0.724 0.370 NaN NaN NaN NaN NaN NaN NaN NaN
1 1850 2 -0.221 0.430 NaN NaN NaN NaN NaN NaN NaN NaN
2 1850 3 -0.443 0.419 NaN NaN NaN NaN NaN NaN NaN NaN

Can I seperate the values of a dictionary into multiple columns and still be able to plot them?

I want to seperate the values of a dictionary into multiple columns and still be able to plot them. At this moment all the values are in one column.
So concretely I would like to split all the different values in the list of values. And use the amount of values in the longest list as the amount of columns. So for all the shorter lists I would like to fill in the gaps with something like 'NA' so I can still plot it in seaborn.
This is the dictionary that I used:
dictio = {'seq_7009': [6236.9764, 6367.049999999999], 'seq_418': [3716.3642000000004, 3796.4124000000006], 'seq_9143_unamb': [4631.958999999999], 'seq_2888': [5219.3359, 5365.4089], 'seq_1101': [4287.7417, 4422.8254], 'seq_107': [5825.695099999999, 5972.8073], 'seq_6946': [5179.3118, 5364.420900000001], 'seq_6162': [5531.503199999999, 5645.577399999999], 'seq_504': [4556.920899999999, 4631.959], 'seq_3535': [3396.1715999999997, 3446.1969999999997, 5655.896546], 'seq_4077': [4551.9108, 4754.0073,4565.987654,5668.9999976], 'seq_1626_unamb': [3724.3894999999998]}
This is the code for the dataframe:
df = pd.Series(dictio)
test=pd.DataFrame({'ID':df.index, 'Value':df.values})
seq_107 [5825.695099999999, 5972.8073]
seq_1101 [4287.7417, 4422.8254]
seq_1626_unamb [3724.3894999999998]
seq_2888 [5219.3359, 5365.4089]
seq_3535 [3396.1715999999997, 3446.1969999999997, 5655....
seq_4077 [4551.9108, 4754.0073, 4565.987654, 5668.9999976]
seq_418 [3716.3642000000004, 3796.4124000000006]
seq_504 [4556.920899999999, 4631.959]
seq_6162 [5531.503199999999, 5645.577399999999]
seq_6946 [5179.3118, 5364.420900000001]
seq_7009 [6236.9764, 6367.049999999999]
seq_9143_unamb [4631.958999999999]
Thanks in advance for the help!
Convert the Value column to a list of lists, and reload it into a new dataframe. Afterwards, call plot. Something like this -
df = pd.DataFrame(test.Value.tolist(), index=test.ID)
df
0 1 2 3
ID
seq_107 5825.6951 5972.8073 NaN NaN
seq_1101 4287.7417 4422.8254 NaN NaN
seq_1626_unamb 3724.3895 NaN NaN NaN
seq_2888 5219.3359 5365.4089 NaN NaN
seq_3535 3396.1716 3446.1970 5655.896546 NaN
seq_4077 4551.9108 4754.0073 4565.987654 5668.999998
seq_418 3716.3642 3796.4124 NaN NaN
seq_504 4556.9209 4631.9590 NaN NaN
seq_6162 5531.5032 5645.5774 NaN NaN
seq_6946 5179.3118 5364.4209 NaN NaN
seq_7009 6236.9764 6367.0500 NaN NaN
seq_9143_unamb 4631.9590 NaN NaN NaN
df.plot()

Categories

Resources