This is how my original dataset looks like:
url boolean details
numberOfPages date
xzy.com 0 {'https://www.eltako.depdf': {'numberOfPages': 440, 'date': '2017-09-20'},'https://new.com': {'numberOfPages': 240, 'date': '2017-09-20'} }
The numberOfPages and date col is initally empty while the details col has a dictionary. I want to iterate through all rows (urls) and check their details column. For each key in the details column, I want to make a separate row and then use the numberOfPages and date values to add column values. The result should be something like this:
url boolean pdfLink numberOfPages date
xzy.com 0 https://www.eltako.depdf 440 2017-09-20
https://new.com 240 2017-09-20
I tried this but the second line gives me an error: TypeError: string indices must be integers
def arrange(df):
df=df.explode('details').reset_index(drop=True)
out=pd.DataFrame(df['details'].map(lambda x:[x[y] for y in x]).explode().tolist())
The original type of Info col was dict. I also tried changing the type to str but I still got the same error. Then I tried changing the lambda function to this:
lambda x:[y for y in x]
but the output I get is something like this:
url boolean details 0
xzy.com 0 https://www.eltako.depdf h
Nan Nan t
t
p
So basically the character of the link are being exploded into different rows. How can I fix this?
{'Company URL': {0: 'https://www.eltako.de/'},
'Potential Client': {0: 1},
'PDF Link': {0: nan},
'Number of Pages': {0: nan},
'Creation Date': {0: nan},
'Info': {0: {'https://www.eltako.de/wp-content/uploads/2020/11/Eltako_Gesamtkatalog_LowRes.pdf': {'numberOfPages': 440,
'date': '2017-09-20'}}},1: {'https:new.com: {'numberOfPages': 230,
'date': '2017-09-20'}}}}
Related
I want to create a table(dataframe) by using some live data fields which will be updating in some fixed time and as soon as data updates I need to add that data into my table
I get data in tabular format like below
Time col1 col2 col3
timestapm1 123 456 789
timestamp2 7584 4547 6545
timestamp3 8974 1241 2140
When script run for first time then it create new seprate list of column names for my desired table by using above data which looks like below
Timestamp col1_456 col1_4547 col1_1241 col3_456 col3_4547 col3_1241
col1 and col3 values changes regularly, col2 is static . I'm just getting confused at how can I match the column names of new table and add value under that column.
Time col1 col2 col3
timestapm4 17823 456 10789
timestamp5 758404 4547 65045
timestamp6 89744 1241 14140
Desired output
Timestamp col1_456 col1_4547 col1_1241 col3_456 col3_4547 col3_1241
timestamp4 17823 758404 89744 10789 65045 14140
Please help, thanks!
You could try it like this. I used your example data for df1 and df2, pivoted it and concat them together, everytime you get new data, you can run these 3 lines of code ( change timestamp, pivot, concat) where the new data is appended as new rows to the existing dataframe output. Changing the column Time to a single value is needed to make it one row in your output, you can change that name for every data chunk you receive to make the rows distinct.
df1 = pd.DataFrame(
{
"Time": {0: "timestamp1", 1: "timestamp2", 2: "timestamp3"},
"col1": {0: 123, 1: 7584, 2: 8974},
"col2": {0: 456, 1: 4547, 2: 1241},
"col3": {0: 789, 1: 6545, 2: 2140},
}
)
df2 = pd.DataFrame(
{
"Time": {0: "timestapm4", 1: "timestamp5", 2: "timestamp6"},
"col1": {0: 17823, 1: 758404, 2: 89744},
"col2": {0: 456, 1: 4547, 2: 1241},
"col3": {0: 10789, 1: 65045, 2: 14140},
}
)
df1["Time"] = "timestamp_1"
tmp1 = df1.pivot(index="Time", columns="col2", values=["col1", "col3"]).reset_index()
tmp1.columns = tmp1.columns.map(lambda x: f"{x[0]}_{x[1]}")
df2["Time"] = "timestamp_2"
tmp2 = df2.pivot(index="Time", columns="col2", values=["col1", "col3"]).reset_index()
tmp2.columns = tmp2.columns.map(lambda x: f"{x[0]}_{x[1]}")
output = pd.concat([tmp1, tmp2])
print(output)
Time_ col1_456 col1_1241 col1_4547 col3_456 col3_1241 col3_4547
0 timestamp_1 123 8974 7584 789 2140 6545
0 timestamp_2 17823 89744 758404 10789 14140 65045
I have a dataframe:
import pandas as pd
df = pd.DataFrame({
'ID': ['ABC', 'ABC', 'ABC', 'XYZ', 'XYZ', 'XYZ'],
'value': [100, 120, 130, 200, 190, 210],
'value2': [2100, 2120, 2130, 2200, 2190, 2210],
'state': ['init','mid', 'final', 'init', 'mid', 'final'],
})
I want to create dictionary of unique values of the Column 'ID'. I can extract the unique values by:
df.ID.unique()
But that gives me a list. I want the output to be a dictionary, which looks like this:
dict = {0:'ABC', 1: 'XYZ'}
If the number of unique entries in the column is n, then the keys should start at 0 and go till n-1. The values should be the names of unique entries in the column
The actual dataframe has 1000s of rows and is often updated. So I cannot maintain the dict manually.
Try this. -
dict(enumerate(df.ID.unique()))
{0: 'ABC', 1: 'XYZ'}
If you want to get unique values for a particular column in dict, try:
val_dict = {idx:value for idx , value in enumerate(df["ID"].unique())}
Output while printing val_dict
{0: 'ABC', 1: 'XYZ'}
I want to drop columns that have no content in any of the rows and drop other columns that starts with the same name.
In this example, Line of Business > Organization should be dropped since there are only blanks in all the rows. And since this column is dropped, all other columns starting with "Line of business >" should also be dropped from the pandas data frame. The complete data frame follows the same structure of [some text] > [Organization/Department/Employees].
data = pd.DataFrame({'Process name': {0: 'Ad campaign', 1: 'Payroll', 2: ''},
'Line of business > Organization': {0: "", 1: "", 2:''},
'Line of business > Department': {0: "Social media", 1: "People", 2:''},
'Line of business > Employees': {0: "Linda, Tom", 1: "Manuel, Olaf", 2:''}})
Result:
output = pd.DataFrame({'Process name': {0: 'Ad campaign', 1: 'Payroll', 2: ''}})
I hope I understand the case correctly, but I think you could try this:
First, replace the emtpy "" values with NaNs:
data.replace('', np.nan, inplace=True)
Then, identify the empty cols like this:
empty_cols = [col for col in data.columns if data[col].isnull().all()]
Next, identify the columns to be deleted. (this assumes that the '>' is the separator of the text relevant to identify this).
delete_cols= [col for col in data.columns for empty_col in empty_cols if col.split('>')[0] == empty_col.split('>')[0]]
At last, drop the columns you don't need and drop null values from the columns remaining:
data = data.drop(delete_cols, axis=1).dropna()
I have JSON output from m3inference package in python like this:
{'input': {'description': 'Bundeskanzlerin',
'id': '2631881902',
'img_path': '/root/m3/cache/angelamerkeicdu_224x224.jpg',
'lang': 'de',
'name': 'Angela Merkel',
'screen_name': 'angelamerkeicdu'},
'output': {'age': {'19-29': 0.0,
'30-39': 0.0001,
'<=18': 0.0001,
'>=40': 0.9998},
'gender': {'female': 0.9991, 'male': 0.0009},
'org': {'is-org': 0.0032, 'non-org': 0.9968}}}
I store it in:
org = pd.DataFrame.from_dict(json_normalize(org['output']), orient='columns')
gender.male gender.female age.<=18 ... age.>=40 org.non-org org.is-org
0 0.0009 0.9991 0.0000 ... 0.9998 0.9968 0.0032
i dont know where is the 0 value in the first column coming from, I save org.isorg column to isorg
isorg = org['org.is-org']
but when i append it to panda data frame dtypes is object, the value is change to
0 0.0032 Name: org.is-org, dtype: float64
not 0.0032
How to fix this?
"i dont know where 0 value in first column coming from then i save org.isorg column to isorg"
That "0" is an index to your dataframe. Unless you specify your dataframe index, pandas will auto create the index. You can change you index instead.
code example:
org.set_index('gender.male', inplace=True)
Index is like an address to your data. It is how any data point across the dataframe or series can be accessed.
I'm working on a df as follows:
df = pd.DataFrame({'ID': {0: 'S0001', 1: 'S0002', 2: 'S0003'},
'StartDate': {0: Timestamp('2018-01-01 00:00:00'),
1: Timestamp('2019-01-01 00:00:00'),
2: Timestamp('2019-04-01 00:00:00')},
'EndDate': {0: Timestamp('2019-01-02 00:00:00'),
1: Timestamp('2020-01-02 00:00:00'),
2: Timestamp('2020-04-01 00:00:00')}
'Color': {0: 'Blue', 1: 'Green', 2: 'Red'},
'Type': {0: 'Small', 1: 'Mid', 2: 'Mid'}})
Now I want to create a df with 366 rows between Start and End dates and I want to add the Color, Type, ID for every row between Start and End Date.
I'm doing the following whick works well:
OutputDF = pd.concat([pd.DataFrame(data = Row['ID'], index = pd.date_range(Row['StartDate'], Row['EndDate'], freq='1D', closed = 'left'), columns = ['ID']) for index, Row in df.iterrows()])
and I get a df with 2 columns SiteID and days in the range Start/End Dates.
I'm able to add the Color/Type by doing a pd.merge on 'ID' but I think there is a direct way to add the column Color and Type directly when creating the DF.
I've tried data = [Row['ID'], Row['Type'], Row['Color']] or data = Row[['ID', 'Color', 'Type']] but neither works.
Therefore, how should I do to create my dataframe but having the Color for every item for the whole 366 rows directly without requiring the merge ?
Sample of current output:
It goes on for all the days between Start/End dates for each item.
Desired output:
Thanks
Try, pd.DataFrame constructor with a dictionary for data:
pd.concat([pd.DataFrame({'ID':Row['ID'],
'Color':Row['Color'],
'Type':Row['Type']},
index = pd.date_range(Row['StartDate'],
Row['EndDate'],
freq='1D',
closed = 'left'))
for index, Row in df.iterrows()])