I have a similar problem with this question but the original question was to make multiple csv output. In my case, I am wondering if there's a way to make the multiple dataframe output into environment through a loop so I can carry on some data analysis.
us = df[df['country_code'].str.match("US")]
mx = df[df['country_code'].str.match("MX")]
ca = df[df['country_code'].str.match("CA")]
au = df[df['country_code'].str.match("AU")]
You could use the same code as the link posted, but save the different dfs into a dictionary:
codes = ['US', 'MX', 'CA', 'AU']
result_dict = {}
for code in codes:
temp = df.query(f'country_code.str.match("{code}")')
result_dict[code] = temp
You can create for and check like below and create dict for match:
df = pd.DataFrame({'country_code': ['US','MX', 'CA', 'AU']})
codes = ['US', 'MX', 'CA', 'AU']
out = {code : df[df['country_code'].str.match(code)] for code in codes}
Output:
>>> out["US"]
country_code
0 US
>>> type(out["US"])
pandas.core.frame.DataFrame
>>> out["CA"]
country_code
2 CA
Related
I have a dataframe contains strings of email addresses with the format:
d = {'Country':'A', 'Email':'123#abc.com,456#def.com,789#ghi.com'}
df = pd.DataFrame(data=d)
and I want the username of emails only. So the new dataframe should look like this:
d = {'Country':'A', 'Email':'123,456,789'}
df1 = pd.DataFrame(data=d)
The best way I could think of is to split the original string by comma, delete the domain part of emails and join the list back again. Are there better ways to this problem?
If you want a string as output, you can remove the part starting on #. Use str.replace with the #[^,]+ regex:
df['Email'] = df['Email'].str.replace(r'#[^,]+', '', regex=True)
Output:
Country Email
0 A 123,456,789
For a list you could use str.findall:
df['Email'] = df['Email'].str.findall(r'[^,]+(?=#)')
Output:
Country Email
0 A [123, 456, 789]
This is a regex question, not really a Pandas question but here's a solution that'll return a list (which you can join together as a string)
import re
df['Email'].apply(lambda s: re.findall('\w+(?=#)', s))
Output:
0 [123, 456, 789]
Name: Email, dtype: object
Try this
import pandas as pd
d = {'Country':['A', 'B'], 'Email':['123#abc.com,456#def.com,789#ghi.com', '134#abc.com,436#def.com,229#ghi.com']}
df = pd.DataFrame(d)
df['Email'] = df['Email'].str.findall('#\w+.com').apply(', '.join).str.replace('#','')
df
Output
Country Email
A abc.com, def.com, ghi.com
B abc.com, def.com, ghi.com
here is one way to do it
# iterate through the dictionary values are replace anything after # and , or '
d={k: re.sub(r"#[^,\']+",'' , v) for k, v in d.items()}
d
{'Country': 'A', 'Email': '123,456, 789'}
I'm having a python project:
df_testR with columns={'Name', 'City','Licence', 'Amount'}
df_testF with columns={'Name', 'City','Licence', 'Amount'}
I want to compare both df's. Result should be a df, wehere I see the Name, City and Licence and the Amount. Normally, df_testR and df_testF should be exact same.
In case it is not the same, I want to see the difference in Amount_R vs Amount_F.
I referred to: Diff between two dataframes in pandas
But I receive a table with TRUE and FALSE only:
Name
City
Licence
Amount
True
True
True
False
But I'd like to get a table that lists ONLY the lines where differences occur, and that shows the differences between the data in the way such as:
Name
City
Licence
Amount_R
Amount_F
Paul
NY
YES
200
500.
Here, both tables contain PAUL, NY and Licence = Yes, but Table R contains 200 as Amount and table F contains 500 as amount. I want to receive a table from my analysis that captures only the lines where such differences occur.
Could someone help?
import copy
import pandas as pd
data1 = {'Name': ['A', 'B', 'C'], 'City': ['SF', 'LA', 'NY'], 'Licence': ['YES', 'NO', 'NO'], 'Amount': [100, 200, 300]}
data2 = copy.deepcopy(data1)
data2.update({'Amount': [500, 200, 300]})
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df2.drop(1, inplace=True)
First find the missing rows and print them:
matching = df1.isin(df2)
meta_data_columns = ['Name', 'City', 'Licence']
metadata_match = matching[meta_data_columns]
metadata_match['check'] = metadata_match.apply(all, 1, raw=True)
missing_rows = list(metadata_match.index[~metadata_match['check']])
if missing_rows:
print('Some rows are missing from df2:')
print(df1.iloc[missing_rows, :])
Then drop these rows and merge:
df3 = pd.merge(df2, df1.drop(missing_rows), on=meta_data_columns)
Now remove the rows that have the same amount:
df_different_amounts = df3.loc[df3['Amount_x'] != df3['Amount_y'], :]
I assumed the DFs are sorted.
If you're dealing with very large DFs it might be better to first filter the DFs to make the merge faster.
I have in my data information about place as full post code for example CZ25145. I would like to create new column for this with value CZ. How to do this?
I have this:
import pandas as pd
df = pd.DataFrame({
'CODE_LOAD_PLACE' : ['PL43100', 'CZ25905', 'DE29333', 'DE29384', 'SK92832']
},)
I would like to get it like below:
df = pd.DataFrame({
'CODE_LOAD_PLACE' : ['PL43100', 'CZ25905', 'DE29333', 'DE29384', 'SK92832'],
'COUNTRY_LOAD_PLACE' : ['PL', 'CZ', 'DE', 'DE', 'SK']
},)
I try use .factorize and .groupby but no positive final effect.
Use .str and select the first 2 characters:
df["COUNTRY_LOAD_PLACE"] = df["CODE_LOAD_PLACE"].str[:2]
How can I add outputs of different for loops into one dataframe. For example I have scraped data from website and have list of Names,Email and phone number using loops. I want to add all outputs into a table in single dataframe.
I am able to do it for One single loop but not for multiple loops.
Please look at the code and output in attached images.
By removing Zip from for loop its giving error. "Too many values to unpack"
Loop
phone = soup.find_all(class_ = "directory_item_phone directory_item_info_item")
for phn in phone:
print(phn.text.strip())
##Output - List of Numbers
Code for df
df = list()
for name,mail,phn in zip(faculty_name,email,phone):
df.append(name.text.strip())
df.append(mail.text.strip())
df.append(phn.text.strip())
df = pd.DataFrame(df)
df
For loops
Code and Output for df
An efficient way to create a pandas.DataFrame is to first create a dict and then convert it into a DataFrame.
In your case you probably could do :
import pandas as pd
D = {'name': [], 'mail': [], 'phone': []}
for name, mail, phn in zip(faculty_name, email, phone):
D['name'].append(name.text.strip())
D['mail'].append(mail.text.strip())
D['phone'].append(phn.text.strip())
df = pd.DataFrame(D)
Another way with a lambda function :
import pandas as pd
text_strip = lambda s : s.text.strip()
D = {
'name': list(map(text_strip, faculty_name)),
'mail': list(map(text_strip, email)),
'phone': list(map(text_strip, phone))
}
df = pd.DataFrame(D)
If lists don't all have the same length you may try this (but I am not sure that is very efficient) :
import pandas as pd
columns_names = ['name', 'mail', 'phone']
all_lists = [faculty_name, email, phone]
max_lenght = max(map(len, all_lists))
D = {c_name: [None]*max_lenght for c_name in columns_names}
for c_name, l in zip(columns_names , all_lists):
for ind, element in enumerate(l):
D[c_name][ind] = element
df = pd.DataFrame(D)
Try this,
data = {'name':[name.text.strip() for name in faculty_name],
'mail':[mail.text.strip() for mail in email],
'phn':[phn.text.strip() for phn in phone],}
df = pd.DataFrame.from_dict(data)
I want to create a dictionary that has predefined keys, like this:
dict = {'state':'', 'county': ''}
and read through and get values from a spreadsheet, like this:
for row in range(rowNum):
for col in range(colNum):
and update the values for the keys 'state' (sheet.cell_value(row, 1)) and 'county' (sheet.cell_value(row, 1)) like this:
dict[{}]
I am confused on how to get the state value with the state key and the county value with the county key. Any suggestions?
Desired outcome would look like this:
>>>print dict
[
{'state':'NC', 'county': 'Nash County'},
{'state':'VA', 'county': 'Albemarle County'},
{'state':'GA', 'county': 'Cook County'},....
]
I made a few assumptions regarding your question. You mentioned in the comments that State is at index 1 and County is at index 3; what is at index 2? I assumed that they occur sequentially. In addition to that, there needs to be a way in which you can map the headings to the data columns, hence I used a list to do that as it maintains order.
# A list containing the headings that you are interested in the order in which you expect them in your spreadsheet
list_of_headings = ['state', 'county']
# Simulating your spreadsheet
spreadsheet = [['NC', 'Nash County'], ['VA', 'Albemarle County'], ['GA', 'Cook County']]
list_of_dictionaries = []
for i in range(len(spreadsheet)):
dictionary = {}
for j in range(len(spreadsheet[i])):
dictionary[list_of_headings[j]] = spreadsheet[i][j]
list_of_dictionaries.append(dictionary)
print(list_of_dictionaries)
Raqib's answer is partially correct but had to be modified for use with an actual spreadsheet with row and columns and the xlrd mod. What I did was first use xlrd methods to grab the cell values, that I wanted and put them into a list (similar to the spreadsheet variable raqib has shown above). Not that the parameters sI and cI are the column index values I picked out in a previous step. sI=StateIndex and cI=CountyIndex
list =[]
for row in range(rowNum):
for col in range(colNum):
list.append([str(sheet.cell_value(row, sI)), str(sheet.cell_value(row, cI))])
Now that I have a list of the states and counties, I can apply raqib's solution:
list_of_headings = ['state', 'county']
fipsDic = []
print len(list)
for i in range(len(list)):
temp = {}
for j in range(len(list[i])):
tempDic[list_of_headings[j]] = list[i][j]
fipsDic.append(temp)
The result is a nice dictionary list that looks like this:
[{'county': 'Minnehaha County', 'state': 'SD'}, {'county': 'Minnehaha County', 'state': 'SD', ...}]